Deep neural networks (DNNs) have achieved spectacular success in classification tasks when trained on very large, but still finite training sets. DNN training mostly follows the principle of Empirical Risk Minimization (ERM)
, which states that by minimizing the training error the classifier will generalize to previously unseen data, under the condition that novel data points and labels are drawn from the same distribution as the training data. Although this assumption works remarkably well on difficult benchmark datasets such as ImageNet, the assumption of identically distributed training and test sets is likely to be violated in DNN-based systems deployed in real-world situations. Knowing when a DNN can or cannot be trusted because of dataset shift is of utmost importance whenever DNNs should be used in safety-critical applications [13, 11], such as autonomous driving, robotics, surveillance, or medical diagnosis. At the same time, there can be true ambiguity in the data, e.g. when human annotators cannot agree or make mistakes , when inputs are corrupted or occluded, or whenever environmental conditions prevent a conclusive classification, e.g. due to challenging light or weather conditions . Such situations require DNNs that do not just predict the most likely class, but also quantify the uncertainty or confidence of their prediction, thereby allowing decision making systems to take the risk caused by perceptual uncertainty into account.
Unfortunately, the softmax outputs of modern DNNs, although accurate in their class predictions, have proven to perform poorly as indicators of uncertainty. Overfitting to training data with one-hot encoded orhard labels 
, and over-confidence of ReLU networks for out-of-data inputs
have been identified as potential root causes for this behavior. Estimating the predictive uncertainty in deep learning is thus an active and challenging research topic. The ultimate goal is to obtaincalibrated confidence scores , i.e. indicators that directly quantify the likelihood of a correct prediction.
Since in all practical machine learning scenarios there is no access to the true data generating distribution, a reasonable starting point for uncertainty estimation is to assume that only data points in the vicinity of training data points can be predicted with high certainty. In fact, this is closely related to the problem of generalization, and various data augmentation techniques improve classification accuracy on unseen data by generating new training samples obtained by applying simple transformations to the original training samples without modifying the labels. In this article we propose a novel approach On-manifold Adversarial Data Augmentation, or OMADA, which yields calibrated uncertainty predictions by augmenting the training dataset with ambiguous samples generated by adversarial attacks , but constrained to lie on an estimated training data manifold , see Fig. 1. The adversarial attack targets a latent space classifier. Unlike typical image space classifiers that directly process the data samples, the latent space classifier is built on top of an autoencoder (encoder-decoder) based generative model, see Fig. 2, and processes the latent codes of data samples created by the encoder. The encoder and decoder of the generative model are responsible for transforming between the latent and image space. They are jointly trained to approximate the true data distribution. By constraining the augmented samples to lie on the data manifold, we can closely approximate the true decision boundaries between classes by the latent-space classifier, while avoiding confusing the image-space classifier by injecting out-of-distribution samples into the training set.
We perform extensive experiments and comparisons against alternative methods from the literature for supervised classification. Augmenting the supervised classification training with OMADA, we observe significant improvements for calibration and accuracy across multiple benchmark datasets such as CIFAR-100, CIFAR-10 and SVHN and diverse network architectures such as DenseNet , Wide ResNet , VGG , and ResNeXt 
. Furthermore, we test the (image space) classifier on out-of-distribution samples. Using the confidence of the predictions as the metric to detect unseen data, OMADA outperforms multiple baseline methods in outlier detection performance in terms of the area under the ROC curve and Mean-Maximal Confidence (MMC). In summary, the results suggest that on-manifold samples between one or more class clusters aid in resolving the notorious over-confidence characteristics associated with DNNs.
Our contributions include (1) a novel approach OMADA to create on-manifold ambiguous samples for data augmentation in supervised classification; (2) extensive empirical comparisons of a wide spectrum of alternative methods in the literature on various uncertainty evaluation metrics, and on out-of-distribution detection tasks; (3) significant improvement over the benchmark methods on prediction calibration and outlier detection. For example, on CIFAR-100, OMADA results in up to a 9.8x reduction in calibration error against standard training, up to a 5.9x reduction compared to Mixup, and up to a 3x reduction compared to temperature scaling.
2 Related Work
OMADA extends elements of recent successful approaches for uncertainty estimation, data augmentation, and adversarial training, and our experiments show superior performance across a wide range of tasks and architectures.
The recently proposed Mixup method  creates new training samples by linear interpolation in the image space between a random pair of training samples. In addition it also creates soft
labels by linearly interpolating between the original one-hot label vectors. In it was shown that Mixup not only improves generalization, but also yields well-calibrated softmax scores, and yields less confident predictions for out-of-distribution data than a DNN trained without Mixup. Recent variants of Mixup that also interpolate between inputs and use soft labels are Manifold Mixup , which performs Mixup in the feature space of a DNN instead of the image space, and CutMix , which samples patches from multiple images, and creates soft labels based on the fraction of patches from each image. All of these methods may generate unrealistic samples that lie off the true data manifold. Furthermore, the labels are generated by interpolating between two or more hard label vectors, and may thus not reflect true ambiguity, e.g. if the image obtained by interpolation is more similar to a third class (see Section A2 for an example).
Soft labels were also used to improve generalization via -smoothing 
, where a probability mass of sizeis distributed over all but the correct class, thus penalizing over-confident predictions. Another simple and effective method to avoid over-confidence on outliers is to include out-of-distribution samples with uniform labels in the training set [6, 10]; the samples can even be as simple as uniform noise images.
Calibration can also be efficiently achieved after training, most prominently by temperature scaling . This method re-scales softmax scores on a validation set, thereby achieving good calibration without affecting the accuracy of the model. However, temperature scaling does not perform on par with other methods in outlier detection tasks, and when dataset shift occurs . As a post-tuning method on a trained classifier, temperature scaling can be combined with data augmentation and label smoothing methods.
For studying adversarial robustness, the authors of  introduced the concept of on- and off-manifold adversarial examples. Augmenting the training set with on-manifold adversarial examples is particularly useful to improve the generalization performance. However, common perturbations in the image space, including the above-mentioned Mixup, CutMix and additive random noise, are not constrained to the data manifold. In this work we are interested in the use of on-manifold data augmentation on uncertainty calibration. The proposed OMADA method trains an autoencoder based generative model to approximate the data manifold and uses the adversarial attack to create ambiguous samples with soft labels. Different to the soft labels created by -smoothing, Mixup and its variants, the soft labels are semantically coherent with the samples, e.g., Fig. 1. In the experiments section, we will showcase the benefits of OMADA across multiple benchmark datasets, architectures and uncertainty evaluation metrics in both the classification and out-of-distribution detection tasks.
3 OMADA Methodology
The core of OMADA is generating realistic (yet ambiguous) samples for data augmentation, and input-dependent soft labels for improving the calibration of classifiers. This section explains in detail how to create on-manifold ambiguous training samples, and how to exploit them for the target classification task. Fig. 2 sketches the three main training phases of OMADA: generative modeling, latent space adversarial attacks, and classifier training with on-manifold data augmentation.
3.1 Generative Modeling
Generative models are used to approximate the inaccessible ground-truth data distribution. We choose BigBi-GAN  because it has achieved state-of-the-art results on image synthesis and representation learning tasks, and exploit its design for learning the training data manifold. OMADA is not limited to BigBi-GAN, and can work with any other autoencoder-based generative model.
that follows the standard normal distribution. The decoder attempts to reconstruct the input from the latent code . Instead of explicitly defining the metric to evaluate the reconstruction loss, BigBi-GAN uses a discriminator network as in other GAN methods . The discriminator is trained to distinguish decoder outputs from real samples, measuring the Jensen-Shannon divergence between the data distribution and the model distribution induced by the encoder-decoder pair. The decoder competes against the discriminator by synthesizing increasingly realistic samples. As the discriminator is only needed for training BigBi-GAN, it is omitted in Fig. 2-(1).
Latent Space Classifier
The current setting of BigBi-GAN is unsupervised. On top of it, we further introduce a latent space classifier that exploits the label information to cluster the latent codes of according to the classes . Namely, given the labeled training samples , the classifier is trained by cross entropy minimization to predict the labels from the latent code obtained by applying the encoder to . The cross entropy multi-class classification loss is added to the original encoder-decoder training loss of BigBi-GAN. The three networks are jointly trained to fool the discriminator. In doing so, the latent codes of from the same class are clustered together, see Fig. 1. We further observe from Fig. 1 that sampling from the boundaries between two class clusters yields ambiguous samples at the decoder output. Such generated samples lie in the support of the model distribution, which well approximates the data manifold. Therefore, they are treated as on-manifold samples.
The trained encoder , decoder and latent space classifier provide all of the necessary tools for OMADA to generate ambiguous samples and corresponding labels, which is described in the following section.
3.2 Latent Space Adversarial Attack
OMADA uses the generative model to synthesize samples which specifically raise class ambiguity. Since ambiguous samples shall reflect characteristics from two or more classes, their latent codes are expected to lie close to the class decision boundaries of the latent space classifier. We cannot directly generate samples close to decision boundaries from randomly sampling the prior distribution on ; however, adversarial attacks on the latent space classifier provide a targeted way to raise class ambiguity.
We start to explore the latent space from the latent code of an arbitrary training sample and move in a direction to approach a target class . Here, is a one-hot vector encoding the class label. An adversarial attack, e.g. the projected gradient descent (PGD) method , is used to find a small perturbation on such that the latent space classifier classifies as rather than . Using the cross entropy loss, the perturbation is attained by solving the following minimization problem:
where denotes the number of classes and is the th entry of the one-hot vector . Unlike standard adversarial attacks, here we do not need to constrain to lie within an ball. It is because the decoder is trained to produce realistic samples from any and the support of the prior distribution is the whole latent space. As depicted in Fig. 2-(2), the work horse of our second phase training is the attack model to solve (1) in an iterative manner (for all adversarial attacks we perform k steps, using an norm with a step size ). The other networks in Fig. 2-(2) are not changed after phase (1).
By iterating to solve (1), the intermediate results for added to create an attack path in the latent space, see Fig. 1. Compared to simple linear interpolation in latent space, the proposed adversarial attack path has an important advantage: The adversarial loss (1) penalizes paths that pass through the class clusters except the target one. As shown in Fig. 1, the attacker mainly explores the empty regions between class clusters (i.e., decision boundaries of the latent space classifier) to reach the target, therefore being more efficient than linear interpolation in creating ambiguous samples. Feeding the latent codes along the attack path into the decoder , Fig. 1 depicts a series of synthetic samples that smoothly diverge from the source and approach a sample belonging to the target class . The samples in-between realistically exhibit the features from both the source class and the target class , and possibly other classes if they are encountered on the attack path. They are on data manifold because the encoder-decoder models the data distribution.
The labels of the samples can be obtained by applying the latent space classifier to the latent codes, i.e., . Unlike the one-hot encoded hard label vectors, the softmax responses can take on soft values between . Since the perturbation may traverse through multiple class boundaries to reach the target, the soft labels are not simply based on and , and can have non-zero mass on other classes. From Fig. 1, the soft labels are semantically coherent with the samples synthesized by the decoder. Comparing with Mixup  that linearly interpolates both the samples and their labels, the proposed adversarial attack always produces on-manifold ambiguous samples and labels them according to the class-specific features.
Using the attacker together with the BigBi-GAN pretrained models to create our OMADA augmentation set, we investigate two ways to sample the latent codes from the attack path in the latent space. The first, and default mode, samples uniformly along the path. The second approach favors samples whose soft labels yield large entropies. After proper normalization, we use the entropies of each latent code’s soft label vector along the path to parameterize a probability mass function (pmf), and then sample the latent code according to such a constructed pmf.
3.3 On-Manifold Data Augmentation
In order to solve the classification task, we train a DNN in the original input space . As shown in Fig. 2-(3), the only difference is that we augment the original dataset with the OMADA set generated in Step (2) by sampling on data-manifold ambiguous samples together with their soft labels, thereby complementing the original training set, which contains only hard labels. Mixing the two datasets has two effects. Firstly, the enlarged training set improves the generalization performance and reduces model uncertainty. As the size of the OMADA set can be unlimited by repeatedly sampling the latent space, it also prevents overfitting and memorization. Secondly, the DNN learns from the soft labels of OMADA to make soft predictions in addition to hard ones, tempering overconfidence in the training process and achieving an improved calibration performance at test time. In the subsequent experiment section, we find that soft labeling of ambiguous samples is particularly helpful to detect out-of-distribution samples, so the model knows what it does not know.
Setup We evaluate and compare OMADA against multiple benchmark methods in the literature across datasets, i.e., SVHN, CIFAR-10, CIFAR-100 and models, i.e., DenseNet (, ) , Wide-ResNet 28-10 (WRN) , ResNeXt-29 , and VGG-16 . The benchmark methods from the literature primarily address data augmentation, label smoothing, and combinations of the two, similar to our proposed method. Additionally, we compare to Temperature Scaling , as this is the gold-standard for network calibration.
The following is the list of methods we compare against: Base network (trained without data augmentation), Standard data augmentation (random crops and flips), Mixup () , Manifold Mixup () , -smoothing () , CEDA , CutMix  and temperature scaling (TS) 
. Unless otherwise noted, hyperparameters are taken from the original publications. For Mixup,is chosen based on the results from . Further details about hyperparameters for individual methods can be found in Section A3.
Training details The training hyperparamters (learning rate, etc.) for each network are listed in the appendix (Section A3
); these hyperparameters do not vary across datasets and methods. At the end of training, the model weights used for evaluation are chosen from the epoch with the best validation accuracy. Each reported result is the mean overindependent runs with the same hyperparameters.
For all OMADA-trained networks we evenly balance each batch with of the real training samples and of the on-manifold adversarial samples. In order for these networks to be comparable to other baselines, we ensure that each epoch has the same number of updates as the Base method (i.e. for each epoch the networks only observe of all real training samples).
While the experimental investigation is primarily focused on calibration, we also look at other applications of network uncertainty, and the classification accuracy.
Calibration A classifier is well-calibrated if its probabilistic output corresponds to the actual likelihood of being correct, i.e. of all images a network predicts with a softmax confidence of , approximately should be classified correctly. This is typically measured by creating a Reliability Diagram, in which images are binned by the softmax value of their predicted class, and calculating some distance metric between the resulting curve and the ideal calibration curve. The most popular of these metrics is the Expected Calibration Error (ECE) . We instead use the Adaptive Calibration Error (ACE), which results in an equal number of images per bin; this metric is more robust wrt. binning hyperparameters and the baseline network accuracy :
in which is the total number of data points, and the calibration range is defined by the index of the sorted softmax predictions.
Outlier Detection Outlier detection focuses on identifying out-of-distribution (OOD) inputs at test time, based on thresholding the predicted uncertainty. OOD data can be a completely different dataset, corrupted data, or classes from the same dataset not seen during training. The outlier detection experiments in this paper focus on the former case; the network is trained on CIFAR-10, and the predicted softmax is used to try and identify anomalous SVHN images at inference time. The metric used for evaluating outlier detection performance is the area under the receiver operating characteristic (ROC) curve (AUC); intuitively, this measures the ability of the uncertainty measure to binary classify an input as in-distribution or out-of-distribution over various thresholds.
Other Uncertainty Meaures We also investigate how confident the networks are on OOD data by measuring the Mean Maximal Confidence (MMC), and how well the produced uncertainty estimates correlate with the true error by producing Sparsification plots . These results, and a more detailed explanation of the metrics, can be found in the appendix (Section A4).
4.2 Experimental Results
Network Calibration We first compare the calibration and accuracy performance of OMADA against a range of baselines and similar methods. Figure 3 visualizes the ACE performance on the in-distribution test set. OMADA shows significant improvements over all datasets and model combinations, except for DenseNet on CIFAR-10, where it is slighly outperformed only by CutMix. However, CutMix does not consistently perform well across other architectures and dataset, e.g. on WRN and ResNext. Mixup is also comparable to OMADA on CIFAR-100 for DenseNet and VGG, but is significantly worse in other situations (e.g. SVHN).
We observe that the ACE of the base network is relatively low for some networks (e.g. CIFAR-10 + ResNeXt) which have large capacity; this is likely due to the fact that early stopping was used during training, and the results of the model with the highest accuracy are reported. Further investigations on the effect of early stopping are shown in the appendix Section A5.
We observe larger performance gains for harder datasets such as CIFAR-100 (Fig. 2(c)), as well as SVHN (Fig. 2(d)), where the dataset contains multiple class instances (digits) in the same image, introducing high uncertainty.
The stability of OMADA’s performance across models is remarkable. The selected networks range from low capacity networks such as DenseNet, larger networks such as WRN and ResNeXt, as well as a network architecture with multiple dense layers (VGG). Achieving such low calibration errors across such a diverse list of network architectures and multiple datasets posed great challenges to alternative methods in the literature, further emphasizing the benefits of OMADA for calibrated network training.
Fig. 2(b) shows the accuracy of CIFAR-10 across all models. Standard data augmentation yields the highest accuracy for all models, but OMADA always improves over the base classifier, and outperforms most other methods. This indicates that the increased calibration performance obtained by OMADA does not come at the expense of a drop in accuracy, but rather significantly increases accuracy. This observation is consistent across all studied datasets (results shown in the appendix, Section A4). Additionally, standard data augmentation can be performed in addition to all of the label-smoothing methods, as well as OMADA, and this will be investigated in future work.
In summary, for in-distribution samples, OMADA results in well-calibrated, accurate classifiers across many network architectures and datasets, especially in comparison to competing label smoothing/data augmentation approaches.
Temperature scaling As temperature scaling (TS)  is an orthogonal post-processing calibration technique, we separately compare the effect of TS applied to the Base network as well as on OMADA. Fig. 4 compares the ACE of Base and OMADA with their respective temperature scaling variants. It can be seen that for CIFAR-10, temperature scaling on Base mostly surpasses the calibration performance of OMADA alone, but the best performance is obtained by applying temperature scaling on top of OMADA. For harder datasets such as CIFAR-100, OMADA alone achieves a similar or often better ACE than Base-TS.
An interesting observation about TS can be seen in Fig. 4; OMADA-TS does not always produce better calibration when compared to OMADA. This is an unintuitive effect, though further investigation showed similar behavior for other methods in the literature, usually in the case where the calibration error without temperature scaling is already fairly low (like in the case of OMADA). This can happen as the NLL for which TS is optimized for, is not directly correlated with the ACE metric. This result calls for careful consideration when using temperature scaling for calibration, as it degrades performance for already calibrated networks. A simple alternative is to simply do a grid search over temperatures and choose the one which results in the best calibration performance on a validation set. For further explanation, and an example of this phenomenon, see Section A6
in the appendix. Furthermore, as temperature scaling normalizes the logits with a constant, it does not change the accuracy of the models, and thus does not come with the accuracy improvements of OMADA.
OMADA Variants In this part we show the performance of different OMADA variants, to investigate the effects of adding ambiguous images and soft labels independently.
We first investigate an alternative sampling method, which preferentially samples images from the path with high label entropies, as opposed to uniformly sampling them. We call this variant OMADA-SE (Sample from Entropy). Furthermore, we study the effect of the soft labels produced by the latent-space classifier by training the networks with the generated ambiguous samples from OMADA and OMADA-SE, but changing the labels. We either harden the soft labels based on the maximum class probability (OMADA*-H), or change the class labels to be uniform across all classes (OMADA*-U).
We investigate the resulting network calibration (ACE), the accuracy (ACC), and the outlier detection performance (AUC) of the variants. The results can be observed in Table 1 for CIFAR-10 on DenseNet and WRN.
OMADA-SE focuses on sampling from the high entropy regions on the path (i.e. higher chance of sampling pure boundary region samples). We observe in Table 1 that using this alternative sampling method SE to uniformly sampling across the attack paths, provides a variant of OMADA which performs very competitively on multiple tasks, especially on outlier detection.
The effect of hardening the labels yields surprisingly good results on ACE, where it sometimes improves calibration over the corresponding soft label variant, suggesting that the ambiguous samples generated by the on-manifold attacks alone are enough to improve the network’s confidence estimates. However, this gain comes at the cost of a drop in accuracy, suggesting that the soft labels help generalization. This observation will be the focus of future research.
The effect of hardening labels is different for the sampling variants; as OMADA-SE contains more samples with higher entropy soft labels, the change in label density is much more drastic than in OMADA, which also produces samples far away from decision boundaries (i.e. the soft label is already relatively hard). This is illustrated especially in the outlier detection performance: here, OMADA-H and OMADA-SE-H suffer in comparison to their soft-label counterparts. These observations are consistent with OOD-MMC reported in the appendix (Section A7).
Next, we study the effect of assigning uniform class labels for each ambiguous sample generated by the adversarial attack. Here, it becomes apparent that the soft labels of the ambiguous samples are required to attain competitive ACE and accuracy for in-distribution data. However, for out-of-distribution samples, where the AUC and OOD-MMC metrics are optimized when predicting near uniform class labels on OOD data, the OMADA*-U networks do very well. This is consistent with observations from CEDA , where uniform class labels are also used to improve detecting OOD samples (shown in Fig. 5).
In summary, changing the soft labels increases performance on some tasks, but degrades performance across other tasks; the best choice of labels is then task-dependent. On average, the soft labeled methods (OMADA and OMADA-SE) perform stably across tasks.
Outlier Detection In order to put the outlier detection abilities of OMADA-SE (the best variant across multiple tasks) into context, we compare the AUC to the already-investigated label smoothing methods (Fig. 5). OMADA-SE outperforms all other methods on both DenseNet and WRN, albeit with a small gap to CEDA on DenseNet. The good performance of CEDA is not surprising, as it is implicitly trained to predict lower confidence on out-of-distribution samples (CEDA uses adversarially perturbed uniform random noise as the out-of-distribution samples). Interestingly, soft-labels alone are not enough to result in good outlier detection, as evidenced by the poor performance of -smoothing.
Stochastic DNN methods We compare the calibration and outlier detection performance of the OMADA variants to both Monte Carlo (MC) Dropout  and Ensembles, as these are commonly used to obtain uncertainty estimates, and have been shown to improve network calibration (results in Section A8). As with temperature scaling, these methods are both orthogonal to OMADA, and can be easily combined. While an Ensemble is competitive on ACE, it performs worse than OMADA-SE on outlier detection. MC-Dropout is less competitive on both tasks.
5 Conclusion and Discussion
In this article we have introduced the concept of on-manifold adversarial data augmentation for uncertainty estimation by leveraging recent advances in generative modeling. We use a latent space classifier to estimate the class decision boundaries on the approximated data manifold. This sets the ground work for a our novel sampling procedure, which directs the search through the latent space to discover challenging regions for the latent space classifier (i.e. decision boundary regions). Leveraging the ability to sample specifically at these challenging regions on the manifold, we use a decoder to generate ambiguous samples as well as utilize the latent space classifier to label these samples with a soft label, resulting in our OMADA dataset. Through a range of carefully chosen experiments, we study the effect that OMADA has when training an independent image space classifier.
An extensive set of experiments show significant improvements across multiple datasets and diverse network architectures, as well as on multiple tasks. The stability of the OMADA results for ACE across multiple networks is a particularly desirable property, as most alternative methods fail to perform well across all investigated networks. OMADA can be combined with post-processing methods such as temperature scaling , and we are confident that further beneficial combinations and extensions of the key concept of OMADA will be discovered in future research. Furthermore, we show that OMADA additionally results in increased classification accuracy compared to Base and most other methods.
Lastly, we show that OMADA-SE, which focuses its data generating sampling from boundary regions, outperforms all other methods for outlier detection.
This is an first attempt at using on-manifold adversarial samples for the study of uncertainty. Initial results show significant improvements of the networks ability to assign confidence to its predictions on in-distribution samples as well as out-of-distribution samples. Further studies are required to investigate the behavior of these networks on data which marginally leaves the data manifold (e.g. unseen transformations or corruptions).
-  (2019) Large scale adversarial representation learning. Advances in Neural Information Processing Systems. Cited by: §3.1, §3.1.
-  (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In icml, Cited by: §4.2.
-  (2015) Explaining and harnessing adversarial examples. International Conference on Learning Representations (ICLR). Cited by: §1.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: §3.1.
-  (2017) On calibration of modern neural networks. In International Conference on Machine Learning (ICML), Cited by: §1, §2, §4.1, §4.2, §4, §4, §5.
-  (2019) Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In , Cited by: §1, §2, item 6, §4.2, §4.
-  (2019) Benchmarking neural network robustness to common corruptions and perturbations. International Conference on Learning Representations (ICLR). Cited by: §1.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Table A1, §4.
-  (2017) Adversarial machine learning at scale. International Conference on Learning Representations (ICLR). Cited by: §3.2.
-  (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2019) Towards neural networks that provably know when they don’t know. arXiv preprint arXiv:1909.12180. Cited by: §1.
-  (2019) Measuring calibration in deep learning. In CVPR Workshop: Uncertainty and Robustness in Deep Visual Learning, Cited by: §4.1.
-  (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530. Cited by: §1, §2.
-  (2019) Human uncertainty makes classification more robust. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
-  (2015) Imagenet large scale visual recognition challenge. Proceedings of the IEEE International Conference on Computer Vision (ICCV). Cited by: §1.
-  (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR). Cited by: §1, Table A1, §4.
-  (2019) Disentangling adversarial robustness and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.
-  (2019) On mixup training: improved calibration and predictive uncertainty for deep neural networks. Advances in Neural Information Processing Systems. Cited by: §1, §2, item 3, item 5, §4.
-  (1992) Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, Cited by: §1.
-  (2019) Manifold mixup: better representations by interpolating hidden states. In International Conference on Machine Learning (ICML), Cited by: §2, item 4, §4.
-  (2017) Aggregated residual transformations for deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, Table A1, §4.
-  (2017) Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, Cited by: §4.1, §A4.2.
-  (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2, item 7, §4.
-  (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §1, Table A1, §4.
-  (2018) Mixup: beyond empirical risk minimization. International Conference on Learning Representations (ICLR). Cited by: §1, §2, item 3, §3.2, §4.
Appendix: On-manifold Adversarial Data Augmentation Improves Uncertainty Calibration
A1 Visualizing Other Attack Paths
Fig. A1 depicts more examples of attack paths, with different start and end targets, produced by the presented method. The OMADA attack path examples include paths where the target is set to another class (e.g. blue path with target “2”), as well as paths where the target is a decision boundary (e.g. green path with target between “1” and “8” and red path with target between “0” and “2”). The decision boundary between two classes can be reached by setting the target vector () in Eq. 1 to for the two classes and elsewhere. It can be seen that the images produced by the decision boundary paths produce confusing samples which reflect features from the neighboring clusters. Furthermore, this confusion is reflected in the soft-label.
A2 Input Mixup Example
In Fig. A1, the Input Mixup projection path is visualized in magenta. This path is produced by projecting the linearly interpolated images produced by Input Mixup into the latent space using the encoder. Even though Mixup mainly produces unrealistic images (Fig. 1), when it does produce realistic samples from another class, the soft label would not reflect the presence of this class. For example, in Fig. A1, Input Mixup produces an interpolated image between the classes “5” and “2” which looks similar to an “8”. It can be seen that Input Mixup assigns zero probability for class “8”, whereas using our encoder the images get mapped to the “8” cluster, which means a soft label produced by our method would reflect the presence of the class “8” .
A3 Experiment Hyperparameters
This section will present detailed information regarding the training process.
a3.1 Model and Training Hyper-parameters
All optimizer training hyper-parameters for the training of the image-space classifiers can be found in Table A1. These parameters are kept unchanged across the three datasets and all methods, as well as across all repetitions (where only the random seed was changed).
|DenseNet (L=100,k=12) ||796, 162||True||Multi-step||64|
|WRN-28-10 ||36, 479, 194||True||Multi-step||128|
|VGG-16 ||33, 646, 666||True||Multi-step||128|
|ResNeXt-29 ||34, 426, 698||True||Multi-step||128|
a3.2 OMADA training hyper-parameters
Each OMADA-trained network uses a balanced of real training samples (with hard one-hot labels) and of the On-manifold adversarial samples in each batch.
In order to enable direct comparison to alternative methods in the literature, we ensure that for each epoch, the total number of gradient updates performed are the same with the balanced number of samples from both datasets.
Therefore, at the end of each epoch, of the real training samples are not seen and instead replaced by On-manifold adversarial samples.
It should be noted that the real samples seen during each epoch vary across epochs.
In order to speed up the training process, we create an offline On-manifold adversarial dataset, and sample from this dataset to fill up each batch during training.
For all networks, K random training samples are withheld to create a validation set. The validation set accuracy is used for early stopping, and all experiments (unless stated otherwise) report the results from the checkpoint with the highest validation accuracy during training. Furthermore, the validation set is used to find the best temperature to produce the temperature scaling results.
a3.3 Details about Literature Methods
This sub-section reports the hyper-parameters of the alternative methods in the literature in more detail.
Base: base network trained using only real samples with hard labels and no data augmentation.
Standard augmentation (std_aug)
: base network trained with data augmentation on the training samples (random crop (padding) and horizontal flips (flip prob.)).
Manifold Mixup: extends mixup training by taking linear interpolations of hidden layers in the network and hard labels. We use .
-smoothing: smooths the labels with (found to be best in ) by taking a linear combination of the and .
CEDA: confidence enhancing data augmentation (CEDA) is a training scheme that enforces uniform confidences on out-of-distribution noise. These out-of-distribution images are included into the training by replacing half of the batch of real samples with permuted pixel images and
uniform random noise images. For each of these augmented images, a Gaussian filter with standard deviation is applied on the images, to have more low-frequency structure in the noise. The label for each of these images is the uniform class label.
Cut-Mix: replaces patches of regions in the image with images from another sample in the batch. These are claimed by the authors to generate more locally natural images than Mixup. The combination ratio is sampled from and the cut-mix probability is (meaning cut-mix is performed for all samples). Procedure for sampling the image patches, to replace, is kept identical as in .
A4 Additional Results
In Fig. A2 we report the classification accuracy for CIFAR-100 and SVHN.
We make similar observations as in Fig. 2(b).
OMADA achieves an improvement in accuracy compared to the Base models and most other methods, emphasizing that the gain in calibration does not come at the cost of a drop in accuracy.
a4.2 Sparsification and OOD-MMC
In this section we report the Sparsification results on the in-distribution test set and Mean Maximal Confidence on out-of-distribution data (OOD-MMC). Sparsification evaluates how well a given uncertainty estimate correlates with the true error; intuitively, we want our networks to be more confident about correct predictions, and less confident about incorrect predictions . This is calculated by selectively calculating the classification accuracy on increasingly large subsets of the test set. Samples are added to the subset based on their uncertainty; the more certain samples are added first. The final metric is the difference between the curve generated by the method and the ideal curve, in which all incorrectly-classified images have a higher uncertainty than all correctly-classified images. A lower Sparsification error is desired.
Another measure for evaluating the over-confidence of networks is to measure the OOD-MMC on out-of-distribution data. For out-of-distribution samples we want the network to assign a confidence of , reflecting maximum uncertainty. The Mean Max Confidence (MMC) measures how well the network performs the task of assigning a low confidence to unseen samples.
These results can be seen in Fig. A3. We observe that OMADA-SE significantly improves its Sparsification error compared to all other methods, and performs similar to Standard Augmentation. For OOD-MMC, we observe that OMADA-SE performs better than all other methods except CEDA, which has the lowest OOD-MMC on DenseNet. Though for WRN, OMADA-SE again performs best compared to all other methods.
A5 ACE on Last Epoch Checkpoint
In the main paper, we report the evaluations based on the model weights resulting in the highest validation accuracy. In order to show that the results are consistent with results from the last epoch model weights, in this section we report the ACE results for CIFAR-10 on all models. This ensures that all models were trained for the exact same number of epochs. Fig. A4 shows ACE results for CIFAR-10 for all models for the last epoch checkpoint. It can be seen that similar performance orderings can be observed compared to Fig. 2(a). Most methods have a worse ACE when evaluating using the last epoch (as longer training often increases mis-calibration), though surprisingly some exceptions do exist. This suggests a further study into the temporal aspect of network calibration across training epochs would be informative.
A6 When Does Temperature Scaling Help?
Temperature scaling is a simple method for improving network calibration. Interestingly, we observe that temperature scaling does not always improve performance; for networks which are fairly well calibrated already, the ACE gets worse by applying temperature scaling. This suggests that the negative log likelihood (NLL) optimized by temperature scaling does not always correlate with a lower ACE (or ECE). We show this phenomenon for WRN on CIFAR-100, where the optimized temperature increased the calibration error (ACE). Fig. A5 shows the Negative Log-Likelihood (NLL) and ACE when performing a grid-search across temperatures. It can be seen that the best temperature (T=0.952) based on the validation NLL (vertical dashed black line) does not minimize the ACE on the test nor the validation set. This shows that the NLL and ACE are not perfectly correlated (similar observations were made for ECE), and that a grid-search for the temperature based on the ACE might be an alternative option to find better temperatures.
A7 Ablation Studies
Here we report the OOD-MMC results of the ablation study, as well as the standard deviations of the networks in Table 1. We report these numbers in Table A2. It can be seen that similar to AUC, the soft-labels become important to get a lower MMC on out-of-distribution data.
A8 Stochastic Bayesian Neural Network Approximations
Here we report the results of two stochastic Bayesian Neural Network Approximations: Ensembles and MC-Dropout (15 forward passes). As these are orthogonal methods and can be applied to all methods, we compare our results with the Base network when applying these two stochastic approaches for uncertainty estimation.
Fig. A6 shows the results for an Ensemble and MC-Dropout for CIFAR-10 on DenseNet and WRN. Each ensemble entry reports the mean and standard deviation across ensembles, where each ensemble contains networks. As none of the networks reported in the paper are trained with Dropout, we specially train DenseNet and WRN with dropout ( and , respectively) in order to compare against MC-Dropout. As these networks can be considered to have a different network architecture compared to their no-dropout counter-parts, we report the ACE and AUC for a single deterministic forward pass through the network (Dropout-1FP) and compare this to the stochastic forward passes (Dropout-15FP). We observe that for ACE on DenseNet, Base-ENS performs best, though after taking an ensemble of OMADA networks, we achieve similar performance. However, for WRN, OMADA-ENS significantly surpasses Base-ENS. This shows that ensembles help to improve network calibration, though come at the cost of expensive compute times during inference. For ACE on both networks, MC-Dropout does not perform competitively.
On the other hand, when comparing the AUC numbers, we see that Base-ENS only slightly improves on Base and falls short of all OMADA-trained networks (with and without an ensemble on top). OMADA-ENS improves on OMADA alone, though interestingly it does not perform as well as OMADA-SE (which has much more samples with high entropy soft labels and confusing samples).