On-manifold Adversarial Data Augmentation Improves Uncertainty Calibration

by   Kanil Patel, et al.

Uncertainty estimates help to identify ambiguous, novel, or anomalous inputs, but the reliable quantification of uncertainty has proven to be challenging for modern deep networks. To improve uncertainty estimation, we propose On-Manifold Adversarial Data Augmentation or OMADA, which specifically attempts to generate the most challenging examples by following an on-manifold adversarial attack path in the latent space of an autoencoder-based generative model that closely approximates decision boundaries between two or more classes. On a variety of datasets and for multiple network architectures, OMADA consistently yields more accurate and better calibrated classifiers than baseline models, and outperforms competing approaches such as Mixup and CutMix, as well as achieving similar performance to (at times better than) post-processing calibration methods such as temperature scaling. Variants of OMADA can employ different sampling schemes for ambiguous on-manifold examples based on the entropy of their estimated soft labels, which exhibit specific strengths for generalization, calibration of predicted uncertainty, or detection of out-of-distribution inputs.


page 2

page 7

page 12

page 14

page 15


Soft Calibration Objectives for Neural Networks

Optimal decision making requires that classifiers produce uncertainty es...

Combining Ensembles and Data Augmentation can Harm your Calibration

Ensemble methods which average over multiple neural network predictions ...

A Forgotten Danger in DNN Supervision Testing: Generating and Detecting True Ambiguity

Deep Neural Networks (DNNs) are becoming a crucial component of modern s...

Calibrated Prediction Intervals for Neural Network Regressors

Ongoing developments in neural network models are continually advancing ...

Uncertainty Quantification and Deep Ensembles

Deep Learning methods are known to suffer from calibration issues: they ...

Improving Classifier Confidence using Lossy Label-Invariant Transformations

Providing reliable model uncertainty estimates is imperative to enabling...

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

In real-world applications of machine learning, reliable and safe system...

1 Introduction

Figure 1: Visualization of an MNIST encoder-decoder latent space with two trajectories traversing between pairs of clusters. On the right we visualize the decoded image path for OMADA (top) and the Input-Mixup images (bottom) along with their corresponding soft labels (10 rows below images, red intensity corresponds to likelihood for classes to

), and the class entropy (bottom row, black shows high entropy). The OMADA trajectory starts at the cluster of “0” and smoothly transitions to the target class “1”. It can be seen that the path favors routes which stick to the boundary regions of class clusters (e.g. going around the red cluster of “3”s). Alternatively, we visualize the projection of Input-Mixup images onto the same manifold, for linear input interpolations between the digits “5” and “2”. It can be seen that the images generated by OMADA are more confusing, and more importantly that the soft labels assigned to each image depend on the location on the manifold. This is in contrast to Mixup, where the soft label will always be non-zero for all classes except the start and end interpolation points, regardless of whether the mixed images have similar features to other classes.

Deep neural networks (DNNs) have achieved spectacular success in classification tasks when trained on very large, but still finite training sets. DNN training mostly follows the principle of Empirical Risk Minimization (ERM)


, which states that by minimizing the training error the classifier will generalize to previously unseen data, under the condition that novel data points and labels are drawn from the same distribution as the training data. Although this assumption works remarkably well on difficult benchmark datasets such as ImageNet

[15], the assumption of identically distributed training and test sets is likely to be violated in DNN-based systems deployed in real-world situations. Knowing when a DNN can or cannot be trusted because of dataset shift is of utmost importance whenever DNNs should be used in safety-critical applications [13, 11], such as autonomous driving, robotics, surveillance, or medical diagnosis. At the same time, there can be true ambiguity in the data, e.g. when human annotators cannot agree or make mistakes [14], when inputs are corrupted or occluded, or whenever environmental conditions prevent a conclusive classification, e.g. due to challenging light or weather conditions [7]. Such situations require DNNs that do not just predict the most likely class, but also quantify the uncertainty or confidence of their prediction, thereby allowing decision making systems to take the risk caused by perceptual uncertainty into account.

Unfortunately, the softmax outputs of modern DNNs, although accurate in their class predictions, have proven to perform poorly as indicators of uncertainty. Overfitting to training data with one-hot encoded or

hard labels [19]

, and over-confidence of ReLU networks for out-of-data inputs


have been identified as potential root causes for this behavior. Estimating the predictive uncertainty in deep learning is thus an active and challenging research topic. The ultimate goal is to obtain

calibrated confidence scores [5], i.e. indicators that directly quantify the likelihood of a correct prediction.

Since in all practical machine learning scenarios there is no access to the true data generating distribution, a reasonable starting point for uncertainty estimation is to assume that only data points in the vicinity of training data points can be predicted with high certainty

[26]. In fact, this is closely related to the problem of generalization, and various data augmentation techniques improve classification accuracy on unseen data by generating new training samples obtained by applying simple transformations to the original training samples without modifying the labels. In this article we propose a novel approach On-manifold Adversarial Data Augmentation, or OMADA, which yields calibrated uncertainty predictions by augmenting the training dataset with ambiguous samples generated by adversarial attacks [3], but constrained to lie on an estimated training data manifold [17], see Fig. 1. The adversarial attack targets a latent space classifier. Unlike typical image space classifiers that directly process the data samples, the latent space classifier is built on top of an autoencoder (encoder-decoder) based generative model, see Fig. 2, and processes the latent codes of data samples created by the encoder. The encoder and decoder of the generative model are responsible for transforming between the latent and image space. They are jointly trained to approximate the true data distribution. By constraining the augmented samples to lie on the data manifold, we can closely approximate the true decision boundaries between classes by the latent-space classifier, while avoiding confusing the image-space classifier by injecting out-of-distribution samples into the training set.

We perform extensive experiments and comparisons against alternative methods from the literature for supervised classification. Augmenting the supervised classification training with OMADA, we observe significant improvements for calibration and accuracy across multiple benchmark datasets such as CIFAR-100, CIFAR-10 and SVHN and diverse network architectures such as DenseNet [8], Wide ResNet [25], VGG [16], and ResNeXt [22]

. Furthermore, we test the (image space) classifier on out-of-distribution samples. Using the confidence of the predictions as the metric to detect unseen data, OMADA outperforms multiple baseline methods in outlier detection performance in terms of the area under the ROC curve and Mean-Maximal Confidence (MMC). In summary, the results suggest that on-manifold samples between one or more class clusters aid in resolving the notorious over-confidence characteristics associated with DNNs.

Our contributions include (1) a novel approach OMADA to create on-manifold ambiguous samples for data augmentation in supervised classification; (2) extensive empirical comparisons of a wide spectrum of alternative methods in the literature on various uncertainty evaluation metrics, and on out-of-distribution detection tasks; (3) significant improvement over the benchmark methods on prediction calibration and outlier detection. For example, on CIFAR-100, OMADA results in up to a 9.8x reduction in calibration error against standard training, up to a 5.9x reduction compared to Mixup, and up to a 3x reduction compared to temperature scaling.

2 Related Work

OMADA extends elements of recent successful approaches for uncertainty estimation, data augmentation, and adversarial training, and our experiments show superior performance across a wide range of tasks and architectures.

The recently proposed Mixup method [26] creates new training samples by linear interpolation in the image space between a random pair of training samples. In addition it also creates soft

labels by linearly interpolating between the original one-hot label vectors. In

[19] it was shown that Mixup not only improves generalization, but also yields well-calibrated softmax scores, and yields less confident predictions for out-of-distribution data than a DNN trained without Mixup. Recent variants of Mixup that also interpolate between inputs and use soft labels are Manifold Mixup [21], which performs Mixup in the feature space of a DNN instead of the image space, and CutMix [24], which samples patches from multiple images, and creates soft labels based on the fraction of patches from each image. All of these methods may generate unrealistic samples that lie off the true data manifold. Furthermore, the labels are generated by interpolating between two or more hard label vectors, and may thus not reflect true ambiguity, e.g. if the image obtained by interpolation is more similar to a third class (see Section  A2 for an example).

Soft labels were also used to improve generalization via -smoothing [18]

, where a probability mass of size

is distributed over all but the correct class, thus penalizing over-confident predictions. Another simple and effective method to avoid over-confidence on outliers is to include out-of-distribution samples with uniform labels in the training set [6, 10]; the samples can even be as simple as uniform noise images.

Calibration can also be efficiently achieved after training, most prominently by temperature scaling [5]. This method re-scales softmax scores on a validation set, thereby achieving good calibration without affecting the accuracy of the model. However, temperature scaling does not perform on par with other methods in outlier detection tasks, and when dataset shift occurs [13]. As a post-tuning method on a trained classifier, temperature scaling can be combined with data augmentation and label smoothing methods.

For studying adversarial robustness, the authors of [17] introduced the concept of on- and off-manifold adversarial examples. Augmenting the training set with on-manifold adversarial examples is particularly useful to improve the generalization performance. However, common perturbations in the image space, including the above-mentioned Mixup, CutMix and additive random noise, are not constrained to the data manifold. In this work we are interested in the use of on-manifold data augmentation on uncertainty calibration. The proposed OMADA method trains an autoencoder based generative model to approximate the data manifold and uses the adversarial attack to create ambiguous samples with soft labels. Different to the soft labels created by -smoothing, Mixup and its variants, the soft labels are semantically coherent with the samples, e.g., Fig. 1. In the experiments section, we will showcase the benefits of OMADA across multiple benchmark datasets, architectures and uncertainty evaluation metrics in both the classification and out-of-distribution detection tasks.

Figure 2: Illustration of the three phases of OMADA: (1) generative model, used in step (2) for latent space adversarial attacks to create the OMADA set, used in step (3) to train the classifier in image space using on-manifold data augmentation.

3 OMADA Methodology

The core of OMADA is generating realistic (yet ambiguous) samples for data augmentation, and input-dependent soft labels for improving the calibration of classifiers. This section explains in detail how to create on-manifold ambiguous training samples, and how to exploit them for the target classification task. Fig. 2 sketches the three main training phases of OMADA: generative modeling, latent space adversarial attacks, and classifier training with on-manifold data augmentation.

3.1 Generative Modeling

Generative models are used to approximate the inaccessible ground-truth data distribution. We choose BigBi-GAN [1] because it has achieved state-of-the-art results on image synthesis and representation learning tasks, and exploit its design for learning the training data manifold. OMADA is not limited to BigBi-GAN, and can work with any other autoencoder-based generative model.

BigBi-GAN model

The BigBi-GAN model (see Fig. 2-(1),  [1]) consists of an encoder and decoder . The encoder encodes the input sample from the training set by a latent code

that follows the standard normal distribution

. The decoder attempts to reconstruct the input from the latent code . Instead of explicitly defining the metric to evaluate the reconstruction loss, BigBi-GAN uses a discriminator network as in other GAN methods [4]. The discriminator is trained to distinguish decoder outputs from real samples, measuring the Jensen-Shannon divergence between the data distribution and the model distribution induced by the encoder-decoder pair. The decoder competes against the discriminator by synthesizing increasingly realistic samples. As the discriminator is only needed for training BigBi-GAN, it is omitted in Fig. 2-(1).

Latent Space Classifier

The current setting of BigBi-GAN is unsupervised. On top of it, we further introduce a latent space classifier that exploits the label information to cluster the latent codes of according to the classes . Namely, given the labeled training samples , the classifier is trained by cross entropy minimization to predict the labels from the latent code obtained by applying the encoder to . The cross entropy multi-class classification loss is added to the original encoder-decoder training loss of BigBi-GAN. The three networks are jointly trained to fool the discriminator. In doing so, the latent codes of from the same class are clustered together, see Fig. 1. We further observe from Fig. 1 that sampling from the boundaries between two class clusters yields ambiguous samples at the decoder output. Such generated samples lie in the support of the model distribution, which well approximates the data manifold. Therefore, they are treated as on-manifold samples.

The trained encoder , decoder and latent space classifier provide all of the necessary tools for OMADA to generate ambiguous samples and corresponding labels, which is described in the following section.

3.2 Latent Space Adversarial Attack

OMADA uses the generative model to synthesize samples which specifically raise class ambiguity. Since ambiguous samples shall reflect characteristics from two or more classes, their latent codes are expected to lie close to the class decision boundaries of the latent space classifier. We cannot directly generate samples close to decision boundaries from randomly sampling the prior distribution on ; however, adversarial attacks on the latent space classifier provide a targeted way to raise class ambiguity.

We start to explore the latent space from the latent code of an arbitrary training sample and move in a direction to approach a target class . Here, is a one-hot vector encoding the class label. An adversarial attack, e.g. the projected gradient descent (PGD) method [9], is used to find a small perturbation on such that the latent space classifier classifies as rather than . Using the cross entropy loss, the perturbation is attained by solving the following minimization problem:


where denotes the number of classes and is the th entry of the one-hot vector . Unlike standard adversarial attacks, here we do not need to constrain to lie within an ball. It is because the decoder is trained to produce realistic samples from any and the support of the prior distribution is the whole latent space. As depicted in Fig. 2-(2), the work horse of our second phase training is the attack model to solve (1) in an iterative manner (for all adversarial attacks we perform k steps, using an norm with a step size ). The other networks in Fig. 2-(2) are not changed after phase (1).

By iterating to solve (1), the intermediate results for added to create an attack path in the latent space, see Fig. 1. Compared to simple linear interpolation in latent space, the proposed adversarial attack path has an important advantage: The adversarial loss (1) penalizes paths that pass through the class clusters except the target one. As shown in Fig. 1, the attacker mainly explores the empty regions between class clusters (i.e., decision boundaries of the latent space classifier) to reach the target, therefore being more efficient than linear interpolation in creating ambiguous samples. Feeding the latent codes along the attack path into the decoder , Fig. 1 depicts a series of synthetic samples that smoothly diverge from the source and approach a sample belonging to the target class . The samples in-between realistically exhibit the features from both the source class and the target class , and possibly other classes if they are encountered on the attack path. They are on data manifold because the encoder-decoder models the data distribution.

The labels of the samples can be obtained by applying the latent space classifier to the latent codes, i.e., . Unlike the one-hot encoded hard label vectors, the softmax responses can take on soft values between . Since the perturbation may traverse through multiple class boundaries to reach the target, the soft labels are not simply based on and , and can have non-zero mass on other classes. From Fig. 1, the soft labels are semantically coherent with the samples synthesized by the decoder. Comparing with Mixup [26] that linearly interpolates both the samples and their labels, the proposed adversarial attack always produces on-manifold ambiguous samples and labels them according to the class-specific features.

Using the attacker together with the BigBi-GAN pretrained models to create our OMADA augmentation set, we investigate two ways to sample the latent codes from the attack path in the latent space. The first, and default mode, samples uniformly along the path. The second approach favors samples whose soft labels yield large entropies. After proper normalization, we use the entropies of each latent code’s soft label vector along the path to parameterize a probability mass function (pmf), and then sample the latent code according to such a constructed pmf.

3.3 On-Manifold Data Augmentation

In order to solve the classification task, we train a DNN in the original input space . As shown in Fig. 2-(3), the only difference is that we augment the original dataset with the OMADA set generated in Step (2) by sampling on data-manifold ambiguous samples together with their soft labels, thereby complementing the original training set, which contains only hard labels. Mixing the two datasets has two effects. Firstly, the enlarged training set improves the generalization performance and reduces model uncertainty. As the size of the OMADA set can be unlimited by repeatedly sampling the latent space, it also prevents overfitting and memorization. Secondly, the DNN learns from the soft labels of OMADA to make soft predictions in addition to hard ones, tempering overconfidence in the training process and achieving an improved calibration performance at test time. In the subsequent experiment section, we find that soft labeling of ambiguous samples is particularly helpful to detect out-of-distribution samples, so the model knows what it does not know.

4 Experiments

Setup We evaluate and compare OMADA against multiple benchmark methods in the literature across datasets, i.e., SVHN, CIFAR-10, CIFAR-100 and models, i.e., DenseNet (, ) [8], Wide-ResNet 28-10 (WRN) [25], ResNeXt-29 [22], and VGG-16 [16]. The benchmark methods from the literature primarily address data augmentation, label smoothing, and combinations of the two, similar to our proposed method. Additionally, we compare to Temperature Scaling [5], as this is the gold-standard for network calibration.

The following is the list of methods we compare against: Base network (trained without data augmentation), Standard data augmentation (random crops and flips), Mixup ([26], Manifold Mixup ([21], -smoothing ([18], CEDA [6], CutMix [24] and temperature scaling (TS)  [5]

. Unless otherwise noted, hyperparameters are taken from the original publications. For Mixup,

is chosen based on the results from  [19]. Further details about hyperparameters for individual methods can be found in Section  A3.

Training details The training hyperparamters (learning rate, etc.) for each network are listed in the appendix (Section  A3

); these hyperparameters do not vary across datasets and methods. At the end of training, the model weights used for evaluation are chosen from the epoch with the best validation accuracy. Each reported result is the mean over

independent runs with the same hyperparameters.

For all OMADA-trained networks we evenly balance each batch with of the real training samples and of the on-manifold adversarial samples. In order for these networks to be comparable to other baselines, we ensure that each epoch has the same number of updates as the Base method (i.e. for each epoch the networks only observe of all real training samples).

4.1 Tasks

While the experimental investigation is primarily focused on calibration, we also look at other applications of network uncertainty, and the classification accuracy.

Calibration A classifier is well-calibrated if its probabilistic output corresponds to the actual likelihood of being correct, i.e. of all images a network predicts with a softmax confidence of , approximately should be classified correctly. This is typically measured by creating a Reliability Diagram, in which images are binned by the softmax value of their predicted class, and calculating some distance metric between the resulting curve and the ideal calibration curve. The most popular of these metrics is the Expected Calibration Error (ECE)  [5]. We instead use the Adaptive Calibration Error (ACE), which results in an equal number of images per bin; this metric is more robust wrt. binning hyperparameters and the baseline network accuracy  [12]:


in which is the total number of data points, and the calibration range is defined by the index of the sorted softmax predictions.

Outlier Detection Outlier detection focuses on identifying out-of-distribution (OOD) inputs at test time, based on thresholding the predicted uncertainty. OOD data can be a completely different dataset, corrupted data, or classes from the same dataset not seen during training. The outlier detection experiments in this paper focus on the former case; the network is trained on CIFAR-10, and the predicted softmax is used to try and identify anomalous SVHN images at inference time. The metric used for evaluating outlier detection performance is the area under the receiver operating characteristic (ROC) curve (AUC); intuitively, this measures the ability of the uncertainty measure to binary classify an input as in-distribution or out-of-distribution over various thresholds.

Other Uncertainty Meaures We also investigate how confident the networks are on OOD data by measuring the Mean Maximal Confidence (MMC), and how well the produced uncertainty estimates correlate with the true error by producing Sparsification plots [23]. These results, and a more detailed explanation of the metrics, can be found in the appendix (Section A4).

(a) CIFAR-10 ACE
(b) CIFAR-10 Accuracy
(c) CIFAR-100 ACE
Figure 3: Calibration performance (ACE) of label-smoothing methods on in-distribution data for CIFAR-10 (a), CIFAR-100 (c), and SVHN (d). Hatched bars indicate the best-performing method per network architecture. Error bars are 1 std. dev. over 5 runs. (b) Shows the classification accuracy on CIFAR-10. Across all datasets and architectures, OMADA achieves better ACE (except for DenseNet and CIFAR-10), and higher accuracy than most other calibration methods.

4.2 Experimental Results

Network Calibration We first compare the calibration and accuracy performance of OMADA against a range of baselines and similar methods. Figure 3 visualizes the ACE performance on the in-distribution test set. OMADA shows significant improvements over all datasets and model combinations, except for DenseNet on CIFAR-10, where it is slighly outperformed only by CutMix. However, CutMix does not consistently perform well across other architectures and dataset, e.g. on WRN and ResNext. Mixup is also comparable to OMADA on CIFAR-100 for DenseNet and VGG, but is significantly worse in other situations (e.g. SVHN).

We observe that the ACE of the base network is relatively low for some networks (e.g. CIFAR-10 + ResNeXt) which have large capacity; this is likely due to the fact that early stopping was used during training, and the results of the model with the highest accuracy are reported. Further investigations on the effect of early stopping are shown in the appendix Section A5.

We observe larger performance gains for harder datasets such as CIFAR-100 (Fig.  2(c)), as well as SVHN (Fig.  2(d)), where the dataset contains multiple class instances (digits) in the same image, introducing high uncertainty.

The stability of OMADA’s performance across models is remarkable. The selected networks range from low capacity networks such as DenseNet, larger networks such as WRN and ResNeXt, as well as a network architecture with multiple dense layers (VGG). Achieving such low calibration errors across such a diverse list of network architectures and multiple datasets posed great challenges to alternative methods in the literature, further emphasizing the benefits of OMADA for calibrated network training.

Fig. 2(b) shows the accuracy of CIFAR-10 across all models. Standard data augmentation yields the highest accuracy for all models, but OMADA always improves over the base classifier, and outperforms most other methods. This indicates that the increased calibration performance obtained by OMADA does not come at the expense of a drop in accuracy, but rather significantly increases accuracy. This observation is consistent across all studied datasets (results shown in the appendix, Section A4). Additionally, standard data augmentation can be performed in addition to all of the label-smoothing methods, as well as OMADA, and this will be investigated in future work.

In summary, for in-distribution samples, OMADA results in well-calibrated, accurate classifiers across many network architectures and datasets, especially in comparison to competing label smoothing/data augmentation approaches.

Temperature scaling As temperature scaling (TS) [5] is an orthogonal post-processing calibration technique, we separately compare the effect of TS applied to the Base network as well as on OMADA. Fig. 4 compares the ACE of Base and OMADA with their respective temperature scaling variants. It can be seen that for CIFAR-10, temperature scaling on Base mostly surpasses the calibration performance of OMADA alone, but the best performance is obtained by applying temperature scaling on top of OMADA. For harder datasets such as CIFAR-100, OMADA alone achieves a similar or often better ACE than Base-TS.

An interesting observation about TS can be seen in Fig. 4; OMADA-TS does not always produce better calibration when compared to OMADA. This is an unintuitive effect, though further investigation showed similar behavior for other methods in the literature, usually in the case where the calibration error without temperature scaling is already fairly low (like in the case of OMADA). This can happen as the NLL for which TS is optimized for, is not directly correlated with the ACE metric. This result calls for careful consideration when using temperature scaling for calibration, as it degrades performance for already calibrated networks. A simple alternative is to simply do a grid search over temperatures and choose the one which results in the best calibration performance on a validation set. For further explanation, and an example of this phenomenon, see Section A6

in the appendix. Furthermore, as temperature scaling normalizes the logits with a constant, it does not change the accuracy of the models, and thus does not come with the accuracy improvements of OMADA.

(a) CIFAR-10
(b) CIFAR-100
Figure 4: Calibration performance (ACE) of OMADA and temperature scaling (TS) on in-distribution data. Hatched bars indicate the best-performing method. Error bars are 1 std. dev. over 5 runs.

OMADA Variants In this part we show the performance of different OMADA variants, to investigate the effects of adding ambiguous images and soft labels independently.

We first investigate an alternative sampling method, which preferentially samples images from the path with high label entropies, as opposed to uniformly sampling them. We call this variant OMADA-SE (Sample from Entropy). Furthermore, we study the effect of the soft labels produced by the latent-space classifier by training the networks with the generated ambiguous samples from OMADA and OMADA-SE, but changing the labels. We either harden the soft labels based on the maximum class probability (OMADA*-H), or change the class labels to be uniform across all classes (OMADA*-U).

We investigate the resulting network calibration (ACE), the accuracy (ACC), and the outlier detection performance (AUC) of the variants. The results can be observed in Table 1 for CIFAR-10 on DenseNet and WRN.

DN Base 0.0319 92.768 0.9076
OMADA 0.0138 94.530 0.9252
OMADA-H 0.0058 93.652 0.8913
OMADA-U 0.0260 93.890 0.9678
OMADA-SE 0.0273 94.988 0.9741
OMADA-SE-H 0.0281 94.258 0.9119
OMADA-SE -U 0.0302 94.618 0.9786
WRN Base 0.0341 91.596 0.9022
OMADA 0.0208 95.772 0.9243
OMADA-H 0.0172 95.150 0.9210
OMADA-U 0.0274 95.446 0.9750
OMADA-SE 0.0207 96.022 0.9833
OMADA-SE-H 0.0222 95.882 0.9643
OMADA-SE-U 0.0231 95.248 0.9346
Table 1: Performance of OMADA ablation methods on calibration (ACE), network accuracy (ACC), and outlier detection (AUC). -H refers to the respective hard label variant, -U refers to the respective uniform label variant. DN refers to DenseNet. Bold entries indicate the best-performing method. We report the mean over independent runs for each method (std devs. can be found in the appendix).

OMADA-SE focuses on sampling from the high entropy regions on the path (i.e. higher chance of sampling pure boundary region samples). We observe in Table 1 that using this alternative sampling method SE to uniformly sampling across the attack paths, provides a variant of OMADA which performs very competitively on multiple tasks, especially on outlier detection.

The effect of hardening the labels yields surprisingly good results on ACE, where it sometimes improves calibration over the corresponding soft label variant, suggesting that the ambiguous samples generated by the on-manifold attacks alone are enough to improve the network’s confidence estimates. However, this gain comes at the cost of a drop in accuracy, suggesting that the soft labels help generalization. This observation will be the focus of future research.

The effect of hardening labels is different for the sampling variants; as OMADA-SE contains more samples with higher entropy soft labels, the change in label density is much more drastic than in OMADA, which also produces samples far away from decision boundaries (i.e. the soft label is already relatively hard). This is illustrated especially in the outlier detection performance: here, OMADA-H and OMADA-SE-H suffer in comparison to their soft-label counterparts. These observations are consistent with OOD-MMC reported in the appendix (Section A7).

Next, we study the effect of assigning uniform class labels for each ambiguous sample generated by the adversarial attack. Here, it becomes apparent that the soft labels of the ambiguous samples are required to attain competitive ACE and accuracy for in-distribution data. However, for out-of-distribution samples, where the AUC and OOD-MMC metrics are optimized when predicting near uniform class labels on OOD data, the OMADA*-U networks do very well. This is consistent with observations from CEDA [6], where uniform class labels are also used to improve detecting OOD samples (shown in Fig. 5).

In summary, changing the soft labels increases performance on some tasks, but degrades performance across other tasks; the best choice of labels is then task-dependent. On average, the soft labeled methods (OMADA and OMADA-SE) perform stably across tasks.

Outlier Detection In order to put the outlier detection abilities of OMADA-SE (the best variant across multiple tasks) into context, we compare the AUC to the already-investigated label smoothing methods (Fig. 5). OMADA-SE outperforms all other methods on both DenseNet and WRN, albeit with a small gap to CEDA on DenseNet. The good performance of CEDA is not surprising, as it is implicitly trained to predict lower confidence on out-of-distribution samples (CEDA uses adversarially perturbed uniform random noise as the out-of-distribution samples). Interestingly, soft-labels alone are not enough to result in good outlier detection, as evidenced by the poor performance of -smoothing.

Stochastic DNN methods We compare the calibration and outlier detection performance of the OMADA variants to both Monte Carlo (MC) Dropout [2] and Ensembles, as these are commonly used to obtain uncertainty estimates, and have been shown to improve network calibration (results in Section  A8). As with temperature scaling, these methods are both orthogonal to OMADA, and can be easily combined. While an Ensemble is competitive on ACE, it performs worse than OMADA-SE on outlier detection. MC-Dropout is less competitive on both tasks.

Figure 5: Outlier detection performance(AUC) of label smoothing methods. Hatched bars indicate the best-performing method. Error bars are 1 std. dev. over 5 runs.

5 Conclusion and Discussion

In this article we have introduced the concept of on-manifold adversarial data augmentation for uncertainty estimation by leveraging recent advances in generative modeling. We use a latent space classifier to estimate the class decision boundaries on the approximated data manifold. This sets the ground work for a our novel sampling procedure, which directs the search through the latent space to discover challenging regions for the latent space classifier (i.e. decision boundary regions). Leveraging the ability to sample specifically at these challenging regions on the manifold, we use a decoder to generate ambiguous samples as well as utilize the latent space classifier to label these samples with a soft label, resulting in our OMADA dataset. Through a range of carefully chosen experiments, we study the effect that OMADA has when training an independent image space classifier.

An extensive set of experiments show significant improvements across multiple datasets and diverse network architectures, as well as on multiple tasks. The stability of the OMADA results for ACE across multiple networks is a particularly desirable property, as most alternative methods fail to perform well across all investigated networks. OMADA can be combined with post-processing methods such as temperature scaling [5], and we are confident that further beneficial combinations and extensions of the key concept of OMADA will be discovered in future research. Furthermore, we show that OMADA additionally results in increased classification accuracy compared to Base and most other methods.

Lastly, we show that OMADA-SE, which focuses its data generating sampling from boundary regions, outperforms all other methods for outlier detection.

This is an first attempt at using on-manifold adversarial samples for the study of uncertainty. Initial results show significant improvements of the networks ability to assign confidence to its predictions on in-distribution samples as well as out-of-distribution samples. Further studies are required to investigate the behavior of these networks on data which marginally leaves the data manifold (e.g. unseen transformations or corruptions).


  • [1] J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. Advances in Neural Information Processing Systems. Cited by: §3.1, §3.1.
  • [2] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In icml, Cited by: §4.2.
  • [3] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. International Conference on Learning Representations (ICLR). Cited by: §1.
  • [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: §3.1.
  • [5] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning (ICML), Cited by: §1, §2, §4.1, §4.2, §4, §4, §5.
  • [6] M. Hein, M. Andriushchenko, and J. Bitterwolf (2019) Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §2, item 6, §4.2, §4.
  • [7] D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. International Conference on Learning Representations (ICLR). Cited by: §1.
  • [8] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Table A1, §4.
  • [9] A. Kurakin, I. J. Goodfellow, and S. Bengio (2017) Adversarial machine learning at scale. International Conference on Learning Representations (ICLR). Cited by: §3.2.
  • [10] K. Lee, H. Lee, K. Lee, and J. Shin (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [11] A. Meinke and M. Hein (2019) Towards neural networks that provably know when they don’t know. arXiv preprint arXiv:1909.12180. Cited by: §1.
  • [12] J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran (2019) Measuring calibration in deep learning. In CVPR Workshop: Uncertainty and Robustness in Deep Visual Learning, Cited by: §4.1.
  • [13] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, and J. Snoek (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530. Cited by: §1, §2.
  • [14] J. Peterson, R. Battleday, T. L. Griffiths, and O. Russakovsky (2019) Human uncertainty makes classification more robust. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • [15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. Proceedings of the IEEE International Conference on Computer Vision (ICCV). Cited by: §1.
  • [16] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR). Cited by: §1, Table A1, §4.
  • [17] D. Stutz, M. Hein, and B. Schiele (2019) Disentangling adversarial robustness and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.
  • [19] S. Thulasidasan, G. Chennupati, J. Bilmes, T. Bhattacharya, and S. Michalak (2019) On mixup training: improved calibration and predictive uncertainty for deep neural networks. Advances in Neural Information Processing Systems. Cited by: §1, §2, item 3, item 5, §4.
  • [20] V. Vapnik (1992) Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [21] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio (2019) Manifold mixup: better representations by interpolating hidden states. In International Conference on Machine Learning (ICML), Cited by: §2, item 4, §4.
  • [22] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, Table A1, §4.
  • [23] G. Yonatan and E. Ran (2017) Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, Cited by: §4.1, §A4.2.
  • [24] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2, item 7, §4.
  • [25] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §1, Table A1, §4.
  • [26] H. Zhang, M. Cissé, Y. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. International Conference on Learning Representations (ICLR). Cited by: §1, §2, item 3, §3.2, §4.

Appendix: On-manifold Adversarial Data Augmentation Improves Uncertainty Calibration

A1 Visualizing Other Attack Paths

Fig. A1 depicts more examples of attack paths, with different start and end targets, produced by the presented method. The OMADA attack path examples include paths where the target is set to another class (e.g. blue path with target “2”), as well as paths where the target is a decision boundary (e.g. green path with target between “1” and “8” and red path with target between “0” and “2”). The decision boundary between two classes can be reached by setting the target vector () in Eq. 1 to for the two classes and elsewhere. It can be seen that the images produced by the decision boundary paths produce confusing samples which reflect features from the neighboring clusters. Furthermore, this confusion is reflected in the soft-label.

Figure A1: Visualization of an MNIST encoder-decoder latent space with multiple trajectories traversing through the latent space. The paths depict On-manifold adversarial attack paths, as well as Input Mixup projection into the same latent space. Below the latent space we visualize the decoded image path for OMADA (top 3 blocks) and the Input-Mixup images (bottom block) along with their corresponding soft labels (10 rows below images, red intensity corresponds to likelihood for classes 0 to 9), and the class entropy (bottom row, black shows high entropy). The green and red paths are generated when setting the target as a soft-label between two classes (targeting specifically the decision boundaries). For example, the green path starts at cluster “0” and optimizes Eq. 1 with the target () set to a soft label with for classes and , and elsewhere. As a result, this produces perturbations which direct the path to the decision boundary between the classes and . The magenta path shows the projection of Input Mixup images between samples “5” and “2”. It can be seen the OMADA paths produce mostly confusing samples at the decision boundaries, and that the soft labels reflect this confusion, whereas Input Mixup produces images which resemble an “8” (seen in the image path below as well as the magenta projection path going to the “8” cluster first before heading towards the target cluster ‘2”); Mixup’s soft label does not reflect this in its soft-label (soft-label is zero at class “8”).

A2 Input Mixup Example

In Fig. A1, the Input Mixup projection path is visualized in magenta. This path is produced by projecting the linearly interpolated images produced by Input Mixup into the latent space using the encoder. Even though Mixup mainly produces unrealistic images (Fig. 1), when it does produce realistic samples from another class, the soft label would not reflect the presence of this class. For example, in Fig. A1, Input Mixup produces an interpolated image between the classes “5” and “2” which looks similar to an “8”. It can be seen that Input Mixup assigns zero probability for class “8”, whereas using our encoder the images get mapped to the “8” cluster, which means a soft label produced by our method would reflect the presence of the class “8” .

A3 Experiment Hyperparameters

This section will present detailed information regarding the training process.

a3.1 Model and Training Hyper-parameters

All optimizer training hyper-parameters for the training of the image-space classifiers can be found in Table A1. These parameters are kept unchanged across the three datasets and all methods, as well as across all repetitions (where only the random seed was changed).

Model Num Params. Weight Nesterov Epochs LR Milestones LR Batch
(10 classes) Decay Scheduler Decay Size
DenseNet (L=100,k=12) [8] 796, 162 True Multi-step 64
WRN-28-10 [25] 36, 479, 194 True Multi-step 128
VGG-16 [16] 33, 646, 666 True Multi-step 128
ResNeXt-29 [22] 34, 426, 698 True Multi-step 128
Table A1: Training hyper-parameters. We use SGD with a base learning rate of and momentum of for all trainings.

a3.2 OMADA training hyper-parameters

Each OMADA-trained network uses a balanced of real training samples (with hard one-hot labels) and of the On-manifold adversarial samples in each batch. In order to enable direct comparison to alternative methods in the literature, we ensure that for each epoch, the total number of gradient updates performed are the same with the balanced number of samples from both datasets. Therefore, at the end of each epoch, of the real training samples are not seen and instead replaced by On-manifold adversarial samples. It should be noted that the real samples seen during each epoch vary across epochs. In order to speed up the training process, we create an offline On-manifold adversarial dataset, and sample from this dataset to fill up each batch during training.
For all networks, K random training samples are withheld to create a validation set. The validation set accuracy is used for early stopping, and all experiments (unless stated otherwise) report the results from the checkpoint with the highest validation accuracy during training. Furthermore, the validation set is used to find the best temperature to produce the temperature scaling results.

a3.3 Details about Literature Methods

This sub-section reports the hyper-parameters of the alternative methods in the literature in more detail.

  1. Base: base network trained using only real samples with hard labels and no data augmentation.

  2. Standard augmentation (std_aug)

    : base network trained with data augmentation on the training samples (random crop (padding

    ) and horizontal flips (flip prob.)).

  3. Mixup: mixup training [26] with  [19]. Augments the training dataset by linearly interpolating between both images and labels within a mini batch.

  4. Manifold Mixup: extends mixup training by taking linear interpolations of hidden layers in the network and hard labels. We use  [21].

  5. -smoothing: smooths the labels with (found to be best in [19]) by taking a linear combination of the and .

  6. CEDA: confidence enhancing data augmentation (CEDA) is a training scheme that enforces uniform confidences on out-of-distribution noise. These out-of-distribution images are included into the training by replacing half of the batch of real samples with permuted pixel images and

    uniform random noise images. For each of these augmented images, a Gaussian filter with standard deviation

     [6] is applied on the images, to have more low-frequency structure in the noise. The label for each of these images is the uniform class label.

  7. Cut-Mix: replaces patches of regions in the image with images from another sample in the batch. These are claimed by the authors to generate more locally natural images than Mixup. The combination ratio is sampled from and the cut-mix probability is (meaning cut-mix is performed for all samples). Procedure for sampling the image patches, to replace, is kept identical as in [24].

A4 Additional Results

a4.1 Accuracy

In Fig. A2 we report the classification accuracy for CIFAR-100 and SVHN. We make similar observations as in Fig. 2(b). OMADA achieves an improvement in accuracy compared to the Base models and most other methods, emphasizing that the gain in calibration does not come at the cost of a drop in accuracy.

(a) CIFAR-100 Accuracy
(b) SVHN Accuracy
Figure A2: Classification accuracy on CIFAR-100 (a) and SVHN (b). Across both datasets, we observe similar behavior as 2(b).

a4.2 Sparsification and OOD-MMC

In this section we report the Sparsification results on the in-distribution test set and Mean Maximal Confidence on out-of-distribution data (OOD-MMC). Sparsification evaluates how well a given uncertainty estimate correlates with the true error; intuitively, we want our networks to be more confident about correct predictions, and less confident about incorrect predictions [23]. This is calculated by selectively calculating the classification accuracy on increasingly large subsets of the test set. Samples are added to the subset based on their uncertainty; the more certain samples are added first. The final metric is the difference between the curve generated by the method and the ideal curve, in which all incorrectly-classified images have a higher uncertainty than all correctly-classified images. A lower Sparsification error is desired.

Another measure for evaluating the over-confidence of networks is to measure the OOD-MMC on out-of-distribution data. For out-of-distribution samples we want the network to assign a confidence of , reflecting maximum uncertainty. The Mean Max Confidence (MMC) measures how well the network performs the task of assigning a low confidence to unseen samples.

These results can be seen in Fig. A3. We observe that OMADA-SE significantly improves its Sparsification error compared to all other methods, and performs similar to Standard Augmentation. For OOD-MMC, we observe that OMADA-SE performs better than all other methods except CEDA, which has the lowest OOD-MMC on DenseNet. Though for WRN, OMADA-SE again performs best compared to all other methods.

(a) CIFAR-10 Sparsification Error
Figure A3: Sparsification (a) and OOD-MMC (b) for CIFAR-10 on DenseNet and WRN. For Sparsification error we observe that OMADA-SE has a significantly lower error compared to all other methods except Standard Augmentation. For OOD-MMC, we observe that the best performing methods are CEDA and OMADA-SE. Both methods have significantly lower confidence assigned to OOD data compared to all other methods.

A5 ACE on Last Epoch Checkpoint

In the main paper, we report the evaluations based on the model weights resulting in the highest validation accuracy. In order to show that the results are consistent with results from the last epoch model weights, in this section we report the ACE results for CIFAR-10 on all models. This ensures that all models were trained for the exact same number of epochs. Fig. A4 shows ACE results for CIFAR-10 for all models for the last epoch checkpoint. It can be seen that similar performance orderings can be observed compared to Fig. 2(a). Most methods have a worse ACE when evaluating using the last epoch (as longer training often increases mis-calibration), though surprisingly some exceptions do exist. This suggests a further study into the temporal aspect of network calibration across training epochs would be informative.

Figure A4: Calibration performance (ACE) of label-smoothing methods on in-distribution test data for CIFAR-10 for the last epoch checkpoint. Hatched bars indicate the best-performing method per network architecture.

A6 When Does Temperature Scaling Help?

Temperature scaling is a simple method for improving network calibration. Interestingly, we observe that temperature scaling does not always improve performance; for networks which are fairly well calibrated already, the ACE gets worse by applying temperature scaling. This suggests that the negative log likelihood (NLL) optimized by temperature scaling does not always correlate with a lower ACE (or ECE). We show this phenomenon for WRN on CIFAR-100, where the optimized temperature increased the calibration error (ACE). Fig. A5 shows the Negative Log-Likelihood (NLL) and ACE when performing a grid-search across temperatures. It can be seen that the best temperature (T=0.952) based on the validation NLL (vertical dashed black line) does not minimize the ACE on the test nor the validation set. This shows that the NLL and ACE are not perfectly correlated (similar observations were made for ECE), and that a grid-search for the temperature based on the ACE might be an alternative option to find better temperatures.

(a) CIFAR-100 + WRN Negative Log-Likelihood (NLL)
(b) CIFAR-100 + WRN ACE
Figure A5: The figure depicts the NLL (a) and ACE (b) values when performing a grid-search for finding the best temperature T on the validation and test sets for CIFAR-100 on WRN. The vertical dashed black line shows the chosen temperature (T = 0.952) based on the lowest NLL on the validation set (i.e the optimized temperature). It can be seen that for both validation and test sets, the optimized temperatures do not minimize the ACE. In this case, in can be seen that not applying temperature scaling (T=1) would give a lower ACE.

A7 Ablation Studies

Here we report the OOD-MMC results of the ablation study, as well as the standard deviations of the networks in Table 1. We report these numbers in Table A2. It can be seen that similar to AUC, the soft-labels become important to get a lower MMC on out-of-distribution data.

DN Base
WRN Base
Table A2: We report the mean and standard deviations across runs for all OMADA variants. The means were also shown in Table 1 and show the ACE, ACC and AUC of each network. Additionally, this table shows the OOD-MMC for all the networks.

A8 Stochastic Bayesian Neural Network Approximations

Here we report the results of two stochastic Bayesian Neural Network Approximations: Ensembles and MC-Dropout (15 forward passes). As these are orthogonal methods and can be applied to all methods, we compare our results with the Base network when applying these two stochastic approaches for uncertainty estimation.

Fig. A6 shows the results for an Ensemble and MC-Dropout for CIFAR-10 on DenseNet and WRN. Each ensemble entry reports the mean and standard deviation across ensembles, where each ensemble contains networks. As none of the networks reported in the paper are trained with Dropout, we specially train DenseNet and WRN with dropout ( and , respectively) in order to compare against MC-Dropout. As these networks can be considered to have a different network architecture compared to their no-dropout counter-parts, we report the ACE and AUC for a single deterministic forward pass through the network (Dropout-1FP) and compare this to the stochastic forward passes (Dropout-15FP). We observe that for ACE on DenseNet, Base-ENS performs best, though after taking an ensemble of OMADA networks, we achieve similar performance. However, for WRN, OMADA-ENS significantly surpasses Base-ENS. This shows that ensembles help to improve network calibration, though come at the cost of expensive compute times during inference. For ACE on both networks, MC-Dropout does not perform competitively.

On the other hand, when comparing the AUC numbers, we see that Base-ENS only slightly improves on Base and falls short of all OMADA-trained networks (with and without an ensemble on top). OMADA-ENS improves on OMADA alone, though interestingly it does not perform as well as OMADA-SE (which has much more samples with high entropy soft labels and confusing samples).

(a) CIFAR-10 ACE
(b) CIFAR-10 AUC
Figure A6: The figure depicts the ACE (a) and AUC (b) for CIFAR-10 on DenseNet and WRN. We denote the methods which involve an ensemble with “*-ENS” (Ensemble of networks) and report the mean and standard deviation across sets of ensembles. The Dropout networks are specially trained networks with dropout and we report the results when using deterministic forward pass (Dropout-1FP) and stochastic forward passes (Dropout-15FP).