Manifold Mixup: Encouraging Meaningful On-Manifold Interpolation as a Regularizer

06/13/2018 ∙ by Vikas Verma, et al. ∙ aalto Université de Montréal 0

Deep networks often perform well on the data manifold on which they are trained, yet give incorrect (and often very confident) answers when evaluated on points from off of the training distribution. This is exemplified by the adversarial examples phenomenon but can also be seen in terms of model generalization and domain shift. We propose Manifold Mixup which encourages the network to produce more reasonable and less confident predictions at points with combinations of attributes not seen in the training set. This is accomplished by training on convex combinations of the hidden state representations of data samples. Using this method, we demonstrate improved semi-supervised learning, learning with limited labeled data, and robustness to adversarial examples. Manifold Mixup requires no (significant) additional computation. Analytical experiments on both real data and synthetic data directly support our hypothesis for why the Manifold Mixup method improves results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks are the backbone of state-of-the-art systems for computer vision, speech recognition, and language translation

(LeCun et al., 2015)

. However, these systems perform well only when evaluated on instances very similar to those from the training set. When evaluated on slightly different distributions, neural networks often provide incorrect predictions with strikingly high confidence. This is a worrying prospect, since deep learning systems are being deployed in settings where data may be subject to distributional shifts. Adversarial examples

(Szegedy et al., 2014)

are one such failure case: deep neural networks with nearly perfect performance provide incorrect predictions with very high confidence when evaluated on perturbations imperceptible to the human eye. Adversarial examples are a serious hazard when deploying machine learning systems in security-sensitive applications. More generally, deep learning systems quickly degrade in performance as the distributions of training and testing data differ slightly from each other

(Ben-David et al., 2010).

(a)
(b)
(c)
(d)
(e)
(f)
Figure 1: An experiment on a network trained on the 2D spiral dataset with a 2D bottleneck hidden representation in the middle of the network. Manifold mixup has three effects on learning when compared to vanilla training. First, it smoothens decision boundaries (from a. to b.). Second, it improves the arrangement of hidden representations and encourages broader regions of low-confidence predictions (from d. to e.). Black dots are the hidden representation of the inputs sampled uniformly from the range of the input space. Third, it flattens the representations (c. at layer 1, f. at layer 3). Figure 2

shows that these effects are not accomplished by other well-studied regularizers (input mixup, weight decay, dropout, batch normalization, and adding noise to the hidden representations).

In this paper, we realize several troubling properties concerning the hidden representations and decision boundaries of state-of-the-art neural networks. First, we observe that the decision boundary is often sharp and close to the data. Second, we observe that the vast majority of the hidden representation space corresponds to high confidence predictions, both on and off of the data manifold.

Motivated by these intuitions we propose Manifold Mixup (Section 2), a simple regularizer that addresses several of these flaws by training neural networks on linear combinations of hidden representations of training examples. Previous work, including the study of analogies through word embeddings (e.g. king man woman queen), has shown that interpolations are an effective way of combining factors (Mikolov et al., 2013)

. Since high-level representations are often low-dimensional and useful to linear classifiers, linear interpolations of hidden representations should explore meaningful regions of the feature space effectively. To use combinations of hidden representations of data as novel training signal, we also perform the same linear interpolation in the associated pair of one-hot labels, leading to mixed examples with soft targets.

To start off with the right intuitions, Figure 1 illustrates the impact of Manifold Mixup on a simple two-dimensional classification task with small data. In this example, vanilla training of a deep neural network leads to an irregular decision boundary (Figure 1a), and a complex arrangement of hidden representations (Figure 1d). Moreover, every point in both the raw (Figure 1a) and hidden (Figure 1d) data representations is assigned a prediction with very high confidence. This includes points (depicted in black) that correspond to inputs off the data manifold! In contrast, training the same deep neural network with Manifold Mixup leads to a smoother decision boundary (Figure 1b) and a simpler (linear) arrangement of hidden representations (Figure 1e). In sum, the representations obtained by Manifold Mixup have two desirable properties: the class-representations are flattened into a minimal amount of directions of variation, and all points in-between these flat representations, most unobserved during training and off the data manifold, are assigned low-confidence predictions.

This example conveys the central message of this paper:

Manifold Mixup improves the hidden representations and decision boundaries of neural networks at multiple layers.

Input Space

Weight Decay

Hidden space

Noise

Dropout

Batch-Norm

Input Mixup

Figure 2: The same experimental setup as Figure 1, but using a variety of competitive regularizers. This shows that the effect of concentrating the hidden representation for each class and providing a broad region of low confidence between the regions is not accomplished by the other regularizers (although input space mixup does produce regions of low confidence, it does not flatten the class-specific state distribution). Noise refers to gaussian noise in the input layer, dropout refers to dropout of 50% in all layers except the bottleneck itself (due to its low dimensionality), and batch normalization refers to batch normalization in all layers.

More specifically, Manifold Mixup improves generalization in deep neural networks because it:

  • Leads to smoother decision boundaries that are further away from the training data, at multiple levels of representation. Smoothness and margin are well-established factors of generalization (Bartlett & Shawe-taylor, 1998; Lee et al., 1995).

  • Leverages interpolations in deeper hidden layers, which capture higher level information (Zeiler & Fergus, 2013) to provide additional training signal.

  • Flattens the class-representations, reducing their number of directions with significant variance (Section 3). This can be seen as a form of compression, which is linked to generalization by a well-established theory (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017) and extensive experimentation (Alemi et al., 2017; Belghazi et al., 2018; Goyal et al., 2018; Achille & Soatto, 2018).

Throughout a variety of experiments, we demonstrate four benefits of Manifold Mixup:

  • Better generalization than other competitive regularizers (such as Cutout, Mixup, AdaMix, and Dropout) (Section 5.1).

  • Improved log-likelihood on test samples (Section 5.1).

  • Increased performance at predicting data subject to novel deformations (Section 5.2).

  • Improved robustness to single-step adversarial attacks. This is the evidence that Manifold Mixup pushes the decision boundary away from the data in some directions (Section 5.3). This is not to be confused with full adversarial robustness, which is defined in terms of moving the decision boundary away from the data in all directions.

2 Manifold Mixup

Consider training a deep neural network , where denotes the part of the neural network mapping the input data to the hidden representation at layer , and denotes the part mapping such hidden representation to the output . Training using Manifold Mixup is performed in five steps. First, we select a random layer from a set of eligible layers in the neural network. This set may include the input layer . Second, we process two random data minibatches and as usual, until reaching layer . This provides us with two intermediate minibatches and . Third, we perform Input Mixup (Zhang et al., 2018) on these intermediate minibatches. This produces the mixed minibatch:

where . Here, are one-hot labels, and the mixing coefficient as proposed in mixup (Zhang et al., 2018). For instance, is equivalent to sampling . Fourth, we continue the forward pass in the network from layer until the output using the mixed minibatch . Fifth, this output is used to compute the loss value and gradients that update all the parameters of the neural network.

Mathematically, Manifold Mixup minimizes:

(1)

Some implementation considerations. We backpropagate gradients through the entire computational graph, including those layers before the mixup layer

(Section 5.1 and appendix Section B explore this issue in more detail). In the case where , Manifold Mixup reduces to the original mixup algorithm of Zhang et al. (2018).

While one could try to reduce the variance of the gradient updates by sampling a random per example, we opted for the simpler alternative of sampling a single per minibatch, which in practice gives the same performance. As in Input Mixup, we use a single minibatch to compute the mixed minibatch. We do so by mixing the minibatch with copy of itself with shuffled rows.

B1

B2

A1

A2

B1

B2

A1

A2

B1

B2

A1

A2

Figure 3: Illustration on why Manifold Mixup learns flatter representations. The interpolation between A1 and B2 in the left panel soft-labels the black dot as 50% red and 50% blue, regardless of being very close to a blue point. In the middle panel a different interpolation between A2 and B1 soft-labels the same point as  95% blue and  5% red. However, since Manifold Mixup learns the hidden representations, the pressure to predict consistent soft-labels at interpolated points causes the states to become flattened (right panel).

3 Manifold Mixup Flattens Representations

We turn to the study of how Manifold Mixup impacts the hidden representations of a deep neural network. At a high level, Manifold Mixup flattens the class-specific representations. More specifically, this flattening reduces the number of directions with significant variance (akin to reducing their number of principal components).

In the sequel, we first prove a theory (Section 3.1) that characterizes this behavior precisely under idealized conditions. Second, we show that this flattening also happens in practice, by performing the SVD of class-specific representations of neural networks trained on real datasets (Section 3.2). Finally, we discuss why the flattening of class-specific representations is a desirable property (Section 3.3).

3.1 Theory

We start by characterizing how the representations of a neural network are changed by Manifold Mixup, under a simplifying set of assumptions. More concretely, we will show that if one performs mixup in a sufficiently deep hidden layer in a neural network, then the loss can be driven to zero if the dimensionality of that hidden layer is greater than the number of classes . As a consequence of this, the resulting representations for that class will have dimensions.

A more intuitive and less formal version of this argument is given in Figure 3 and Appendix F.

To this end, assume that and denote the input and representation spaces, respectively. We denote the label-set by and let . Let denote the set of functions realizable by the neural network, from the input to the representation. Similarly, let be the set of all functions realizable by the neural network, from the representation to the output.

We are interested in the solution of the following problem in some asymptotic regimes:

(2)

More specifically, let be the empirical distribution defined by a dataset . Then, let and be the minimizers of (2) for . Also, let , , and

be a vector space. These conditions

(Cybenko, 1989) state that the mappings realizable by large neural networks are dense in the set of all continuous bounded functions. In this case, we show that the minimizer is a linear function from to . In this case, the objective (2) can be rewritten as:

where .

Theorem 1.

Let be a vector space of dimension , and let to represent the number classes contained in some dataset . If , then and the corresponding minimizer is a linear function from to .

Proof.

First, we observe that the following statement is true if :

where and denote the

-dimensional identity matrix and all-one vector, respectively. In fact,

is a rank-one matrix, while the rank of identity matrix is . So, only needs rank .

Let for all . Let be the -th column of , where stands for the class-index of the example . These choices minimize (2), since:

The result follows from for all . ∎

Furthermore, if , then data points in the representation space

have some degrees of freedom to move independently.


Corollary 1.

Consider the setting in Theorem 1 with . Let minimize (2) under . Then, the representations of the training points fall on a -dimensional subspace.

Proof.

From the proof of Theorem 1, . The r.h.s. of this expression is a rank- matrix for a properly chosen . Thus, can have a null-space of dimension . This way, one can assign , where is defined as in the proof of Theorem 1, and are arbitrary vectors in the null-space of , for all . ∎

This result implies that if the Manifold Mixup loss is minimized, then the representation of each class lies on a subspace of dimension . In the extreme case where , each class representation will collapse to a single point, meaning that hidden representations would not change in any direction, for each class-conditional manifold. In the more general case with larger , the majority of directions in -space will be empty in the class-conditional manifold.

3.2 Empirical Investigation of Flattening

We now show that the “flattening” theory that we have just developed also holds for real neural networks networks trained on real data. To this end, we trained a collection of fully-connected neural networks on the MNIST dataset using multiple regularizers, including Manifold Mixup. When using Manifold Mixup

, we mixed representations at a single, fixed hidden layer per network. After training, we performed the Singular Value Decomposition (SVD) of the hidden representations of each network, and analyzed their spectrum decay.

More specifically, we computed the largest singular value per class, as well as the sum of the all other singular values. We computed these statistics at the first hidden layer for all networks and regularizers. For the largest singular value, we obtained: 51.73 (baseline), 33.76 (weight decay), 28.83 (dropout), 33.46 (input mixup), and 31.65 (manifold mixup). For the sum of all the other singular values, we obtained: 78.67 (baseline), 73.36 (weight decay), 77.47 (dropout), 66.89 (input mixup), and 40.98 (manifold mixup). Therefore, weight decay, dropout, and input mixup all reduce the largest singular value, but only Manifold Mixup achieves a reduction of the sum of the all other singular values (e.g. flattening). For more details regarding this experiment, consult Appendix G.

3.3 Why is Flattening Representations Desirable?

We have presented evidence to conclude that Manifold Mixup leads to flatter class-specific representations, and that such flattening is not accomplished by other regularizers.

But why is this flattening desirable? First, it means that the hidden representations computed from our data occupy a much smaller volume. Thus, a randomly sampled hidden representation within the convex hull spanned by the data in this space is more likely to have a classification score with lower confidence (higher entropy). Second, compression has been linked to generalization in the information theory literature (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017). Third compression has been been linked to generalization empirically in the past by work which minimizes mutual information between the features and the inputs as a regularizer (Belghazi et al., 2018; Alemi et al., 2017; Achille & Soatto, 2018).

4 Related Work

Regularization is a major area of research in machine learning. Manifold Mixup is a generalization of Input Mixup, the idea of building random interpolations between training examples and perform the same interpolation for their labels (Zhang et al., 2018; Tokozume et al., 2018).

Intriguingly, our experiments show that Manifold Mixup changes the representations associated to the layers before and after the mixing operation, and that this effect is crucial to achieve good results (Section 5.1, Appendix G). This suggests that Manifold Mixup works differently than Input Mixup.

Another line of research closely related to Manifold Mixup involves regularizing deep networks by perturbing their hidden representations. These methods include dropout (Hinton et al., 2012), batch normalization (Ioffe & Szegedy, 2015), and the information bottleneck (Alemi et al., 2017). Notably, Hinton et al. (2012) and Ioffe & Szegedy (2015) demonstrated that regularizers that work well in the input space can also be applied to the hidden layers of a deep network, often to further improve results. We believe that Manifold Mixup is a complimentary form of regularization.

Zhao & Cho (2018) explored improving adversarial robustness by classifying points using a function of the nearest neighbors in a fixed feature space. This involves applying mixup between each set of nearest neighbor examples in that feature space. The similarity between (Zhao & Cho, 2018) and Manifold Mixup is that both consider linear interpolations of hidden representations with the same interpolation applied to their labels. However, an important difference is that Manifold Mixup backpropagates gradients through the earlier parts of the network (the layers before the point where mixup is applied), unlike (Zhao & Cho, 2018). In Section 3 we explain how this discrepancy significantly affects the learning process.

AdaMix (Guo et al., 2018a) is another related method which attempts to learn better mixing distributions to avoid overlap. AdaMix performs interpolations only on the input space, reporting that their method degrades significantly when applied to hidden layers. Thus, AdaMix may likely work for different reasons than Manifold Mixup, and perhaps the two are complementary. AgrLearn (Guo et al., 2018b) adds an information bottleneck layer to the output of deep neural networks. AgrLearn leads to substantial improvements, achieving 2.45% test error on CIFAR-10 when combined with Input Mixup (Zhang et al., 2018). As AgrLearn is complimentary to Input Mixup, it may be also complimentary to Manifold Mixup. Wang et al. (2018) proposed an interpolation exclusively in the output space, does not backpropagate through the interpolation procedure, and has a very different framing in terms of the Euler-Lagrange equation (Equation 2) where the cost is based on unlabeled data (and the pseudolabels at those points) and the labeled data provide constraints.

5 Experiments

We now turn to the empirical evaluation of Manifold Mixup. We will study its regularization properties in supervised learning (Section 5.1), as well as how it affects the robustness of neural networks to novel input deformations (Section 5.2), and adversarial examples (Section 5.3).

5.1 Generalization on Supervised Learning

PreActResNet18 Test Error (%) Test NLL
No Mixup
AdaMix NA
Input Mixup NA
Input Mixup ()
Manifold Mixup ()
PreActResNet34
No Mixup
Input Mixup ()
Manifold Mixup ()
Wide-Resnet-28-10
No Mixup
Input Mixup ()
Manifold Mixup ()
(a) CIFAR-10
PreActResNet18 Test Error (%) Test NLL
No Mixup
AdaMix 20.97 n/a
Input Mixup 21.10 n/a
Input Mixup ()
Manifold Mixup ()
PreActResNet34
No Mixup
Input Mixup ()
Manifold Mixup ()
Wide-Resnet-28-10
No Mixup
Input Mixup ()
Manifold Mixup ()
(b) CIFAR-100
Table 1: Classification errors on (a) CIFAR-10 and (b) CIFAR-100. We include results from (Zhang et al., 2018) and (Guo et al., 2016)

. Standard deviations over five repetitions.

PreActResNet18 Test Error (%) Test NLL
No Mixup
Input Mixup ()
Manifold Mixup ()
PreActResNet34
No Mixup
Input Mixup ()
Manifold Mixup ()
Wide-Resnet-28-10
No Mixup
Input Mixup ()
Manifold Mixup ()
Table 3: Accuracy on TinyImagenet.
PreActResNet18 top-1 top-5
No Mixup 55.52 71.04
Input Mixup ()
Input Mixup ()
Input Mixup ()
Input Mixup ()
Manifold Mixup ()
Manifold Mixup ()
Manifold Mixup ()
Manifold Mixup ()
Table 2: Classification errors and neg-log-likelihoods on SVHN. We run each experiment five times.

We train a variety of residual networks (He et al., 2016) using different regularizers: no regularization, AdaMix, Input Mixup, and Manifold Mixup. We follow the training procedure of (Zhang et al., 2018), which is to use SGD with momentum, a weight decay of , and a step-wise learning rate decay. Please refer to Appendix C

for further details (including the values of the hyperparameter

). We show results for the CIFAR-10 (Table 1(a)), CIFAR-100 (Table 1(b)), SVHN (Table 3), and TinyImageNET (Table 3) datasets. Manifold Mixup outperforms vanilla training, AdaMix, and Input Mixup across datasets and model architectures. Furthermore, Manifold Mixup leads to models with significantly better Negative Log-Likelihood (NLL) on the test data. In the case of CIFAR-10, Manifold Mixup models achieve as high as relative improvement of test NLL.

As a complimentary experiment to better understand why Manifold Mixup works, we zeroed gradient updates immediately after the layer where mixup is applied. On the dataset CIFAR-10 and using a PreActResNet18, this led to a 4.33% test error, which is worse than our results for Input Mixup and Manifold Mixup, yet better than the baseline. Because Manifold Mixup selects the mixing layer at random, each layer is still being trained even when zeroing gradients, although it will receive less updates. This demonstrates that Manifold Mixup improves performance by updating the layers both before and after the mixing operation.

We also compared Manifold Mixup

against other strong regularizers. For each regularizer, we selected the best hyper-parameters using a validation set. The training of PreActResNet50 on CIFAR-10 for 600 epochs led to the following test errors (%): no regularization (

), Dropout (), Cutout (Devries & Taylor, 2017) (), Mixup (), and Manifold Mixup (). (Note that the results in Table 1 for PreActResNet were run for 1200 epochs, and therefore are not directly comparable to the numbers in this paragraph.)

To provide further evidence about the quality of representations learned with Manifold Mixup, we applied a

-nearest neighbour classifier on top of the features extracted from a PreActResNet18 trained on CIFAR-10. We achieved test errors of 6.09% (vanilla training), 5.54% (Input Mixup), and 5.16% (

Manifold Mixup).

Finally, we considered a synthetic dataset where the data generating process is a known function of disentangled factors of variation, and mixed in this space factors. As shown in Appendix A, this led to significant improvements in performance. This suggests that mixing in the correct level of representation has a positive impact on the decision boundary. However, our purpose here is not to make any claim about when do deep networks learn representations corresponding to disentangled factors of variation.

Finally, Table 6 and Table 6 show the sensitivity of Manifold Mixup to the hyper-parameter and the set of eligible layers . (These results are based on training a PreActResNet18 for epochs, so these numbers are not exactly comparable to the ones in Table 1.) This shows that Manifold Mixup is robust with respect to choice of hyper-parameters, with improvements for many choices.

Deformation No Mixup Input Mixup () Input Mixup () Manifold Mixup ()
Rotation U(,) 56.48
Rotation U(,) 36.78
Shearing U(, ) 60.01
Shearing U(, ) 39.7
Zoom In (60% rescale)
Zoom In (80% rescale) 50.47
Zoom Out (120% rescale) 61.62
Zoom Out (140% rescale) 42.02
Table 4: Test accuracy on novel deformations. All models trained on normal CIFAR-100.
CIFAR-10 CIFAR-100
Table 6: Test accuracy (%) of Input Mixup and Manifold Mixup for different on CIFAR-10.
Input Mixup Manifold Mixup
Table 5: Test accuracy Manifold Mixup for different sets of eligible layers on CIFAR.

5.2 Generalization to Novel Deformations

To further evaluate the quality of representations learned with Manifold Mixup, we train PreActResNet34 models on the normal CIFAR-100 training split, but test them on novel (not seen during training) deformations of the test split. These deformations include random rotations, random shearings, and different rescalings. Better representations should generalize to a larger variety of deformations. Table 4 shows that networks trained using Manifold Mixup are the most able to classify test instances subject to novel deformations, which suggests the learning of better representations. For more results see Appendix C, Table 9.

5.3 Robustness to Adversarial Examples

CIFAR-10 FGSM
No Mixup 36.32
Input Mixup () 71.51
Manifold Mixup () 77.50
PGD training (7-steps) 56.10
CIFAR-100 FGSM
Input Mixup () 40.7
Manifold Mixup () 44.96
SVHN FGSM
No Mixup 21.49
Input Mixup () 56.98
Manifold Mixup () 65.91
PGD training (7-steps) 72.80
Table 7: Test accuracy on white-box FGSM adversarial examples on CIFAR-10 and CIFAR-100 (using a PreActResNet18 model) and SVHN (using a WideResNet20-10 model). We include the results of (Madry et al., 2018).

Adversarial robustness is related to the position of the decision boundary relative to the data. Because Manifold Mixup only considers some directions around data points (those corresponding to interpolations), we would not expect the model to be robust to adversarial attacks that consider any direction around each example. However, since Manifold Mixup expands the set of examples seen during training, an intriguing hypothesis is that these expansions overlap with the set of possible adversarial examples, providing some degree of defense. If this hypothesis is true, Manifold Mixup would force adversarial attacks to consider a wider set of directions, leading to a larger computational expense for the attacker. To explore this, we consider the Fast Gradient Sign Method (FGSM, Goodfellow et al., 2015), which constructs adversarial examples in one single step, thus considering a relatively small subset of directions around examples. The performance of networks trained using Manifold Mixup against FGSM attacks is given in Table 7. One challenge in evaluating robustness against adversarial examples is the “gradient masking problem”, in which a defense succeeds only by reducing the quality of the gradient signal. Athalye et al. (2018) explored this issue in depth, and proposed running an unbounded search for a large number of iterations to confirm the quality of the gradient signal. Manifold Mixup passes this sanity check (consult Appendix D for further details). While we found that using Manifold Mixup improves the robustness to single-step FGSM attack (especially over Input Mixup), we found that Manifold Mixup did not significantly improve robustness against stronger, multi-step attacks such as PGD (Madry et al., 2018).

6 Connections to Neuroscience and Credit Assignment

We present an intriguing connection between Manifold Mixup and a challenging problem in neuroscience. At a high level, we can imagine systems in the brain which compute predictions from a stream of changing inputs, and pass these predictions onto other modules which return some kind of feedback signal (Lee et al., 2015; Scellier & Bengio, 2017; Whittington & Bogacz, 2017; Bartunov et al., 2018). For instance, these feedback signals can be gradients or targets for prediction. There is a delay between the output of the prediction and the point in time in which the feedback can return to the system after having travelled across the brain. Moreover, this delay could be noisy and could differ based on the type of the prediction or other conditions in the brain, as well as depending on which paths are considered (there are many skip connections between areas). This means that it could be very difficult for a system in the brain to establish a clear correspondence between its outputs and the feedback signals that it receives over time.

While it is preliminary, an intriguing hypothesis is that part of how systems in the brain could be working around this limitation is by averaging their states and feedback signals across multiple points in time. The empirical results from mixup suggest that such a technique may not just allow successful computation, but also act as a potent regularizer. Manifold Mixup strenghthens this result by showing that the same regularization effect can be achieved from mixing in higher level hidden representations.

7 Conclusion

Deep neural networks often give incorrect, yet extremely confident predictions on examples that differ from those seen during training. This problem is one of the most central challenges in deep learning. We have investigated this issue from the perspective of the representations learned by deep neural networks. We observed that vanilla neural networks spread the training data widely throughout the representation space, and assign high confidence predictions to almost the entire volume of representations. This leads to major drawbacks since the network will provide high-confidence predictions to examples off the data manifold, thus lacking enough incentives to learn discriminative representations about the training data. To address these issues, we introduced Manifold Mixup, a new algorithm to train neural networks on interpolations of hidden representations. Manifold Mixup encourages the neural network to be uncertain across the volume of the representation space unseen during training. This leads to concentrating the representations of the real training examples in a low dimensional subspace, resulting in more discriminative features. Throughout a variety of experiments, we have shown that neural networks trained using Manifold Mixup have better generalization in terms of error and log-likelihood, as well as better robustness to novel deformations of the data and adversarial examples. Being easy to implement and incurring little additional computational cost, we hope that Manifold Mixup will become a useful regularization tool for deep learning practitioners.

Acknowledgements

The authors thank Christopher Pal, Sherjil Ozair and Dzmitry Bahdanau for useful discussions and feedback. Vikas Verma was supported by Academy of Finland project 13312683 / Raiko Tapani AT kulut. We would also like to acknowledge Compute Canada for providing computing resources used in this work.

References

Appendix A Synthetic Experiments Analysis

We conducted experiments using a generated synthetic dataset where each image is deterministically rendered from a set of independent factors. The goal of this experiment is to study the impact of input mixup and an idealized version of Manifold Mixup where we know the true factors of variation in the data and we can do mixup in exactly the space of those factors. This is not meant to be a fair evaluation or representation of how Manifold Mixup actually performs - rather it’s meant to illustrate how generating relevant and semantically meaningful augmented data points can be much better than generating points by mixing in the input space.

We considered three tasks. In Task A, we train on images with angles uniformly sampled between (-, -

) (label 0) with 50% probability and uniformly between (

, ) (label 1) with 50% probability. At test time we sampled uniformly between (-, -) (label 0) with 50% probability and uniformly between (, ) (label 1) with 50% probability. Task B used the same setup as Task A for training, but the test instead used (-, -) as label 0 and (-, ) as label 1. In Task C we made the label whether the digit was a “1” or a “7”, and our training images were uniformly sampled between (-, -) with 50% probability and uniformly between (, ) with 50% probability. The test data for Task C were uniformly sampled with angles from (-, ).

The examples of the data are in Figure 4 and results are in Table 8. In all cases we found that Input Mixup gave some improvements in likelihood but limited improvements in accuracy - suggesting that the even generating nonsensical points can help a classifier trained with Input Mixup to be better calibrated. Nonetheless the improvements were much smaller than those achieved with mixing in the ground truth attribute space.

Figure 4: Synthetic task where the underlying factors are known exactly. Training images (left), images from input mixup (center), and images from mixing in the ground truth factor space (right).
Task Model Test Accuracy Test NLL
No Mixup 1.6 8.8310
Task A Input Mixup (1.0) 0.0 6.0601
Ground Truth Factor Mixup (1.0) 94.77 0.4940
No Mixup 21.25 7.0026
Task B Input Mixup (1.0) 18.40 4.3149
Ground Truth Factor Mixup (1.0) 84.02 0.4572
No Mixup 63.05 4.2871
Task C Input Mixup 66.09 1.4181
Ground Truth Factor Mixup 99.06 0.1279
Table 8: Results on synthetic data generalization task with an idealized Manifold Mixup (mixing in the true latent generative factors space). Note that in all cases visible mixup significantly improved likelihood, but not to the same degree as factor mixup.

Appendix B Analysis of how Manifold Mixup changes learned representations

We have found significant improvements from using Manifold Mixup, but a key question is whether the improvements come from changing the behavior of the layers before the mixup operation is applied or the layers after the mixup operation is applied. This is a place where Manifold Mixup

and Input Mixup are clearly differentiated, as Input Mixup has no “layers before the mixup operation” to change. We conducted analytical experimented where the representations are low-dimensional enough to visualize. More concretely, we trained a fully connected network on MNIST with two fully-connected leaky relu layers of 1024 units, followed by a 2-dimensional bottleneck layer, followed by two more fully-connected leaky-relu layers with 1024 units.

We then considered training with no mixup, training with mixup in the input space, and training only with mixup directly following the 2D bottleneck. We consistently found that Manifold Mixup has the effect of making the representations much tighter, with the real data occupying smaller region in the hidden space, and with a more well separated margin between the classes, as shown in Figure 5

Figure 5: Representations from a classifier on MNIST (top is trained on digits 0-4, bottom is trained on all digits) with a 2D bottleneck representation in the middle layer. No Mixup Baseline (left), Input Mixup (center), Manifold Mixup (right).

Appendix C Supervised Regularization Experimental Details

Test Set Deformation No Mixup Baseline Input Mixup =1.0 Input Mixup =2.0 Manifold Mixup =2.0
Rotation U(,) 52.96 55.55 56.48 60.08
Rotation U(,) 33.82 37.73 36.78 42.13
Rotation U(,) 26.77 28.47 27.53 33.78
Rotation U(,) 24.19 26.72 25.34 29.95
Shearing U(, ) 55.92 58.16 60.01 62.85
Shearing U(, ) 35.66 39.34 39.7 44.27
Shearing U(, ) 19.57 22.94 22.8 24.69
Shearing U(, ) 17.55 21.66 21.22 23.56
Shearing U(, ) 22.38 25.53 25.27 28.02
Zoom In (20% rescale) 2.43 1.9 2.45 2.03
Zoom In (40% rescale) 4.97 4.47 5.23 4.17
Zoom In (60% rescale) 12.68 13.75 13.12 11.49
Zoom In (80% rescale) 47.95 52.18 50.47 52.7
Zoom Out (120% rescale) 43.18 60.02 61.62 63.59
Zoom Out (140% rescale) 19.34 41.81 42.02 45.29
Zoom Out (160% rescale) 11.12 25.48 25.85 27.02
Zoom Out (180% rescale) 7.98 18.11 18.02 15.68
Table 9: Models trained on the normal CIFAR-100 and evaluated on a test set with novel deformations. Manifold Mixup

(ours) consistently allows the model to be more robust to random shearing, rescaling, and rotation even though these deformations were not observed during training. For the rotation experiment, each image is rotated with an angle uniformly sampled from the given range. Likewise the shearing is performed with uniformly sampled angles. Zooming-in refers to take a bounding box at the center of the image with k% of the length and k% of the width of the original image, and then expanding this image to fit the original size. Likewise zooming-out refers to drawing a bounding box with k% of the height and k% of the width, and then taking this larger area and scaling it down to the original size of the image (the padding outside of the image is black).

For supervised regularization we considered following architectures: PreActResNet18, PreActResNet34, and Wide-Resnet-28-10. When using Manifold Mixup, we selected the layer to perform mixing uniformly at random from a set of eligible layers. In all our experiments, for the PreActResNets architectures, the eligible layers for mixing in Manifold Mixup were : the input layer, the output from the first resblock, and the output from the second resblock. For Wide-ResNet-20-10 architecture, the eligible layers for mixing in Manifold Mixup were: the input layer and the output from the first resblock. For PreActResNet18, the first resblock has four layers and the second resblock has four layers. For PreActResNet34, the first resblock has six layers and the second resblock has eight layers. For Wide-Resnet-28-10, the first resblock has four layers. Thus the mixing is often done fairly deep layers in the network.

Throughout our experiments, we use SGD+Momentum optimizer with learning rate 0.1, momentum 0.9 and weight-decay , with step-wise learning rate decay.

For Table 1(a), Table 1(b) and Table 3, we train the PreActResNet18, and PreActResNet34 for 1200 epochs with learning rate annealed by a factor of 10 at epoch 400 and 800. For above Tables, we train Wide-ResNet-28-10 for 400 epochs with learning rate annealed by a factor of 10 at epoch 200 and 300. In Table 3, we train PreActResNet18 for 2000 epochs with learning rate annealed by a factor of 10 at epoch 1000 and 1500.

For Table 6 and Table 6, we train the PreActResNet18 network for 2000 epochs with learning rate annealed by a factor of 10 at epoch 1000 and 1500.

For Table 7, Table 4 and Table 9, we train the networks for 1200 epochs with learning rate annealed by a factor of 10 at epoch 400 and 800.

In Figure 6 and Figure 7, we present the training loss (Binary cross entropy) for CIFAR10 and CIFAR100 datasets respectively. We observe that performing Manifold Mixup in higher layers allows the train loss to go down faster as compared to the Input Mixup, which suggests that while Input Mixup may suffer from underfitting, Manifold Mixup alleviates this problem to some extend.

In Table 9, we present full set of experiments of Section 5.2.

c.1 Hyperparameter

For Input Mixup on CIFAR10 and CIFAR100 datasets, we used the value as recommended in (Zhang et al., 2018)

. For Input Mixup on SVHN and Tiny-imagenet datasets, we experimented with the

values in the set . We obtained best results using and for SVHN and Tiny-imagenet, respectively.

For Manifold Mixup, for all datasets, we experimented with the values in the set . We obtained best results with for CIFAR10, CIFAR100 and SVHN and with for Tiny-imagenet.

Figure 6: CIFAR-10 train set Binary Cross Entropy Loss (BCE) on Y-axis using PreActResNet18, with respect to training epochs (X-axis). The numbers in {} refer to the resblock after which Manifold Mixup is performed. The ordering of the losses is consistent over the course of training: Manifold Mixup with gradient blocked before the mixing layer has the highest training loss, followed by Input Mixup. The lowest training loss is achieved by mixing in the deepest layer, which suggests that having more hidden units can help to prevent underfitting.
Figure 7: CIFAR-100 train set Binary Cross Entropy Loss (BCE) on Y-axis using PreActResNet50, with respect to training epochs (X-axis). The numbers in {} refer to the resblock after which Manifold Mixup is performed. The lowest training loss is achieved by mixing in the deepest layer.

Appendix D Adversarial Examples

We ran the unbounded projected gradient descent (PGD) (Madry et al., 2018) sanity check suggested in (Athalye et al., 2018). We took our trained models for the input mixup baseline and manifold mixup and we ran PGD for 200 iterations with a step size of 0.01 which reduced the mixup model’s accuracy to 1% and reduced the Manifold Mixup model’s accuracy to 0%. This is a evidence that our defense did not improve results primarily as a result of gradient masking.

Appendix E Generative Adversarial Networks

The recent literature has suggested that regularizing the discriminator is beneficial for training GANs (Salimans et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2018). In a similar vein, one could add mixup to the original GAN training objective such that the extra data augmentation acts as a beneficial regularization to the discriminator, which is what was proposed in Zhang et al. (2018). Mixup proposes the following objective111The formulation written is based on the official code provided with the paper, rather than the description in the paper. The discrepancy between the two is that the formulation in the paper only considers mixes between real and fake.:

(3)

where can be either real or fake samples, and is sampled from a . Note that we have used a function to denote the label since there are four possibilities depending on and :

(4)

In practice however, we find that it did not make sense to create mixes between real and real where the label is set to 1, (as shown in equation 4), since the mixup of two real examples in input space is not a real example. So we only create mixes that are either real-fake, fake-real, or fake-fake. Secondly, instead of using just the equation in 3, we optimize it in addition to the regular minimax GAN equations:

(5)

Using similar notation to earlier in the paper, we present the manifold mixup version of our GAN objective in which we mix in the hidden space of the discriminator:

(6)

where is a function denoting the intermediate output of the discriminator at layer , and the output of the discriminator given input from layer .

The layer we choose the sample can be arbitrary combinations of the input layer (i.e., input mixup), or the first or second resblocks of the discriminator, all with equal probability of selection.

We run some experiments evaluating the quality of generated images on CIFAR10, using as a baseline JSGAN with spectral normalization (Miyato et al., 2018) (our configuration is almost identical to theirs). Results are averaged over at least three runs222Inception scores are typically reported with a mean and variance, though this is across multiple splits of samples across a single model. Since we run multiple experiments, we average their respective means and variances.. From these results, the best-performing mixup experiments (both input and Manifold Mixup) is with , with mixing in all layers (both resblocks and input) achieving an average Inception / FID of / , input mixup achieving / , for the baseline experiment / . This suggests that mixup acts as a useful regularization on the discriminator, which is even further improved by Manifold Mixup. (See Figure 8 for the full set of experimental results.)

Figure 8: We test out various values of in conjunction with either: input mixup ( pixel) (Zhang et al., 2018), mixing in the output of the first resblock (h1), mixing in either the output of the first resblock or the output of the second resblock (h1,2), and mixing in the input or the output of the first resblock or the output of the second resblock (1,2,pixel). The dotted line indicates the baseline Inception / FID score. Higher scores are better for Inception, while lower is better for FID.

Appendix F Intuitive Explanation of how Manifold Mixup avoids Inconsistent Interpolations

An essential motivation behind manifold mixup is that as the network learns the hidden states, it does so in a way that encourages them to be a flatter (per-class). Section 3.1 characterized this for hidden states with any number of dimensions and Figure 1 showed how this can occur on the 2D spiral dataset.

Our goal here is to discuss concrete examples to illustrate why this flattening happens, as shown in Figure 3. If we consider any two points, the interpolated point between them is based on a sampled and the soft-target for that interpolated point is the targets interpolated with the same . So if we consider two points A,B which have the same label, it is apparent that every point on the line between A and B should have that same label with 100% confidence. If we consider two points A,B with different labels, then the point which is halfway between them will be given the soft-label of 50% the label of A and 50% the label of B (and so on for other values).

It is clear that for many arrangements of data points, it is possible for a point in the space to be reached through distinct interpolations between different pairs of examples, and reached with different values. Because the learned model tries to capture the distribution , it can only assign a single distribution over the label values to a single particular point (for example it could say that a point is 100% label A, or it could say that a point is 50% label A and 50% label B). Intuitively, these inconsistent soft-labels at interpolated points can be avoided if the states for each class are more concentrated and the representations do not have variability in directions pointing towards other classes. This leads to flattening: a reduction in the number of directions with variability. The theory in Section 3.1 characterizes exactly what this concentration needs to be: that the representations for each class need to lie on a subspace of dimension equal to “number of hidden dimensions” - “number of classes” + 1.

Appendix G Spectral Analysis of Learned Representations

When we refer to flattening, we mean that the class-specific representations have reduced variability in some directions. Our analysis in this section makes this more concrete.

We trained an MNIST classifier with a hidden state bottleneck in the middle with 12 units (intentionally selected to be just slightly greater than the number of classes). We then took the representation for each class and computed a singular value decomposition (Figure 9 and Figure 10) and we also computed an SVD over all of the representations together (Figure 12). Our architecture contained three hidden layers with 1024 units and LeakyReLU activation, followed by a bottleneck representation layer (with either 12 or 30 hidden units), followed by an additional four hidden layers each with 1024 units and LeakyReLU activation. When we performed Manifold Mixup

for our analysis, we only performed mixing in the bottleneck layer, and used a beta distribution with an alpha of 2.0. Additionally we performed another experiment (Figure 

11 where we placed the bottleneck representation layer with 30 units immediately following the first hidden layer with 1024 units and LeakyReLU activation.

We found that Manifold Mixup had a striking effect on the singular values, with most of the singular values becoming much smaller. Effectively, this means that the representations for each class have variance in fewer directions. While our theory in Section 3.1 showed that this flattening must force each classes representations onto a lower-dimensional subspace (and hence an upper bound on the number of singular values) but this explores how this occurs empirically and does not require the number of hidden dimensions to be so small that it can be manually visualized. In our experiments we tried using 12 hidden units in the bottleneck Figure 9 as well as 30 hidden units Figure 10 in the bottleneck.

Our results from this experiment are unequivocal: Manifold Mixup dramatically reduces the size of the smaller singular values for each classes representations. This indicates a flattening of the class-specific representations. At the same time, the singular values over all the representations are not changed in a clear way (Figure 12), which suggests that this flattening occurs in directions which are distinct from the directions occupied by representations from other classes, which is the same intuition behind our theory. Moreover, Figure 11 shows that when the mixing is performed earlier in the network, there is still a flattening effect, though it is weaker than in the later layers, and again Input Mixup has an inconsistent effect.

Figure 9: SVD on the class-specific representations in a bottleneck layer with 12 units following 3 hidden layers. For the first singular value, the value (averaged across the plots) is 50.08 for the baseline, 37.17 for Input Mixup, and 43.44 for Manifold Mixup (these are the values at x=0 which are cutoff). We can see that the class-specific SVD leads to singular values which are dramatically more concentrated when using Manifold Mixup with Input Mixup not having a consistent effect.
Figure 10: SVD on the class-specific representations in a bottleneck layer with 30 units following 3 hidden layers. For the first singular value, the value (averaged across the plots) is 14.68 for the baseline, 12.49 for Input Mixup, and 14.43 for Manifold Mixup (these are the values at x=0 which are cutoff).
Figure 11: SVD on the class-specific representations in a bottleneck layer with 30 units following a single hidden layer. For the first singular value, the value (averaged across the plots) is 33.64 for the baseline, 27.60 for Input Mixup, and 24.60 for Manifold Mixup (these are the values at x=0 which are cutoff). We see that with the bottleneck layer placed earlier, the reduction in the singular values from Manifold Mixup is smaller but still clearly visible. This makes sense, as it is not possible for this early layer to be perfectly discriminative.
Figure 12: When we run SVD on all of the classes together (in the setup with 12 units in the bottleneck layer following 3 hidden layers), we see no clear difference in the singular values for the Baseline, Input Mixup, and Manifold Mixup models. Thus we can see that the flattening effect of manifold mixup is entirely class-specific, and does not appear overall, which is consistent with what our theory has predicted. More intuitively, this means that the directions which are being flattened are those directions which point towards the representations of different classes.