1 Introduction
Deep neural networks are the backbone of stateoftheart systems for computer vision, speech recognition, and language translation
(LeCun et al., 2015). However, these systems perform well only when evaluated on instances very similar to those from the training set. When evaluated on slightly different distributions, neural networks often provide incorrect predictions with strikingly high confidence. This is a worrying prospect, since deep learning systems are being deployed in settings where data may be subject to distributional shifts. Adversarial examples
(Szegedy et al., 2014)are one such failure case: deep neural networks with nearly perfect performance provide incorrect predictions with very high confidence when evaluated on perturbations imperceptible to the human eye. Adversarial examples are a serious hazard when deploying machine learning systems in securitysensitive applications. More generally, deep learning systems quickly degrade in performance as the distributions of training and testing data differ slightly from each other
(BenDavid et al., 2010).shows that these effects are not accomplished by other wellstudied regularizers (input mixup, weight decay, dropout, batch normalization, and adding noise to the hidden representations).
In this paper, we realize several troubling properties concerning the hidden representations and decision boundaries of stateoftheart neural networks. First, we observe that the decision boundary is often sharp and close to the data. Second, we observe that the vast majority of the hidden representation space corresponds to high confidence predictions, both on and off of the data manifold.
Motivated by these intuitions we propose Manifold Mixup (Section 2), a simple regularizer that addresses several of these flaws by training neural networks on linear combinations of hidden representations of training examples. Previous work, including the study of analogies through word embeddings (e.g. king man woman queen), has shown that interpolations are an effective way of combining factors (Mikolov et al., 2013)
. Since highlevel representations are often lowdimensional and useful to linear classifiers, linear interpolations of hidden representations should explore meaningful regions of the feature space effectively. To use combinations of hidden representations of data as novel training signal, we also perform the same linear interpolation in the associated pair of onehot labels, leading to mixed examples with soft targets.
To start off with the right intuitions, Figure 1 illustrates the impact of Manifold Mixup on a simple twodimensional classification task with small data. In this example, vanilla training of a deep neural network leads to an irregular decision boundary (Figure 1a), and a complex arrangement of hidden representations (Figure 1d). Moreover, every point in both the raw (Figure 1a) and hidden (Figure 1d) data representations is assigned a prediction with very high confidence. This includes points (depicted in black) that correspond to inputs off the data manifold! In contrast, training the same deep neural network with Manifold Mixup leads to a smoother decision boundary (Figure 1b) and a simpler (linear) arrangement of hidden representations (Figure 1e). In sum, the representations obtained by Manifold Mixup have two desirable properties: the classrepresentations are flattened into a minimal amount of directions of variation, and all points inbetween these flat representations, most unobserved during training and off the data manifold, are assigned lowconfidence predictions.
This example conveys the central message of this paper:
Manifold Mixup improves the hidden representations and decision boundaries of neural networks at multiple layers.
More specifically, Manifold Mixup improves generalization in deep neural networks because it:

Leverages interpolations in deeper hidden layers, which capture higher level information (Zeiler & Fergus, 2013) to provide additional training signal.

Flattens the classrepresentations, reducing their number of directions with significant variance (Section 3). This can be seen as a form of compression, which is linked to generalization by a wellestablished theory (Tishby & Zaslavsky, 2015; ShwartzZiv & Tishby, 2017) and extensive experimentation (Alemi et al., 2017; Belghazi et al., 2018; Goyal et al., 2018; Achille & Soatto, 2018).
Throughout a variety of experiments, we demonstrate four benefits of Manifold Mixup:

Better generalization than other competitive regularizers (such as Cutout, Mixup, AdaMix, and Dropout) (Section 5.1).

Improved loglikelihood on test samples (Section 5.1).

Increased performance at predicting data subject to novel deformations (Section 5.2).

Improved robustness to singlestep adversarial attacks. This is the evidence that Manifold Mixup pushes the decision boundary away from the data in some directions (Section 5.3). This is not to be confused with full adversarial robustness, which is defined in terms of moving the decision boundary away from the data in all directions.
2 Manifold Mixup
Consider training a deep neural network , where denotes the part of the neural network mapping the input data to the hidden representation at layer , and denotes the part mapping such hidden representation to the output . Training using Manifold Mixup is performed in five steps. First, we select a random layer from a set of eligible layers in the neural network. This set may include the input layer . Second, we process two random data minibatches and as usual, until reaching layer . This provides us with two intermediate minibatches and . Third, we perform Input Mixup (Zhang et al., 2018) on these intermediate minibatches. This produces the mixed minibatch:
where . Here, are onehot labels, and the mixing coefficient as proposed in mixup (Zhang et al., 2018). For instance, is equivalent to sampling . Fourth, we continue the forward pass in the network from layer until the output using the mixed minibatch . Fifth, this output is used to compute the loss value and gradients that update all the parameters of the neural network.
Mathematically, Manifold Mixup minimizes:
(1) 
Some implementation considerations. We backpropagate gradients through the entire computational graph, including those layers before the mixup layer
(Section 5.1 and appendix Section B explore this issue in more detail). In the case where , Manifold Mixup reduces to the original mixup algorithm of Zhang et al. (2018).While one could try to reduce the variance of the gradient updates by sampling a random per example, we opted for the simpler alternative of sampling a single per minibatch, which in practice gives the same performance. As in Input Mixup, we use a single minibatch to compute the mixed minibatch. We do so by mixing the minibatch with copy of itself with shuffled rows.
3 Manifold Mixup Flattens Representations
We turn to the study of how Manifold Mixup impacts the hidden representations of a deep neural network. At a high level, Manifold Mixup flattens the classspecific representations. More specifically, this flattening reduces the number of directions with significant variance (akin to reducing their number of principal components).
In the sequel, we first prove a theory (Section 3.1) that characterizes this behavior precisely under idealized conditions. Second, we show that this flattening also happens in practice, by performing the SVD of classspecific representations of neural networks trained on real datasets (Section 3.2). Finally, we discuss why the flattening of classspecific representations is a desirable property (Section 3.3).
3.1 Theory
We start by characterizing how the representations of a neural network are changed by Manifold Mixup, under a simplifying set of assumptions. More concretely, we will show that if one performs mixup in a sufficiently deep hidden layer in a neural network, then the loss can be driven to zero if the dimensionality of that hidden layer is greater than the number of classes . As a consequence of this, the resulting representations for that class will have dimensions.
To this end, assume that and denote the input and representation spaces, respectively. We denote the labelset by and let . Let denote the set of functions realizable by the neural network, from the input to the representation. Similarly, let be the set of all functions realizable by the neural network, from the representation to the output.
We are interested in the solution of the following problem in some asymptotic regimes:
(2) 
More specifically, let be the empirical distribution defined by a dataset . Then, let and be the minimizers of (2) for . Also, let , , and
be a vector space. These conditions
(Cybenko, 1989) state that the mappings realizable by large neural networks are dense in the set of all continuous bounded functions. In this case, we show that the minimizer is a linear function from to . In this case, the objective (2) can be rewritten as:where .
Theorem 1.
Let be a vector space of dimension , and let to represent the number classes contained in some dataset . If , then and the corresponding minimizer is a linear function from to .
Proof.
First, we observe that the following statement is true if :
where and denote the
dimensional identity matrix and allone vector, respectively. In fact,
is a rankone matrix, while the rank of identity matrix is . So, only needs rank .Let for all . Let be the th column of , where stands for the classindex of the example . These choices minimize (2), since:
The result follows from for all . ∎
Furthermore, if , then data points in the representation space
have some degrees of freedom to move independently.
Corollary 1.
Proof.
This result implies that if the Manifold Mixup loss is minimized, then the representation of each class lies on a subspace of dimension . In the extreme case where , each class representation will collapse to a single point, meaning that hidden representations would not change in any direction, for each classconditional manifold. In the more general case with larger , the majority of directions in space will be empty in the classconditional manifold.
3.2 Empirical Investigation of Flattening
We now show that the “flattening” theory that we have just developed also holds for real neural networks networks trained on real data. To this end, we trained a collection of fullyconnected neural networks on the MNIST dataset using multiple regularizers, including Manifold Mixup. When using Manifold Mixup
, we mixed representations at a single, fixed hidden layer per network. After training, we performed the Singular Value Decomposition (SVD) of the hidden representations of each network, and analyzed their spectrum decay.
More specifically, we computed the largest singular value per class, as well as the sum of the all other singular values. We computed these statistics at the first hidden layer for all networks and regularizers. For the largest singular value, we obtained: 51.73 (baseline), 33.76 (weight decay), 28.83 (dropout), 33.46 (input mixup), and 31.65 (manifold mixup). For the sum of all the other singular values, we obtained: 78.67 (baseline), 73.36 (weight decay), 77.47 (dropout), 66.89 (input mixup), and 40.98 (manifold mixup). Therefore, weight decay, dropout, and input mixup all reduce the largest singular value, but only Manifold Mixup achieves a reduction of the sum of the all other singular values (e.g. flattening). For more details regarding this experiment, consult Appendix G.
3.3 Why is Flattening Representations Desirable?
We have presented evidence to conclude that Manifold Mixup leads to flatter classspecific representations, and that such flattening is not accomplished by other regularizers.
But why is this flattening desirable? First, it means that the hidden representations computed from our data occupy a much smaller volume. Thus, a randomly sampled hidden representation within the convex hull spanned by the data in this space is more likely to have a classification score with lower confidence (higher entropy). Second, compression has been linked to generalization in the information theory literature (Tishby & Zaslavsky, 2015; ShwartzZiv & Tishby, 2017). Third compression has been been linked to generalization empirically in the past by work which minimizes mutual information between the features and the inputs as a regularizer (Belghazi et al., 2018; Alemi et al., 2017; Achille & Soatto, 2018).
4 Related Work
Regularization is a major area of research in machine learning. Manifold Mixup is a generalization of Input Mixup, the idea of building random interpolations between training examples and perform the same interpolation for their labels (Zhang et al., 2018; Tokozume et al., 2018).
Intriguingly, our experiments show that Manifold Mixup changes the representations associated to the layers before and after the mixing operation, and that this effect is crucial to achieve good results (Section 5.1, Appendix G). This suggests that Manifold Mixup works differently than Input Mixup.
Another line of research closely related to Manifold Mixup involves regularizing deep networks by perturbing their hidden representations. These methods include dropout (Hinton et al., 2012), batch normalization (Ioffe & Szegedy, 2015), and the information bottleneck (Alemi et al., 2017). Notably, Hinton et al. (2012) and Ioffe & Szegedy (2015) demonstrated that regularizers that work well in the input space can also be applied to the hidden layers of a deep network, often to further improve results. We believe that Manifold Mixup is a complimentary form of regularization.
Zhao & Cho (2018) explored improving adversarial robustness by classifying points using a function of the nearest neighbors in a fixed feature space. This involves applying mixup between each set of nearest neighbor examples in that feature space. The similarity between (Zhao & Cho, 2018) and Manifold Mixup is that both consider linear interpolations of hidden representations with the same interpolation applied to their labels. However, an important difference is that Manifold Mixup backpropagates gradients through the earlier parts of the network (the layers before the point where mixup is applied), unlike (Zhao & Cho, 2018). In Section 3 we explain how this discrepancy significantly affects the learning process.
AdaMix (Guo et al., 2018a) is another related method which attempts to learn better mixing distributions to avoid overlap. AdaMix performs interpolations only on the input space, reporting that their method degrades significantly when applied to hidden layers. Thus, AdaMix may likely work for different reasons than Manifold Mixup, and perhaps the two are complementary. AgrLearn (Guo et al., 2018b) adds an information bottleneck layer to the output of deep neural networks. AgrLearn leads to substantial improvements, achieving 2.45% test error on CIFAR10 when combined with Input Mixup (Zhang et al., 2018). As AgrLearn is complimentary to Input Mixup, it may be also complimentary to Manifold Mixup. Wang et al. (2018) proposed an interpolation exclusively in the output space, does not backpropagate through the interpolation procedure, and has a very different framing in terms of the EulerLagrange equation (Equation 2) where the cost is based on unlabeled data (and the pseudolabels at those points) and the labeled data provide constraints.
5 Experiments
We now turn to the empirical evaluation of Manifold Mixup. We will study its regularization properties in supervised learning (Section 5.1), as well as how it affects the robustness of neural networks to novel input deformations (Section 5.2), and adversarial examples (Section 5.3).
5.1 Generalization on Supervised Learning


. Standard deviations over five repetitions.
PreActResNet18  Test Error (%)  Test NLL 

No Mixup  
Input Mixup ()  
Manifold Mixup ()  
PreActResNet34  
No Mixup  
Input Mixup ()  
Manifold Mixup ()  
WideResnet2810  
No Mixup  
Input Mixup ()  
Manifold Mixup () 
PreActResNet18  top1  top5 

No Mixup  55.52  71.04 
Input Mixup ()  
Input Mixup ()  
Input Mixup ()  
Input Mixup ()  
Manifold Mixup ()  
Manifold Mixup ()  
Manifold Mixup ()  
Manifold Mixup () 
We train a variety of residual networks (He et al., 2016) using different regularizers: no regularization, AdaMix, Input Mixup, and Manifold Mixup. We follow the training procedure of (Zhang et al., 2018), which is to use SGD with momentum, a weight decay of , and a stepwise learning rate decay. Please refer to Appendix C
for further details (including the values of the hyperparameter
). We show results for the CIFAR10 (Table 1(a)), CIFAR100 (Table 1(b)), SVHN (Table 3), and TinyImageNET (Table 3) datasets. Manifold Mixup outperforms vanilla training, AdaMix, and Input Mixup across datasets and model architectures. Furthermore, Manifold Mixup leads to models with significantly better Negative LogLikelihood (NLL) on the test data. In the case of CIFAR10, Manifold Mixup models achieve as high as relative improvement of test NLL.As a complimentary experiment to better understand why Manifold Mixup works, we zeroed gradient updates immediately after the layer where mixup is applied. On the dataset CIFAR10 and using a PreActResNet18, this led to a 4.33% test error, which is worse than our results for Input Mixup and Manifold Mixup, yet better than the baseline. Because Manifold Mixup selects the mixing layer at random, each layer is still being trained even when zeroing gradients, although it will receive less updates. This demonstrates that Manifold Mixup improves performance by updating the layers both before and after the mixing operation.
We also compared Manifold Mixup
against other strong regularizers. For each regularizer, we selected the best hyperparameters using a validation set. The training of PreActResNet50 on CIFAR10 for 600 epochs led to the following test errors (%): no regularization (
), Dropout (), Cutout (Devries & Taylor, 2017) (), Mixup (), and Manifold Mixup (). (Note that the results in Table 1 for PreActResNet were run for 1200 epochs, and therefore are not directly comparable to the numbers in this paragraph.)To provide further evidence about the quality of representations learned with Manifold Mixup, we applied a
nearest neighbour classifier on top of the features extracted from a PreActResNet18 trained on CIFAR10. We achieved test errors of 6.09% (vanilla training), 5.54% (Input Mixup), and 5.16% (
Manifold Mixup).Finally, we considered a synthetic dataset where the data generating process is a known function of disentangled factors of variation, and mixed in this space factors. As shown in Appendix A, this led to significant improvements in performance. This suggests that mixing in the correct level of representation has a positive impact on the decision boundary. However, our purpose here is not to make any claim about when do deep networks learn representations corresponding to disentangled factors of variation.
Finally, Table 6 and Table 6 show the sensitivity of Manifold Mixup to the hyperparameter and the set of eligible layers . (These results are based on training a PreActResNet18 for epochs, so these numbers are not exactly comparable to the ones in Table 1.) This shows that Manifold Mixup is robust with respect to choice of hyperparameters, with improvements for many choices.
Deformation  No Mixup  Input Mixup ()  Input Mixup ()  Manifold Mixup () 

Rotation U(,)  56.48  
Rotation U(,)  36.78  
Shearing U(, )  60.01  
Shearing U(, )  39.7  
Zoom In (60% rescale)  
Zoom In (80% rescale)  50.47  
Zoom Out (120% rescale)  61.62  
Zoom Out (140% rescale)  42.02 
CIFAR10  CIFAR100  

Input Mixup  Manifold Mixup  

5.2 Generalization to Novel Deformations
To further evaluate the quality of representations learned with Manifold Mixup, we train PreActResNet34 models on the normal CIFAR100 training split, but test them on novel (not seen during training) deformations of the test split. These deformations include random rotations, random shearings, and different rescalings. Better representations should generalize to a larger variety of deformations. Table 4 shows that networks trained using Manifold Mixup are the most able to classify test instances subject to novel deformations, which suggests the learning of better representations. For more results see Appendix C, Table 9.
5.3 Robustness to Adversarial Examples
CIFAR10  FGSM 

No Mixup  36.32 
Input Mixup ()  71.51 
Manifold Mixup ()  77.50 
PGD training (7steps)  56.10 
CIFAR100  FGSM 
Input Mixup ()  40.7 
Manifold Mixup ()  44.96 
SVHN  FGSM 
No Mixup  21.49 
Input Mixup ()  56.98 
Manifold Mixup ()  65.91 
PGD training (7steps)  72.80 
Adversarial robustness is related to the position of the decision boundary relative to the data. Because Manifold Mixup only considers some directions around data points (those corresponding to interpolations), we would not expect the model to be robust to adversarial attacks that consider any direction around each example. However, since Manifold Mixup expands the set of examples seen during training, an intriguing hypothesis is that these expansions overlap with the set of possible adversarial examples, providing some degree of defense. If this hypothesis is true, Manifold Mixup would force adversarial attacks to consider a wider set of directions, leading to a larger computational expense for the attacker. To explore this, we consider the Fast Gradient Sign Method (FGSM, Goodfellow et al., 2015), which constructs adversarial examples in one single step, thus considering a relatively small subset of directions around examples. The performance of networks trained using Manifold Mixup against FGSM attacks is given in Table 7. One challenge in evaluating robustness against adversarial examples is the “gradient masking problem”, in which a defense succeeds only by reducing the quality of the gradient signal. Athalye et al. (2018) explored this issue in depth, and proposed running an unbounded search for a large number of iterations to confirm the quality of the gradient signal. Manifold Mixup passes this sanity check (consult Appendix D for further details). While we found that using Manifold Mixup improves the robustness to singlestep FGSM attack (especially over Input Mixup), we found that Manifold Mixup did not significantly improve robustness against stronger, multistep attacks such as PGD (Madry et al., 2018).
6 Connections to Neuroscience and Credit Assignment
We present an intriguing connection between Manifold Mixup and a challenging problem in neuroscience. At a high level, we can imagine systems in the brain which compute predictions from a stream of changing inputs, and pass these predictions onto other modules which return some kind of feedback signal (Lee et al., 2015; Scellier & Bengio, 2017; Whittington & Bogacz, 2017; Bartunov et al., 2018). For instance, these feedback signals can be gradients or targets for prediction. There is a delay between the output of the prediction and the point in time in which the feedback can return to the system after having travelled across the brain. Moreover, this delay could be noisy and could differ based on the type of the prediction or other conditions in the brain, as well as depending on which paths are considered (there are many skip connections between areas). This means that it could be very difficult for a system in the brain to establish a clear correspondence between its outputs and the feedback signals that it receives over time.
While it is preliminary, an intriguing hypothesis is that part of how systems in the brain could be working around this limitation is by averaging their states and feedback signals across multiple points in time. The empirical results from mixup suggest that such a technique may not just allow successful computation, but also act as a potent regularizer. Manifold Mixup strenghthens this result by showing that the same regularization effect can be achieved from mixing in higher level hidden representations.
7 Conclusion
Deep neural networks often give incorrect, yet extremely confident predictions on examples that differ from those seen during training. This problem is one of the most central challenges in deep learning. We have investigated this issue from the perspective of the representations learned by deep neural networks. We observed that vanilla neural networks spread the training data widely throughout the representation space, and assign high confidence predictions to almost the entire volume of representations. This leads to major drawbacks since the network will provide highconfidence predictions to examples off the data manifold, thus lacking enough incentives to learn discriminative representations about the training data. To address these issues, we introduced Manifold Mixup, a new algorithm to train neural networks on interpolations of hidden representations. Manifold Mixup encourages the neural network to be uncertain across the volume of the representation space unseen during training. This leads to concentrating the representations of the real training examples in a low dimensional subspace, resulting in more discriminative features. Throughout a variety of experiments, we have shown that neural networks trained using Manifold Mixup have better generalization in terms of error and loglikelihood, as well as better robustness to novel deformations of the data and adversarial examples. Being easy to implement and incurring little additional computational cost, we hope that Manifold Mixup will become a useful regularization tool for deep learning practitioners.
Acknowledgements
The authors thank Christopher Pal, Sherjil Ozair and Dzmitry Bahdanau for useful discussions and feedback. Vikas Verma was supported by Academy of Finland project 13312683 / Raiko Tapani AT kulut. We would also like to acknowledge Compute Canada for providing computing resources used in this work.
References
 Achille & Soatto (2018) Achille, A. and Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 Alemi et al. (2017) Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. In International Conference on Learning Representations, 2017.
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223, 2017.
 Athalye et al. (2018) Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 274–283, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/athalye18a.html.

Bartlett & Shawetaylor (1998)
Bartlett, P. and Shawetaylor, J.
Generalization performance of support vector machines and other pattern classifiers, 1998.
 Bartunov et al. (2018) Bartunov, S., Santoro, A., Richards, B. A., Hinton, G. E., and Lillicrap, T. Assessing the scalability of biologicallymotivated deep learning algorithms and architectures. submitted to ICLR’2018, 2018.
 Belghazi et al. (2018) Belghazi, I., Rajeswar, S., Baratin, A., Hjelm, R. D., and Courville, A. C. MINE: mutual information neural estimation. CoRR, abs/1801.04062, 2018. URL http://arxiv.org/abs/1801.04062.
 BenDavid et al. (2010) BenDavid, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.

Cybenko (1989)
Cybenko, G.
Approximation by superpositions of a sigmoidal function.
Mathematics of control, signals and systems, 2(4):303–314, 1989.  Devries & Taylor (2017) Devries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. CoRR, abs/1708.04552, 2017. URL http://arxiv.org/abs/1708.04552.
 Goodfellow et al. (2015) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations, 2015.
 Goyal et al. (2018) Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Larochelle, H., Botvinick, M., Levine, S., and Bengio, Y. Transfer and exploration via the information bottleneck. 2018.
 Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769–5779, 2017.
 Guo et al. (2016) Guo, H., Mao, Y., and Zhang, R. MixUp as Locally Linear OutOfManifold Regularization. ArXiv eprints, 2016. URL https://arxiv.org/abs/1809.02499.
 Guo et al. (2018a) Guo, H., Mao, Y., and Zhang, R. MixUp as Locally Linear OutOfManifold Regularization. ArXiv eprints, September 2018a.
 Guo et al. (2018b) Guo, H., Mao, Y., and Zhang, R. Aggregated Learning: A Vector Quantization Approach to Learning with Neural Networks. ArXiv eprints, July 2018b.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In ECCV, 2016.
 Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Improving neural networks by preventing coadaptation of feature detectors. CoRR, abs/1207.0580, 2012. URL http://arxiv.org/abs/1207.0580.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
 Lee et al. (2015) Lee, D.H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. In Machine Learning and Knowledge Discovery in Databases (ECML/PKDD). 2015.
 Lee et al. (1995) Lee, W. S., Bartlett, P. L., and Williamson, R. C. Lower bounds on the vc dimension of smoothly parameterized function classes. Neural Computation, 7(5):1040–1053, Sep. 1995. ISSN 08997667. doi: 10.1162/neco.1995.7.5.1040.
 Madry et al. (2018) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.

Mikolov et al. (2013)
Mikolov, T., Chen, K., Corrado, G., and Dean, J.
Efficient estimation of word representations in vector space.
In International Conference on Learning Representations, 2013.  Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1QRgziT.
 Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.

Scellier & Bengio (2017)
Scellier, B. and Bengio, Y.
Equilibrium propagation: Bridging the gap between energybased models and backpropagation.
Frontiers in computational neuroscience, 11, 2017.  ShwartzZiv & Tishby (2017) ShwartzZiv, R. and Tishby, N. Opening the black box of deep neural networks via information. CoRR, abs/1703.00810, 2017. URL http://arxiv.org/abs/1703.00810.
 Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
 Tishby & Zaslavsky (2015) Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. CoRR, abs/1503.02406, 2015. URL http://arxiv.org/abs/1503.02406.

Tokozume et al. (2018)
Tokozume, Y., Ushiku, Y., and Harada, T.
Betweenclass learning for image classification.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2018.  Wang et al. (2018) Wang, B., Luo, X., Li, Z., Zhu, W., Shi, Z., and Osher, S. J. Deep learning with data dependent implicit activation function. CoRR, abs/1802.00168, 2018. URL http://arxiv.org/abs/1802.00168.
 Whittington & Bogacz (2017) Whittington, J. C. and Bogacz, R. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation, 2017.
 Zeiler & Fergus (2013) Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. URL http://arxiv.org/abs/1311.2901.
 Zhang et al. (2018) Zhang, H., Cisse, M., Dauphin, Y. N., and LopezPaz, D. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1Ddp1Rb.
 Zhao & Cho (2018) Zhao, J. and Cho, K. Retrievalaugmented convolutional neural networks for improved robustness against adversarial examples. CoRR, abs/1802.09502, 2018. URL http://arxiv.org/abs/1802.09502.
Appendix A Synthetic Experiments Analysis
We conducted experiments using a generated synthetic dataset where each image is deterministically rendered from a set of independent factors. The goal of this experiment is to study the impact of input mixup and an idealized version of Manifold Mixup where we know the true factors of variation in the data and we can do mixup in exactly the space of those factors. This is not meant to be a fair evaluation or representation of how Manifold Mixup actually performs  rather it’s meant to illustrate how generating relevant and semantically meaningful augmented data points can be much better than generating points by mixing in the input space.
We considered three tasks. In Task A, we train on images with angles uniformly sampled between (, 
) (label 0) with 50% probability and uniformly between (
, ) (label 1) with 50% probability. At test time we sampled uniformly between (, ) (label 0) with 50% probability and uniformly between (, ) (label 1) with 50% probability. Task B used the same setup as Task A for training, but the test instead used (, ) as label 0 and (, ) as label 1. In Task C we made the label whether the digit was a “1” or a “7”, and our training images were uniformly sampled between (, ) with 50% probability and uniformly between (, ) with 50% probability. The test data for Task C were uniformly sampled with angles from (, ).The examples of the data are in Figure 4 and results are in Table 8. In all cases we found that Input Mixup gave some improvements in likelihood but limited improvements in accuracy  suggesting that the even generating nonsensical points can help a classifier trained with Input Mixup to be better calibrated. Nonetheless the improvements were much smaller than those achieved with mixing in the ground truth attribute space.
Task  Model  Test Accuracy  Test NLL 

No Mixup  1.6  8.8310  
Task A  Input Mixup (1.0)  0.0  6.0601 
Ground Truth Factor Mixup (1.0)  94.77  0.4940  
No Mixup  21.25  7.0026  
Task B  Input Mixup (1.0)  18.40  4.3149 
Ground Truth Factor Mixup (1.0)  84.02  0.4572  
No Mixup  63.05  4.2871  
Task C  Input Mixup  66.09  1.4181 
Ground Truth Factor Mixup  99.06  0.1279 
Appendix B Analysis of how Manifold Mixup changes learned representations
We have found significant improvements from using Manifold Mixup, but a key question is whether the improvements come from changing the behavior of the layers before the mixup operation is applied or the layers after the mixup operation is applied. This is a place where Manifold Mixup
and Input Mixup are clearly differentiated, as Input Mixup has no “layers before the mixup operation” to change. We conducted analytical experimented where the representations are lowdimensional enough to visualize. More concretely, we trained a fully connected network on MNIST with two fullyconnected leaky relu layers of 1024 units, followed by a 2dimensional bottleneck layer, followed by two more fullyconnected leakyrelu layers with 1024 units.
We then considered training with no mixup, training with mixup in the input space, and training only with mixup directly following the 2D bottleneck. We consistently found that Manifold Mixup has the effect of making the representations much tighter, with the real data occupying smaller region in the hidden space, and with a more well separated margin between the classes, as shown in Figure 5
Appendix C Supervised Regularization Experimental Details
Test Set Deformation  No Mixup Baseline  Input Mixup =1.0  Input Mixup =2.0  Manifold Mixup =2.0 

Rotation U(,)  52.96  55.55  56.48  60.08 
Rotation U(,)  33.82  37.73  36.78  42.13 
Rotation U(,)  26.77  28.47  27.53  33.78 
Rotation U(,)  24.19  26.72  25.34  29.95 
Shearing U(, )  55.92  58.16  60.01  62.85 
Shearing U(, )  35.66  39.34  39.7  44.27 
Shearing U(, )  19.57  22.94  22.8  24.69 
Shearing U(, )  17.55  21.66  21.22  23.56 
Shearing U(, )  22.38  25.53  25.27  28.02 
Zoom In (20% rescale)  2.43  1.9  2.45  2.03 
Zoom In (40% rescale)  4.97  4.47  5.23  4.17 
Zoom In (60% rescale)  12.68  13.75  13.12  11.49 
Zoom In (80% rescale)  47.95  52.18  50.47  52.7 
Zoom Out (120% rescale)  43.18  60.02  61.62  63.59 
Zoom Out (140% rescale)  19.34  41.81  42.02  45.29 
Zoom Out (160% rescale)  11.12  25.48  25.85  27.02 
Zoom Out (180% rescale)  7.98  18.11  18.02  15.68 
(ours) consistently allows the model to be more robust to random shearing, rescaling, and rotation even though these deformations were not observed during training. For the rotation experiment, each image is rotated with an angle uniformly sampled from the given range. Likewise the shearing is performed with uniformly sampled angles. Zoomingin refers to take a bounding box at the center of the image with k% of the length and k% of the width of the original image, and then expanding this image to fit the original size. Likewise zoomingout refers to drawing a bounding box with k% of the height and k% of the width, and then taking this larger area and scaling it down to the original size of the image (the padding outside of the image is black).
For supervised regularization we considered following architectures: PreActResNet18, PreActResNet34, and WideResnet2810. When using Manifold Mixup, we selected the layer to perform mixing uniformly at random from a set of eligible layers. In all our experiments, for the PreActResNets architectures, the eligible layers for mixing in Manifold Mixup were : the input layer, the output from the first resblock, and the output from the second resblock. For WideResNet2010 architecture, the eligible layers for mixing in Manifold Mixup were: the input layer and the output from the first resblock. For PreActResNet18, the first resblock has four layers and the second resblock has four layers. For PreActResNet34, the first resblock has six layers and the second resblock has eight layers. For WideResnet2810, the first resblock has four layers. Thus the mixing is often done fairly deep layers in the network.
Throughout our experiments, we use SGD+Momentum optimizer with learning rate 0.1, momentum 0.9 and weightdecay , with stepwise learning rate decay.
For Table 1(a), Table 1(b) and Table 3, we train the PreActResNet18, and PreActResNet34 for 1200 epochs with learning rate annealed by a factor of 10 at epoch 400 and 800. For above Tables, we train WideResNet2810 for 400 epochs with learning rate annealed by a factor of 10 at epoch 200 and 300. In Table 3, we train PreActResNet18 for 2000 epochs with learning rate annealed by a factor of 10 at epoch 1000 and 1500.
For Table 6 and Table 6, we train the PreActResNet18 network for 2000 epochs with learning rate annealed by a factor of 10 at epoch 1000 and 1500.
For Table 7, Table 4 and Table 9, we train the networks for 1200 epochs with learning rate annealed by a factor of 10 at epoch 400 and 800.
In Figure 6 and Figure 7, we present the training loss (Binary cross entropy) for CIFAR10 and CIFAR100 datasets respectively. We observe that performing Manifold Mixup in higher layers allows the train loss to go down faster as compared to the Input Mixup, which suggests that while Input Mixup may suffer from underfitting, Manifold Mixup alleviates this problem to some extend.
c.1 Hyperparameter
For Input Mixup on CIFAR10 and CIFAR100 datasets, we used the value as recommended in (Zhang et al., 2018)
. For Input Mixup on SVHN and Tinyimagenet datasets, we experimented with the
values in the set . We obtained best results using and for SVHN and Tinyimagenet, respectively.For Manifold Mixup, for all datasets, we experimented with the values in the set . We obtained best results with for CIFAR10, CIFAR100 and SVHN and with for Tinyimagenet.
Appendix D Adversarial Examples
We ran the unbounded projected gradient descent (PGD) (Madry et al., 2018) sanity check suggested in (Athalye et al., 2018). We took our trained models for the input mixup baseline and manifold mixup and we ran PGD for 200 iterations with a step size of 0.01 which reduced the mixup model’s accuracy to 1% and reduced the Manifold Mixup model’s accuracy to 0%. This is a evidence that our defense did not improve results primarily as a result of gradient masking.
Appendix E Generative Adversarial Networks
The recent literature has suggested that regularizing the discriminator is beneficial for training GANs (Salimans et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2018). In a similar vein, one could add mixup to the original GAN training objective such that the extra data augmentation acts as a beneficial regularization to the discriminator, which is what was proposed in Zhang et al. (2018). Mixup proposes the following objective^{1}^{1}1The formulation written is based on the official code provided with the paper, rather than the description in the paper. The discrepancy between the two is that the formulation in the paper only considers mixes between real and fake.:
(3) 
where can be either real or fake samples, and is sampled from a . Note that we have used a function to denote the label since there are four possibilities depending on and :
(4) 
In practice however, we find that it did not make sense to create mixes between real and real where the label is set to 1, (as shown in equation 4), since the mixup of two real examples in input space is not a real example. So we only create mixes that are either realfake, fakereal, or fakefake. Secondly, instead of using just the equation in 3, we optimize it in addition to the regular minimax GAN equations:
(5) 
Using similar notation to earlier in the paper, we present the manifold mixup version of our GAN objective in which we mix in the hidden space of the discriminator:
(6) 
where is a function denoting the intermediate output of the discriminator at layer , and the output of the discriminator given input from layer .
The layer we choose the sample can be arbitrary combinations of the input layer (i.e., input mixup), or the first or second resblocks of the discriminator, all with equal probability of selection.
We run some experiments evaluating the quality of generated images on CIFAR10, using as a baseline JSGAN with spectral normalization (Miyato et al., 2018) (our configuration is almost identical to theirs). Results are averaged over at least three runs^{2}^{2}2Inception scores are typically reported with a mean and variance, though this is across multiple splits of samples across a single model. Since we run multiple experiments, we average their respective means and variances.. From these results, the bestperforming mixup experiments (both input and Manifold Mixup) is with , with mixing in all layers (both resblocks and input) achieving an average Inception / FID of / , input mixup achieving / , for the baseline experiment / . This suggests that mixup acts as a useful regularization on the discriminator, which is even further improved by Manifold Mixup. (See Figure 8 for the full set of experimental results.)
Appendix F Intuitive Explanation of how Manifold Mixup avoids Inconsistent Interpolations
An essential motivation behind manifold mixup is that as the network learns the hidden states, it does so in a way that encourages them to be a flatter (perclass). Section 3.1 characterized this for hidden states with any number of dimensions and Figure 1 showed how this can occur on the 2D spiral dataset.
Our goal here is to discuss concrete examples to illustrate why this flattening happens, as shown in Figure 3. If we consider any two points, the interpolated point between them is based on a sampled and the softtarget for that interpolated point is the targets interpolated with the same . So if we consider two points A,B which have the same label, it is apparent that every point on the line between A and B should have that same label with 100% confidence. If we consider two points A,B with different labels, then the point which is halfway between them will be given the softlabel of 50% the label of A and 50% the label of B (and so on for other values).
It is clear that for many arrangements of data points, it is possible for a point in the space to be reached through distinct interpolations between different pairs of examples, and reached with different values. Because the learned model tries to capture the distribution , it can only assign a single distribution over the label values to a single particular point (for example it could say that a point is 100% label A, or it could say that a point is 50% label A and 50% label B). Intuitively, these inconsistent softlabels at interpolated points can be avoided if the states for each class are more concentrated and the representations do not have variability in directions pointing towards other classes. This leads to flattening: a reduction in the number of directions with variability. The theory in Section 3.1 characterizes exactly what this concentration needs to be: that the representations for each class need to lie on a subspace of dimension equal to “number of hidden dimensions”  “number of classes” + 1.
Appendix G Spectral Analysis of Learned Representations
When we refer to flattening, we mean that the classspecific representations have reduced variability in some directions. Our analysis in this section makes this more concrete.
We trained an MNIST classifier with a hidden state bottleneck in the middle with 12 units (intentionally selected to be just slightly greater than the number of classes). We then took the representation for each class and computed a singular value decomposition (Figure 9 and Figure 10) and we also computed an SVD over all of the representations together (Figure 12). Our architecture contained three hidden layers with 1024 units and LeakyReLU activation, followed by a bottleneck representation layer (with either 12 or 30 hidden units), followed by an additional four hidden layers each with 1024 units and LeakyReLU activation. When we performed Manifold Mixup
for our analysis, we only performed mixing in the bottleneck layer, and used a beta distribution with an alpha of 2.0. Additionally we performed another experiment (Figure
11 where we placed the bottleneck representation layer with 30 units immediately following the first hidden layer with 1024 units and LeakyReLU activation.We found that Manifold Mixup had a striking effect on the singular values, with most of the singular values becoming much smaller. Effectively, this means that the representations for each class have variance in fewer directions. While our theory in Section 3.1 showed that this flattening must force each classes representations onto a lowerdimensional subspace (and hence an upper bound on the number of singular values) but this explores how this occurs empirically and does not require the number of hidden dimensions to be so small that it can be manually visualized. In our experiments we tried using 12 hidden units in the bottleneck Figure 9 as well as 30 hidden units Figure 10 in the bottleneck.
Our results from this experiment are unequivocal: Manifold Mixup dramatically reduces the size of the smaller singular values for each classes representations. This indicates a flattening of the classspecific representations. At the same time, the singular values over all the representations are not changed in a clear way (Figure 12), which suggests that this flattening occurs in directions which are distinct from the directions occupied by representations from other classes, which is the same intuition behind our theory. Moreover, Figure 11 shows that when the mixing is performed earlier in the network, there is still a flattening effect, though it is weaker than in the later layers, and again Input Mixup has an inconsistent effect.
Comments
There are no comments yet.