# Towards Understanding the Data Dependency of Mixup-style Training

In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk and exhibit better generalization and robustness on various tasks when compared to standard training. In this paper, we investigate how these benefits of Mixup training rely on properties of the data in the context of classification. For minimizing the original empirical risk, we compute a closed form for the Mixup-optimal classification, which allows us to construct a simple dataset on which minimizing the Mixup loss can provably lead to learning a classifier that does not minimize the empirical loss on the data. On the other hand, we also give sufficient conditions for Mixup training to also minimize the original empirical risk. For generalization, we characterize the margin of a Mixup classifier, and use this to understand why the decision boundary of a Mixup classifier can adapt better to the full structure of the training data when compared to standard training. In contrast, we also show that, for a large class of linear models and linearly separable datasets, Mixup training leads to learning the same classifier as standard training.

## Authors

• 1 publication
• 70 publications
• 3 publications
• 5 publications
• 56 publications
03/28/2018

### Supervising Feature Influence

Causal influence measures for machine learnt classifiers shed light on t...
04/25/2020

### Finite-sample analysis of interpolating linear classifiers in the overparameterized regime

We prove bounds on the population risk of the maximum margin algorithm f...
12/02/2020

### On the Error Resistance of Hinge Loss Minimization

Commonly used classification algorithms in machine learning, such as sup...
11/25/2020

### No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems

In real-world classification tasks, each class often comprises multiple ...
10/28/2019

### IPGuard: Protecting the Intellectual Property of Deep Neural Networks via Fingerprinting the Classification Boundary

A deep neural network (DNN) classifier represents a model owner's intell...
01/08/2018

### Boundary Optimizing Network (BON)

Despite all the success that deep neural networks have seen in classifyi...
06/12/2020

### Learning Diverse Representations for Fast Adaptation to Distribution Shift

The i.i.d. assumption is a useful idealization that underpins many succe...

## Code Repositories

### Mixup-Data-Dependency

Code associated with the paper "Towards Understanding the Data Dependency of Mixup-style Training".

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

All code associated with this paper can be found at: https://github.com/2014mchidamb/Mixup-Data-Dependency.

Mixup (zhang2018Mixup)

is a modification to the standard supervised learning setup which involves training on convex combinations of pairs of data points and their labels instead of the original data itself. In the original paper,

zhang2018Mixup

demonstrated that training deep neural networks using Mixup leads to better generalization performance, as well as greater robustness to adversarial attacks and label noise on image classification tasks. The empirical advantages of Mixup training have been affirmed by several follow-up works

(He_2019_CVPR; thulasidasan2019Mixup; lamb2019interpolated; arazo2019unsupervised)

. The idea of Mixup has also been extended beyond the supervised learning setting, and been applied to semi-supervised learning

(mixmatch; fixmatch), privacy-preserving learning (huang2021instahide), and learning with fairness constraints (fairmixup).

However, from a theoretical perspective, Mixup training is still mysterious even in the basic multi-class classfication setting – why should the output of a linear mixture of two training samples be the same linear mixture of their labels, especially when considering highly nonlinear models? Despite several recent theoretical results (guo2019mixup; carratino2020mixup; zhang2020does; zhang2021mixup), there is still not a complete understanding of why Mixup training actually works in practice. In this paper, we try to understand why Mixup works by first understanding when Mixup works: in particular, how the properties of Mixup training rely on the structure of the training data.

We consider two properties for classifiers trained with Mixup. First, even though Mixup training does not observe many original data points during training, it usually can still correctly classify all of the original data points (empirical risk minimization (ERM)). Second, the aforementioned empirical works have shown how classifiers trained with Mixup often have better adversarial robustness and generalization than standard training. In this work, we show that both of these properties can rely heavily on the data used for training, and that they need not hold in general.

Main Contributions and Related Work. The idea that Mixup can potentially fail to minimize the original risk is not new; guo2019mixup provide examples of how Mixup labels can conflict with actual data point labels. However, their theoretical results do not characterize the data and model conditions under which this failure can provably happen when minimizing the Mixup loss. In Section 2 of this work, we provide a concrete classification dataset on which continuous approximate-minimizers of the Mixup loss can fail to minimize the empirical risk. We also provide sufficient conditions for Mixup to minimize the original risk, and show that these conditions hold approximately on standard image classification benchmarks.

With regards to generalization and robustness, the parallel works of carratino2020mixup and zhang2020does showed that Mixup training can be viewed as minimizing the empirical loss along with a data-dependent regularization term. zhang2020does further relate this term to the adversarial robustness and Rademacher complexity of certain function classes learned with Mixup. In Section 3

, we take an alternative approach to understanding generalization and robustness by analyzing the margin of Mixup classifiers. Our perspective can be viewed as complementary to that of the aforementioned works, as we directly consider the properties exhibited by a Mixup-optimal classifier instead of considering what properties are encouraged by the regularization effects of the Mixup loss. In addition to our margin analysis, we also show that for the common setting of linear models trained on high-dimensional Gaussian features both Mixup (for a large class of mixing distributions) and ERM with gradient descent learn the same classifier with high probability.

Finally, we note the related works that are beyond the scope of our paper; namely the many Mixup-like training procedures such as Manifold Mixup (verma2019manifold), Cut Mix (yun2019cutmix), Puzzle Mix (kim2020puzzle), and Co-Mixup (kim2021comixup).

## 2 Mixup and Empirical Risk Minimization

The goal of this section is to understand when Mixup training can also minimize the empirical risk. Our main technique for doing so is to derive a closed-form for the Mixup-optimal classifier over a sufficiently powerful function class, which we do in Section 2.2 after introducing the basic setup in Section 2.1. We use this closed form to motivate a concrete example on which Mixup training does not minimize the empirical risk in Section 2.3, and show under mild nondegeneracy conditions that Mixup will minimize the emprical risk in Section 2.4.

### 2.1 Setup

We consider the problem of -class classification where the classes correspond to compact disjoint sets with an associated probability measure supported on . We use to denote the set of all functions satisfying the property that for all (where represents the -th coordinate function of ). We refer to a function as a classifier, and say that classifies as class if . The cross-entropy loss associated with such a classifier is then:

 J(g,PX)=−k∑i=1∫Xiloggi(x)dPX(x)

The goal of standard training is to learn a classifier . Any such classifier will necessarily satisfy on since the are disjoint.

Mixup. In the Mixup version of our setup, we are interested in minimizing the cross-entropy of convex combinations of the original data and their classes. These convex combinations are determined according to a probability measure whose support is , and we assume this measure has a density . For two points , we let (and use when is understood) and define the Mixup cross-entropy on with respect to a classifier as:

 ℓmix(g,s,t,λ)={−loggi(zst)s,t∈Xi−(λloggi(zst)+(1−λ)loggj(zst))s∈Xi,t∈Xj

Having defined as above, we may write the component of the full Mixup cross-entropy loss corresponding to mixing points from classes and as:

 Ji,jmix(g,PX,Pf)=∫Xi×Xj×[0,1]ℓmix(g,s,t,λ) d(PX×PX×Pf)(s,t,λ)

We omit some, or all of, the arguments of when they are clear from context. The final Mixup cross-entropy loss is then the sum of over all (corresponding to all possible mixings between classes, including themselves):

 Jmix(g,PX,Pf)=k∑i=1Ji,imix+2k∑i=1k∑j=i+1Ji,jmix

Where the coefficient of 2 in front of the second term comes from the fact that from Fubini’s Theorem (we consider only classifiers for which the terms are defined and integrable).

Relation to Prior Work. We have opted for a more general definition of the Mixup loss (at least when constrained to multi-class classification) than prior works. This is not generality for generality’s sake, but rather because many of our results apply to any mixing distribution supported on . One obtains the original Mixup formulation of zhang2018Mixup for multi-class classification on a finite dataset by taking the to be finite sets, and choosing

to be the normalized counting measure (corresponding to a discrete uniform distribution). Additionally,

is chosen to have density , where

is a hyperparameter.

### 2.2 Mixup-optimal Classifier

Given our setup, we now wish to characterize the behavior of a Mixup-optimal classifier at a point . However, if the optimization of is considered over the class of functions , this is intractable (to the best of our knowledge) due to the lack of regularity conditions imposed on functions in . We thus wish to constrain the optimization of to a class of functions that is sufficiently powerful (so as to include almost all practical settings) while still allowing for local analysis. To do so, we will need the following definitions, which will also be referenced throughout the results in this section and the next:

 Ai,jx,ϵ ={(s,t,λ)∈Xi×Xj×[0,1]: λs+(1−λ)t∈Bϵ(x)} Ai,jx,ϵ,δ ={(s,t,λ)∈Xi×Xj×[0,1−δ]: λs+(1−λ)t∈Bϵ(x)} Xmix ={x∈Rn: ⋃i,jAi,jx,ϵ has positive measure for every ϵ>0} ξi,jx,ϵ =∫Ai,jx,ϵ d(PX×PX×Pf)(s,t,λ) ξi,jx,ϵ,λ =∫Ai,jxλd(PX×PX×Pf)(s,t,λ)

The set represents all points in that have lines between them intersecting an -neighborhood of , while the set represents the restriction of to only those points whose connecting line segments intersect an -neighborhood of with values bounded by . The set corresponds to all points for which every neighborhood factors into . The term represents the measure of the set while represents the expectation of over the same set. To provide better intuition for these definitions, we provide visualizations in Section B of the Appendix. We can now define the subset of to which we will constrain our optimization of .

###### Definition 2.1.

Let to be the subset of for which every satisfies for all when the limit exists. Here represents the Mixup loss for a constant function with value with the restriction of each term in to the set .

We immediately justify this definition with the following proposition.

###### Proposition 2.2.

Any function satisfies for any continuous .

Proof Sketch. We can argue directly from definitions by considering points in for which and differ.

Proposition 2.2 demonstrates that optimizing over is at least as good as optimizing over the subset of consisting of continuous functions, so we cover most cases of practical interest (i.e. optimizing deep neural networks). As such, the term “Mixup-optimal” is intended to mean optimal with respect to throughout the rest of the paper. We may now characterize the classification of a Mixup-optimal classifier on .

###### Lemma 2.3.

For any point and , there exists a continuous function satisfying:

 hiϵ(x)=ξi,ix,ϵ+2(∑kj=i+1ξi,jx,ϵ,λ+∑i−1j=1(ξj,ix,ϵ−ξj,ix,ϵ,λ))∑kq=1ξq,qx,ϵ+2(∑kj=q+1ξq,jx,ϵ,λ+∑q−1j=1(ξj,qx,ϵ−ξj,qx,ϵ,λ)) (1)

With the property that for every when the limit exists.

Proof Sketch. We define to be and show that this is well-defined and continuous using the strict convexity of the minimization problem.

###### Remark 2.4.

For the important case of finite datasets, it will be shown that the limit above always exists as part of the proof of Theorem 3.2.

Although the expression for looks complicated, it just represents the expected location of the point on all lines between class and other classes, normalized by the sum of the expected locations for all classes. Importantly, we note that while as defined in Lemma 2.3 is continuous for every , its pointwise limit need not be, which we demonstrate below.

###### Proposition 2.5.

Let and let , with being discrete uniform over and being continuous uniform over . Then the Mixup-optimal classifier is discontinuous at .

Proof Sketch. One may explicitly compute for that .

Proposition 2.5 illustrates our first significant difference between Mixup training and standard training: there always exists a minimizer of the empirical cross-entropy that can be extended to a continuous function (since a minimizer is constant on the class supports and not constrained elsewhere), whereas depending on the data the minimizer of can be discontinuous.

### 2.3 A Mixup Failure Case

With that in mind, several model classes popular in practical applications consist of continuous functions. For example, neural networks with ReLU activations are continuous, and several works have noted that they are Lipschitz continuous with shallow networks having approximately small Lipschitz constant

(scaman2019lipschitz; fazlyab2019efficient; latorre2020lipschitz). Given the regularity of such models, we are motivated to consider the continuous approximations in Lemma 2.3 and see if it is possible to construct a dataset on which (for a fixed ) can fail to classify the original points correctly. We thus consider the following dataset:

###### Definition 2.6.

[3-Point Alternating Line] We define to be the binary classification dataset consisting of the points classified as . In our setup, this corresponds to and with .

Intuitively, the reason why Mixup can fail on is that, for choices of that concentrate about , we will have by Lemma 2.3

that the Mixup-optimal classification in a neighborhood of point 1 should skew towards class 1 instead of class 2 due to the sandwiching of point 1 between points 0 and 2. The canonical choice of

corresponding to a mixing density of is one such choice:

###### Theorem 2.7.

Let have associated density . Then for any classifier on (as defined in Lemma 2.3), we may choose such that does not achieve 0 classification error on .

Proof Sketch. For any , we can bound the terms in Equation 1 using the fact that is strictly subgaussian (marchal), and then choose appropriately.

Experiments. The result of Theorem 2.7 leads us to believe that the Mixup training of a continuous model should fail on for appropriately chosen . To verify that the theory predicts the experiments, we train a two-layer feedforward neural network with 512 hidden units and ReLU activations on

with and without Mixup. The implementation of Mixup training does not differ from the theoretical setup; we uniformly sample pairs of data points and train on their mixtures. Our implementation uses PyTorch

(pytorch)

and is based heavily on the open source implementation of Manifold Mixup

(verma2019manifold) by Shivam Saboo. Results for training using (full-batch) Adam (adam) with the suggested (and common) hyperparameters of and a learning rate of are shown in Figure 1. The class 1 probabilities for each point in the dataset outputted by the learned Mixup classifiers from Figure 1 are shown in Table 1 below:

We see from Figure 1 and Table 1 that Mixup training fails to correctly classify the points in for , and this misclassification becomes more exacerbated as we increase . The choice of for which misclassifications begin to happen is largely superficial; we show in Section D of the Appendix that it is straightforward to construct datasets in the style of for which Mixup training will fail even for the very mild choice of . We focus on the case of here to simplify the theory. The key takeaway is that, for datasets that exhibit (approximately) collinear structure amongst points, it is possible for inappropriately chosen mixing distributions to cause Mixup training to fail to minimize the original empirical risk.

### 2.4 Sufficient Conditions for Minimizing the Original Risk

The natural follow-up question to the results of the previous subsection is: under what conditions on the data can this failure case be avoided? In other words, when can the Mixup-optimal classifier classify the original data points correctly while being continuous at those points?

Prior to answering that question, we first point out that if discontinuous functions are allowed, then Mixup training always minimizes the original risk on finite datasets:

###### Proposition 2.8.

Consider -class classification where the supports are finite and corresponds to the discrete uniform distribution. Then for every , we have that on .

Proof Sketch. This is just because the -measure of mixing a point with itself is constant as .

Note that Proposition 2.8 holds for any continuous mixing distribution supported on - we just need a rich enough model class. In order to obtain the result of Proposition 2.8 with the added restriction of continuity of on each of the , we need to make further assumptions. Namely, we need to avoid the collinearity of different class points that occurred in the previous subsection, and we do so with the following assumption which is a function of a class and a point :

###### Assumption 2.9.

For a class and a point , there exists an such that has measure zero for all when both and .

A visualization of Assumption 2.9 is provided in Section B of the Appendix. With this assumption in hand, we obtain the following result as a corollary of Theorem 3.2 which is proved in the next section:

###### Theorem 2.10.

We consider the same setting as Proposition 2.8 and further suppose that Assumption 2.9 is satisfied by all points in . Then for every , we have that on and that is continuous on .

Application of Sufficient Conditions. Theorem 2.10 suggests a way to test whether a dataset will be amenable to Mixup; we simply attempt to verify if Assumption 2.9 holds for some large enough

value (depending on the Lipschitz constant of the model). This is, however, computationally intensive for large, high-dimensional datasets. We thus consider the following approximate verification scheme: we sample an epoch’s worth of Mixup points (to simulate training) from a downsampled version of the train dataset, and then compute the minimum distances between each Mixup point and points (from both train and test data) of classes other than the mixed classes. The minimum over these distances corresponds to an estimate of

in Assumption 2.9

. For our experiments, we consider MNIST, CIFAR-10, and CIFAR-100

(cifar) downsampled to 20% of their sizes (replicating the setting of guo2019mixup) and use angular distance instead of Euclidean distance since ReLU activations are positive homogeneous. We use with as the mixing distribution both because the results/experiments of Subsection 2.3 demonstrate that the underfitting issue manifests in practice more readily for concentrated distributions and because larger values of move further away from the ERM regime.

We find that the value computed according to our scheme is approximately 0.1 or greater (in angular distance) for each of MNIST, CIFAR-10, and CIFAR-100. Given the large Lipschitz constant estimates (scaman2019lipschitz) for the deep models typically used for these datasets, this much separation between the original points and the mixed points seems to imply that Mixup training should minimize the original risk. We verify this by training ResNet-18 (resnet) (using the popular implementation of Kuang Liu) on MNIST, CIFAR-10, and CIFAR-100 with and without Mixup for 50 epochs using and a batch size of 128 (with otherwise identical settings to the previous subsection). Results are shown in Figure 2. We checked that the graphs shown are not sensitive to small changes in the aforementioned hyperparameters, although we did not perform an exhaustive hyperparameter search due to resource constraints.

As predicted from our approximate calculation and Theorem 2.10, Mixup training minimizes the empirical risk on MNIST, CIFAR-10, and CIFAR-100. However, we find that the test performance of Mixup at our choice of is significantly worse than ERM for CIFAR-10 and CIFAR-100, affirming what was observed previously by guo2019mixup for choices of . This is in contrast to the fact that the test data points exhibit greater angular distance to the mixed training points than the original training points do themselves. As such, we challenge the implication in guo2019mixup that collisions between mixed points and test points are the cause of the degradation in test performance - understanding when Mixup generalizes poorly seems to require more than just this perspective.

### 2.5 The Rate of Empirical Risk Minimization Using Mixup

Another striking aspect of the experiments in Figure 2 is that Mixup training minimizes the original empirical risk at a very similar rate to that of direct empirical risk minimization. A priori, there is no reason to expect that Mixup should be able to do this - a simple calculation shows that Mixup training only sees one true data point per epoch in expectation (each pair of points is sampled with probability and there are true point pairs and pairs seen per epoch, where is the dataset size). The experimental results are even more surprising given that we are training using , which essentially corresponds to training using the midpoints of the original data points. This seems to imply that it is possible to recover the classifications of the original data points from the midpoints alone (not including the midpoint of a point and itself). We make this rigorous with the following result:

###### Theorem 2.11.

Suppose with are sampled from according to , and that has a density (in other words, a continuous distribution). Then with probability 1, we can uniquely recover the points given only the midpoints .

Proof Sketch. The idea is to represent the recovery problem as a linear system, and show using rank arguments that the non-recoverable points are a measure zero set.

Theorem 2.11 shows, in an information-theoretic sense, that it is possible to obtain the original data points (and therefore also their labels) from only their midpoints. While this gives more theoretical backing as to why it is possible for Mixup training using to recover the original data point classifications with very low error, it does not explain why this actually happens in practice at the rate that it does. A full theoretical analysis of this phenomenon would necessarily require analyzing the training dynamics of neural networks (or another model of choice) when trained only on midpoints of the original data, which is outside the intended scope of this work. That being said, we hope that such analysis will be a fruitful line of investigation for future work.

## 3 Generalization Properties of Mixup Classifiers

Having discussed how Mixup training differs from standard empirical risk minimization with regards to the original training data, we now consider how a learned Mixup classifier can differ from one learned through empirical risk minimization on unseen test data. To do so, we analyze the per-class margin of Mixup classifiers, i.e. the distance one can move from a class support while still being classified as class .

### 3.1 The Margin of Mixup Classifiers

Intuitively, if a point falls only on line segments between and some other classes , and if always falls closer to than the other classes, we can expect to be classified according to class by the Mixup-optimal classifier due to Lemma 2.3. To make this rigorous, we introduce another assumption in the same vein as Assumption 2.9:

###### Assumption 3.1.

For a class and a point , suppose there exists an and a such that has measure zero for all .

Here the measure zero condition on the sets is codifying the aforementioned idea that the point falls closer to than any other class on every line segment that intersects it. A visualization of Assumption 3.1 is provided in Section B of the Appendix. Now for points for which Assumptions 2.9 and 3.1 hold, we can prove:

###### Theorem 3.2.

Consider -class classification where the supports are finite and corresponds to the discrete uniform distribution. If a point satisfies Assumptions 2.9 and 3.1 with respect to a class , then for every , we have that classifies as class and that is continuous at .

Proof Sketch. The limit can be shown to exist using the Lebesgue differentiation theorem, and we can bound the limit below since the have measure zero.

Any point is easily seen to satisfy Assumption 3.1 with respect to class , and hence we get Theorem 2.10 as a corollary of Theorem 3.2 as mentioned in Section 2. To use Theorem 3.2 to understand generalization, we make the observation that a point can satisfy Assumptions 2.9 and 3.1 while being a distance of up to from some class . This distance can be significantly farther than, for example, the optimal linear separator in a linearly separable dataset.

Experiments. To illustrate this, we consider the two moons dataset (twomoons), which consists of two classes of points supported on semicircles with added Gaussian noise. Our motivation for doing so comes from the work of gradstarv, in which it was noted that neural network models trained on a separated version of the two moons dataset essentially learned a linear separator while ignoring the curvature of the class supports. While gradstarv introduced an explicit regularizer to encourage a nonlinear decision boundary, we expect due to Theorem 3.2 that Mixup training will achieve a similar result without any additional modifications.

To verify this empirically, we train a two-layer neural network with 500 hidden units with and without Mixup, to have a 1-to-1 comparison with the setting of gradstarv. We use and for Mixup to capture a wide band of mixing densities. The version of the two moons dataset we use is also identical to that of the one used in the experiments of gradstarv, and we are grateful to the authors for releasing their code under the MIT license. We do full-batch training with all other training, implementation, and compute details remaining the same as the previous section. Results are shown in Figure 3.

Our results affirm the observations of gradstarv and previous work (combes) that neural network training dynamics may ignore salient features of the dataset; in this case the “Base Model” learns to differentiate the two classes essentially based on the -coordinate alone. On the other hand, the models trained using Mixup have highly nonlinear decision boundaries. Further experiments for different class separations and values of are included in Section F of the Appendix.

### 3.2 When Mixup Training Learns the Same Classifier

The experiments and theory of the previous sections have shown how a Mixup classifier can differ significantly from one learned through standard training. In this subsection, we now consider the opposing question - when is the Mixup classifier the same as the one learned through standard training? The motivation for doing so is the increasing computational cost of model training; knowing when Mixup produces the same result as ERM allows a practitioner to avoid having to try Mixup training.

Towards that end, we consider the case of binary classification using a linear model on high-dimensional Gaussian data, which is a setting that arises naturally when training using Gaussian kernels. Specifically, we consider the dataset to consist of points in distributed according to with (to be made more precise shortly). We also consider the mixing distribution to be any symmetric distribution supported on (thereby including as a special case ). We let the labels of points in be (so that the sign of is the classification), and use and to denote the individual class points. We will show that in this setting, the optimal Mixup classifier is the same (up to rescaling of ) as the ERM classifier learned using gradient descent with high probability. To do so we need some additional definitions.

###### Definition 3.3.

We say

is an interpolating solution, if there exists

such that

 ^θ⊤xi=−^θ⊤zj=k   ∀xi∈X1, ∀zj∈X−1.
###### Definition 3.4.

The maximum margin solution is defined through:

 ~θ:=argmax∥θ∥2=1{minxi∈X1,zj∈X−1{θ⊤xi,−θ⊤zj}}

When the maximum margin solution coincides with an interpolating solution for the dataset

(i.e. all the points are support vectors), we have that Mixup training leads to learning the max margin solution (up to rescaling).

###### Theorem 3.5.

If the maximum margin solution for is also an interpolating solution for , then any that lies in the span of and minimizes the Mixup loss for a symmetric mixing distribution is a rescaling of the maximum margin solution.

Proof Sketch. It can be shown that is an interpolating solution using a combination of the strict convexity of as a function of and the symmetry of the mixing distribution.

###### Remark 3.6.

For every , we can decompose it as where is the projection of onto the subspace spanned by . By definition we have that is orthogonal to all possible mixings of points in . Hence, does not affect the Mixup loss or the interpolating property, so for simplicity we may just assume lies in the span of .

To characterize the conditions on under which the maximum margin solution interpolates the data, we use a key result of muthukumar2020classification, restated below. Note that muthukumar2020classification actually provide more settings in their paper, but we constrain ourselves to the one stated below for simplicity.

###### Lemma 3.7.

[Theorem 1 in muthukumar2020classification, Rephrased] Assuming then with probability at least the maximum margin solution for is also an interpolating solution.

To tie the optimal Mixup classifier back to the classifier learned through standard training, we appeal to the fact that minimizing the empirical cross-entropy of a linear model using gradient descent leads to learning the maximum margin solution on linearly separable data (soudry2018implicit; matus). From this we obtain the desired result of this subsection:

###### Corollary 3.8.

Under the same conditions as Lemma 3.7, the optimal Mixup classifier has the same direction as the classifier learned through minimizing the empirical cross-entropy using gradient descent with high probability.

## 4 Conclusion

The main contribution of our work has been to provide a theoretical framework for analyzing how Mixup training can differ from empirical risk minimization. Our results characterize a practical failure case of Mixup, and also identify conditions under which Mixup can provably minimize the original risk. They also show in the sense of margin why the generalization of Mixup classifiers can be superior to those learned through empirical risk minimization, while again identifying model classes and datasets for which the generalization of a Mixup classifier is no different (with high probability). We also emphasize that the generality of our theoretical framework allows most of our results to hold for any continuous mixing distribution. Our hope is that the tools developed in this work will see applications in future works concerned with analyzing the relationship between benefits obtained from Mixup training and properties of the training data.

## Acknowledgements

Rong Ge, Muthu Chidambaram, Xiang Wang, and Chenwei Wu are supported in part by NSF Award DMS-2031849, CCF-1704656, CCF-1845171 (CAREER), CCF-1934964 (Tripods), a Sloan Research Fellowship, and a Google Faculty Research Award. Muthu would like to thank Michael Lin for helpful discussions during the early stages of this project.

## Appendix A Review of Definitions and Assumptions

For convenience, we first recall the definitions and assumptions stated throughout the paper below.

 ℓmix(g,s,t,λ) ={−loggi(zst)s,t∈Xi−(λloggi(zst)+(1−λ)loggj(zst))s∈Xi,t∈Xj Ji,jmix(g,PX,Pf) =∫Xi×Xj×[0,1]ℓmix(g,s,t,λ) d(PX×PX×Pf)(s,t,λ) Jmix(g,PX,Pf) =k∑i=1Ji,imix+2k∑i=1k∑j=i+1Ji,jmix Ai,jx,ϵ ={(s,t,λ)∈Xi×Xj×[0,1]: λs+(1−λ)t∈Bϵ(x)} Ai,jx,ϵ,δ ={(s,t,λ)∈Xi×Xj×[0,1−δ]: λs+(1−λ)t∈Bϵ(x)} Xmix ={x∈Rn: ⋃i,jAi,jx,ϵ has positive measure for every ϵ>0} ξi,jx,ϵ =∫Ai,jx,ϵ d(PX×PX×Pf)(s,t,λ) ξi,jx,ϵ,λ =∫Ai,jxλd(PX×PX×Pf)(s,t,λ)

See 2.1 See 2.6 See 2.9 See 3.1 See 3.3 See 3.4

## Appendix B Visualizations of Definitions and Assumptions

Due to the technical nature of the definitions and assumptions above, we provide several visualizations in Figures 4 to 7 to help aid the reader’s intuition for our main results.

## Appendix C Full Proofs for Section 2

We now prove all results found in Section 2 of the main body of the paper in the order that they appear.

### c.1 Proofs for Propositions, Lemmas, and Theorems 2.2 - 2.10

See 2.2

###### Proof.

Intuitively, the idea is that when and differ at a point , there must be a neighborhood of for which the constant function that takes value has lower loss than due to the continuity constraint on . We formalize this below.

Let be an arbitrary function in and let be a continuous function in . Consider a point such that the limit in Definition 2.1 exists and that (if such an did not exist, we would be done). Now let and be the constant functions whose values are and (respectively) on all of , and further let (this is shown to be a single value in the proof of Lemma 2.3 below).

Since , we have that there exists a such that for we have . From this we get that there exists depending only on and such that:

 Jmix(θg)|Bδ(x)−Jmix(θh)|Bδ(x)≥Jmix(ϵ′)|Bδ(x)

Where is an abuse of notation indicating the result of replacing all terms in the integrands of with (this corresponds to rescaling all the terms defined in Section 2.2 by ). Now by the continuity of (and thus the continuity of ), we may choose such that . This implies , and since was arbitrary (within the initially mentioned constraints, as outside of them is unconstrained) we have the desired result. ∎

See 2.3

###### Proof.

Firstly, the condition that is necessary, since if has measure zero the LHS of Equation 1 is not even defined.

Now we simply take as in Definition 2.1 and show that is well-defined.

Since the is over constant functions, we may unpack the definition of and pull each of the terms out of the integrands and rewrite them simply as . From this we obtain that:

 Jmix(g,PX,Pf)|Bϵ(x)=−k∑i=1(ξi,ix,ϵ+2(k∑j=i+1ξi,jx,ϵ,λ−i−1∑j=1(ξj,ix,ϵ−ξj,ix,ϵ,λ)))logθi

Where the first part of the summation above is from mixing with itself, and the second part of the summation corresponds to the and components of mixing with . Discarding the terms for which the coefficients above are 0 (the associated terms are taken to be 0, as anything else is suboptimal due to the summation constraint), we are left with a linear combination of , where the set is a subset of . We obviously cannot minimize by choosing any of the to be 0, and as a consequence we also cannot choose any of the to be 1 since they are constrained to add to 1.

As such, we may consider each of the for some . With this consideration in mind, is strictly convex in terms of the , since the Hessian of a linear combination of will be diagonal with positive entries when the arguments of the log terms are strictly greater than 0. Thus, as is compact, there exists a unique solution to , justifying the use of equality. This unique solution is easily computed via Lagrange multipliers, and the solution is given in Equation 1.

We have thus far defined on , and it remains to check that this construction of corresponds to the restriction of a continuous function from . To do so, we first note that is closed, since any limit point of must necessarily either be contained in one of the supports (due to compactness) or on a line segment between/within two supports (else it has positive distance from ), and every -neighborhood of contains points for which has positive measure for every (immediately implying that has positive measure).

Now we can check that is continuous on as follows. Consider a sequence ; we wish to show that . Since the codomain of is compact, the sequence must have a limit point . Furthermore, since each is the unique minimizer of , we must have that is the unique minimizer of , implying that . We have thus established the continuity of on , and since is closed (in fact, compact), we have by the Tietze Extension Theorem that can be extended to a continuous function on all of .

Finally, if exists, then it is by definition for any and therefore for any . ∎

See 2.5

###### Proof.

Considering the point , we note that only and are non-zero for all . Furthermore, we have that since the -measure of is and the sets have -measure . From this we have that the limit is the Mixup-optimal value at .

On the other hand, every other point on the line segment connecting the points in will have an -neighborhood disjoint from the line segment between the points in (and vice versa), so we will have for all . This implies that (and an identical result for points on the line segment), so we have that the Mixup-optimal classifier is discontinuous at as desired. ∎

See 2.7

###### Proof.

Fix a classifier as defined in Lemma 2.3 on . Now for we have by the fact that is strictly subgaussian that where

is the variance of

. As a result, we can choose to guarantee that and therefore that .

Now we have by Lemma 2.3 that:

 h1ϵ(1)=ξ1,11,ϵ+2ξ1,21,ϵ,λξ1,11,ϵ+2ξ1,21,ϵ+ξ2,21,ϵ=ξ1,11,ϵ+ξ1,21,ϵξ1,11,ϵ+2ξ1,21,ϵ+ξ2,21,ϵ>12

Where the second line followed from the fact that by the symmetry of . Thus, we have shown that will classify the point 1 as class 1 despite it belonging to class 2. ∎

See 2.8

###### Proof.

The full proof is not much more than the proof sketch in the main body. For , we have that and as for every , while and . As a result, we have as desired. ∎

See 2.10

###### Proof.

Obtained as a corollary of Theorem 3.2. ∎

### c.2 Proof for Theorem 2.11

Prior to proving Theorem 2.11, we first introduce some additional notation as well as a lemma that will be necessary for the proof.

Notation: Throughout the following corresponds to the number of data points being considered (as in Theorem 2.11), and as a shorthand we use to indicate . We use to denote the -th basis vector in , and use to denote the -th basis vector in . Let be the “Mixup matrix” where each row has two s and the other entries are , representing a mixture of the two data points whose associated indices have a . The rows enumerate all the possible mixings of the data points. In this way, is uniquely defined up to a permutation of rows. We can pick any representative as our matrix, and prove the following lemma.

###### Lemma C.1.

Assume , and is a permutation matrix. If is not a permutation of the columns of , then the rank of is larger than .

Using this lemma we can prove Theorem 2.11. See 2.11

###### Proof.

We only need to show that the set of original points that do not allow unique recovery has measure zero. It suffices to show this for the first entry (dimension) of the ’s, as the result then follows for all dimensions. For convenience, we group the first entries of the data points into a vector and similarly obtain from . Suppose is not a permutation of but that they have the same set of Mixup points. We only need to show that the set of such has measure zero.

Suppose is a Mixup matrix and is a permutation matrix. Suppose is not a permutation of the columns in . We would need , which is equivalent to . According to Lemma C.1, we know the rank of is at least , which implies that the solution set of is at most

So fixing and , the set of non-recoverable has measure zero. There are only a finite number of combinations of and . Thus, considering all of these and , the full set of non-recoverable still has measure zero. ∎

#### c.2.1 Proof of Supporting Lemma

See C.1

###### Proof.

First, we show that both the ranks of and are . For all , define , and define . Note that these vectors are all rows of . The first vectors are linearly independent because each has a unique direction that is not a linear combination of any other vectors in . Besides, we know that the span of is a subspace of where is the -th entry of . Therefore, doesn’t lie in the span of