1 Introduction
Data augmentation (Simard et al., 1998)
has played an important role in training deep neural networks for the purpose of preventing overfitting and improving the generalization performance. Recently, samplemixed data augmentation
(Zhang et al., 2018; Tokozume et al., 2018a, b; Verma et al., 2018; Guo et al., 2019; Inoue, 2018) has attracted attention, where we combine two samples linearly to generate augmented samples. The effectiveness of this approach is shown especially for image classification and sound recognition tasks.Many traditional data augmentations (e.g., slight deformations for image data (Taylor and Nitschke, 2017)) rely on the specific properties of the target domain such as invariances to some transformations. On the other hand, samplemixed augmentation can be applied to any dataset due to its simplicity. However, its effectiveness depends on the specified data structure. There is basically a difference between original clean samples and augmented samples in that augmented samples are not drawn from an underlying distribution directly. Thus a classifier trained with samplemix augmentation may learn the biased decision boundary. In fact, we can easily create a distribution where samplemix deteriorates the classification performance (See Fig. 1).
for the details of a hyperparameter of beta distribution
and the number of sampling .To overcome this problem, we propose a alternative framework called Data Interpolating Prediction (DIP), where the samplemixing process is encapsulated in a classifier. More specifically, we consider samplemix as a stochastic perturbation in a function and obtain the prediction by computing the expected value over the random variable. Note that we apply samplemix to both train and test samples in our framework. This procedure is similar to existing studies such as monte carlo dropout
(Gal and Ghahramani, 2016) and Augmented PAttern Classification (Sato et al., 2015). Furthermore, we derive the generalization error bound for our algorithm via Rademacher complexity and find that samplemix helps to reduce the Rademacher complexity of a hypothesis class. Through experiments on benchmark image datasets, we confirm the generalization gap can be reduced by samplemix and demonstrate the effectiveness of the proposed method.2 Proposed Method
2.1 Data Interpolating Prediction
Let be a dimensional input space and be a dimensional onehot label space. Denote a classifier by . The standard goal of classification problems is to obtain a classifier that minimizes the classification risk defined as:
(1) 
where is the joint density of an underlying distribution and
is a loss function. Here, we consider a function where the samplemixing process is encapsulated as a random variable. We describe samplemix between
and with a function as follows.(2) 
where is a parameter that controls mixing ratio between two samples.
Let be i.i.d. samples drawn from , be the density of the empirical distribution of , and be a specified classifier. Using the function , we redefine the deterministic function by
(3) 
where is some density function over . Note that the function is equivalent to the base function when we set .
2.2 Practical Optimization
Since the expected value is usually intractable, we train the classifier by minimizing upper the bound of an empirical version of ,
(4) 
By applying Jensen’s inequality, we have
(5) 
where
is a positive integer which represents the number of sampling to estimate the expectation. We denote the RHS in equation
5 by . The tightness of the above bound is related to the value of as(6) 
We can prove this in a similar manner to Burda et al. (2016). Since , larger gives a more precise risk estimation.
2.3 LabelMixing or LabelPreserving
There are two types of samplemix data augmentation, namely, labelmixing approach and labelpreserving approach. We can show that the objective functions of both approaches are consistent under some conditions.
Proposition 1.
Suppose that is a linear function with respect to the second argument and for some constant . Then we have the following equation.
(7) 
The proof of this theorem can be found in the blog post^{2}^{2}2inFERENCe, https://www.inference.vc/mixupdatadependentdataaugmentation. For many labelmixing approaches (Zhang et al., 2018; Verma et al., 2018), they use beta distribution for a prior of . Thus, the optimization of such approaches can be considered as a special case of our framework because an empirical version of RHS in equation 7 corresponds to where is set to . We experimentally investigate behaviors of both labelmixing and labelpreserving training in Sec. 3.
2.4 Generalization Bound via Rademacher Complexity
In this section, we present a generalization bound for a function equipped with samplemix. Let be a function class of the specified model and be the empirical Rademacher complexity of . Then we have the following inequality.
Proposition 2.
Let be i.i.d. random variables drawn from an underlying distribution with the density and . Suppose that is bounded by some constant . For any
, with the probability at least
, the following holds for all .(8) 
The proof of this theorem can be found in the textbook such as Mohri et al. (2012). Now we analyze a Rademacher complexity of a proposed function class. Let be a specified function class and as defined in equation 3. Suppose that the empirical Rademacher complexity of can be bounded with some constant as follows.
(9) 
We can prove this assumption holds for neural network models in a similar manner to Gao and Zhou (2016). Then we have the following theorem.
Theorem 1.
Note that always holds from Jensen’s inequality and . Thus, samplemix can reduce the empirical Rademacher complexity of the function class, which reduces the generalization gap (i.e., . For example, when in equation 3, we have , which is a monotonically decreasing function with respect to . Hence, we claim that larger can be effective for the smaller generalization gap. We experimentally analyze the behavior with respect to in Sec. 3.
3 Experiments on CIFAR Datasets
In this section, we analyze the behavior of our proposed framework through experiments on CIFAR10/100 datasets (Krizhevsky and Hinton, 2009). We evaluated the classification performances with two neural network architectures, VGG16 (Simonyan and Zisserman, 2015) and PreActResNet18 (He et al., 2016). The details of the experimental setting are described in Appendix B. For our proposed DIP, output after final fullyconnected layer is used as in equation 3 and the expected output is approximated by 500 times monte carlo sampling in test stage. As we discussed in Sec. 2.3, there are two types of optimization process when . We evaluated both labelpreserving and labelmixing style training. We set for labelpreserving samplemix training and for labelmixing samplemix training. Note that the prediction was computed with in test stage even when labelmix style was used for training.
For two baseline methods, we trained a classifier with (i) standard training (without samplemix) and (ii) Mixup (Zhang et al., 2018) training (labelmixing style). To evaluate the performances of these methods, we computed the prediction only from original clean samples. We used for Mixup training.
We show the classification performances in Table 1 and generalization gap (i.e., the gap between train and test performances) in Fig. 2. Note that the magnified versions of experimental results are deferred to Appendix C. As can be seen in Table 1, our proposed method is likely to outperform existing Mixup approach.
Remarks: For all approaches including existing Mixup, the larger leads to the smaller generalization gap, which is consistent with the discussion in Sec. 2.4. In addition, we found that the larger
is likely to enlarge the gap and deteriorate the performance on test samples. It might be because the variance of the empirical loss function computed by
times sampling plays a role of regularization.Model  Method  CIFAR10  CIFAR100 

VGG16  without samplemix  6.78 (0.057)  28.68 (0.169) 
Mixup () (Zhang et al., 2018)  5.81 (0.031)  26.58 (0.044)  
DIP (, , labelmixing)  5.74 (0.100)  25.48 (0.034)  
DIP (, , labelpreserving)  6.05 (0.015)  26.57 (0.155)  
DIP (, )  5.52 (0.041)  26.73 (0.054)  
PreActResNet18  without samplemix  5.68 (0.015)  25.25 (0.272) 
Mixup ()  4.46 (0.082)  22.58 (0.074)  
DIP (, , labelmixing)  4.36 (0.079)  21.97 (0.052)  
DIP (, , labelpreserving)  4.83 (0.125)  23.33 (0.052)  
DIP (, )  4.40 (0.036)  22.04 (0.067) 
Mean misclassification rate and standard error over three trials on CIFAR10/100 datasets.
4 Conclusion
In this paper, we proposed a novel framework called DIP, where samplemix is encapsulated in the hypothesis class of a classifier. We theoretically evaluated the generalization error bound via Rademacher complexity and showed that samplemix is effective to reduce the generalization gap. Through experiments on CIFAR datasets, we demonstrated that our approach can outperform existing Mixup data augmentation.
References

Burda et al. (2016)
Y. Burda, R. Grosse, and R. Salakhutdinov.
Importance weighted autoencoders.
In ICLR, 2016. 
Gal and Ghahramani (2016)
Y. Gal and Z. Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In ICML, 2016.  Gao and Zhou (2016) W. Gao and Z.H. Zhou. Dropout rademacher complexity of deep neural networks. Science China Information Sciences, 59(7):072104, 2016.
 Guo et al. (2019) H. Guo, Y. Mao, and R. Zhang. Mixup as locally linear outofmanifold regularization. In AAAI, 2019.
 He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 Inoue (2018) H. Inoue. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929, 2018.
 Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Mohri et al. (2012)
M. Mohri, A. Rostamizadeh, F. Bach, and A. Talwalkar.
Foundations of Machine Learning
. MIT Press, 2012.  Sato et al. (2015) I. Sato, H. Nishimura, and K. Yokoi. Apac: Augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229, 2015.

Simard et al. (1998)
P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri.
Transformation invariance in pattern recognition—tangent distance and tangent propagation.
In Neural networks: tricks of the trade. 1998.  Simonyan and Zisserman (2015) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition, 2015.
 Taylor and Nitschke (2017) L. Taylor and G. Nitschke. Improving deep learning using generic data augmentation. arXiv preprint arXiv:1708.06020, 2017.
 Tokozume et al. (2018a) Y. Tokozume, Y. Ushiku, and T. Harada. Betweenclass learning for image classification. In CVPR, 2018a.
 Tokozume et al. (2018b) Y. Tokozume, Y. Ushiku, and T. Harada. Learning from betweenclass examples for deep sound recognition. In ICLR, 2018b.
 Verma et al. (2018) V. Verma, A. Lamb, C. Beckham, A. Najafi, A. Courville, I. Mitliagkas, and Y. Bengio. Manifold mixup: Learning better representations by interpolating hidden states. arXiv preprint arXiv:1806.05236, 2018.
 Zhang et al. (2018) H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. mixup: Beyond empirical risk minimization. In ICLR, 2018.
Appendix A Proofs of Theorem 1
In this section, we give a complete proof of theorem 1. The empirical Rademacher complexity is defined as follows.
Definition 1.
Let be a positive integer, be i.i.d. random variables drawn from , be a class of measurable functions, and be Rademacher random variables, namely, random variables taking and with the equal probabilities. Then the empirical Rademacher complexity of is defined as
(11) 
We assume that is a lipschitz function with respect to first argument. Here we have the following useful lemma. The proof of this lemma can be found in [Mohri et al., 2012].
Lemma 1 (Talagrand’s lemma).
Let be an lipschitz function. Then for any hypothesis set of real valued function functions, the following inequality holds:
(12) 
From this lemma, we have
(13) 
Let . In equation 9, we assume that
Now we can bound as follows.
By combining the above result and equation 13, we complete the proof of this theorem. ∎
Appendix B Details of Experimental Setting
In this section, we describe the details of training for experiments in Section 3.
b.1 Training
VGG16 [Simonyan and Zisserman, 2015] and PreActResNet18 [He et al., 2016] was used for experiments. We did not apply Dropout similarly to Mixup [Zhang et al., 2018]
. For all experiments, we trained a neural network for 200 epoch. Learning rate was set to
in the beginning and multiplied by at 100 and 150 epoch. We applied standard augmentation such as cropping and flipping. The size of minibatch was set to . We set for labelpreserving samplemix training and for labelmixing samplemix training. was generated for each sample in minibatch, and was obtained by permutation of samples in minibatch.b.2 Prediction
For standard without samplemix method and Mixup method, we predicted labels of test samples from original clean samples. For proposed method, we predicted labels from the expectation over mixed samples computed by monte carlo approximation. In the same manner to the training process, we sampled and 500 times and calculated the average to obtain the final output. In evaluation stage, data augmentation except for samplemix was turned off.
Appendix C Magnified Versions of Experimental Results
In this section, we present the magnfied version of experimental results in Section 3.
CIFAR10  CIFAR100  

Model  Method  Train Acc.  Test Acc.  Train Acc.  Test Acc. 
VGG16  without mix  0.00 (0.000)  6.78 (0.057)  0.03 (0.003)  28.68 (0.169) 
Mixup ()  0.05 (0.007)  5.81 (0.031)  0.27 (0.006)  26.58 (0.044)  
Mixup ()  0.26 (0.029)  5.73 (0.042)  1.77 (0.108)  26.34 (0.225)  
DIP (, , labelmixing)  0.13 (0.000)  5.74 (0.100)  0.48 (0.012)  25.48 (0.034)  
DIP (, , labelmixing)  0.72 (0.035)  5.85 (0.015)  3.08 (0.147)  25.45 (0.179)  
DIP (, , labelpreserving)  0.47 (0.012)  6.05 (0.015)  1.26 (0.072)  26.57 (0.155)  
DIP (, , labelpreserving)  2.08 (0.026)  6.81 (0.046)  7.38 (0.152)  27.73 (0.140)  
DIP (, )  0.02 (0.003)  5.57 (0.093)  0.12 (0.006)  25.87 (0.200)  
DIP (, )  0.30 (0.015)  5.63 (0.032)  1.15 (0.038)  25.72 (0.042)  
DIP (, )  0.00 (0.000)  5.85 (0.041)  0.04 (0.003)  27.20 (0.067)  
DIP (, )  0.01 (0.006)  5.52 (0.041)  0.10 (0.009)  26.73 (0.054)  
PreActResNet18  without mix  0.00 (0.000)  5.68 (0.015)  0.02 (0.000)  25.25 (0.272) 
Mixup ()  0.02 (0.004)  4.46 (0.082)  0.09 (0.006)  22.58 (0.074)  
Mixup ()  0.18 (0.013)  4.32 (0.098)  0.50 (0.027)  22.87 (0.100)  
DIP (, , labelmixing)  0.09 (0.006)  4.36 (0.079)  0.26 (0.009)  21.97 (0.052)  
DIP (, , labelmixing)  0.51 (0.013)  4.66 (0.125)  1.43 (0.029)  22.31 (0.127)  
DIP (, , labelpreserving)  0.40 (0.032)  4.83 (0.125)  0.78 (0.003)  23.33 (0.052)  
DIP (, , labelpreserving)  1.74 (0.022)  5.84 (0.059)  3.87 (0.023)  23.75 (0.156)  
DIP (, )  0.02 (0.000)  4.50 (0.116)  0.09 (0.003)  21.85 (0.231)  
DIP (, )  0.27 (0.010)  4.75 (0.110)  0.57 (0.003)  21.94 (0.197)  
DIP (, )  0.00 (0.000)  4.75 (0.046)  0.03 (0.000)  22.37 (0.136)  
DIP (, )  0.01 (0.003)  4.40 (0.036)  0.08 (0.007)  22.04 (0.067) 
Comments
There are no comments yet.