Implicit Semantic Data Augmentation for Deep Networks

09/26/2019 ∙ by Yulin Wang, et al. ∙ Tsinghua University 16

In this paper, we propose a novel implicit semantic data augmentation (ISDA) approach to complement traditional augmentation techniques like flipping, translation or rotation. Our work is motivated by the intriguing property that deep networks are surprisingly good at linearizing features, such that certain directions in the deep feature space correspond to meaningful semantic transformations, e.g., adding sunglasses or changing backgrounds. As a consequence, translating training samples along many semantic directions in the feature space can effectively augment the dataset to improve generalization. To implement this idea effectively and efficiently, we first perform an online estimate of the covariance matrix of deep features for each class, which captures the intra-class semantic variations. Then random vectors are drawn from a zero-mean normal distribution with the estimated covariance to augment the training data in that class. Importantly, instead of augmenting the samples explicitly, we can directly minimize an upper bound of the expected cross-entropy (CE) loss on the augmented training set, leading to a highly efficient algorithm. In fact, we show that the proposed ISDA amounts to minimizing a novel robust CE loss, which adds negligible extra computational cost to a normal training procedure. Although being simple, ISDA consistently improves the generalization performance of popular deep models (ResNets and DenseNets) on a variety of datasets, e.g., CIFAR-10, CIFAR-100 and ImageNet. Code for reproducing our results are available at



There are no comments yet.


page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data augmentation is an effective technique to alleviate the overfitting problem in training deep networks Krizhevsky and Hinton (2009); Krizhevsky et al. (2012); Simonyan and Zisserman (2015); He et al. (2016); Huang et al. (2017). In the context of image recognition, this usually corresponds to applying content preserving transformations, e.g., cropping, horizontal mirroring, rotation and color jittering, on the input samples. Although being effective, these augmentation techniques are not capable of performing semantic transformations, such as changing the background of an object or the texture of a foreground object. Recent work has shown that data augmentation can be more powerful if (class identity preserving) semantic transformations are allowed Ratner et al. (2017); Bowles et al. (2018); Antoniou et al. (2018). For example, by training a generative adversarial network (GAN) for each class in the training set, one could then sample infinite number of samples from the generator. Unfortunately, this procedure is computationally intensive because training generative models and inferring them to obtain augmented samples are both nontrivial tasks. Moreover, due to the extra augmented data, the training procedure is also likely to be prolonged.

In this paper, we propose an implicit semantic data augmentation (ISDA) algorithm for training deep image recognition networks. The ISDA is highly efficient as it does not require training/inferring auxiliary networks or explicitly generating extra training samples. Our approach is motivated by the intriguing observation made by recent work showing that the features deep in a network are usually linearized Upchurch et al. (2017); Bengio et al. (2013). Specifically, there exist many semantic directions in the deep feature space, such that translating a data sample in the feature space along one of these directions results in a feature representation corresponding to another sample with the same class identity but different semantics. For example, a certain direction corresponds to the semantic translation of "make-bespectacled". When the feature of a person, who does not wear glasses, is translated along this direction, the new feature may correspond to the same person but with glasses (The new image can be explicitly reconstructed using proper algorithms as shown in Upchurch et al. (2017)). Therefore, by searching for many such semantic directions, we can effectively augment the training set in a way complementary to traditional data augmenting techniques.

However, explicitly finding semantic directions is not a trivial task, which usually requires extensive human annotations Upchurch et al. (2017). In contrast, sampling directions randomly is efficient but may result in meaningless transformations. For example, it makes no sense to apply the "make-bespectacled" transformation to the “car” class. In this paper, we adopt a simple method that achieves a good balance between effectiveness and efficiency. In specific, we perform an online estimate of the covariance matrix of the features for each class, which captures the intra-class variations. Then we sample directions from a zero-mean multi-variate normal distribution with the estimated covariance, and apply them to the features of training samples in that class to augment the dataset. In this way, the chance of generating meaningless semantic transformations can be significantly reduced.

To further improve the efficiency, we derive a closed-form upper bound of the expected

cross-entropy (CE) loss with the proposed data augmentation scheme. Therefore, instead of performing the augmentation procedure explicitly, we can directly minimize the upper bound, which is in fact a novel robust loss function. As there is no need to generate explicit data samples, we call our algorithm

implicit semantic data augmentation (ISDA). Compared to existing semantic data augmentation algorithms, the proposed ISDA can be conveniently implemented on top of most deep models without introducing auxiliary models or noticeable extra computational cost.

Although being simple, the proposed ISDA algorithm is surprisingly effective, and complements existing non-semantic data augmentation techniques quite well. Extensive empirical evaluations on several competitive image classification benchmarks show that ISDA consistently improves the generalization performance of popular deep networks, especially with little training data and powerful traditional augmentation techniques.

Figure 1: An overview of ISDA. Inspired by the observation that certain directions in the feature space correspond to meaningful semantic transformations, we augment the training data semantically by translating their features along these semantic directions, without involving auxiliary deep networks. The directions are obtained by sampling random vectors from a zero-mean normal distribution with dynamically estimated class-conditional covariance matrices. In addition, instead of performing augmentation explicitly, ISDA boils down to minimizing a closed-form upper-bound of the expected cross-entropy loss on the augmented training set, which makes our method highly efficient.

2 Related Work

In this section, we briefly review existing research on related topics.

Data augmentation is a widely used technique to alleviate overfitting in training deep networks. For example, in image recognition tasks, data augmentation techniques like random flipping, mirroring and rotation are applied to enforce certain invariance in convolutional networks He et al. (2016); Huang et al. (2017); Simonyan and Zisserman (2015); Srivastava et al. (2015). Recently, automatic data augmentation techniques, e.g., AutoAugment Cubuk et al. (2018), are proposed to search for a better augmentation strategy among a large pool of candidates. Similar to our method, learning with marginalized corrupted features Maaten et al. (2013) can be viewed as an implicit data augmentation technique, but it is limited to simple linear models. Complementarily, recent research shows that semantic data augmentation techniques which apply class identity preserving transformations (e.g. changing backgrounds of objects or varying visual angles) to the training data is effective as well Jaderberg et al. (2016); Bousmalis et al. (2017); Ratner et al. (2017); Antoniou et al. (2018). This is usually achieved by generating extra semantically transformed training samples with specialized deep structures such as DAGAN Antoniou et al. (2018), domain adaptation networks Bousmalis et al. (2017) or other GAN-based generators Jaderberg et al. (2016); Ratner et al. (2017). Although being effective, these approaches are nontrivial to implement and computationally expensive, due to the need to train generative models beforehand and infer them during training.

Robust loss function.

As shown in the paper, ISDA amounts to minimizing a novel robust loss function. Therefore, we give a brief review of related work on this topic. Recently, several robust loss functions are proposed for deep learning. For example, the L

loss Zhang and Sabuncu (2018) is a balanced noise-robust form for the cross entropy (CE) loss and mean absolute error (MAE) loss, derived from the negative Box-Cox transformation. Focal loss Lin et al. (2017) attaches high weights to a sparse set of hard examples to prevent the vast number of easy samples from dominating the training of the network. The idea of introducing large margin for CE loss has been proposed in Liu et al. (2016); Liang et al. (2017); Wang et al. (2018). In Sun et al. (2014), the CE loss and the contrastive loss are combined to learn more discriminative features. From a similar perspective, center loss Wen et al. (2016) simultaneously learns a center for deep features of each class and penalizes the distances between the samples and their corresponding class centers in the feature space, enhancing the intra-class compactness and inter-class separability.

Semantic transformations in deep feature space. Our work is motivated by the fact that high-level representations learned by deep convolutional networks can potentially capture abstractions with semantics Bengio and others (2009); Bengio et al. (2013)

. In fact, translating deep features along certain directions is shown to be corresponding to performing meaningful semantic transformations on the input images. For example, deep feature interpolation

Upchurch et al. (2017)

leverages simple interpolations of deep features from pre-trained neural networks to achieve semantic image transformations. Variational Autoencoder(VAE) and Generative Adversarial Network(GAN) based methods

Choi et al. (2018); Zhu et al. (2017); He et al. (2017) establish a latent representation corresponding to the abstractions of images, which can be manipulated to edit the semantics of images. Generally, these methods reveal that certain directions in the deep feature space correspond to meaningful semantic transformations, and can be leveraged to perform semantic data augmentation.

3 Method

Deep networks are known to excel at forming high-level representations in the deep feature space He et al. (2016); Huang et al. (2017); Upchurch et al. (2017); Ren et al. (2015), where the semantic relations between samples can be captured by the relative positions of their features Bengio et al. (2013). Previous work has demonstrated that translating features towards specific directions corresponds to meaningful semantic transformations when the features are mapped to the input space Upchurch et al. (2017); Li et al. (2016); Bengio et al. (2013). Based on this observation, we propose to directly augment the training data in the feature space, and integrate this procedure into the training of deep models.

The proposed implicit semantic data augmentation (ISDA) has two important components, i.e., online estimation of class-conditional covariance matrices and optimization with a robust loss function. The first component aims to find a distribution from which we can sample meaningful semantic transformation directions for data augmentation, while the second saves us from explicitly generating large amount of extra training data, leading to remarkable efficiency compared to existing data augmentation techniques.

3.1 Sematic Transformations in Deep Feature Space

As aforementioned, certain directions in the deep feature space correspond to meaningful semantic transformations like “make-bespectacled” or ‘change-view-angle’. This motivates us to augment the training set by applying such semantic transformations on deep features. However, manually searching for semantic directions is infeasible for large scale problems. To address this problem, we propose to approximate the procedure by sampling random vectors from a normal distribution with zero mean and a covariance that is proportional to the intra-class covariance matrix, which captures the variance of samples in that class and is thus likely to contain rich semantic information. Intuitively, features for the

person class may vary along the “wear-glasses” direction, while have nearly zero variance along the “has-propeller” direction which only occurs for other classes like the plane class. We hope that directions corresponding to meaningful transformations for each class are well represented by the principle components of the covariance matrix of that class.

Consider training a deep network with weights on a training set , where is the label of the -th sample over classes. Let the -dimensional vector denote the deep features of learned by , and indicate the th element of .

To obtain semantic directions to augment , we randomly sample vectors from a zero-mean multi-variate normal distribution , where is the class-conditional covariance matrix estimated from the features of all the samples in class . In implementation, the covariance matrix is computed in an online fashion by aggregating statistics from all mini-batches. The online estimation algorithm is given in Section A in the supplementary.

During training, covariance matrices are computed, one for each class. The augmented feature is obtained by translating along a random direction sampled from . Equivalently, we have



is a positive coefficient to control the strength of semantic data augmentation. As the covariances are computed dynamically during training, the estimation in the first few epochs are not quite informative when the network is not well trained. To address this issue, we let

be a function of the current iteration , thus to reduce the impact of the estimated covariances on our algorithm early in the training stage.

3.2 Implicit Semantic Data Augmentation (ISDA)

A naive method to implement ISDA is to explicitly augment each for times, forming an augmented feature set of size , where is -th copy of augmented features for sample . Then the networks are trained by minimizing the cross-entropy (CE) loss:


where and are the weight matrix and biases corresponding to the final fully connected layer, respectively.

Obviously, the naive implementation is computationally inefficient when is large, as the feature set is enlarged by times. In the following, we consider the case that grows to infinity, and find that an easy-to-compute upper bound can be derived for the loss function, leading to a highly efficient implementation.

Upper bound of the loss function. In the case , we are in fact considering the expectation of the CE loss under all possible augmented features. Specifically, is given by:


If can be computed efficiently, then we can directly minimize it without explicitly sampling augmented features. However, Eq. (3) is difficult to compute in its exact form. Alternatively, we find that it is possible to derive an easy-to-compute upper bound for , as given by the following proposition.

Proposition 1.

Suppose that , then we have an upper bound of , given by


According to the definition of in (3), we have:


In the above, the Inequality (6) follows from the Jensen’s inequality , as the logarithmic function is concave. The Eq. (7

) is obtained by leveraging the moment-generating function:

due to the fact that

is a Gaussian random variable, i.e.,

1:  Input: ,
2:  Randomly initialize and
3:  for  to  do
4:     Sample a mini-batch from
5:     Compute
6:     Estimate the covariance matrices , , ,
7:     Compute according to Eq. (4)
8:     Update , with SGD
9:  end for
10:  Output: and
Algorithm 1 The ISDA Algorithm.

Essentially, Proposition 1 provides a surrogate loss for our implicit data augmentation algorithm. Instead of minimizing the exact loss function , we can optimize its upper bound in a much more efficient way. Therefore, the proposed ISDA boils down to a novel robust loss function, which can be easily adopted by most deep models. In addition, we can observe that when , which means no features are augmented, reduces to the standard CE loss.

In summary, the proposed ISDA can be simply plugged into deep networks as a robust loss function, and efficiently optimized with the stochastic gradient descent (SGD) algorithm. We present the pseudo code of ISDA in Algorithm

1. Details of estimating covariance matrices and computing gradients are presented in Appendix A.

4 Experiments

In this section, we empirically validate the proposed algorithm on several widely used image classification benchmarks, i.e., CIFAR-10, CIFAR-100 Krizhevsky and Hinton (2009) and ImageNetDeng et al. (2009). We first evaluate the effectiveness of ISDA with different deep network architectures on these datasets. Second, we apply several recent proposed non-semantic image augmentation methods in addition to the standard baseline augmentation, and investigate the performance of ISDA. Third, we present comparisons with state-of-the-art robust lost functions and generator-based semantic data augmentation algorithms. Finally, ablation study is conducted to examine the effectiveness of each component. We also visualize the augmented samples in the original input space with the aid of a generative network.

4.1 Datasets and Baselines

Datasets. We use three image recognition benchmarks in the experiments. (1) The two CIFAR datasets consist of 32x32 colored natural images in 10 classes for CIFAR-10 and 100 classes for CIFAR-100, with 50,000 images for training and 10,000 images for testing, respectively. In our experiments, we hold out 5000 images from the training set as the validation set to search for the hyper-parameter . These samples are also used for training after an optimal

is selected, and the results on the test set are reported. Images are normalized with channel means and standard deviations for pre-procession. For the non-semantic data augmentation of the training set, we follow the standard operation in

Howard (2014)

: 4 pixels are padded at each side of the image, followed by a random 32x32 cropping combined with random horizontal flipping. (2) ImageNet is a 1,000-class dataset from ILSVRC2012

Deng et al. (2009), providing 1.2 million images for training and 50,000 images for validation. We adopt the same augmentation configurations in Krizhevsky et al. (2012); He et al. (2016); Huang et al. (2017).

Non-semantic augmentation techniques. To study the complementary effects of ISDA to traditional data augmentation methods, two state-of-the-art non-semantic augmentation techniques are applied, with and without ISDA. (1) Cutout DeVries and Taylor (2017) randomly masks out square regions of input during training to regularize the model. (2) AutoAugment Cubuk et al. (2019) automatically searches for the best augmentation policies to yield the highest validation accuracy on a target dataset. All hyper-parameters are the same as reported in the papers introducing them.

Method Params CIFAR-10 CIFAR-100
ResNet-32 He et al. (2016) 0.5M 7.39 0.10% 31.20 0.41%
ResNet-32 + ISDA 0.5M 7.09 0.12% 30.27 0.34%
ResNet-110 He et al. (2016) 1.7M 6.76 0.34% 28.67 0.44%
ResNet-110 + ISDA 1.7M 6.33 0.19% 27.57 0.46%
SE-ResNet-110 Hu et al. (2018) 1.7M 6.14 0.17% 27.30 0.03%
SE-ResNet-110 + ISDA 1.7M 5.96 0.21% 26.63 0.21%
Wide-ResNet-16-8 Zagoruyko and Komodakis (2017) 11.0M 4.25 0.18% 20.24 0.27%
Wide-ResNet-16-8 + ISDA 11.0M 4.04 0.29% 19.91 0.21%
Wide-ResNet-28-10 Zagoruyko and Komodakis (2017) 36.5M 3.82 0.15% 18.53 0.07%
Wide-ResNet-28-10 + ISDA 36.5M 3.58 0.15% 17.98 0.15%
ResNeXt-29, 8x24d Xie et al. (2017) 34.4M 3.86 0.14% 18.16 0.13%
ResNeXt-29, 8x24d + ISDA 34.4M 3.67 0.12% 17.43 0.25%
DenseNet-BC-100-12 Huang et al. (2017) 0.8M 4.90 0.08% 22.61 0.10%
DenseNet-BC-100-12 + ISDA 0.8M 4.54 0.07% 22.10 0.34%
DenseNet-BC-190-40 Huang et al. (2017) 15.2M 3.52% 17.74%
DenseNet-BC-190-40 + ISDA 15.2M 3.24% 17.42%
Table 1: Evaluation of ISDA on CIFAR with different models. The average test error over the last 10 epochs is calculated in each experiment, and we report mean values and standard deviations in three independent experiments. The best results are bold-faced.
Dataset Networks Cutout DeVries and Taylor (2017) Cutout + ISDA AA Cubuk et al. (2019) AA + ISDA
CIFAR-10 Wide-ResNet-28-10 Zagoruyko and Komodakis (2017) 2.99 0.06% 2.83 0.04% 2.65 0.07% 2.56 0.01%
Shake-Shake (26, 2x32d) Gastaldi (2017) 3.16 0.09% 2.93 0.03% 2.89 0.09% 2.68 0.12%
Shake-Shake (26, 2x112d) Gastaldi (2017) 2.36% 2.25% 2.01% 1.82%
CIFAR-100 Wide-ResNet-28-10 Zagoruyko and Komodakis (2017) 18.05 0.25% 17.02 0.11% 16.60 0.40% 15.62 0.32%
Shake-Shake (26, 2x32d) Gastaldi (2017) 18.92 0.21% 18.17 0.08 % 17.50 0.19% 17.21 0.33%
Shake-Shake (26, 2x112d) Gastaldi (2017) 17.34 0.28% 16.24 0.20 % 15.21 0.20% 13.87 0.26%
Table 2: Evaluation of ISDA with state-of-the-art non-semantic augmentation techniques. ‘AA’ refers to AutoAugment Cubuk et al. (2019). We report mean values and standard deviations in three independent experiments. The best results are bold-faced.
Figure 2: Curves of test errors on CIFAR-100 with Wide-ResNet (WRN).
Figure 3: Training and test errors on ImageNet.

Baselines. Our method is compared to several baselines including state-of-the-art robust loss functions and generator-based semantic data augmentation methods. (1) Dropout Srivastava et al. (2014)

is a widely used regularization approach which randomly mutes some neurons during training. (2)

Large-margin softmax loss Liu et al. (2016) introduces large decision margin, measured by a cosine distance, to the standard CE loss. (3) Disturb label Xie et al. (2016) is a regularization mechanism that randomly replaces a fraction of labels with incorrect ones in each iteration. (4) Focal loss Lin et al. (2017) focuses on a sparse set of hard examples to prevent easy samples from dominating the training procedure. (5) Center loss Wen et al. (2016) simultaneously learns a center of features for each class and minimizes the distances between the deep features and their corresponding class centers. (6) loss Zhang and Sabuncu (2018) is a noise-robust loss function, using the negative Box-Cox transformation. (7) For generator-based semantic augmentation methods, we train several state-of-the-art GANs Arjovsky et al. (2017); Mirza and Osindero (2014); Odena et al. (2017); Chen et al. (2016), which are then used to generate extra training samples for data augmentation. For fair comparison, all methods are implemented with the same training configurations when it is possible. Details for hyper-parameter settings are presented in Appendix B.

Training details. For deep networks, we implement the ResNet, SE-ResNet, Wide-ResNet, ResNeXt, DenseNet and PyramidNet on the two CIFAR datasets, and ResNet on ImageNet. Detailed configurations for these models are given in Appendix B. The hyper-parameter for ISDA is selected from the set according to the performance on the validation set. On ImageNet, due to GPU memory limitation, we approximate the covariance matrices by their diagonals, i.e., the variance of each dimension of the features. The best hyper-parameter is selected from .

4.2 Main Results

Table 1 presents the performance of several state-of-the-art deep networks with and without ISDA. It can be observed that ISDA consistently improves the generalization performance of these models, especially with fewer training samples per class. On CIFAR-100, for relatively small models like ResNet-32 and ResNet-110, ISDA reduces test errors by about , while for larger models like Wide-ResNet-28-10 and ResNeXt-29, 8x24d, our method outperforms the competitive baselines by nearly . Compared to ResNets, DenseNets generally suffer less from overfitting due to their architecture design, thus appear to benefit less from our algorithm.

Table 2 shows experimental results with recent proposed powerful traditional image augmentation methods (i.e. Cutout DeVries and Taylor (2017) and AutoAugment Cubuk et al. (2019)). Interestingly, ISDA seems to be even more effective when these techniques exist. For example, when applying AutoAugment, ISDA achieves performance gains of and on CIFAR-100 with the Shake-Shake (26, 2x112d) and the Wide-ResNet-28-10, respectively. Notice that these improvements are more significant than the standard situations. A plausible explanation for this phenomenon is that non-semantic augmentation methods help to learn a better feature representation, which makes semantic transformations in the deep feature space more reliable. The curves of test errors during training on CIFAR-100 with Wide-ResNet-28-10 are presented in Figure 3. It is clear that ISDA achieves a significant improvement after the third learning rate drop, and shows even better performance after the forth drop.

Method ResNet-110 Wide-ResNet-28-10
Large Margin Liu et al. (2016) 6.460.20% 28.000.09% 3.690.10% 18.480.05%
Disturb Label Xie et al. (2016) 6.610.04% 28.460.32% 3.910.10% 18.560.22%
Focal Loss Lin et al. (2017) 6.680.22% 28.280.32% 3.620.07% 18.220.08%
Center Loss Wen et al. (2016) 6.380.20% 27.850.10% 3.760.05% 18.500.25%
L Loss Zhang and Sabuncu (2018) 6.690.07% 28.780.35% 3.780.08% 18.430.37%
WGAN Arjovsky et al. (2017) 6.630.23% - 3.810.08% -
CGAN Mirza and Osindero (2014) 6.560.14% 28.250.36% 3.840.07% 18.790.08%
ACGAN Odena et al. (2017) 6.320.12% 28.480.44% 3.810.11% 18.540.05%
infoGAN Chen et al. (2016) 6.590.12% 27.640.14% 3.810.05% 18.440.10%
Basic 6.760.34% 28.670.44% - -
Basic + Dropout 6.230.11% 27.110.06% 3.820.15% 18.530.07%
ISDA 6.330.19% 27.570.46% - -
ISDA + Dropout 5.980.20% 26.350.30% 3.580.15% 17.980.15%
Table 3: Comparisons with the state-of-the-art methods. We report mean values and standard deviations of the test error in three independent experiments. Best results are bold-faced.
Method Top-1 Top-5
ResNet-50 He et al. (2016) 23.58% 6.92%
ResNet-50 + ISDA 23.30% 6.82%
ResNet-152 He et al. (2016) 21.65% 6.01%
ResNet-152 + ISDA 21.20% 5.67%
Table 4: Evaluation of ISDA on ImageNet.

Table 4 presents the performance of ISDA on the large scale ImageNet dataset. It can be observed that ISDA reduces Top-1 error rate by for the ResNet-152 model. The training and test error curves are shown in Figure 3. Notably, ISDA achieves a slightly higher training error but a lower test error, indicating that ISDA performs effective regularization on deep networks.

4.3 Comparison with Other Approaches

We compare ISDA with a number of competitive baselines described in Section 4.1, ranging from robust loss functions to semantic data augmentation algorithms based on generative models. The results are summarized in Table 3, and the training curves are presented in Appendix D. One can observe that ISDA compares favorably with all the competitive baseline algorithms. With ResNet-110, the test errors of other robust loss functions are 6.38% and 27.85% on CIFAR-10 and CIFAR-100, respectively, while ISDA achieves 6.23% and 27.11%, respectively.

Among all GAN-based sematic augmentation methods, ACGAN gives the best performance, especially on CIFAR-10. However, these models generally suffer a performance reduction on CIFAR-100, which do not contain enough samples to learn a valid generator for each class. In contrast, ISDA shows consistent improvements on all the datasets. In addition, GAN-based methods require additional computation to train the generators, and introduce significant overhead to the training process. In comparison, ISDA not only leads to lower generalization error, but is simpler and more efficient.

4.4 Visualization Results

To demonstrate that our method is able to generate meaningful semantically augmented samples, we introduce an approach to map the augmented features back to the pixel space to explicitly show semantic changes of the images. Due to space limit, we defer the detailed introduction of the mapping algorithm and present it in Appendix C.

Figure 4 shows the visualization results. The first and second column represent the original images and reconstructed images without any augmentation. The rest columns present the augmented images by the proposed ISDA. It can be observed that ISDA is able to alter the semantics of images, e.g., backgrounds, visual angles, colors and type of cars, color of skins, which is not possible for traditional data augmentation techniques.

Figure 4: Visualization results of semantically augmented images.

4.5 Ablation Study

Setting CIFAR-10 CIFAR-100
Basic 3.820.15% 18.580.10%
Identity matrix 3.630.12% 18.530.02%
Diagonal matrix 3.700.15% 18.230.02%
Single covariance matrix 3.670.07% 18.290.13%
Constant 3.690.08% 18.330.16%
ISDA 3.580.15% 17.980.15%
Table 5: The ablation study for ISDA.

To get a better understanding of the effectiveness of different components in ISDA, we conduct a series of ablation study. In specific, several variants are considered: (1) Identity matrix means replacing the covariance matrix by the identity matrix. (2) Diagonal matrix means using only the diagonal elements of the covariance matrix . (3) Single covariance matrix means using a global covariance matrix computed from the features of all classes. (4) Constant means using a constant without setting it as a function of the training iterations.

Table 5 presents the ablation results. Adopting identity matrix increases the test error by 0.05% on CIFAR-10 and nearly 0.56% on CIFAR-100. Using a single covariance matrix greatly degrades the generalization performance as well. The reason is likely to be that both of them fail to find proper directions in the deep feature space to perform meaningful semantic transformations. Adopting a diagonal matrix also hurts the performance as it does not consider correlations of features.

5 Conclusion

In this paper, we proposed an efficient implicit semantic data augmentation algorithm (ISDA) to complement existing data augmentation techniques. Different from existing approaches leveraging generative models to augment the training set with semantically transformed samples, our approach is considerably more efficient and easier to implement. In fact, we showed that ISDA can be formulated as a novel robust loss function, which is compatible with any deep network with the cross-entropy loss. Extensive results on several competitive image classification datasets demonstrate the effectiveness and efficiency of the proposed algorithm.


Gao Huang is supported in part by Beijing Academy of Artificial Intelligence (BAAI) under grant BAAI2019QN0106 and Tencent AI Lab Rhino-Bird Focused Research Program under grant JR201914.


  • A. Antoniou, A. J. Storkey, and H. A. Edwards (2018) Data augmentation generative adversarial networks. CoRR abs/1711.04340. Cited by: §1, §2.
  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. CoRR abs/1701.07875. Cited by: Appendix B, Appendix C, §4.1, Table 3.
  • Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai (2013) Better mixing via deep representations. In ICML, pp. 552–560. Cited by: §1, §2, §3.
  • Y. Bengio et al. (2009) Learning deep architectures for ai.

    Foundations and trends® in Machine Learning

    2 (1), pp. 1–127.
    Cited by: §2.
  • K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, pp. 3722–3731. Cited by: §2.
  • C. Bowles, L. J. Chen, R. Guerrero, P. Bentley, R. N. Gunn, A. Hammers, D. A. Dickie, M. del C. Valdés Hernández, J. M. Wardlaw, and D. Rueckert (2018) GAN augmentation: augmenting training data using generative adversarial networks. CoRR abs/1810.10863. Cited by: §1.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, pp. 2172–2180. Cited by: Appendix B, §4.1, Table 3.
  • Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    StarGAN: unified generative adversarial networks for multi-domain image-to-image translation

    In CVPR, pp. 8789–8797. Cited by: §2.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation policies from data. In CVPR, Cited by: §4.1, §4.2, Table 2.
  • E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le (2018) AutoAugment: learning augmentation policies from data. CoRR abs/1805.09501. Cited by: §2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In ICML, pp. 248–255. Cited by: §4.1, §4.
  • T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    arXiv preprint arXiv:1708.04552. Cited by: §4.1, §4.2, Table 2.
  • X. Gastaldi (2017) Shake-shake regularization. arXiv preprint arXiv:1705.07485. Cited by: Table 2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §2, §3, §4.1, Table 1, Table 4.
  • Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen (2017) AttGAN: facial attribute editing by only changing what you want.. CoRR abs/1711.10678. Cited by: §2.
  • A. G. Howard (2014) Some improvements on deep convolutional neural network based image classification. CoRR abs/1312.5402. Cited by: §4.1.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: Table 1.
  • G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely Connected Convolutional Networks. In CVPR, pp. 2261–2269. Cited by: §1, §2, §3, §4.1, Table 1.
  • G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In ECCV, pp. 646–661. Cited by: Appendix B.
  • M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2016) Reading text in the wild with convolutional neural networks.

    International Journal of Computer Vision

    116 (1), pp. 1–20.
    Cited by: §2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1, §4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §1, §4.1.
  • M. Li, W. Zuo, and D. Zhang (2016) Convolutional network for attribute-driven and identity-preserving human face generation. CoRR abs/1608.06434. Cited by: §3.
  • X. Liang, X. Wang, Z. Lei, S. Liao, and S. Z. Li (2017) Soft-margin softmax for deep classification. In ICONIP, Cited by: §2.
  • T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2999–3007. Cited by: §2, §4.1, Table 3.
  • W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks.. In ICML, Cited by: §2, §4.1, Table 3.
  • L. Maaten, M. Chen, S. Tyree, and K. Weinberger (2013) Learning with marginalized corrupted features. In ICML, pp. 410–418. Cited by: §2.
  • A. Mahendran and A. Vedaldi (2015) Understanding deep image representations by inverting them. In CVPR, pp. 5188–5196. Cited by: Appendix C.
  • M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: Appendix B, §4.1, Table 3.
  • A. Odena, C. Olah, and J. Shlens (2017)

    Conditional image synthesis with auxiliary classifier gans

    In ICML, pp. 2642–2651. Cited by: Appendix B, §4.1, Table 3.
  • A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré (2017) Learning to compose domain-specific transformations for data augmentation. In NeurIPS, pp. 3236–3246. Cited by: §1, §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §3.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §1, §2.
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: Appendix B, §4.1.
  • R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Training very deep networks. In NeurIPS, pp. 2377–2385. Cited by: §2.
  • Y. Sun, X. Wang, and X. Tang (2014) Deep learning face representation by joint identification-verification. In NeurIPS, Cited by: §2.
  • P. Upchurch, J. R. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Q. Weinberger (2017) Deep feature interpolation for image content changes. In CVPR, pp. 6090–6099. Cited by: Appendix C, §1, §1, §2, §3.
  • X. Wang, S. Zhang, Z. Lei, S. Liu, X. Guo, and S. Z. Li (2018) Ensemble soft-margin softmax loss for image classification. In IJCAI, Cited by: §2.
  • Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016)

    A discriminative feature learning approach for deep face recognition

    In ECCV, pp. 499–515. Cited by: §2, §4.1, Table 3.
  • L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian (2016) DisturbLabel: regularizing cnn on the loss layer. In CVPR, pp. 4753–4762. Cited by: §4.1, Table 3.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In CVPR, pp. 1492–1500. Cited by: Table 1.
  • S. Zagoruyko and N. Komodakis (2017) Wide residual networks. In BMVC, Cited by: Table 1, Table 2.
  • Z. Zhang and M. R. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, Cited by: Appendix B, §2, §4.1, Table 3.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pp. 2223–2232. Cited by: §2.

Appendix A Implementation Details of ISDA.

Dynamic estimation of covariance matrices. During the training process using , covariance matrices are estimated by:


where and are the estimates of average values and covariance matrices of the features of class at step. and are the average values and covariance matrices of the features of class in mini-batch. denotes the total number of training samples belonging to class in all mini-batches, and denotes the number of training samples belonging to class only in mini-batch.

Gradient computation. In backward propagation, gradients of are given by:


where denotes element of . can be obtained through the backward propagation algorithm using .

Appendix B Training Details

On CIFAR, we implement the ResNet, SE-ResNet, Wide-ResNet, ResNeXt, DenseNet and PyramidNet. The SGD optimization algorithm with a nesterov momentum is applied to train all models. Specific hyper-parameters for training are presented in Table


Network Total Epochs Batch Size Weight Decay Momentum Initial Schedule
ResNet 160 128 1e-4 0.9 0.1 Multiplied by 0.1 in and epoch.
SE-ResNet 200 128 1e-4 0.9 0.1 Multiplied by 0.1 in , and epoch.
Wide-ResNet 240 128 5e-4 0.9 0.1 Multiplied by 0.2 in , , and epoch.
DenseNet-BC 300 64 1e-4 0.9 0.1 Multiplied by 0.1 in , and epoch.
ResNeXt 350 128 5e-4 0.9 0.05 Multiplied by 0.1 in , and epoch.
Shake Shake 1800 64 1e-4 0.9 0.1 Cosine learning rate.
PyramidNet 1800 128 1e-4 0.9 0.1 Cosine learning rate.
Table 6: Training configurations on CIFAR. ‘’ donates the learning rate.

On ImageNet, we train ResNet for 120 epochs using the same l2 weight decay and momentum as CIFAR, following Huang et al. (2016). The initial learning rate is set as 0.1 and divided by 10 every 30 epochs. The size of mini-batch is set as 256.

All baselines are implemented with the same training configurations mentioned above. Dropout rate is set as 0.3 for comparison if it is not applied in the basic model, following the instruction in Srivastava et al. (2014). For noise rate in disturb label, 0.05 is adopted in Wide-ResNet-28-10 on both CIFAR-10 and CIFAR-100 datasets and ResNet-110 on CIFAR 10, while 0.1 is used for ResNet-110 on CIFAR 100. Focal Loss contains two hyper-parameters and . Numerous combinations have been tested on the validation set and we ultimately choose and for all four experiments. For L loss, although Zhang and Sabuncu (2018) states that achieves best performance on most conditions, we suggest that is more suitable in our experiments, and therefore adopted. For center loss, we find its performance is largely affected by the learning rate of the center loss module, therefore its initial learning rate is set as 0.5 for the best generalization performance.

For generator-based augmentation methods, we apply the GANs structures introduced in Arjovsky et al. (2017); Mirza and Osindero (2014); Odena et al. (2017); Chen et al. (2016) to train the generators. For WGAN, a generator is trained for each class in CIFAR-10 dataset. For CGAN, ACGAN and infoGAN, single model is simply required to generate images of all classes. A 100 dimension noise drawn from standard normal distribution is adopted as input, generating images corresponding to their label. Specially, infoGAN takes additional input with two dimensions, which represent specific attributes of the whole training set. Synthetic images are involved with a fixed ratio in every mini-batch. Based on the experiments on the validation set, the proportion of generalized images is set as .

Appendix C Reversing Convolutional Networks

To explicitly demonstrate the semantic changes generated by ISDA, we propose an algorithm to map deep features back to the pixel space. Some extra visualization results are shown in Figure 6.

An overview of the algorithm is presented in Figure 5. As there is no closed-form inverse function for convolutional networks like ResNet or DenseNet, the mapping algorithm acts in a similar way to Mahendran and Vedaldi (2015) and Upchurch et al. (2017), by fixing the model and adjusting inputs to find images corresponding to the given features. However, given that ISDA augments semantics of images in essence, we find it insignificant to directly optimize the inputs in the pixel space. Therefore, we add a fixed pre-trained generator , which is obtained through training a wasserstein GAN Arjovsky et al. (2017), to produce images for the classification model, and optimize the inputs of the generator instead. This approach makes it possible to effectively reconstruct images with augmented semantics.

Figure 5: Overview of the algorithm. We adopt a fixed generator obtained by training a wasserstein gan to generate fake images for convolutional networks, and optimize the inputs of in terms of the consistency in both the pixel space and the deep feature space.

Figure 6: Extra visualization results.

The mapping algorithm can be divided into two steps:

Step I. Assume a random variable is normalized to and input to , generating fake image . is a real image sampled from the dataset (such as CIFAR). and are forwarded through a pre-trained convolutional network to obtain deep feature vectors and . The first step of the algorithm is to find the input noise variable corresponding to , namely


where and are the average value and the standard deviation of , respectively. The consistency of both the pixel space and the deep feature space are considered in the loss function, and we introduce a hyper-parameter to adjust the relative importance of two objectives.

Step II. We augment with ISDA, forming and reconstructe it in the pixel space. Specifically, we search for corresponding to in the deep feature space, with the start point found in Step I:


As the mean square error in the deep feature space is optimized to 0, is supposed to represent the image corresponding to .

The proposed algorithm is performed on a single batch. In practice, a ResNet-32 network is used as the convolutional network. We solve Eq. (15), (16) with a standard gradient descent (GD) algorithm of 10000 iterations. The initial learning rate is set as 10 and 1 for Step I and Step II respectively, and is divided by 10 every 2500 iterations. We apply a momentum of 0.9 and a l2 weight decay of 1e-4.

Appendix D Extra Experimental Results

(a) ResNet-110 on CIFAR-10
(b) ResNet-110 on CIFAR-100
Figure 7: Comparison with state-of-the-art image classification methods.

Curves of test errors of state-of-the-art methods and ISDA are presented in Figure 7. ISDA outperforms other methods consistently, and shows the best generalization performance in all situations. Notably, ISDA decreases test errors more evidently in CIFAR-100, which demonstrate that our method is more suitable for datasets with fewer samples. This observation is consistent with the results in the paper. In addition, among other methods, center loss shows competitive performance with ISDA on CIFAR-10, but it fails to significantly enhance the generalization in CIFAR-100.