1 Introduction
Data augmentation is an effective technique to alleviate the overfitting problem in training deep networks Krizhevsky and Hinton (2009); Krizhevsky et al. (2012); Simonyan and Zisserman (2015); He et al. (2016); Huang et al. (2017). In the context of image recognition, this usually corresponds to applying content preserving transformations, e.g., cropping, horizontal mirroring, rotation and color jittering, on the input samples. Although being effective, these augmentation techniques are not capable of performing semantic transformations, such as changing the background of an object or the texture of a foreground object. Recent work has shown that data augmentation can be more powerful if (class identity preserving) semantic transformations are allowed Ratner et al. (2017); Bowles et al. (2018); Antoniou et al. (2018). For example, by training a generative adversarial network (GAN) for each class in the training set, one could then sample infinite number of samples from the generator. Unfortunately, this procedure is computationally intensive because training generative models and inferring them to obtain augmented samples are both nontrivial tasks. Moreover, due to the extra augmented data, the training procedure is also likely to be prolonged.
In this paper, we propose an implicit semantic data augmentation (ISDA) algorithm for training deep image recognition networks. The ISDA is highly efficient as it does not require training/inferring auxiliary networks or explicitly generating extra training samples. Our approach is motivated by the intriguing observation made by recent work showing that the features deep in a network are usually linearized Upchurch et al. (2017); Bengio et al. (2013). Specifically, there exist many semantic directions in the deep feature space, such that translating a data sample in the feature space along one of these directions results in a feature representation corresponding to another sample with the same class identity but different semantics. For example, a certain direction corresponds to the semantic translation of "makebespectacled". When the feature of a person, who does not wear glasses, is translated along this direction, the new feature may correspond to the same person but with glasses (The new image can be explicitly reconstructed using proper algorithms as shown in Upchurch et al. (2017)). Therefore, by searching for many such semantic directions, we can effectively augment the training set in a way complementary to traditional data augmenting techniques.
However, explicitly finding semantic directions is not a trivial task, which usually requires extensive human annotations Upchurch et al. (2017). In contrast, sampling directions randomly is efficient but may result in meaningless transformations. For example, it makes no sense to apply the "makebespectacled" transformation to the “car” class. In this paper, we adopt a simple method that achieves a good balance between effectiveness and efficiency. In specific, we perform an online estimate of the covariance matrix of the features for each class, which captures the intraclass variations. Then we sample directions from a zeromean multivariate normal distribution with the estimated covariance, and apply them to the features of training samples in that class to augment the dataset. In this way, the chance of generating meaningless semantic transformations can be significantly reduced.
To further improve the efficiency, we derive a closedform upper bound of the expected
crossentropy (CE) loss with the proposed data augmentation scheme. Therefore, instead of performing the augmentation procedure explicitly, we can directly minimize the upper bound, which is in fact a novel robust loss function. As there is no need to generate explicit data samples, we call our algorithm
implicit semantic data augmentation (ISDA). Compared to existing semantic data augmentation algorithms, the proposed ISDA can be conveniently implemented on top of most deep models without introducing auxiliary models or noticeable extra computational cost.Although being simple, the proposed ISDA algorithm is surprisingly effective, and complements existing nonsemantic data augmentation techniques quite well. Extensive empirical evaluations on several competitive image classification benchmarks show that ISDA consistently improves the generalization performance of popular deep networks, especially with little training data and powerful traditional augmentation techniques.
2 Related Work
In this section, we briefly review existing research on related topics.
Data augmentation is a widely used technique to alleviate overfitting in training deep networks. For example, in image recognition tasks, data augmentation techniques like random flipping, mirroring and rotation are applied to enforce certain invariance in convolutional networks He et al. (2016); Huang et al. (2017); Simonyan and Zisserman (2015); Srivastava et al. (2015). Recently, automatic data augmentation techniques, e.g., AutoAugment Cubuk et al. (2018), are proposed to search for a better augmentation strategy among a large pool of candidates. Similar to our method, learning with marginalized corrupted features Maaten et al. (2013) can be viewed as an implicit data augmentation technique, but it is limited to simple linear models. Complementarily, recent research shows that semantic data augmentation techniques which apply class identity preserving transformations (e.g. changing backgrounds of objects or varying visual angles) to the training data is effective as well Jaderberg et al. (2016); Bousmalis et al. (2017); Ratner et al. (2017); Antoniou et al. (2018). This is usually achieved by generating extra semantically transformed training samples with specialized deep structures such as DAGAN Antoniou et al. (2018), domain adaptation networks Bousmalis et al. (2017) or other GANbased generators Jaderberg et al. (2016); Ratner et al. (2017). Although being effective, these approaches are nontrivial to implement and computationally expensive, due to the need to train generative models beforehand and infer them during training.
Robust loss function.
As shown in the paper, ISDA amounts to minimizing a novel robust loss function. Therefore, we give a brief review of related work on this topic. Recently, several robust loss functions are proposed for deep learning. For example, the L
loss Zhang and Sabuncu (2018) is a balanced noiserobust form for the cross entropy (CE) loss and mean absolute error (MAE) loss, derived from the negative BoxCox transformation. Focal loss Lin et al. (2017) attaches high weights to a sparse set of hard examples to prevent the vast number of easy samples from dominating the training of the network. The idea of introducing large margin for CE loss has been proposed in Liu et al. (2016); Liang et al. (2017); Wang et al. (2018). In Sun et al. (2014), the CE loss and the contrastive loss are combined to learn more discriminative features. From a similar perspective, center loss Wen et al. (2016) simultaneously learns a center for deep features of each class and penalizes the distances between the samples and their corresponding class centers in the feature space, enhancing the intraclass compactness and interclass separability.Semantic transformations in deep feature space. Our work is motivated by the fact that highlevel representations learned by deep convolutional networks can potentially capture abstractions with semantics Bengio and others (2009); Bengio et al. (2013)
. In fact, translating deep features along certain directions is shown to be corresponding to performing meaningful semantic transformations on the input images. For example, deep feature interpolation
Upchurch et al. (2017)leverages simple interpolations of deep features from pretrained neural networks to achieve semantic image transformations. Variational Autoencoder(VAE) and Generative Adversarial Network(GAN) based methods
Choi et al. (2018); Zhu et al. (2017); He et al. (2017) establish a latent representation corresponding to the abstractions of images, which can be manipulated to edit the semantics of images. Generally, these methods reveal that certain directions in the deep feature space correspond to meaningful semantic transformations, and can be leveraged to perform semantic data augmentation.3 Method
Deep networks are known to excel at forming highlevel representations in the deep feature space He et al. (2016); Huang et al. (2017); Upchurch et al. (2017); Ren et al. (2015), where the semantic relations between samples can be captured by the relative positions of their features Bengio et al. (2013). Previous work has demonstrated that translating features towards specific directions corresponds to meaningful semantic transformations when the features are mapped to the input space Upchurch et al. (2017); Li et al. (2016); Bengio et al. (2013). Based on this observation, we propose to directly augment the training data in the feature space, and integrate this procedure into the training of deep models.
The proposed implicit semantic data augmentation (ISDA) has two important components, i.e., online estimation of classconditional covariance matrices and optimization with a robust loss function. The first component aims to find a distribution from which we can sample meaningful semantic transformation directions for data augmentation, while the second saves us from explicitly generating large amount of extra training data, leading to remarkable efficiency compared to existing data augmentation techniques.
3.1 Sematic Transformations in Deep Feature Space
As aforementioned, certain directions in the deep feature space correspond to meaningful semantic transformations like “makebespectacled” or ‘changeviewangle’. This motivates us to augment the training set by applying such semantic transformations on deep features. However, manually searching for semantic directions is infeasible for large scale problems. To address this problem, we propose to approximate the procedure by sampling random vectors from a normal distribution with zero mean and a covariance that is proportional to the intraclass covariance matrix, which captures the variance of samples in that class and is thus likely to contain rich semantic information. Intuitively, features for the
person class may vary along the “wearglasses” direction, while have nearly zero variance along the “haspropeller” direction which only occurs for other classes like the plane class. We hope that directions corresponding to meaningful transformations for each class are well represented by the principle components of the covariance matrix of that class.Consider training a deep network with weights on a training set , where is the label of the th sample over classes. Let the dimensional vector denote the deep features of learned by , and indicate the th element of .
To obtain semantic directions to augment , we randomly sample vectors from a zeromean multivariate normal distribution , where is the classconditional covariance matrix estimated from the features of all the samples in class . In implementation, the covariance matrix is computed in an online fashion by aggregating statistics from all minibatches. The online estimation algorithm is given in Section A in the supplementary.
During training, covariance matrices are computed, one for each class. The augmented feature is obtained by translating along a random direction sampled from . Equivalently, we have
(1) 
where
is a positive coefficient to control the strength of semantic data augmentation. As the covariances are computed dynamically during training, the estimation in the first few epochs are not quite informative when the network is not well trained. To address this issue, we let
be a function of the current iteration , thus to reduce the impact of the estimated covariances on our algorithm early in the training stage.3.2 Implicit Semantic Data Augmentation (ISDA)
A naive method to implement ISDA is to explicitly augment each for times, forming an augmented feature set of size , where is th copy of augmented features for sample . Then the networks are trained by minimizing the crossentropy (CE) loss:
(2) 
where and are the weight matrix and biases corresponding to the final fully connected layer, respectively.
Obviously, the naive implementation is computationally inefficient when is large, as the feature set is enlarged by times. In the following, we consider the case that grows to infinity, and find that an easytocompute upper bound can be derived for the loss function, leading to a highly efficient implementation.
Upper bound of the loss function. In the case , we are in fact considering the expectation of the CE loss under all possible augmented features. Specifically, is given by:
(3) 
If can be computed efficiently, then we can directly minimize it without explicitly sampling augmented features. However, Eq. (3) is difficult to compute in its exact form. Alternatively, we find that it is possible to derive an easytocompute upper bound for , as given by the following proposition.
Proposition 1.
Suppose that , then we have an upper bound of , given by
(4) 
Proof.
According to the definition of in (3), we have:
(5)  
(6)  
(7)  
(8) 
In the above, the Inequality (6) follows from the Jensen’s inequality , as the logarithmic function is concave. The Eq. (7
) is obtained by leveraging the momentgenerating function:
due to the fact that
is a Gaussian random variable, i.e.,
Essentially, Proposition 1 provides a surrogate loss for our implicit data augmentation algorithm. Instead of minimizing the exact loss function , we can optimize its upper bound in a much more efficient way. Therefore, the proposed ISDA boils down to a novel robust loss function, which can be easily adopted by most deep models. In addition, we can observe that when , which means no features are augmented, reduces to the standard CE loss.
In summary, the proposed ISDA can be simply plugged into deep networks as a robust loss function, and efficiently optimized with the stochastic gradient descent (SGD) algorithm. We present the pseudo code of ISDA in Algorithm
1. Details of estimating covariance matrices and computing gradients are presented in Appendix A.4 Experiments
In this section, we empirically validate the proposed algorithm on several widely used image classification benchmarks, i.e., CIFAR10, CIFAR100 Krizhevsky and Hinton (2009) and ImageNetDeng et al. (2009). We first evaluate the effectiveness of ISDA with different deep network architectures on these datasets. Second, we apply several recent proposed nonsemantic image augmentation methods in addition to the standard baseline augmentation, and investigate the performance of ISDA. Third, we present comparisons with stateoftheart robust lost functions and generatorbased semantic data augmentation algorithms. Finally, ablation study is conducted to examine the effectiveness of each component. We also visualize the augmented samples in the original input space with the aid of a generative network.
4.1 Datasets and Baselines
Datasets. We use three image recognition benchmarks in the experiments. (1) The two CIFAR datasets consist of 32x32 colored natural images in 10 classes for CIFAR10 and 100 classes for CIFAR100, with 50,000 images for training and 10,000 images for testing, respectively. In our experiments, we hold out 5000 images from the training set as the validation set to search for the hyperparameter . These samples are also used for training after an optimal
is selected, and the results on the test set are reported. Images are normalized with channel means and standard deviations for preprocession. For the nonsemantic data augmentation of the training set, we follow the standard operation in
Howard (2014): 4 pixels are padded at each side of the image, followed by a random 32x32 cropping combined with random horizontal flipping. (2) ImageNet is a 1,000class dataset from ILSVRC2012
Deng et al. (2009), providing 1.2 million images for training and 50,000 images for validation. We adopt the same augmentation configurations in Krizhevsky et al. (2012); He et al. (2016); Huang et al. (2017).Nonsemantic augmentation techniques. To study the complementary effects of ISDA to traditional data augmentation methods, two stateoftheart nonsemantic augmentation techniques are applied, with and without ISDA. (1) Cutout DeVries and Taylor (2017) randomly masks out square regions of input during training to regularize the model. (2) AutoAugment Cubuk et al. (2019) automatically searches for the best augmentation policies to yield the highest validation accuracy on a target dataset. All hyperparameters are the same as reported in the papers introducing them.
Method  Params  CIFAR10  CIFAR100 

ResNet32 He et al. (2016)  0.5M  7.39 0.10%  31.20 0.41% 
ResNet32 + ISDA  0.5M  7.09 0.12%  30.27 0.34% 
ResNet110 He et al. (2016)  1.7M  6.76 0.34%  28.67 0.44% 
ResNet110 + ISDA  1.7M  6.33 0.19%  27.57 0.46% 
SEResNet110 Hu et al. (2018)  1.7M  6.14 0.17%  27.30 0.03% 
SEResNet110 + ISDA  1.7M  5.96 0.21%  26.63 0.21% 
WideResNet168 Zagoruyko and Komodakis (2017)  11.0M  4.25 0.18%  20.24 0.27% 
WideResNet168 + ISDA  11.0M  4.04 0.29%  19.91 0.21% 
WideResNet2810 Zagoruyko and Komodakis (2017)  36.5M  3.82 0.15%  18.53 0.07% 
WideResNet2810 + ISDA  36.5M  3.58 0.15%  17.98 0.15% 
ResNeXt29, 8x24d Xie et al. (2017)  34.4M  3.86 0.14%  18.16 0.13% 
ResNeXt29, 8x24d + ISDA  34.4M  3.67 0.12%  17.43 0.25% 
DenseNetBC10012 Huang et al. (2017)  0.8M  4.90 0.08%  22.61 0.10% 
DenseNetBC10012 + ISDA  0.8M  4.54 0.07%  22.10 0.34% 
DenseNetBC19040 Huang et al. (2017)  15.2M  3.52%  17.74% 
DenseNetBC19040 + ISDA  15.2M  3.24%  17.42% 
Dataset  Networks  Cutout DeVries and Taylor (2017)  Cutout + ISDA  AA Cubuk et al. (2019)  AA + ISDA 

CIFAR10  WideResNet2810 Zagoruyko and Komodakis (2017)  2.99 0.06%  2.83 0.04%  2.65 0.07%  2.56 0.01% 
ShakeShake (26, 2x32d) Gastaldi (2017)  3.16 0.09%  2.93 0.03%  2.89 0.09%  2.68 0.12%  
ShakeShake (26, 2x112d) Gastaldi (2017)  2.36%  2.25%  2.01%  1.82%  
CIFAR100  WideResNet2810 Zagoruyko and Komodakis (2017)  18.05 0.25%  17.02 0.11%  16.60 0.40%  15.62 0.32% 
ShakeShake (26, 2x32d) Gastaldi (2017)  18.92 0.21%  18.17 0.08 %  17.50 0.19%  17.21 0.33%  
ShakeShake (26, 2x112d) Gastaldi (2017)  17.34 0.28%  16.24 0.20 %  15.21 0.20%  13.87 0.26% 
Baselines. Our method is compared to several baselines including stateoftheart robust loss functions and generatorbased semantic data augmentation methods. (1) Dropout Srivastava et al. (2014)
is a widely used regularization approach which randomly mutes some neurons during training. (2)
Largemargin softmax loss Liu et al. (2016) introduces large decision margin, measured by a cosine distance, to the standard CE loss. (3) Disturb label Xie et al. (2016) is a regularization mechanism that randomly replaces a fraction of labels with incorrect ones in each iteration. (4) Focal loss Lin et al. (2017) focuses on a sparse set of hard examples to prevent easy samples from dominating the training procedure. (5) Center loss Wen et al. (2016) simultaneously learns a center of features for each class and minimizes the distances between the deep features and their corresponding class centers. (6) loss Zhang and Sabuncu (2018) is a noiserobust loss function, using the negative BoxCox transformation. (7) For generatorbased semantic augmentation methods, we train several stateoftheart GANs Arjovsky et al. (2017); Mirza and Osindero (2014); Odena et al. (2017); Chen et al. (2016), which are then used to generate extra training samples for data augmentation. For fair comparison, all methods are implemented with the same training configurations when it is possible. Details for hyperparameter settings are presented in Appendix B.Training details. For deep networks, we implement the ResNet, SEResNet, WideResNet, ResNeXt, DenseNet and PyramidNet on the two CIFAR datasets, and ResNet on ImageNet. Detailed configurations for these models are given in Appendix B. The hyperparameter for ISDA is selected from the set according to the performance on the validation set. On ImageNet, due to GPU memory limitation, we approximate the covariance matrices by their diagonals, i.e., the variance of each dimension of the features. The best hyperparameter is selected from .
4.2 Main Results
Table 1 presents the performance of several stateoftheart deep networks with and without ISDA. It can be observed that ISDA consistently improves the generalization performance of these models, especially with fewer training samples per class. On CIFAR100, for relatively small models like ResNet32 and ResNet110, ISDA reduces test errors by about , while for larger models like WideResNet2810 and ResNeXt29, 8x24d, our method outperforms the competitive baselines by nearly . Compared to ResNets, DenseNets generally suffer less from overfitting due to their architecture design, thus appear to benefit less from our algorithm.
Table 2 shows experimental results with recent proposed powerful traditional image augmentation methods (i.e. Cutout DeVries and Taylor (2017) and AutoAugment Cubuk et al. (2019)). Interestingly, ISDA seems to be even more effective when these techniques exist. For example, when applying AutoAugment, ISDA achieves performance gains of and on CIFAR100 with the ShakeShake (26, 2x112d) and the WideResNet2810, respectively. Notice that these improvements are more significant than the standard situations. A plausible explanation for this phenomenon is that nonsemantic augmentation methods help to learn a better feature representation, which makes semantic transformations in the deep feature space more reliable. The curves of test errors during training on CIFAR100 with WideResNet2810 are presented in Figure 3. It is clear that ISDA achieves a significant improvement after the third learning rate drop, and shows even better performance after the forth drop.
Method  ResNet110  WideResNet2810  

CIFAR10  CIFAR100  CIFAR10  CIFAR100  
Large Margin Liu et al. (2016)  6.460.20%  28.000.09%  3.690.10%  18.480.05% 
Disturb Label Xie et al. (2016)  6.610.04%  28.460.32%  3.910.10%  18.560.22% 
Focal Loss Lin et al. (2017)  6.680.22%  28.280.32%  3.620.07%  18.220.08% 
Center Loss Wen et al. (2016)  6.380.20%  27.850.10%  3.760.05%  18.500.25% 
L Loss Zhang and Sabuncu (2018)  6.690.07%  28.780.35%  3.780.08%  18.430.37% 
WGAN Arjovsky et al. (2017)  6.630.23%    3.810.08%   
CGAN Mirza and Osindero (2014)  6.560.14%  28.250.36%  3.840.07%  18.790.08% 
ACGAN Odena et al. (2017)  6.320.12%  28.480.44%  3.810.11%  18.540.05% 
infoGAN Chen et al. (2016)  6.590.12%  27.640.14%  3.810.05%  18.440.10% 
Basic  6.760.34%  28.670.44%     
Basic + Dropout  6.230.11%  27.110.06%  3.820.15%  18.530.07% 
ISDA  6.330.19%  27.570.46%     
ISDA + Dropout  5.980.20%  26.350.30%  3.580.15%  17.980.15% 
Method  Top1  Top5 

ResNet50 He et al. (2016)  23.58%  6.92% 
ResNet50 + ISDA  23.30%  6.82% 
ResNet152 He et al. (2016)  21.65%  6.01% 
ResNet152 + ISDA  21.20%  5.67% 
Table 4 presents the performance of ISDA on the large scale ImageNet dataset. It can be observed that ISDA reduces Top1 error rate by for the ResNet152 model. The training and test error curves are shown in Figure 3. Notably, ISDA achieves a slightly higher training error but a lower test error, indicating that ISDA performs effective regularization on deep networks.
4.3 Comparison with Other Approaches
We compare ISDA with a number of competitive baselines described in Section 4.1, ranging from robust loss functions to semantic data augmentation algorithms based on generative models. The results are summarized in Table 3, and the training curves are presented in Appendix D. One can observe that ISDA compares favorably with all the competitive baseline algorithms. With ResNet110, the test errors of other robust loss functions are 6.38% and 27.85% on CIFAR10 and CIFAR100, respectively, while ISDA achieves 6.23% and 27.11%, respectively.
Among all GANbased sematic augmentation methods, ACGAN gives the best performance, especially on CIFAR10. However, these models generally suffer a performance reduction on CIFAR100, which do not contain enough samples to learn a valid generator for each class. In contrast, ISDA shows consistent improvements on all the datasets. In addition, GANbased methods require additional computation to train the generators, and introduce significant overhead to the training process. In comparison, ISDA not only leads to lower generalization error, but is simpler and more efficient.
4.4 Visualization Results
To demonstrate that our method is able to generate meaningful semantically augmented samples, we introduce an approach to map the augmented features back to the pixel space to explicitly show semantic changes of the images. Due to space limit, we defer the detailed introduction of the mapping algorithm and present it in Appendix C.
Figure 4 shows the visualization results. The first and second column represent the original images and reconstructed images without any augmentation. The rest columns present the augmented images by the proposed ISDA. It can be observed that ISDA is able to alter the semantics of images, e.g., backgrounds, visual angles, colors and type of cars, color of skins, which is not possible for traditional data augmentation techniques.
4.5 Ablation Study
Setting  CIFAR10  CIFAR100 

Basic  3.820.15%  18.580.10% 
Identity matrix  3.630.12%  18.530.02% 
Diagonal matrix  3.700.15%  18.230.02% 
Single covariance matrix  3.670.07%  18.290.13% 
Constant  3.690.08%  18.330.16% 
ISDA  3.580.15%  17.980.15% 
To get a better understanding of the effectiveness of different components in ISDA, we conduct a series of ablation study. In specific, several variants are considered: (1) Identity matrix means replacing the covariance matrix by the identity matrix. (2) Diagonal matrix means using only the diagonal elements of the covariance matrix . (3) Single covariance matrix means using a global covariance matrix computed from the features of all classes. (4) Constant means using a constant without setting it as a function of the training iterations.
Table 5 presents the ablation results. Adopting identity matrix increases the test error by 0.05% on CIFAR10 and nearly 0.56% on CIFAR100. Using a single covariance matrix greatly degrades the generalization performance as well. The reason is likely to be that both of them fail to find proper directions in the deep feature space to perform meaningful semantic transformations. Adopting a diagonal matrix also hurts the performance as it does not consider correlations of features.
5 Conclusion
In this paper, we proposed an efficient implicit semantic data augmentation algorithm (ISDA) to complement existing data augmentation techniques. Different from existing approaches leveraging generative models to augment the training set with semantically transformed samples, our approach is considerably more efficient and easier to implement. In fact, we showed that ISDA can be formulated as a novel robust loss function, which is compatible with any deep network with the crossentropy loss. Extensive results on several competitive image classification datasets demonstrate the effectiveness and efficiency of the proposed algorithm.
Acknowledgments
Gao Huang is supported in part by Beijing Academy of Artificial Intelligence (BAAI) under grant BAAI2019QN0106 and Tencent AI Lab RhinoBird Focused Research Program under grant JR201914.
References
 Data augmentation generative adversarial networks. CoRR abs/1711.04340. Cited by: §1, §2.
 Wasserstein gan. CoRR abs/1701.07875. Cited by: Appendix B, Appendix C, §4.1, Table 3.
 Better mixing via deep representations. In ICML, pp. 552–560. Cited by: §1, §2, §3.

Learning deep architectures for ai.
Foundations and trends® in Machine Learning
2 (1), pp. 1–127. Cited by: §2.  Unsupervised pixellevel domain adaptation with generative adversarial networks. In CVPR, pp. 3722–3731. Cited by: §2.
 GAN augmentation: augmenting training data using generative adversarial networks. CoRR abs/1810.10863. Cited by: §1.
 Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, pp. 2172–2180. Cited by: Appendix B, §4.1, Table 3.

StarGAN: unified generative adversarial networks for multidomain imagetoimage translation
. In CVPR, pp. 8789–8797. Cited by: §2.  Autoaugment: learning augmentation policies from data. In CVPR, Cited by: §4.1, §4.2, Table 2.
 AutoAugment: learning augmentation policies from data. CoRR abs/1805.09501. Cited by: §2.
 ImageNet: a largescale hierarchical image database. In ICML, pp. 248–255. Cited by: §4.1, §4.

Improved regularization of convolutional neural networks with cutout
. arXiv preprint arXiv:1708.04552. Cited by: §4.1, §4.2, Table 2.  Shakeshake regularization. arXiv preprint arXiv:1705.07485. Cited by: Table 2.
 Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §2, §3, §4.1, Table 1, Table 4.
 AttGAN: facial attribute editing by only changing what you want.. CoRR abs/1711.10678. Cited by: §2.
 Some improvements on deep convolutional neural network based image classification. CoRR abs/1312.5402. Cited by: §4.1.
 Squeezeandexcitation networks. In CVPR, pp. 7132–7141. Cited by: Table 1.
 Densely Connected Convolutional Networks. In CVPR, pp. 2261–2269. Cited by: §1, §2, §3, §4.1, Table 1.
 Deep networks with stochastic depth. In ECCV, pp. 646–661. Cited by: Appendix B.

Reading text in the wild with convolutional neural networks.
International Journal of Computer Vision
116 (1), pp. 1–20. Cited by: §2.  Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1, §4.
 ImageNet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §1, §4.1.
 Convolutional network for attributedriven and identitypreserving human face generation. CoRR abs/1608.06434. Cited by: §3.
 Softmargin softmax for deep classification. In ICONIP, Cited by: §2.
 Focal loss for dense object detection. In ICCV, pp. 2999–3007. Cited by: §2, §4.1, Table 3.
 Largemargin softmax loss for convolutional neural networks.. In ICML, Cited by: §2, §4.1, Table 3.
 Learning with marginalized corrupted features. In ICML, pp. 410–418. Cited by: §2.
 Understanding deep image representations by inverting them. In CVPR, pp. 5188–5196. Cited by: Appendix C.
 Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: Appendix B, §4.1, Table 3.

Conditional image synthesis with auxiliary classifier gans
. In ICML, pp. 2642–2651. Cited by: Appendix B, §4.1, Table 3.  Learning to compose domainspecific transformations for data augmentation. In NeurIPS, pp. 3236–3246. Cited by: §1, §2.
 Faster rcnn: towards realtime object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §3.
 Very deep convolutional networks for largescale image recognition. In ICLR, Cited by: §1, §2.
 Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: Appendix B, §4.1.
 Training very deep networks. In NeurIPS, pp. 2377–2385. Cited by: §2.
 Deep learning face representation by joint identificationverification. In NeurIPS, Cited by: §2.
 Deep feature interpolation for image content changes. In CVPR, pp. 6090–6099. Cited by: Appendix C, §1, §1, §2, §3.
 Ensemble softmargin softmax loss for image classification. In IJCAI, Cited by: §2.

A discriminative feature learning approach for deep face recognition
. In ECCV, pp. 499–515. Cited by: §2, §4.1, Table 3.  DisturbLabel: regularizing cnn on the loss layer. In CVPR, pp. 4753–4762. Cited by: §4.1, Table 3.
 Aggregated residual transformations for deep neural networks. In CVPR, pp. 1492–1500. Cited by: Table 1.
 Wide residual networks. In BMVC, Cited by: Table 1, Table 2.
 Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, Cited by: Appendix B, §2, §4.1, Table 3.
 Unpaired imagetoimage translation using cycleconsistent adversarial networks. In ICCV, pp. 2223–2232. Cited by: §2.
Appendix A Implementation Details of ISDA.
Dynamic estimation of covariance matrices. During the training process using , covariance matrices are estimated by:
(9) 
(10) 
(11) 
where and are the estimates of average values and covariance matrices of the features of class at step. and are the average values and covariance matrices of the features of class in minibatch. denotes the total number of training samples belonging to class in all minibatches, and denotes the number of training samples belonging to class only in minibatch.
Gradient computation. In backward propagation, gradients of are given by:
(12) 
(13) 
(14) 
where denotes element of . can be obtained through the backward propagation algorithm using .
Appendix B Training Details
On CIFAR, we implement the ResNet, SEResNet, WideResNet, ResNeXt, DenseNet and PyramidNet. The SGD optimization algorithm with a nesterov momentum is applied to train all models. Specific hyperparameters for training are presented in Table
6.Network  Total Epochs  Batch Size  Weight Decay  Momentum  Initial  Schedule 
ResNet  160  128  1e4  0.9  0.1  Multiplied by 0.1 in and epoch. 
SEResNet  200  128  1e4  0.9  0.1  Multiplied by 0.1 in , and epoch. 
WideResNet  240  128  5e4  0.9  0.1  Multiplied by 0.2 in , , and epoch. 
DenseNetBC  300  64  1e4  0.9  0.1  Multiplied by 0.1 in , and epoch. 
ResNeXt  350  128  5e4  0.9  0.05  Multiplied by 0.1 in , and epoch. 
Shake Shake  1800  64  1e4  0.9  0.1  Cosine learning rate. 
PyramidNet  1800  128  1e4  0.9  0.1  Cosine learning rate. 
On ImageNet, we train ResNet for 120 epochs using the same l2 weight decay and momentum as CIFAR, following Huang et al. (2016). The initial learning rate is set as 0.1 and divided by 10 every 30 epochs. The size of minibatch is set as 256.
All baselines are implemented with the same training configurations mentioned above. Dropout rate is set as 0.3 for comparison if it is not applied in the basic model, following the instruction in Srivastava et al. (2014). For noise rate in disturb label, 0.05 is adopted in WideResNet2810 on both CIFAR10 and CIFAR100 datasets and ResNet110 on CIFAR 10, while 0.1 is used for ResNet110 on CIFAR 100. Focal Loss contains two hyperparameters and . Numerous combinations have been tested on the validation set and we ultimately choose and for all four experiments. For L loss, although Zhang and Sabuncu (2018) states that achieves best performance on most conditions, we suggest that is more suitable in our experiments, and therefore adopted. For center loss, we find its performance is largely affected by the learning rate of the center loss module, therefore its initial learning rate is set as 0.5 for the best generalization performance.
For generatorbased augmentation methods, we apply the GANs structures introduced in Arjovsky et al. (2017); Mirza and Osindero (2014); Odena et al. (2017); Chen et al. (2016) to train the generators. For WGAN, a generator is trained for each class in CIFAR10 dataset. For CGAN, ACGAN and infoGAN, single model is simply required to generate images of all classes. A 100 dimension noise drawn from standard normal distribution is adopted as input, generating images corresponding to their label. Specially, infoGAN takes additional input with two dimensions, which represent specific attributes of the whole training set. Synthetic images are involved with a fixed ratio in every minibatch. Based on the experiments on the validation set, the proportion of generalized images is set as .
Appendix C Reversing Convolutional Networks
To explicitly demonstrate the semantic changes generated by ISDA, we propose an algorithm to map deep features back to the pixel space. Some extra visualization results are shown in Figure 6.
An overview of the algorithm is presented in Figure 5. As there is no closedform inverse function for convolutional networks like ResNet or DenseNet, the mapping algorithm acts in a similar way to Mahendran and Vedaldi (2015) and Upchurch et al. (2017), by fixing the model and adjusting inputs to find images corresponding to the given features. However, given that ISDA augments semantics of images in essence, we find it insignificant to directly optimize the inputs in the pixel space. Therefore, we add a fixed pretrained generator , which is obtained through training a wasserstein GAN Arjovsky et al. (2017), to produce images for the classification model, and optimize the inputs of the generator instead. This approach makes it possible to effectively reconstruct images with augmented semantics.
The mapping algorithm can be divided into two steps:
Step I. Assume a random variable is normalized to and input to , generating fake image . is a real image sampled from the dataset (such as CIFAR). and are forwarded through a pretrained convolutional network to obtain deep feature vectors and . The first step of the algorithm is to find the input noise variable corresponding to , namely
(15) 
where and are the average value and the standard deviation of , respectively. The consistency of both the pixel space and the deep feature space are considered in the loss function, and we introduce a hyperparameter to adjust the relative importance of two objectives.
Step II. We augment with ISDA, forming and reconstructe it in the pixel space. Specifically, we search for corresponding to in the deep feature space, with the start point found in Step I:
(16) 
As the mean square error in the deep feature space is optimized to 0, is supposed to represent the image corresponding to .
The proposed algorithm is performed on a single batch. In practice, a ResNet32 network is used as the convolutional network. We solve Eq. (15), (16) with a standard gradient descent (GD) algorithm of 10000 iterations. The initial learning rate is set as 10 and 1 for Step I and Step II respectively, and is divided by 10 every 2500 iterations. We apply a momentum of 0.9 and a l2 weight decay of 1e4.
Appendix D Extra Experimental Results


Curves of test errors of stateoftheart methods and ISDA are presented in Figure 7. ISDA outperforms other methods consistently, and shows the best generalization performance in all situations. Notably, ISDA decreases test errors more evidently in CIFAR100, which demonstrate that our method is more suitable for datasets with fewer samples. This observation is consistent with the results in the paper. In addition, among other methods, center loss shows competitive performance with ISDA on CIFAR10, but it fails to significantly enhance the generalization in CIFAR100.
Comments
There are no comments yet.