Hierarchically Robust Representation Learning

11/11/2019 ∙ by Qi Qian, et al. ∙ University of Washington 54

With the tremendous success of deep learning in visual tasks, the representations extracted from intermediate layers of learned models, that is, deep features, attract much attention of researchers. The previous analysis shows that those features include appropriate semantic information. By training the deep models on a large-scale benchmark data set (e.g., ImageNet), the features can work well on other tasks. In this work, we investigate this phenomenon and demonstrate that deep features can fail due to the fact that they are learned by minimizing empirical risk. When the distribution of data is different from that of the benchmark data set, the performance of deep features can degrade. Hence, we propose a hierarchically robust optimization to learn more generic features. Considering the example-level and concept-level robustness simultaneously, we formulate the problem as a distributionally robust optimization problem with Wasserstein ambiguity set constraints. An efficient algorithm with the conventional training pipeline is proposed. Experiments on benchmark data sets confirm our claim and demonstrate the effectiveness of the robust deep representations.



There are no comments yet.


page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Extracting representations is essential for visual recognition. In the past decades, various hand-crafted features have been developed to capture semantics of images, e.g., SIFT [Lowe2004], HOG [Dalal and Triggs2005], etc. The conventional pipeline works in two phases. In the first phase, representations are extracted from each image with a given schema. After that, a specific model (e.g., SVM [Cortes and Vapnik1995]) is learned with these features for target tasks. Since the hand-crafted features are task independent, the performance of this pipeline can be suboptimal.

Deep learning proposes to incorporate these phases by training the end-to-end convolutional neural networks 

[LeCun et al.1989]

. Without the explicit feature design, a task dependent representation will be learned through multiple layers and a fully connected layer is attached at the end as a linear classifier for recognition. Benefited from this coherent structure, deep learning promotes the performance on visual tasks dramatically, e.g., categorization 

[Krizhevsky, Sutskever, and Hinton2012], detection [Ren et al.2017], etc. Despite the success of deep learning on large-scale data sets, deep neural networks (DNNs) are easy to overfit the small data set due to the large number of parameters. Besides, DNN requires GPU for efficient training where the cost is expensive.

Researchers attempt to leverage DNNs to improve the feature design mechanism. Surprisingly, it observes that the features extracted from last few layers perform well on the generic tasks when the model is pre-trained on a large-scale benchmark data set, e.g., ImageNet 

[Russakovsky et al.2015]. Deep features, which are outputs from intermediate layers of a deep model, become popular as the substitute of training deep models for light computation. Systematic comparison shows that these deep features outperform the existing hand-crafted features with a large margin [Donahue et al.2014, Mormont, Geurts, and Marée2018, Qian et al.2015].

The objective of learning deep models and deep features can be different, but little efforts are devoted to further investigating deep features. When learning deep models, it focuses on optimizing the performance on the current training data set. In contrast, deep features are learned for generic tasks rather than a single data set. In the applications of deep features, it also notices that the deep features can fail when the distribution of data is different from the benchmark ImageNet data set [Zhou et al.2014]

. By studying the objective of learning models, we find that it optimizes the uniform distribution over examples and is a standard empirical risk minimization (ERM) problem. It is well known that the models obtained by ERM can generalize well on the data from the same distribution as training 

[Bousquet and Elisseeff2002]. Since a large-scale data set covers data from a wide range of classes, it explains the good generalization performance of deep features.

However, the distribution of data from real applications can be significantly different from the benchmark data set, which can result in the performance degradation when adopting the representations learned from ERM. The differences can come from at least two aspects. First, the distribution of examples in each class can be different. This problem attracts much attention recently and approaches to optimize the worst-case performance are developed to handle the issue [Chen et al.2017, Namkoong and Duchi2016, Sinha, Namkoong, and Duchi2018]. Second, the distribution of concepts is also different from that in the benchmark data set. In this scenario, each concept can contain multiple classes. This difference has been less investigated but more crucial for deploying deep features due to the fact that the concepts in real applications may be only a subset of or partially overlapped by those in the benchmark data set.

In this work, we propose to consider the drifting in examples and concepts simultaneously and learn the hierarchically robust representations from deep neural networks. Compared with ERM, it is more consistent with the objective of learning deep features. For the example-level robustness, we adopt Wasserstein ambiguity set to encode the uncertainty from examples for the efficient optimization. Our theoretical analysis also illustrates that an appropriate augmentation can be better than the regularization in training DNNs, since the former one provides a tighter approximation for the optimization problem. For the concept-level robustness, we formulate it as a game between the deep model and the distribution over different concepts to optimize the worst-case performance over concepts. By learning deep features with the adversarial distribution, the worst-case performance over concepts can be improved. Finally, to keep the simplicity of the training pipeline, we develop an algorithm that leverages the standard random sampling strategy at each iteration and re-weights the obtained gradient for an unbiased estimation. This step may increase the variance of the gradient and we reduce the variance by setting the learning rate elaborately. We can show that the adversarial distribution can converge at the rate of

, where denotes the total number of iterations. The empirical study on benchmark data sets confirms the effectiveness of our method.

Related Work

Deep learning becomes popular since ImageNet ILSVRC12 and various structures of deep neural networks have been proposed, e.g., AlexNet [Krizhevsky, Sutskever, and Hinton2012], VGG [Simonyan and Zisserman2014], GoogLeNet [Szegedy et al.2015] and ResNet [He et al.2016]. Besides the success on image categorization, features extracted from the last few layers are applied for generic tasks. [Donahue et al.2014] adopts the deep features from the last two layers in AlexNet and shows the impressive performance on visual recognition with different applications. After that, [Qian et al.2015] applies deep features for distance metric learning and achieves the overwhelming performance to the hand-crafted features on fine-grained visual categorization. [Mormont, Geurts, and Marée2018] compares deep features from different neural networks and ResNet shows the best results. Besides the model pre-trained on ImageNet, [Zhou et al.2014]

proposes to learn deep features with the large-scale scene data set to improve the performance on the scene recognition task. All of these work directly extract features from the model learned with ERM as the objective. In contrast, we develop an algorithm that is tailored to learn robust deep representations. Note that deep features can be extracted from multiple layers of deep models and we focus on the layer before the final fully-connected layer in this work.

Recently, distributionally robust optimization that aims to optimize the worst-case performance has attracted much attention [Chen et al.2017, Namkoong and Duchi2016, Sinha, Namkoong, and Duchi2018]. [Namkoong and Duchi2016] proposes to optimize the performance with worst-case distribution over examples that is derived from the empirical distributions. [Chen et al.2017]

extends the problem to the non-convex loss function, but they require a near-optimal oracle for the non-convex problem to learn the robust model.

[Sinha, Namkoong, and Duchi2018] introduces the adversarial perturbation on each example for robustness. Most of these algorithms only consider the example-level robustness. In contrast, we propose the hierarchically robust optimization, that considers example-level and concept-level robustness simultaneously, to learn the representations for real applications.

Hierarchical Robustness

Problem Formulation

Let denote an image and be its corresponding label. Given a benchmark data set where , the parameters in a deep neural network can be learned by solving the optimization problem


where is a non-negative loss function (e.g., cross entropy loss). It is an empirical risk minimization (ERM) problem which can be inappropriate for learning generic features. We will explore the hierarchical robustness to obtain robust deep representations.

First, we consider the example-level robustness. Unlike ERM, a robust model is to minimize the worst-case distribution derived from the empirical distribution. The optimization problem can be cast as a game between the prediction model and the adversarial distribution

which is equivalent to

where is the adversarial distribution over training examples and is the simplex as . When is a uniform distribution, the distributioanlly robust problem becomes ERM.

To alleviate the issue from outliers and constrain the space of the adversarial distribution, a regularizer can be added into the formulation as


where is the empirical distribution. measures the distance between the learned adversarial distribution and the empirical distribution. We apply squared distance in this work as . The regularizer is to guarantee that the generated adversarial distribution is not too far way from the empirical distribution. It implies that the adversarial distribution is from an ambiguity set as

where is determined by .

Besides the example-level robustness, concept-level robustness is more important for learning the generic features. A qualified model should perform consistently well over different concepts. Assume that there are concepts in the training set and each concept consists of examples, the concept robust optimization problem can be written as

With the similar analysis as above, the problem becomes

where can be set as .

Combined with example-level robustness, the hierarchically robust problem is

In this formulation, each example is associated with a parameter and . The high dimensionality and the coupling structure make an efficient optimization challenging. Due to the fact that , we try to decouple the hierarchical robustness with an alternative formulation for the example-level robustness.

Wasserstein Ambiguity Set

In Eqn. 2, the ambiguity set is defined with the distance to the uniform distribution over the training set. It introduces the adversarial distribution by re-weighting each example, which couples the parameter with that of the concept-level problem. To simplify the optimization, we generate the ambiguity set for the adversarial distribution with Wasserstein distance. The property of Wasserstein distance can help to decouple the example-level robustness from concept-level robustness.

Assume that is a data-generating distribution over the data space and is the empirical distribution where the training set is generated from it as . The ambiguity set for the distribution can be defined as

is the Wasserstein distance between distributions and we denote the example generated from as . is the transportation cost between examples.

The problem of example-level robustness can be written as

According to the definition of Wasserstein distance [Sinha, Namkoong, and Duchi2018] and let the cost function be the squared Euclidean distance, the problem is equivalent to

where is the data space. In [Sinha, Namkoong, and Duchi2018], they obtain the optimal by solving the subproblem for each example at each iteration. To accelerate the optimization, we propose to minimize the upper bound of the subproblem, which also provides the insight for the comparison between regularization and augmentation.

The main results are stated in the following theorems and all proofs are in the appendix.

Theorem 1.

Assume is -smoothness in and in , we have

where is sufficiently large such that and .

Theorem 2.

With the same assumption in Theorem 1 and let

we have


and is a non-negative constant as

Theorem 1 shows that learning the model on the original examples with the regularization for the complexity of the model can make the learned model robust for examples from the ambiguity set. The similar result has been observed in the conventional robust optimization [Ben-Tal, El Ghaoui, and Nemirovski2009]. However, regularization is not sufficient for training DNNs well and many optimization algorithms have to rely on augmented examples to obtain meaningful models.

Theorem 2 interprets the phenomenon by analyzing a specific augmentation and shows that augmented examples can provide a more tighter bound for the examples in the ambiguity set. Besides, the augmented patch is corresponding to the gradient of the original example. To make the approximation tight, it should be identity to the direction of the gradient. So we set , which is similar as in adversarial training [Goodfellow, Shlens, and Szegedy2015].

For the concept-level robustness, we keep the strategy above and obtain the final objective as


Efficient Optimization

The problem in Eqn. 3

can be solved efficiently by stochastic gradient descent (SGD). In the standard training pipeline for ERM in Eqn. 

1, a mini-batch of examples will be randomly sampled at each iteration and the model will be updated with gradient descent as

where is the size of a mini-batch.

For the problem in Eqn. 3, each example has a weight as and the gradient has to be weighted for an unbiased estimation as


For the adversarial distribution , each concept has a weight and the straightforward way is to sample a mini-batch of examples from each concept to estimate the gradient of the distribution. However, the number of concepts varies and it can be larger than the size of a mini-batch. Besides, it results in the different sampling processes for computing the gradient of deep models and the adversarial distribution, which increases the complexity of the system. To address the issue, we take the same random sampling pipeline and update the distribution with weighted gradient ascent as


where is the number of examples from the -th concept in the mini-batch and .

projects the vector onto the simplex.

1:  Input: Dataset , iterations , mini-batch size , , , ,
2:  for  do
3:     if  then
5:     else
7:     end if
8:     Randomly sample a mini-batch of examples
9:     Generate the augmented data as
10:     Update model with gradient descent as in Eqn. 4
11:     Update distribution with gradient ascent as in Eqn. Efficient Optimization
12:  end for
13:  return  
Algorithm 1 Hierarchically Robust Representation Learning (HRRL)

Re-weighting strategy makes the gradient unbiased but introduces the additional variance. Since batch-normalization 

[Ioffe and Szegedy2015] is inapplicable for the parameters of the adversarial distribution that is from the simplex, we develop a learning strategy to reduce the variance from gradients.

First, to illustrate the issue, let and

be two binary random variables as

Then we have

It demonstrates that the gradient after re-weighting is unbiased. However, the variance is different

where the variance roughly increases with a factor of .

By investigating the updating criterion in Eqn. Efficient Optimization, we find that the gradient is rescaled by the learning rate . If we let , the norm of the gradient will be reasonable after a sufficient number of iterations. Besides, the norm of is bounded by a small value of since the distribution is from the simplex. It inspires us to deal with the first several iterations by adopting a small learning rate. The algorithm is summarized in Alg. 1. In short, we use the learning rate as where for the first iterations and then the conventional learning rate is applied.

The convergence about the adversarial distribution is stated as follows.

Theorem 3.

Assume the gradient of distribution is bounded as and set the learning rate as

we have

where is a non-negative constant as

and should be larger than .

Theorem 3 shows a convergence rate for the adversarial distribution. The gain of varying learning rate is indicated in . When applying the conventional learning rate i.e. , it is easy to show . To further investigate the properties of , we let and study its behavior.

Proposition 1.

is non-negative.


Since , we have . Therefore

It implies that we can benefit from the variance reduction as long as the variance is large. Then, we fix and plot the curve of in Fig. 1. We can find that achieves its maximum after thousands of iterations, which suggests that should not be too large. It is consistent with our claim that the gradient will be shrunk by the learning rate and the variance has little influence when is large.

Figure 1: Curves of with different .


We adopt ImageNet ILSVRC12 [Russakovsky et al.2015] as the benchmark data set to learn models to extract representations in the experiments. ImageNet includes classes, where each class has about images in training and images in test. We extract concepts according to the structure of WordNet [Fellbaum1998] and the statistics is summarized in Table 1. Explicitly, ImageNet is biased to specific animals. For example, it contains classes of birds and more than classes of dogs. It can result in the performance degeneration when applying the model learned by ERM to generate representations.

Concept Mammal Bird Vehicle Container
#Classes 100 59 67 56
Concept Structure Device Instrumentality Artifact
#Classes 57 129 106 107
Concept Dog Animal Others
#Classes 118 121 80
Table 1: Concepts in ImageNet.

We apply ResNet-18 [He et al.2016], which is a popular network as the feature extractor [Mormont, Geurts, and Marée2018], to learn the representations. We train the model with stochastic gradient descent (SGD) on 2 GPUs and set the size of mini-batch as . Following the common practice, we learn the model with epochs. The initial learning rate is set to and then decayed by a factor of at . All model training includes random crop and horizontal flipping as the data augmentation. We set , for the proposed algorithm. After obtaining deep models, we extract deep features from the layer before the last fully-connected layer, which generates -dimensional features for a single image. Given the features, we learn a linear SVM [Chang and Lin2011] to categorize examples. , and the parameter of SVM are searched in . Four different deep features with SVM are compared in the experiments.

  • SVM: deep features learned with ERM.

  • SVM: deep features learned with example-level robustness only.

  • SVM: deep features learned with concept-level robustness only.

  • SVM: deep features learned with both of example-level and concept-level robustness.

Experiments are repeated

times and the average results with standard deviation are reported.


First, we will demonstrate the influence of example-level robustness. We conduct experiments of deep features on CIFAR-10, which contains classes and images. of them are for training and the rest are for test. CIFAR-10 has the similar concepts as those in ImageNet, e.g., “bird”, “dog”, and the drifting in concepts is negligible. On the contrary, each image in CIFAR-10 has a size of , which is significantly smaller than that of images in ImageNet. Fig. 2 illustrates examples from ImageNet and CIFAR-10. It is obvious that the distribution for images changes dramatically and example-level robustness is important for this task.

Figure 2: Examples from ImageNet, CIFAR-10 and SOP.

Table 2 summarizes the comparison. First, we observe that the accuracy of SVM can achieve , which surpasses the performance of SIFT features [Bo, Ren, and Fox2010], i.e., , by more than . It confirms that representations extracted from a DNN model trained on the benchmark data set can be applicable for generic tasks. Compared with representations from the model learned with ERM, SVM outperforms it by a margin about . It shows that optimizing with Wasserstein ambiguity set can learn the example-level robust features and handle the drifting in examples better than ERM. SVM has the similar performance as SVM. It is consistent with the fact that the difference of concepts between CIFAR-10 and ImageNet is small. Finally, the performance of SVM is comparable to that of SVM and is significantly better than SVM, which demonstrates the effectiveness of the proposed algorithm.

Methods Acc(meanstd)
SVM 85.770.12
SVM 86.620.18
SVM 85.640.26
SVM 86.490.19
Table 2: Comparison of accuracy () on CIFAR-10.

Stanford Online Products (SOP)

Then, we try to illustrate the importance of concept-level robustness. We have Stanford Online Products (SOP) as the target task to evaluate the learned representations. It collects product images from eBay.com and consists of images for training and images for test. We adopt the super class label for each image, which leads to a -class classification problem. Fig. 2 illustrates examples from ImageNet and SOP. We can find that the distribution of images are similar while the distribution of concepts is different. ImageNet includes many natural objects, e.g., animals, while SOP only contains artificial ones. Handling the concept drifting is challenging for this task.

Table 3 shows the accuracy. Explicitly, SVM has the similar performance as SVM due to the minor changes in the distribution of images. However, SVM demonstrates a better accuracy, which is about better than SVM. It illustrates that the deep features learned with the proposed algorithm is more robust than those from ERM when the distribution of concepts varies. Besides, the performances of SVM and SVM are comparable, which confirms that deep features obtained with hierarchical robustness work well consistently in different scenarios.

Methods Acc(meanstd)
SVM 73.470.09
SVM 73.480.08
SVM 74.340.05
SVM 74.230.08
Table 3: Comparison of accuracy () on SOP.

Street View House Numbers (SVHN)

Finally, we deal with a problem when distributions of both examples and concepts are varying. We evaluate the robustness of deep features on Street View House Numbers (SVHN) data set. It consists of images for training and for test. The target is to identify one of digits from each image. The image has the same size as CIFAR-10, which is very different from ImageNet. Moreover, SVHN has the concepts of digits, which is also different from ImageNet.

We compare the different deep features in Table 4. First, as obtained in CIFAR-10, SVM outperforms SVM by a large margin. It is because features learned with example-level robustness is more applicable than those from ERM when examples are from a different distribution. Second, SVM improves the performance by more than . It is consistent with the observation in SOP, where features learned with concept-level robustness perform better when concepts vary. Besides, the performance of SVM surpasses that of SVM, which implies that controlling concept-level robustness, which has not been investigated, may be more important than example-level robustness. Finally, by combining example-level and concept-level robustness, SVM shows an improvement of more than . It demonstrates that example-level and concept-level robustness are complementary. Incorporating both of them can further improve the performance of deep features, when the example and concept distributions are different from these of the benchmark data used to learn representations.

Methods Acc(meanstd)
SVM 63.230.35
SVM 65.010.37
SVM 65.470.27
SVM 67.330.39
Table 4: Comparison of accuracy () on SVHN.
(a) CIFAR-10 (b) SOP (c) SVHN
Figure 3: Comparison of finetuning with different initializations.


Besides extracting features, a pre-trained model is often applied as an initialization for training DNNs on the target task when GPUs are available. Since initialization is crucial for the final performance of DNNs [Sutskever et al.2013], we conduct the experiments that initialize the model with parameters trained on ImageNet and then finetune the model on CIFAR-10, SOP and SVHN. After initialization, the model is finetuned with epochs. The learning rate is set as and decayed once by a factor of after epochs. Fig. 3 illustrates the curve of test error. We let “ERM” denote the model initialized with the model pre-trained with ERM and “Robust” denote the one initialized with the model pre-trained with the proposed algorithm. Surprisingly, we observe that the models initialized with the proposed algorithm still surpass those with ERM. It implies that the learned robust model can be used for initialization besides feature extraction.

Influence of Robustness

Finally, we investigate the influence of the proposed algorithm on ImageNet to further illustrate the impact of robustness. First, we demonstrate the results of example-level robustness. We generate the augmented examples for validation set as in Theorem 2 and report the accuracy of different models in Fig. 4. The horizontal axis shows the step size for generating the augmented examples. When step size is , the original validation set is applied for evaluation. Otherwise, each image in the validation set is modified with the corresponding step size and models, and only modified images are used for evaluation. Intuitively, larger step size implies larger distribution change of the validation set compared to the original validation set.

Besides ERM, four different models are included in the comparison. Each model is trained with example-level robustness and the corresponding parameter is denoted in the legend, where larger should theoretically provide a more robust model.

We can observe that ERM performs well when there is no augmentation but its performance degrades significantly when the augmentation step size increases. It confirms that ERM only optimizes the example in the training set and cannot generalize well when the distribution of examples changes. Fortunately, we can observe that more robust models (i.e., increases) can provide better generalization performance as expected. It is because that the proposed algorithm focuses on optimizing the worst-case performance among different distributions derived from the original distribution. The proposed method is more powerful when the target is to learn the generic features and the information of data distribution is unavailable.

Figure 4: Comparison of accuracy on augmented examples.

Second, we show the influence of concept-level robustness. We train models with different for the regularizer and summarize the accuracy of concepts in Fig. 5. We sort the accuracy in the increasing order to make the comparison clear. Evidently, ERM aims to optimize the uniform distribution of examples and ignore the distribution of concepts. Consequently, certain concepts, .e.g., “bird”, “vehicle”, have much higher accuracy than others. The bias from concepts makes deep features sensitive to the concepts in the target task. When decreasing , the freedom of adversarial distribution increases. With more freedom, it can be far way from the initial distribution and focus on the concept with bad performance. By optimizing the adversarial distribution, the model will balance the performance between different concepts as illustrated in Fig. 5.

Figure 5: Comparison of accuracy on concepts in ImageNet.

Figs. 4 and 5 demonstrate the different influences of example-level and concept-level robustness. Explicitly, they can deal with the perturbation from different aspects and improving the hierarchical robustness is important for applying deep features or initializing models in real-world applications.


In this work, we study the problem of learning deep features for generic tasks. We propose a hierarchically robust optimization algorithm to learn the robust representations from a large-scale benchmark data set. The theoretical analysis also interprets the importance of augmentation when training DNNs. The experiments on ImageNet and benchmark data sets demonstrate the effectiveness of the learned features. The framework can be further improved when side information is available. For example, given the concepts of the target domain, we can obtain the specific reference distribution accordingly and then learn the features for the desired task. This direction can be our future work.


  • [Ben-Tal, El Ghaoui, and Nemirovski2009] Ben-Tal, A.; El Ghaoui, L.; and Nemirovski, A. 2009. Robust optimization, volume 28. Princeton University Press.
  • [Bo, Ren, and Fox2010] Bo, L.; Ren, X.; and Fox, D. 2010. Kernel descriptors for visual recognition. In NIPS, 244–252.
  • [Bousquet and Elisseeff2002] Bousquet, O., and Elisseeff, A. 2002. Stability and generalization.

    Journal of Machine Learning Research

  • [Chang and Lin2011] Chang, C.-C., and Lin, C.-J. 2011.

    LIBSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology 2:27:1–27:27.
  • [Chen et al.2017] Chen, R. S.; Lucier, B.; Singer, Y.; and Syrgkanis, V. 2017. Robust optimization for non-convex objectives. In NIPS, 4708–4717.
  • [Cortes and Vapnik1995] Cortes, C., and Vapnik, V. 1995. Support-vector networks. Machine Learning 20(3):273–297.
  • [Dalal and Triggs2005] Dalal, N., and Triggs, B. 2005. Histograms of oriented gradients for human detection. In CVPR, 886–893.
  • [Donahue et al.2014] Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; and Darrell, T. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 647–655.
  • [Fellbaum1998] Fellbaum, C. 1998. 1998, wordnet: An electronic lexical database.
  • [Goodfellow, Shlens, and Szegedy2015] Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explaining and harnessing adversarial examples. In ICLR.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
  • [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 448–456.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1106–1114.
  • [LeCun et al.1989] LeCun, Y.; Boser, B. E.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W. E.; and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4):541–551.
  • [Lowe2004] Lowe, D. G. 2004. Distinctive image features from scale-invariant keypoints.

    International Journal of Computer Vision

  • [Mormont, Geurts, and Marée2018] Mormont, R.; Geurts, P.; and Marée, R. 2018.

    Comparison of deep transfer learning strategies for digital pathology.

    In CVPR Workshops, 2262–2271.
  • [Namkoong and Duchi2016] Namkoong, H., and Duchi, J. C. 2016. Stochastic gradient methods for distributionally robust optimization with f-divergences. In NIPS, 2208–2216.
  • [Qian et al.2015] Qian, Q.; Jin, R.; Zhu, S.; and Lin, Y. 2015. Fine-grained visual categorization via multi-stage metric learning. In CVPR, 3716–3724.
  • [Ren et al.2017] Ren, S.; He, K.; Girshick, R. B.; and Sun, J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6):1137–1149.
  • [Russakovsky et al.2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115(3):211–252.
  • [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.
  • [Sinha, Namkoong, and Duchi2018] Sinha, A.; Namkoong, H.; and Duchi, J. 2018. Certifiable distributional robustness with principled adversarial training. In ICLR.
  • [Sutskever et al.2013] Sutskever, I.; Martens, J.; Dahl, G. E.; and Hinton, G. E. 2013. On the importance of initialization and momentum in deep learning. In ICML, 1139–1147.
  • [Szegedy et al.2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S. E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In CVPR, 1–9.
  • [Zhou et al.2014] Zhou, B.; Lapedriza, À.; Xiao, J.; Torralba, A.; and Oliva, A. 2014. Learning deep features for scene recognition using places database. In NIPS, 487–495.


Proof of Theorem 1


Due to the smoothness, we have


When is sufficiently large as , R.H.S. is bounded and

Since is smoothness, we have

When assuming as in many neural networks, the original subproblem can be bounded

where . ∎

Proof of Theorem 2


We consider the augmented examples as

According to the smoothness, we have

The last equation is from setting to optimum

Proof of Theorem 3


For the convergence of distribution, we have

The last inequality is from the fact that the objective is -strongly concave in . Therefore, we have

When , we have

When , we have

We assume that the and for the first iterations and then . So we have

By setting , we have