Modern deep learning algorithms are exceptional at interpolation. For example, they can achieve superhuman performance on image classification tasks when tested on the same distribution of images that they were trained on(Karpathy, 2011; Krizhevsky et al., 2012; Huang et al., 2018). When these models are evaluated on images that are even slightly perturbed, however, their performance often degrades catastrophically (Dodge & Karam, 2017; Hendrycks & Dietterich, 2019; Azulay & Weiss, 2018; Rosenfeld et al., 2018).
A common way of increasing the robustness of deep learning algorithms is to apply perturbations to images during training (Simard et al., 2003; Cubuk et al., 2018). Although models trained with certain image perturbations become more robust to the specific perturbations they were trained with, they remain vulnerable to most other kinds of noise distributions (Dodge & Karam, 2017; Hendrycks & Dietterich, 2019; Azulay & Weiss, 2018; Geirhos et al., 2018).
In this work, we explore the effects of the optimization algorithm on robustness. Specifically, we employ meta-learning to learn an optimizer designed specifically to produce models which perform well on corrupted images. The meta-learning framework consists of two nested learning problems. In the inner-problem, a learned, parametric optimizer trains a model, making use of gradients computed only on clean training data. The outer-problem involves training the parameters of the optimizer so that the model trained in the inner-loop has a low outer-loss. In this work, we employ outer-losses based on validation performance on corrupted images. We find that the learned optimizers produce models which are not only robust to the noise distribution used in outer-training, but, in some cases, are also more robust to additional noise distributions as well.
where are the prediction targets, and is the cross entropy loss. The ellipses () denote potential additional features passed to the update rule (e.g. momentum values). An example of an update function U is SGD, which can be expressed as: where is the gradient of the inner-loss, and the learning rate is the single outer-parameter. In this work, we introduce more complex update functions with many outer-parameters .
To evaluate inner-problem performance, we often use held out validation data. In this work, we additionally want to be robust to different kinds of corruptions. As such, to compute the outer-objective at inner-iteration we compute: , where is a function which injects noise. We emphasize that during outer-training of the optimizers, is not used to train the inner-model, only to evaluate it through the outer-objective. In some experiments, we do apply the learned optimizer to noised data after it has been outer-trained.
To find the outer-parameters,
, we optimize for performance of the meta-objective (noised validation loss) with a corruption chosen from the meta-training corruption set. In all experiments we employ truncated evolutionary strategies for training. While it is possible to use gradients, the estimators can be very high variance(Metz et al., 2018).
In this work, we parameterize our learned optimizer similar to (Metz et al., 2018), employing a small fully connected network that operates on each inner-parameter independently (with the exception of some cross parameter normalization, described in Appendix A.2). This parameterization leverages existing features from optimization (such as momentum at different scales (Lucas et al., 2018)) and is flexible enough to express common regularization techniques, such as weight decay or learning rate decay, since weight value and timestep are included as input features. See Appendix A for more information on the update rule parameterization and outer-training.
3 Related Work
Recent work highlights the contrast between the human visual system and artificial neural networks (ANN), by looking at commonplace corruptions of images.Geirhos et al. (2018) reports that CNN rely much more on texture than shape, relative to humans. They find that data augmentation via style transfer can help ANNs focus more on shape, which leads to improved robustness to the Common Corruptions Benchmark (Hendrycks & Dietterich, 2019). Dodge & Karam (2017) report that while ANNs and humans perform comparably well on clean, high quality images; ANNs perform significanly worse on distorted images. They also report that errors made by humans and ANNs show little correlation (though other work has found surprising similarities in errors (Elsayed et al., 2018)). Azulay & Weiss (2018) show that ANNs are not robust to geometric transformations of objects either, such as translations and scale changes.
On the other hand, Gilmer et al. (2018); Fawzi et al. (2018); Ford et al. (2019) show that robustness to commonplace corruptions and worst-case corruptions (such as adversarial examples (Szegedy et al., 2014)) are directly related. Cubuk et al. (2017)
find that the sensitivity of ANNs to distortions at the input has a universal functional form across machine learning models, caused by a lack of correlation between outputs for different classes.
Meta-learning is a general term often used to describe learning some aspects of a learning algorithm. Early work in this area is from Schmidhuber (1987) which involves self-referential algorithms. Optimizer learning has been first been studied in (Bengio et al., 1990, 1992) and then advanced with more complex parametric update rules and inner-models (Andrychowicz et al., 2016; Chen et al., 2016; Li & Malik, 2017; Wichrowska et al., 2017; Bello et al., 2017; Metz et al., 2018). In this work, we target an objective (validation loss on a noised image distribution) different than that used at training time (training loss). This idea has been explored in the context of validation loss, (Metz et al., 2018)
as well as in unsupervised learning(Metz et al., 2019)
and in reinforcement learning(Houthooft et al., 2018).
We perform experiments on two types of noise distributions. First, we explore a corruption distribution consisting of different amounts of Gaussian noise added to the input image. Second, we explore a noise distribution based on the (Hendrycks & Dietterich, 2019) corruption benchmark. We select an outer-train set of corruptions and test our method on held out corruptions. In all cases the inner-model, the model being trained by an optimizer, consists of a 4 layer CNN on Cifar10. All values reported are cross entropy loss calculated on test images.
To aid in clarity, we color code our experimental setup in Table 1. For both experiments we train a learned optimizer. Our contribution, shown in black, is a learned optimizer outer-trained to perform well on noised validation data. At evaluation time, we can assess performance by inner-training on either clean data (to match how it was outer-trained), or on noised data, and testing performance of the trained model on different noise distributions. For the corruption data, to help isolate the effects of a more powerful optimizer, and of outer-training to target model robustness, we employ a second learned optimizer (blue) where we outer-train targeting clean validation images.
For both experiments we include Adam (Kingma & Ba, 2014) baselines, with learning rate tuned over
, outer-trained on both clean (solid), and noised (dashed) data matching the outer-training corruption distribution. To match standard hyperparameter tuning, we select the learning rate base on the target noise distribution, as opposed to the outer-train noise distribution.
4.1 Gaussian Noise
In this section, we train a learned optimizer to perform well on validation images (scaled 0-1) that have 0.05 per pixel Gaussian noise added to them. In Figure 1a, we show outer-training curves. We find that our learned optimizer starts to outperform the learning rate tuned Adam after 500 outer-iterations and Adam inner-trained on noisy data after 600 outer-iterations. In Figure 1b, we show inner-training of our learned optimizer evaluated on the noise distribution used at outer-training time. We present 2 baselines: first the learning rate tuned Adam trained on clean data, as well as the learning rate tuned Adam on the 0.05 noised-training data. We find that despite never seeing noised data at inner-training time, our learned optimizer can outperform Adam specifically trained at this noise level.
In Figure 1c we show outer-generalization outside the outer-training distribution. We present 2 settings of inner-training: training on clean data (solid) and 0.05 noised data (dashed). On clean data, our learned optimizer outperforms the clean Adam baselines but does not outperform Adam on noised data after 0.08 noise. When training on noised data, we find considerable improvements in robustness and outperform all other models. This is particularly surprising as this learned optimizer has never seen noised inner-training data at outer-training time. Ideally we would like the leaned optimizer to outperform Adam when inner-trained on noisy data. While this is true for 0.05 noise, (solid black is lower than dashed yellow), this does not hold at higher noise levels.
4.2 Novel corruption types
In this section we explore the effects of transferring between different kinds of corruptions. We take the set of corruptions proposed in Hendrycks & Dietterich (2019), divide the set of the nine corruptions (excluding JPEG corruption) into an outer-train set consisting of 7 training corruptions (Gaussian noise, shot noise, impule noise, defocus blur, zoom blur, brightness, and contrast), and an outer-test set consisting of 2 corruptions (frosted glass blur and fog). We outer-train only monitoring 2 train corruptions and the 2 test corruptions for computational reasons. In Figure 2, we show two of the better performing corruptions (frosted glass, and shot noise) and provide the other two (fog, and brightness) in Appendix B. As an additional baseline, to isolate the effect of having a better optimizer as opposed to an outer-training against a corruption objective, we also outer-train an optimizer targeting performance on clean validation images.
We find the performance of our learned optimizer varies dramatically across both the outer-train, and outer-test corruptions. We find our learned optimizer outer-trained for robustness, when inner-trained on clean data outperformed both the baseline learned optimizer, and the Adam when also inner-trained on clean data in all cases except the brightness corruption. Once again we find inner-training on the outer-train corruption distribution helps dramatically for both Adam, and both learned optimizers. In the Appendix, we find for fog, and brightness, our baseline learned optimizer outperforms both our learned optimizer, and Adam.
In this work we demonstrate the use of meta-learning to outer-train optimizers that produce robust classifiers. While small scale, we see our results as the first step towards achieving this goal in real world settings. In this work, we present two extremes of how to parameterize optimizers: our MLP learned optimizer, and the learning rate tuned Adam. The Adam parameterization used in this work is limited, as it isn’t able to make use of learning rate decay and regularizers like our learned optimizer. Designing better inductive biases and parameterizations for robustness on either end of the spectrum would be greatly beneficial. For example, the use of other regularizers (e.g. dropout), or data augmentation techniques would likely improve both our baseline and the learned optimizers.
In this work, we make the simplifying assumption for our learned optimizers that we are always inner-training on clean data. This choice defines a specific experimental paradigm. We outperform the hand designed optimizers in most cases when the hand designed optimizers abide by this paradigm (Adam trained on clean data, solid lines). When we break this experimental setup and train on noised data (dashed) we achieve much better performance with both Adam and our learned optimizers. Future work involves further exploring the impact of the training distribution on the meta-learning procedure. We could, for example, inner-train on a distribution of corruptions, train an optimizer to target a different set, and outer-test on a third set.
A limitation of meta-learning is the need for a distribution of corruptions. We have found the existing set of 9 corruptions presented in Hendrycks & Dietterich (2019) are quite different in nature. This makes outer-generalization to unseen corruptions challenging. Techniques such as meta-unsupervised learning (Hsu et al., 2018)
could be used to build heuristic corruption types to train on with the hope that the learned optimizer would transfer.
We would like to thank Justin Gilmer for discussion on this project as well as the rest of the Brain Team.
- Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.
- Azulay & Weiss (2018) Azulay, A. and Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
- Bello et al. (2017) Bello, I., Zoph, B., Vasudevan, V., and Le, Q. Neural optimizer search with reinforcement learning. 2017. URL https://arxiv.org/pdf/1709.07417.pdf.
- Bengio et al. (1992) Bengio, S., Bengio, Y., Cloutier, J., and Gecsei, J. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Univ. of Texas, 1992.
- Bengio et al. (1990) Bengio, Y., Bengio, S., and Cloutier, J. Learning a synaptic learning rule. Université de Montréal, Département d’informatique et de recherche opérationnelle, 1990.
- Chen et al. (2016) Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lillicrap, T. P., Botvinick, M., and de Freitas, N. Learning to learn without gradient descent by gradient descent. arXiv preprint arXiv:1611.03824, 2016.
- Cubuk et al. (2017) Cubuk, E. D., Zoph, B., Schoenholz, S. S., and Le, Q. V. Intriguing properties of adversarial examples. arXiv preprint arXiv:1711.02846, 2017.
- Cubuk et al. (2018) Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
- Dodge & Karam (2017) Dodge, S. and Karam, L. A study and comparison of human and deep learning recognition performance under visual distortions. In Computer Communication and Networks (ICCCN), 2017 26th International Conference on, pp. 1–7. IEEE, 2017.
Elsayed et al. (2018)
Elsayed, G., Shankar, S., Cheung, B., Papernot, N., Kurakin, A., Goodfellow,
I., and Sohl-Dickstein, J.
Adversarial examples that fool both computer vision and time-limited humans.In Advances in Neural Information Processing Systems, pp. 3910–3920, 2018.
- Fawzi et al. (2018) Fawzi, A., Fawzi, H., and Fawzi, O. Adversarial vulnerability for any classifier. arXiv preprint arXiv:1802.08686, 2018.
- Ford et al. (2019) Ford, N., Gilmer, J., Carlini, N., and Cubuk, D. Adversarial examples are a natural consequence of test error in noise. arXiv preprint arXiv:1901.10513, 2019.
- Geirhos et al. (2018) Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
- Gilmer et al. (2018) Gilmer, J., Metz, L., Faghri, F., Schoenholz, S. S., Raghu, M., Wattenberg, M., and Goodfellow, I. Adversarial spheres. arXiv preprint arXiv:1801.02774, 2018.
- Hendrycks & Dietterich (2019) Hendrycks, D. and Dietterich, T. G. Benchmarking neural network robustness to common corruptions and surface variations. International Conference on Learning Representations, 2019.
- Houthooft et al. (2018) Houthooft, R., Chen, R. Y., Isola, P., Stadie, B. C., Wolski, F., Ho, J., and Abbeel, P. Evolved policy gradients. arXiv preprint arXiv:1802.04821, 2018.
- Hsu et al. (2018) Hsu, K., Levine, S., and Finn, C. Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334, 2018.
- Huang et al. (2018) Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965, 2018.
- Karpathy (2011) Karpathy, A. Lessons learned from manually classifying cifar-10. Published online at http://karpathy. github. io/2011/04/27/manually-classifying-cifar10, 2011.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
- Li & Malik (2017) Li, K. and Malik, J. Learning to optimize. International Conference on Learning Representations, 2017.
- Lucas et al. (2018) Lucas, J., Sun, S., Zemel, R., and Grosse, R. Aggregated momentum: Stability through passive damping. arXiv preprint arXiv:1804.00325, 2018.
- Metz et al. (2018) Metz, L., Maheswaranathan, N., Nixon, J., Freeman, C. D., and Sohl-Dickstein, J. Understanding and correcting pathologies in the training of learned optimizers. arXiv preprint arXiv:1810.10180, 2018.
- Metz et al. (2019) Metz, L., Maheswaranathan, N., Cheung, B., and Sohl-Dickstein, J. Meta-learning update rules for unsupervised representation learning. ICLR, 2019.
- Rosenfeld et al. (2018) Rosenfeld, A., Zemel, R., and Tsotsos, J. K. The elephant in the room. arXiv preprint arXiv:1808.03305, 2018.
- Schmidhuber (1987) Schmidhuber, J. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987.
- Simard et al. (2003) Simard, P. Y., Steinkraus, D., Platt, J. C., et al. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of International Conference on Document Analysis and Recognition, 2003.
- Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6199.
- Wichrowska et al. (2017) Wichrowska, O., Maheswaranathan, N., Hoffman, M. W., Colmenarejo, S. G., Denil, M., de Freitas, N., and Sohl-Dickstein, J. Learned optimizers that scale and generalize. International Conference on Machine Learning, 2017.
Appendix A Optimizer Details
We briefly give an overview of our optimzier training details. The optimizer used in this work is similar to that used in (Metz et al., 2018).
The inner-model used in this work consists of a 4 layer convolutional neural network with ReLU activations. It contains hidden sizes of 32, 32, 64, 64 with strides 2,2,1,1. All layers use a kernel size of 3. The final layer is meaned spatially then passed into a linear projection to 10 units. We use cross entropy loss to train.
When outer-training our learned optimizer, we use clean Cifar10 data rescaled to fall between 0-1. Note that at evaluation time (after the model has been outer-trained) we also inner-train on noised data.
a.2 Learned optimizer architecture
The learned optimizer consists of a 1 hidden layer MLP that is shared across all units. For each unit, we construct a feature vector containing a variety of features commonly used in hand designed optimizers(Wichrowska et al., 2017)
. These include the gradient values, momentum values at 5 timescales, (0.5, 0.9, 0.99, 0.999, 0.9999), the current weights, the log absolute value of the weights. These values are then normalized by the second moment of each feature across each tensor. We include time based features consisting ofwhere is the current inner-training iteration, and is one of [2, 10, 20, 100, 200, 1000]. Additionally, a feature that is the log norm of each tensor value, and the log of the number of units in the tensor.
These features are all passed through a 1 hidden layer MLP with 32 units to produce 2 outputs: . We combine them to produce a step as follows: . The form of this update can be thought of as learning a direction , and a log step length . We multiply by to ensure that the initial step size is stable and so that we do not initialize in an unstable outer-loss regime.
a.3 Outer-training details
We outer-train on a asynchronous, batched distributed cluster containing 256 workers and a batch size of 256. Each worker performs partial truncations and sends gradient information to a centralized learner. A worker then synchronizes weights, and proceeds from where the previous truncation left off. To account and mitigate truncation bias, we use an increasing schedule of truncation length that starts at 100 and linearly increases to 10k over 5k outer-iterations. Note that we never actually train until completion in any of our experiments. To prevent artifacts arising from the truncation schedule, we jitter this truncation amount by 20% while training. If at any point the outer-loss is greater than 2 times the initial loss we stop the unroll, and reinitialize the inner-model randomly.
For a outer-gradient estimator, we make use of variational optimization. As shown in (Metz et al., 2018) we can use a reparametization based gradient (backaprop through unrolled training), or a gradient based on evolutionary strategies, or the combination of the 2. In this work, we only use the evolutionary strategies based estimator as it uses less ram with our naive implementation and is thus easier to work with given our computing infrastructure. We expect using the combined estimator would speed up outer-training. For lower variance evolutionary strategies gradients we make use of antithetical sampling with shared randomness where ever possible.
While progress has been made on increasing stability of learned optimizer training, not all random seeds converge. We use the outer-train loss to select the best model out of 4 random seeds for the corruptions experiments, and 3 random seeds for the Gaussian noise experiments.
a.4 Outer-training task distribution: Gaussian experiments
Our outer-objective for the Gaussian noise experiments consists of validation Cifar10 images corrupted with 0.05 Gaussian noise added to them.
a.5 Outer-training task distribution: Corruption experiments
Our outer-objective for the corruption experiments consists of sampling a severity amount, (1, 2, or 3), and a training corruption, (gaussian noise, shot noise, impulse noise, defocus blur, zoom blur, brightness and contrast). Each inner-training we sample a new augmentation to compute the meta-objective with.