1 Introduction
As machine learning moves to new domains, collecting diverse, rich, and applicationrelevant datasets is critical for its continued success. Historically, research on learning optimization algorithms have only leveraged single tasks
(andrychowicz2016learning; metz2019understanding), or parametric synthetic tasks (wichrowska2017learned), due to the difficulty of obtaining large sets of tasks.1.1 TaskSet: A set of tasks
We present a set of tasks significantly larger than any optimizer dataset previously studied. We aim to better enable standardized research on optimizers, be that analysis of existing optimizers, or development of new learned learning algorithms. We call this suite of tasks TaskSet.
Much in the same way that learned features in computer vision outpaced hand designed features
(krizhevsky2012imagenet; lecun2015deep), we believe that data driven approaches to discover optimization algorithms will replace their hand designed counterparts resulting in increased performance and usability. To this end, standardizing a large suite of optimization tasks is an important first step towards more rigorous learned optimizer research.In this setting, a single “example” is an entire training procedure for a task defined by data, loss function, and architecture. Thus, TaskSet consists of over a thousand optimization tasks, largely focused on deep learning (neural networks). They include image classification using fully connected and convolutional models, generative models with variational autoencoders
(kingma2013auto) or flows (dinh2016density; papamakarios2017masked), natural language processing tasks including both language modeling and classification, as well as synthetic tasks such as quadratics, and optimization test functions. The problems themselves are diverse in size, spanning 7 orders of magnitude in parameter count, but remain reasonably fast to compute as almost all tasks can be trained 10k iterations on a CPU in under one hour. To demonstrate the breadth of this dataset we show an embedding of all the tasks in Figure
1.1.2 Amortizing hyperparameter search
Machine learning methods are growing ever more complex, and their computational demands are increasing at a frightening pace (amodei2018ai). Unfortunately, most modern machine learning models also require extensive hyperparameter tuning. Often, hyperparameter search is many times more costly than the final algorithm, which ultimately has large economic and environmental costs (strubell2019energy).
The most common approach to hyperparameter tuning involves some form of quasirandom search over a prespecified grid of hyperparameters. We propose a new hyperparameter search strategy: a simple ordered list of hyperparameters to try. The idea is that the first few elements in this list will cover most of the variation in hyperparameters found in typical machine learning workloads.
We choose the elements in this list by leveraging the diversity of tasks in TaskSet, by metalearning a hyperparameter list that performs the best on the set of tasks in TaskSet. We then test this list of hyperparameters on new, larger machine learning tasks.
Although learning the list of hyperparameters is costly (in total we train 29 million models consisting of over 4000 distinct hyper parameter configurations), our final published list is now available as a good starting guess for new tasks.
Furthermore, we believe the raw training curves generated by this search will be useful for future hyperparameter analysis and metalearning research, and we release it as part of this work^{2}^{2}2github.com/googleresearch/googleresearch/tree/master/task_set
. We additionally release code in Tensorflow
(abadi2016tensorflow), Jax(jax2018github), and PyTorch
(pytorch) for a reference optimizer which uses our learned hyperparameter list, and can be easily applied to any model.^{3}^{3}3github.com/googleresearch/googleresearch/tree/master/opt_list We believe this hyperparameter search strategy will enable the machine learning community to train better performing models, in less time, and with reduced compute and energy cost.1.3 Paper structure
In §2
, we define our open source optimizer dataset, Taskset. In §
3, we choose several common optimizers, and we detail our algorithm for finding a performant search strategy over those optimizer’s hyperparameters for Taskset. §4 provides qualitative summary statistics for the tasks in Taskset for a typical optimizer. §5 constitutes a detailed study of the performance of our learned hyperparameter list benchmarked against several baseline search strategies, both in terms of training time and optimizer generalization performance. Finally, §6, considers transfer learning experiments of our learned list to significantly larger architectures.
2 TaskSet: A set of tasks
How should one choose what problems to include in a set of optimization tasks? In our case, we seek to include common problems in machine learning research. As such, we strive to include optimization tasks that have been influential over the course of research in the last several decades. This is necessarily subjective, but by distilling these beliefs into a clear set of tasks we are explicit about this subjectivity. Designing this dataset requires striking a balance between including realistic largescale workloads and ensuring that tasks are fast to train so that using it for metalearning is tractable. We chose to fill our dataset with a mixture of mostly neural network based tasks. Our chosen tasks have between ten thousand and one million parameters (much smaller than the billions commonly used today), as a result most problems can train in under an hour on a cloud CPU with 5 cores. We additionally focus on increased “task diversity” by including many different kinds of training algorithms, architectures, and datasets inspired by past work in reinforcement learning which has demonstrated large numbers of problems and increased diversity around some domain of interest is useful for both training and generalization
heess2017emergence; tobin2017domain; cobbe2018quantifying; openai2019rubiks. Once again though, a balance must be struck as in the limit of too much diversity no learning can occur due to the no free lunch theorem (wolpert1997no).Our dataset, TaskSet, is made up of 1162 tasks in total. We define a task as the combination of a loss function, a dataset, and initialization.
Specifically we define a task as a set of 4 functions:

Initialization
() parameter initial values 
Data generator
data split (e.g. train / valid / test) batch of data 
Forward pass
(batch of data, params) loss 
Compute gradients
(input data, params) gradients ()
A task has no tunable hyperparameters and, coupled with an optimizer, provides all the necessary information to train using first order optimization. This makes experimentation easier, as each task definition also specifies reasonable defaults for hyperparameters such as batch size (shallue2018measuring; mccandlish2018empirical) or initialization (schoenholz2016deep; yang17; xiao18a; li2018on; pretorius2018critical; hayou2018selection; karakida2018universal; blumenfeld2019mean; hayou2019meanfield) that no longer need to be tuned.
Handdesigning architectures, datasets, and losses for thousands of neuralnetworkbased tasks is a challenge. We augment a set of “fixed” tasks which have been designed by hand, with “sampled” tasks that are randomly generated task instances.
2.1 Sampled families of tasks
Sampled tasks are created by sampling neural network architectures, activation functions, datasets, and other properties. We organize these sampled tasks into similar
families of tasks: See Appendix G for more details and example configurations.
mlp
: Multi layer perceptrons trained on image data.

mlp_ae: Multi layer perceptron based auto encoder trained on image data (hinton2006reducing).

mlp_vae Multi layer perceptron based variational auto encoder trained on image data (kingma2013auto).

conv_pooling ConvNet with spatial pooling before the classification layer.

conv_fc: ConvNet with fully connected classification network instead of pooling.

nvp: Non volume preserving flows trained on image data (dinh2016density).

maf: Masked autoregressive flows trained on image data (papamakarios2017masked).

char_rnn_language_model Language modeling with an RNN on characters (graves2013generating).

word_rnn_language_model Language modeling with an RNN on words / subwords.

rnn_text_classification Text classification using RNN models.

quadratic Problems based on quadratics possibly transformed by a nonlinearity.

losg_tasks Tasks generated from the synthetic optimization problems documented in “Learned Optimizers that Scale and Generalize” (wichrowska2017learned), abbreviated “losg”.
Defining a sampling distribution that generates tasks that are always valid, and that run within a time constraint, is difficult. Instead, we define a broad distribution and make use of rejection sampling to remove tasks that are either too slow, contain errors, or that we are unable to optimize at all. By starting with a distribution that is too broad, and pruning it, we hope to achieve better coverage of tasks.
2.2 Hand designed tasks
In addition to the sampled tasks, we also include 107 hand designed tasks. These consist of more common tasks that both improve the coverage beyond the sampled tasks, and provide for better interpretability through a closer match to existing tasks in the literature. These tasks span image classification, text classification, language modeling, and generative modeling, as well as some synthetic tasks such as associative retrieval (ba2016using). We leave the description of each one of these tasks to Appendix G.3.
3 Amortized hyperparameter search
As a first demonstration leveraging the largescale task dataset for metalearning research, we consider learning hyperparameter lists. We define an optimizer as the pairing of an optimization algorithm and all its corresponding hyperparameters (e.g. learning rate). While sometimes practitioners use a single optimizer – e.g. Adam (kingma2014adam) with default hyperparameters – most practitioners will often run multiple optimizers and use a validation set to select the best performer.
3.1 Optimizer families
We define different parameterizations of hand designed optimizers as an optimizer family. The optimizer families we consider consist of:

Adam1p: One hyperparameter, the fixed learning rate

Adam4p: Four Adam hyperparameters, , , , and

Adam6p: Adam4p hyperparameters, and two additional hyperparameters controlling linear and exponential learning rate decays

Adam8p: The hyperparameters in Adam6p plus two additional hyperparameters for and regularization terms

NAdamW: A 10 hyperparameter search space based on NAdam (dozat2016incorporating) with cosine learning rate decay, and weight decay.
For the full update equations see Appendix C.1 for Adam and C.2 for NadamW. We chose Adam based on its use in existing work, and NAdam based on performance shown in (choi2019empirical).
3.2 Learned hyperparameter lists
Traditionally researchers tune hyperparameters on a per model basis. While this often results in performance gains; it comes at the cost of immense compute, and researchers are almost never able to expend enough compute to saturate model performance (shallue2018measuring). As an alternative to perproblem tuning, we proposes instead tuning the search strategy itself on a dataset of tasks and transferring the knowledge gained to new tasks of interest. This idea is already implicitly done by humans – e.g. we don’t start a hyperparameter search with a learning rate of – we use values that the community has found useful.
This datasetbased tuning has a number of desirable properties. First, the resulting search strategies are much more efficient, resulting in large speedups in sample efficiency on unseen tasks over a random search baseline. Second, we are less restricted by the number of optimizer parameters we search over or by needing to define reasonable search spaces. For example, if there are redundant regions of search space, our learned optimizer will be less likely to sample them repeatedly, unlike random search. If there is a region of hyperparameter space that performs poorly on all problems, the learned search strategy will avoid it.
In this work we parameterize the learned search strategy as an ordered list of optimizers to try (i.e. a list of hyperparameter configurations). Given a fixed number of task evaluations we would like to achieve the best possible performance on all tasks in the training set of tasks. For a length list of optimizers we define our loss as:
(1) 
where are the optimizer hyperparameters for element in the list, and is an appropriately normalized loss computed after training task .
We seek to find an optimal list of optimizers as:
(2) 
3.3 Scoring an optimizer by averaging over tasks
To score a task, we initialize the parameters of the task and run 10,000 iterations of an optimizer. We monitor loss on each data split (train, validation, test) every 200 steps using an average over 50 minibatches per evaluation. For all data presented in this paper we also compute averages over 5 random task parameter initializations.
A side effect of the diverse task dataset is that losses span multiple orders of magnitude, making direct aggregation of performance problematic. To remedy this we normalize the loss values for all tasks linearly between 0 and 1 where 1 is validation loss at initialization and zero is the lowest validation loss achieved by any tested optimizer. Loss values greater than the loss at initialization are clipped to 1.
To collapse an entire normalized training curve into a scalar cost, we compute the mean normalized loss over the 10,000 iterations. We find empirically that this choice is similar to taking the minimum (Appendix A.4). We leave exploring alternative methods such as performance profiles (dolan2002benchmarking) and Nash averaging (balduzzi2018re) for future work.
3.4 Greedy learning from random search
Optimizing Eq. 2 is combinatorially expensive. To tractably solve this optimization problem, we introduce two approximations. First, we shift the unconstrained search over the full space of optimizers to search over a finite set of optimizers, . This finite set can be computed ahead of time and decouples the expensive procedure of training each task with an optimizer from training the learned search space. Separating data and training in this way has been done for both hyperparameter search (eggensperger2015efficient), and neural architecture search (klein2019tabular; ying2019nasbench)
. In total we trained 1,000 optimizer configurations for each of Adam1p, Adam4p, Adam6p, Adam8p, and NAdamW on all 1,162 tasks with 5 random seeds per pair. Second, we use a greedy heuristic to approximate the combinatorial search over sets of
optimizers. For a single optimizer trial, , we select the best performing optimizer on average across all training tasks. We then continue to select optimizer parameters such that the minimum of all optimizerparameters per task, aggregated over all tasks is minimized. This shifts the complexity from exponential in to linear. Finding a length set of optimizers can thus be efficiently computed as follows:(3)  
(4)  
(5) 
We note that the first argument of the outer min, , can be computed once per set of hyperparameters as it does not depend on . Finally, as our tasks are stochastic, we order optimizers based on validation loss and report test loss (van2016deep). ^{4}^{4}4This technically means that increasing the number of optimizes could potentially decrease performance, but we find this rarely happens in practice.
This training strategy requires an original search space from which to collect data and build . The search space we use is described in Appendix D.1. While large, we find that the optimal parameters for each task end up covering almost the entire space.
4 Experiments: TaskSet
In this section we demonstrate various properties of the task suite. For a qualitative view, we first construct a feature space consisting of performance measurements for each task+optimizer pair (See §3.3). This forms a dense matrix of size number of tasks by number of optimizers. We then perform TSNE (maaten2008visualizing; van2014accelerating) to reduce the dimensionality to two and plot the results coloring by task family (Figure 1). Clusters in this space correspond to tasks that work well with similar optimizers. We find diversity of tasks with clusters occurring around similar families of tasks.
Next, we look at aggregate statistics. In Figure 2a we show histograms of compute times for all problems and find almost all problems train under an hour (see Appendix B for per task family histograms). In Figure 2c we plot a histogram of the number of parameters per tasks. Finally, in Figure 2b we show a distribution of task difficulty by plotting the fraction of optimizer configurations that achieve a certain loss value. We find that for some tasks as many as of optimizers perform well while for others achieve a loss close to the smallest observed loss.
5 Experiments: Training and generalization of learned hyperparameter lists
With our dataset of tasks and data collected, we turn our attention to exploring training of the hyperparameter lists, and generalization beyond the suite of tasks in TaskSet. Our main tool to show performance are figures that sweep the number of optimizers configurations on the xaxis, and show the best performance achieved for each number of optimizers tried, averaged over some set of tasks (see Eq. 1).
5.1 Learned hyperparameter lists are more efficient than random search
To demonstrate the impact of learning a search space, we take the 1,162 tasks split them into even train and test tasks. We then learn a search strategy using optimizers from the Adam8p family following Eq. 5 on the train tasks. As baselines, we use random search with different search spaces, including just learning rate (Rand: Adam1p), the default Adam hyper parameters (Rand: Adam4p), as well as the Adam 8 dimensional search space (Rand: Adam8p). Search spaces are specified in Appendix D.1.
The performance of random search critically depends on the boundaries of the original search space. Without prior knowledge about the problems, however, picking a good search space is difficult. To explore this we additionally choose search spaces after collecting and looking at the data. We then use this search space to simulate random search within the constraints via rejection sampling. To find these search spaces we find the best hyper parameters for each task and construct new hyperparameter ranges with min and max values determined by the smallest and largest values of each hyperparameter which were the best hyperparameter for some task. This removes regions of the search space not used by any task. We also tested bounds based on the 5th and 95th percentile of best performing hyperparameters computed over all tasks. In the case of min and max, we find the optimal hyperparameters cover nearly all of the existing space, whereas the percentile based search spaces reduces the volume of the search hypercube by more than 90% leaving us with only 100 hyperparameter configurations. In Figure 5, we find, in all cases, learning the hyperparameter list is much more efficient.
5.2 More tasks lead to better generalization
We next look at the effects of the number of training tasks on generalization. We take subsets of tasks of different size, and train hyperparameter lists using Eq.5. We compute test performance on the remainder of the tasks and plot loss averaged over different splits in Figure 4. We find that a large number of tasks (more than 100) are required to achieve nearoptimal test performance. This is surprising to us given how simple our learned search strategy is (simply a list of hyperparameters), but not wholly so given past work studying generalization in RL (cobbe2018quantifying).
5.3 Generalization to different types of problem
For learned algorithms to be generally useful, some amount of generalization to unseen task families is required. To test this, we split our data into disjoint task types. We perform two splits: testing on RNN tasks and training on all others, and testing on autoencoder tasks and training on all others. As a best case baseline we additionally train search spaces on the test task families directly. We find an order of magnitude better sample efficiency than random search for both cases and find our learned search space is close in performance to search spaces trained on just the testing tasks (Fig. 5).
5.4 Generalization to different sized problems
Training learned algorithms on large models is often infeasible for computational reasons. As such, one form of generalization needed when building learned algorithms is the ability to transfer to different sized models. As shown in Figure 2 the tasks in this suite contain a wide range of parameter counts, and can thus be used to test this kind of generalization. We split the tasks into 8 groups – one group per order of magnitude in parameter count, and train hyperparameter lists on one range and test on the rest. In Figure 6 we plot the fraction of the training loss achieved by the test loss on the target parameter range. We find peak performance around the model sizes used for training, and smooth falloff as the testing tasks become more dissimilar as measured by parameter count. We note that our problems are not evenly distributed across these groups thus each group will contain a different percentage of the underlying tasks. While this potentially confounds these results, we believe a similar bias occurs in realistic workloads as well.
6 Experiments: Realistic problems
In §5.3 and §5.4 we explored generalization of learned hyperparameter lists to held out tasks within the TaskSet dataset. While useful for analysis, these tasks are still far from the workloads commonly employed to solve real problems. In this section, we explore the performance of our learned search space on a number of state of the art models. These models drastically differ from the training set of tasks in parameter count and compute cost. For all experiments in this section we take the optimizer ordering using the NAdamW optimizer family on all TaskSet tasks then apply the resulting search space to the target problem. The final list of hyperparameters can be found in Appendix F. We show results for ResNet50 on ImageNet, and Transformers on LM1B. Additional results with reinforcement learning using PPO are in Appendix A.1.
6.1 ImageNet Resnet50
We take the TPU implementation with default settings from the official Tensorflow models repository (resnet50code) and swap out different optimizers. We test the default optimizer, SGD + momentum with a learning rate warmup and staircase decay, learning rate tuned Adam (in half orders of magnitudes between 1e6, 3e2, as well the learned list of hyperparameters. For momentum and learning rate tuned Adam we leave the default weight decay value. For our learned search space we remove weight decay as this is handled by the optimizer.
We show accuracy computed over the course of training as well as best performance for a given hyperparameter budget in Figure 7. We find that the learned search space vastly outperforms learning rate tuned Adam. After 67 model evaluations we find a optimizer that outperforms the default SGD+momentum staircase learning rate schedule commonly used to train these models. By using the list as opposed to searching randomly in the original search space we find beter hyperparameters faster. Note that using this list of hyperparameters does not require any problemspecific knowledge. Despite this, we are able to slightly improve upon the default methodology used when training a ResNet50.
6.2 LM1B Transformer
We take the transformer (vaswani2017attention) example implemented in Jax (jax2018github) with Flax (flax2020github). We train using a 2x2 TPU V2 configuration for 100k iterations. Once again we take all other hyperparameters as is and simply swap optimizer implementation. We additionally split a second validation set from the training set to perform the max over hyperparameters over. We present 2 baselines: first, tuning learning rate only, and otherwise using the default transformer training hyperparameters; and second a fixed learning rate Adam baseline. Results in Figure 10. We find the learned hyperparameter list dramatically outperforms the default optimizer setting and the fixed learning rate baseline. We suspect the fact that the fixed learning rate performs better than the built in learning rate schedule is due to limited training time and model hyperparameters. Nevertheless, we emphasize that our method does not require any knowledge of the underlying problem to achieve faster results. See Appendix A.2 for this same transformer with a budget of 20k iterations.
7 Related Work
The idea of sets of tasks has been explored throughout machine learning. The majority of these suites are for use in evaluation where as our suite is targeted for metalearning. The closest family of optimization tasks for evaluation to those presented here is DeepObs (schneider2019deepobs) which includes 20 neural network tasks. Our task suite focuses on smaller problems and contains 50x more tasks. Outside of evaluation, task suites in reinforcement learning such as Obstacle Tower (juliani2019obstacle), ProcGen (cobbe2019leveraging), CoinRun (cobbe2018quantifying), and Sonic (nichol2018gotta) focus on training algorithms that work across a variety of settings.
The creation of TaskSet was motivated by the goal of learning learning algorithms, or metalearning (schmidhuber1987evolutionary; schmidhuber1995learning; hochreiter2001learning), and in particular learned optimizers (bengio1990learning; andrychowicz2016learning; Bello17; wichrowska2017learned; li2017learning; lv2017learning; metz2019understanding; metz2019using). In this work we do not use this task suite to train learned optimizers, but instead focus on learning a hyperparameter search strategy. Tuning hyperparameters by leveraging multiple tasks has been explored within the contexts of Bayesian optimization swersky2013multi; perrone2019learning; perrone2018scalable as well as metalearning in chen2017learning.
8 Discussion
Learning optimization algorithms represents a promising direction to accelerate machine learning research. For the resulting algorithms to become useful tools, however, we must further understand the relationships between training tasks, metaoptimization, and both iid and out of distribution generalization. This work takes steps towards this goal by introducing a set of tasks which can be used to train and study optimization algorithms. We then use this task set and learned hyperparameter lists to answer questions related to optimization and generalization of learned learning algorithms. We find a large degree of generalization even to out of distribution tasks but as the tasks get more varied, transfer performance suffers. At this point, the training of learned learning algorithms is computationally expensive despite the extreme simplicity of our learnedlearning algorithm parameterization (a list of hyperparameteters). We hope to explore alternative parameterizations which will increase performance such as by leveraging previous evaluations and partial model trainings (swersky2014freeze; li2016hyperband).
We are releasing the optimal hyperparameter list we have found as a dropin replacement optimizer in a variety of deep learning frameworks (Tensorflow (abadi2016tensorflow), PyTorch (pytorch), and JAX (jax2018github)) in the hopes that the research community finds them useful. We believe this represents a new set of reasonable optimizer defaults for new problems. We additionally hope TaskSet encourages more standardized research on general purpose optimizers.
Acknowledgments
We would like to thank Alex Alemi, George Dahl, Justin Gilmer, Jaehoon Lee, Chris Madison, Alec Radford, Christopher Shallue, for input on this work. Finally, we would like to thank the entire Brain Team for providing a supportive research environment.
References
Appendix A Additional Experiments
a.1 Reinforcement Learning with PPO
We test the learned hyperparameter lists on two continuous control reinforcement learning environments, half cheetah and humanoid, from Gym’s Mujoco environments(todorov2012mujoco; Brockman2016). We use TFAgents (TFAgents) with all nonoptimizer hyperparameters set via searching a mixture of environments. In figure A.1 we find our learned hyperparameter lists achieves comparable to slightly worse performance does not out perform learning rate tuning of Adam in both efficiency nor final performance. To diagnose this behavior we ran all 1k optimizers for both problems and found the learned hyperparameter list performs comparable to random search in the underlying space. To probe further, we computed spearman correlation on the performance of each optimizer as compared to the rest of the tasks in the task suite. We found considerably worse correlations than where present for tasks in the TaskSet. This is not surprising as TaskSet contains no reinforcement learning problems.
a.2 LM1B targeting 20k iterations
a.3 Probing short horizon
Often the goal when training a learned optimizers is to minimize performance after training some number of iterations. This is extremely computationally expensive and in practice approximations must be used. One common family of approximations is short horizon based methods. These methods rely upon somehow truncating training so that updates can be made to the learned optimizer more frequently. This is commonly done via truncated backprop (werbos1990backpropagation; wichrowska2017learned; metz2019understanding; wuunderstanding)
, or proxy objectives such as only training for a handful of epoch
(Zoph2017). While this short horizon proxy is certainly not optimal(wuunderstanding), the performance gains are immense and in practice is what makes metatraining optimizers feasible. In our task suite, we test this short horizon learning by training hyperparameter lists only using some finite amount of training iterations per task and testing in the full training regieme (10k steps). Results in figure 11. We find that even when learning the hyperparameter list on a mere 200 steps, our hyperparameter list continues to generalize to outperform random search on Adam8p. This is promising as this suggests that training the learned hyperparameter list can be done with 1/50th of the total compute. This result is surprising to us as prior work indicates the effect of this bias can be severe (wuunderstanding; metz2019understanding). We suspect it is due to the simplicity of the learned parameter space but leave a thorough analysis of this for future work.a.4 Choice of normalization function
There is no easy way to define a single metric for optimizer performance over a mixture of tasks. This paper picks a single normalization strategy based on minimum validation loss and the validation loss at initialization presented in §3.3. In this section we show the impact of choosing a different normalization and or aggregation technique. First, instead of computing the mean over learning curves as described in §3.3 we compute a min. Second, instead of rescaling based on init and min, we linearly rescale based on the 95 percentile of validation loss and the min validation loss achieved at the end of training each task.In Figure 12 we show learned hyperparameter list training and testing performance as a function of number of optimizers tried when training with different normalization techniques. We find using the min instead of mean results in a negligible change, while using the percentile loss more significantly hurts performance. This difference can be explained by Figure 12b and 12c where we show correlations between the two losses. We find the percentile loss has a much weaker correlation to the default normalizer. We suspect this difference is due to the fact that many optimizers diverage on tasks. By using the 95 percentile we upweight optimizers that do not diverge.
a.5 Task families are diverse
To show the effects of diversity we train and test hyperparameter lists on each pair of task family. We additionally normalize each column from 01 to account for different mean losses across tasks. Results in Figure 13. While we do find some similarity in tasks – e.g. between MAF and NVP models, but no two tasks behave the same performance characteristics (no duplicate columns) suggesting that each task family is providing a different contribution to the space of all tasks. We also find when training on certain “far away” tasks, e.g. the quadratic family, we find poor performance on most other task families.
a.6 Effects of the metatraining search space size
Our offline learning technique described in §3.4 hinges on a finite set of optimizers collected via random search. This set is denote by in Eq.5. In this section we probe the impact of this size. We take different sized subsets of the the thousand Adam8p optimizer configurations and train and test search spaces on different iid splits of tasks. We then plot performance as a function of this number of optimizers in Figure 15. Moving left in this figure corresponds to increasing the compute needed to train the learned hyperparameter list. We find performance continues to improve as the size of grows. Given the high dimension of our metaparameters, 8, this is not a surprise as the number of evaluations needed to explore the space will grow exponentially. We find that the full thousand trials are needed to out perform learning rate tuned Adam when only given a single optimizer evaluation. We find around 100 optimizers (size of ) are needed in the case of 10 optimizer trials ().
Overall this sugjests that randomsearch might not be the most efficient learning method for creating hyperparameter lists. This is especially true as we work with optimizer families that have more hyperparameters. Other approximate learning methods should likely be explored such as truncated backprop through time as used by the learned optimizer community(metz2019understanding), and/or population based methods (balduzzi2019open).
Appendix B Task timings
In Figure 14
we show box plots of training times for each problem. For each task we use the median step time recorded over a mixture of different physical devices and multipled by 10k to estimate a full training time. Future versions of this dataset of tasks will contain more variation within each task family.
Appendix C Optimizer family update equations
c.1 Adam8p update equations
The 8 metaparameters are: the learning rate,
, first and second moment momentum,
, , the numerical stability term, , and regularization strength, and learning rate schedule constants and . For Adam6p, we set and to zero.problem specified random initialization  (6)  
(7)  
(8)  
(9)  
(10)  
(11)  
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
c.2 NAdamW update equations
This optimizer family has 10 hyper parameters. The base learning rate, , first and second moment momentum, , , the numerical stability term, , regularization strength, AdamW style weight decay, and a boolean to switch between NAdam and Adam, . The learning rate schedule is based off of a single cycle cosine decay with a warmup. It is controlled by 3 additional parameters – , , and .
The learning rate is defined by:
(18)  
(19)  
(20)  
(21)  
(22)  
(23) 
The update equations of NAdamW are quite similar to that of Adam8p. For clarity we list the full update here.
problem specified random initialization  (24)  
(25)  
(26)  
(27)  
(28)  
(29)  
(30)  
(31)  
(32)  
(33)  
(34)  
(35) 
Appendix D Optimizer family search spaces
d.1 Adam8p, Adam6p, Adam4p, AdamLr search spaces
For Adam1p, Adam4p, Adam6p, and Adam8p we sample learning rate logritmically between 1e8 and 10, beta1 and beta2 we parametrize as and sample logrithmically between 1e4 and 1 and 1e6 and 1 respectively. For learning rate schedules we sample linear decay between 1e7, 1e4 logrithmically and exponential decay logrithmically between 1e3, 1e6. We sample both and logrithmcally between 1e8, 1e1.
d.2 NAdamW search space
This search space was chosen heuristically in an effort to generalize to new problems. We would like to emphasize that it was not tuned. We used our insight from Adam based optimizer families and chose this. No iterations where done. We expect more iterations will improve not only in distribution performance, but also generalization performance.
The initial learning rate, is sampled from log space between and . is sampled logrithmically between , and . is sampled between , and . is sampled logarithmically between and . We sample using nesterov () 50% of the time. We sample and logrithmically between and
. Equal probabilities of a third we either use both terms, zero out
, or zero out . With 50% probability we use a nonzero min learning rate multiplier sampled logrithmically between and . With 50% probability we sample the warm up fraction, between 1e5 and 1e1, otherwise it is set to zero. Finally, we uniformly sample the amount of time the learning rate is held constant() between 0 and 1.Appendix E Extended related work
e.1 Sets of tasks
Benchmarks consisting of multiple tasks are becoming an increasingly common technique for measuring improvement in algorithm design. Reinforcement learning has Atari bellemare2013arcade, DMLab beattie2016deepmind, gym Brockman2016, and dm_control deepmindcontrolsuite2018. Natural language processing has evaluation sets such as GLUE (wang2018glue), Super GLUE (wang2019superglue), and the NLPDecathalon (McCann2018decaNLP). In computer vision there is (zhai2019visual) which studies transfer learning of image features. In black box optimization there is Nevergrad (nevergrad), COmparing Continuous Optimizers (COCO) (hansen2016coco) and a number of tasks to test Bayesian hyperparameter optimization presented in (dewancker2016stratified). For first order gradient methods there are unit tests for stochastic optimization (schaul2013unit) which studies toy optimization functions, and DeepObs (schneider2019deepobs) which includes 20 neural network tasks. Hyperparameter tuning practices on these benchmarks vary between tuning on each task separately, to tuning one set of hyperparameters for all problems. In Atari (bellemare2013arcade), for example, it is common practice to tune hyperparameters on a subset of tasks and evaluate on the full set. This protocol can further be extended by leveraging unseen levels or games at test time as done in Obstacle Tower (juliani2019obstacle), ProcGen (cobbe2019leveraging), CoinRun (cobbe2018quantifying), and Sonic (nichol2018gotta). We believe generalization to unseen tasks is key for learned algorithms to be useful thus our learned search space experiments mirror this setting by making use of hold out tasks.
Existing metalearning data sets share similar goals to our work but focus on different domains. In few shot learning there is MiniImageNet (vinyals2016matching) which is built procedurally from the ImageNet dataset (ILSVRC15). MetaDataset (triantafillou2019meta) takes this further and also focuses on generalization by constructing few shot learning tasks using images from a number of different domains for evaluation purposes. The automated machine learning community has OpenML (OpenML2013) with a focus on selecting and tuning nonneural algorithms. For learning optimizers, the use of task suites has been limited and adhoc. Many works use a single or small number of standard machine learning tasks (andrychowicz2016learning; li2017learning; lv2017learning; metz2019understanding). wichrowska2017learned uses a set of synthetic problems meant to emulate many different kinds of loss surfaces. While existing collections of tasks exist for optimizer evaluation, e.g. (schneider2019deepobs), they contain too small a number of tasks to act as a comprehensive training set for learning algorithms, and many of their tasks are additionally too computationally expensive to be useful during learning.
e.2 Hand designed and learned optimizers
Optimization is core to machine learning and thus the focus of extensive work. Methods such as Nesterov momentum
(nesterov1983method), AdaGrad (duchi2011adaptive), RMSProp
(tieleman2012lecture), and Adam (kingma2014adam) have all shown considerable improvements in both the speed of optimization and ease of use by exposing robust, and easier to tune hyperparameters than SGD (sivaprasad2019tunability). Adaptive step size methods in particular have emerged at the forefront with many works building from it including AdamW (loshchilov2017fixing), RAdam (liu2019variance), Novograd (ginsburg2019stochastic), and NAdam dozat2016incorporating. Recently, there has been a focus on comparing optimizers either for best performance, or ease of use (wilson2017marginal; choi2019empirical; schneider2019deepobs; sivaprasad2019tunability). This has proven difficult as performance is heavily dependent on the choice of search space for optimization hyperparameters (choi2019empirical).Learned optimizers represent a parallel thread in the development of optimizers. By learning as opposed to handdesigning optimizers, researchers hope to not only increase performance but also ease of use (e.g. minimize the number of hyperparameters required or lower hyperparameter sensitivity) (bengio1990learning; schmidhuber1995learning; hochreiter2001learning). Recently, there has been renewed interest in parameterizating learning algorithms with neural networks and learning these optimizers on neural network based losses (andrychowicz2016learning; wichrowska2017learned; li2017learning; lv2017learning; metz2019understanding; metz2019using). Other approaches make learn symbolic parameterizations for new optimizers (Bello17). These various methods are all trained and evaluated on different distributions of tasks making comparison across papers challenging. The dataset of tasks presented here will hopefully aid in the ability to compare and evaluate progress in learned optimizer research.
In this work, we develop a much more minimal type of “learned optimizer” than previous work which developed new functional forms for the optimizer. Optimization involves not only the functional form of the optimizer, but also the rules for choosing hyperparameters and applying the optimizer. We focus on this second aspect of optimization and learn a hyperparameter search space to improve the performance of existing hand designed methods.
e.3 Hyperparameter search
Hyperparameter search is a key component in machine learning. Considerable improvements have been made in language melis2017state, computer vision (snoek2012practical), and RL (chen2018bayesian) simply by tuning better. Often no single hyperparameter configuration works well across all tasks for existing optimization methods. Most current hyperparameter search methods involve trying a very large number of hyperparameters for every new task, which is computationally infeasible for large tasks, and additionally can severely limit the number of hyperparameters that can be tuned. Many common techniques such as random search (bergstra2012random; bousquet2017critical), Bayesian optimization (snoek2012practical; snoek2015scalable), tree parzen estimators (NIPS2011_4443), or sequential halving (kumar2018parallel) require setting a hyperparameter search space by hand which is not only difficult but often wildly inefficient.
Learning hyperparameters or search strategies by leveraging multiple tasks has been explored within the context of Bayesian optimization swersky2013multi; perrone2019learning; perrone2018scalable as well as under the term metalearning in chen2017learning in which an LSTM is metatrained to produce function locations to query.
The cost of hyperparameter search is often large as each evaluation requires training a model to completion. Often multifidelity based approaches are used which leverage “simpler” tasks and transfer the resulting hyperparameters (automl). Common approaches include training on partial function evaluations swersky2014freeze; domhan2015speeding; li2016hyperband; klein2016learning; falkner2018bohb, or leveraging simplified data and models (petrak2000fast; zoph2016neural; brock2017smash). Our dataset of tasks serves as a: “simpler” set of tasks to train on; a large and diverse enough set of problems that optimization algorithms trained on it may be expected to generalize; and a framework to test transfer across different types of problems.
Appendix F List of NAdam HParams
Idx  Lr  warmup  constant  Min LR mult  beta1  beta2  epsilon  nesterov  l2 reg  l2 weight decay 

0  1.24e3  0.000  0.477  1.01e3  0.94666  0.94067  8.114e8  False  0.000e+00  7.258e5 
1  5.33e3  0.000  0.172  0.0  0.96047  0.99922  8.665e8  True  0.000e+00  5.563e3 
2  2.12e4  0.000  0.210  1.39e3  0.62297  0.97278  1.540e7  False  0.000e+00  5.361e2 
3  4.06e1  0.000  0.324  0.0  0.99724  0.98680  1.079e+02  True  0.000e+00  1.562e2 
4  2.05e2  0.000  0.885  1.57e5  0.35731  0.86043  8.874e5  True  0.000e+00  7.217e2 
5  5.95e4  0.008  0.378  0.0  0.89130  0.99983  1.483e7  True  0.000e+00  4.087e2 
6  7.53e3  0.000  0.422  9.55e4  0.69192  0.98434  3.593e8  False  0.000e+00  3.060e4 
7  4.69e3  0.000  0.509  0.0  0.99639  0.98820  2.056e5  False  0.000e+00  3.552e2 
8  2.95e1  0.000  0.201  0.0  0.99678  0.99981  7.498e+00  False  3.792e4  3.463e4 
9  2.04e3  0.000  0.527  0.0  0.49995  0.99755  5.630e8  True  0.000e+00  2.796e2 
10  7.39e1  0.001  0.556  3.31e3  0.99691  0.80639  2.900e+03  False  0.000e+00  7.851e2 
11  8.12e3  0.000  0.207  0.0  0.17785  0.96033  7.971e2  False  0.000e+00  1.489e2 
12  3.33e2  0.000  0.369  0.0  0.69592  0.99997  5.510e6  True  0.000e+00  1.362e5 
13  6.95e3  0.000  0.014  0.0  0.99412  0.99305  4.352e7  False  0.000e+00  3.142e5 
14  1.88e1  0.000  0.205  1.08e1  0.98597  0.56531  3.335e+00  True  1.265e5  3.868e3 
15  9.47e4  0.007  0.452  0.0  0.43977  0.09422  2.120e7  False  0.000e+00  6.902e3 
16  3.75e3  0.000  0.184  0.0  0.87756  0.96128  3.163e3  True  7.468e5  2.627e3 
17  7.25e1  0.000  0.495  0.0  0.99800  0.99781  3.608e+00  True  1.656e5  3.911e2 
18  4.58e3  0.000  0.107  3.66e1  0.42294  0.99963  4.174e6  True  0.000e+00  4.446e3 
19  3.07e4  0.007  0.518  0.0  0.57863  0.99625  9.881e6  False  0.000e+00  5.521e2 
20  2.94e5  0.000  0.830  8.27e5  0.96916  0.99896  7.782e7  True  3.364e4  3.416e3 
21  1.65e4  0.002  0.457  2.70e1  0.95280  0.04565  2.832e6  True  0.000e+00  1.141e2 
22  9.17e1  0.010  0.897  2.67e2  0.45061  0.99244  4.945e1  False  1.253e3  0.000e+00 
23  2.36e3  0.000  0.986  0.0  0.98560  0.99997  1.080e8  True  0.000e+00  3.023e3 
24  2.14e2  0.000  0.128  0.0  0.98741  0.99336  1.266e4  False  0.000e+00  5.194e4 
25  5.91e2  0.000  0.062  0.0  0.99794  0.99383  3.447e+02  True  0.000e+00  3.935e2 
26  1.57e3  0.000  0.251  0.0  0.91820  0.99991  4.675e5  False  0.000e+00  4.112e5 
27  4.43e1  0.000  0.702  0.0  0.94375  0.93551  2.335e8  True  0.000e+00  8.325e5 
28  2.98e3  0.008  0.046  0.0  0.68612  0.94232  6.614e2  False  6.489e5  0.000e+00 
29  1.65e2  0.004  0.082  4.92e4  0.95717  0.99789  3.068e+01  True  0.000e+00  8.920e2 
30  5.58e3  0.000  0.538  0.0  0.97559  0.99990  3.238e8  True  0.000e+00  4.896e4 
31  8.54e1  0.000  0.229  0.0  0.93129  0.50200  2.051e2  False  2.068e4  2.801e2 
32  7.38e3  0.000  0.722  8.78e2  0.21456  0.99752  2.862e2  False  0.000e+00  8.439e2 
33  4.26e4  0.001  0.923  2.06e1  0.47239  0.99974  8.221e5  False  1.248e5  0.000e+00 
34  6.04e3  0.000  0.698  0.0  0.97849  0.91449  1.806e+00  False  3.183e3  1.762e2 
35  8.86e3  0.000  0.104  1.66e1  0.98967  0.99720  1.493e2  True  0.000e+00  2.253e2 
36  1.51e2  0.000  0.431  1.99e3  0.80488  0.97878  2.538e8  True  0.000e+00  2.269e5 
37  2.50e3  0.000  0.009  0.0  0.98127  0.99988  1.799e7  False  0.000e+00  1.303e2 
38  3.42e4  0.000  0.827  6.38e1  0.25217  0.96572  2.928e7  True  0.000e+00  1.318e3 
39  6.94e5  0.000  0.085  0.0  0.98674  0.42709  2.387e7  False  0.000e+00  2.071e4 
40  3.03e2  0.001  0.313  0.0  0.90610  0.99997  4.449e3  True  0.000e+00  2.813e5 
41  4.64e3  0.000  0.495  2.26e5  0.64658  0.54108  3.528e8  False  0.000e+00  2.996e5 
42  2.25e3  0.000  0.722  0.0  0.97967  0.97518  1.488e7  True  1.812e5  2.180e2 
43  6.66e4  0.000  0.632  2.79e5  0.65968  0.99997  6.848e6  True  0.000e+00  3.130e3 
44  3.31e3  0.000  0.146  0.0  0.90447  0.99970  6.618e6  True  0.000e+00  2.184e2 
45  7.84e4  0.016  0.124  0.0  0.95065  0.99685  2.141e2  False  0.000e+00  4.024e5 
46  6.16e3  0.016  0.623  0.0  0.98823  0.98744  1.616e6  False  0.000e+00  1.544e2 
47  3.26e4  0.000  0.738  1.61e4  0.78425  0.99998  3.468e3  False  0.000e+00  4.709e2 
48  4.12e3  0.001  0.205  0.0  0.99561  0.75382  2.390e6  True  0.000e+00  3.631e2 
49  6.26e1  0.000  0.932  2.52e3  0.99401  0.83521  2.431e+00  True  0.000e+00  1.048e2 
Top 50 hyper parameters found using the NAdamW search space. We find diverse learning rates, with very little warmup used. We additionally find most good performing optimizers make use of AdamW style weight decay. Finally, matching insight from (choi2019empirical), we find large values of .
Appendix G Description of tasks in task suite
In this section we detail the task distribution used throughout this work. In addition to this text, a Tensorflow (abadi2016tensorflow) implementation is also released at github.com/googleresearch/googleresearch/tree/master/task_set.
g.1 Sampled Tasks
g.1.1 Default sampled components
As many of the sampled tasks are neural networks. We define common sampling routines used by all the sampled tasks.
Activation functions:
We define a distribution of activation functions which is sampled corresponding the following listing both name and weight. These are a mix of standard functions (relu, tanh) to less standard (cos).

relu: 6

tanh: 3

cos: 1

elu: 1

sigmoid: 1

swish (ramachandran2017searching): 1

leaky relu (with ): 1

leaky relu (with ): 1

leaky relu (with ): 1
Initializations:
We sample initializers according to a weighted distribution. Each initialization sample also optionally samples hyperparameters (e.g. for random normal initializers we sample standard deviation of the underlying distribution).

he normal (he2015delving): 2

he uniform (he2015delving): 2

glorot normal (glorot2010understanding): 2

glorot uniform (glorot2010understanding): 2

orthogonal: 1. We sample the “gain”, or multiplication of the orthogonal matrix logarithmically between
. 
random uniform 1.0: This is defined between where is sampled logarithmically between .

random normal: 1.0: The std is sampled logarithmically between .

truncated normal: 1.0: The std is sampled logarithmically between .

variance scaling: 1.0: The scale is sampled logarithmically between .
RNN Cores: We define a distribution over different types of RNN cores used by the sequential tasks. With equal probability we sample either a vanilla RNN (elman1990finding), GRU(chung2014empirical), or LSTM(hochreiter1997long). For each cell we either sample 1 shared initialization method or sample a different initialization method per parameter vector with a 4:1 ratio. We sample the core hidden dimension logarithmically between .
g.1.2 Sampled Datasets
Image Datasets: We sample uniformly from the following image datasets. Each dataset additionally has sampled parameters. For all datasets we make use of four data splits: train, validinner, validouter, test. Train is used to train models, validinner is used while training models to allow for modification of the training procedure (e.g. if validation loss doesn’t increase, drop learning rate). Validouter is used to select metaparameters. Test should not be used during metatraining.
For all datasets, we sample a switch with low probability (10% of the time) to only use training data and thus not test generalization. This ensures that our learned optimizers are capable of optimizing a loss as opposed to a mix of optimizing and generalizing.
Mnist: Batch size is sampled logarithmically between . We sample the number of training images logarithmically between (lecun1998mnist).
Fashion Mnist: Batch size is sampled logarithmically between . We sample the number of training images logarithmically between (xiao2017/online).
Cifar10: Batch size is sampled logarithmically between . The number of training examples is sampled logarithmically (krizhevsky2009cifar).
Cifar100: Batch size is sampled logarithmically between . The number of training examples is sampled logarithmically (krizhevsky2009cifar).
{food101_32x32, coil100_32x32, deep_weeds_32x32, sun397_32x32}: These dataset take the original set of images and resize them to 32x32 using OpenCV’s (opencv_library)
cubic interpolation. We ignore aspect ratio for this resize. Batch size is sampled logarithmically between
(bossard14; nene1996columbia; DeepWeeds2019; Xiao2010).Imagenet32x32 / Imagenet16x16: The ImageNet 32x32 and 16x16 dataset as created by chrabaszcz2017downsampled. Batch size is logrithmically sampled between .
g.1.3 Text classification:
IMDB sentiment classification: We use text from the IMDB movie reviews dataset(maasEtAl:2011:ACLHLT2011) and tokenize using subwords using a vocab size of 8k(sennrich2015neural). We then take length s random slice from each example where s is sampled logarithmically between . These examples are then batched into a batch size logarithmically sampled between . We sample the number of training examples logarithmically between and with 10% probability just use training data instead of valid / test to test pure optimization as opposed to generalization.
g.1.4 Character and Word language Modeling
For the character and word language modeling datasets we make use of the following data sources: imdb movie reviews(maasEtAl:2011:ACLHLT2011), amazon product reviews (amazonreviews) using the Books, Camera, Home, and Video subset each as separate datasets, LM1B(DBLP:journals/corr/ChelbaMSGBK13), and Wikipedia(wikidump) taken from the 20190301 dump using the zh, ru, ja, hab, and en language codes. We split each article by new lines and only keep resulting examples that contain more than 5 characters. For infrastructure reasons, we only use a million articles from each language and only 200k examples to build the tokenizer.
Byte encoding: We take length s random slices of each example where is sampled logarithmically between . These examples are then batched into a batch size logarithmically sampled between . With probability 0.2 we restrict the number of training examples to a number logarithmically sampled between . Finally, with a 10% probability just use training data instead of valid / test to test pure optimization as opposed to generalization.
subword encoding: We encode the text as subwords with a vocabsize of 8k (sennrich2015neural). We then take length random slices of each example where s is sampled logarithmically between . These examples are then batched into a batch size logarithmically sampled between . With probability we restrict the number of training examples to a number logarithmically sampled between . Finally, with a 10% probability just use training data instead of valid / test to test pure optimization as opposed to generalization.
g.2 Sampled Tasks
g.2.1 Mlp
This task family consists of a multi layer perceptron trained on flattened image data. The amount of layers is sampled uniformly from . Layer hidden unit sizes are sampled logarithmically between with different number of hidden units per layer. One activation function is chosen for the whole network and is chosen as described in G.1.1. One shared initializer strategy is also sampled. The image dataset used is also sampled.
Two sampled configurations are shown below.
g.2.2 MLP_ae
This task family consists of a multi layer perceptron trained with an auto encoding loss. The amount of layers is sampled uniformly from . Layer hidden unit sizes are sampled logarithmically between with different number of hidden units per layer. The last layer always maps back to the input dimension. The output activation function is sampled with the following weights: tanh:2, sigmoid:1, linear_center:1, linear:1 where linear_center is an identity mapping. When using the linear_center and tanh activation we shift the ground truth image to before performing a comparison to the model’s predictions. We sample the per dimension distance function used to compute loss with weights l2:2, l1:1, and the reduction function across dimensions to be either mean or sum with equal probability. A single activation function, and initializer is sampled. We train on image datasets which are also sampled.
A sample configurations is shown below.
g.2.3 Mlp Vae
This task has an encoder with sampled number of layers between . For each layer we sample the number of hidden units logarithmically between . For the decoder we sample the number of layers uniformly between . For each layer we sample the number of hidden units logarithmically between . We use a gaussian prior of dimensionality logarithmically sampled between . A single activation function and initialization is chosen for the whole network. The output of the encoder is projected to both a mean, and a log standard deviation which parameterizes the variational distribution,
. The decoder maps samples from the latent space to a quantized gaussian distribution in which we compute data log likelihoods log
. The loss we optimize is the evidence lower bound (ELBO) which is computed by adding this likelihood to the kl divergence between our normal distribution prior and
. We use the reparameterization trick to compute gradients. This model is trained on sampled image datasets.A sample configuration is listsed below.
g.2.4 Conv Pooling
This task consists of small convolutional neural networks with pooling. We sample the number of layers uniformly between
. We sample a stride pattern to be either all stride 2, repeating the stride pattern of 1,2,1,2… for the total number of layers, or 2,1,2,1… for the total number of layers. The hidden units are logarithmically sampled for each layer between
. We sample one activation function and weight init for the entire network. Padding for the convolutions are sampled per layer to either be same or valid with equal probability. For the convnet we also sample whether or not to use a bias with equal probability. At the last layer of the convnet we do a reduction spatially using either the mean, max, or squared mean sampled uniformly. This reduced output is fed into a linear layer and a softmax cross entropy loss. These models are trained on a sampled image dataset.
A sample configuration is shown below.
g.2.5 Conv FC
This task consists of small convolutional neural networks, flattened, then run through a MLP. We sample the number of conv layers uniformly between . We sample a stride pattern to be either all stride 2, repeating the stride pattern of 1,2,1,2… for the total number of layers, or 2,1,2,1… for the total number of layers. The hidden units are logarithmically sampled for each layer between . Padding for the convolutions are sampled per layer to either be same or valid with equal probability.
The output is then flattened, and run through a MLP with hidden layers sampled uniformly from and with sizes sampled logrithmically from . The loss is then computed via softmax cross entropy.
We sample one activation function and weight init for the entire network. For the convnet we also sample whether or not to use a bias with equal probability. These models are trained on a sampled image dataset.
An example configuration is shown below.
g.2.6 character rnn language model
This task takes character embedded data, and embeds in a size embedding vector where is sampled logarithmically between with random normal initializer with std . With 80% we use all tokens, and with 20% chance we only consider a subset of tokens sampled logarithmically . We then pass this embedded vector to a RNN with teacher forcing with equal probability we use a trainable initializer or zeros. A linear projection is then applied to the number of vocab tokens. Losses are computed using a softmax cross entropy vector and mean across the sequence.
A sample configuration is shown below.
g.2.7 word rnn language model
This task takes word embedded data, and embeds in a size s embedding vector where s is sampled logarithmically between with random normal initializer with std 1.0. A vocab size for this embedding table is sampled logarithmically between . We then pass this embedded vector to a RNN with teacher forcing with equal probability we use a trainable initializer or zeros. A linear projection is then applied to the number of vocab tokens. Losses are computed using a softmax cross entropy vector and mean across the sequence.
A sample configuration shown below.
g.2.8 LOSG Problems
These tasks consist of a mixture of many other tasks. We sample uniformly over the following types of problems. We brielfy describe them here but refer reader to the provided source for more information. In this work we took all the base problems from (wichrowska2017learned) but modified the sampling distributions to better cover the space as opposed to narrowly sampling particular problem families. Future work will consist of evaluating which sets of problems or which sampling decisions are required.
quadratic: n dimensional quadratic problems where n is sampled logarithmically between . Noise is optionally added with probability 0.5 and of the scale s where s is sampled logarithmically between .
bowl: A 2d qaudratic bowl problem with a sampled condition number (logrithmically between ). Noise is optionally added with probability 0.5 and of the scale s where s is sampled logarithmically between .
sparse_softmax_regression:
A synthetic random sparse logistic regression task.
optimization_test_problems: A uniform sample over the following functions: Ackley, Beale, Branin, logsumexp, Matyas, Michalewicz, Rosenbrock, StyblinskiTang.
fully_connected: A sampled random fully connected classification neural network predicting 2 classes on synthetic data. Number of input features is sampled logrithmically between 1 and 16, with a random activation function, and a sampled number of layers uniformly sampled from 25.
norm: A problem that finds a minimum error in an arbitrary norm. Specifically: where , . The dimentionality, , is sampled logrithmically between 3, and 1000. The power, , is sampled uniformly between 0.1 and 5.0. , and are drawn from a standard normal distribution.
dependency_chain: A synthetic problem where each parameter must be brought to zero sequentially. We sample dimensionality logrithmically between 3, 100.
outward_snake: This loss creates a winding path to infinity. Step size should remain constant across this path. We sample dimensionality logrithmically between 3 and 100.
min_max_well: A loss based on the sum of min and max over parameters: . Note that the gradient is zero for all but 2 parameters. We sample dimentaionlity logrithmically between 10 and 1000. Noise is optionally added with probability 0.5 and of the scale s where s is sampled logarithmically between [0.01, 10].
sum_of_quadratics: A least squares loss of a dimentionality sampled logrithmically between 3 and 100 to a synthetic dataset.
projection_quadratic: A quadratic minimized by probing different directions. Dimentionality is sampled from 3 to 100 logrithmically.
In addition to these base tasks, we also provide a variety of transformations described bellow. The use of these transformations is also sampled.
sparse_problems: With probability 0.9 to 0.99 the gradient per parameter is set to zero. Additional noise is added with probability 0.5 sampled from a normal with std sampled logrithmically between .
rescale_problems: Rescales the loss value by 0.001 to 1000.0 sampled logrithmically.
log_objective: Takes the log of the objective value.
2 Sample configurations shown below.
g.2.9 Masked Autoregressive Flows
Masked autoregressive flows are a family of tractable density generative models. See XX for more information. The MAF is defined by a sequence of bijectors. For one bijector samples a number of layers to either be 1 or 2 with equal probability, and a number of hidden layers sampled logarithmically between . We sample the number of bijector uniformly from and use the same hidden layers across all bijector. We sample activation function, and initializer once for the whole model. In this task we model image datasets which are also sampled.
A sample configuration is shown below.
g.2.10 Non volume preserving flows
NVP are a family of tractable density generative models. See dinh2016density for more information. The NVP is defined by a sequence of bijectors. For one bijector samples a number of layers to either be 1 or 2 with equal probability, and a number of hidden layers sampled logarithmically between . We sample the number of bijector uniformly from and use the same hidden layers across all bijector. We sample activation function, and initializer once for the whole model. In this task we model image datasets which are also sampled.
A sample configuration shown below.
g.2.11 Quadratic like problems
This task distribution defines a synthetic problem based on a nonlinear modification to a quadratic. The dimensionality of the problem is sampled logarithmically between [2, 3000].
The loss for this task is described by:
(36) 
where and where param is initialized by initial_dist.sample() / weight_rescale.
The output_fn is sampled uniformly between identity, and . The loss scale is sampled logarithmically between [, ].
We define a distribution over matrices A as a sample from one of the following: normal: we sample a mean from a normal draw with a standard deviation of 0.05 and a std from a uniform [0, 0.05]. The elements of A are drawn from the resulting distribution. uniform: linspace_eigen: logspace_eigen:
We define a distribution over B to be either normal with mean and std sampled from N(0, 1), U(0, 2) respectively or uniform with min and range equal to U(5, 2.5), U(0, 5) respectively.
With probability 50% we add noise from a distribution whose parameters are also sampled.
A sample configuration shown below.
g.2.12 RNN Text classification
This task consists of using an RNN to classify tokenized text. We first trim the vocab length to be of a size logarithmically sampled between
. The text is then embedded into a vocab size logarithmically sampled between . These embeddings get fed into a sampled config RNN. With equal probability the initial state of the rnn is either sampled, or zeros. With equal probability we either take the last RNN prediction, the mean over features, or the per feature max over the sequence. This batch of activations is then passed through a linear layer and a softmax cross entropy loss. The initialization for the linear projection is sampled.An example configuration shown below. In this version of TaskSet the dataset sampling contains a bug. All data used is from the imdb_reviews/subwords8k dataset.
g.3 Fixed Tasks
In addition to sampled tasks, we also define a set of hand designed and hand specified tasks. These tasks are either more typical of what researcher would do (e.g. using default initializations) or specific architecture features such as bottlenecks in autoencoders, normalization, or dropout.
In total there are 107 fixed tasks. Each task is labeled by name with some information about the underlying task. We list all tasks, discuss groups of tasks, but will not describe each task in detail. Please see the source for exact details.
Associative_GRU128_BS128_Pairs10_Tokens50
Associative_GRU256_BS128_Pairs20_Tokens50
Associative_LSTM128_BS128_Pairs10_Tokens50
Associative_LSTM128_BS128_Pairs20_Tokens50
Associative_LSTM128_BS128_Pairs5_Tokens20
Associative_LSTM256_BS128_Pairs20_Tokens50
Associative_LSTM256_BS128_Pairs40_Tokens100
Associative_VRNN128_BS128_Pairs10_Tokens50
Associative_VRNN256_BS128_Pairs20_Tokens50
These tasks use RNN’s to perform an associative memory task. Given a vocab of tokens, and some number of pairs to store and a query the RNN’s goal is to produce the desired value. For example given the input sequence A1B2C3?B_ the RNN should produce ________B.
This model embeds tokens, applies an RNN, and applies a linear layer to map back to the output space. Softmax cross entropy loss is used to compare outputs. A weight is also placed on the losses so that loss is incurred only when the RNN is supposed to predict. For RNN cells we use LSTM (hochreiter1997long), GRU (chung2014empirical), and VRNN – a vanilla RNN. The previous tasks are defined with the corresponding RNN cell, number of units, batch size, sequence lengths, and number of possible tokens for the retrieval task.
Copy_GRU128_BS128_Length20_Tokens10
Copy_GRU256_BS128_Length40_Tokens50
Copy_LSTM128_BS128_Length20_Tokens10
Copy_LSTM128_BS128_Length20_Tokens20
Copy_LSTM128_BS128_Length50_Tokens5
Copy_LSTM128_BS128_Length5_Tokens10
Copy_LSTM256_BS128_Length40_Tokens50
Copy_VRNN128_BS128_Length20_Tokens10
Copy_VRNN256_BS128_Length40_Tokens50
These tasks use RNN’s to perform a copy task. Given a vocab of tokens and some number of tokens the RNN’s job is to read the tokens and to produce the corresponding outputs. For example an input might be: ABBC____ and the RNN should output ____ABBC. See the source for a complete description of the task. Each task in this set varies the RNN core, as well as the dataset structure.
This model embeds tokens, applies an RNN, and applies a linear layer to map back to the output space. Softmax crossentropy loss is used to compare outputs. A weight is also placed on the losses so that loss is incurred only when the RNN is supposed to predict. For RNN cells we use LSTM (hochreiter1997long), GRU (chung2014empirical), and VRNN – a vanilla RNN. The previous tasks are defined with the corresponding RNN cell, number of units, batch size, sequence lengths, and number of possible tokens.
FixedImageConvAE_cifar10_32x32x32x32x32_bs128
FixedImageConvAE_cifar10_32x64x8x64x32_bs128
FixedImageConvAE_mnist_32x32x32x32x32_bs128
FixedImageConvAE_mnist_32x64x32x64x32_bs512
FixedImageConvAE_mnist_32x64x8x64x32_bs128
Convolutional autoencoders trained on different datasets and with different architectures (sizes of hidden units).
FixedImageConvVAE_cifar10_32x64x128x64x128x64x32_bs128
FixedImageConvVAE_cifar10_32x64x128x64x128x64x32_bs512
FixedImageConvVAE_cifar10_32x64x128x64x32_bs128
FixedImageConvVAE_cifar10_64x128x256x128x256x128x64_bs128
FixedImageConvVAE_mnist_32x32x32x32x32_bs128
FixedImageConvVAE_mnist_32x64x32x64x32_bs128
FixedImageConvVAE_mnist_64x128x128x128x64_bs128
Convolutional variational autoencoders trained on different datasets, batch sizes, and with different architectures.
FixedImageConv_cifar100_32x64x128_FC64x32_tanh_variance_scaling_bs64
FixedImageConv_cifar100_32x64x64_flatten_bs128
FixedImageConv_cifar100_bn_32x64x128x128_bs128
FixedImageConv_cifar10_32x64x128_flatten_FC64x32_tanh_he_bs8
FixedImageConv_cifar10_32x64x128_flatten_FC64x32_tanh_variance_scaling_bs64
FixedImageConv_cifar10_32x64x128_he_bs64
FixedImageConv_cifar10_32x64x128_largenormal_bs64
FixedImageConv_cifar10_32x64x128_normal_bs64
FixedImageConv_cifar10_32x64x128_smallnormal_bs64
FixedImageConv_cifar10_32x64x128x128x128_avg_he_bs64
FixedImageConv_cifar10_32x64x64_bs128
FixedImageConv_cifar10_32x64x64_fc_64_bs128
FixedImageConv_cifar10_32x64x64_flatten_bs128
FixedImageConv_cifar10_32x64x64_tanh_bs64
FixedImageConv_cifar10_batchnorm_32x32x32x64x64_bs128
FixedImageConv_cifar10_batchnorm_32x64x64_bs128
FixedImageConv_coil10032x32_bn_32x64x128x128_bs128
FixedImageConv_colorectalhistology32x32_32x64x64_flatten_bs128
FixedImageConv_food10164x64_Conv_32x64x64_flatten_bs64
FixedImageConv_food101_batchnorm_32x32x32x64x64_bs128
FixedImageConv_mnist_32x64x64_fc_64_bs128
FixedImageConv_sun39732x32_bn_32x64x128x128_bs128
Mnist_Conv_32x16x64_flatten_FC32_tanh_bs32
Convolutional neural networks doing supervised classification. These models vary in dataset, architecture, and initializations.
FixedLM_lm1b_patch128_GRU128_embed64_avg_bs128
FixedLM_lm1b_patch128_GRU256_embed64_avg_bs128
FixedLM_lm1b_patch128_GRU64_embed64_avg_bs128
FixedLM_lm1b_patch128_LSTM128_embed64_avg_bs128
FixedLM_lm1b_patch128_LSTM256_embed64_avg_bs128
Language modeling tasks on different RNN cell types and sizes.
FixedMAF_cifar10_3layer_bs64
FixedMAF_mnist_2layer_bs64
FixedMAF_mnist_3layer_thin_bs64
Masked auto regressive flows models with different architectures (number of layers and sizes).
FixedMLPAE_cifar10_128x32x128_bs128
FixedMLPAE_mnist_128x32x128_bs128
FixedMLPAE_mnist_32x32x32_bs128
Autoencoder models based on multi layer perceptron with different number of hidden layers and dataset.
FixedMLPVAE_cifar101_128x128x32x128x128_bs128
FixedMLPVAE_cifar101_128x32x128_bs128
FixedMLPVAE_food10132x32_128x64x32x64x128_bs64
FixedMLPVAE_mnist_128x128x8x128_bs128
FixedMLPVAE_mnist_128x64x32x64x128_bs64
FixedMLPVAE_mnist_128x8x128x128_bs128
Imagenet32x30_FC_VAE_128x64x32x64x128_relu_bs256
Variational autoencoder models built from multi layer perceptron with different datasets, batchsizes, and architectures.
FixedMLP_cifar10_BatchNorm_128x128x128_relu_bs128
FixedMLP_cifar10_BatchNorm_64x64x64x64x64_relu_bs128
FixedMLP_cifar10_Dropout02_128x128_relu_bs128
FixedMLP_cifar10_Dropout05_128x128_relu_bs128
FixedMLP_cifar10_Dropout08_128x128_relu_bs128
FixedMLP_cifar10_LayerNorm_128x128x128_relu_bs128
FixedMLP_cifar10_LayerNorm_128x128x128_tanh_bs128
FixedMLP_cifar10_ce_128x128x128_relu_bs128
FixedMLP_cifar10_mse_128x128x128_relu_bs128
FixedMLP_food10132x32_ce_128x128x128_relu_bs128
FixedMLP_food10132x32_mse_128x128x128_relu_bs128
FixedMLP_mnist_ce_128x128x128_relu_bs128
FixedMLP_mnist_mse_128x128x128_relu_bs128
FixedNVP_mnist_2layer_bs64
Image classification based on multi layer perceptron. We vary architecture, data, batchsize, normalization techniques, dropout, and loss type across problems.
FixedNVP_mnist_3layer_thin_bs64
FixedNVP_mnist_5layer_bs64
FixedNVP_mnist_5layer_thin_bs64
FixedNVP_mnist_9layer_thin_bs16
Non volume preserving flow models with different batchsizesm and architectures.
FixedTextRNNClassification_imdb_patch128_LSTM128_avg_bs64
FixedTextRNNClassification_imdb_patch128_LSTM128_bs64
FixedTextRNNClassification_imdb_patch128_LSTM128_embed128_bs64
FixedTextRNNClassification_imdb_patch32_GRU128_bs128
FixedTextRNNClassification_imdb_patch32_GRU64_avg_bs128
FixedTextRNNClassification_imdb_patch32_IRNN64_relu_avg_bs128
FixedTextRNNClassification_imdb_patch32_IRNN64_relu_last_bs128
FixedTextRNNClassification_imdb_patch32_LSTM128_E128_bs128
FixedTextRNNClassification_imdb_patch32_LSTM128_bs128
FixedTextRNNClassification_imdb_patch32_VRNN128_tanh_bs128
FixedTextRNNClassification_imdb_patch32_VRNN64_relu_avg_bs128
FixedTextRNNClassification_imdb_patch32_VRNN64_tanh_avg_bs128
RNN text classification problems with different RNN cell, sizes, embedding sizes, and batchsize.
TwoD_Bowl1
TwoD_Bowl10
TwoD_Bowl100
TwoD_Bowl1000
2D quadratic bowls with different condition numbers.
TwoD_Rosenbrock
TwoD_StyblinskiTang
TwoD_Ackley
TwoD_Beale
Toy 2D test functions.
Comments
There are no comments yet.