Learning to Rank Learning Curves

06/05/2020
by   Martin Wistuba, et al.
0

Many automated machine learning methods, such as those for hyperparameter and neural architecture optimization, are computationally expensive because they involve training many different model configurations. In this work, we present a new method that saves computational budget by terminating poor configurations early on in the training. In contrast to existing methods, we consider this task as a ranking and transfer learning problem. We qualitatively show that by optimizing a pairwise ranking loss and leveraging learning curves from other datasets, our model is able to effectively rank learning curves without having to observe many or very long learning curves. We further demonstrate that our method can be used to accelerate a neural architecture search by a factor of up to 100 without a significant performance degradation of the discovered architecture. In further experiments we analyze the quality of ranking, the influence of different model components as well as the predictive behavior of the model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 13

03/08/2019

Inductive Transfer for Neural Architecture Optimization

The recent advent of automated neural network architecture search led to...
10/07/2021

Conceptual Expansion Neural Architecture Search (CENAS)

Architecture search optimizes the structure of a neural network for some...
09/15/2021

RankNAS: Efficient Neural Architecture Search by Pairwise Ranking

This paper addresses the efficiency challenge of Neural Architecture Sea...
11/26/2019

Ranking architectures using meta-learning

Neural architecture search has recently attracted lots of research effor...
05/30/2017

Accelerating Neural Architecture Search using Performance Prediction

Methods for neural network hyperparameter optimization and meta-modeling...
03/30/2021

A resource-efficient method for repeated HPO and NAS problems

In this work we consider the problem of repeated hyperparameter and neur...
03/14/2021

Use of static surrogates in hyperparameter optimization

Optimizing the hyperparameters and architecture of a neural network is a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A method commonly used by human experts to speed up the optimization of neural architectures or hyperparameters is the early termination of iterative training processes that are unlikely to improve the current solution. A common technique to determine the likelihood of no improvement is to compare the learning curve of a new configuration to the one of the currently best configuration. This idea can also be used to speed up automated machine learning processes. For this purpose, it is common practice to extrapolate the partial learning curve in order to predict the final performance of the currently investigated model. Current extrapolation techniques have several weaknesses that make them unable to realize their full potential in practice. Many of the methods require sufficient sample learning curves to make reliable predictions (Chandrashekaran and Lane, 2017; Klein et al., 2017; Baker et al., 2018). Thus, the extrapolation method for the first candidates can not be used yet, which means more computational effort. Other methods do not have this disadvantage, but require sufficiently long learning curves to make reliable predictions which again means unnecessary overhead (Domhan et al., 2015). Many of these methods also do not take into account other information such as the hyperparameters of the model being examined or its network architecture.

We address the need for sample learning curves by devising a transfer learning technique that uses learning curves from other problems. Since the range of accuracy varies from dataset to dataset, we are forced to consider this in our modeling. But since we are not interested in predicting the performance of a model anyway, we use a ranking model that models the probability that the model currently being investigated surpasses the best solution so far.

In order to be able to make reliable predictions for short learning curves, we consider further characteristics of the model such as its network architecture. We compare our ranking method with respect to a ranking measure against different methods on five different image classification and four tabular regression datasets. We also show that our method is capable of significantly accelerating neural architecture search (NAS) and hyperparameter optimization. Furthermore, we conduct several ablation studies to provide a better motivation of our model and its behavior.

2 Related work

Most of the prior work for learning curve prediction is based on the idea of extrapolating the partial learning curve by using a combination of continuously increasing basic functions.

Domhan et al. (2015)

define a set of 11 parametric basic functions, estimate their parameters and combine them in an ensemble.

Klein et al. (2017)

propose a heteroscedastic Bayesian model which learns a weighted average of the basic functions.

Chandrashekaran and Lane (2017) do not use basic functions but use previously observed learning curves of the current dataset. An affine transformation for each previously seen learning curve is estimated by minimizing the mean squared error with respect to the partial learning curve. The best fitting extrapolations are averaged as the final prediction. Baker et al. (2018)

use a different procedure. They use support vector machines as sequential regressive models to predict the final accuracy based on features extracted from the learning curves, its gradients, and the neural architecture itself.

The predictor by Domhan et al. (2015)

is able to forecast without seeing any learning curve before but requires observing more epochs for accurate predictions. The model by

Chandrashekaran and Lane (2017) requires seeing few learning curves to extrapolate future learning curves. However, accurate forecasts are already possible after few epochs. Algorithms proposed by Klein et al. (2017); Baker et al. (2018)

need to observe many full-length learning curves before providing any useful forecasts. However, this is prohibiting in the scenarios where learning is time-consuming such as in large convolutional neural networks (CNN).

All previous methods for automatically terminating iterative learning processes are based on methods that predict the learning curve. Ultimately, however, we are less interested in the exact learning curve but rather whether the current learning curve leads to a better result. This way of obtaining a ranking is referred to as pointwise ranking methods (Liu, 2011). They have proven to be less efficient than pairwise ranking methods which directly optimize for the objective function (Burges et al., 2005). Yet, we are the first to consider a pairwise ranking loss for this application.

There are some bandit-based methods which leverage early termination as well. Successive Halving (Jamieson and Talwalkar, 2016) is a method that trains multiple models with settings chosen at random simultaneously and terminates the worst performing half in predefined intervals. Hyperband (Li et al., 2017) identifies that the choice of the intervals is vital and therefore proposes to run Successive Halving with different intervals. BOHB (Falkner et al., 2018) is an extension of Hyperband which proposes to replace the random selection of settings with Bayesian optimization.

The use of methods that terminate less promising iterative training processes early are of particular interest in the field of automated machine learning, as they can, for example, significantly accelerate hyperparameter optimization. The most computationally intensive subproblem of automated machine learning is Neural Architecture Search (Wistuba et al., 2019), the optimization of the neural network topology. Our method is not limited to this problem, but as it is currently one of the major challenges in automated machine learning, we use this problem as a sample application in our evaluation. For this particular application, optimization methods that leverage parameter sharing (Pham et al., 2018) have become established as a standard method. Here, a search space spanning, overparameterized neural network is learned and finally used to determine the best architecture. DARTS (Liu et al., 2019) is one of the last extensions of this idea which uses a continuous relaxation of the search space, which allows learning of the shared parameters as well as of the architecture by means of gradient descent. These methods can be a strong alternative to early termination but do not transfer to general machine learning and recent research implies that this approach does not work for arbitrary architecture search spaces (Yu et al., 2020).

3 Learning curve ranking

Iterations

Validation Accuracy

Final Performance

Intermediate Performance
Figure 1: Learning curve prediction tries to predict from the partial learning curve (solid line) the final performance.

With learning curve we refer to the function of qualitative performance with growing number of iterations of an iterative learning algorithm. We use the term final learning curve to explicitly denote the entire learning curve, , reflecting the training process from beginning to end. Here, is a measure of the performance of the model (e.g., classification accuracy), which is determined at regular intervals. On the contrary, a partial learning curve, , refers to learning curves that are observed only up to a time . We visualize the concepts of the terms in Figure 1.

There is a set of automated early termination methods which follow broadly the same principle. The first model is trained to completion and is considered as the current best model with its performance being . For each further model , the probability that it is better than the current best model, , is monitored at periodic intervals during training. If it is below a given threshold, the model’s training is terminated (see Algorithm 1

). All existing methods rely on a two-step approach to determine this probability which involves extrapolating the partial learning curve and a heuristic measure. Instead of the two-step process to determine the probability, we propose LCRankNet to predict the probability that a model

is better than directly. LCRankNet is based on a neural network which considers model characteristics and the partial learning curve as input. We define the probability that is better than as

(1)

Using the logistic function is an established modelling approach in the learning-to-rank community (Burges et al., 2005). Given a set of final learning curves for some models, the estimation of the values between these models is trivial since we know whether is better than or not. Therefore, we set

(2)

We minimize the cross-entropy loss

(3)

to determine the parameters of . Now can be predicted for arbitrary pairs and using Equation (1). A model is trained on all partial learning curves of length to predict the probabilities of curves of length .

0:  Dataset , model , performance of best model so far .
0:  Learning curve.
1:  for  do
2:     Train on for a step and observe a further part of the learning curve .
3:     if  then
4:        continue
5:     else if  then
6:        return
7:     end if
8:  end for
9:  return
Algorithm 1 Early Termination Method

Fully Connected Layer

maxpool

convk=2..5

LSTM

Embedding

Architecture

Embedding

Dataset ID

Hyperparams
Figure 2: LCRankNet has four different components, each dealing with one type of input: architecture encoding, partial learning curve, dataset ID, and further hyperparameters.

3.1 Ranking model to learn across datasets

We have defined the prediction of our model in Equation (1). It depends on the outcome of the function which takes a model representation as input. The information contained in this representation depends on the task at hand. The representation consists of up to four different parts. First, the partial learning curve . Second, the description of the model’s architecture which is a sequence of strings representing the layers it is comprised of. Third, a dataset ID to indicate on which dataset the corresponding architecture and learning curve was trained and observed. Fourth, all remaining hyperparameters, e.g. learning rate or batch size.

We model using a neural network and use special layers to process the different parts of the representation. The learned representation for the different parts are finally concatenated and fed to a fully connected layer. The architecture is visualized in Figure 2. We will now describe the different components in detail.

Figure 3:

The x-axis indicates the length observed of the learning curve. We report the mean of ten repetitions. The shaded area is the standard deviation. Our method LCRankNet outperforms its competitors on all datasets.

Learning curve component

A CNN is used to process the partial learning curve. We consider up to four different convolutional layers with different kernel sizes. The exact number depends on the length of the learning curve since we consider no kernel sizes larger than the learning curve length. Each convolution is followed by a global max pooling layer and their outputs are concatenated. This results in the learned representation for the learning curve. In the appendix we analyze alternative modeling options.

Architecture component

In this paper we consider two experiments for learning curve prediction. In one we deal with NAS in the NASNet search space (Zoph et al., 2018), in the other we work on the joint optimization of architecture and hyperparameters on tabular data (Klein and Hutter, 2019). In the former experiment, the search space consists of CNNs, which consist of two cells with five blocks each. This means that 60 decisions about architecture have to be made. In the second experiment, the fully connected neural networks (FCNN) consists of only two layers, so the number of decisions is only 6. For more details refer to the appendix. We learn an embedding for every option. An LSTM takes this embedding and generates the architecture embedding.

Dataset component

As we intend to learn from learning curves of other datasets, we include dataset embeddings as one of the components. Every dataset has its own embedding and it will be used whenever a learning curve observed on this dataset is selected. If the dataset is new, then the corresponding embedding is initialized at random. As the model observes more learning curves from this dataset, it improves the dataset embedding. The embedding helps us to model dataset-specific peculiarities in order to adjust the ranking if necessary.

Hyperparameter component

Other hyperparameters, if available, are passed directly to the last layer.

Technical details

During the development process, we found that the architecture component leads to instabilities during training. To avoid this, we regularize the output of the LSTM layer by using it as an input to an autoregressive model, similar to a sequence-to-sequence model 

(Sutskever et al., 2014) that recreates the original description of the architecture. In addition, we use the attention mechanism (Bahdanau et al., 2015) to facilitate this process. All parameters of the layers in are trained jointly by means of Adam (Kingma and Ba, 2015) by minimizing

(4)

a weighted linear combination of the ranking loss (Equation (3)) and the reconstruction loss with . The hyperparameter allows for trading precision vs. recall or search time vs. regret, respectively. There are two extreme cases: if we set , every run is terminated immediately. If we set , we never terminate any run. For our experiment we set which means that if the predicted probability that the new model is better than the best one is below 45%, the run is terminated early.

4 Experiments

In this section, we first discuss how to create the meta-knowledge and then analyze our model in terms of learning curve ranking and the ability to use it as a way to accelerate NAS. Finally, we examine its individual components and behavior in certain scenarios. Furthermore, we show that LCRankNet can be combined with various optimization methods and used for hyperparameter optimization tasks.


Method CIFAR-10 CIFAR-100 Fashion Quickdraw SVHN
Regr. Time Regr. Time Regr. Time Regr. Time Regr. Time
No Early Termination 0.00 1023 0.00 1021 0.00 1218 0.00 1045 0.00 1485
Domhan et al. (2015) 0.56 346 0.82 326 0.00 460 0.44 331 0.28 471
Li et al. (2017) 0.22 106 0.78 102 0.32 132 0.54 109 0.00 156
Baker et al. (2018) 0.00 89 0.00 77 0.00 129 0.00 107 0.00 241
Jamieson and Talwalkar (2016) 0.62 62 0.00 54 0.18 70 0.40 60 0.28 88
Chandrashekaran and Lane (2017) 0.62 30 0.00 35 0.28 41 0.30 82 0.06 164
LCRankNet 0.22 20 0.00 11 0.10 19 0.00 28 0.10 74
Table 1: Results obtained by the different methods on five different datasets. For both metrics the smaller, the better. Regret reported in percent, time in GPU hours.

4.1 Meta-knowledge

We compare our method to similar methods on five different datasets: CIFAR-10, CIFAR-100, Fashion-MNIST, Quickdraw, and SVHN. We use the original train/test splits if available. Quickdraw has a total of 50 million data points and 345 classes. To reduce the training time, we select a subset of this dataset. We use 100 different randomly selected classes and choose 300 examples per class for the training split and 100 per class for the test split. 5,000 random data points of the training dataset serve as validation split for all datasets.

To create the meta-knowledge, we choose 200 architectures per dataset at random from the NASNet search space (Zoph et al., 2018)

such that we train a total of 1,000 architectures. We would like to point out that these are 1,000 unique architectures, there is no architecture that has been trained on several different datasets. Each architecture is trained for 100 epochs with stochastic gradient descent and cosine learning rate schedule without restart 

(Loshchilov and Hutter, 2017).

We use standard image preprocessing and augmentation: for every image, we first subtract the channel mean and then divide by the channel standard deviation. Images are padded by a margin of four pixels and randomly cropped back to the original dimension. For all datasets but SVHN we apply random horizontal flipping. Additionally, we use Cutout 

(DeVries and Taylor, 2017).

For the experiments in Section 4.6 we rely on the tabular benchmark (Klein and Hutter, 2019). This benchmark contains meta-data for four regression tasks. In total 62,208 different FCNN with different architecture and hyperparameters are evaluated. 100 settings per dataset are chosen at random as meta-knowledge.

The following experiments are conducted in a leave-one-dataset-out cross-validation. That means when considering one dataset, all meta-knowledge but the one for this particular dataset is used.

4.2 Ranking performance

First, we analyze the quality of the learning curve rankings by different learning curve prediction methods. In this experiment we choose 50 different learning curves at random as a test set. Five random learning curves are used as a training set for every repetition. Each learning curve prediction method ranks the 50 architectures by observing the partial learning curve whose length varies from 0 to 30 percent of the final learning curve. We repeat the following experiment ten times and report the mean and standard deviation of the correlation between the true and the predicted ranking in Figure 3. As a correlation metric, we use Spearman’s rank correlation coefficient. Thus, the correlation is 1 for a perfect ranking and 0 for an uncorrelated, random ranking. Our method LCRankNet shows for all datasets higher correlation. If there are no or only very short partial learning curves available, our method shows the biggest difference to the existing methods. The reason for this is a combination of the consideration of the network architecture together with additional meta-knowledge. We analyze the impact of each component in detail in Section 4.5.

The method by Chandrashekaran and Lane (2017) consistently shows the second best results and in some cases can catch up to the results of our method. The method by Baker et al. (2018)

stands out due to the high standard deviation. It is by far the method with the smallest squared error on test. However, the smallest changes in the prediction lead to a significantly different ranking, which explains the high variance in their results. The method by

Domhan et al. (2015) requires a minimum length of the learning curve to make predictions. Accordingly, we observe rank correlation values starting from a learning curve length of 4%. Using the last seen value to determine the ranking of learning curves is a simple yet efficient method (Klein et al., 2017). In fact, it is able to outperform some of the more elaborate methods.

4.3 Accelerating random neural architecture search

In this experiment, we demonstrate the utility of learning curve predictors in the search for network architectures. For the sake of simplicity, we accelerate a random search in the NASNet search space (Zoph et al., 2018).

The random search samples 200 models and trains each of them for 100 epochs to obtain the final accuracy. In the end, the best of all these models is returned. Now each learning curve predictor iterates over these sampled architectures in the same order and determines at every third epoch if the training should be aborted. For Successive Halving (Jamieson and Talwalkar, 2016) and Hyperband (Li et al., 2017) we follow the algorithm defined by the authors and use the recommended settings. The current best model discovered after iterating over all 200 architectures is returned as the best model for this learning curve predictor. The goal of the methods is to minimize the regret compared to a random search without early stopping and at the same time reduce the computational effort.

Figure 4: LCRankNet is speeding up architecture search for SVHN. Difference between learning curves is small, making this the hardest task.

One of our observations is that Domhan et al. (2015)’s method underestimates performance when the learning curve is short. As a result, the method ends each training process early after only a few epochs. Therefore, we follow the recommendation by Domhan et al. (2015) and do not end processes before we have seen the first 30 epochs. We summarize the results in Table 1. Our first observation is that all methods accelerate the search with little regret. Here we define regret as the difference between the accuracy of the model found by the random search and the accuracy of the model found by one of the methods. Not surprisingly, Domhan et al. (2015)’s method takes up most of the time, as it requires significantly longer learning curves to make its decision. In addition, we can confirm the results by Baker et al. (2018); Chandrashekaran and Lane (2017), both of which report better results than Domhan et al. (2015). Our method requires the least amount of time for each dataset. For CIFAR-100 we do not observe any regret, but a reduction of the time by a factor of 100. In some cases we observe an insignificantly higher regret than some of the other methods. In our opinion, the time saved makes up for it.

In Figure 4 we visualize the random search for SVHN. As you can see, many curves not only have similar behavior but also similar error. For this reason, it is difficult to decide whether to discard a model safely, which explains the increased runtime. The only methods that do not show this behavior are Successive Halving and Hyperband. The simple reason is that the number of runs terminated and the time they are discarded are fixed by the algorithm and do not rely on the properties of the learning curve. The disadvantages are obvious: promising runs may be terminated and unpromising runs may run longer than needed.

Finally, we compare our method with DARTS (Liu et al., 2019), a method that speeds up the search by parameter sharing. We train the architecture discovered in the previous experiment with an equivalent training setup like DARTS for a fair comparison. This architecture achieved a classification error of 2.99% on CIFAR-10 after only 20 GPU hours were searched. In comparison, DARTS (1st order) needs 36 GPU hours for a similarly good architecture (3.00% error). DARTS (2nd order) can improve these results to 2.76%, but it takes 96 GPU hours. One of the major disadvantages of DARTS and related methods is that they cannot generalize to any architecture search space (Yu et al., 2020) and do not allow hyperparameter optimization. Our proposed method does not suffer from these problems.

Figure 5: Analysis of the predicted probability of LCRankNet with growing learning curve length. The plots in the top row show examples for correct decisions by LCRankNet. The bottom row shows two examples where its decision was wrong.

4.4 LCRankNet prediction analysis

We saw in the previous selection that LCRankNet does not perform perfectly. There are cases where its regret is greater than zero or where the search time is higher which indicates that early stopping was applied later than maybe possible. In this section we try to shed some light on the decisions made by the model and try to give the reader some insight of the model’s behavior.

We provide four example decisions of LCRankNet in Figure 5. The plots in the top row show cases where LCRankNet made a correct decision, the plots in the bottom row are examples for incorrect decisions.

In both of the correct cases (top row), LCRankNet assigns a higher probability to begin with, using meta-knowledge. However, in the top left case it becomes evident after only very few epochs that is consistently better than such that the probability reduces sharply to values close to zero which would correctly stop training early on. The case in the top right plot is more tricky. The probability is increasing until a learning curve length of 12% as the learning curve seem to indicate that is better than . However, the learning curve of approaches the values of and then the probability decreases and training is stopped early.

We continue now with the discussion of the two examples in the bottom row which LCRankNet erroneously terminated early, which causes in turn a regret higher than zero in Table 1. In the bottom left we show an example where sometimes is better for a longer period and sometimes . This is arguable a very difficult problem and it is hard to predict which curve will eventually be better. However, LCRankNet shows a reasonable behavior. As the learning curve of is consistently worse than in the segment from 15% to 42%, the probability is decreasing. Starting from a length of 57%, where shows superior performance than for several epochs, the probability starts to raise. From learning curve length of 70% onward, both learning curves are very close and the difference in the final accuracy is only 0.0022. The example visualized in the bottom right plot is a very interesting one. The learning curve of is consistently better than or almost equal to up to the very end. Towards the end, learning curves are getting very close. In fact, from learning curve length 63% onward, the maximum difference between and per epoch is only 0.008. Hence, in the beginning it seems like a trivial decision to reject . However, eventually turns out to be better than .

Figure 6: Analysis of the different components of LCRankNet. Every single component, metadata, consideration of the learning curve and architecture description, is vital.

In conclusion, deciding whether one model will be better than another one based on a partial learning curve is a challenging task. A model that turns out to be (slightly) better than another one can be dominated consistently for most of the learning curve. This makes it a very challenging problem for automated methods and human experts.

4.5 Analysis of LCRankNet’s components

We briefly mentioned before which components of our learning curve ranker have an essential influence on its quality. We would like to deepen the analysis at this point and compare the configuration we have proposed with different variants in Figure 6. We consider variants with and without metadata, architecture description or learning curve consideration. In addition, we compare our configuration trained with pairwise ranking loss to one trained with a pointwise ranking loss.

One of the most striking observations is that the metadata is essential for the model. This is not surprising since in particular the learning of architecture embedding needs sufficient data. Sufficient data is not available in this setup, so we observe a much higher variance for these variants. Even the variant that only considers the learning curve benefits from additional meta-knowledge. But even this is not surprising, since stagnating learning processes show similar behavior regardless of the dataset. Using the meta-knowledge, both components achieve good results on their own. It can be clearly seen that these components are orthogonal to one another. The variant, which only considers the architecture, shows very good results for short learning curves. If only the learning curve is considered, the results are not very good at first, but improve significantly with the length of the learning curve. A combination of both methods ultimately leads to our method and further improves the results. Finally, we compare the use of a pointwise ranking loss (L2 loss) versus a pairwise ranking loss. Although our assumption was that the latter would have to be fundamentally better, since it optimizes the model parameters directly for the task at hand, in practice this does not necessarily seem to be the case. Especially for short learning curves, the simpler method achieves better results. However, once the learning curve is long enough, the pairwise ranking loss pays off.

4.6 Accelerating hyperparameter optimization

LCRankNet is not limited to NAS, but can be used for hyperparameter optimization methods as well. We demonstrate this by combining LCRankNet with TPE (Bergstra et al., 2011), Regularized Evolution (RE) (Real et al., 2019)

), and Reinforcement Learning (RL) 

(Zoph and Le, 2017) to optimize the hyperparameters of an FCNN using a tabular data benchmark (Klein and Hutter, 2019). The objective of this benchmark is to minimize the mean squared error (MSE) by choosing the right architecture and hyperparameter settings. All experiments are repeated ten times. Each optimization method can evaluate up to 100 different configurations and chooses the best one based on its validation MSE. We report results with respect to test regret following Klein and Hutter (2019). This number indicates the difference between the test MSE actually achieved and the best possible test MSE. In our experiments, we observed that early termination generally improves the method (Figure 7, Appendix). All other results are listed in the Appendix.

Figure 7: TPE and RE can leverage LCRankNet to accelerate optimization. Results for reinforcement learning, more datasets and further insights can be found in the Appendix.

5 Conclusion

In this paper we present LCRankNet, a method to automatically terminate unpromising model configurations early. The two main novelties of the underlying model are that it is able to consider learning curves from other datasets and that it uses a pairwise ranking loss. The former allows to predict for relatively short, and in extreme cases even without, learning curves. The latter directly allows to model the probability that one configuration is better than the another. We analyze our method on five different datasets against three alternatives. In an experiment to optimize network architectures, we obtain the fastest results. In the best case, LCRankNet is 100 times faster without sacrificing accuracy. We also examine the components and predictions of our method to give the reader a better understanding of the design choices and functionalities. Finally, we demonstrate its use in combination with various hyperparameter optimizers.

References

  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §3.1.
  • B. Baker, O. Gupta, R. Raskar, and N. Naik (2018) Accelerating neural architecture search using performance prediction. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, External Links: Link Cited by: §1, §2, §2, §4.2, §4.3, Table 1.
  • J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pp. 2546–2554. External Links: Link Cited by: 1st item, §4.6.
  • C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. N. Hullender (2005) Learning to rank using gradient descent. In Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, pp. 89–96. External Links: Link, Document Cited by: §2, §3.
  • A. Chandrashekaran and I. R. Lane (2017) Speeding up hyper-parameter optimization by extrapolation of learning curves using previous builds. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18-22, 2017, Proceedings, Part I, Lecture Notes in Computer Science, Vol. 10534, pp. 477–492. Cited by: §1, §2, §2, §4.2, §4.3, Table 1.
  • T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552. External Links: Link, 1708.04552 Cited by: §4.1.
  • T. Domhan, J. T. Springenberg, and F. Hutter (2015) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In

    Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015

    ,
    pp. 3460–3468. Cited by: §1, §2, §2, §4.2, §4.3, Table 1.
  • S. Falkner, A. Klein, and F. Hutter (2018) BOHB: robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1436–1445. External Links: Link Cited by: §2.
  • K. G. Jamieson and A. Talwalkar (2016) Non-stochastic best arm identification and hyperparameter optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, pp. 240–248. External Links: Link Cited by: §2, §4.3, Table 1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §3.1.
  • A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter (2017) Learning curve prediction with bayesian neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §2, §2, §4.2.
  • A. Klein and F. Hutter (2019) Tabular benchmarks for joint architecture and hyperparameter optimization. CoRR abs/1905.04970. External Links: Link, 1905.04970 Cited by: §B.1, Appendix B, Appendix C, §3.1, §4.1, §4.6.
  • L. Li, K. G. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017) Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18, pp. 185:1–185:52. External Links: Link Cited by: §2, §4.3, Table 1.
  • H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §2, §4.3.
  • T. Liu (2011) Learning to rank for information retrieval. Springer. External Links: Link, Document, ISBN 978-3-642-14266-6 Cited by: §2.
  • I. Loshchilov and F. Hutter (2017) SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §4.1.
  • R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 7827–7838. External Links: Link Cited by: Appendix C.
  • H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 4092–4101. External Links: Link Cited by: §2.
  • E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    .
    In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 4780–4789. External Links: Link, Document Cited by: 1st item, §4.6.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104–3112. External Links: Link Cited by: §3.1.
  • M. Wistuba, A. Rawat, and T. Pedapati (2019) A survey on neural architecture search. CoRR abs/1905.01392. External Links: Link, 1905.01392 Cited by: §2.
  • K. Yu, C. Sciuto, M. Jaggi, C. Musat, and M. Salzmann (2020) Evaluating the search phase of neural architecture search. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Conference Track Proceedings, External Links: Link Cited by: §2, §4.3.
  • B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: 1st item, §4.6.
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    ,
    pp. 8697–8710. External Links: Link, Document Cited by: Appendix C, §3.1, §4.1, §4.3.

Appendix A Ablation study: learning curve component

One of the most important design decisions of LCRankNet was the learning curve component. We experimented with several models that would generate the best learning curve embedding. In the following discussion of different models, we assume that the learning curve represents the validation accuracy over time and thus higher values are better.

To begin with, passing the entire learning curve directly to be concatenated with architecture embedding would result in large number of predictors thereby overfitting. It was also clear that the best or last value of the learning curve alone (assuming that the learning curve is constantly improving, i.e. it increases monotonically, it is the same) is a very good predictor. After all, it is the only one used by methods like Hyperband and Successive Halving. In our example, this would be achieved through a simple global max pooling layer. Any further information regarding the development of the learning curve (improvement since epoch 1, 1st and 2nd order gradients, etc.) would, however, be disregarded. Feature engineering would be one way to create such features but convolutions allow to automatically learn which of these features are helpful. In Figure 8

we take two different models with max pooling layer into account. The version with global max pooling layer reduces the number of predictors to one per filter, while the version with strides reduces the number to 4 per filter. In our experiments, we did not notice a big difference between these two versions, but a significant improvement over the version that only uses the best value.

Furthermore, we tried in vain to apply LSTMs to this problem. But both, LSTMs directly using the learning curve and using the output of the convolution did not achieve better results than if the learning curve had not been taken into account at all (”architecture only”).

Appendix B Joint architecture and hyperparameter optimization

As briefly discussed in the main paper, we want to show that LCRankNet

  • can be combined with optimization methods such as Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011), Regularized Evolution (RE) (Real et al., 2019) and Reinforcement Learning (RL) (Zoph and Le, 2017),

  • can be applied to different search spaces (convolutional vs. fully connected neural networks) with and without hyperparameters, and

  • work with different machine learning tasks (classification vs. regression).

For this reason we are carrying out an additional experiment on a publicly available tabular regression benchmark (Klein and Hutter, 2019)

. This benchmark was created by performing a grid search with a total of 62,208 different architecture and hyperparameter settings for four different tabular regression datasets: slice localization, protein structure, parkinsons telemonitoring, and naval propulsion. Each dataset is split into 60% train, 20% validation, and 20% test. Each architecture has two fully connected layers. Settings vary with respect to the initial learning rate (0.0005, 0.001, 0.005, 0.01, 0.05, 0.1), the batch size (8, 16, 32, 64), the learning rate schedule (cosine or fixed), the activation per layer (ReLU or tanh), the number of units per layer (16, 32, 64, 128, 256, 512), and dropout per layer (0.0, 0.3, 0.6).

We are again following the leave-one-dataset-out cross-validation protocol: We optimize the architecture and hyperparameter setting for one dataset and transfer the knowledge from the others. 100 settings per dataset are randomly selected as additional data for LCRankNet.

b.1 Accelerating hyperparameter optimization

The goal is to minimize the mean squared error (MSE) by choosing the right architecture and hyperparameter settings. In this experiment we compare the optimization methods TPE, RE and RL with a version with early termination using LCRankNet. All experiments are repeated ten times. Each optimization method can take up to 100 different settings into account. We report results with respect to test regret following Klein and Hutter (2019). This number indicates the difference between test MSE actually achieved and best possible test MSE. Finding the best setting will result in a regret of 0. The setting chosen by an optimization method is based on the validation MSE. It is therefore possible that the test regret may increase again. This can be seen as an overfitting on the validation set. In Figure 10 to 12 we report the results up to the search time required for the shortest of all repetitions. We find that early termination generally improves the method. Only in one case are they not better, but not worse either.

b.2 Technical details

In contrast to random search, Successive Halving or Hyperband, the intermediate performance is not sufficient for other optimization methods and the final performance of a model is required. To take this into account, a simple change to LCRankNet is required. An additional output layer is added that predicts the final performance in additional to the ranking score. We continue to use the ranking score only to decide whether to terminate a run early or not. The predicted final performance is only used in the event of early termination and is only used to provide feedback to the optimization method. If a run is not terminated early, the actual final performance is used. To ensure that the predicted final performance is in a reasonable range, we define a lower and upper bound. We are now explaining them in the context of MSE so lower values are better and vice versa. The lower bound is defined by the mean MSE observed for previous runs. The motivation for this decision is that each terminated run should be below average. We define the best observed MSE of the partial learning curve as an upper bound. If the upper and lower bounds conflict, we prefer the upper bound because it is based directly on observed data.

conv3x3

max3x3

add
Figure 9: One example block used in an architecture from the NASNet search space.

Appendix C Architecture representation

In this work we consider two different search spaces. The NASNet search space that takes CNNs (Zoph et al., 2018) into account and a search space for FCNNs (Klein and Hutter, 2019). In the NASNet search space, an architecture is completely described by two cells, each consisting of several blocks. The entire architecture is defined by selecting the design of all blocks. A block is designed by selecting two operations and their corresponding inputs and how these two operations are combined. A sample block is shown in Figure 9. We follow Luo et al. (2018) and model every combination of input and operation through three embeddings. The first embedding specifies the input choice, the second the type of operation (convolution, maximum pooling, etc.), and the third the kernel size. In this way, an architecture with two cells, each with five blocks, is clearly described by a sequence of 60 decisions.

We describe the FCNNs in a very similar way. For each layer we learn an embedding for the activation function, the dropout rate and the number of units. The batch size, the learning rate (after log transformation) and the learning rate schedule (after one-hot encoding) are taken into account as other hyperparameters.

Figure 10: TPE benefits from early termination on all datasets.
Figure 11: Regularized Evolution benefits from early termination on all datasets.
Figure 12: Reinforcement Learning benefits from early termination on all datasets.