1 Introduction
A method commonly used by human experts to speed up the optimization of neural architectures or hyperparameters is the early termination of iterative training processes that are unlikely to improve the current solution. A common technique to determine the likelihood of no improvement is to compare the learning curve of a new configuration to the one of the currently best configuration. This idea can also be used to speed up automated machine learning processes. For this purpose, it is common practice to extrapolate the partial learning curve in order to predict the final performance of the currently investigated model. Current extrapolation techniques have several weaknesses that make them unable to realize their full potential in practice. Many of the methods require sufficient sample learning curves to make reliable predictions (Chandrashekaran and Lane, 2017; Klein et al., 2017; Baker et al., 2018). Thus, the extrapolation method for the first candidates can not be used yet, which means more computational effort. Other methods do not have this disadvantage, but require sufficiently long learning curves to make reliable predictions which again means unnecessary overhead (Domhan et al., 2015). Many of these methods also do not take into account other information such as the hyperparameters of the model being examined or its network architecture.
We address the need for sample learning curves by devising a transfer learning technique that uses learning curves from other problems. Since the range of accuracy varies from dataset to dataset, we are forced to consider this in our modeling. But since we are not interested in predicting the performance of a model anyway, we use a ranking model that models the probability that the model currently being investigated surpasses the best solution so far.
In order to be able to make reliable predictions for short learning curves, we consider further characteristics of the model such as its network architecture. We compare our ranking method with respect to a ranking measure against different methods on five different image classification and four tabular regression datasets. We also show that our method is capable of significantly accelerating neural architecture search (NAS) and hyperparameter optimization. Furthermore, we conduct several ablation studies to provide a better motivation of our model and its behavior.
2 Related work
Most of the prior work for learning curve prediction is based on the idea of extrapolating the partial learning curve by using a combination of continuously increasing basic functions.
Domhan et al. (2015)
define a set of 11 parametric basic functions, estimate their parameters and combine them in an ensemble.
Klein et al. (2017)propose a heteroscedastic Bayesian model which learns a weighted average of the basic functions.
Chandrashekaran and Lane (2017) do not use basic functions but use previously observed learning curves of the current dataset. An affine transformation for each previously seen learning curve is estimated by minimizing the mean squared error with respect to the partial learning curve. The best fitting extrapolations are averaged as the final prediction. Baker et al. (2018)use a different procedure. They use support vector machines as sequential regressive models to predict the final accuracy based on features extracted from the learning curves, its gradients, and the neural architecture itself.
The predictor by Domhan et al. (2015)
is able to forecast without seeing any learning curve before but requires observing more epochs for accurate predictions. The model by
Chandrashekaran and Lane (2017) requires seeing few learning curves to extrapolate future learning curves. However, accurate forecasts are already possible after few epochs. Algorithms proposed by Klein et al. (2017); Baker et al. (2018)need to observe many fulllength learning curves before providing any useful forecasts. However, this is prohibiting in the scenarios where learning is timeconsuming such as in large convolutional neural networks (CNN).
All previous methods for automatically terminating iterative learning processes are based on methods that predict the learning curve. Ultimately, however, we are less interested in the exact learning curve but rather whether the current learning curve leads to a better result. This way of obtaining a ranking is referred to as pointwise ranking methods (Liu, 2011). They have proven to be less efficient than pairwise ranking methods which directly optimize for the objective function (Burges et al., 2005). Yet, we are the first to consider a pairwise ranking loss for this application.
There are some banditbased methods which leverage early termination as well. Successive Halving (Jamieson and Talwalkar, 2016) is a method that trains multiple models with settings chosen at random simultaneously and terminates the worst performing half in predefined intervals. Hyperband (Li et al., 2017) identifies that the choice of the intervals is vital and therefore proposes to run Successive Halving with different intervals. BOHB (Falkner et al., 2018) is an extension of Hyperband which proposes to replace the random selection of settings with Bayesian optimization.
The use of methods that terminate less promising iterative training processes early are of particular interest in the field of automated machine learning, as they can, for example, significantly accelerate hyperparameter optimization. The most computationally intensive subproblem of automated machine learning is Neural Architecture Search (Wistuba et al., 2019), the optimization of the neural network topology. Our method is not limited to this problem, but as it is currently one of the major challenges in automated machine learning, we use this problem as a sample application in our evaluation. For this particular application, optimization methods that leverage parameter sharing (Pham et al., 2018) have become established as a standard method. Here, a search space spanning, overparameterized neural network is learned and finally used to determine the best architecture. DARTS (Liu et al., 2019) is one of the last extensions of this idea which uses a continuous relaxation of the search space, which allows learning of the shared parameters as well as of the architecture by means of gradient descent. These methods can be a strong alternative to early termination but do not transfer to general machine learning and recent research implies that this approach does not work for arbitrary architecture search spaces (Yu et al., 2020).
3 Learning curve ranking
With learning curve we refer to the function of qualitative performance with growing number of iterations of an iterative learning algorithm. We use the term final learning curve to explicitly denote the entire learning curve, , reflecting the training process from beginning to end. Here, is a measure of the performance of the model (e.g., classification accuracy), which is determined at regular intervals. On the contrary, a partial learning curve, , refers to learning curves that are observed only up to a time . We visualize the concepts of the terms in Figure 1.
There is a set of automated early termination methods which follow broadly the same principle. The first model is trained to completion and is considered as the current best model with its performance being . For each further model , the probability that it is better than the current best model, , is monitored at periodic intervals during training. If it is below a given threshold, the model’s training is terminated (see Algorithm 1
). All existing methods rely on a twostep approach to determine this probability which involves extrapolating the partial learning curve and a heuristic measure. Instead of the twostep process to determine the probability, we propose LCRankNet to predict the probability that a model
is better than directly. LCRankNet is based on a neural network which considers model characteristics and the partial learning curve as input. We define the probability that is better than as(1) 
Using the logistic function is an established modelling approach in the learningtorank community (Burges et al., 2005). Given a set of final learning curves for some models, the estimation of the values between these models is trivial since we know whether is better than or not. Therefore, we set
(2) 
We minimize the crossentropy loss
(3) 
to determine the parameters of . Now can be predicted for arbitrary pairs and using Equation (1). A model is trained on all partial learning curves of length to predict the probabilities of curves of length .
3.1 Ranking model to learn across datasets
We have defined the prediction of our model in Equation (1). It depends on the outcome of the function which takes a model representation as input. The information contained in this representation depends on the task at hand. The representation consists of up to four different parts. First, the partial learning curve . Second, the description of the model’s architecture which is a sequence of strings representing the layers it is comprised of. Third, a dataset ID to indicate on which dataset the corresponding architecture and learning curve was trained and observed. Fourth, all remaining hyperparameters, e.g. learning rate or batch size.
We model using a neural network and use special layers to process the different parts of the representation. The learned representation for the different parts are finally concatenated and fed to a fully connected layer. The architecture is visualized in Figure 2. We will now describe the different components in detail.
The xaxis indicates the length observed of the learning curve. We report the mean of ten repetitions. The shaded area is the standard deviation. Our method LCRankNet outperforms its competitors on all datasets.
Learning curve component
A CNN is used to process the partial learning curve. We consider up to four different convolutional layers with different kernel sizes. The exact number depends on the length of the learning curve since we consider no kernel sizes larger than the learning curve length. Each convolution is followed by a global max pooling layer and their outputs are concatenated. This results in the learned representation for the learning curve. In the appendix we analyze alternative modeling options.
Architecture component
In this paper we consider two experiments for learning curve prediction. In one we deal with NAS in the NASNet search space (Zoph et al., 2018), in the other we work on the joint optimization of architecture and hyperparameters on tabular data (Klein and Hutter, 2019). In the former experiment, the search space consists of CNNs, which consist of two cells with five blocks each. This means that 60 decisions about architecture have to be made. In the second experiment, the fully connected neural networks (FCNN) consists of only two layers, so the number of decisions is only 6. For more details refer to the appendix. We learn an embedding for every option. An LSTM takes this embedding and generates the architecture embedding.
Dataset component
As we intend to learn from learning curves of other datasets, we include dataset embeddings as one of the components. Every dataset has its own embedding and it will be used whenever a learning curve observed on this dataset is selected. If the dataset is new, then the corresponding embedding is initialized at random. As the model observes more learning curves from this dataset, it improves the dataset embedding. The embedding helps us to model datasetspecific peculiarities in order to adjust the ranking if necessary.
Hyperparameter component
Other hyperparameters, if available, are passed directly to the last layer.
Technical details
During the development process, we found that the architecture component leads to instabilities during training. To avoid this, we regularize the output of the LSTM layer by using it as an input to an autoregressive model, similar to a sequencetosequence model
(Sutskever et al., 2014) that recreates the original description of the architecture. In addition, we use the attention mechanism (Bahdanau et al., 2015) to facilitate this process. All parameters of the layers in are trained jointly by means of Adam (Kingma and Ba, 2015) by minimizing(4) 
a weighted linear combination of the ranking loss (Equation (3)) and the reconstruction loss with . The hyperparameter allows for trading precision vs. recall or search time vs. regret, respectively. There are two extreme cases: if we set , every run is terminated immediately. If we set , we never terminate any run. For our experiment we set which means that if the predicted probability that the new model is better than the best one is below 45%, the run is terminated early.
4 Experiments
In this section, we first discuss how to create the metaknowledge and then analyze our model in terms of learning curve ranking and the ability to use it as a way to accelerate NAS. Finally, we examine its individual components and behavior in certain scenarios. Furthermore, we show that LCRankNet can be combined with various optimization methods and used for hyperparameter optimization tasks.
Method  CIFAR10  CIFAR100  Fashion  Quickdraw  SVHN  

Regr.  Time  Regr.  Time  Regr.  Time  Regr.  Time  Regr.  Time  
No Early Termination  0.00  1023  0.00  1021  0.00  1218  0.00  1045  0.00  1485 
Domhan et al. (2015)  0.56  346  0.82  326  0.00  460  0.44  331  0.28  471 
Li et al. (2017)  0.22  106  0.78  102  0.32  132  0.54  109  0.00  156 
Baker et al. (2018)  0.00  89  0.00  77  0.00  129  0.00  107  0.00  241 
Jamieson and Talwalkar (2016)  0.62  62  0.00  54  0.18  70  0.40  60  0.28  88 
Chandrashekaran and Lane (2017)  0.62  30  0.00  35  0.28  41  0.30  82  0.06  164 
LCRankNet  0.22  20  0.00  11  0.10  19  0.00  28  0.10  74 
4.1 Metaknowledge
We compare our method to similar methods on five different datasets: CIFAR10, CIFAR100, FashionMNIST, Quickdraw, and SVHN. We use the original train/test splits if available. Quickdraw has a total of 50 million data points and 345 classes. To reduce the training time, we select a subset of this dataset. We use 100 different randomly selected classes and choose 300 examples per class for the training split and 100 per class for the test split. 5,000 random data points of the training dataset serve as validation split for all datasets.
To create the metaknowledge, we choose 200 architectures per dataset at random from the NASNet search space (Zoph et al., 2018)
such that we train a total of 1,000 architectures. We would like to point out that these are 1,000 unique architectures, there is no architecture that has been trained on several different datasets. Each architecture is trained for 100 epochs with stochastic gradient descent and cosine learning rate schedule without restart
(Loshchilov and Hutter, 2017).We use standard image preprocessing and augmentation: for every image, we first subtract the channel mean and then divide by the channel standard deviation. Images are padded by a margin of four pixels and randomly cropped back to the original dimension. For all datasets but SVHN we apply random horizontal flipping. Additionally, we use Cutout
(DeVries and Taylor, 2017).For the experiments in Section 4.6 we rely on the tabular benchmark (Klein and Hutter, 2019). This benchmark contains metadata for four regression tasks. In total 62,208 different FCNN with different architecture and hyperparameters are evaluated. 100 settings per dataset are chosen at random as metaknowledge.
The following experiments are conducted in a leaveonedatasetout crossvalidation. That means when considering one dataset, all metaknowledge but the one for this particular dataset is used.
4.2 Ranking performance
First, we analyze the quality of the learning curve rankings by different learning curve prediction methods. In this experiment we choose 50 different learning curves at random as a test set. Five random learning curves are used as a training set for every repetition. Each learning curve prediction method ranks the 50 architectures by observing the partial learning curve whose length varies from 0 to 30 percent of the final learning curve. We repeat the following experiment ten times and report the mean and standard deviation of the correlation between the true and the predicted ranking in Figure 3. As a correlation metric, we use Spearman’s rank correlation coefficient. Thus, the correlation is 1 for a perfect ranking and 0 for an uncorrelated, random ranking. Our method LCRankNet shows for all datasets higher correlation. If there are no or only very short partial learning curves available, our method shows the biggest difference to the existing methods. The reason for this is a combination of the consideration of the network architecture together with additional metaknowledge. We analyze the impact of each component in detail in Section 4.5.
The method by Chandrashekaran and Lane (2017) consistently shows the second best results and in some cases can catch up to the results of our method. The method by Baker et al. (2018)
stands out due to the high standard deviation. It is by far the method with the smallest squared error on test. However, the smallest changes in the prediction lead to a significantly different ranking, which explains the high variance in their results. The method by
Domhan et al. (2015) requires a minimum length of the learning curve to make predictions. Accordingly, we observe rank correlation values starting from a learning curve length of 4%. Using the last seen value to determine the ranking of learning curves is a simple yet efficient method (Klein et al., 2017). In fact, it is able to outperform some of the more elaborate methods.4.3 Accelerating random neural architecture search
In this experiment, we demonstrate the utility of learning curve predictors in the search for network architectures. For the sake of simplicity, we accelerate a random search in the NASNet search space (Zoph et al., 2018).
The random search samples 200 models and trains each of them for 100 epochs to obtain the final accuracy. In the end, the best of all these models is returned. Now each learning curve predictor iterates over these sampled architectures in the same order and determines at every third epoch if the training should be aborted. For Successive Halving (Jamieson and Talwalkar, 2016) and Hyperband (Li et al., 2017) we follow the algorithm defined by the authors and use the recommended settings. The current best model discovered after iterating over all 200 architectures is returned as the best model for this learning curve predictor. The goal of the methods is to minimize the regret compared to a random search without early stopping and at the same time reduce the computational effort.
One of our observations is that Domhan et al. (2015)’s method underestimates performance when the learning curve is short. As a result, the method ends each training process early after only a few epochs. Therefore, we follow the recommendation by Domhan et al. (2015) and do not end processes before we have seen the first 30 epochs. We summarize the results in Table 1. Our first observation is that all methods accelerate the search with little regret. Here we define regret as the difference between the accuracy of the model found by the random search and the accuracy of the model found by one of the methods. Not surprisingly, Domhan et al. (2015)’s method takes up most of the time, as it requires significantly longer learning curves to make its decision. In addition, we can confirm the results by Baker et al. (2018); Chandrashekaran and Lane (2017), both of which report better results than Domhan et al. (2015). Our method requires the least amount of time for each dataset. For CIFAR100 we do not observe any regret, but a reduction of the time by a factor of 100. In some cases we observe an insignificantly higher regret than some of the other methods. In our opinion, the time saved makes up for it.
In Figure 4 we visualize the random search for SVHN. As you can see, many curves not only have similar behavior but also similar error. For this reason, it is difficult to decide whether to discard a model safely, which explains the increased runtime. The only methods that do not show this behavior are Successive Halving and Hyperband. The simple reason is that the number of runs terminated and the time they are discarded are fixed by the algorithm and do not rely on the properties of the learning curve. The disadvantages are obvious: promising runs may be terminated and unpromising runs may run longer than needed.
Finally, we compare our method with DARTS (Liu et al., 2019), a method that speeds up the search by parameter sharing. We train the architecture discovered in the previous experiment with an equivalent training setup like DARTS for a fair comparison. This architecture achieved a classification error of 2.99% on CIFAR10 after only 20 GPU hours were searched. In comparison, DARTS (1st order) needs 36 GPU hours for a similarly good architecture (3.00% error). DARTS (2nd order) can improve these results to 2.76%, but it takes 96 GPU hours. One of the major disadvantages of DARTS and related methods is that they cannot generalize to any architecture search space (Yu et al., 2020) and do not allow hyperparameter optimization. Our proposed method does not suffer from these problems.
4.4 LCRankNet prediction analysis
We saw in the previous selection that LCRankNet does not perform perfectly. There are cases where its regret is greater than zero or where the search time is higher which indicates that early stopping was applied later than maybe possible. In this section we try to shed some light on the decisions made by the model and try to give the reader some insight of the model’s behavior.
We provide four example decisions of LCRankNet in Figure 5. The plots in the top row show cases where LCRankNet made a correct decision, the plots in the bottom row are examples for incorrect decisions.
In both of the correct cases (top row), LCRankNet assigns a higher probability to begin with, using metaknowledge. However, in the top left case it becomes evident after only very few epochs that is consistently better than such that the probability reduces sharply to values close to zero which would correctly stop training early on. The case in the top right plot is more tricky. The probability is increasing until a learning curve length of 12% as the learning curve seem to indicate that is better than . However, the learning curve of approaches the values of and then the probability decreases and training is stopped early.
We continue now with the discussion of the two examples in the bottom row which LCRankNet erroneously terminated early, which causes in turn a regret higher than zero in Table 1. In the bottom left we show an example where sometimes is better for a longer period and sometimes . This is arguable a very difficult problem and it is hard to predict which curve will eventually be better. However, LCRankNet shows a reasonable behavior. As the learning curve of is consistently worse than in the segment from 15% to 42%, the probability is decreasing. Starting from a length of 57%, where shows superior performance than for several epochs, the probability starts to raise. From learning curve length of 70% onward, both learning curves are very close and the difference in the final accuracy is only 0.0022. The example visualized in the bottom right plot is a very interesting one. The learning curve of is consistently better than or almost equal to up to the very end. Towards the end, learning curves are getting very close. In fact, from learning curve length 63% onward, the maximum difference between and per epoch is only 0.008. Hence, in the beginning it seems like a trivial decision to reject . However, eventually turns out to be better than .
In conclusion, deciding whether one model will be better than another one based on a partial learning curve is a challenging task. A model that turns out to be (slightly) better than another one can be dominated consistently for most of the learning curve. This makes it a very challenging problem for automated methods and human experts.
4.5 Analysis of LCRankNet’s components
We briefly mentioned before which components of our learning curve ranker have an essential influence on its quality. We would like to deepen the analysis at this point and compare the configuration we have proposed with different variants in Figure 6. We consider variants with and without metadata, architecture description or learning curve consideration. In addition, we compare our configuration trained with pairwise ranking loss to one trained with a pointwise ranking loss.
One of the most striking observations is that the metadata is essential for the model. This is not surprising since in particular the learning of architecture embedding needs sufficient data. Sufficient data is not available in this setup, so we observe a much higher variance for these variants. Even the variant that only considers the learning curve benefits from additional metaknowledge. But even this is not surprising, since stagnating learning processes show similar behavior regardless of the dataset. Using the metaknowledge, both components achieve good results on their own. It can be clearly seen that these components are orthogonal to one another. The variant, which only considers the architecture, shows very good results for short learning curves. If only the learning curve is considered, the results are not very good at first, but improve significantly with the length of the learning curve. A combination of both methods ultimately leads to our method and further improves the results. Finally, we compare the use of a pointwise ranking loss (L2 loss) versus a pairwise ranking loss. Although our assumption was that the latter would have to be fundamentally better, since it optimizes the model parameters directly for the task at hand, in practice this does not necessarily seem to be the case. Especially for short learning curves, the simpler method achieves better results. However, once the learning curve is long enough, the pairwise ranking loss pays off.
4.6 Accelerating hyperparameter optimization
LCRankNet is not limited to NAS, but can be used for hyperparameter optimization methods as well. We demonstrate this by combining LCRankNet with TPE (Bergstra et al., 2011), Regularized Evolution (RE) (Real et al., 2019)
), and Reinforcement Learning (RL)
(Zoph and Le, 2017) to optimize the hyperparameters of an FCNN using a tabular data benchmark (Klein and Hutter, 2019). The objective of this benchmark is to minimize the mean squared error (MSE) by choosing the right architecture and hyperparameter settings. All experiments are repeated ten times. Each optimization method can evaluate up to 100 different configurations and chooses the best one based on its validation MSE. We report results with respect to test regret following Klein and Hutter (2019). This number indicates the difference between the test MSE actually achieved and the best possible test MSE. In our experiments, we observed that early termination generally improves the method (Figure 7, Appendix). All other results are listed in the Appendix.5 Conclusion
In this paper we present LCRankNet, a method to automatically terminate unpromising model configurations early. The two main novelties of the underlying model are that it is able to consider learning curves from other datasets and that it uses a pairwise ranking loss. The former allows to predict for relatively short, and in extreme cases even without, learning curves. The latter directly allows to model the probability that one configuration is better than the another. We analyze our method on five different datasets against three alternatives. In an experiment to optimize network architectures, we obtain the fastest results. In the best case, LCRankNet is 100 times faster without sacrificing accuracy. We also examine the components and predictions of our method to give the reader a better understanding of the design choices and functionalities. Finally, we demonstrate its use in combination with various hyperparameter optimizers.
References
 Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §3.1.
 Accelerating neural architecture search using performance prediction. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Workshop Track Proceedings, External Links: Link Cited by: §1, §2, §2, §4.2, §4.3, Table 1.
 Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 1214 December 2011, Granada, Spain, pp. 2546–2554. External Links: Link Cited by: 1st item, §4.6.
 Learning to rank using gradient descent. In Machine Learning, Proceedings of the TwentySecond International Conference (ICML 2005), Bonn, Germany, August 711, 2005, pp. 89–96. External Links: Link, Document Cited by: §2, §3.
 Speeding up hyperparameter optimization by extrapolation of learning curves using previous builds. In Machine Learning and Knowledge Discovery in Databases  European Conference, ECML PKDD 2017, Skopje, Macedonia, September 1822, 2017, Proceedings, Part I, Lecture Notes in Computer Science, Vol. 10534, pp. 477–492. Cited by: §1, §2, §2, §4.2, §4.3, Table 1.
 Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552. External Links: Link, 1708.04552 Cited by: §4.1.

Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves.
In
Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 2531, 2015
, pp. 3460–3468. Cited by: §1, §2, §2, §4.2, §4.3, Table 1.  BOHB: robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pp. 1436–1445. External Links: Link Cited by: §2.
 Nonstochastic best arm identification and hyperparameter optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 911, 2016, pp. 240–248. External Links: Link Cited by: §2, §4.3, Table 1.
 Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §3.1.
 Learning curve prediction with bayesian neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §2, §2, §4.2.
 Tabular benchmarks for joint architecture and hyperparameter optimization. CoRR abs/1905.04970. External Links: Link, 1905.04970 Cited by: §B.1, Appendix B, Appendix C, §3.1, §4.1, §4.6.
 Hyperband: A novel banditbased approach to hyperparameter optimization. J. Mach. Learn. Res. 18, pp. 185:1–185:52. External Links: Link Cited by: §2, §4.3, Table 1.
 DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: §2, §4.3.
 Learning to rank for information retrieval. Springer. External Links: Link, Document, ISBN 9783642142666 Cited by: §2.
 SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §4.1.
 Neural architecture optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada, pp. 7827–7838. External Links: Link Cited by: Appendix C.
 Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pp. 4092–4101. External Links: Link Cited by: §2.

Regularized evolution for image classifier architecture search
. In The ThirtyThird AAAI Conference on Artificial Intelligence, AAAI 2019, The ThirtyFirst Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27  February 1, 2019, pp. 4780–4789. External Links: Link, Document Cited by: 1st item, §4.6.  Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pp. 3104–3112. External Links: Link Cited by: §3.1.
 A survey on neural architecture search. CoRR abs/1905.01392. External Links: Link, 1905.01392 Cited by: §2.
 Evaluating the search phase of neural architecture search. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, Conference Track Proceedings, External Links: Link Cited by: §2, §4.3.
 Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: 1st item, §4.6.

Learning transferable architectures for scalable image recognition.
In
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018
, pp. 8697–8710. External Links: Link, Document Cited by: Appendix C, §3.1, §4.1, §4.3.
Appendix A Ablation study: learning curve component
One of the most important design decisions of LCRankNet was the learning curve component. We experimented with several models that would generate the best learning curve embedding. In the following discussion of different models, we assume that the learning curve represents the validation accuracy over time and thus higher values are better.
To begin with, passing the entire learning curve directly to be concatenated with architecture embedding would result in large number of predictors thereby overfitting. It was also clear that the best or last value of the learning curve alone (assuming that the learning curve is constantly improving, i.e. it increases monotonically, it is the same) is a very good predictor. After all, it is the only one used by methods like Hyperband and Successive Halving. In our example, this would be achieved through a simple global max pooling layer. Any further information regarding the development of the learning curve (improvement since epoch 1, 1st and 2nd order gradients, etc.) would, however, be disregarded. Feature engineering would be one way to create such features but convolutions allow to automatically learn which of these features are helpful. In Figure 8
we take two different models with max pooling layer into account. The version with global max pooling layer reduces the number of predictors to one per filter, while the version with strides reduces the number to 4 per filter. In our experiments, we did not notice a big difference between these two versions, but a significant improvement over the version that only uses the best value.
Furthermore, we tried in vain to apply LSTMs to this problem. But both, LSTMs directly using the learning curve and using the output of the convolution did not achieve better results than if the learning curve had not been taken into account at all (”architecture only”).
Appendix B Joint architecture and hyperparameter optimization
As briefly discussed in the main paper, we want to show that LCRankNet

can be applied to different search spaces (convolutional vs. fully connected neural networks) with and without hyperparameters, and

work with different machine learning tasks (classification vs. regression).
For this reason we are carrying out an additional experiment on a publicly available tabular regression benchmark (Klein and Hutter, 2019)
. This benchmark was created by performing a grid search with a total of 62,208 different architecture and hyperparameter settings for four different tabular regression datasets: slice localization, protein structure, parkinsons telemonitoring, and naval propulsion. Each dataset is split into 60% train, 20% validation, and 20% test. Each architecture has two fully connected layers. Settings vary with respect to the initial learning rate (0.0005, 0.001, 0.005, 0.01, 0.05, 0.1), the batch size (8, 16, 32, 64), the learning rate schedule (cosine or fixed), the activation per layer (ReLU or tanh), the number of units per layer (16, 32, 64, 128, 256, 512), and dropout per layer (0.0, 0.3, 0.6).
We are again following the leaveonedatasetout crossvalidation protocol: We optimize the architecture and hyperparameter setting for one dataset and transfer the knowledge from the others. 100 settings per dataset are randomly selected as additional data for LCRankNet.
b.1 Accelerating hyperparameter optimization
The goal is to minimize the mean squared error (MSE) by choosing the right architecture and hyperparameter settings. In this experiment we compare the optimization methods TPE, RE and RL with a version with early termination using LCRankNet. All experiments are repeated ten times. Each optimization method can take up to 100 different settings into account. We report results with respect to test regret following Klein and Hutter (2019). This number indicates the difference between test MSE actually achieved and best possible test MSE. Finding the best setting will result in a regret of 0. The setting chosen by an optimization method is based on the validation MSE. It is therefore possible that the test regret may increase again. This can be seen as an overfitting on the validation set. In Figure 10 to 12 we report the results up to the search time required for the shortest of all repetitions. We find that early termination generally improves the method. Only in one case are they not better, but not worse either.
b.2 Technical details
In contrast to random search, Successive Halving or Hyperband, the intermediate performance is not sufficient for other optimization methods and the final performance of a model is required. To take this into account, a simple change to LCRankNet is required. An additional output layer is added that predicts the final performance in additional to the ranking score. We continue to use the ranking score only to decide whether to terminate a run early or not. The predicted final performance is only used in the event of early termination and is only used to provide feedback to the optimization method. If a run is not terminated early, the actual final performance is used. To ensure that the predicted final performance is in a reasonable range, we define a lower and upper bound. We are now explaining them in the context of MSE so lower values are better and vice versa. The lower bound is defined by the mean MSE observed for previous runs. The motivation for this decision is that each terminated run should be below average. We define the best observed MSE of the partial learning curve as an upper bound. If the upper and lower bounds conflict, we prefer the upper bound because it is based directly on observed data.
Appendix C Architecture representation
In this work we consider two different search spaces. The NASNet search space that takes CNNs (Zoph et al., 2018) into account and a search space for FCNNs (Klein and Hutter, 2019). In the NASNet search space, an architecture is completely described by two cells, each consisting of several blocks. The entire architecture is defined by selecting the design of all blocks. A block is designed by selecting two operations and their corresponding inputs and how these two operations are combined. A sample block is shown in Figure 9. We follow Luo et al. (2018) and model every combination of input and operation through three embeddings. The first embedding specifies the input choice, the second the type of operation (convolution, maximum pooling, etc.), and the third the kernel size. In this way, an architecture with two cells, each with five blocks, is clearly described by a sequence of 60 decisions.
We describe the FCNNs in a very similar way. For each layer we learn an embedding for the activation function, the dropout rate and the number of units. The batch size, the learning rate (after log transformation) and the learning rate schedule (after onehot encoding) are taken into account as other hyperparameters.
Comments
There are no comments yet.