1 Introduction
At present, significant human expertise and labor is required for designing highperforming neural network architectures and successfully training them for different applications. Ongoing research in two areas—metamodeling and hyperparameter optimization—attempts to reduce the amount of human intervention required for these tasks. Hyperparameter optimization methods (e.g., Hutter et al. (2011); Snoek et al. (2015); Li et al. (2017)) focus primarily on obtaining good optimization hyperparameter configurations for training humandesigned networks, whereas metamodeling algorithms (Bergstra et al., 2013; Verbancsics & Harguess, 2013; Baker et al., 2017; Zoph & Le, 2017) aim to design neural network architectures from scratch. Both sets of algorithms require training a large number of neural network configurations for identifying the right set of hyperparameters or the right network architecture—and are hence computationally expensive.
When sampling many different model configurations, it is likely that many subpar configurations will be explored. Human experts are quite adept at recognizing and terminating suboptimal model configurations by inspecting their partial learning curves. In this paper we seek to emulate this behavior and automatically identify and terminate subpar model configurations in order to speedup both metamodeling and hyperparameter optimization methods. Our method parameterizes learning curve trajectories with simple features derived from model architectures, training hyperparameters, and early timeseries measurements from the learning curve. We use these features to train a set of frequentist regression models that predicts the final validation accuracy of partially trained neural network configurations using a small training set of fully trained curves from both image classification and language modeling domains. We use these predictions and uncertainty estimates obtained from small model ensembles to construct a simple early stopping algorithm that can speedup both metamodeling and hyperparameter optimization methods.
While there is some prior work on neural network performance prediction using Bayesian methods (Domhan et al., 2015; Klein et al., 2017), our proposed method is significantly more accurate, accessible, and efficient. We hope that our work leads to inclusion of neural network performance prediction and early stopping in the practical neural network training pipeline.
2 Related Work
Neural Network Performance Prediction: There has been limited work on predicting neural network performance during the training process. Domhan et al. (2015)
introduce a weighted probabilistic model for learning curves and utilize this model for speeding up hyperparameter search in small convolutional neural networks (CNNs) and fullyconnected networks (FCNs). Building on
Domhan et al. (2015), Klein et al. (2017)train Bayesian neural networks for predicting unobserved learning curves using a training set of fully and partially observed learning curves. Both methods rely on expensive Markov chain Monte Carlo (MCMC) sampling procedures and handcrafted learning curve basis functions. We also note that
Swersky et al. (2014) develop a Gaussian Process kernel for predicting individual learning curves, which they use to automatically stop and restart configurations.Metamodeling:
We define metamodeling as an algorithmic approach for designing neural network architectures from scratch. The earliest metamodeling approaches were based on genetic algorithms
(Schaffer et al., 1992; Stanley & Miikkulainen, 2002; Verbancsics & Harguess, 2013) or Bayesian optimization (Bergstra et al., 2013; Shahriari et al., 2016). More recently, reinforcement learning methods have become popular. Baker et al. (2017) use Qlearning to design competitive CNNs for image classification. Zoph & Le (2017) use policy gradients to design stateoftheart CNNs and Recurrent cell architectures. Several methods for architecture search (Cortes et al., 2017; Negrinho & Gordon, 2017; Zoph et al., 2017; Brock et al., 2017; Suganuma et al., 2017) have been proposed this year since the publication of Baker et al. (2017) and Zoph & Le (2017).Hyperparameter Optimization: We define hyperparameter optimization as an algorithmic approach for finding optimal values of designindependent hyperparameters such as learning rate and batch size, along with a limited search through the network design space. Bayesian hyperparameter optimization methods include those based on sequential modelbased optimization (SMAC) (Hutter et al., 2011), Gaussian processes (GP) (Snoek et al., 2012), TPE (Bergstra et al., 2013), and neural networks Snoek et al. (2015). However, random search or grid search is most commonly used in practical settings (Bergstra & Bengio, 2012). Recently, Li et al. (2017) introduced Hyperband, a multiarmed banditbased efficient random search technique that outperforms stateoftheart Bayesian optimization methods.
3 Neural Network Performance Prediction
We first describe our model for neural network performance prediction, followed by a description of the datasets used to evaluate our model, and finally present experimental results.
3.1 Modeling Learning Curves
Our goal is to model the validation accuracy of a neural network configuration
at epoch
using previous performance observations . For each configuration trained for epochs, we record a timeseries of validation accuracies. We train a population of configurations, obtaining a set . Note that this problem formulation is very similar to Klein et al. (2017).We propose to use a set of features , derived from the neural network configuration , along with a subset of timeseries accuracies (where ) from to train a regression model for estimating . Our model predicts of a neural network configuration using a feature set . For clarity, we train regression models, where each successive model uses one more point of the timeseries validation data. As we shall see in subsequent sections, this use of sequential regression models (SRM) is more computationally and more precise than methods that train a single Bayesian model.
Features: We use features based on timeseries (TS) validation accuracies, architecture parameters (AP), and hyperparameters (HP). (1) TS: These include the validation accuracies (where ), the firstorder differences of validation accuracies (i.e., ), and the secondorder differences of validation accuracies (i.e., ). (2) AP: These include total number of weights and number of layers. (3) HP: These include all hyperparameters used for training the neural networks, e.g., initial learning rate and learning rate decay (full list in Appendix Table 2).
3.2 Datasets and Training Procedures
We experiment with small and very deep CNNs (e.g., ResNet, CudaConvnet) trained on image classification datasets and with LSTMs trained with Penn Treebank (PTB), a language modeling dataset. Figure 1 shows example learning curves from three of the datasets considered in our experiments. We provide brief summary of the datasets below. Please see Appendix Section A for further details on the search space, preprocessing, hyperparameters and training settings of all datasets.
Datasets with Varying Architectures:
Deep Resnets (TinyImageNet): We sample 500 ResNet architectures and train them on the TinyImageNet^{*}^{*}*https://tinyimagenet.herokuapp.com/ dataset (containing 200 classes with 500 training images of pixels) for 140 epochs. We vary depths, filter sizes and number of convolutional filter block outputs. The network depths vary between 14 and 110.
Deep Resnets (CIFAR10): We sample 500 39layer ResNet architectures from a search space similar to Zoph & Le (2017), varying kernel width, kernel height, and number of kernels. We train these models for 50 epochs on CIFAR10.
MetaQNN CNNs (CIFAR10 and SVHN): We sample 1,000 model architectures from the search space detailed by Baker et al. (2017), which allows for varying the numbers and orderings of convolution, pooling, and fully connected layers. The models are between 1 and 12 layers for the SVHN experiment and between 1 and 18 layers for the CIFAR10 experiment. Each architecture is trained on SVHN and CIFAR10 datasets for 20 epochs.
LSTM (PTB): We sample 300 LSTM models and train them on the Penn Treebank dataset for 60 epochs, evaluating perplexity on the validation set. We vary number of LSTM cells and hidden layer inputs between 101400.
Datasets with Varying Hyperparameters:
CudaConvnet (CIFAR10 and SVHN): We train CudaConvnet architecture (Krizhevsky, 2012) with varying values of initial learning rate, learning rate reduction step size, weight decay for convolutional and fully connected layers, and scale and power of local response normalization layers. We train models with CIFAR10 for 60 epochs and with SVHN for 12 epochs.
3.3 Prediction Performance
Choice of Regression Method: We now describe our results for performing final neural network performance. For all experiments, we train our SRMs on 100 randomly sampled neural network configurations. We obtain the best performing method using random hyperparameter search over 3fold crossvalidation. We then compute the regression performance over the remainder of the dataset using the coefficient of determination
. We repeat each experiment 10 times and report the results with standard errors. We experiment with a few different frequentist regression models, including ordinary least squares (OLS), random forests, and
support vector machine regressions (
SVR). As seen in Table 1, SVR with linear or RBF kernels perform the best on most datasets, though not by a large margin. For the rest of this paper, we use SVR RBF unless otherwise specified.Dataset  SVR (RBF)  SVR (Linear)  Random Forest  OLS 

MetaQNN (CIFAR10)  
Resnet (TinyImageNet)  
LSTM (Penn Treebank) 
Ablation Study on Feature Sets: In Table 2
, we compare the predictive ability of different feature sets, training SVR (RBF) with timeseries (TS) features obtained from 25% of the learning curve, along with features of architecture parameters (AP), and hyperparameters (HP). TS features explain the largest fraction of the variance in all cases. For datasets with varying architectures, AP are more important that HP; and for hyperparameter search datasets, HP are more important than AP, which is expected. AP features almost match TS on the ResNet (TinyImageNet) dataset, indicating that choice of architecture has a large influence on accuracy for ResNets. Figure
2 shows the true vs. predicted performance for all test points in three datasets, trained with TS, AP, and HP features.Generalization Between Depths: We also test to see whether SRMs can accurately predict the performance of outofdistribution neural networks. In particular, we train SVR (RBF) with 25% of TS, along with AP and HP features on ResNets (TinyImagenet) dataset, using 100 models with number of layers less than a threshold and test on models with number of layers greater than , averaging over 10 runs. Value of varies from 14 to 110. For , is . For , is .
Feature Set  MetaQNN  ResNets  LSTM  CudaConvnet 

(CIFAR10)  (TinyImageNet)  (Penn Treebank)  (CIFAR10)  
TS  
AP  
HP  
TS+AP  )  
AP+HP  
TS+AP+HP 
3.3.1 Comparison with Existing Methods:
We now compare the neural network performance prediction ability of SRMs with three existing learning curve prediction methods: (1) Bayesian Neural Network (BNN) (Klein et al., 2017), (2) the learning curve extrapolation (LCE) method (Domhan et al., 2015)
, and (3) the last seen value (LastSeenValue) heuristic
(Li et al., 2017). When training the BNN, we not only present it with the subset of fully observed learning curves but also all other partially observed learning curves from the training set. While we do not present the partially observed curves to the SVR SRM for training, we felt this was a fair comparison as SVR uses the entire partially observed learning curve during inference. Methods (2) and (3) do not incorporate prior learning curves during training. Figure 3 shows theobtained by each method for predicting the final performance versus the percent of the learning curve used for training the model. We see that in all neural network configuration spaces and across all datasets, either one or both SRMs outperform the competing methods. The LastSeenValue heuristic only becomes viable when the configurations are near convergence, and its performance is worse than an SRM for very deep models. We also find that the SRMs outperform the LCE method in all experiments, even after we remove a few extreme prediction outliers produced by LCE. Finally, while BNN outperforms the LastSeenValue and LCE methods when only a few iterations have been observed, it does worse than our proposed method. In summary, we show that our simple, frequentist SRMs outperforms existing Bayesian approaches on predicting neural network performance on modern, very deep models in computer vision and language modeling tasks.
Since most of our experiments perform stepwise learning rate decay; it is conceivable that the performance gap between SRMs and both LCE and BNN results from a lack of sharp jump in their basis functions. We experimented with exponential learning rate decay (ELRD), which the basis functions in LCE are designed for. We trained 630 random nets with ELRD, from the 1000 MetaQNNCIFAR10 nets. Predicting from 25% of the learning curve, the is 0.95 for SVR (RBF), 0.48 for LCE (with extreme outlier removal, negative without), and 0.31 for BNN. This comparison illuminates another benefit of our method: we do not require handcrafted basis functions to model new learning curve types.
Training and Inference Speed Comparison: Another advantage of our regression approach is speed. SRMs are much faster to train and do inference in than proposed Bayesian methods (Domhan et al., 2015; Klein et al., 2017). On 1 core of a Intel 6700k CPU, an SVR (RBF) with 100 training points trains in 0.006 seconds, and each inference takes 0.00006 seconds. In comparison, the LCE code takes 60 seconds and BNN code takes 0.024 seconds on the same hardware for each inference.
4 Applying Performance Prediction For Early Stopping
To speed up hyperparameter optimization and metamodeling methods, we develop an algorithm to determine whether to continue training a partially trained model configuration using our sequential regression models. If we would like to sample total neural network configurations, we begin by sampling and training configurations to create a training set . We then train a model to predict . Now, given the current best performance observed , we would like to terminate training a new configuration given its partial learning curve if so as to not waste computational resources exploring a suboptimal configuration.
However, in the case has poor outofsample generalization, we may mistakenly terminate the optimal configuration. If we assume that our estimate can be modeled as a Gaussian perturbation of the true value
, then we can find the probability
, where is the CDF of . Note that in general the uncertainty will depend on both the configuration and , the number of points observed from the learning curve. Because frequentist models do not admit a natural estimate of uncertainty, we assume that is independent of yet still dependent on and estimate it via Leave One Out Cross Validation.Now that we can estimate the model uncertainty, given a new configuration and an observed learning curve , we may set our termination criteria to be . balances the tradeoff between increased speedups and risk of prematurely terminating good configurations. In many cases, one may want several configurations that are close to optimal, for the purpose of ensembling. We offer two modifications in this case. First, one may relax the termination criterion to , which will allow configurations within of optimal performance to complete training. One can alternatively set the criterion based on the best configuration observed, guaranteeing that with high probability the top configurations will be fully trained.
4.1 Early Stopping for Metamodeling
Baker et al. (2017) train a learning agent to design convolutional neural networks. In this method, the agent samples architectures from a large, finite space by traversing a path from input layer to termination layer. However, the MetaQNN method uses 100 GPUdays to train 2700 neural architectures and the similar experiment by Zoph & Le (2017)
utilized 10,000 GPUdays to train 12,800 models on CIFAR10. The amount of computing resources required for these approaches makes them prohibitively expensive for large datasets (e.g., Imagenet) and larger search spaces. The main computational expense of reinforcement learningbased metamodeling methods is training the neural network configuration to
epochs (where is typically a large number at which the network stabilizes to peak accuracy).We now detail the performance of a SVR (RBF) SRM in speeding up architecture search using sequential configuration selection. First, we take 1,000 random models from the MetaQNN (Baker et al., 2017) search space. We simulate the MetaQNN algorithm by taking 10 random orderings of each set and running our early stopping algorithm. We compare against the LCE early stopping algorithm (Domhan et al., 2015) as a baseline, which has a similar probability threshold termination criterion. Our SRM trains off of the first 100 fully observed curves, while the LCE model trains from each individual partial curve and can begin early termination immediately. Despite this “burn in” time needed by an SRM, it is still able to significantly outperform the LCE model (Figure 4). In addition, fitting the LCE model to a learning curve takes between 13 minutes on a modern CPU due to expensive MCMC sampling, and it is necessary to fit a new LCE model each time a new point on the learning curve is observed. Therefore, on a full metamodeling experiment involving thousands of neural network configurations, our method could be faster by several orders of magnitude as compared to LCE based on current implementations.
We furthermore simulate early stopping for ResNets trained on CIFAR10. We found that only the probability threshold resulted in recovering the top model consistently. However, even with such a conservative threshold, the search was sped up by a factor of 3.4 over the baseline. While we do not have the computational resources to run the full experiment from Zoph & Le (2017), our method could provide similar gains in large scale architecture searches.
It is not enough, however, to simply simulate the speedup because metamodeling algorithms typically use the observed performance in order to update an acquisition function to inform future sampling. In the reinforcement learning setting, the performance is given to the agent as a reward, so we also empirically verify that substituting for does not cause the MetaQNN agent to converge to a subpar policy. Replicating the MetaQNN experiment on CIFAR10 (see Figure 5), we find that integrating early stopping with the learning procedure does not disrupt learning and resulted in a speedup of 3.8x with . The speedup is relatively low due to a conservative value of . After training the top models to 300 epochs, we also find that the resulting performance (just under 93%) is on par with original results of Baker et al. (2017).
4.2 Early Stopping for Hyperparameter Optimization
Recently, Li et al. (2017) introduced Hyperband, a random search technique based on multiarmed bandits that obtains stateoftheart performance in hyperparameter optimization in a variety of settings. The Hyperband algorithm trains a population of models with different hyperparameter configurations and iteratively discards models below a certain percentile in performance among the population until the computational budget is exhausted or satisfactory results are obtained.
4.2.1 Fast Hyperband
We present a Fast Hyperband (fHyperband) algorithm based on our early stopping scheme. During each iteration of successive halving, Hyperband trains configurations to epochs. In fHyperband, we train an SRM to predict and do early stopping within each iteration of successive halving. We initialize fHyperband in exactly the same way as vanilla Hyperband, except once we have trained 100 models to iterations, we begin early stopping for all future successive halving iterations that train to iterations. By doing this, we exhibit no initial slowdown to Hyperband due to a “burnin” phase. We also introduce a parameter which denotes the proportion of the models in each iteration that must be trained to the full iterations. This is similar to setting the criterion based on the best model in the previous section. See Appendix section C for an algorithmic representation of fHyperband.
We empirically evaluate fHyperband using CudaConvnet trained on CIFAR10 and SVHN datasets. Figure 6 shows that fHyperband evaluates the same number of unique configurations as Hyperband within half the compute time, while achieving the same final accuracy within standard error. When reinitializing hyperparameter searches, one can use previouslytrained set of SRMs to achieve even larger speedups. Figure 8 in Appendix shows that one can achieve up to a 7x speedup in such cases.
5 Conclusion
In this paper we introduce a simple, fast, and accurate model for predicting future neural network performance using features derived from network architectures, hyperparameters, and timeseries performance data. We show that the performance of drastically different network architectures can be jointly learned and predicted on both image classification and language models. Using our simple algorithm, we can speedup hyperparameter search techniques with complex acquisition functions, such as a learning agent, by a factor of 3x to 6x and Hyperband—a stateoftheart hyperparameter search method—by a factor of 2x, without disturbing the search procedure. We outperform all competing methods for performance prediction in terms of accuracy, train and test time, and speedups obtained on hyperparameter search methods. We hope that the simplicity and success of our method will allow it to be easily incorporated into current hyperparameter optimization pipelines for deep neural networks. With the advent of large scale automated architecture search (Baker et al., 2017; Zoph & Le, 2017), methods such as ours will be vital in exploring even larger and more complex search spaces.
References
 Baker et al. (2017) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. International Conference on Learning Representations, 2017.
 Bergstra & Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. JMLR, 13(Feb):281–305, 2012.
 Bergstra et al. (2013) James Bergstra, Daniel Yamins, and David D Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. ICML (1), 28:115–123, 2013.
 Brock et al. (2017) Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Smash: Oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.

Cortes et al. (2017)
Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott
Yang.
AdaNet: Adaptive structural learning of artificial neural
networks.
International Conference on Machine Learning
, 70:874–883, 2017.  Domhan et al. (2015) Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. IJCAI, 2015.
 Hutter et al. (2011) Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. Sequential modelbased optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer, 2011.
 Klein et al. (2017) Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. International Conference on Learning Representations, 17, 2017.
 Krizhevsky (2012) Alex Krizhevsky. Cudaconvnet. https://code.google.com/p/cudaconvnet/, 2012.
 Li et al. (2017) Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel banditbased approach to hyperparameter optimization. International Conference on Learning Representations, 2017.
 Negrinho & Gordon (2017) Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792, 2017.
 Schaffer et al. (1992) J David Schaffer, Darrell Whitley, and Larry J Eshelman. Combinations of genetic algorithms and neural networks: A survey of the state of the art. International Workshop on Combinations of Genetic Algorithms and Neural Networks, pp. 1–37, 1992.
 Shahriari et al. (2016) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. NIPS, pp. 2951–2959, 2012.
 Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, pp. 2171–2180, 2015.
 Stanley & Miikkulainen (2002) Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2):99–127, 2002.
 Suganuma et al. (2017) Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A genetic programming approach to designing convolutional neural network architectures. arXiv preprint arXiv:1704.00764, 2017.
 Swersky et al. (2014) Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freezethaw bayesian optimization. arXiv preprint arXiv:1406.3896, 2014.
 Verbancsics & Harguess (2013) Phillip Verbancsics and Josh Harguess. Generative neuroevolution for deep learning. arXiv preprint arXiv:1312.5355, 2013.
 Zoph & Le (2017) Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. International Conference on Learning Representations, 2017.
 Zoph et al. (2017) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017.
Appendix
Appendix A Datasets and architectures
Deep Resnets (TinyImageNet): We sample 500 ResNet architectures and train them on the TinyImageNet^{†}^{†}†https://tinyimagenet.herokuapp.com/ dataset (containing 200 classes with 500 training images of pixels) for 140 epochs. We vary depths, filter sizes and number of convolutional filter block outputs. Filter sizes are sampled from and number of filters is sampled from
. Each ResNet block is composed of three convolutional layers followed by batch normalization and summation layers. We vary the number of blocks from 2 to 18, giving us networks with depths varying between 14 and 110. Each network is trained for 140 epochs, using Nesterov optimizer. The learning rate is set to 0.1 and learning rate reduction and momentum are set to 0.1 and 0.9 respectively.
Deep Resnets (CIFAR10): We sample 500 39layer ResNet architectures from a search space similar to Zoph & Le (2017), varying kernel width, kernel height, and number of kernels. We train these models for 50 epochs on CIFAR10. Each architecture consists of 39 layers: 12 conv, a 2x2 max pool, 9 conv, a 2x2 max pool, 15 conv, and softmax. Each conv
layer is followed by batch normalization and a ReLU nonlinearity. Each block of 3
convlayers are densely connected via residual connections and also share the same kernel width, kernel height, and number of learnable kernels. Kernel height and width are independently sampled from
and number of kernels is sampled from . Finally, we randomly sample residual connections between each block of convlayers. Each network is trained for 50 epochs using the RMSProp optimizer, with weight decay
, initial learning rate 0.001, and a learning rate reduction to at epoch 30 on the CIFAR10 dataset.MetaQNN CNNs (CIFAR10 and SVHN): We sample 1,000 model architectures from the search space detailed by Baker et al. (2017), which allows for varying the numbers and orderings of convolution, pooling, and fully connected layers. The models are between 1 and 12 layers for the SVHN experiment and between 1 and 18 layers for the CIFAR10 experiment. Each architecture is trained on SVHN and CIFAR10 datasets for 20 epochs. Table 3 displays the state space of the MetaQNN algorithm.
Layer Type  Layer Parameters  Parameter Values  

Convolution (C) 



Pooling (P) 



Fully Connected (FC) 



Termination State 


For each layer type, we list the relevant parameters and the values each parameter is allowed to take. The networks are sampled beginning from the starting layer. Convolutional layers are allowed to transition to any other layer. Pooling layers are allowed to transition to any layer other than pooling layers. Fully connected layers are only allowed to transition to fully connected or softmax layers. A convolutional or pooling layer may only go to a fully connected layer if the current image representation size is below 8. We use this space to both randomly sample and simulate the behavior of a MetaQNN run as well as directly run the MetaQNN with early stopping.
LSTM (PTB):
We sample 300 LSTM models and train them on the Penn Treebank dataset for 60 epochs. Number of hidden layer inputs and lstm cells was varied from 10 to 1400 in steps of 20. Each network was trained for 60 epochs with batch size of 50 and trained the models using stochastic gradient descent. Dropout ratio of 0.5 was used to prevent overfitting. Dictionary size of 400 words was used to generate embeddings when vectorizing the data.
CudaConvnet (CIFAR10 and SVHN): We train CudaConvnet architecture (Krizhevsky, 2012) with varying values of initial learning rate, learning rate reduction step size, weight decay for convolutional and fully connected layers, and scale and power of local response normalization layers. We train models with CIFAR10 for 60 epochs and with SVHN for 12 epochs. Table 4 show the hyperparameter ranges for the Cuda Convnet experiments.
Experiment  Hyperparameter  Scale  Min  Max [b] 

CIFAR10, Imagenet, SVHN  Initial Learning Rate  Log  5  
Learning Rate Reductions  Integer  0  3  
Conv1 Penalty  Log  
Conv2 Penalty  Log  
CIFAR10, SVHN  Conv3 Penalty  Log  
FC4 Penalty  Log  
Response Normalization Scale  Log  
Response Normalization Power  Linear 
Appendix B Hyperparameter selection in Random Forest and SVM based experiments
When training SVM and Random Forest we divided the data into training and validation and used cross validation techniques to select optimal hyperparameters. The SVM and RF model was then trained on full training data using the best hyperparameters. For random forests we varied number of trees between 10 and 800, and varied ratio of number of features from 0.1 to 0.5. For SVR, we perform a random search over 1000 hyperparameter configurations from the space LogUniform(, 10), Uniform(0, 1), and LogUniform(, 10) (when using the RBF kernel).
Appendix C fHyperband
Algorithm 1 of this text replicates Algorithm 1 from Li et al. (2017), except we initialize two dictionaries: to store training data and to store performance prediction models. will correspond to a dictionary containing all datasets with prediction target epoch . will correspond to the dataset for predicting based on the observed , and will hold the corresponding performance prediction model. We will assume that the performance prediction model will have a train function, and a predict
function that will return the prediction and standard deviation of the prediction. In addition to the standard Hyperband hyperparameters
and , we include and described in Section 4 and . During each iteration of successive halving, we train configurations to epochs; denotes the fraction of the top models that should be run to the full iterations. This is similar to setting the criterion based on the best model in the previous section.We also detail the run_then_return_validation_loss function in Algorithm 2. This algorithm runs a set of configurations, adds training data from observed learning curves, trains the performance prediction models when there is enough training data present, and then uses the models to terminate poor configurations. It assumes we have a function max_k, which returns the max value or if the list has less than values.
–  (Max resources allocated to any configuration)  
–  (default )  
–  (Probability threshold for early termination)  
–  (Performance offset for early termination)  
–  (# points required to train performance predictors)  
–  (Proportion of models to train) 
–  hyperparameter configurations  
–  resources to use for training  
–  # configurations in next iteration of successive halving  
–  dictionary storing training data  
–  dictionary storing performance prediction models 