1 Introduction
The design of an effective machine learning system typically involves making many design choices that reflect the nature of data at hand and the inferences we wish to make. The techniques at our disposal, as designers of machine learning systems, are
intuition: repeating what worked in other settings that seem to be similar (appealing to our expertise and intuition), and search: model selection by trial and error search using e.g. crossvalidation.In common practice, both intuition and search are typically carried out informally. A practitioner may design a complete system by a semiautomated search process in which smallscale searches (e.g. grid search) update the practitioner’s own implicit beliefs regarding what constitutes a good model for the task at hand. Those beliefs inform the choice of future smallscale searches in an iterative process that makes progressive improvements to the system. (We may see many empirical results published in machine learning conferences as evidence of this process unfolding on an international scale over a course of years.)
One practical problem that arises from this common practice is that algorithms which have been demonstrated to work on particular data sets are notoriously difficult to adapt to new data sets. The trouble is that the implicit beliefs of the practitioner play a crucial role in the process of model selection. This difficulty has been widely recognized by domain experts, but the status quo remains because so many experts feel that their search is sufficiently efficient, and the insights gained from the model selection process are valuable. We hope that our results, taken together with other recent work on hyperparameter optimization such as Bergstra et al. (2011); Snoek et al. (2012); Thornton et al. (2012), challenge these beliefs and induce more researchers to recognize automatic hyperparameter optimization as an important technique for model evaluation.
1.1 Hyperparameter Optimization and Ensemble Construction
While hyperparameter optimization is important, it is not the only standard performanceenhancing technique used to improve the scores of a model family on a given benchmark task. Ensemble methods such as Bagging (Breiman, 1996a), Boosting (Demiriz et al., 2002), Stacking (Wolpert, 1992; Breiman, 1996b; Sill et al., 2009), and Bayesian Model Averaging (Hoeting et al., 1999) are also commonly employed to exact every last drop of accuracy from a given set of algorithmic technology. One of the goals of automating hyperparameter optimization is to assess, in an objective way, how good a set of classification system components can be. In pursuit of that goal, the automation of ensemble creation is also critical.
This paper presents a first attempt to provide a fully automated algorithm for model selection and ensemble construction. It starts from a palette of configurable preprocessing strategies, and classification algorithms, and proceeds to creates the most accurate ensemble of optimized components that it can. Our algorithm uses a Boosting approach to ensemble construction, in which hyperparameter optimization plays the role of a base learner. This setting creates unique challenges that motivate a new Boosting algorithm, which we call SVM HyperBoost.
2 Null Model for Image Classification
Our basic approach is described in Bergstra et al. (2013a). We use Hyperopt (Bergstra et al., 2013b)
to describe a configuration space that includes onelayer, twolayer, and threelayer convolutional networks. The elements of our image classification model are standard scaling (image resolution), affine warp (rigid image deformation), filterbank normalized crosscorrelation, local spatial pooling, dihistogram spatial pooling, and an L2SVM classifier. For each layer in each architecture, hyperparameters govern the size of filters, the volume of pooling regions, constants that modulate local normalization, and so on. The filters themselves are either chosen randomly from a centered Gaussian distribution, or are random projections of PCA components of training data (as it appears as input to each layer), or are random projections of input patches (again, as input arrives to each layer). Features for classification are derived from the output layer by either signed (dihistogram) or unsigned pooling over some topographically local partitioning of output features. This configuration space was chosen to span the model space investigated by
Pinto & Cox (2011) and the randomfilter models of Coates & Ng (2011). Relative to Bergstra et al. (2013a) we add the possibility of affine warping of input images and remove the inputcropping step. In total, the configuration space includes 238 hyperparameters, although no configuration uses all 238 at once. Many of the hyperparameters are conditional hyperparameters because they are only active in certain conditions; for example, the hyperparameters governing the creation of a third layer are inactive for twolayer models. The details of this metamodel are described in Bergstra et al. (2013a) and implemented in the hyperoptconvnet software available from http://github.com/jaberg/hyperoptconvnet.Notable omissions from the model space include: backpropagation
(Rumelhart et al., 1986), unsupervised learning for filters such as RBMs
(Hinton, 2002; Hinton et al., 2006), Sparse Coding (Coates & Ng, 2011), DAAs (Vincent et al., 2008), and recent highperformance regularization strategies such as dropout (Hinton et al., 2012) and maxout (Goodfellow et al., 2013).Hyperparameter optimization within this Null model was carried out using the TPE algorithm (Bergstra & Bengio, 2012), as implemented by the hyperopt software (available from http://jaberg.github.com/hyperopt).
3 SVM HyperBoosting
This section describes an ensemble construction method (SVM HyperBoost, or just HyperBoost) that is particularly wellsuited to the use of a hyperparameter optimization algorithm as an inner loop. This algorithm is presented in the context of models which have the form of a feature extractor and a linear classifier. In this context, the ensemble is simply a larger linear function that can be seen as the concatenation of ensemble members. The HyperBoost algorithm can be understood as a piecewise training of this single giant linear classifier.
To derive the HyperBoost algorithm, suppose that we commit to using an ensemble of size . (No such commitment is necessary in practice, but it makes the development clearer.) The ideal ensemble weights and hyperparameter configuration settings for a binary classification task would optimize generalization error:
(1) 
Here stands for a joint density over inputs and labels (, ). We have used
to denote the feature vector associated to input
by hyperparameter configuration . The expression denotes the indicator function that is one for values of which are negative, but zero for values of which are not negative. We use to stand for the set of possible hyperparameter configurations, so that the argmin means “choose optimal hyperparameter configurations” (one for each ensemble member). We use the notation in the argmin to indicate that the final set of weights will be a vector, but it is not known apriori how many elements it will have. Rather, will be logically divided into pieces corresponding to ensemble elements and each piece will have a dimensionality that matches .The joint optimization of and implied by Equation 1 is challenging because of

the complicated effect of each on ,

the expectation over unknown , and

the nondifferentiable indicator function.
Our strategy for dealing with the complicated relationship between and is to select the configurations greedily, using the algorithm illustrated in Figure 1
. Our strategy for dealing with the expectation is to estimate it from what is typically called validation data, so that each argmin for
is what previous work has called hyperparameter optimization. Our strategy for dealing with the nondifferentiability of the indicator function is to use a gradientfree optimization method, namely TPE (Bergstra et al., 2011).Normally, Boosting (functional gradient methods) on a Hinge loss or ZeroOne loss would quickly run into trouble because once the training margins are pushed past the decision boundary, subsequent rounds have nothing to do (the training criterion is completely satisfied). We avoid this nonsense using two techniques. First, Boosting on sufficient validation data helps because models fit to training data are seldom perfect for validation data by random chance (incidentally, we are interested in collaborations that might shed light on exactly how much validation data is necessary). Second, each round of HyperBoost is free to scale the contribution of previous ensemble components (see in Figure 1), so standard SVM regularization techniques (i.e. ) allow us to meaningfully add features and improve even if the Hinge loss had been reduced to 0 at a previous HyperBoosting iteration. The regularization parameter governing the entire SVM is a hyperparameter that is reoptimized on every round of HyperBoost. This technique makes HyperBoost a partially corrective Boosting algorithm.
3.1 Weak Learners vs. Strong Learners
HyperBoost is suitable for Boosting strong base learners. In fact, when Boosting and model fitting are conducted on statistically independent example sets, the distinction between distinction between “weak” vs. “strong” learners is no longer important. Instead, any learners (weak or strong) simply provide models, and HyperBoosting chooses the model that most improves the validation set performance of the ensemble. While strong learners generally require additional regularization compared with weak learners in order to generalize correctly from training data, strong and weak base learners are equally useful for HyperBoosting.
4 Results on Facial Expression
In support of the ICML2013 workshop on representation learning,
PierreLuc Carrier and Aaron Courville released a data set for facial expression recognition as a Kaggle competition (
http://www.kaggle.com/c/challengesin
representationlearningfacialexpression
recognitionchallenge).
The data consist of 48x48 pixel grayscale images of faces, and labels for the expressions of those faces.
The faces have been automatically registered so that each face is approximately centered and occupies about the same amount of area within each image.
The task is to categorize each face as one of seven categories (anger, disgust, fear, happy, sad, surprised, neutral).
The set distributed by Kaggle consists of 28,709 training examples examples, and 7,178 test examples.
Our protocol for model selection was simple crossvalidation on the training examples. We partitioned the training data into 20709 SVMfitting examples and 8000 validation examples, and performed hyperparameter optimization with regard to the performance on this validation set. The test examples were not used for model selection. The Kaggle website only provided the images for test examples, to prevent cheating in the contest. The test scores listed for the Facial Expression Recognition Challenge were obtained by uploading our predictions to Kaggle’s website, which computed the test set accuracy on our behalf.
On each round of HyperBoosting, we evaluated 1000 nondegenerate hyperparameter proposals in search of the best feature set to add to the ensemble. These proposals ranged in accuracy from chance baseline of 20% up to a relatively strong 62%. Experiments were done using a single computer with four NVidia Tesla 2050 GPUs and a slow file system so typically 2 or 3 jobs would run simultaneously. Many configurations were invalid (e.g. downsampling so much that 0 features remain) but these are recognized relatively quickly. Valid (nondegenerate) trials typically took 10  25 minutes to complete, so each round of HyperBoost took two or three days using this one machine.
Rank  Team  Accuracy (%) 

1  “RBM”  
2  “Unsupervised”  
3  “Maxim Milakov”  
4  “Radu+Marius+Cristi”  
  HyperBoost Round 4  
5  “Lor.Voldy”  
11  “jaberg”  
  HyperBoost Round 1  
12  “bulbugoglu”  
56  “dstarerstor” 
The accuracies of the models chosen by HyperBoost are shown in Figure 2. HyperBoost creates a small ensemble whose combined accuracy (65.5%) is significantly better than the best individual model (60%). The ranking relative to other models in the Kaggle competition is shown in Table 1. The ensemble of size 4 ranks among the top 5 competition entries.
It is worth noting that the model and training programs used for HyperBoosting in this model space were entirely designed prior to the release of the data set. The model space was chosen to span the models of Pinto & Cox (2011) and Coates & Ng (2011). Pinto & Cox (2011) reported excellent match verification performance on the Labeled Faces in the Wild (LFW) data set (Huang et al., 2007)), and Coates & Ng (2011) advanced the state of the art at the time on the CIFAR10 object recognition data set. Our approach was developed prior to the release of the Facial Expression Recognition data set, so the good performance speaks directly to the ability of our metamodeling approach to generalize to new image classification tasks.
It is also worth noting that the training accuracy (not shown) of all models in the ensemble is much higher than the generalization accuracy (from up to ). The size of individual feature sets was capped at 9000 (to stay within the available memory on the GPU cards), and all of the best models approached this maximum number of features. Although these large feature sets demonstrated significant overfitting of the training data (these feature sets represent strong base learners for Boosting), HyperBoost selected ensemble members that brought steady improvement on the test set. This is a familiar story for Boosting algorithms based on an exponential loss, but HyperBoost produces the effect while operating on the more representative hinge loss.
5 Conclusion
Hyperparameter optimization within large model classes is difficult. We have shown that hyperparameter optimization within a Null model achieves over 60% accuracy in the workshop’s Facial Expression Recognition Challenge, which ranks 12th of 56 contest submissions. Further use of an ensembleconstruction mechanism raises that accuracy to 65.5%, which would have ranked 5th / 56 had it been ready by the contest closing date. These performances underscore the importance and difficulty of fully leveraging known algorithmic technology for image classification. We can only conjecture that a future version of our model space that includes a wider range of algorithms for feature initialization and refinement (e.g. backpropagation, dropout, maxout, sparse coding, sparsity regularization, RBMs, DAAs) could perform better yet. By the same token, until such a search is carried out, it is difficult to make quantitative claims regarding the value added by such algorithms over and above a wellconfigured set of simpler components.
The software used in these experiments is publicly available from github:
 Hyperopt

The TPE hyperparameter optimization algorithm and distributed optimization infrastructure. http://jaberg.github.com/hyperopt
 HyperoptConvNet

The HyperBoost algorithm and hyperoptsearchable representation of the image classification model. http://github.com/jaberg/hyperoptconvnet
Acknowledgments
This project has been supported by the Rowland Institute of Harvard, the United States’ National Science Foundation (IIS 0963668), and Canada’s National Science and Engineering Research Council through the Banting Fellowship program.
References
 Bergstra & Bengio (2012) Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.
 Bergstra et al. (2011) Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. Algorithms for hyperparameter optimization. In NIPS*24, pp. 2546–2554, 2011.
 Bergstra et al. (2013a) Bergstra, J., Yamins, D., and Cox, D. D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In In Proc. ICML, 2013a.
 Bergstra et al. (2013b) Bergstra, J., Yamins, D., and Cox, D. D. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In SciPy’13, 2013b.
 Breiman (1996a) Breiman, L. Bagging predictors. Machine Learning, 24:123–140, 1996a.
 Breiman (1996b) Breiman, L. Stacked regressions. Machine Learning, 24:49–64, 1996b.
 Coates & Ng (2011) Coates, A. and Ng, A. Y. The importance of encoding versus training with sparse coding and vector quantization. In Proc. ICML28, 2011.
 Demiriz et al. (2002) Demiriz, A., Bennett, K.P., and ShaweTaylor, J. Linear programming boosting via column generation. Machine Learning, 46:225–254, 2002.
 Goodfellow et al. (2013) Goodfellow, I., WardeFarley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. CoRR, abs/1302.4398, 2013.

Hinton (2002)
Hinton, G. E.
Training products of experts by minimizing contrastive divergence.
Neural Computation, 14:1771–1800, 2002.  Hinton et al. (2006) Hinton, G. E., Osindero, S., and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.
 Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing coadaptation of feature detectors. CoRR, abs/1207.0580, 2012.
 Hoeting et al. (1999) Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. Bayesian model averaging: A tutorial. Statistical Science, pp. 382–401, 1999.

Huang et al. (2007)
Huang, G. B., Ramesh, M., Berg, T., and LearnedMiller, E.
Labeled faces in the wild: A database for studying face recognition in unconstrained environments.
Technical Report 0749, University of Massachusetts, Amherst, October 2007.  Pinto & Cox (2011) Pinto, N. and Cox, D. D. Beyond simple features: A largescale feature search approach to unconstrained face recognition. In Proc. Face and Gesture Recognition, 2011.
 Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning internal representations by error propagation. In Rumelhart, D. E. and McClelland, J. L. (eds.), Parallel Distributed Processing, volume 1, chapter 8, pp. 318–362. MIT Press, Cambridge, 1986.
 Sill et al. (2009) Sill, J., Takács, G., Mackey, L., and Lin, D. Featureweighted linear stacking. CoRR, abs/0911.0460, 2009.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, 2012.
 Thornton et al. (2012) Thornton, C., Hutter, F., Hoos, H. H., and LeytonBrown, K. AutoWEKA: Automated selection and hyperparameter optimization of classification algorithms. CoRR, abs/1208.3719, 2012.

Vincent et al. (2008)
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, PA.
Extracting and composing robust features with denoising autoencoders.
In Proc. of the 25th ICML, pp. 1096–1103. ACM, 2008.  Wolpert (1992) Wolpert, D. H. Stacked generalization. Neural Networks, 5:241–259, 1992.
Comments
There are no comments yet.