The design of an effective machine learning system typically involves making many design choices that reflect the nature of data at hand and the inferences we wish to make. The techniques at our disposal, as designers of machine learning systems, areintuition: repeating what worked in other settings that seem to be similar (appealing to our expertise and intuition), and search: model selection by trial and error search using e.g. cross-validation.
In common practice, both intuition and search are typically carried out informally. A practitioner may design a complete system by a semi-automated search process in which small-scale searches (e.g. grid search) update the practitioner’s own implicit beliefs regarding what constitutes a good model for the task at hand. Those beliefs inform the choice of future small-scale searches in an iterative process that makes progressive improvements to the system. (We may see many empirical results published in machine learning conferences as evidence of this process unfolding on an international scale over a course of years.)
One practical problem that arises from this common practice is that algorithms which have been demonstrated to work on particular data sets are notoriously difficult to adapt to new data sets. The trouble is that the implicit beliefs of the practitioner play a crucial role in the process of model selection. This difficulty has been widely recognized by domain experts, but the status quo remains because so many experts feel that their search is sufficiently efficient, and the insights gained from the model selection process are valuable. We hope that our results, taken together with other recent work on hyperparameter optimization such as Bergstra et al. (2011); Snoek et al. (2012); Thornton et al. (2012), challenge these beliefs and induce more researchers to recognize automatic hyperparameter optimization as an important technique for model evaluation.
1.1 Hyperparameter Optimization and Ensemble Construction
While hyperparameter optimization is important, it is not the only standard performance-enhancing technique used to improve the scores of a model family on a given benchmark task. Ensemble methods such as Bagging (Breiman, 1996a), Boosting (Demiriz et al., 2002), Stacking (Wolpert, 1992; Breiman, 1996b; Sill et al., 2009), and Bayesian Model Averaging (Hoeting et al., 1999) are also commonly employed to exact every last drop of accuracy from a given set of algorithmic technology. One of the goals of automating hyperparameter optimization is to assess, in an objective way, how good a set of classification system components can be. In pursuit of that goal, the automation of ensemble creation is also critical.
This paper presents a first attempt to provide a fully automated algorithm for model selection and ensemble construction. It starts from a palette of configurable pre-processing strategies, and classification algorithms, and proceeds to creates the most accurate ensemble of optimized components that it can. Our algorithm uses a Boosting approach to ensemble construction, in which hyper-parameter optimization plays the role of a base learner. This setting creates unique challenges that motivate a new Boosting algorithm, which we call SVM HyperBoost.
2 Null Model for Image Classification
to describe a configuration space that includes one-layer, two-layer, and three-layer convolutional networks. The elements of our image classification model are standard scaling (image resolution), affine warp (rigid image deformation), filterbank normalized cross-correlation, local spatial pooling, di-histogram spatial pooling, and an L2-SVM classifier. For each layer in each architecture, hyperparameters govern the size of filters, the volume of pooling regions, constants that modulate local normalization, and so on. The filters themselves are either chosen randomly from a centered Gaussian distribution, or are random projections of PCA components of training data (as it appears as input to each layer), or are random projections of input patches (again, as input arrives to each layer). Features for classification are derived from the output layer by either signed (di-histogram) or unsigned pooling over some topographically local partitioning of output features. This configuration space was chosen to span the model space investigated byPinto & Cox (2011) and the random-filter models of Coates & Ng (2011). Relative to Bergstra et al. (2013a) we add the possibility of affine warping of input images and remove the input-cropping step. In total, the configuration space includes 238 hyperparameters, although no configuration uses all 238 at once. Many of the hyperparameters are conditional hyperparameters because they are only active in certain conditions; for example, the hyperparameters governing the creation of a third layer are inactive for two-layer models. The details of this meta-model are described in Bergstra et al. (2013a) and implemented in the hyperopt-convnet software available from http://github.com/jaberg/hyperopt-convnet.
Notable omissions from the model space include: backpropagation(Rumelhart et al., 1986)
, unsupervised learning for filters such as RBMs(Hinton, 2002; Hinton et al., 2006), Sparse Coding (Coates & Ng, 2011), DAAs (Vincent et al., 2008), and recent high-performance regularization strategies such as dropout (Hinton et al., 2012) and maxout (Goodfellow et al., 2013).
3 SVM HyperBoosting
This section describes an ensemble construction method (SVM HyperBoost, or just HyperBoost) that is particularly well-suited to the use of a hyperparameter optimization algorithm as an inner loop. This algorithm is presented in the context of models which have the form of a feature extractor and a linear classifier. In this context, the ensemble is simply a larger linear function that can be seen as the concatenation of ensemble members. The HyperBoost algorithm can be understood as a piecewise training of this single giant linear classifier.
To derive the HyperBoost algorithm, suppose that we commit to using an ensemble of size . (No such commitment is necessary in practice, but it makes the development clearer.) The ideal ensemble weights and hyperparameter configuration settings for a binary classification task would optimize generalization error:
Here stands for a joint density over inputs and labels (, ). We have used
to denote the feature vector associated to inputby hyperparameter configuration . The expression denotes the indicator function that is one for values of which are negative, but zero for values of which are not negative. We use to stand for the set of possible hyperparameter configurations, so that the argmin means “choose optimal hyperparameter configurations” (one for each ensemble member). We use the notation in the argmin to indicate that the final set of weights will be a vector, but it is not known a-priori how many elements it will have. Rather, will be logically divided into pieces corresponding to ensemble elements and each piece will have a dimensionality that matches .
The joint optimization of and implied by Equation 1 is challenging because of
the complicated effect of each on ,
the expectation over unknown , and
the non-differentiable indicator function.
Our strategy for dealing with the complicated relationship between and is to select the configurations greedily, using the algorithm illustrated in Figure 1
. Our strategy for dealing with the expectation is to estimate it from what is typically called validation data, so that each argmin foris what previous work has called hyperparameter optimization. Our strategy for dealing with the non-differentiability of the indicator function is to use a gradient-free optimization method, namely TPE (Bergstra et al., 2011).
Normally, Boosting (functional gradient methods) on a Hinge loss or Zero-One loss would quickly run into trouble because once the training margins are pushed past the decision boundary, subsequent rounds have nothing to do (the training criterion is completely satisfied). We avoid this nonsense using two techniques. First, Boosting on sufficient validation data helps because models fit to training data are seldom perfect for validation data by random chance (incidentally, we are interested in collaborations that might shed light on exactly how much validation data is necessary). Second, each round of HyperBoost is free to scale the contribution of previous ensemble components (see in Figure 1), so standard SVM regularization techniques (i.e. ) allow us to meaningfully add features and improve even if the Hinge loss had been reduced to 0 at a previous HyperBoosting iteration. The regularization parameter governing the entire SVM is a hyperparameter that is re-optimized on every round of HyperBoost. This technique makes HyperBoost a partially corrective Boosting algorithm.
3.1 Weak Learners vs. Strong Learners
HyperBoost is suitable for Boosting strong base learners. In fact, when Boosting and model fitting are conducted on statistically independent example sets, the distinction between distinction between “weak” vs. “strong” learners is no longer important. Instead, any learners (weak or strong) simply provide models, and HyperBoosting chooses the model that most improves the validation set performance of the ensemble. While strong learners generally require additional regularization compared with weak learners in order to generalize correctly from training data, strong and weak base learners are equally useful for HyperBoosting.
4 Results on Facial Expression
In support of the ICML2013 workshop on representation learning,
Pierre-Luc Carrier and Aaron Courville released a data set for facial expression recognition as a Kaggle competition (
recognition-challenge). The data consist of 48x48 pixel grayscale images of faces, and labels for the expressions of those faces. The faces have been automatically registered so that each face is approximately centered and occupies about the same amount of area within each image. The task is to categorize each face as one of seven categories (anger, disgust, fear, happy, sad, surprised, neutral). The set distributed by Kaggle consists of 28,709 training examples examples, and 7,178 test examples.
Our protocol for model selection was simple cross-validation on the training examples. We partitioned the training data into 20709 SVM-fitting examples and 8000 validation examples, and performed hyperparameter optimization with regard to the performance on this validation set. The test examples were not used for model selection. The Kaggle website only provided the images for test examples, to prevent cheating in the contest. The test scores listed for the Facial Expression Recognition Challenge were obtained by uploading our predictions to Kaggle’s website, which computed the test set accuracy on our behalf.
On each round of HyperBoosting, we evaluated 1000 non-degenerate hyperparameter proposals in search of the best feature set to add to the ensemble. These proposals ranged in accuracy from chance baseline of 20% up to a relatively strong 62%. Experiments were done using a single computer with four NVidia Tesla 2050 GPUs and a slow file system so typically 2 or 3 jobs would run simultaneously. Many configurations were invalid (e.g. downsampling so much that 0 features remain) but these are recognized relatively quickly. Valid (non-degenerate) trials typically took 10 - 25 minutes to complete, so each round of HyperBoost took two or three days using this one machine.
|-||HyperBoost Round 4|
|-||HyperBoost Round 1|
The accuracies of the models chosen by HyperBoost are shown in Figure 2. HyperBoost creates a small ensemble whose combined accuracy (65.5%) is significantly better than the best individual model (60%). The ranking relative to other models in the Kaggle competition is shown in Table 1. The ensemble of size 4 ranks among the top 5 competition entries.
It is worth noting that the model and training programs used for HyperBoosting in this model space were entirely designed prior to the release of the data set. The model space was chosen to span the models of Pinto & Cox (2011) and Coates & Ng (2011). Pinto & Cox (2011) reported excellent match verification performance on the Labeled Faces in the Wild (LFW) data set (Huang et al., 2007)), and Coates & Ng (2011) advanced the state of the art at the time on the CIFAR-10 object recognition data set. Our approach was developed prior to the release of the Facial Expression Recognition data set, so the good performance speaks directly to the ability of our meta-modeling approach to generalize to new image classification tasks.
It is also worth noting that the training accuracy (not shown) of all models in the ensemble is much higher than the generalization accuracy (from up to ). The size of individual feature sets was capped at 9000 (to stay within the available memory on the GPU cards), and all of the best models approached this maximum number of features. Although these large feature sets demonstrated significant over-fitting of the training data (these feature sets represent strong base learners for Boosting), HyperBoost selected ensemble members that brought steady improvement on the test set. This is a familiar story for Boosting algorithms based on an exponential loss, but HyperBoost produces the effect while operating on the more representative hinge loss.
Hyperparameter optimization within large model classes is difficult. We have shown that hyperparameter optimization within a Null model achieves over 60% accuracy in the workshop’s Facial Expression Recognition Challenge, which ranks 12th of 56 contest submissions. Further use of an ensemble-construction mechanism raises that accuracy to 65.5%, which would have ranked 5th / 56 had it been ready by the contest closing date. These performances underscore the importance and difficulty of fully leveraging known algorithmic technology for image classification. We can only conjecture that a future version of our model space that includes a wider range of algorithms for feature initialization and refinement (e.g. backpropagation, dropout, maxout, sparse coding, sparsity regularization, RBMs, DAAs) could perform better yet. By the same token, until such a search is carried out, it is difficult to make quantitative claims regarding the value added by such algorithms over and above a well-configured set of simpler components.
The software used in these experiments is publicly available from github:
This project has been supported by the Rowland Institute of Harvard, the United States’ National Science Foundation (IIS 0963668), and Canada’s National Science and Engineering Research Council through the Banting Fellowship program.
- Bergstra & Bengio (2012) Bergstra, J. and Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.
- Bergstra et al. (2011) Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. Algorithms for hyper-parameter optimization. In NIPS*24, pp. 2546–2554, 2011.
- Bergstra et al. (2013a) Bergstra, J., Yamins, D., and Cox, D. D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In In Proc. ICML, 2013a.
- Bergstra et al. (2013b) Bergstra, J., Yamins, D., and Cox, D. D. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In SciPy’13, 2013b.
- Breiman (1996a) Breiman, L. Bagging predictors. Machine Learning, 24:123–140, 1996a.
- Breiman (1996b) Breiman, L. Stacked regressions. Machine Learning, 24:49–64, 1996b.
- Coates & Ng (2011) Coates, A. and Ng, A. Y. The importance of encoding versus training with sparse coding and vector quantization. In Proc. ICML-28, 2011.
- Demiriz et al. (2002) Demiriz, A., Bennett, K.P., and Shawe-Taylor, J. Linear programming boosting via column generation. Machine Learning, 46:225–254, 2002.
- Goodfellow et al. (2013) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. CoRR, abs/1302.4398, 2013.
Hinton, G. E.
Training products of experts by minimizing contrastive divergence.Neural Computation, 14:1771–1800, 2002.
- Hinton et al. (2006) Hinton, G. E., Osindero, S., and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.
- Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
- Hoeting et al. (1999) Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. Bayesian model averaging: A tutorial. Statistical Science, pp. 382–401, 1999.
Huang et al. (2007)
Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, E.
Labeled faces in the wild: A database for studying face recognition in unconstrained environments.Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
- Pinto & Cox (2011) Pinto, N. and Cox, D. D. Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Proc. Face and Gesture Recognition, 2011.
- Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning internal representations by error propagation. In Rumelhart, D. E. and McClelland, J. L. (eds.), Parallel Distributed Processing, volume 1, chapter 8, pp. 318–362. MIT Press, Cambridge, 1986.
- Sill et al. (2009) Sill, J., Takács, G., Mackey, L., and Lin, D. Feature-weighted linear stacking. CoRR, abs/0911.0460, 2009.
- Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, 2012.
- Thornton et al. (2012) Thornton, C., Hutter, F., Hoos, H. H., and Leyton-Brown, K. Auto-WEKA: Automated selection and hyper-parameter optimization of classification algorithms. CoRR, abs/1208.3719, 2012.
Vincent et al. (2008)
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P-A.
Extracting and composing robust features with denoising autoencoders.In Proc. of the 25th ICML, pp. 1096–1103. ACM, 2008.
- Wolpert (1992) Wolpert, D. H. Stacked generalization. Neural Networks, 5:241–259, 1992.