Spearmint Salad is a combination of Spearmint and Agnostic Bayes to create an ensemble of predictors over the hyperparameter space while using the fast hyperparameter search approach of Spearmint.
One of the most tedious tasks in the application of machine learning is model selection, i.e. hyperparameter selection. Fortunately, recent progress has been made in the automation of this process, through the use of sequential model-based optimization (SMBO) methods. This can be used to optimize a cross-validation performance of a learning algorithm over the value of its hyperparameters. However, it is well known that ensembles of learned models almost consistently outperform a single model, even if properly selected. In this paper, we thus propose an extension of SMBO methods that automatically constructs such ensembles. This method builds on a recently proposed ensemble construction paradigm known as agnostic Bayesian learning. In experiments on 22 regression and 39 classification data sets, we confirm the success of this proposed approach, which is able to outperform model selection with SMBO.READ FULL TEXT VIEW PDF
The motivation of this work is to improve the performance of standard
Ensembles of classifier models typically deliver superior performance an...
Tree ensembles, such as random forests and boosted trees, are renowned f...
Model selection is a problem that has occupied machine learning research...
Hyperparameter tuning is an omnipresent problem in machine learning as i...
This paper introduces a new method for model selection and more generall...
Kernel-based nonparametric models have become very attractive for model-...
Spearmint Salad is a combination of Spearmint and Agnostic Bayes to create an ensemble of predictors over the hyperparameter space while using the fast hyperparameter search approach of Spearmint.
The automation of hyperparameter selection is an important step towards making the practice of machine learning more approachable to the non-expert and increases its impact on data reliant sciences. Significant progress has been made recently, with many methods reporting success in tuning a large variety of algorithms (Bergstra et al., 2011; Hutter et al., 2011; Snoek et al., 2012). One successful general paradigm is known as Sequential Model-Based Optimization (SMBO). It is based on a process that alternates between the proposal of a new hyperparameter configuration to test and the update of an adaptive model of the relationship between hyperparameter configurations and their holdout set performances. Thus, as the model learns about this relationship, it increases its ability to suggest improved hyperparameter configurations and gradually converges to the best solution.
While finding the single best model configuration is useful, better performance is often obtained by, instead, combining several (good) models into an ensemble. This was best illustrated by the winning entry of the Netflix competition, which combined a variety of models (Bell et al., 2007)
. Even if one concentrates on a single learning algorithm, combining models produced by using different hyperparameters is also helpful. Intuitively, models with comparable performances are still likely to generalize differently across the input space and produce different patterns of errors. By averaging their predictions, we can hope that the majority of models actually perform well on any given input and will move the ensemble towards better predictions globally, by dominating the average. In other words, the averaging of several comparable models reduces the variance of our predictor compared to each individual in the ensemble, while not sacrificing too much in terms of bias.
However, constructing such ensembles is just as tedious as performing model selection and at least as important in the successful deployment of machine-learning-based systems. Moreover, unlike the model selection case for which SMBO can be used, no comparable automatic ensemble construction methods have been developed thus far. The current methods of choice remain trial and error or exhaustive grid search for exploring the space of models to combine, followed by a selection or weighting strategy which is often an heuristic.
In this paper, we propose a method for leveraging the recent research on SMBO in order to generate an ensemble of models, as opposed to the single best model. The proposed approach builds on the agnostic Bayes framework (Lacoste et al., 2014), which provides a successful strategy for weighting a predetermined and finite set of models (already trained) into an ensemble. Using a successful SMBO method, we show how we can effectively generalize this framework to the case of an infinite space of models (indexed by its hyperparameter space). The resulting method is simple and highly efficient. Our experiments on 22 regression and 39 classification data sets confirm that it outperforms the regular SMBO model selection method.
The paper develops as follows. First, we describe SMBO and its use for hyperparameter selection (Section 2). We follow with a description of the agnostic Bayes framework and present a bootstrap-based implementation of it (Section 3). Then, we describe the proposed algorithm for automatically constructing an ensemble using SMBO (Section 4). Finally, related work is discussed (Section 5) and the experimental comparisons are presented (Section 6).
Let us first lay down the notation we will be using to describe the task of model selection for a machine learning algorithm. In this setup, a task
corresponds to a probability distribution over the input-output space. Given a set of examples (which will be our holdout validation set), the objective is to find, among a set , the best function . In general, can be any set and we refer to a member as a predictor. In the context of hyperparameter selection, corresponds to the set of models trained on a training set (disjoint from ), for different configurations of the learning algorithm’s hyperparameters . Namely, let be the learning algorithm with a hyperparameter configuration , we will note the predictor obtained after training on . The set contains all predictors obtained from each when is trained on , i.e. .
To assess the quality of a predictor, we use a loss functionthat quantifies the penalty incurred when predicts while the true target is . Then, we can define the risk as being the expected loss of on task , i.e. . Finally, the best111The best solution may not be unique but any of them are equally good. function is simply the one minimizing the risk, i.e.
. Here, estimatingthus corresponds to hyperparameter selection.
For most of machine learning history, the state of the art in hyperparameter selection has been testing a list of predefined configurations and selecting the best according to the loss function on some holdout set of examples . When a learning algorithm has more than one hyperparameter, a grid search is required, forcing to grow exponentially with the number of hyperparameters. In addition, the search may yield a suboptimal result when the minimum lies outside of the grid or when there is not enough computational power for an appropriate grid resolution. Recently, randomized search has been advocated as a better replacement to grid search (Bergstra and Bengio, 2012). While it tends to be superior to grid search, it remains inefficient since its search is not informed by results of the sequence of hyperparameters that are tested.
To address these limitations, there has been an increasing amount of work on automatic hyperparameter optimization (Bergstra et al., 2011; Hutter et al., 2011; Snoek et al., 2012). Most rely on an approach called sequential based optimization (SMBO). The idea consists in treating as a learnable function of , which we can learn from the observations collected during the hyperparameter selection process.
We must thus choose a model family for . A common choice is a Gaussian process (GP) representation, which allows us to represent our uncertainty about , i.e. our uncertainty about the value of at any unobserved hyperparameter configuration . This uncertainty can then be leveraged to determine an acquisition function that suggests the most promising hyperparameter configuration to test next.
Namely, let functions and be the mean and covariance kernel functions of our GP over . Let us also denote the set of the previous evaluations as
where is the empirical risk of on set , i.e. the holdout set error for hyperparameter .
The GP assumption on implies that the conditional distribution is Gaussian, that is
where is the Gaussian density function with mean and variance
. We also have vectors, , , and matrix is such that .
There are several choices for the acquisition function. One that has been used with success is the one maximizing the expected improvement:
which can be shown to be equal to
is the cumulative distribution function of the standard normal and
The acquisition function thus maximizes Equation 3 and returns its solution. This optimization can be performed by gradient ascent initialized at points distributed across the hyperparameter space according to a Sobol sequence, in order to maximize the chance of finding a global optima. One advantage of expected improvement is that it directly offers a solution to the exploration-exploitation trade-off that hyperparameter selection faces.
An iteration of SMBO requires fitting the GP to the current set of tested hyperparameters (initially empty), invoking the acquisition function, running the learning algorithm with the suggested hyperparameters and adding the result to . This procedure is expressed in Algorithm 1. Fitting the GP corresponds to learning the mean and covariance functions hyperparameters to the collected data. This can be performed either by maximizing the data’s marginal likelihood or defining priors over the hyperparameters and sampling from the posterior using sampling (see Snoek et al. (2012) for more details).
While SMBO hyperparameter optimization can produce very good predictors, it can also suffer from overfitting on the validation set, especially for high-dimensional hyperparameter spaces. This is in part why an ensemble of predictors are often preferable in practice. Properly extending SMBO to the construction of ensembles is, however, not obvious. Here, we propose one such successful extension, building on the framework of Agnostic Bayes learning, described in the next section.
In this section, we offer a brief overview of the Agnostic Bayes learning paradigm presented in Lacoste et al. (2014) and serving as a basis for the algorithm we present in this paper. Agnostic Bayes learning was used in Lacoste et al. (2014) as a framework for successfully constructing ensembles when the number of predictors in (i.e. the potential hyperparameter configurations ) was constrained to be finite (e.g. by restricting the space to a grid). In our context, we can thus enumerate the possible hyperparameter configurations from to . This paper will generalize this approach to the infinite case later.
Agnostic Bayes learning attempts to directly address the problem of inferring what is the best function in , according to the loss function . It infers a posterior , i.e. a distribution over how likely each member of is the best predictor. This is in contrast with standard Bayesian learning, which implicitly assumes that contains the true data-generating model and infers a distribution for how likely each member of has generated the data (irrespective of what the loss is). From , by marginalizing
and selecting the most probable prediction, we obtain the following ensemble decision rule:
To estimate , Agnostic Bayes learning uses the set of losses of each example as evidence for inference. In Lacoste et al. (2014), a few different approaches are proposed and analyzed. A general strategy is to assume a joint prior over the risks of all possible hyperparameter configurations and choose a joint observation for the losses. From Bayes rule, we obtain the posterior from which we can compute
with a Monte Carlo estimate. This would result in repeatedly sampling from and counting the number of times each has the smallest sampled risk to estimate . Similarly, samples from could be obtained by sampling a risk vector from and returning the predictor with the lowest sampled risk. The ensemble decision rule of Equation 4 could then be implemented by repeatedly sampling from to construct the ensemble of predictors and using their average as the ensemble’s prediction.
Among the methods explored in Lacoste et al. (2014) to obtain samples from , the bootstrap approach stands out for its efficiency and simplicity. Namely, to obtain a sample from , we sample with replacement from to obtain and return the vector of empirical risks as a sample. While bootstrap only serves as a ”poor man’s” posterior, it can be shown to be statistically related to a proper model with Dirichlet priors and its empirical performance was shown to be equivalent (Lacoste et al., 2014).
When the bootstrap method is used to obtain samples from , the complete procedure for generating each ensemble member can be summarized by
where is a returned sample.
We now present our proposed method for automatically constructing an ensemble, without having to restrict (or, equivalently ) to a finite subset of hyperparameters.
As described in Section 3, to sample a predictor from the Agnostic Bayes bootstrap method, it suffices to obtain a bootstrap from and solve the optimization problem of Equation 6. In our context where is possibly an infinite set of models trained on the training set for any hyperparameter configuration , Equation 6 corresponds in fact to hyperparameter optimization where the holdout set is instead of .
This suggests a simple procedure for building an ensemble of predictors according to agnostic Bayes i.e., that reflects our uncertainty about the true best model . We could repeat the full SMBO hyperparameter optimization process times, with different bootstrap , for . However, for large ensembles, performing runs of SMBO can be computationally expensive, since each run would need to train its own sequence of models.
We can notice however that predictors are always trained on the same training set , no matter in which run of SMBO they were trained on. We propose a handy trick that exploits this observation to greatly accelerate the construction of the ensemble by almost a factor of . Specifically, we propose to simultaneously optimize all problems in a round-robin fashion. Thus, we maintain different histories of evaluation , for and when a new predictor is obtained, we update all with . Notice that the different histories contain the empirical risks on different bootstrap holdout sets, but they are all updated at the cost of training only a single predictor. Also, to avoid recalculating multiple times , these values can be cached and shared in the computation of each . This leaves the task of updating all insignificant compared to the computational time usually required for training a predictor. This procedure is detailed in Algorithm 2.
By updating all at the same time, we trick each SMBO run by updating its history with points it did not suggest. This implies that the GP model behind each SMBO run will be able to condition on more observations then it would if the runs had been performed in isolation. This can only benefit the GPs and improve the quality of their suggestions.
In the Bayesian learning literature, a common way of dealing with hyperparameters in probabilistic predictors is to define hyperpriors and perform posterior inference to integrate them out. This process often results in also constructing an ensemble of predictors with different hyperparameters, sampled from the posterior. Powerful MCMC methods have been developed in order to accommodate for different types of hyperparameter spaces, including infinite spaces.
However, this approach requires that the family predictors in question be probabilistic in order to apply Bayes rule. Moreover, even if the predictor family is probabilistic, the construction of the ensemble will entirely ignore the nature of the loss function that determines the measure of performance. The comparative advantage of the proposed Agnostic Bayes SMBO approach is thus that it can be used for any predictor family (probabilistic or not) and is loss-sensitive.
We now compare the SMBO ensemble approach (ESMBO) to three alternative methods for building a predictor from a machine learning algorithm with hyperparameters:
A single model, whose hyperparameters were selected by hyperparameter optimization with SMBO (SMBO).
A single model, whose hyperparameters were selected by a randomized search (RS), which in practice is often superior to grid search (Bergstra and Bengio, 2012).
An Agnostic Bayes ensemble constructed from a randomly selected set of hyperparameters (ERS).
Both ESMBO and SMBO used GP models of the holdout risk, with hyperparameters trained to maximize the marginal likelihood. A constant was used for the mean function, while the Matérn 5/2 kernel was used for the covariance function, with length scale parameters. The GP’s parameters were obtained by maximizing the marginal likelihood and a different length scale was used for each dimension222We used the implementation provided by spearmint: https://github.com/JasperSnoek/spearmint.
Each method is allowed to evaluate 150 hyperparameter configurations. To compare their performances, we perform statistical tests on several different hyperparameter spaces over two different collections of data sets.
Here, we describe the hyperparameter spaces of all learning algorithms we employ in our experiments. Except for a custom implementation of the multilayer perceptron, we used scikit-learn333http://scikit-learn.org/ for the implementation of all other learning algorithms.
We explore the soft margin parameter for values ranging from to on a logarithmic scale. We use the RBF kernel and explore values of ranging from to on a logarithmic scale.
We fix the number of trees to 100 and we explore two different ways of producing them: either the original Breiman (2001) method or the extremely randomized trees method of Geurts et al. (2006). We also explore the choice of bootstrapping or not the training set before generating a tree. Finally, the ratio of randomly considered features at each split for the construction of the trees is varied between and on a linear scale.
This is a tree-based algorithm using boosting (Friedman, 2001). We fix the set of weak learners to 100 trees and take the maximum depth of each tree to be in . The learning rate ranges between and on a logarithmic scale. Finally, the ratio of randomly considered features at each split for the construction of the trees varies between and on a linear scale.
We use a 2 hidden layers perceptron with tanh activation function and a softmax function on the last layer. We minimize the negative log likelihood using the L-BFGS algorithm. Thus there is no learning rate parameter. However, we used a different L2 regularizer weight for each of the 3 layers with values ranging fromto
on a logarithmic scale. Also, the number of neurons on each layer can take values in. In total, this yields a 5 dimensional hyperparameter space.
The different methods presented in this paper are generic and are meant to work across different tasks. It is thus crucial that we evaluate them on several data sets using metrics that do not assume commensurability across tasks (Demšar, 2006). The metrics of choice are thus the expected rank and the pairwise winning frequency. Let be either one of our model selection/ensemble construction algorithms run on the data set, with training set and validation set . When comparing algorithms, the rank of (best or ensemble) predictor on test set is defined as
Then, the expected rank of the method is obtained from the empirical average over the data sets i.e., . When comparing algorithm against algorithm , the winning frequency444We deal with ties by attributing 0.5 to each method except for the Sign test where the sample is simply discarded. of is
In the case of the expected rank, lower is better and for the winning frequency, it is the converse. Also, when , .
When the winning frequency , we say that method is better than method . However, to make sure that this is not the outcome of chance, we use statistical tests such as the sign test and the Poisson Binomial test (PB test) (Lacoste et al., 2012). The PB test derives a posterior distribution over and integrates the probability mass above , denoted as . When , we say that the result is significant and when , we say that it is highly significant. Similarly for the sign test, when the -value is lower than 0.1, it is significant and when lower than , it is highly significant.
To build a substantial collection of data sets, we used the AYSU collection (Ulaş et al., 2009)
coming from the UCI and the Delve repositories and we added the MNIST data set. We also converted the multiclass data sets to binary classification by either merging classes or selecting pairs of classes. The resulting benchmark contains 39 data sets. We have also collected 22 regression data sets from the Louis Torgo collection555These data sets were obtained from the following source : http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html.
The result tables present the winning frequency for each pair of methods, where grayed out values represent redundant information. As a complement, we also add the expected rank of each method in the rightmost column and sort the table according to this metric. To report the conclusion of the Sign test and the PB test, we use colored dots, where orange means significant and green means highly significant. The first dot reports the result of the PB test and the second one, the Sign test. For more stable results, we average the values obtained during the last 15 iterations.
Looking at the overall results over 7 different hyperparameter spaces in Table 1 and Table 2, we observe that ESMBO is never significantly outperformed by any other method and often outperforms the others. More precisely, it is either ranked first or tightly following ERS. Looking more closely, we see that the cases where ESMBO does not significantly outperform ERS concerns hyperparameter spaces of low complexity. For example, most hyperparameter configurations of Random Forest yield good generalization performances. Thus, these cases do not require an elaborate hyperparameter search method. On the other hand, when looking at more challenging hyperparameter spaces such as Support Vector Regression and Multilayer Perceptrons, we clearly see the benefits of combining SMBO with Agnostic Bayes.
As described in Section 4, ESMBO is alternating between different SMBO optimizations and deviates from the natural sequence of SMBO. To see if this aspect of ESMBO can influence its convergence rate, we present a temporal analysis of the methods in Figure 1 and Figure 2. The left columns depict for selected pairs of methods and the right columns present the expected rank of each method over time.
A general analysis clearly shows that there is no significant degradation in terms of convergence speed. In fact, we generally observe the opposite. More precisely, looking at , the green curve of the left columns, it usually reaches a significantly better state right at the beginning or within the first few iterations. A notable exception to that trend occurs with the Multiplayer Perceptrons, where SMBO is significantly better than ESMBO for a few iterations at the beginning. Then, it gets quickly outperformed by ESMBO.
We described a successful method for automatically constructing ensembles without requiring hand-selection of models or a grid search. The method can adapt the SMBO hyperparameter optimization algorithm so that it can produce an ensemble instead of a single model. Theoretically, the method is motivated by an Agnostic Bayesian paradigm which attempts to construct ensembles that reflect the uncertainty over which a model actually has the smallest true risk. The resulting method is easy to implement and comes with no extra computational cost at learning time. Its generalization performance and convergence speed are also dominant according to experiments on 22 regression and 39 classification data sets.