Algorithm selection and hyper-parameter optimization are core parts of automatic machine learning (AutoML). Previously, AutoML approaches often define the search space as the algorithm selection space [Brazdil et al.2003, Adankon and Cheriet2009, Biem2003], hyper-parameter space [Hu et al.2018, Hu et al.2019], or the joint of the both spaces (CASH problem) [Feurer et al.2015, Thornton et al.2013]. While the joint space allows a more thorough search that could cover potentially better configurations, the huge space is a barrier to effective search in limited time. Moreover, the joint space can be quite redundant when considering only one of the algorithms, since the hyper-parameters of the other algorithms are irrelevant. Therefore, the joint space contains redundancy or even can be misleading.
The cascaded algorithm selection can have levels [Jamieson and Talwalkar2016]. The first level is on the hyper-parameter optimization. It only needs to focus on the selected algorithm, but not the hyper-parameters of all algorithms. The second level is on the algorithm selection. However, previous methods in this kind commonly carry out a full hyper-parameter optimization on the candidate algorithms, making the slow and expensive algorithm evaluations.
In this paper, we propose a cascaded algorithm selection approach to avoid a full-space hyper-parameter optimization. The hyper-parameter optimization usually employs some stepping search methods, which can be paused after every search step, and can also be resumed. The selection receives feedback and allocates the next search step to one of the algorithms. Thus, the cascaded algorithm selection is naturally to be modeled as a multi-armed bandit problem [Auer et al.2002]. However, most of the classical bandits maximize the average feedbacks. In the AutoML, however, only the best feedback matters. A variant of the bandit, the extreme bandit [Carpentier and Valko2014], can model this situation, which tries to identify the arm with the maximize (or equivalently minimize) feedback value. However, as the extreme bandit follows the extreme distribution, it is not only unstable but often require to known the distribution type, making the extreme bandit approach unpractical.
In this paper, we propose the extreme-region UCB bandit (ER-UCB), which focuses on the extreme region of the feedback distributions. Unlike the extreme bandit, ER-UCB considers a region instead of the extreme point, which can lead to a better mathematical condition. Moreover, in machine learning where the test data is commonly different from the train data, the extreme region can be more robust for generalization. With -arms and trials, our analysis proves that ER-UCB has the regret upper bound, which has the same order with the classical UCB strategy. The experiments on synthetic and real AutoML tasks reveal that the ER-UCB can find the best algorithm precisely, and exploit it with the majority of the trial budget.
The rest sections present background & related works, extreme-region UCB bandit, experiments, and conclusion.
2 Background & Related Works
We consider the algorithm selection and hyper-parameter optimization on classification tasks. Let and denote the training and testing datasets. Let denote the algorithm set with candidates. For , denotes a hyper-parameter setting, where is the hyper-parameter space of . Let denote a performance criterion for a configuration , e.g., accuracy, AUC score, etc. The AutoML problem can be formulated as follows:
where and . It is also concludes the CASH problem formulation [Feurer et al.2015].
Because of the non-convex, non-continuous and non-differentiable properties, derivative-free optimization [Yu et al.2016, Hu et al.2017] is usually applied to solve it. For example, a tree-structure based Bayesian optimization (SMAC) [Hutter et al.2011] is employed on AutoWEKA [Thornton et al.2013] and AutoSKLEARN [Feurer et al.2015], the popular open-source AutoML tools. Derivative-free optimization explores search space by sampling and evaluating. But the high time-cost restrains the total number of evaluations on AutoML. With the limited trials, the performance of derivative-free optimization is very sensitive to search space. However, in above formulation, the search space . Obviously, is redundant, because the best configuration is only relevant to the hyper-parameter space of the best algorithm.
Hence, we consider an easier formulation, i.e., optimizing hyper-parameters of algorithms separately:
The hyper-parameter processes can be seen as arms. The algorithm selection level is a multi-armed bandit problem. The bandit is a classical formulation of the resource allocation problem. In [Felício et al.2017], the authors formulated the cold-start user recommendation as a multi-armed bandit problem, which user information was unavailable at the beginning. The feedbacks of users has to be obtained by trials. In this situation, the bandit concerns more about the average feedback of arms. In [Cicirello and Smith2005], the authors proposed the max
-armed bandit, which focused on the maximum feedback of trials. But it assumed that the reward distribution was a Gaussian distribution, and it was designed for the heuristic search, in which more than one arms can be selected at a trial step.
In this paper, we customize the extreme-region UCB (ER-UCB) bandit for AutoML problems.
3 Extreme-Region UCB Bandit
In this section, we present details of the ER-UCB: the bandit formulation for AutoML, the deduction of the ER-UCB strategy and the theoretical analysis on the ER-UCB strategy.
3.1 Bandit formulation for AutoML
In the classical multi-armed bandit, feedbacks of an arm obey an underlying distribution. In this paper, we employ the random search on the hyper-parameter optimization. A trial in a model is uniformly sampling hyper-parameters from , and its performance is the feedback of this trial. Thus, , where denote a feedback of a trial on , and is the underlying performance distribution of . Because of the random search, is fixed. With algorithm candidates, let denote the performance distribution set. The -armed bandit formulation for AutoML is: at the -th trial, the is selected from algorithm candidates, and get a feedback independently from .
In AutoML tasks, the selected algorithm is required to have maximum performances. For this requirement, we present the extreme-region target for the proposed bandit. Then, we show the deduction details of extreme-region UCB strategy.
3.2.1 Extreme-region target
The target of the hyper-parameter optimization is to find the hyper-parameters which have the maximum performance. In the bandit, with a fixed
, we want the probabilityas large as possible. With the Chebyshev inequality: , let ,
In other words, with the same fixed probability upper bound , the best arm selection is:
With the given and , the ground-truth selection strategy is (4with its estimation. But for variance, it is hard to find the relationship. With the variance definition:
is the expectation of the random variable. The Markov inequality can be applied to it easily. And can partly represent according to (5). Thus, we try to replace with :
Comparing with (4), (6) magnifies the effect of expectation item on selection strategy. To tackle this issue, we introduce a hyper-parameter , and construct a new random variable . Furthermore, let , and . Thus, the extreme-region target is:
We prove that it can reduce the effect of expectation on algorithm selection by introducing into :
According to definitions of , and ,
Comparing with (4), because of , the item of expectation is reduced, but the item of variance stays the same. It concludes the proof. ∎
3.2.2 Extreme-region UCB strategy
We apply the upper confidence bound (UCB) strategy on the extreme-region target. In this paper, we assume that the random variables satisfy the following moment condition. There exists a convex functionon the reals, for all ,
If we let and , (9) is known as Hoeffding’s lemma. We apply this assumption to construct an upper bound for the estimated expectations at some fixed confidence level. Let denote the Legendre-Fenchel transform of . With observations of , let and denote the estimated expectations of and . Only for with a fixed , using the Markov inequality:
The same deduction for , and is a monotonically increasing function:
Because , and let . With the union bound, we combine and as follows:
Let . With the probability at least ,
Within total trials, let denote the number that the -th arm is selected, and . -ER-UCB strategy is:
and are the exploitation and exploration items. With Hoeffding’s lemma, taking , then, . And let . The exploration can be re-written as:
Thus, the Hoeffding’s ER-UCB strategy is:
Because on AutoML, the exploitation item is often much smaller than the exploration item. To further exploration and exploitation trade-off, we introduce a hyper-parameter . The practical Hoeffding’s ER-UCB strategy is:
The cascaded algorithm selection and hyper-parameter optimization with ER-UCB bandit is presented at Algorithm 1. Line 2 and 7 are the procedures of uniformly sampling hyper-parameters for the selected algorithm and obtaining the feedbacks. Line 1 to 4 are the initialization steps. In the main loop (line 5 to 10), the algorithm is selected by the ER-UCB strategy (line 6). Line 7 to 9 are the procedures for updating the exploitation item for the selected algorithm.
We have to discuss the hyper-parameters, i.e., , and for the ER-UCB bandit. is employed to control the space size of the extreme region. It is usually a small real number, e.g., 0.1 or 0.01. is the exploration-and-exploitation trade-off hyper-parameter. In AutoML tasks, is used to magnify the exploitation item. Thus, it is usually a big number such as 10 or 20. is applied to reduce the impact of expectation item in the selection strategy. It should be tuned according to tasks. In experiments, we will investigate them empirically.
3.3 Theoretical Analysis
We present the analysis of the upper bound for -ER-UCB strategy (3.2.2) and the Hoeffding’s ER-UCB strategy (15) on the extreme-region regret. For the arbitrary arm and a fixed , we define . Thus, . According to (7), let , thus , and . We assume by choosing an appropriate . The extreme-region regret is the Definition 1.
Definition 1 (Extreme-region regret).
At -th trial, event A is the number of times that occurs, and event B is the number of times that occurs with a given strategy. The extreme-region regret is:
Introducing and , The extreme-region regret can be re-written as:
We can prove the following simple upper regret bound for -ER-UCB strategy:
Theorem 1 (Regret of -Er-Ucb).
Assume the feedback distribution of arbitrary arm satisfy (9). With , -ER-UCB satisfies:
Due to the limitation of paper length, we present the proof details in our supplementary material. Based on Theorem 1, we can easily prove the extreme-region regret of the Hoeffding’s ER-UCB strategy:
Corollary 1 (Regret of Hoeffding’s ER-UCB).
Assume the feedback distribution of arbitrary arm satisfy (9). With , Hoeffding’s ER-UCB satisfies:
According to the theoretical analysis, the ER-UCB bandit has upper bound on the extreme-region regret.
In the experiment section, we empirically investigate the effectiveness of the ER-UCB bandit on some synthetic and real-world AutoML tasks. Some state-of-the-art bandit strategies are selected as the compared methods, including the classical UCB (C-UCB) [Bubeck et al.2012], -greedy [Sutton and Barto2018], softmax strategy [Tokic and Palm2011] and random strategy which allocates the budget by selecting arms randomly. In addition, we apply the random search on the joint hyper-parameter spaces of all algorithms (Joint) to compare with the cascaded hyper-parameter optimization.
4.1 Synthetic problem
We construct a 7-armed bandit problem in this section. The feedbacks obey Gaussian distributions with different expectations and variances: , , , , , , . The best arm is not only related with the expectation, but also influenced by the variance. Obviously, it is more likely to obtain the best feedback by exploiting in , in other words, . We study on the three hyper-parameters of ER-UCB firstly, and then compare the ER-UCB with other methods.
4.1.1 Hyper-parameter study
We investigate the , and for the ER-UCB. With fixed two of them, we study another one: with fixed , , we study ; with fixed , , we study ; with fixed , , we study . For every hyper-parameter, we evenly sample 1000 settings from the setting region. The core problem we care about is how the methods allocate budget to arms. Let define the exploitation rate for arm . Large means the large number of trials that the arm is selected. The trial budget is set as 1000. The experiment for every hyper-parameter setting is repeated for 3 times independently, and the average results are presented.
Figure 1:a.1, 2 and 3 show the study results of , and . The arm is the best selection. Thus, the larger the better. For (Figure 1:a.1), the green line () is approaching 1 when nears by 0. In practice, should be set as a small value. For (Figure 1:a.2), when is small, the exploitation rates of arms are similar. And the green line is increasing during is increasing. It means that the small encourages exploration and the large encourages exploitation according to the observations. For (Figure 1:a.3), the exploitation rates are sensitive to when is around the expectations of reward distributions. Thus, should be carefully tuned according to different tasks.
4.1.2 Investigation with compared methods
According to the hyper-parameter study results of the ER-UCB, we set , , , and compare it with the C-UCB, -greedy (), Softmax strategy () and random selection strategy. The trial budget is 1000. Every experiment is repeated for 3 times independently. The average performances are presented in Table 1.
Table 1 shows that the ER-UCB outperforms the compared methods. Furthermore, the ER-UCB can find the best arm (arm ) and allocate most of budget to it ( and average is 0.9). Because the C-UCB depends only on mean observations to make decisions. It wrongly allocates budget to arm (). The of -greedy is very unstable. It means -greedy can’t find the best arm effectively. In general, the ER-UCB can effectively discover the best-arm and reasonably allocate budget to exploration and exploitation in this synthetic problem.
4.2 Real-word AutoML tasks
We apply the ER-UCB to solve the real-world classification tasks. We select 10 frequently-used algorithms as the candidates from SKLEARN [Pedregosa et al.2011], including DecisionTree (DT), AdaBoost (Ada), QuadraticDiscriminantAnalysis (QDA), GaussianNB (GNB), BernoulliNB (BNB), K-Neighbors (KN), ExtraTree (ET), PassiveAggressive (PA), RandomForest (RF) and SGD. And 12 classification datasets from UCI are selected as AutoML tasks. The evaluation criterion of each configuration is the accuracy score. The compared methods are C-UCB, -greedy (), Softmax strategy (), random strategy and Joint. The trial budget is 1000. We set , for the ER-UCB on all datasets. The is set according to the tasks, and showed in Table 2. For each method and each dataset, we run every experiment 3 times independently, and the average performances of our experiment are presented. In addition, we apply the random search with 1000 trials to explore on every algorithm candidate. According to (2), we can find out the best ground-truth algorithm for the datasets.
|Dataset||Methods||V-Eval||B. Alg.||T-Eval||Dataset||Methods||V-Eval||B. Alg.||T-Eval|
“No free lunch” has been proved again in those experiments. The best performance algorithms are different in different datasets. Particularly, tree-based ensemble algorithms, e.g., AdaBoost, RandomForest, etc, show the outstanding performance in most of the datasets. It indicates that the algorithm selection is necessary for making search hyper-parameters easier.
The cascaded algorithm selection and hyper-parameter optimization are necessary for making the search problem easier to solve. Comparing the random strategy with the Joint, the random strategy beats the Joint on most of the datasets (8/12). It indicates that the large search space provides more difficult for optimization.
It will mislead the strategy to select wrong algorithms only according to the average performance. In Table 2, the random strategy is not always bad in datasets. The strategies, such as C-UCB, -greedy and Softmax, which focus on the average performance are easy to select wrong algorithms which average performances are good.
The proposed ER-UCB bandit strategy can effectively find out the best performance algorithm (B. Alg. is the ground-truth algorithm on 9/12 datasets), and reasonably allocate the trial budget to the best algorithm (ER-UCB gets the highest on 12/12 datasets).
This paper proposes the extreme-region upper confidence bound (ER-UCB) bandit for the cascaded algorithm selection and hyper-parameter optimization. we employ the random search in the hyper-parameter optimization level. The level of algorithm selection is formulated as a multi-armed bandit problem. The bandit strategies are applied to allocate the limited search budget to the hyper-parameter optimization processes on algorithm candidates. However, the algorithm selection focuses on the algorithm with the maximum performance but not the average performance. To tackle this, we propose the extreme-region UCB (ER-UCB) strategy, which selects the arm with the largest extreme region of the underlying distribution. The theoretical study shows that the ER-UCB has extreme-region regret upper bound, which has the same order with the classical UCB strategy. The experiments on synthetic and real-world AutoML problems empirically verify that the ER-UCB can precisely discover the algorithm with the best performance, and reasonably allocate the trial budget to the algorithm candidates.
- [Adankon and Cheriet2009] Mathias M Adankon and Mohamed Cheriet. Model selection for the LS-SVM. application to handwriting recognition. Pattern Recognition, 42(12):3264–3270, 2009.
- [Auer et al.2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
- [Biem2003] Alain Biem. A model selection criterion for classification: Application to hmm topology optimization. In Proceedings of the 7th International Conference on Document Analysis and Recognition, pages 104–108, 2003.
- [Brazdil et al.2003] Pavel B Brazdil, Carlos Soares, and Joaquim Pinto Da Costa. Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning, 50(3):251–277, 2003.
- [Bubeck et al.2012] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
- [Carpentier and Valko2014] Alexandra Carpentier and Michal Valko. Extreme bandits. In Advances in Neural Information Processing Systems, pages 1089–1097, 2014.
[Cicirello and Smith2005]
Vincent A Cicirello and Stephen F Smith.
The max k-armed bandit: A new model of exploration applied to search
Proceedings of the 20th AAAI Conference on Artificial Intelligence, pages 1355–1361, 2005.
- [Felício et al.2017] Crícia Z Felício, Klérisson VR Paixão, Celia AZ Barcelos, and Philippe Preux. A multi-armed bandit model selection for cold-start user recommendation. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, pages 32–40. ACM, 2017.
- [Feurer et al.2015] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost , Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pages 2962–2970, 2015.
- [Hu et al.2017] Yi-Qi Hu, Hong Qian, and Yang Yu. Sequential classification-based optimization for direct policy search. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 2029–2035, 2017.
- [Hu et al.2018] Yi-Qi Hu, Yang Yu, and Zhi-Hua Zhou. Experienced optimization with reusable directional model for hyper-parameter search. In Proceeding of the 27th International Joint Conference on Artificial Intelligence, pages 2276–2282, 2018.
- [Hu et al.2019] Yi-Qi Hu, Yang Yu, Wei-Wei Tu, Qiang Yang, Yuqiang Chen, and Wenyuan Dai. Multi-fidelity automatic hyper-parameter tuning via transfer series expansion. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019.
- [Hutter et al.2011] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. LION, 5:507–523, 2011.
Kevin G Jamieson and Ameet Talwalkar.
Non-stochastic best arm identification and hyperparameter optimization.In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 240–248, 2016.
- [Pedregosa et al.2011] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- [Thornton et al.2013] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 847–855, 2013.
- [Tokic and Palm2011] Michel Tokic and Günther Palm. Value-difference based exploration: adaptive control between epsilon-greedy and softmax. In Annual Conference on Artificial Intelligence, pages 335–346. Springer, 2011.
- [Yu et al.2016] Yang Yu, Hong Qian, and Yi-Qi Hu. Derivative-free optimization via classification. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 2286–2292, 2016.