The goal of many important real-world problems is to optimize an underlying function whose evaluation is expensive. Minimizing the number of times to query the underlying function is then desired. Problems of this kind include experimental design (chaloner1995bayesian; srinivas2009gaussian; martinez2014bayesopt), online recommendation (ricci2015recommender), data-efficient system control (calandra2016bayesian; lizotte2007automatic; mukadam2017continuous), and tuning neural network hyperparameters (kernel size, pooling size, learning rate and its decay rate). Tuning hyperparameters of neural networks is not simply a gradient-free optimization problem, since some hyperparameter settings are significantly more expensive than others to be evaluated. When the user can control the computational resources dedicated to the tuning, algorithms within the stochastic bandit framework should be leveraging this control to obtain cost-effective performance, as is also suggested in the Bayesian Optimization setting (kandasamy2017multi; song2018general).
A stochastic bandit problem assumes that payoffs are noisy and are drawn from a unchanging distribution. The study of stochastic bandit problems started with the discrete arm setting, where the agent is faced with a finite set of choices. Classic works on this problem date back to the Thompson Sampling problem(thompson1933likelihood), Gittins index (gittins1979bandit), and some Upper Confidence Bound (UCB) methods (lai1985asymptotically; agrawal1995sample). Common solutions to this problem include the -greedy algorithms (sutton1998introduction), the UCB-based algorithms (auer2002finite), and the Thompson Sampling algorithms (agrawal2012analysis). These bandit strategies have led to powerful real-life applications. For example, Deep Q-Network (mnih2015human) uses -greedy for action exploration; and AlphaGO (silver2016mastering) uses UCT (kocsis2006bandit), which is built on the UCB strategy, for action searching. One recent line of work on stochastic bandit problems considers the case where the arm space is infinite. In this setting, the arms are usually assumed to be in a subset of the Euclidean space (or a more general metric space), and the expected payoff function is assumed to be a function of the arms. Some works along this line model the expected payoff as a linear function of the arms (auer2002using; dani2008stochastic; li2010contextual; abbasi2011improved; agrawal2013thompson; abeille2017linear); some algorithms model the expected payoff as Gaussian processes over the arms (srinivas2009gaussian; contal2014gaussian; de2012exponential; vazquez2007convergence); some algorithms assume that the expected payoff is a Lipschitz function of the arms (slivkins2011contextual; kleinberg2008multi; bubeck2011x); and some assume locally Hölder payoffs on the real line (auer2007improved). When the arms are continuous and equipped with a metric, and the expected payoff is Lipschitz continuous in the arm space, we refer to the problem as a stochastic Lipschitz bandit problem. In addition, when the agent’s decisions are made with the aid of contextual information, we refer to the problem as a contextual stochastic Lipschitz bandit problem. In this paper, we focus our study on the (contextual) stochastic Lipschitz bandit problems, and its application to neural network tuning. While our analysis focuses on the case of Lipschitz expected payoffs, our empirical results demonstrate that our methods can adapt to the landscape of the payoff and leverage properties other than Lipschitzness. This means our methods in practice have much better performance than analyses for Lipschitz bandit problems suggest. Few of the methods listed above consider computational resources as part of the algorithm, even though it is important for neural network tuning and other important problems.
Neural network tuning is challenging, and in recent years, the best performance has relied heavily on human tuning of the network’s hyperparameters. Given a training and validation set, the validation performance metric (e.g. validation accuracy) of the neural network can be viewed as a noisy function (payoff) of hyperparameters (arms), which include the architecture of the network, the learning rate, initialization, the number of training iterations, etc. This turns a neural network tuning problem into a stochastic bandit problem. The arms have different costs: if we train only for a small amount of iterations, we use less resources. Training for a small amount of iterations is often useful to judge initial conditions and architecture. Ideally, the method should be able to choose how long (how many iterations) to train in order to balance between exploration, exploitation, and cost of evaluation. As shown in Section 3, our methods balance between exploration, exploitation, and cost of evaluation for the problem of neural network tuning.
Our proposed algorithm, (Contextual) PartitionUCB, maintains a finite partition of the (context-)arm space and uses an Upper Confidence Bound strategy (auer2002finite), as if the partition represents a finite set of (context-)arms. As we observe more data, the partition grows finer. To better deploy (Contextual) PartitionUCB to problems such as neural network tuning, we provide fast implementations using regression trees, since a regression tree corresponds to a partition. We show empirically that our proposed tuning methods outperform existing state-of-the-art on benchmark datasets. In summary, our contributions are twofold: 1) we develop a novel stochastic Lipschitz bandit algorithm PartitionUCB and its contextual counterpart Contextual PartitionUCB. For compact domains in , both algorithms exhibit a
regret bound with high probability;2) we develop TreeUCB (TUCB) and Contextual TreeUCB (CTUCB) as fast implementations of PartitionUCB and Contextual PartitionUCB. We apply TUCB and CTUCB to tuning neural networks. They can achieve a satisfactory level of accuracy as the average of existing state-of-the-art approaches while saving over 50% resource (e.g. time, total training iterations) on the MNIST dataset (lecun1998mnist), the SVHN dataset (netzer2011reading) dataset and the CIFAR-10 dataset (krizhevsky2012imagenet).
Related works: From a bandit perspective, the closest works are the Zooming bandit algorithm (and the contextual Zooming bandit algorithm) (slivkins2011contextual; kleinberg2008multi) and the HOO algorithm (bubeck2011x), which inspired our work. Both the Zooming bandit algorithm and the HOO algorithm have excellent theoretical analyses, but suffer from practical inefficiency. The Zooming bandit algorithm (and the contextual Zooming bandit algorithm) keeps a cover of the arm space. As the Zooming bandit runs, some subsets in the cover shrink, resulting in some points of the arm space becoming uncovered. Once this happens, the algorithm needs to introduce new subsets of arms to maintain the cover. The operation of checking whether each arm is covered can be expensive and hard-to-implement in practice. The HOO algorithm does not require Zooming bandit’s covering oracle, but it is non-contextual. This makes it less general than the Contextual PartitionUCB algorithm. In addition, it requires a fine discretization of the arm space to start with. This initialization requirement makes HOO impractical, since problems like tuning neural networks can have arm spaces with millions of points in discrete spaces or infinite points in a parameter space. The practical efficiency of our method is partially a result of maintaining a careful partition of the space. This allows us to use a fast implementation of regression trees (geurts2006extremely).
Empirical results on tuning neural networks for the MNIST dataset (lecun1998mnist), the SVHN dataset (netzer2011reading) dataset and the CIFAR-10 dataset (krizhevsky2012imagenet), indicate that our methods tend to be more efficient and/or effective than 1) the state-of-the-art Bayesian optimization methods: the SPEARMINT package (snoek2012practical) which integrates several Bayesian optimization methods (swersky2013multi; bergstra2011algorithms; snoek2014input; snoek2013bayesian; gelbart2014bayesian), the TPE algorithm (bergstra2011algorithms) and the SMAC algorithm (hutter2011sequential); and 2) methods that focus explicitly on tuning neural networks: Hyperband algorithm (li2016hyperband) (and SuccessiveHalving algorithm (jamieson2016non)), the Harmonica algorithm (hazan2017hyperparameter), and random search (bergstra2012random). As suggested by the results in this paper, common bandit algorithms, such as UCB algorithms, can be competitive with the state-of-the-art methods for neural network tuning.
2.1 The PartitionUCB algorithm
Given a stochastic bandit problem, our goal is to locate the global maximum (minimum) of the payoff function via querying the payoff function at a sequence of points. The performance of the algorithm is typically measured by how quickly the algorithm is able to locate the maximum. In this paper, we focus our study on the following setting. A payoff function is defined over an arm space that is a compact subset , and the payoff function of interest is and the actual observations are given by , where .444PartitionUCB and Contextual PartitionUCB can naturally handle sub-Gaussian noise with zero mean. For the theory, we assume that the payoff function is Lipschitz in the sense that , for some constant . An agent is interacting with this environment in the following fashion. At each round , based on past observations , the agent makes a query at point and observes the noisy payoff . The observation is revealed only after the agent has made a decision . The agent repeats this procedure with the goal of locating the maximum of . To measure the performance of the agent’s strategy, the concepts of regret and cumulative regret are defined. The regret at round is defined to be
where is the global maximizer of ; and the cumulative regret up to time is defined to be
The PartitionUCB algorithm runs by maintaining a sequence of adaptive finite partitions of the arm space. Intuitively, at each step , PartitionUCB treats the problem as a finite-arm problem with respect to the partition bins at . The partition bins become smaller and smaller as the algorithm runs. At each time , we maintain partition of the input space: for any ,
Each bin in the partition is called a region and by convention .
Before formulating our strategy, we need to put forward several definitions. First of all, based on the partition at time , we define an auxiliary function - the Region Selection function, in Definition 1 to aid our discussion.
Definition 1 (Region Selection function).
Given the partition , a function is called a Region Selection function with respect to if for any , is the region in that contains .
For example, if the arm space , and the partition , then is defined on [0,2] and
As the name PartitionUCB suggested, our algorithm is an Upper Confidence Bound (UCB) strategy. In order to define our Upper Confidence Bound, we first define the count function, the corrected count function, and the corrected average function in Definition 2.
Let be the partition of at time () and let be the Region Selection function associated with . Let be the observations received up to time (). We define
the count function , such that
and by convention, ;
the corrected count function , such that
the corrected average function , such that
In words, is the number of points among that are in the same region as arm , with regions defined by . It is important to note that the domain of the function is the arm space – although when computing with , we need to locate the region that contains first (using the region selection function), the function always takes an arm as input. The functions and are defined in a similar fashion – all are defined over the arm space , with respect to the partition and the data . When , we simplify the notations to , to , and to . We also denote by the diameter of , and .
At time , based on the partition and observations , our bandit algorithm use
for some () as the Upper Confidence Bound of arm ; and we play an arm with the highest value (with ties broken uniformly at random).
Since is a piece-wise constant function in the arm space and is constant within each region, playing an arm with the highest with random tie-breaking is equivalent to selecting the best region (under UCB) and randomly select an arm within the region. This strategy (4) takes a similar form to the classic UCB1 algorithm auer2002finite. After we decide which arm to play we update the partition into a finer one if necessary. This strategy, PartitionUCB, is summarized in Algorithm 1.
In order for the algorithm to be well-behaved, i.e., having , we need the partitions to follow some regularization conditions. One regularization condition is defined in Definition 3.
Definition 3 (Legal Partition).
For a set of points , a partition at time of is said to be an -legal partition (with and ) with respect to if for all
where is defined with respect to , and is defined with respect to and .
In addition, we require that for any two partitions and consecutive in time, for any , there exists such that . In words, at round , some regions of the partition are split into multiple regions to form the partition at round . We say that the partition grows finer. Practical versions of Algorithm 1 are discussed in Section 3. A high probability regret bound of Algorithm 1 is stated in Theorem 1.
- upper bound of variance of the noise.*/
Suppose that the payoff function defined on a compact domain satisfies for all and is Lipschitz. Then for any given , with probability at least , the cumulative regret of Algorithm 1 satisfies .
In order to prove Theorem 1, we first need Lemmas 1, 2 and 3. Lemma 1 is a consequence of the Hoeffding’s inequality and the Lipschitzness of the payoff function. Lemma 2 states that the regret at time can be bounded in terms of the inverse of the squared-root corrected count and the size of the regions in the legal partition with high probability. Lemmas 3 is an extension of Lemma 2 and it bounds the cumulative regret by the inverse of the squared-root corrected count.
Let and be the corrected count and the corrected average defined with respect to the legal partition and observations . At any time , with probability greater than or equal to ,
where is the standard deviation of the Gaussian noise in the observation,
is the standard deviation of the Gaussian noise in the observation,is the Lipschitz constant, is the dimension of the arm space, and and define the legal partition.
Let be the corrected count defined with respect to the legal partition and observations up to time . At any time , with probability greater than or equal to , the regret satisfies
Let be the corrected count defined with respect to the legal partition and observations up to time . For , with probability at least , the cumulative regret satisfies
To prove Theorem 1, it remains to bound
For any , is not necessarily increasing with , since the partition grows finer. This results in difficulty in bounding (8). Next, we present a constructive trick to bound (8). For each , we can construct a hypothetical noisy degenerate Gaussian process to bound (8). We are not assuming our payoffs are drawn from these Gaussian processes. We only use the construction to bound (8). To construct these noisy degenerate Gaussian processes, we define the kernel functions ,
where is the region selection function defined with respect to . The kernel is positive semi-definite as shown in Proposition 1.
The kernel defined in (9) is positive semi-definite for any .
For any in where the kernel is defined, the Gram matrix
can be written into block diagonal form where diagonal blocks are all-one matrices and off-diagonal blocks are all zeros with proper permutations of rows and columns. Thus without loss of generality, for any vector, where the first summation is taken over all diagonal blocks and is the total number of diagonal blocks in the Gram matrix. ∎
Now, at any time , let us consider the model where is drawn from a Gaussian process and . Suppose that the arms and hypothetical payoffs are observed from this Gaussian process. The posterior variance for this Gaussian process after the observations at is
where , and
is the identity matrix. In other words,is the posterior variance using points up to time with the kernel defined by the partition at time . After some matrix manipulation, we know that
where . By the Sherman-Morrison formula, . Thus the posterior variance is . Now, we can link the sum of variances in the constructed Gaussian processes to (8), since the posterior variances in these Gaussian processes can be used to bound the term . Lemma 4 bounds the sum of posterior variances in the constructed Gaussian process in terms of the cardinality of the partition . Lemma 5 bounds (8) using the fact that (See Appendix A.5) and Lemma 4. Therefore, the constructed Gaussian processes bridge (8) and the cardinality of the partition . When the partition is a Legal Partition, we can bound its cardinality. This sketches a proof for Lemma 5. Lemmas 3 and 5 lead directly to a proof for Theorem 1, since all terms in (7) are .
For data generated from the noisy Gaussian process and with , if we query at points , then
where and is the cardinality of the partition associated with .
Detailed proofs of all lemmas are in Appendix A.
2.2 The Contextual PartitionUCB algorithm
In this section, we present an extension of Algorithm 1 for the contextual stochastic bandit problem. The contextual stochastic bandit problem is an extension to the stochastic bandit problem. In this problem, at each time, context information is revealed, and the agent chooses an arm based on past experience as well as the contextual information. Formally, the payoff function is defined over the product of the context space and the arm space and takes values from . Similar to the previous discussions, compactness of the product space and Lipschitzness of the payoff function are assumed. At each time , a contextual vector is revealed and the agent plays an arm . The performance of the agent is measured by the cumulative contextual regret
where is the maximal value of given contextual information . Here, is the maximizer of Surprisingly, a simple extension of Algorithm 1 can solve the contextual version problem. In the contextual case, we partition the joint space instead of the arm space . As an analog to (2) and (3), we define the corrected count and the corrected average over the joint space with respect to the partition of the joint space , and observations in the joint space .
Suppose that the payoff function defined on a compact domain satisfies for all and is Lipschitz. Then for any given , with probability at least , the cumulative contextual regret of Algorithm 2 satisfies .
3 Implementation and Experiments
3.1 Regression Tree Implementation
One nice property of Algorithm 1 is that it does not impose constraints on how to construct the partition. Therefore we can use a greedy criterion for constructing regression trees to construct the partition. Leaves in a regression tree form a partition of the space in the sense that (1) is satisfied. At the same time, a regression tree is designed to fit the underlying function. This property tends to result in an adaptive partition where the underlying function values within each region are relatively close to each other. For this paper, we use the Mean Absolute Error () reduction criterion breiman1984classification to adaptively construct a regression tree. More specifically, a node containing data samples is split along a feature (can be randomly selected for scalability) into and (where and ) such that the following reduction in MAE is maximized:
where and . The nodes are recursively split until the maximal possible reduction in is smaller than . The leaves are then used to form a partition. Each region is again associated with a corrected mean and corrected count. Using regression trees, we develop the TreeUCB algorithm (TUCB), and the Contextual TreeUCB algorithm (CTUCB), as summarized in Algorithms 3 and 4
. Although tree implementation may result in partitions that are not legal, the empirical results show that in practice the heuristic developed based on the regression tree implementation outperforms most state-of-the-art in tuning neural networks as we will see in Section3.2. The coding of TUCB and CTUCB is based on a modified scikit-learn package scikit-learn.
3.2 Application to Neural Network tuning
Experimental design: Given fixed training and validation sets, the validation accuracy of a neural network can be viewed as a noisy function of hyperparameters. Typical hyperparameters for tuning include the network architecture, learning rate, training iterations, etc. The number of training iterations is itself a hyperparameter, but it is a special one – training accuracy tends to increase with the number of training iterations, and we may only care about the performance at a large enough training iteration. In addition, some tuning methods have a special scheme to leverage the training iterations as a special dimension, while some do not. Therefore we divide the experiments into two settings: A) the algorithms are agnostic to the training resource and thus tune the training iterations together with all other hyperparameters; B) the algorithms leverage the number of training iterations as a special dimension using their own schemes (if it has one). We will refer to these two settings as setting A and setting B from now on; we compare TUCB against other methods in Setting A, and compare CTUCB against other methods in Setting B.
Using training iterations as contextual information. Context is usually observed from the environment, but in this case, we are able to choose it. We let CTUCB start with a small number of training iterations and progressively increase the number of iterations. By doing this, CTUCB can use information obtained at smaller iterations (cheaper to obtain) to help infer the performance at larger iterations (more expensive to obtain). This assumes that a configuration that is good at smaller iterations tends to be good at larger iterations as well, which tended to be true in practice for our experiments. As shown in Figures 3, 6 and 9 by using training iterations as the contextual information, CTUCB can outperform existing state-of-the-art methods.
Modeling the Lipschitz constant with the number of training iterations. In setting B, for a hyperparameter configuration , as the training iteration
increases, the validation accuracy converges. Therefore, on a local scale, the Lipschitz constant decreases as the training iteration (epoch) increases. As stated in Algorithm2, the confidence coefficient is with Instead of using the global Lipschitz constant , we can use to model the local Lipschitz constant if there is enough prior knowledge. Since neural networks converge as training resource increases, should be positive and decreases with .
In this section, we compare the performance of TUCB and CTUCB on the MNIST dataset, the SVHN dataset, and the CIFAR-10 dateset. In setting A we compare SMAC, random search, TPE, Spearmint, Harmonica-r (Harmonica using random search as the base algorithm) and TUCB. In setting B, we compare SMAC, random search, TPE, Spearmint, CTUCB, Hyperband and Harmonica-h (Harmonica using Hyperband as the base algorithm). In setting B, CTUCB, Hyperband and Harmonica-h set the number of training iterations using their own schemes, while the rest of the algorithms fix the number of training iterations and tune the rest of the hyperparameters. In particular, CTUCB uses the number of training iterations as the contextual information. How we set the training iterations for CTUCB will be specified later.
3.3.1 MLP for MNIST
In this section, we tune a simple Multi-Layer Perceptron (MLP) and compare TUCB and CTUCB with state-of-the-art methods. For TUCB, we use, where is a hyperparameter of the tuning algorithm; and for CTUCB in tuning neural networks, we use , where is the number of training iterations, and models the local Lipschitz constant. In this case, , and are all hyperparameters of the tuning algorithm. Although TUCB and CTUCB have their own hyperparameters, the performance of both TUCB and CTUCB are not sensitive to these hyperparameters. The architecture of this MLP is as follows: in the feed-forward direction, there are the input layer, the fully connected hidden layer with dropout ensemble, and then the output layer. The hyperparameter search space is
number of hidden neurons(range ), learning rate (), dropout rate (), batch size (), number of iterations (). In setting B: All algorithms except for Hyperband, CTUCB, Harmonica-h always use 243 for the number of iterations. 555We choose the number 243 so that it helps Hyperband avoid rounding using their example down-sampling rate 3. Hyperband, CTUCB and Harmonica-h use their own specific mechanisms to alter number of iterations. The results are shown in Figure 3. For CTUCB in Setting B, we choose the training iterations so that
By setting this way, CTUCB repeatedly starts with small iterations and proceeds to larger iterations (and repeats). As shown in Figure 3, our methods find good configurations faster than other methods, in both Setting A and Setting B.
3.4 AlexNet CNN for SVHN
In this section, we tune an AlexNet-type krizhevsky2012imagenet CNN lecun1998gradient for the SVHN dataset and compare TUCB and CTUCB with state-of-the-art methods. The architecture of this CNN and the corresponding hyperparameters are summarized in Table LABEL:tab:svhn-arch and LABEL:tab:svhn-params. The results are shown below in Figure 6. For CTUCB in Setting B, we pick context (the training iterations) so that
By setting in this way, CTUCB starts with small training iteration and progressively increases it (and repeats). In this set of experiments, TUCB and CTUCB usually find good configurations at least as fast as other methods.
3.4.1 AlexNet CNN for CIFAR-10
In this section, we tune an AlexNet-type krizhevsky2012imagenet CNN lecun1998gradient for the CIFAR-10 dataset krizhevsky2009learning and compare TUCB and CTUCB with state-of-the-art methods. The architecture of this CNN and the corresponding hyperparameters are summarized in Table LABEL:tab:cifar-arch and LABEL:tab:cifar-params. The results are shown below in Figure 9. For CTUCB in Setting B, we pick context (the training iterations) so that CTUCB increases the number of training iterations in exactly the same way as Hyperband. Hyperband and CTUCB increase the number of training as shown in Table 7. This progression of training iterations is determined by Hyperband’s hyperparameters. Please refer to the Hyperband paper li2016hyperband for more details on how Hyperband selects training resources.
As shown in Figure 9, our methods find good configurations faster than other methods most of the time. In particular, in Setting A, TUCB reaches 70% accuracy using 30,400 iterations, while other methods on average require 42,800 iterations; in Setting B, TUCB reaches 70% accuracy using 11,943 iterations, while other methods on average require 71,934 total iterations.666In our averages for Setting A, for methods that required over 45,000 iterations to reach 70% accuracy, we simply denoted that they required 45,000 iterations. In our averages for Setting B, for methods that required over 120,000 iterations to reach 70% accuracy, we simply denoted that they required 120,000 iterations. In addition, TUCB does not achieve above 0.754 accuracy after 45,000 iterations in Setting A, while CTUCB achieves above 0.76 accuracy within 45,000 iterations in Setting B. This shows that leveraging the training iterations as a special dimension can help increase the accuracy on an absolute scale.
We propose the PartitionUCB and the Contextual PartitionUCB algorithms that successively partition the arm space and the context-arm space and play the Upper Confidence Bound strategy based on the partition. We also provide high probability regret upper bounds for both algorithms. Since a decision tree corresponds to a partition of the space, fast implementations using regression trees called TUCB and CTUCB are provided. Empirical studies show that TUCB and CTUCB are competitive with the state-of-the-art methods for tuning neural networks, and could save substantial computing resources. As suggested by the results in the paper, more bandit algorithms could be considered as benchmarks for the problem of neural network tuning.
The authors are grateful to Aaron J Fisher, Tiancheng Liu and Weicheng Ye for their comments and insights. The project is partially supported by the Alfred P. Sloan Foundation through the Duke Energy Data Analytics fellowship.
Appendix A Proofs
In all the proofs that follows, we denote . In the case of non-contextual bandit problem, is the empty set so .
a.1 Proof of Lemma 1
The statement is true when since a probability is non-negative. For , let . This expectation is taken with respect to the tie-breaking scheme, which is uniformly random. By Lipschitz continuity, . Since by our legal partition restrictions, we have, for any ,
Case 1. :
Since the underlying function , it is sub-Gaussian with the sub-Gaussian parameter (at most) . Since both the underlying function and the Gaussian noise are sub-Gaussian, we can apply Hoeffding’s inequality:
are independent sub-Gaussian random variables with sub-Gaussian parameter being. Thus, using Hoeffding’s inequality with a change of variables, we know with probability at least . We can apply Hoeffding’s inequality since the observed rewards within the same region is independent of each other. This is because the partition grows finer and we break ties uniformly randomly.
Case 2. :
When , since and . ∎
Therefore we can choose or to tradeoff exploration and exploitation. In addition, with probability at least , we can remove the absolute value and bound from below by
or bound from above by