We consider an unknown function , where and
denotes the probability space representing some uncontrollable variables. For any fixed
is a random variable of lawand we consider with , a real-valued functional defined on probability measures. We assume that there exists at least one such that . Using a set of sequential observations , our goal is to minimize the simple regret with the value returned after using a budget .
Different families of algorithms have been developed to treat this problem. Some are for example of Bayesian flavor (see [Shahriari et al., 2016] for instance), some are inspired by the bandit literature. Here we focus our interest on the bandit framework.
In the classical -armed bandit problem, a forecaster selects repeatedly a point in the input space and receives a reward distributed according to an unknown distribution . Historically, the main goal was to minimize the cumulative regret, i.e. the sum of the difference between his collected rewards and the ones that would have been brought by optimal actions. In the last decade, other works focused on the simple regret. These can be divided in two: algorithms that optimize an unknown function with the knownledge of the smoothness, for example StoOO [Munos et al., 2014], HOO [Bubeck et al., 2011] Zooming [Kleinberg et al., 2008] or HCT [Azar et al., 2014], and others focusing on the optimization of unknown functions without the knowledge of the smoothness, such as POO [Grill et al., 2015], StroquOOL [Bartlett et al., 2018], GPO [Shang et al., 2019] StoSOO [Valko et al., 2013] or [Locatelli and Carpentier, 2018].
Those algorithms focus on the optimization of the conditional expectation of
. This choice is questionable in some situations. For example if the shape and variance of the reward distribution depend on the input, a forecaster may be interested in different aspects of the unknown distribution in order to modulate its risk exposure. In the literature, some measures of risk have been proposed to replace the expectation: for instance quantiles (also referred to as Value-at-Risk, see[Artzner et al., 1999, McNeil and Frey, 2000] for instance), the Conditional Value-at-Risk [Rockafellar et al., 2000], the entropy Value-at-Risk [Ahmadi-Javid, 2012], or expectiles [Bellini and Di Bernardino, 2017]. The purpose of this paper is to present a risk optimization framework of an unknown stochastic function with the knowledge of the smoothness using only pointwise sequential observations and a finite budget .
-armed bandit algorithms rely on optimistic strategies that associate with each point of the space an upper confidence bound (UCB), that is, an “optimistic” prediction of the outcome. Adapting the classical setting to the optimization of risk measures implies being able to create high-probability confidence bounds for that particular measure. This problem has been tackled in the multi-armed bandit setting ( when the input space is discrete and finite). For instance, [Audibert et al., 2009, Sani et al., 2012] focused on the empirical variance, [Galichet et al., 2013, Kolla et al., 2019, Hepworth, 2017] on the CVaR while in [David and Shimkin, 2016, Szorenyi et al., 2015] the authors based their policies on the quantile. However, the literature is scarce in the continuous input space case.
In this paper we provide a new version of the Stochastic Optimistic Optimization (StoOO) algorithm [Munos et al., 2014], named StoROO (Stochastic Risk Optimistic Optimization), which is designed to optimize any function . In a first part, we provide an analysis of the simple regret from a generic point of view (that is, for any ). Then, we apply StoROO to optimize the conditional quantile. Using only the assumption that the output distribution support is connected and bounded in and admits a continuous density, we first propose an upper bound on the simple regret using Hoeffding’s inequality. Next we derive confidence intervals that take into account the order of the quantile respectively based on Bernstein’s and Chernoff’s inequalities. Finally, we present numerical experiments that illustrate the ability of our method to optimize conditional quantiles of a black-box function and the relevance to use confidence bounds derived from Chernoff’s inequality. Due to space limitation, technical proofs are deferred to Supplementary Material.
2 Problem setup
2.1 Hierarchical partitioning
The upper confidence bounds on which optimistic algorithms are based are surrogate functions larger than the objective (in a sense detailed below) with high probability. At each round , the point having the highest UCB is sampled and a reward is collected.
In the classical multi-armed bandit problem, computing and sorting the UCB can be done without major issues. But dealing with continuous input spaces ( infinitely many arms) implies maximizing a UCB function over a continuous space, which can be both computational intensive and algorithmically challenging. For example, Piyavskii’s algorithm (see [Bouttier, 2017] and references therein) defines
using a global Lipschitz assumption on the targeted function. Because of the Lipschitz hypothesis, the UCB maximizer is at an intersection of hyperplanes, i.e. where the UCB is non-differentiable. Thus a gradient-based algorithm cannot be used, implying that finding the point with the highest UCB is a very hard problem to solve.
To overcome the computational difficulties, a popular alternative is to rely on hierarchical partitions [Bubeck et al., 2011, Munos et al., 2014]. Let us consider an infinite hierarchical space structure of such that
with the number of sub-regions obtained after expanding a cell and the -th cell at depth . In the following we assume that:
Assumption : There exists a decreasing sequence , such that for any and for any cell , , with the center of .
Assumption : There exists such that all cells of depth contain a ball of radius .
Starting with and following an optimistic strategy, at time the algorithm has expanded some cells and the result is a tree that is a subset of and a partition of . In this setting is taken as a piecewise constant function. Indeed for any we define such that for all , .
In the literature of -armed bandits there are two ways to select a cell of at each round. In [Bubeck et al., 2011], the algorithm follows an optimistic path from the root to the leaves. In [Munos et al., 2014], StoOO selects the cell having the highest UCB among all the cells of that have not been expanded, the set of leaves of . We consider here this second alternative. Hence, to find the maximizer of at time , we only need to evaluate and sort a finite number of values .
2.2 Upper and lower confidence bounds, bias
To create confidence bounds for , the idea of StoOO is to get a sample of every node cell center . Thanks to the fact that all observed values are independent, we can use a deviation inequality to create a UCB for , that we denote . Finally to create the UCB over the cell , a bias term is added that takes into account how can potentially increase from the center of the cell to its edges.
To ensure the convergence of StoOO (and StoROO), the function only needs to be a UCB of for the cell containing , as is detailed in the proof of Proposition 1 (see also [Munos et al., 2014]). Bounding by how much can potentially increase from the center to the edge of the optimal cell requires a regularity assumption on . Following [Munos et al., 2014, Azar et al., 2014], we assume the following smoothness property:
Note that this condition is less restrictive than a global Lipschitz condition. It does not exclude functions that are very irregular (possibly discontinuous), except close to global maxima. Based on (1) we define
The algorithm also needs a quantity that bounds from below in order to provide guaranties on the value of over each cell. We thus construct a lower confidence bound, termed , for , and use it as an LCB for the maximum of on . In particular, on the cell containing the optimum , it holds that
with high probability. To summarize, the estimation ofis altered by two sources of error: the local estimation error made at the center of the cell, and the bias term . Balancing those two terms naturally provides a trade-off between exploration and exploitation.
3 Stochastic Risk Optimistic Optimization
3.1 The StoROO algorithm
StoROO starts by sampling one time each sub-region of the root node. Then, at each time the algorithm selects having the highest UCB. To reduce the estimation error, StoROO can either get more samples from (to reduce the variance), or split the cell in order to reduce its diameter (to reduce the bias). The good balance between these two options is found by dividing a cell as soon as the local estimation error is smaller than the bias, that is when
If Condition (2) is satisfied, StoROO expands and requires a new sample at the center of each sub-region. If Condition (2) is not satisfied, then StoROO requires a new sample at the center which is used to update and .
When the budget is exhausted, several choices are possible for the return value: they have the same theoretical guarantees. Following [Munos et al., 2014], one can return the deepest node among those that have been expanded. Here we propose a different, more conservative choice. Denoting by the set of nodes having the highest LCB among those that have been expanded after a budget , StoROO returns the node with the highest value (an estimator of ) among the deepest nodes of .
3.2 Analysis of the algorithm
In this section we provide a theoretical analysis of StoROO. It is inspired by [Munos et al., 2014], but differs most notably by the fact that the analysis is suited for any and not only for the conditional expectation.
The analysis relies on the possibility to construct, for any , upper- and lower-confidences bounds and such that the event
has probability at least . We defer to Section 4 their specific expression for the case of the quantile.
Contrary to the framework of [Munos et al., 2014], in our setting the magnitude of the confidence bound () associated to each node is not explicit. We thus need to introduce the following definition to quantify how many times a node needs to be sampled before satisfying the expansion condition (Eq. 2).
The vector of safe constants is composed of the constants and such that the event
has probability at least .
Note that in the case of the conditional expectation, [Munos et al., 2014] take , and
We first prove (Proposition 1) that any point at the center of an expanded cell of depth belongs to
Next, we show that using a budget , the tree reaches at least a depth given below (Proposition 2). This implies that the point returned by the algorithm belongs to (Proposition 3). Finally, using an assumption on the size of that can be formalized by the so-call near-optimality dimension [Bubeck et al., 2011, Munos et al., 2014], we provide an upper bound on the regret (Theorem 1).
Conditionally on , StoROO only expands cells such that .
Given the value and the total budget , the deeper the algorithm builds the tree, the better are the guarantees on the final point returned. So the goal of the following proposition is to provide a lower bound on the depth of .
Define the largest such that
with the cardinal of . The deepest node expanded by StoROO is such that
Intuitively, is the budget needed to expand all the nodes in for all . It may be that some of this nodes will not be visited, but in the worst case they are and they need to be considered in order to obtain a valid bound. Putting Propositions 1 and 2 together, yields a first upper bound on the simple regret:
Running StoROO with budget , with probability the regret is bounded as
A more explicit bound for the regret can be obtained by quantifying the volume of
for small values of . Introducing the Holderian semi-metric
The -near optimality dimension is the smallest such that for all , there exists such that the maximal number of disjoint -balls of radius with center in is less than .
To evaluate we need to bound for all . The following proposition makes the link between the near optimality dimension and .
Let be the -near-optimality dimension, and the corresponding constant. Then
Assume that for some and , and assume that . Thus with probability , the regret of StoOO is bounded as
where is the near optimality dimension and the corresponding near optimality constant.
Remark: In the particular case where each cell is a hypercube and the sub-regions are created by the division of the parent-cell into sub-regions of equal size, then , is equal to and is equal to .
4 Optimizing Quantiles
In this section, we focus on the optimization of quantiles, which are well-established tools in (risk-averse) decision theory (see [Rostek, 2010]
for instance). In particular, they benefit from interesting robustness properties, with respect to outliers or heavy tails. Let
now denote the -quantile of , where
is the cumulative distribution function (CDF) of.
In this section we detail how to construct the UCB and LCB for quantiles. First, we provide bounds based on Hoeffding’s inequality and we use them to adapt the regret bounds of Theorem 3. Then we provide two more refined bounds that take into account the order of the quantile based respectively on the Bernstein’s inequality and on the Kullback-Leibler divergence.
Let us first introduce some notation. For all , , and we denote
the empirical CDF of the reward inside the cell , where is the (random) number of times the cell was sampled up to time (see Definition 1). The generalized inverse of the piecewise constant function is defined as
that is the order statistic of the sample that has been collected from the node until time .
To define confidence bounds on the conditional quantile we proceed in two steps. First we propose confidence bounds on
. To do so, we simply use deviation bounds for Bernoulli distributions, since for all, for all , the random variables are independent and identically distributed with a Bernoulli law of parameter , if denotes the time when the node has been sampled for the -th time. Then we use the properties
to create confidence bounds on using bounds on . The first equivalence in illustrated on Figure 1.
4.1 Hoeffding’s bound and regret analysis
Let , and let
The next proposition motivates the choice of the above quantities as a UCB and a LCB for the quantile of order at the points .
Now, analyzing the regret requires a high probability bound on the number of time a node is sampled before being expanded:
According to the previous proposition, if we have sampled a node at depth more than
times, then with probability Condition (2) is satisfied and thus the node is expanded.
Equation (8) reflects that the smaller the minimum (taken over the whole support) of the density, the larger the upper bound on the number of samples needed before being expanded. Actually the bound is crude. It is rather clear, in fact, that the local minimum of around is the crucial quantity. Here we chose to write the results in terms of the global minimum to simplify the proof of Proposition (6). A more precise way to understand the behaviour of StoROO is that the number of time a node needs to be sampled before expansion depends on the pdf value in a neighborhood (of decreasing size with ) of the targeted quantile.
Assume that for some and , then with probability , the regret of StoROO for minimizing the quantile is bounded as
with the near-optimality dimension and the near-optimality corresponding constant.
Note that the speed of convergence is the same as the one obtained in the conditional expectation optimization setting; only the constant varies.
4.2 Tight bounds
Using Hoeffding’s inequality is convenient because it leads to explicit lower and upper confidence bounds, which simplifies the deriviation of bounds on the regret. However, it implicitly upper-bounds the variance of all -valued random variables by , which is overly pessimistic when the inequality is applied to variables whose expectations are far from . This is in particular the case for quantile estimation, when the quantile is of order close to or . To take into account the order of the quantile, following [David and Shimkin, 2016], a first possibility is to derive confidence intervals from Bernstein’s inequality as presented in the following theorem.
For any , for all , and , define
Then the event has probability at least .
Although Bernstein’s inequality takes into account the order of the quantile, it is possible to do something better. In order to create tighter confidence bound, we thus go back to Chernoff’s inequality and derive less explicit, but more accurate upper- and lower- confidence bounds on the -quantiles. We follow here [Garivier and Cappé, 2011], but a close inspection at the proofs shows however a difference in the order of the marginals of the KL functions. Recall that the binary relative entropy is defined for as:
with by convention, , and
For any , for all , and , define
Then the event has probability at least .
Contrary to Bernstein’s inequality, Chernoff’s bound is always tighter than Hoeffding’s inequality , which follows from Pinsker’s inequality (see e.g.[Garivier et al., 2018]):
For example, given and an i.i.d. sample of size , one can see that
with (resp. ) the UCB associated to Chernoff’s inequality (resp. Hoeffding’s inequality). Berstein’s inequality is tighter than Hoeffding’s when is different from and sufficiently large, but always looser than Chernoff. It follows in particular that the regret of StoROO using confidence bounds derived from Chernoff’s inequality has, at least, the guarantees presented in Theorem 3.
The online setting we consider in this article induces that, after steps, the set of nodes and the number of observations in each node are random. To cope with this, we thus need deviation bounds for random size samples. The most simple way to obtain such inequalities is to use a union bound on the possible number of observations in each node, as presented above. Tighter results can be obtained from a more thorough analysis (sometimes called peeling trick): this is what is presented below.
For any let
Then the event has probability at least .
We empirically highlight the capacity of StoROO to optimize the conditional quantile of a black-box function. Four versions of StoROO are compared, ( StoROO using confidence bounds derived from Hoeffding’s inequality), ( StoROO using confidence bounds derived from Bernstein’s inequality), ( StoROO using confidence bounds derived from Chernoff’s inequality) and ( StoROO using confidence bounds derived from Chernoff’s inequality and the peeling trick).
As a test-case, we use the function
following a log-normal distribution of parameterand truncated at its
-quantile with the truncated mass following a uniform distribution betweenand . Figure 2 (left) shows the shape of the and quantiles of , while Figure 2 (right) shows samples of .
The performance of each version of StoROO is evaluated for different values of and quantified according to the simple regret. In our experiments we fix the values and such that the condition (1) is satisfied. Note that these values do not correspond to the actual regularity conditions at optimum. In addition we fix and we choose to expand the nodes into three sub-region of equal sizes.
Figure 2 reports the average of the simple regret over runs for and . For both values of all the variants of StoROO have a regret that decreases with the budget. However from our experiments a ranking can be created.
The less efficient method is . For its simple regret decreases slower than the three others methods and for does not reach the performance of the others variants. Sometimes to reach a fixed accuracy, needs a much larger budget than others variants. For example taking , needs a budget of to reach a simple regret of order , while and need a budget equals to .
Then there is . Using the maximal budget, on both experiments this variant reaches the same accuracy as and but its simple regret decreases slower. For some levels of performance needs a much larger budget than . For example, taking , to reach the value needs the budget while is enough for .
Finally, the most efficient methods are and . Both methods are always better or equal to and . The variants and are often equivalent but sometimes the regret of decreases slightly faster than the version without the peeling trick. This behaviour provides a small gain for .
In this work, we extended StoSOO to a generic algorithm applicable to any functional of the reward distribution. We proposed a tailored application to the problem of quantile optimization, with four variants: one based on the classical Hoeffding’s inequality, one based on Bernstein’s inequality, and two others based on Chernoff’s inequality. We showed that using Chernoff’s inequality to build confidence intervals resulted in a dramatic improvement, both in theory and practice.
For simplicity, we assumed in this paper that the local regularity (or at least, an upper bound) of the target function at the optimum was known to the user. However, we believe that it is possible to combine our results to the procedure defined in [Grill et al., 2015, Xuedong et al., 2019] so that creating an algorithm able to optimize without the knowledge of the smoothness near an optimal point: this is left for future work. A second possible extension is to leverage the results proposed here to design an algorithm for the cumulative regret, in the spirit of HOO [Bubeck et al., 2011] for example.
- [Ahmadi-Javid, 2012] Ahmadi-Javid, A. (2012). Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications, 155(3):1105–1123.
- [Artzner et al., 1999] Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. (1999). Coherent measures of risk. Mathematical finance, 9(3):203–228.
- [Audibert et al., 2009] Audibert, J.-Y., Munos, R., and Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902.
- [Azar et al., 2014] Azar, M. G., Lazaric, A., and Brunskill, E. (2014). Online stochastic optimization under correlated bandit feedback. In ICML, pages 1557–1565.
- [Bartlett et al., 2018] Bartlett, P. L., Gabillon, V., and Valko, M. (2018). A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption. arXiv preprint arXiv:1810.00997.
- [Bellini and Di Bernardino, 2017] Bellini, F. and Di Bernardino, E. (2017). Risk management with expectiles. The European Journal of Finance, 23(6):487–506.
- [Bouttier, 2017] Bouttier, C. (2017). Optimisation globale sous incertitudes: algorithmes stochastiques et bandits continus avec application à la planification de trajectoires d’avions.
[Bubeck et al., 2011]
Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. (2011).
Journal of Machine Learning Research, 12(May):1655–1695.
- [David and Shimkin, 2016] David, Y. and Shimkin, N. (2016). Pure exploration for max-quantile bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 556–571. Springer.
- [Galichet et al., 2013] Galichet, N., Sebag, M., and Teytaud, O. (2013). Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning, pages 245–260.
- [Garivier and Cappé, 2011] Garivier, A. and Cappé, O. (2011). The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pages 359–376.
- [Garivier et al., 2018] Garivier, A., Ménard, P., and Stoltz, G. (2018). Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research.
- [Grill et al., 2015] Grill, J.-B., Valko, M., and Munos, R. (2015). Black-box optimization of noisy functions with unknown smoothness. In Advances in Neural Information Processing Systems, pages 667–675.
- [Hepworth, 2017] Hepworth, A. J. (2017). A multi-armed bandit approach to superquantile selection. PhD thesis, Monterey, California: Naval Postgraduate School.
[Kleinberg et al., 2008]
Kleinberg, R., Slivkins, A., and Upfal, E. (2008).
Multi-armed bandits in metric spaces.
Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681–690. ACM.
- [Kolla et al., 2019] Kolla, R. K., Jagannathan, K., et al. (2019). Risk-aware multi-armed bandits using conditional value-at-risk. arXiv preprint arXiv:1901.00997.
- [Locatelli and Carpentier, 2018] Locatelli, A. and Carpentier, A. (2018). Adaptivity to smoothness in x-armed bandits. In Conference on Learning Theory, pages 1463–1492.
[McNeil and Frey, 2000]
McNeil, A. J. and Frey, R. (2000).
Estimation of tail-related risk measures for heteroscedastic financial time series: an extreme value approach.Journal of empirical finance, 7(3-4):271–300.
- [Munos et al., 2014] Munos, R. et al. (2014). From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning, 7(1):1–129.
- [Rockafellar et al., 2000] Rockafellar, R. T., Uryasev, S., et al. (2000). Optimization of conditional value-at-risk. Journal of risk, 2:21–42.
- [Rostek, 2010] Rostek, M. (2010). Quantile maximization in decision theory. The Review of Economic Studies, 77(1):339–371.
- [Sani et al., 2012] Sani, A., Lazaric, A., and Munos, R. (2012). Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283.
- [Shahriari et al., 2016] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.
- [Shang et al., 2019] Shang, X., Kaufmann, E., and Valko, M. (2019). General parallel optimization without a metric. In 30th International Conference on Algorithmic Learning Theory.
- [Szorenyi et al., 2015] Szorenyi, B., Busa-Fekete, R., Weng, P., and Hüllermeier, E. (2015). Qualitative multi-armed bandits: A quantile-based approach. In 32nd International Conference on Machine Learning, pages 1660–1668.
- [Valko et al., 2013] Valko, M., Carpentier, A., and Munos, R. (2013). Stochastic simultaneous optimistic optimization. In International Conference on Machine Learning, pages 19–27.
- [Xuedong et al., 2019] Xuedong, S., Kaufmann, E., and Valko, M. (2019). General parallel optimization a without metric. In Algorithmic Learning Theory, pages 762–787.
Appendix A Proofs related to the generic analysis of StoROO
Proof of Proposition 1
Let us define the partition containing . Assume that the partition has been selected, thus
By definition , thus . Conditionally on , that implies
Note that the last inequality is obtained because the partition is expanded, which implies that
thus belongs to .
Proof of Proposition 2
There is at least an expanded node of depth after a budget was used.
Proof of Proposition 4 According to the assumption , each cell contains ball of radius centered in that is a ball of radius centered in . If the is the near optimality dimension then there is at most disjoint balls of radius inside . Thus if this implies there is more than disjoint balls of radius with center in , that is a contradiction.
Proof of Therorem 1