We consider the following setting of online convex prediction. Let
be a collection of random convex sub-differentiable loss functions sequentially observed. At each time step, a learner forms a prediction based on past observations . The learner aims at minimizing its average risk
with respect to all in some reference set . By considering the Dirac masses, one obtains and the average risk matches the definition of the average regret more commonly used in the online learning literature. We will first consider finite set . Then we will show how to extend the results to the unit -ball providing sparsity guarantees for sparse .
The case of finite reference set corresponds to the setting of prediction with expert advice (see Section 2.2 or (Cesa-BianchiLugosi2006; freund1997decision; vovk1998game)), where a learner makes sequential predictions over a series of rounds with the help of experts. littlestone1994weighted and vovk1990aggregating introduced the exponentially weighted average algorithm (Hedge) which achieves the optimal rate of convergence for the average regret for general convex functions. Several works focused on improving the rate of convergence under nice properties of the loss or the data. For instance, Hedge ensures a rate for exp-concave loss functions. We refer to van2015fast for a thorough review of fast-rate type assumptions on the losses.
The extension from finite reference sets to convex sets is natural. The seminal paper KivinenWarmuth1997 introduced the Exponentiated Gradient algorithm (EG), a version of Hedge using gradient version of the losses. The latter guarantees a average regret uniformly over the unit -ball . Another approach consists in projecting gradient descent steps (see Zinkevich2003 for general convex set, DuchiEtAl2008 for the -ball, or AgarwalNegahbanWainwright2012 for fast rates under sparsity).
First works in i.i.d. online convex optimization under sparsity was done by AgarwalNegahbanWainwright2012; GaillardWintenberger2017; Steinhardt2014 that obtained sparse rates of order 111Throughout the paper denotes an approximative inequality which holds up to universal constants and denotes an asymptotic inequality up to logarithmic terms in and dependence on parameters not clarified.. Their settings are very close to the one of bunea2007 used for studying the convergence properties of the LASSO batch procedure. Their methods differ; the one of Steinhardt2014 uses a -penalized gradient descent whereas the one of AgarwalNegahbanWainwright2012 and GaillardWintenberger2017
are based on restarting a subroutine centered around the current estimate, on sessions of exponentially growing length. These works compete with the optima overassumed to be (approximately in AgarwalNegahbanWainwright2012) sparse with a known -bound. In contrast, we only compete here with optima over which are more likely to be sparse.
Little work was done on sparsity under adversarial data. The papers langford2009sparse; Xiao2010; duchi2010composite focus on providing sparse estimators with rates of order or a linear dependency on the dimension . Recent work (see foster2016online; kale2017adaptive and references therein) considers the problem where the learner only observes a sparse subset of coordinates at each round. Though they also compare themselves with sparse parameters, they also suffer a bound larger than . Fast rate sparse regret bounds involving were, to our knowledge, only obtained through non-efficient procedures (see Gerchinovitz2011 or rakhlin2015online).
Contributions and outline of the paper
In this paper we focus on providing fast rate regret bounds involving the sparsity of the objective .
that the Bernstein Online Aggregation (BOA) and Squint algorithms achieve a fast rate with high probability: i.e.for arbitrary data. The theorem also provides a quantile bound on the risk which improves the dependency on if many experts are performing well. This is the first quantile-like bound on the average risk that provides fast-rate with high probability. mehta2016fast developed high-probability quantile bounds but it was degrading with an additional gap term.
In Section 3, we consider the case . The standard reduction using the “gradient trick” of KivinenWarmuth1997, looses the fast-rate guaranty obtained under Assumption 2. Considering BOA on a discretization grid of and applying Theorem 2.1 yields optimal convergence rate under 2. Yet, the complexity of the discretization is prohibitive. We thus investigate how an a-priori discretization grid may be used to improve the regret bound. We provide in Theorem 3.2 a bound of the form which we call accelerable, i.e. the rate may decrease if decreases with . Here is a pseudo-metric that we call averaging accelerability and is the distance of with in this pseudo-metric. Our bound yields an oracle bound of the form which was recently studied by Foster2017. The following sections 3.3 and 3.4 build the grid adaptively in order to ensure a small regret under a sparsity scenario: Section 3.3 in the adversarial setting and Section 3.4 for i.i.d. losses.
In Section 3.3, we work under the strong convexity assumption on the losses in the adversarial setting. Using a doubling trick, we show that including sparse versions of the leader of the last session in is enough to ensure that for all . The rate is faster than the usual rate of convergence obtained by online gradient descent or online newton step hazan2007. The gain is significant for sparse parameters . The numerical and space complexities of the algorithm, called BOA+, are . Notice that the rate can be decreased to whenever the leaders and the parameter are -sparse. This favorable case is not likely to happen in the adversarial setting but do happen in the i.i.d. setting treated in Section 3.4.
A new difficulty raises in the i.i.d. setting: we accept only assumptions on the risk and not on the losses . To do so, we need to enrich the grid with good approximations of the optima of the risk . However, the risk is not observed and the minimizer of the empirical risk (the leader) suffer a rate of convergence linear in . Thus, we develop another algorithm, called SABOA, that sequentially enriches by averaging the estimations of the algorithms on the last session. We extend the setting of strong convexity on of the preceding results of Steinhardt2014; GaillardWintenberger2017; AgarwalNegahbanWainwright2012 to the weaker Łojasiewicz’s assumption 3 on the -ball only. The latter was introduced by Loja63; Loja93 and states that there exist and such that for all , it exists a minimizer of the risk over satisfying
The Łojasiewicz’s assumption depends on a parameter that ranges from general convex functions () to strongly convex functions (). Under this condition our algorithm achieves a fast rate upper-bound on the average risk of order when the optimal parameters have -norm bounded by . When some optimal parameters lie on the border of the ball, the bound suffers an additional factor . Łojasiewicz’s Assumption 3
also allows multiple optima which is crucial when we are dealing with degenerated collinear design (allowing zero eigenvalues in the Gram matrix). The complexity of the algorithm, called SABOA, isand it is fully adaptive to all parameters except for the Lipschitz constant.
To summarize our contributions, we provide
2. Finite reference set
In this section, we focus on finite reference set . This is the case of the setting of prediction with expert advice presented in Section 2.2. We will consider the following two assumptions on the loss:
Lipschitz loss222Throughout the paper, we assume that the Lipschitz constant in 1 is known. It can be calibrated online with standard tricks such as the doubling trick (see CesaBianchiMansourStoltz2007 for instance) under sub-Gaussian conditions.: are sub-differential and for all , .
Weak exp-concavity: There exist and such that for all , for all , almost surely
For convex losses , Assumption 2 is satisfied with and . Fast rates are obtained for . It is worth pointing out that Assumption 2 is weak even in the strongest case . It is implied by several common assumptions such as:
Strong convexity of the risk: under the boundedness of the gradients, assumption 2 with is implied by the -strong convexity of the risks .
Exp-concavity of the loss: Lemma 4.2, Hazan2016 states that 2 with is implied by -exp-concavity of the loss functions . Our assumption is slightly weaker since its needs to hold in conditional expectation only.
2.1. Fast-rate quantile bound with high probability
For prediction with expert advice, Wintenberger2014 showed that a fast rate can be obtained by the BOA algorithm under the LIST condition (i.e., Lipschitz and strongly convex losses) and i.i.d. estimators. Here, we show that Assumption 2 is enough. By using the Squint algorithm of vanErvenKoolen2015 (see Algorithm 1), we also replace the dependency on the total number of experts with a quantile bound. The latter is smaller when many experts perform well. Note that Algorithm 1 uses Squint with a discrete prior over a finite set of learning rates. It corresponds to BOA of Wintenberger2014, where each expert is replicated multiple times with different constant learning rates. The proof (with the exact constants) is deferred to Appendix C.1.
predict and observe ,
update component-wise for all
A fast rate of this type (without quantiles property) can be obtained in expectation by using the exponential weight algorithm (Hedge) for exp-concave loss functions. However, Theorem 2.1 is stronger. First, Assumption 2 only needs to hold on the risks , which is much weaker than exp-concavity of the losses . It can hold for absolute loss or quantile regression under regularity conditions. Second, the algorithm uses the so-called gradient trick. Therefore, simultaneously with upper-bounding the average risk with respect to the experts , the algorithm achieves the slow rate with respect to any convex combination (similarly to EG). Finally, we recall that our result holds with high-probability, which is not the case for Hedge (see Audibert2008).
If the algorithm is run with a uniform prior , Theorem 2.1 implies that for any subset , with high probability
One only pays the proportion of good experts instead of the total number of experts . This is the advantage of quantile bounds. We refer to vanErvenKoolen2015 for more details, who obtained a similar result for the regret (not the average risk). Such quantile bounds on the risk were studied by mehta2016fast in a batch i.i.d. setting (i.e., are i.i.d.). A standard online to batch conversion of our results shows that in this case, Theorem 2.1 yields with high probability for any
This improves the bound obtained by mehta2016fast who suffers the additional gap
2.2. Prediction with expert advice
The framework of prediction with expert advice is widely considered in the literature (see Cesa-BianchiLugosi2006 for an overview). We recall now this setting and how it can be included in our framework. At the beginning of each round , a finite set of experts forms predictions that are included into the history . The learner then chooses a weight vector in the simplex and produces a prediction as a linear combination of the experts. Its performance at time is evaluated thanks to a loss function444For instance, can be the square loss with respect to some observation . . The goal of the learner is to approach the performance of the best expert on a long run. This can be done by minimizing the average risk with respect to all experts .
This setting reduces to our framework with dimension . Indeed, it suffices to choose the -dimensional loss function and the canonical basis in as the reference set. Denoting by the -th element of the canonical basis, we see that , so that . Therefore, matches our definition of in Equation (1) and we get under the assumptions of Theorem 2.1 a bound of order:
It is worth to point out that though the parameters of the reference set are constant, this method can be used to compare the player with arbitrary strategies that may evolve over time and depend on recent data. This is why we do not want to assume here that there is a single fixed expert which is always the best, i.e., . Hence, we cannot replace 2 with the closely related Bernstein assumption (see Ass. (A2’) or (koolen2016combining, Cond. 1)).
In this setting, Assumption 2 can be reformulated on the one dimensional loss functions as follows: there exist and such that for all , for all ,
It holds with for -strongly convex risk . For instance, the square loss satisfies it with and .
3. Online optimization in the unit -ball
The aim of this section is to extend the preceding results to the reference set instead of finite . A classical reduction from the expert advice setting to the -ball is the so-called “gradient-trick”. A direct analysis on BOA applied to the 2d corners of the -ball suffers a slow rate on the average risk. The goal is to exhibit algorithms that go beyond . In view of the fast rate in Theorem 2.1 the program is clear; in order to accelerate BOA, one has to add in the grid of experts some points of the -ball to the 2d corners. In Section 3.1 one investigate the cases of non adaptive grids that are optimal but yields unfeasible (NP) algorithm. In Section 3.2 we introduce a pseudo-metric in order to bound the regret of grids consisting of the 2d corners and some arbitrary fixed points. From this crucial step, we then derive the form of the adaptive points we have to add to the 2d corners, in the adversarial case, Section 3.3, and in the i.i.d. case, Section 3.4.
3.1. Warmup: fast rate by discretizing the space
As a warmup, we show how to use Theorem 2.1 in order to obtain fast rate on for any . Basically, if the parameter could be included into the grid , Theorem 2.1 would turn into a bound on the regret with respect to . However, this is not possible as we do not know in advance. A solution consists in approaching with , a fixed finite -covering in -norm of minimal cardinal. In dimension , it is known that . We obtain the following near optimal rate for the regret on .
Let and and be its -approximation in . The proof follows from Lipschitzness of the loss: ; followed by applying Theorem 2.1 on . ∎
Following this method and inspired by the work of RigolletTsybakov2011, one can improve to by carefully choosing the prior ; see Appendix A for details. The obtained rate is optimal up to log-factors. However, the complexity of the discretization is prohibitive (of order ) and non realistic for practical purpose.
3.2. Regret bound for arbitrary fixed discretization grid
for any . We say that a regret bound is accelerable if it provides a fast rate except a term depending on the distance with the grid (i.e., the term in in (2)) which vanishes to zero. This property will be crucial in obtaining fast rates by enriching the grid . Hence the regret bound (2) is not accelerable due to the second term that is constant. In order to find an accelerable regret bound, we introduce the notion of averaging accelerability, a pseudo-metric that replaces the -norm in (2). We define it now formally but we will give its intuition in the sketch of proof of Theorem 3.2.
Definition 3.1 (averaging accelerability).
For any , we define
This averaging accelerability has several nice properties. In Appendix B, we provide a few concrete upper-bounds in terms of classical distances. For instance, Lemma B.1 provides the upper-bound . We are now ready to state our regret bound, when Algorithm 1 is applied with an arbitrary approximation grid .
Sketch of proof.
The complete proof can be found in Appendix C.2 but we give here the high-level idea of the proof. Let be the unknown parameter the algorithm will be compared with. Let a point in the grid minimizing . Then one can decompose for a unique point and . See Appendix C.2 for details. In the analysis, the regret bound with respect to can be decomposed into two terms:
The first one quantifies the cost of picking , bounded using Theorem 2.1;
The second one is the cost of learning rescaled by . Using a classical slow-rate bound in , it is of order .
The average risk is thus of the order
Note that the bound of Theorem 3.2 is accelerable as it vanishes to zero on the contrary to Inequality (2). Theorem 3.2 provides an upper-bound which may improve the rate if the distance is small enough. By using the properties of the averaging accelerability (see Lemma B.1 in Appendix B), Theorem 3.2 provides some interesting properties of the rate in terms of distance. By including into our approximation grid , we get a an oracle-bound of order for any . Furthermore, it also yields for any , a bound of order for all .
It is also interesting to notice that the bound on the gradient can be substituted with the averaged gradient observed by the algorithm. This allows to replace with the level of the noise in certain situations with vanishing gradients (see for instance Theorem 3 of GaillardWintenberger2017).
3.3. Fast-rate sparsity regret bound under adversarial data
In this section, we focus on the adversarial case where are -strongly convex deterministic functions. In this case, Assumption 2 is satisfied with and . Our algorithm, called BOA+, is defined as follows. For , it predicts from time step to , by restarting Algorithm 1 with uniform prior, parameter and updated discretization grid indexed by :
where is the empirical risk minimizer (or the leader) until time . The notation denotes the hard-truncation with non-zero values. Remark that for can be efficiently computed approximatively as the solution of a strongly convex optimization problem.
Assume the losses are -strongly convex on with gradients bounded by in -norm. The average regret of BOA+ is upper-bounded for all as:
The proof is deferred to the appendix. It is worth to notice that the bound can be rewritten as follows:
It provides an intermediate rate between known optimal rates without sparsity and and known optimal rates with sparsity and but with non-efficient procedures only. If all are approximatively -sparse it is possible to achieve a rate of order , for any . This can be achieved in particular in the i.i.d. setting (see next section). However, we leave for future work whether it is possible to achieve it in full generality and efficiently in the adversarial setting.
The strongly convex assumption on the losses can be relaxed by only assuming Inequality (30): it exists and such that for all and
The rates will then depend on as it was the case in Theorem 2.1. A specific interesting case is when . Then is very likely to be sparse. Denote its support. Assumption (3) can be weakened in this case. Indeed any satisfies , which from Lemma 6 of AgarwalNegahbanWainwright2012 yields where . One can thus restrict Assumption (3) to hold on the support of only. Such restricted conditions for are common in the sparse learning literature and essentially necessary to hold for the existence of efficient and optimal sparse procedures, see Zhang2014. In the online setting, the restricted condition (3) with should hold at any time , which is unlikely.
3.4. Fast-rate sparsity risk bound under i.i.d. data
In this section, we provide an algorithm with fast-rate sparsity risk-bound on under i.i.d. data. This is obtained by regularly restarting Algorithm 1 with an updated discretization grid approaching the set of minimizers .
This fast-rate type stochastic condition is equivalent to the central condition (see (van2015fast, Condition 5.2)) and was already considered to obtain faster rates of convergence for the regret (see (koolen2016combining, Condition 1)).
The Łojasiewicz’s assumption
In order to obtain sparse oracle inequalities we also need the Łojasiewicz’s Assumption 3 which is a relaxed version of strong convexity of the risk.
Łojasiewicz’s inequality: is i.i.d. and it exists and such that, for all with , it exists satisfying
This assumption is fairly mild. It is indeed satisfied with and as soon as the loss is convex. For , this assumption is implied by the strong convexity of the risk . One should mention that our framework is more general than this classical case because
multiple optima are allowed, which seems to be new when combined with sparsity bounds;
on the contrary to Steinhardt2014 or GaillardWintenberger2017, our framework does not compete with the minimizer over with a known upper-bound on the -norm . We consider the minimizer over the -ball only. The latter is more likely to be sparse and Assumption 3 only needs to hold over .
is more restrictive because it is heavily design dependent. In linear regression for instance, the constantcorresponds to the smallest non-zero eigenvalue of the Gram matrix while . If is a singleton than Assumption 3 implies Assumption (A2’) with .
Algorithm and risk bound
Our new procedure is described in Algorithm 2. It is based on the following fact: the bound of Theorem 3.2 is small if one of the estimators in is close to . Thus, our algorithm regularly restarts BOA by adding current estimators of into an updated grid . The estimators are built by averaging past iterates and truncated to be sparse and ensure small -distance. Remark that restart schemes under Łojasiewicz’s Assumption is natural and was already used for instance in optimization by roulet2017sharpness. A stochastic version of the algorithm (sampling randomly a subset of gradient coordinates at each time step) can be implemented as in the experiments of DuchiEtAl2008. We get the following upper-bound on the average risk. The proof, that computes the exact constants, is postponed to Appendix C.7.
define if and otherwise.
Define a set of hard-truncated and dilated soft-thresholded versions of as in (42);
At time step , restart Algorithm 1 in with parameters (denote by its elements), and uniform prior over . In other words, for time steps :
predict and observe
define component-wise for all
Let , . Under Assumptions (A1-3), if , , Algorithm 2 satisfies with probability at least the bound on the average risk
Approximately sparse optima. Our results can be extended to a unique approximately sparse optimum . We get for any ; see AgarwalNegahbanWainwright2012; bunea2007.
On the radius of L1 ball. We only performed the analysis into , the -ball of radius 1. However, one might need to compare with parameters into the -ball of radius . This can be done by simply rescaling the losses and applying our results to the loss functions instead of . If lies on the border of the -ball, we could not avoid a factor . In that situation, our algorithm needs to recover the support of without the Irreprensatibility Condition (wainwright2009sharp) (see configuration 3 of Figure 1). In this case, we can actually relax Assumption 3 to hold in sup-norm.
In this paper, we show that BOA is an optimal online algorithm for aggregating predictors under very weak conditions on the loss. Then we aggregate sparse versions of the leader (BOA+) or of the averaging of BOA’s weights (SABOA) in the adversarial or in the i.i.d. setting, respectively. Aggregating both achieves sparse fast-rates of convergence in any case. These rates are deteriorated compared with the optimal one that require restrictive assumption. Our weaker conditions are very sensitive to the radius of the -ball we consider. The optimal choice of the radius, if it is not imposed by the application, is left for future research.
Appendix A Sparse oracle inequality by discretizing the space
Inspired by the work of RigolletTsybakov2011, one can improve to in Proposition 3.1 by carefully choosing the prior . To do so, we cover by the subspaces
where denotes a sparsity pattern which determines the non-zero components of . For each sparsity pattern , the subspace can be approximated in -norm by an -cover of size . In order to obtain the optimal rate of convergence, we apply Algorithm 1 with with a non-uniform prior . The latter penalizes non-sparse to reflect their respective complexities. We assign to any the prior, depending on ,
Note that the sum over and is one. Therefore, Theorem 2.1 yields
by noting that