1 Introduction
In adversarial online learning (CesaBianchi and Lugosi, 2006; Hazan, 2016)
, a player interacts with an unknown and arbitrary adversary in a sequence of rounds. At each round, the player chooses an action from an action space and incurs a loss associated with that chosen action. The loss functions are determined by the adversary and are fixed at the beginning of each round. After choosing an action the player observes some feedback, which can help guide the choice of actions in subsequent rounds. The most common feedback model is the
full information model, where the player has access to the entire loss function at the end of each round. Another, more challenging feedback model is the partial information or bandit feedback model where the player at the end of the round just observes the loss associated with the action chosen in that particular round. There are also other feedback models in between and beyond the full and bandit information models, many of which have also been studied in detail. A figure of merit that is often used to judge online learning algorithms is the notion of regret, which compares the players actions to the best single action in hindsight (defined formally in Section 1.2).When the underlying action space is a continuous and compact (possibly convex) set and the losses are linear or convex functions over this set; there are many algorithms known that attain sublinear and sometimes optimal regret in both these feedback settings. In this work we present a generalization of the well studied adversarial online linear learning framework. In our paper, at each round the player selects an action . This action is mapped to an element in a reproducing kernel Hilbert space (RKHS) generated by a mapping . The function is a kernel map, that is, it can be thought of as the inner product of an appropriate Hilbert space . The kernel map can be expressed as , where is the associated feature map.
At each round the loss is , where is the adversary’s action. In the full information setting, as feedback, the player has access to the entire adversarial loss function . In the bandit setting the player is only presented with the value of the loss, .
Notice that this class of losses is much more general than ordinary linear losses and includes potentially nonlinear and nonconvex losses like:

Linear Losses: . This loss is well studied in both the bandit and full information settings. We shall see that our regret bounds will match the bounds established in the literature for these losses.

Quadratic Losses: , where is a symmetric (not necessarily positive semidefinite) matrix and
is a vector. Convex quadratic losses have been well studied under full information feedback as the online eigenvector decomposition problem. Our work establishes regret bounds in the full information setting and also in the previously unexplored bandit feedback setting.

Gaussian Losses: . We provide regret bounds for kernel losses not commonly studied before like Gaussian losses that provide a different loss profile than a linear or convex loss.

Polynomial Losses: for example. We also provide regret bounds for polynomial kernel losses which are potentially (nonconvex) under both partial and full information settings. Specifically in the full information setting we study posynomial losses (discussion in Appendix F).
1.1 Related Work
Adversarial online convex bandits were introduced and first studied by Kleinberg (2005); Flaxman et al. (2005). The problem most closely related to our work is the case when the losses are linear and was introduced earlier by McMahan and Blum (2004); Awerbuch and Kleinberg (2004). To improve the dimension dependence in the regret, Dani et al. (2007); CesaBianchi and Lugosi (2012); Bubeck et al. (2012) proposed the EXP 2 (Expanded Exp) algorithm under different choices of exploration distributions. Dani et al. (2007)
worked with the uniform distribution over the barycentric spanner of the set,
CesaBianchi and Lugosi (2012) used the uniform distribution over the set and Bubeck et al. (2012) set the exploration distribution to be the one given by John’s theorem which leads to a regret bound of . Here is the number of actions, is the number of rounds and is the dimension of the losses. In the case of linear bandits when the set is convex and compact, Abernethy et al. analyzed mirror descent to get a regret bound of for some . For the case with general convex losses with bandit feedback recently Bubeck et al. (2017) proposed a polytime algorithm that has a regret guarantee of , which is optimal in its dependence on the number of rounds . Previous work on this problem includes, Agarwal et al. ; Saha and Tewari (2011); Hazan and Levy (2014); Dekel et al. (2015); Bubeck et al. (2015); Hazan and Li (2016) in the adversarial setting under different assumptions on the structure of the convex losses and by Agarwal et al. (2013) who studied this problem in the stochastic setting^{1}^{1}1For an extended bibliography of the work on online convex bandits see Bubeck et al. (2017).. Valko et al. (2013) study stochastic kernelized contextual bandits with a modified Upper Confidence Bound (UCB) algorithm to obtain a regret bound similar to ours, where is the effective dimension dependent on the eigendecay of the kernel. This problem was also studied previously for loss functions drawn from Gaussian processes by Srinivas et al. (2010). Online learning under bandit feedback has also been studied when the losses are nonparametric, for example when the losses are Lipschitz (Rakhlin and Sridharan, 2014; CesaBianchi et al., 2017).In the full information case, the online optimization framework with convex losses was first introduced by Zinkevich (2003). The conditional gradient descent algorithm (a modification of which we study in this work) for convex losses in this setting was introduced and analyzed by Jaggi (2011) and then improved subsequently by Hazan and Kale (2012)
. The exponential weights algorithm which we modify and use multiple times in this paper has a rich history and has been applied to various online as well as offline settings. The case of convex quadratic losses has been well studied in the full information setting. This problem is called online eigenvector decomposition or online Principal Component Analysis (PCA). Recently
Zhu and Li (2017) established a regret bound of for this problem by presenting an efficient algorithm that achieves this rate – a modified exponential weights strategy termed “follow the compressed leader”. Previous results for this problem were established in both adversarial and stochastic settings by modifications of exponential weights, gradient descent and follow the perturbed leader algorithms (Tsuda et al., 2005; Kalai and Vempala, 2005; Warmuth and Kuzmin, 2006; Arora and Kale, 2016; Warmuth and Kuzmin, 2008; Garber et al., 2015).In the full information setting there has also been work on analyzing gradient descent and mirror descent in an RKHS (McMahan and Orabona, 2014; Balandat et al., 2016). However, in these papers, the player is allowed to play any action in a bounded set in a Hilbert space, while in our paper the player is constrained to only play rank one actions, that is the player chooses an action in which gets mapped to an action in the RKHS.
Contributions
Our primary contribution is to extend the linear bandits framework to more general classes of kernel losses. We present an exponential weights algorithm in this setting and establish a regret bound on its performance. We provide a more detailed analysis of the regret under assumptions on the eigendecay of the kernel. When we assume polynomial eigendecay of the kernel () we can guarantee the regret is bounded as . Under exponential eigendecay we can guarantee an even sharper bound on the regret of . We also provide an exponential weights algorithm and a conditional gradient algorithm for the full information case where we don’t need to assume any conditions on the eigendecay. Finally we provide a couple of applications of our framework – (i) general quadratic losses (not necessarily convex) with linear terms which we can solve efficiently both in the full information setting and the bandit setting, (ii) we provide a computationally efficient algorithm when the underlying losses are posynomials (special class of polynomials).
Organization of the Paper
In the next section we introduce the notation and definitions. In Section 2 we present our algorithm under bandit feedback along with regret bounds for it. In Section 3 we study the problem in the full information setting. In Section 4 we apply our framework to general quadratic losses prove that our algorithms are computationally efficient in this setting. All the proofs and technical details are relegated to the appendix. Also in the appendix is the example of our framework applied to posynomial losses and experimental evidence verifying our claims.
1.2 Notation, main definitions and setting
Here we introduce definitions and notational conventions used throughout the paper.
In each round , the player chooses an action vector . The underlying kernel function at each round is which is a map from such that it is a kernel map and has an associated separable reproducing kernel Hilbert space (RKHS) with an inner product (for more properties of kernel maps and RKHS see Scholkopf and Smola, 2001). Let denote a feature map of such that for every we have . Note that the dimension of the RKHS, could be infinite (for example in the Gaussian kernel over ).
We let the adversary choose a vector in , and at each round the loss incurred by the player is . We assume that the adversary is oblivious, that is, it is a function of the previous actions of the player but unaware of the randomness used to generate . We let the size of the sets be bounded^{2}^{2}2We set the bound on the size of both sets to be the same for ease of exposition, but they could be different and would only change the constants in our results. in kernel norm, that is,
(1) 
Throughout this paper we assume a rankone learner, that is, in each round the player can pick a vector , such that for some . We now formally define the notion of expected regret.
Definition 1 (Expected regret).
The expected regret of an algorithm after rounds is defined as
(2) 
where and the expectation is over the randomness in the algorithm.
Essentially this amounts to comparing against the best single action in hindsight. Our hope will be to find a randomized strategy such that the regret grows sublinearly with the number of rounds . In what follows we will omit the subscript from the subscript of the inner product whenever it is clear from the context that it refers to the RKHS inner product.
To establish regret guarantees we will find that it is essential to work with finite dimensional kernels when working under bandit feedback (more details regarding this are in the proof of the regret bound of Algorithm 1). General kernel maps can have infinite dimensional feature maps thus we will require the construction of a finite dimensional kernel that uniformly approximates the original kernel . This motivates the definition of approximate kernels.
Definition 2 (approximate kernels).
Let and be two kernels over and let . We say is an approximation of if for all ,
2 Bandit Feedback Setting
In this section we present our results on kernel bandits. In the bandit setting we assume the player knows the underlying kernel function , however, at each round after the player plays a vector only the value of the loss associated with that action is revealed to the player – – and not the action of the adversary . We also assume that the player’s action set has finite cardinality^{3}^{3}3This assumption can be relaxed to let be a compact set when is Lipschitz continuous. In this setting we can instead work with an appropriately fine approximating cover of the set .
.This is a generalization of the well studied adversarial linear bandits problem. As we will see in subsequent sections, to guarantee a bound on the regret in the bandit setting our algorithm will build an estimate of adversary’s action
. This becomes impossible if is infinite dimensional (). To circumvent this, we will construct a finite dimensional proxy kernel that is an approximation of .Whenever no approximate kernel is needed, for example when we allow the adversary to be able to choose an action without imposing extra requirements on the set other than it being bounded in norm. When is infinite we impose an additional constraint on the adversary to also select rankone actions at each round, that is, for some . Next we present a procedure to construct a finite kernel that approximates the original kernel well.
2.1 Construction of the finite dimensional kernel
To construct the finite dimensional kernel we will rely crucially on Mercer’s theorem. We first recall a couple of useful definitions.
Definition 3.
Let and
be a probability measure supported over
. Let denote square integrable functions over and measure , .Definition 4.
A kernel is square integrable with respect to measure over , if .
Theorem 5 (Mercer’s Theorem).
Let be compact and be a finite Borel measure with support . Suppose is a continuous square integrable positive definite kernel on , and define a positive definite operator by
Then there exists a sequence of eigenfunctions
that form an orthonormal basis of consisting of eigenfunctions of, and an associated sequence of nonnegative eigenvalues
such that for . Moreover the kernel function can be represented as(3) 
where the convergence of the series holds uniformly.
Mercer’s theorem suggests a natural way to construct a feature map for by defining the component of the feature map to be . Under this choice of the feature map the eigenfunctions are orthogonal with respect to the inner product ^{4}^{4}4To see this observe that the function can be expressed as a vector in the RKHS as a vector with in the coordinate and zeros everywhere else. So for any two and with we have .. Armed with Mercer’s theorem we first present a simple deterministic procedure to obtain a finite dimensional approximate kernel of . When the eigenfunctions of the kernel are uniformly bounded, for all , and if the eigenvalues decay at a suitable rate we can truncate the series in Equation (3) to get a finite dimensional approximation.
Lemma 6.
Given , let be the Mercer operator eigenvalues of under a finite Borel measure with support and eigenfunctions with . Further assume that for some . Let be such that . Then the truncated feature map,
(4) 
induces a kernel , for all that is an approximation of .
The Hilbert space induced by is a subspace of the original Hilbert space . The proof of this lemma is a simple application of Mercer’s theorem and is relegated to Appendix C. If we have access to the eigenfunctions of we can construct and work with because as Lemma 6 shows is an approximation to . Additionally, also has the same first Mercer eigenvalues and eigenfunctions under as . Unfortunately, in most applications of interest the analytical computation of the eigenfunctions is not possible. We can get around this by building an estimate of the eigenfunctions using samples from by leveraging results from kernel principal component analysis (PCA).
Definition 7.
Let be the subspace of spanned by the first eigenvectors of the covariance matrix .
This corresponds^{5}^{5}5This holds as the eigenvector of the covariance matrix has as the coordinate and zero everywhere else combined with the fact that are orthonormal under the inner product. to the span of the eigenfunctions in . Define the linear projection operator that projects onto the subspace ; where , if and .
Remark 8.
The feature map is a projection of the complete feature map to this subspace, .
Let be i.i.d. samples and construct the sample (kernel) covariance matrix, . Let be the subspace spanned by the top eigenvectors of . Define the stochastic feature map, , the feature map defined by projecting to the random subspace . Intuitively we would expect that if the number of samples is large enough, then the kernel defined by the feature map , will also be an approximation for the original kernel . Formalizing this claim is the following theorem.
Theorem 9.
Let be defined as in Lemma 6 and let the th level eigengap be . Further let , and . Then the finite dimensional kernels and satisfy the following properties with probability ,

.

The Mercer eigenvalues and of and are close, .
Theorem 9 shows that given the finite dimensional proxy is a approximation of with high probability as long as sufficiently large number of samples are used. Furthermore, the top
eigenvalues of the second moment matrix of
are at most away from the eigenvalues of the second moment matrix of under .algocf[t]
To construct we need to calculate the top eigenvectors of the sample covariance matrix , however, it is equivalent to calculate the top eigenvectors of the sample Gram matrix and use them to construct the eigenvectors of (for more details see Appendix B where we review the basics of kernel PCA).
2.2 Bandits Exponential Weights
In this section we present a modified version of exponential weights adapted to work with kernel losses. Exponential weights has been analyzed extensively applied to linear losses under bandit feedback (Dani et al., 2007; CesaBianchi and Lugosi, 2012; Bubeck et al., 2012). Two technical challenges make it difficult to directly adapt their algorithms to our setting.
The first challenge is that at each round we need to estimate the adversarial action . If the feature map of the kernel is finite dimensional this is easy to handle, however when the feature map is infinite dimensional, this becomes challenging and we need to build an approximate feature map using Function LABEL:a:algoproxy. This introduces bias in our estimate of the adversarial action and we will need to control the contribution of the bias in our regret analysis. The second challenge will be to lower bound the minimum eigenvalue of the kernel covariance matrix as we will need to invert this matrix to estimate . For general kernels which are infinite dimensional, the minimum eigenvalue is zero. To resolve this we will again turn to our construction of a finite dimensional proxy kernel.
2.3 Bandit Algorithm and Regret Bound
In our exponential weights algorithm we first build the finite dimensional proxy kernel using Function LABEL:a:algoproxy. The rest of the algorithm is then almost identical to the exponential weights algorithm (EXP 2) studied for linear bandits. In Algorithm 1 we set the exploration distribution to be such that it induces John’s distribution () over (first introduced as an exploration distribution by Bubeck et al. (2012); a short discussion is presented in Appendix G.1). Note that for finite sets it is possible to build an approximation to minimal volume ellipsoid containing –John’s ellipsoid and John’s distribution in polynomial time (Grötschel et al., 2012)^{6}^{6}6It is thus possible to construct over in polynomial time. However, as is a finite set, using and it is also possible to construct over efficiently.. In this section we assume that the set is such that the John’s ellipsoid is centered at the origin.
Crucially note that we construct the finite dimensional feature map only once before the first round. In the algorithm presented above we build using the uniform distribution over assuming that kernel has fast eigendecay under this measure. However, any other distribution say – (with support ) could also be used instead of in Algorithm 1 if the kernel enjoys fast eigendecay under .
In our algorithm we build and invert the exact covariance matrix , however this can be relaxed and we can instead work with a sample covariance matrix. We analyze the required sample complexity and error introduced by this additional step in Appendix D. We now state the main result of this paper which is an upper bound on the regret of Algorithm 1.
Theorem 10.
We prove this theorem in Appendix A. Note that this is similar to the regret rate attained for adversarial linear bandits (Dani et al., 2007; CesaBianchi and Lugosi, 2012; Bubeck et al., 2012) with additional terms that accounts for the bias in our loss estimates . In our regret bounds the parameter plays the role of the effective dimension and will be determined by the rate of the eigendecay of the kernel. When the underlying Hilbert space is finite dimensional (as is the case when the losses are linear or quadratic) our regret bound recovers exactly the results of previous work (that is, and ).
We note that the exploration distribution can also be chosen to be the uniform distribution over the Barycentric spanner of . But this choice leads to slightly worse bounds on the regret and we omit a detailed discussion here for the sake of clarity. Next we state the following different characteristic eigenvalue decay profiles.
Definition 11 (Eigenvalue decay).
Let the Mercer operator eigenvalues of a kernel with respect to a measure over a set be denoted by .

is said to have polynomial eigenvalue decay (with ) if for all we have .

is said to have exponential eigenvalue decay (with ) if for all we have .
Under assumptions on the eigendecay we can establish bounds on the effective dimension and , so that the condition stated in Lemma 6 is satisfied and we are guaranteed to build an approximate kernel . We establish bounds on in Proposition 30 presented in Appendix C.1. Under the eigendecay profiles stated above we can now invoke Theorem 10.
Corollary 12.
Let the conditions stated in Theorem 10 hold and let . Then Algorithm 1 has its regret bounded by the following rates with probability .

If has polynomial eigenvalue decay under the uniform measure , with . Then by choosing the step size where, and , with large enough such that and , the expected regret is bounded by

If has exponential eigenvalue decay under the uniform measure . Then by choosing the stepsize where, and with large enough so that , the expected regret is bounded by
Remark 13.
Under polynomial eigendecay condition we have that the regret is upper bounded by . While when we have exponential eigendecay we almost recover the adversarial linear bandits regret rate (up to logarithmic factors), with .
In the corollary above we assume that for ease of exposition; results follow in a similar vein for other values of with the constants altered. One way to interpret the results of Corollary 12 in contrast to the regret bounds obtained for linear losses is the following. We introduce additional parameters into our analysis to handle the infinite dimensionality of our feature vectors – the effective dimension and bias of our estimate . When the effective dimension is chosen to be large we can build an estimate of the adversarial action
which has low bias, however this estimate would have large variance (
). On the other hand if we choose to be small we can build a low variance estimate of the adversarial action but with high bias ( is large). We trade these off optimally to get the regret bounds established above. In the case of exponential decay we obtain that the choice is optimal, hence the regret bound degrades only by a logarithmic factor in terms of as compared to linear losses (where would be a constant). When we have polynomial decay, the effective dimension is higher which leads to worse bounds on the expected regret. Note that asymptotically as the regret bound goes to which aligns well with the intuition that the effective dimension is small. While when (the effective dimension is large) and the regret upper bound becomes close to linear in .3 Full Information Setting
3.1 Full information Exponential Weights
We begin by presenting a version of the exponential weights algorithm, Algorithm 2 adapted to our setup. In each round we sample an action vector (a compact set) from the exponential weights distribution . After observing the loss, we update the distribution by a multiplicative factor, . In the algorithm presented we choose the initial distribution to be uniform over the set , however we note that alternate initial distributions with support over the whole set could also be considered. We can establish a sublinear regret of for the exponential weights algorithm.
3.2 Conditional Gradient Descent
Next we present an online conditional gradient (FrankWolfe) method (Hazan and Kale, 2012) adapted for kernel losses. The conditional gradient method is also a well studied algorithm studied in both the online and offline setting (for a review see Hazan, 2016). The main advantage of the conditional gradient method is that as opposed to projected gradient descent and related methods, the projection step is avoided. At each round the conditional gradient method involves the optimization of a linear (kernel) objective function over to get a point . Next we update the optimal mean action by reweighting the previous mean action by and weight our new action by . Note that this construction also automatically suggests a distribution over such that, is a convex combination of . For this algorithm we can prove a regret bound of .
4 Application: General Quadratic Losses
The first example of losses that we present are general quadratic losses. At each round the adversary can choose a symmetric (not necessarily positive semidefinite matrix) , and a vector , with a constraint on the norm of the matrix and vector such that . If we embed this pair into a Hilbert space defined by the feature map we get a kernel loss defined as – , where is the associated feature map for any and the inner product in the Hilbert space is defined as the concatenation of the trace inner product on the first coordinate and the Euclidean inner product on the second coordinate. The cumulative loss that the player aspires to minimize is, The setting without the linear term, that is when with positive semidefinite matrices is previously well studied in Warmuth and Kuzmin (2006, 2008); Garber et al. (2015); Zhu and Li (2017). When the matrix is not positive semidefinite (making the losses nonconvex) and there is a linear term, regret guarantees and tractable algorithms have not been studied even in the full information case.
As this is a kernel loss we have regret bounds for these losses. We demonstrate in the subsequent sections in the full information case it is also possible to run our algorithms efficiently. First for exponential weights we show sampling is efficient for these losses.
Lemma 16 (Proof in Appendix e.1).
Let be a symmetric matrix and . Sampling from for is tractable in time.
4.1 Guarantees for Conditional Gradient Descent
We now demonstrate that conditional gradient descent also can be run efficiently when the adversary plays a general quadratic loss. At each round the conditional gradient descent requires the player to solve the optimization problem, . When the set of actions is then under quadratic losses this problem becomes,
(5) 
for an appropriate matrix and that can be calculated by aggregating the adversary’s actions up to step . Observe that the optimization problem in Equation (5) is a quadratically constrained quadratic program (QCQP) given our choice of . The dual problem is the (semidefinite program) SDP,
For this particular program with a norm ball constraint set it is known the duality gap is zero provided Slater’s condition holds, that is, strong duality holds (see Annex B.1 Boyd and Vandenberghe, 2004).
Another example of losses where our framework is computationally efficient is when the underlying losses are posynomials (class of polynomials). We present this discussion in Appendix F.
5 Conclusions
Under bandit feedback it would be interesting to explore if it is possible to establish lower bounds on the regret under the eigendecay conditions stated. Another interesting technical challenge is to see if Lemma 20 which we use to control the bias in our estimators can be sharpened to provide nontrivial regret guarantees even for slow eigendecay (). Finding more kernel losses where our algorithms are provably computationally efficient is another direction that is exciting. Finally analyzing a mirror descent type algorithm under this framework could be useful to efficiently solve a wider class of problems.
Acknowledgments
We gratefully acknowledge the support of the NSF through grant IIS1619362 and of the Australian Research Council through an Australian Laureate Fellowship (FL110100281) and through the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS).
References
 (1) Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory  COLT 2008, Helsinki, Finland, July 912, 2008.
 (2) Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multipoint bandit feedback. In COLT 2010  The 23rd Conference on Learning Theory, Haifa, Israel, June 2729, 2010. URL http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf#page=36.
 Agarwal et al. (2013) Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, and Alexander Rakhlin. Stochastic convex optimization with bandit feedback. volume 23, pages 213–240, 2013.
 Arora and Kale (2016) Sanjeev Arora and Satyen Kale. A combinatorial, primaldual approach to semidefinite programs. volume 63, 2016. URL http://doi.acm.org/10.1145/2837020.
 Arora et al. (2012) Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of Computing, 8(1):121–164, 2012.
 Awerbuch and Kleinberg (2004) Baruch Awerbuch and Robert D Kleinberg. Adaptive routing with endtoend feedback: Distributed learning and geometric approaches. In Proceedings of the thirtysixth annual ACM symposium on Theory of computing, pages 45–53. ACM, 2004.
 Balandat et al. (2016) Maximilian Balandat, Walid Krichene, Claire Tomlin, and Alexandre M. Bayen. Minimizing regret on reflexive banach spaces and nash equilibria in continuous zerosum games. pages 154–162, 2016. URL http://papers.nips.cc/book/advancesinneuralinformationprocessingsystems292016.
 Ball (1997) Keith Ball. An elementary introduction to modern convex geometry. 31:1–58, 1997. URL https://doi.org/10.2977/prims/1195164788.
 Bartlett (2014) Peter Bartlett. Learning in sequential decision problems. Lecture Notes Stat 260/CS 294102, 2014.
 Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.
 Boyd et al. (2007) Stephen Boyd, SeungJean Kim, Lieven Vandenberghe, and Arash Hassibi. A tutorial on geometric programming. 8(1):67–127, 2007. URL http://dx.doi.org/10.1007/s1108100790017.
 Bubeck et al. (2012) Sébastien Bubeck, Nicolo CesaBianchi, and Sham Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Annual Conference on Learning Theory, volume 23, pages 41–1. Microtome, 2012.

Bubeck et al. (2015)
Sébastien Bubeck, Ofer Dekel, Tomer Koren, and Yuval Peres.
Bandit convex optimization: regret in one dimension.
In Proceedings of The 28th Conference on Learning Theory,
volume 40 of
Proceedings of Machine Learning Research
, pages 266–278, Paris, France, 2015. URL http://proceedings.mlr.press/v40/Bubeck15a.html.  Bubeck et al. (2017) Sébastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernelbased methods for bandit convex optimization. pages 72–85, 2017. URL http://doi.acm.org/10.1145/3055399.
 CesaBianchi and Lugosi (2006) Nicolò CesaBianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. ISBN 9780521841085.
 CesaBianchi and Lugosi (2012) Nicolò CesaBianchi and Gábor Lugosi. Combinatorial bandits. J. Comput. Syst. Sci, 78(5):1404–1422, 2012.
 CesaBianchi et al. (2017) Nicolò CesaBianchi, Pierre Gaillard, Claudio Gentile, and Sébastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 465–481, Amsterdam, Netherlands, 2017. PMLR. URL http://proceedings.mlr.press/v65/cesabianchi17a.html.

Cristianini and Taylor (2000)
Nello Cristianini and John Shawe Taylor.
An Introduction to Support Vector Machines
. Cambridge University Press, 2000. URL http://www.supportvector.net/.  Dani et al. (2007) Varsha Dani, Thomas P. Hayes, and Sham Kakade. The price of bandit information for online optimization. In NIPS, pages 345–352, 2007. URL http://papers.nips.cc/book/advancesinneuralinformationprocessingsystems202007.
 Dani et al. (2008) Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. pages 355–366, 2008.
 Dekel et al. (2015) Ofer Dekel, Ronen Eldan, and Tomer Koren. Bandit smooth convex optimization: Improving the biasvariance tradeoff. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada, pages 2926–2934, 2015. URL http://papers.nips.cc/book/advancesinneuralinformationprocessingsystems282015.
 Flaxman et al. (2005) Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005.
 Garber et al. (2015) Dan Garber, Elad Hazan, and Tengyu Ma. Online learning of eigenvectors. In International Conference on Machine Learning, pages 560–568, 2015.

Grötschel et al. (2012)
Martin Grötschel, László Lovász, and Alexander Schrijver.
Geometric algorithms and combinatorial optimization
, volume 2. Springer Science & Business Media, 2012.  Hazan (2016) Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Hazan and Kale (2012) Elad Hazan and Satyen Kale. Projectionfree online learning. arXiv preprint arXiv:1206.4657, 2012.
 Hazan and Levy (2014) Elad Hazan and Kfir Levy. Bandit convex optimization: Towards tight bounds. In Advances in Neural Information Processing Systems, pages 784–792, 2014.
 Hazan and Li (2016) Elad Hazan and Yuanzhi Li. An optimal algorithm for bandit convex optimization. arXiv preprint arXiv:1603.04350, 2016.

Hoeffding (1963)
Wassily Hoeffding.
Probability inequalities for sums of bounded random variables.
Journal of the American Statistical Association, 58(301):13–30, 1963.  Hoffman and Wielandt (1953) A. J. Hoffman and H. W. Wielandt. The variation of the spectrum of a normal matrix. Duke Math. J., 20:37–39, 1953. URL http://projecteuclid.org/euclid.dmj/1077465062.
 Jaggi (2011) Martin Jaggi. Convex optimization without projection steps. arXiv preprint arXiv:1108.1170, 2011.
 Kalai and Vempala (2005) Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
 Kleinberg (2005) Robert D Kleinberg. Nearly tight bounds for the continuumarmed bandit problem. In Advances in Neural Information Processing Systems, pages 697–704, 2005.
 Lovász and Vempala (2007) László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007.

McMahan and Blum (2004)
H. Brendan McMahan and Avrim Blum.
Online geometric optimization in the bandit setting against an
adaptive adversary.
In Learning Theory: 17th Annual Conference on Learning Theory,
COLT 2004, Banff, Canada, July 14, 2004. Proceedings, volume 3120 of
Lecture Notes in Artificial Intelligence
, pages 109–123. Springer, 2004.  McMahan and Orabona (2014) H Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pages 1020–1039, 2014.

Mendelson and Pajor (2006)
Shahar Mendelson and Alain Pajor.
On singular values of matrices with independent rows.
Bernoulli, 12(5):761–773, 2006.  Mercer (1909) James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical transactions of the royal society of London. Series A, containing papers of a mathematical or physical character, 209:415–446, 1909.
 Nie et al. (2013) Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online PCA with optimal regrets. In International Conference on Algorithmic Learning Theory, pages 98–112. Springer, 2013.
 Rakhlin and Sridharan (2014) Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression. 35:1232–1264, 2014. URL http://proceedings.mlr.press/v35/rakhlin14.html.
 Saha and Tewari (2011) Ankan Saha and Ambuj Tewari. Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 636–642, 2011.
 Scholkopf and Smola (2001) Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
 ShalevShwartz and Singer (2007) Shai ShalevShwartz and Yoram Singer. A primaldual perspective of online learning algorithms. Machine Learning, 69(23):115–142, 2007.
 Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause 0001, Sham Kakade, and Matthias W. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. pages 1015–1022, 2010. URL http://www.icml2010.org/papers/422.pdf.
 Tsuda et al. (2005) Koji Tsuda, Gunnar Rätsch, and Manfred K Warmuth. Matrix exponentiated gradient updates for online learning and Bregman projection. Journal of Machine Learning Research, 6(Jun):995–1018, 2005.
 Valko et al. (2013) Michal Valko, Nathaniel Korda, Remi Munos, Ilias Flaounas, and Nello Cristianini. Finitetime analysis of kernelised contextual bandits. 2013. In the Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence.

Warmuth and Kuzmin (2006)
Manfred K Warmuth and Dima Kuzmin.
Online variance minimization.
In
International Conference on Computational Learning Theory
, pages 514–528. Springer, 2006.  Warmuth and Kuzmin (2008) Manfred K Warmuth and Dima Kuzmin. Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9(Oct):2287–2320, 2008.
 Zhu and Li (2017) Zeyuan Allen Zhu and Yuanzhi Li. Follow the compressed leader: Faster algorithms for matrix multiplicative weight updates. CoRR, abs/1701.01722, 2017. URL http://arxiv.org/abs/1701.01722.
 Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 928–936, 2003.

Zwald and Blanchard (2006)
Laurent Zwald and Gilles Blanchard.
On the convergence of eigenspaces in kernel principal component analysis.
In Advances in Neural Information Processing Systems, pages 1649–1656, 2006.
Appendix A Bandits Exponential Weights Regret Bound
In this section we prove the regret bound stated in Section 2.3. Here we borrow all the notation from Section 2. As defined before the expected regret for Algorithm 1 after rounds is
where is the exponential weights distribution described in Algorithm 1, is the optimal action and is the sigma field that conditions on (), the events up to the end of round . We will prove the regret bound for the case when the kernel is infinite dimensional, that is, the feature map , where . When is finite the proof is identical with . Recall that when is infinite we constrain the adversary to play rank1 actions. We are going to refer to the adversarial action as for some . We now expand the definition of regret and get,
Here is the regret when we play the distribution in Algorithm 1 but are hit with losses that are governed by the kernel – (with the same as before). Observe that in only the component of in the subspace contributes to the inner product, thus every term is of the form
As the proxy kernel is uniformly close by Theorem 9 we have,
(6) 
a.1 Proof of Theorem 10
We will now attempt to bound and prove Theorem 10
. First we define the unbiased estimator (conditioned on
) of at each round,(7) 
where . We cannot build as we do not receive as feedback. Thus we also have
(8) 
If then the bias would be zero. We now present some estimates involving . In the following section we sometimes denote and more explicitly as and where there may be room for confusion.
Lemma 17.
For any fixed we have,
We also have for all ,
Proof The first claim follows by Equation (7) and the linearity of expectation we have
where the expectation is taken over . Now to prove the second part of the theorem statement we will use tower property. Observe that conditioned on , and are measurable.
We are now ready to prove Theorem 10 and establish the claimed regret bound.
Proof [Proof of Theorem 10] The proof is similar to the regret bound analysis of exponential weights for linear bandits. We proceed in 4 steps. In the first step we decompose the cumulative loss in terms of an exploration cost and an exploitation cost. In Step 2 we control the exploitation cost by using Hoeffding’s inequality as is standard in linear bandits literature, but additionally we need to control terms arising out of the bias of our estimate. In Step 3 we bound the exploration cost and finally in the fourth step we combine the different pieces and establish the claimed regret bound.
Step 1: Using Lemma 17 and the fact that is an unbiased estimate of we can decompose the cumulative loss, the first term in as
where the second line follows by the definition of .
Comments
There are no comments yet.