1 Introduction
The least squares method forms a cornerstone of statistics and machine learning. It is used as the main component of many stochastic sequential decision problems, such as multiarmed bandit, linear bandits, and other linear control problems. However, the analysis of least squares in these online settings is nontrivial because of the correlations between data points. Fortunately, there is a connection between online least squares estimation and the area of selfnormalized processes. Study of selfnormalized processes has a long history that goes back to Student and is treated in detail in recent book by
de la Peña et al. (2009). Using these tools we provide a proof of a bound on the deviation for vectorvalued martingales. A less general version of the bound can be found already in de la Peña et al. (2004, 2009). Additionally our proof, based on the method of mixtures, is new, simpler and selfcontained. The bound improves the previous bound of Rusmevichientong and Tsitsiklis (2010) and it is applicable to virtually any online least squares problem.The bound that we derive, gives immediately rise to tight confidence sets for the online least squares estimate that can replace the confidence sets in existing algorithms. In particular, the confidence sets can be used in the UCB algorithm for the multiarmed bandit problem, the ConfidenceBall algorithm of Dani et al. (2008) for the linear bandit problem, and LinRel algorithm of Auer (2003)
for the associative reinforcement learning problem. We show that this leads to improved performance of these algorithms. Our hope is that the new confidence sets can be used to improve the performance of other similar linear decision problems.
The multiarmed bandit problem, introduced by Robbins (1952), is a game between the learner and the environment. At each time step, the learner chooses one of actions and receives a reward which is generated independently at random from a fixed distribution associated with the chosen arm. The objective of the learner is to maximize his total reward. The performance of the learner is evaluated by the regret, which is defined as the difference between his total reward and the total reward of the best action. Lai and Robbins (1985) prove a lower bound on the expected regret of any algorithm, where is the number of time steps, and are the reward distributions of the optimal arm and arm respectively, and is the KLdivergence.
Auer et al. (2002)
designed the UCB algorithm and proved a finitetime logarithmic bound on its regret. He used Hoeffding’s inequality to construct confidence intervals and obtained a
bound on the expected regret, where is the difference between the expected rewards of the best and the second best action.^{color=red, }^{color=red, }todo: color=red, Cite highprobability bound for UCB by Bubeck. We modify UCB so that it uses our new confidence sets and we show a stronger result. Namely, we show that with probability , the regret of the modified algorithm is . Seemingly, this result contradicts the lower bound of Lai and Robbins (1985), however our algorithm depends on which it receives as an input. The expected regret of the modified algorithm with matches the regret of the original algorithm.In the linear bandit problem, the learner chooses repeatedly actions from a fixed subset of and receives a random reward, expectation of which is a linear function of the action. Dani et al. (2008) proposed the ConfidenceBall algorithm and showed that its regret is at most with probability at most . We modify their algorithm so that it uses our new confidence sets and we show that its regret is at most . Additionally, constants in our bound are smaller, and our bound holds for all , as opposed the previous one which holds only for sufficiently large . Dani et al. (2008) prove also a problem dependent regret bound. Namely, they show that the regret of their algorithm is where is the “gap” as defined in (Dani et al., 2008). For our modified algorithm we prove an improved bound.
1.1 Notation
We use to denote the 2norm. For a positive definite matrix , the weighted norm is defined by , where . The inner product is denoted by and the weighted innerproduct . We use
to denote the minimum eigenvalue of the positive definite matrix
. We use to denote that is positive definite, while we use to denote that it is positive semidefinite. The same notation is used to denote the Loewner partial order of matrices. We shall use to denote the unit vector, i.e., for all , and .2 VectorValued Martingale Tail Inequalities
Let be a filtration, be an valued stochastic process adapted to , be a realvalued martingale difference process adapted to . Assume that is conditionally subGaussian in the sense that there exists some such that for any , ,
(1) 
Consider the martingale
(2) 
and the matrixvalued processes
(3) 
where is an measurable, positive definite matrix. In particular, assume that with probability one, the eigenvalues of are larger than and that holds a.s. for any .
The following standard inequality plays a crucial role in the following developments:
Lemma 1.
Consider , as defined above and let be a stopping time with respect to the filtration . Let be arbitrary and consider
Then is almost surely welldefined and
Proof.
The proof is standard (and is given only for the sake of completeness). We claim that is a supermartingale. Let
Observe that by (1), we have . Clearly, is adapted, as is . Further,
showing that is indeed a supermartingale.
Now, this immediately leads to the desired result when for some deterministic time . This is based on the fact that the mean of any supermartingale can be bounded by the mean of its first element. In the case of , for example, we have .
Now, in order to consider the general case, let .^{1}^{1}1 is a shorthand notation for . It is well known that is still a supermartingale with . Further, since was nonnegative, so is . Hence, by the convergence theorem for nonnegative supermartingales, almost surely, exists, i.e., is almost surely welldefined. Further, by Fatou’s Lemma. ∎
Before stating our main results, we give some recent results, which can essentially be extracted from the paper by Rusmevichientong and Tsitsiklis (2010).
Theorem 2.
Consider the processes , as defined above and let
(4) 
Then, for any , , with probability at least ,
(5) 
We now show how to strengthen the previous result using the method of mixtures, originally used by Robbins and Siegmund (1970) to evaluate boundary crossing probabilities for Brownian motion.
Theorem 3 (Selfnormalized bound for vectorvalued martingales).
Let , , , , and be as before and let be a stopping time with respect to the filtration . Assume that is deterministic. Then, for any , with probability ,
(6) 
Proof.
Without loss of generality, assume that (by appropriately scaling , this can always be achieved). Let
Notice that by Lemma 1, the mean of is not larger than one.
Let
be a Gaussian random variable which is independent of all the other random variables and whose covariance is
. DefineClearly, we still have .
Let us calculate : Let denote the density of and for a positive definite matrix let . Then,
Elementary calculation shows that if , ,
Therefore,
which gives
Now, from , we obtain
thus finishing the proof. ∎
[Uniform Bound] Under the same assumptions as in the previous theorem, for any , with probability ,
(7) 
Proof.
Let us now turn our attention to understanding the determinant term on the righthand side of (6).
Lemma 4.
We have that
Further, we have that
Finally, if then
Proof.
Elementary algebra gives
(9) 
where we used that all the eigenvalues of a matrix of the form are one except one eigenvalue, which is
and which corresponds to the eigenvector
. Using , we can bound byCombining , which holds when , and (9), we get
The trace of is bounded by , assuming . Hence, and therefore,
finishing the proof of the second inequality. The sum can itself be upper bounded as a function of provided that is large enough. Notice . Hence, we get that if ,
∎
Most of this argument can be extracted from the paper of Dani et al. (2008). However, the idea goes back at least to Lai et al. (1979), Lai and Wei (1982) (a similar argument is used around Theorem 11.7 in the book by CesaBianchi and Lugosi (2006)). Note that Lemmas B.9–B.11 of Rusmevichientong and Tsitsiklis (2010) also give a bound on , with an essentially identical argument. Alternatively, one can use the bounding technique of Auer (2003) (see the proof of Lemma 13 there on pages 412–413) to derive a bound like for a suitable chosen constant .
Remark 5.
By combining Corollary 3 and Lemma 4, we get a simple worst case bound that holds with probability :
(10) 
Still, the new bound is considerably better than the previous one given by Theorem 2. Note that the factor cannot be removed, as shown by Problem 3, page 203 in the book by de la Peña et al. (2009).
3 Optional Skipping
Consider the case when , , i.e., the case of an optional skipping process. Then, using again , and thus the expression studied becomes
We also have
Thus, we get, with probability
(11) 
If we apply Doob’s optional skipping and HoeffdingAzuma, with a union bound (see, e.g., the paper of Bubeck et al. (2008)), we would get, for any , , with probability ,
(12) 
The major difference between these bounds is that (12) depends explicitly on , while (11) does not. This has the positive effect that one need not recompute the bound if does not grow, which helps e.g. in the paper of Bubeck et al. (2008) to improve the computational complexity of the HOO algorithm. Also, the coefficient of the leading term in (11) under the square root is , whereas in (12) it is .
Instead of a union bound, it is possible to use a “peeling device” to replace the conservative factor in the above bound by essentially . This is done e.g. in Garivier and Moulines (2008) in their Theorem 22.^{2}^{2}2They give their theorem as ratios, which they should not, since their inequality then fails to hold for . However, this is easy to remedy by reformulating their result as we do it here. From their derivations, the following one sided, uniform bound can be extracted (see Remark 24, page 19): For any , , with probability ,
(13) 
As noted by Garivier and Moulines (2008), due to the law of iterated logarithm, the scaling of the righthand side as a function of cannot be improved in the worstcase. However, this leaves open the possibility of deriving a maximal inequality which depends on only through .
4 The MultiArmed Bandit Problem
Now we turn our attention to the multiarmed bandit problem. Let denote the expected reward of action and , where is the expected reward of the optimal action. We assume that if we choose action in round , we obtain reward . Let denote the number of times that we have played action up to time , and denote the average of the rewards received by action up to time . From (11) with instead of and a union bound over the actions, we have the following confidence intervals that hold with probability at least :
(14) 
where
Modify the UCB Algorithm of Auer et al. (2002) to use the confidence intervals (14) and change the action selection rule accordingly. Hence, at time , we choose the action
(15) 
We call this algorithm UCB().
Theorem 6.
With probability at least , the total regret of the UCB() algorithm with the action selection rule (15) is constant and is bounded by
where is the index of the optimal action.
Proof.
Suppose the confidence intervals do not fail. If we play action , the upper estimate of the action is above . Hence,
Substituting and squaring gives
By using Lemma 8 of Antos et al. (2010), we get that
Thus, using , we get that with probability at least , the total regret is bounded by
∎
Remark 7.
Lai and Robbins (1985) prove that for any suboptimal arm ,
where, and are the reward density of the optimal arm and arm respectively, and is the KDdivergence. This lower bound does not contradict Theorem 6, as Theorem 6 only states a high probability upper bound for the regret. Note that UCB() takes delta as its input. Because with probability , the regret in time can be , on expectation, the algorithm might have a regret of . Now if we select , then we get upper bound on the expected regret.
5 Application to Least Squares Estimation and Linear Bandit Problem
In this section we first apply Theorem 3 to derive confidence intervals for leastsquares estimation, where the covariate process is an arbitrary process and then use these confidence intervals to improve the regret bound of Dani et al. (2008) for the linear bandit problem. In particular, our assumption on the data is as follows:

Assumption A1 Let be a filtration, , , be a sequence of random variables over such that is measurable, and is measurable . Assume that there exists such that , i.e., is a martingale difference sequence (, ) and that is subGaussian: There exists such that for any ,
We shall call the random variables covariates and the random variables the responses. Note that the assumption allows any sequential generation of the covariates.
Let be the regularized leastsquares estimate of with regularization parameter :
(16) 
where is the matrix whose rows are and . We further let .
We are interested in deriving a confidence bound on the error of predicting the mean response at an arbitrarily chosen random covariate using the leastsquares predictor . Using
we get
where . Note that is positive definite (thanks to ) and hence so is , so the above inner product is welldefined. Using the CauchySchwartz inequality, we get
where we used that . Fix any . By Corollary 3, with probability at least ,
Therefore, on the event where this inequality holds, one also has
Similarly, we can derive a worstcase bound. The result is summarized in the following statement:
Theorem 8.
Let , , satisfy the linear model Assumption 5 with some , and let be the associated filtration. Assume that w.p.1 the covariates satisfy , and . Consider the regularized leastsquares parameter estimate with regularization coefficient (cf. (16)). Let be an arbitary, valued random variable. Let be the regularized design matrix underlying the covariates. Then, for any , with probability at least ,
(17) 
Similarly, with probability ,
(18) 
Remark 9.
We see that increases the second term (the “bias term”) in the parenthesis of the estimate. In fact, for fixed gives (as it should be). Decreasing , on the other hand increases
Comments
There are no comments yet.