1 Introduction
The multiarmed bandit (MAB) problem (or simply bandit problem) has received much attention due to a wide range of applications, e.g., clinical trials Thompson (1933), economics Schlag (1998), routing Awerbuch & Kleinberg (2004), and ranking Radlinski et al. (2008). The MAB problems are sequential decision problems, where at each round, a learner selects an action (or arm) from candidates and receives a reward for the selected action. The learner makes decisions based on the observations such as the sequence of past rewards and the selected actions, and would like to maximize the cumulative reward, or equivalently to minimize regret, defined as the difference between the cumulative reward and that achieved by always playing the best arm/action.
The learner often can access to contextual information in addition to rewards and selected actions, which are referred as contextual MAB Langford & Zhang (2008). Examples include personalized recommendation Bouneffouf et al. (2012), web server defense Jung et al. (2012) and information retrieval Hofmann et al. (2011). For instance, the learner see feature vectors associated with each of arms at every round . To address the problem, one has to assume a hypothesis set consisting of functions from the feature vectors to an action that will give the best expected reward. The linear hypothesis set is simple and widely used, where each hypothesis is defined by a coefficient vector and predicts an optimal action as The linear hypothesis set assumes that the expected reward of action at round is defined by with a hidden coefficient vector which is referred to as linear payoff and also called linear realizability
. It is an online linear regression task balancing the tradeoff between exploration and exploitation.
In this paper, we study the contextual MAB problems with linear payoffs under assuming uncertainty or noise on features. Specifically, we assume that the learner cannot observe the true feature vector but noisy vector where random noise
is independently drawn from some distribution. It can incorporate statistical uncertainties of linear hypotheses, and relax the strong linear assumption on rewards, i.e., enhance the power of linear models. Furthermore, it can incorporate recent remarkable progresses in Bayesian deep learning techniques
Gal & Ghahramani (2016)that estimate feature uncertainties, i.e., the knowledge of noise distributions.
There are two main challenges in the noisy contextual MAB problem:

The learner might not extract the true hypothesis from any sequence of observations using policies defined for the noiseless contextual MAB problem, e.g. LinRel Auer (2002).

Even if the learner could learn , it is still hard to design a good exploitation policy since every arm has some uncertainty due to the noise.
Therefore, we need to redesign learning policies considering the noisy feature vectors. To the best of our knowledge, this is the first work that aims to solve the contextual MAB problems under assuming such uncertainties on features.
Contribution. We first study the simplest, but nontrivial, case that every action has the identical noise vector, i.e., for all . This eliminates the issue : the learner can find the best action after extracting the hidden coefficient vector since
However, the issue remains, e.g., LinRel might not find
. Furthermore, one has to design a new confidence interval due to the noise for balancing the exploration and exploitation tradeoff. To address them, we propose a noisy version of
LinRel, coined NLinRel, having regret bound forrounds. For the regret analysis, we use the tail inequalities of random matrices induced by noise vectors and bound the random matrix perturbation.
We next consider nonidentical noise vectors, but assume that the feature vector at each round is independently drawn from some distribution . The underlying reasoning for the statistical assumption is on our finding that this eliminates the issue . Specifically, if is Gaussian, we derive a closed form formula of the optimal coefficient vector using Bayesian analysis. Somewhat interestingly, the optimal is not equal to the true , which cannot occur under noiseless settings. We further design a simple greedy algorithm that achieves regret bound with respect to the optimal linear hypothesis. Here, one can easily observe that any linear hypothesis including the greedy algorithm and NLinRel cannot achieve a sublinear regret with respect to optimal sequence of actions, and thus we analyze such a ‘relative’ regret. Our study on Gaussian features naturally motivates the question of whether is also optimal for general feature distributions. To this end, we derive an optimization formulation (i.e., nonclosed form) for the optimal coefficient vector for general, possibly nonGaussian, setting, and numerically found that is no longer optimal in this case. Finally, we design a new algorithm, coined UniversalNLinRel, for arbitrary distributions on features, where it searches the true using NLinRel and adjusts the parameter to a gradient direction of the optimization objective. In our experiments, UniversalNLinRel outperforms LinUCB Chu et al. (2011), representing the known linear hypothesis designed for the noiseless contextual MAB problem, on both noisy synthetic and realworld datasets.
Related works. Although the name, contextual multiarmed bandit, first appeared in Langford & Zhang (2008), the problem setting has been studied under different names, e.g., bandit with covariates Woodroofe (1979); Sarkar (1991)
, associative reinforcement learning
Kaelbling (1994), associative bandit Auer (2002); Strehl et al. (2006) and bandit with expert advice Auer et al. (2002). This paper, in particular, focuses on the linear hypothesis set and the linear payoff model, which was originally introduced in Abe & Long (1999) and developed in Auer (2002). Our algorithm design of NLinRel is actually motivated by LinRel Auer (2002) and LinUCB Chu et al. (2011). Both LinRel and LinUCB algorithms compute the expected rewards and their confidence intervals for controlling the exploration and exploitation tradeoff. Thompson sampling was also studied for the linear payoff model Agrawal & Goyal (2013). The stochastic linear bandit optimization problem studied in Dani et al. (2008) and many following works are special cases of the contextual bandit with the linear payoff model having infinitely many arms. However, all the studies assume that the feature vectors are noiseless and we cannot directly apply their algorithms to our noise setting.One can discretize the linear hypothesis set into an net, , such that for all . With the net, it is possible to use EXP4type algorithms Auer et al. (2002); Beygelzimer et al. (2011) for the noisy contextual MAB problem studied in this paper. However, the computation costs of the algorithms are extremely expensive to use. The size of the net is and EXP4type algorithms have to update weights of all elements of the net at every round. One can also possibly use EpochGreedy Langford & Zhang (2008) for our Gaussian setting mentioned earlier, but it also requires a huge amount of computations and memory space when computing the most likelihood hypothesis among the hypothesis set. If one uses the net, EpochGreedy has the same issue with the EXP4type algorithms, or all sequence of observations should be memorized to compute the maximum likelihood.
2 Preliminaries
We study a noisy version of the contextual multiarmed bandit (MAB) problem with linear payoffs. At each time , there are possible actions and the learner observes a feature vector for each possible action , i.e., the dimension of features is and the number of arms is . We assume that the observed features are noisy in the sense that the true hidden feature vector of action at time is denoted by , for some independent random vector with . The learner selects an action and observe the reward for the selected action at time . We assume that
is an independent subGaussian random variables with finite variance and its distribution is determined by the true feature vector of the selected arm as
This is called the linear payoff assumption, where there is a coefficient vector with is unknown to the learner a priori.
The learner uses some algorithm or policy selecting an action at each time given current observed feature vectors for all action and past information for all time so that it maximizes the cumulative reward up to time . If the learner knows hidden information and , the best choice of action to maximize the cumulative reward would be . We define a regret function of algorithm compared to the oracle algorithm as follows:
(1) 
The objective of the learner’s algorithm is to minimize the above regret.
Notation. Here, we define necessary matrix notation used throughout this paper. For any matrix , and denote the transpose and inverse of , respectively. The
th eigenvalue and the
th singular value of
are denoted by and , respectively. Let denote the th column of and for . We mean by the th diagonal value of and by the diagonal matrix consisting of .3 Identical Feature Uncertainty
In this section, we assume every action shares the same noise feature vector , i.e., for all . We also assume that are i.i.d. random vectors with and the covariance of , denoted by , is known to the learner. As in Auer (2002), for the analysis, we assume that all feature vectors satisfy and the rewards are bounded by a finite constant. Furthermore, we assume that the distribution of has a finite support.
Under the identical uncertainty assumption, we have
Hence, one can find the best action in terms of the expected reward after finding even with noises on feature vectors. Namely, one can reduce the regret by learning hidden coefficient vector accurately. However, could increase quickly if we spend too much time to learn , which is the popular exploitationexploitation tradeoff issue in the bandit problem.
3.1 Issues of LinRel under Noisy Features
When there is no noise on feature vectors, LinRel by Auer (2002) controls the exploitationexploitation tradeoff very efficiently and guarantees a sublinear with resect to . It executes the following procedures at round/time :

Calculate
where is from the eigenvalue decomposition of , and is the index such that and .

Compute , which is an estimator of . Then, compute the expected reward and the width of the confidence interval as follows: for all ,

Select . The expected reward controls the exploitation and the width of the confidence interval controls the exploration.
The following two facts of LinRel make becomes sublinear. First, one can check that for all and . From this and the selection rule, we have . Second, it holds that . In other words, the sum of uncertainties of the observed arms increases sublinearly. This is because NLinRel has a better estimation on as playing actions having high uncertainties. Intuitively, is the amount of information revealed by playing action .
However, LinRel has very bad regret under noise feature vectors. First, does not converges to . One can easily check that
We have to remove to expect to converge to . Second, the uncertainty indicator strongly depends on and does not indicate the amount of information we can obtain from the action.
3.2 Redesigning LinRel for Noisy Features
In this section, we redesign LinRel, which is referred to as NLinRel and described formally in what follows.
We also prove that the above algorithm achieves the sublinear regret bound.
Theorem 1
The proof of the above theorem is given in Appendix B. For instance, when , the regret is
In what follows, we provide our strategies on the algorithm design and the regret proof.
Redesigning components. For designing NLinRel, we introduce
so that and as . Let be the eigenvalue decomposition of . For given , we use as a threshold for the eigenvalues and let be the largest index such that . We then estimate the coefficient vector and expected rewards of actions as follows:
where and as in LinRel.
Since the width of confidence interval for a single action is not a good indicator to control the exploration due to noise vectors, we redefine the width of confidence interval for each actionpair , which is referred to and approximates . More precisely, we compute
Note that is removed by and thus is independent to the noise.
The exploration and exploitation tradeoff is controlled by and . A candidate set is generated so that for all and there exists such that for all . Then, is selected uniformly at random from .
Proof strategy for Theorem 1. In NLinRel, indicates the uncertainty between the best arm and the selected arm at . The uncertainty is roughly proportional to the amount of revealed information by playing an arm . From that, we first bound the expected sum of uncertainties as follows:
We then connect the expected sum of uncertainties to the regret that has
In the above equation, we have an additional term , which stems from the fact that is just an approximation of . From the definition of , we have
where the last term was not considered when we compute . We can bound the last term using a tail bound for sums of random matrices Tropp (2012). More precisely, using the matrix Azuma inequality, we show that .
4 Nonidentical Feature Uncertainty
When the noise feature vectors are not identical, i.e., , any algorithm based on a linear hypothesis is impossible to guarantee a sublinear regret function. The regret function can grow linearly even though we know the hidden coefficient vector exactly. To see why, suppose each
is drawn under a normal distribution. Then, there exists some constant
such that for any given set of feature vectors , with probability , For coefficient vector , let be the corresponding expected regret function when the learner decides actions as follows:In this section, we do not aim for designing an algorithm of a sublinear regret with respect to the optimal sequence , but study how to find an optimal linear hypotheis that minimizes . Somewhat interestingly, we found that the choice is not always the best, i.e., there could exist such that for all . In order to describe the intuition why is not the optimal choice, we first consider a noisy Gaussian contextual MAB model in Section 4.1. Under the model, that minimizes is represented in a closed form and we prove that a very simple algorithm has a ‘relative’ sublinear regret bound with respect to the optimal linear hypothesis. The optimal closed form for Gaussian models might no longer be true for nonGaussian ones, which is discussed in Section 4.2.
4.1 Gaussian Features
In this section, we consider the following noisy Gaussian contextual MAB model. The true feature vectors are i.i.d. random vectors drawn from the normal distribution where is a positivedefinite matrix. The noises are defined by i.i.d. multivariate Gaussian random vectors as well: for all and , follows with a positivedefinite matrix . Since both and are positivedefinite matrices, we have inverse matrices not only for and but also for and .
Optimal linear hypothesis. We would like to find such that for all and . The following theorem obtains a closed form of an optimal choice .
Theorem 2
Under the noisy Gaussian contextual MAB model, for all and , where
(2) 
The proof of the above theorem is provided in Appendix C. Here we provide its highlevel sketch. At each round, the learner receives noisy feature vectors . When one knows and the distributions of and for all , the optimal decision from the given feature vectors can be computed as:
where the last equality comes from the independence between actions and the linearity of the expectation. In the proof of Theorem 2, we obtain using Bayesian analysis that
Therefore, one can easily find the optimal action with under
From the optimal action, we define the following ‘relative’ regret function:
Greedy algorithm. We now propose a very simple greedy algorithm that operates only with observations , , and can find very accurately. The simple greedy algorithm consists of two parts, each for exploration and exploitation, as stated formally in what follows.
The first selections of the above algorithm are used for the exploration to learn where and . The remaining selections exploit for their decision:
Observe that the above algorithm do not utilize the information , and . Nevertheless, we indeed show that it finds the optimal and a sublinear regret .
Theorem 3
Under the above noisy Gaussian contextual MAB model, the simple greedy algorithm has
with probability .
The proof of the above theorem is provided in Appendix D. Here we provide its highlevel sketch. In the exploration part, at each time instance, the learner selects an action uniformly at random so that the selection and noisy feature vectors become independent. Then, are i.i.d. random vectors following and are also i.i.d. random vectors following . Therefore, we have
From the matrix Azuma inequality, we show that residual matrices and are negligible compared with and , respectively, with respect to their spectral norms. From the facts, we derive
This will leads to the conclusion of Theorem 3.
4.2 NonGaussian Features
In this section, we consider that the true feature vectors drawn under an arbitrary, possibly nonGaussian, distribution in this case, the proof of Theorem
2 is no longer true and it is not easy to analyze whether defined in (2) is optimal in any sense. Formally, we assume that are i.i.d. random vectors drawn from some (possibly, nonGaussian) distribution with mean and covariance where is a positivedefinite matrix. The noise model is same as that of Gaussian contextual MAB model in the previous section. We focus on verifying numerically whether defined in (2) is optimal under the the nonGaussian setting.To this end, one can observe that the optimal minimizes the following given the information , and :
(3) 
The solution of this optimization might not be given as a closed form as like (2) unless is Gaussian/normal. Furthermore, computing a gradient is a nontrivial task depending on and the knowledge of might not be given in practical scenarios. Hence, we estimate it via the following Monte Carlo method:
(4) 
where are randomly generated samples from the distribution or real feature vectors observed in practice. It is elementary to check that each gradient in (4
) can be expressed as an integral form with respect to the probability density function of
.Under several different choices of feature distribution , we compute (4) at to confirm whether it is optimal or not. In all the experiments, the number of arm , the dimension of feature , the number of samples , and each element in the feature vector is an i.i.d. random variable. We also choose each element of uniformly at random in the interval , i.e., Uniform(1,1). For the distribution of noise, we use . The numerical results are reported in Table 1, which implies that (2) might be far from being optimal unless is Gaussian.
Feature distribution  norm of gradient 

Gaussian(0,1)  0.000 
Uniform(1,1)  0.013 
Laplace(0,1)  0.032 
Exponential(1)  0.413 
LogNormal(0,1)  0.648 
Mixture of Gaussian  0.320 
Mixture of Uniform  0.273 

0.3 * Gaussian(10,1) + 0.7 * Gaussian(10,1)

0.3 * Uniform(9,11) + 0.7 * Uniform(11,9)
This motivates to design a new algorithm, completely different from Algorithm 2, for nonGaussian feature distribution . For the purpose, we propose the following algorithm, called UniversalNLinRel.
In the above, is some UCBlike constant as like LinUCB Chu et al. (2011). The main idea on the algorithm design is that it runs NLinRel for estimating the true coefficient vector , and then update the current to a stochastic gradient direction by replacing by the estimation . Although NLinRel has its theoretical value, UniversalNLinRel uses a practical variant of NLinRel by introducing parameter since too many initial explorations might hurt its regret unless an extremely large enough number of time instances is allowed. In the following section, We measure the regret performance of UniversalNLinRel.
5 Experimental Results
In this section, we report experimental results comparing the regret performances of UniveralNLinRel with the following algorithms. First, LinUCB Chu et al. (2011) represents known algorithms designed for the noiseless contextual MAB problem.^{1}^{1}1 We choose for both LinUCB and UniversalNLinRel in all our experiments, but the choice is not sensitive for their performances in all our settings. Seocnd, OracleGD is identical to UniveralNLinRel, except for using the true coefficient vector instead of the estimated one . Finally, OracleTC and OracleCF are linear hypotheses choosing arm where they consider the true coefficient vector and the closed form defined in (2), respectively.
Synthetic dataset. We follow the same synthetic setups described in Section 4.2, and the experimental comparisons among UniversalNLinRel, LinUCB, OracleGD, OracleTC and OracleCF are reported in Figure 1. In the case of the Gaussian distribution, as reported in Figure 1 (a), one can observe that both LinUCB and UniveralNLinRel are close to the optimal OracleCF in this setting. The nearoptimality of LinUCB can be explained as its similarity to the simple greedy algorithm in Section 4.1
. In the case of the mixture of uniform distribution, as reported in Figure
1 (c), one can observe that LinUCB has the worst regret and is significantly outperformed by UniveralNLinRel. Figure 1 (b) and (d) show that NLinRel finds the true coefficient vector well in both Gaussian and nonGaussian setups. This explains why UniveralNLinRel can perform well (since UniveralNLinRel uses NLinRel as its subroutine for tracking the true parameter).Yahoo dataset. We use Yahoo Webscope R6A dataset Li et al. (2010), which contains the history of Yahoo! Front Page Module. The “Featured” tab of the Module highlights one article from the human edited candidate set of size 20. The log contains user context, arm context, candidate set, chosen arm, and reward (click or not). We consider an article as an arm. As a preprocessing step, we removed the lines which are incomplete (contains an arm whose context is not recorded). Then, we clustered the lines by the user. Then, each user can observe several candidate arms whose rewards can be calculated as their empirical CTRs (ClickThrough Rates). For example, if user observed an arm for times and clicked it times, we assumed the reward of the context vector is . We only consider users whose candidates/arms are of size larger than 2. The number of users after this pruning processing is 11,352, and MAB algorithm iterated 10,000 of them without duplication. Both user and article are represented by a sixdimensional real vector. We used the inner product of two features as a context of each arm as in Li et al. (2010). We remark that our reported CTRs are different from those in Li et al. (2010); the authors uses different parameter for each article, but we instead use a single universal under which our algorithms and their theoretical reasoning have been developed.
We run LinUCB, NLinRel and UniversalNLinRel on the preprocessed Yahoo dataset. Compared to our synthetic setting, computing gradients in UniversalNLinRel becomes more expensive due to the larger number of candidate arms. Hence, we estimate each integral in gradients by Monte Carlo of 100 samples. In addition, since we do not have the knowledge of noise and feature distributions, we use the current context as a random sample in UniversalNLinRel, and set the noise variance as 10% of the sample variance of contexts in the entire dataset. Under the Yahoo data, UniversalNLinRel, LinUCB and NLinRel perform better in their orders, as reported in Figure 1 (e).
Mushroom dataset. We use mushroom dataset Bache & Lichman (2013) which was used in the contextual MAB experiment in Blundell et al. (2015). Each mushroom has 22 categorical features and labeled as edible or poisonous. As in Blundell et al. (2015), we used 126 dimensional binary vectors as features. At each round, we sample one edible mushroom and 4 poisonous ones. Thus, the learner searches one edible mushroom from 5 candidate mushrooms. If the agent chooses an edible mushroom, the regret does not change, and if the agent chooses a poisonous one, the regret increases by 1. We experimented two different settings. The first one uses raw data and assumes the noise of context is of the sample variance (as we do for Yahoo dataset). The second one added artificial noise to features, similar to the synthetic experiment. For each dimension, we added Gaussian noise with mean 0 and variance , where . The results are reported in Figure 1 (f). In the first experiment without noise, we observe that LinUCB performs quite well, almost zero regret, since this data is almost linearly separable, i.e., the best setting for LinUCB. However, in the second experiment with artificial noise, UniversalNLinRel definitely outperforms LinUCB. This experiment shows that in some scenarios, it is important to learn/know the statistical information on noise for the performance of UniversalNLinRel. We leave this for further exploration in the future.
6 Conclusion
In this paper, we study contextual multiarmed bandit problems under assuming linear payoffs and uncertainty on features. Based on our theoretical understandings on the special cases of identical noise and Gaussian features, we could develop UniversalNLinRel for general scenarios. We believe that utilizing model uncertainties as addressed in this paper would provide an important direction for designing more practical algorithms for the bandit task.
References
 Abe & Long (1999) Abe, Naoki and Long, Philip M. Associative reinforcement learning using linear probabilistic concepts. In ICML, pp. 3–11, 1999.
 Agrawal & Goyal (2013) Agrawal, Shipra and Goyal, Navin. Thompson sampling for contextual bandits with linear payoffs. In ICML, pp. 127–135, 2013.

Auer (2002)
Auer, Peter.
Using confidence bounds for exploitationexploration tradeoffs.
Journal of Machine Learning Research
, 3(Nov):397–422, 2002.  Auer et al. (2002) Auer, Peter, CesaBianchi, Nicolo, Freund, Yoav, and Schapire, Robert E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.

Awerbuch & Kleinberg (2004)
Awerbuch, Baruch and Kleinberg, Robert D.
Adaptive routing with endtoend feedback: Distributed learning and
geometric approaches.
In
Proceedings of the thirtysixth annual ACM symposium on Theory of computing
, pp. 45–53. ACM, 2004.  Bache & Lichman (2013) Bache, Kevin and Lichman, Moshe. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.

Beygelzimer et al. (2011)
Beygelzimer, Alina, Langford, John, Li, Lihong, Reyzin, Lev, and Schapire,
Robert E.
Contextual bandit algorithms with supervised learning guarantees.
In AISTATS, pp. 19–26, 2011. 
Blundell et al. (2015)
Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray, and Wierstra, Daan.
Weight uncertainty in neural network.
In Proceedings of The 32nd International Conference on Machine Learning, pp. 1613–1622, 2015.  Bouneffouf et al. (2012) Bouneffouf, Djallel, Bouzeghoub, Amel, and Gançarski, Alda Lopes. A contextualbandit algorithm for mobile contextaware recommender system. In International Conference on Neural Information Processing, pp. 324–331. Springer, 2012.
 Chu et al. (2011) Chu, Wei, Li, Lihong, Reyzin, Lev, and Schapire, Robert E. Contextual bandits with linear payoff functions. In AISTATS, volume 15, pp. 208–214, 2011.
 Dani et al. (2008) Dani, Varsha, Hayes, Thomas P, and Kakade, Sham M. Stochastic linear optimization under bandit feedback. In COLT, pp. 355–366, 2008.
 Gal & Ghahramani (2016) Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1050–1059, 2016.
 Hofmann et al. (2011) Hofmann, Katja, Whiteson, Shimon, de Rijke, Maarten, et al. Contextual bandits for information retrieval. In NIPS 2011 Workshop on Bayesian Optimization, Experimental Design, and Bandits, Granada, volume 12, pp. 2011, 2011.
 Jung et al. (2012) Jung, Tobias, Martin, Sylvain, Ernst, Damien, and Leduc, Guy. Contextual multiarmed bandits for web server defense. In Neural Networks (IJCNN), The 2012 International Joint Conference on, pp. 1–8. IEEE, 2012.
 Kaelbling (1994) Kaelbling, Leslie Pack. Associative reinforcement learning: Functions inkdnf. Machine Learning, 15(3):279–298, 1994.
 Langford & Zhang (2008) Langford, John and Zhang, Tong. The epochgreedy algorithm for multiarmed bandits with side information. In Advances in neural information processing systems, pp. 817–824, 2008.
 Li et al. (2010) Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
 Paulsen (2002) Paulsen, Vern. Completely bounded maps and operator algebras, volume 78. Cambridge University Press, 2002.
 Radlinski et al. (2008) Radlinski, Filip, Kleinberg, Robert, and Joachims, Thorsten. Learning diverse rankings with multiarmed bandits. In Proceedings of the 25th international conference on Machine learning, pp. 784–791. ACM, 2008.
 Sarkar (1991) Sarkar, Jyotirmoy. Onearmed bandit problems with covariates. The Annals of Statistics, pp. 1978–2002, 1991.
 Schlag (1998) Schlag, Karl H. Why imitate, and if so, how?: A boundedly rational approach to multiarmed bandits. Journal of economic theory, 78(1):130–156, 1998.
 Strehl et al. (2006) Strehl, Alexander L, Mesterharm, Chris, Littman, Michael L, and Hirsh, Haym. Experienceefficient learning in associative bandit problems. In ICML, pp. 889–896, 2006.
 Thompson (1933) Thompson, William R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 Tropp (2012) Tropp, Joel A. Userfriendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
 Woodroofe (1979) Woodroofe, Michael. A onearmed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799–806, 1979.
Appendix A Tail Bounds of Sums of Random Matrices
In the proof of Theorem 1, we have matrix martingales and require to find their spectral norms to complete our proofs. When a matrix martingale is a sum of random matrices having bounded spectral norms, we can use matrix Azuma inequality which is Theorem 7.1 of Tropp (2012).
Theorem 4 (Matrix Azuma)
Let be a finite sequence of selfadjoint matrices in dimension that satisfy
Then, for all ,
For the proof of Theorem 1, we should study matrix martingales , , and , which are defined as follows:
(5)  
(6) 
We cannot directly apply matrix Azuma inequality to bound the spectral norms of and , since they are not selfadjoint. To resolve this problem, we introduce an operator , called dilations by Paulsen (2002), so that
(7) 
It is known that dilations preserves the spectral norm, i.e.
(8) 
Let and . Let and . Using matrix Azuma inequality and dilations operator, we can bound the spectral norm of , , and as follows:

() Let . Then, is a sequence of selfadjoint matrices in dimension that satisfy
Comments
There are no comments yet.