Contextual Multi-armed Bandits under Feature Uncertainty

03/03/2017 ∙ by Se-Young Yun, et al. ∙ 0

We study contextual multi-armed bandit problems under linear realizability on rewards and uncertainty (or noise) on features. For the case of identical noise on features across actions, we propose an algorithm, coined NLinRel, having O(T^7/8((dT)+K√(d))) regret bound for T rounds, K actions, and d-dimensional feature vectors. Next, for the case of non-identical noise, we observe that popular linear hypotheses including NLinRel are impossible to achieve such sub-linear regret. Instead, under assumption of Gaussian feature vectors, we prove that a greedy algorithm has O(T^2/3√( d)) regret bound with respect to the optimal linear hypothesis. Utilizing our theoretical understanding on the Gaussian case, we also design a practical variant of NLinRel, coined Universal-NLinRel, for arbitrary feature distributions. It first runs NLinRel for finding the `true' coefficient vector using feature uncertainties and then adjust it to minimize its regret using the statistical feature information. We justify the performance of Universal-NLinRel on both synthetic and real-world datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The multi-armed bandit (MAB) problem (or simply bandit problem) has received much attention due to a wide range of applications, e.g., clinical trials Thompson (1933), economics Schlag (1998), routing Awerbuch & Kleinberg (2004), and ranking Radlinski et al. (2008). The MAB problems are sequential decision problems, where at each round, a learner selects an action (or arm) from candidates and receives a reward for the selected action. The learner makes decisions based on the observations such as the sequence of past rewards and the selected actions, and would like to maximize the cumulative reward, or equivalently to minimize regret, defined as the difference between the cumulative reward and that achieved by always playing the best arm/action.

The learner often can access to contextual information in addition to rewards and selected actions, which are referred as contextual MAB Langford & Zhang (2008). Examples include personalized recommendation Bouneffouf et al. (2012), web server defense Jung et al. (2012) and information retrieval Hofmann et al. (2011). For instance, the learner see feature vectors associated with each of arms at every round . To address the problem, one has to assume a hypothesis set consisting of functions from the feature vectors to an action that will give the best expected reward. The linear hypothesis set is simple and widely used, where each hypothesis is defined by a coefficient vector and predicts an optimal action as The linear hypothesis set assumes that the expected reward of action at round is defined by with a hidden coefficient vector which is referred to as linear payoff and also called linear realizability

. It is an online linear regression task balancing the trade-off between exploration and exploitation.

In this paper, we study the contextual MAB problems with linear payoffs under assuming uncertainty or noise on features. Specifically, we assume that the learner cannot observe the true feature vector but noisy vector where random noise

is independently drawn from some distribution. It can incorporate statistical uncertainties of linear hypotheses, and relax the strong linear assumption on rewards, i.e., enhance the power of linear models. Furthermore, it can incorporate recent remarkable progresses in Bayesian deep learning techniques

Gal & Ghahramani (2016)

that estimate feature uncertainties, i.e., the knowledge of noise distributions.

There are two main challenges in the noisy contextual MAB problem:

  • The learner might not extract the true hypothesis from any sequence of observations using policies defined for the noiseless contextual MAB problem, e.g. LinRel Auer (2002).

  • Even if the learner could learn , it is still hard to design a good exploitation policy since every arm has some uncertainty due to the noise.

Therefore, we need to redesign learning policies considering the noisy feature vectors. To the best of our knowledge, this is the first work that aims to solve the contextual MAB problems under assuming such uncertainties on features.

Contribution. We first study the simplest, but non-trivial, case that every action has the identical noise vector, i.e., for all . This eliminates the issue : the learner can find the best action after extracting the hidden coefficient vector since

However, the issue remains, e.g., LinRel might not find

. Furthermore, one has to design a new confidence interval due to the noise for balancing the exploration and exploitation trade-off. To address them, we propose a noisy version of

LinRel, coined NLinRel, having regret bound for

rounds. For the regret analysis, we use the tail inequalities of random matrices induced by noise vectors and bound the random matrix perturbation.

We next consider non-identical noise vectors, but assume that the feature vector at each round is independently drawn from some distribution . The underlying reasoning for the statistical assumption is on our finding that this eliminates the issue . Specifically, if is Gaussian, we derive a closed form formula of the optimal coefficient vector using Bayesian analysis. Somewhat interestingly, the optimal is not equal to the true , which cannot occur under noiseless settings. We further design a simple greedy algorithm that achieves regret bound with respect to the optimal linear hypothesis. Here, one can easily observe that any linear hypothesis including the greedy algorithm and NLinRel cannot achieve a sub-linear regret with respect to optimal sequence of actions, and thus we analyze such a ‘relative’ regret. Our study on Gaussian features naturally motivates the question of whether is also optimal for general feature distributions. To this end, we derive an optimization formulation (i.e., non-closed form) for the optimal coefficient vector for general, possibly non-Gaussian, setting, and numerically found that is no longer optimal in this case. Finally, we design a new algorithm, coined Universal-NLinRel, for arbitrary distributions on features, where it searches the true using NLinRel and adjusts the parameter to a gradient direction of the optimization objective. In our experiments, Universal-NLinRel outperforms LinUCB Chu et al. (2011), representing the known linear hypothesis designed for the noiseless contextual MAB problem, on both noisy synthetic and real-world datasets.

Related works. Although the name, contextual multi-armed bandit, first appeared in Langford & Zhang (2008), the problem setting has been studied under different names, e.g., bandit with covariates Woodroofe (1979); Sarkar (1991)

, associative reinforcement learning

Kaelbling (1994), associative bandit Auer (2002); Strehl et al. (2006) and bandit with expert advice Auer et al. (2002). This paper, in particular, focuses on the linear hypothesis set and the linear payoff model, which was originally introduced in Abe & Long (1999) and developed in Auer (2002). Our algorithm design of NLinRel is actually motivated by LinRel Auer (2002) and LinUCB Chu et al. (2011). Both LinRel and LinUCB algorithms compute the expected rewards and their confidence intervals for controlling the exploration and exploitation trade-off. Thompson sampling was also studied for the linear payoff model Agrawal & Goyal (2013). The stochastic linear bandit optimization problem studied in Dani et al. (2008) and many following works are special cases of the contextual bandit with the linear payoff model having infinitely many arms. However, all the studies assume that the feature vectors are noiseless and we cannot directly apply their algorithms to our noise setting.

One can discretize the linear hypothesis set into an -net, , such that for all . With the -net, it is possible to use EXP4-type algorithms Auer et al. (2002); Beygelzimer et al. (2011) for the noisy contextual MAB problem studied in this paper. However, the computation costs of the algorithms are extremely expensive to use. The size of the -net is and EXP4-type algorithms have to update weights of all elements of the -net at every round. One can also possibly use Epoch-Greedy Langford & Zhang (2008) for our Gaussian setting mentioned earlier, but it also requires a huge amount of computations and memory space when computing the most likelihood hypothesis among the hypothesis set. If one uses the -net, Epoch-Greedy has the same issue with the EXP4-type algorithms, or all sequence of observations should be memorized to compute the maximum likelihood.

2 Preliminaries

We study a noisy version of the contextual multi-armed bandit (MAB) problem with linear payoffs. At each time , there are possible actions and the learner observes a feature vector for each possible action , i.e., the dimension of features is and the number of arms is . We assume that the observed features are noisy in the sense that the true hidden feature vector of action at time is denoted by , for some independent random vector with . The learner selects an action and observe the reward for the selected action at time . We assume that

is an independent sub-Gaussian random variables with finite variance and its distribution is determined by the true feature vector of the selected arm as

This is called the linear payoff assumption, where there is a coefficient vector with is unknown to the learner a priori.

The learner uses some algorithm or policy selecting an action at each time given current observed feature vectors for all action and past information for all time so that it maximizes the cumulative reward up to time . If the learner knows hidden information and , the best choice of action to maximize the cumulative reward would be . We define a regret function of algorithm compared to the oracle algorithm as follows:


The objective of the learner’s algorithm is to minimize the above regret.

Notation. Here, we define necessary matrix notation used throughout this paper. For any matrix , and denote the transpose and inverse of , respectively. The

-th eigenvalue and the

-th singular value of

are denoted by and , respectively. Let denote the -th column of and for . We mean by the -th diagonal value of and by the diagonal matrix consisting of .

3 Identical Feature Uncertainty

In this section, we assume every action shares the same noise feature vector , i.e., for all . We also assume that are i.i.d. random vectors with and the covariance of , denoted by , is known to the learner. As in Auer (2002), for the analysis, we assume that all feature vectors satisfy and the rewards are bounded by a finite constant. Furthermore, we assume that the distribution of has a finite support.

Under the identical uncertainty assumption, we have

Hence, one can find the best action in terms of the expected reward after finding even with noises on feature vectors. Namely, one can reduce the regret by learning hidden coefficient vector accurately. However, could increase quickly if we spend too much time to learn , which is the popular exploitation-exploitation trade-off issue in the bandit problem.

3.1 Issues of LinRel under Noisy Features

When there is no noise on feature vectors, LinRel by Auer (2002) controls the exploitation-exploitation tradeoff very efficiently and guarantees a sub-linear with resect to . It executes the following procedures at round/time :

  • Calculate

    where is from the eigenvalue decomposition of , and is the index such that and .

  • Compute , which is an estimator of . Then, compute the expected reward and the width of the confidence interval as follows: for all ,

  • Select . The expected reward controls the exploitation and the width of the confidence interval controls the exploration.

The following two facts of LinRel make becomes sub-linear. First, one can check that for all and . From this and the selection rule, we have . Second, it holds that . In other words, the sum of uncertainties of the observed arms increases sub-linearly. This is because NLinRel has a better estimation on as playing actions having high uncertainties. Intuitively, is the amount of information revealed by playing action .

However, LinRel has very bad regret under noise feature vectors. First, does not converges to . One can easily check that

We have to remove to expect to converge to . Second, the uncertainty indicator strongly depends on and does not indicate the amount of information we can obtain from the action.

3.2 Redesigning LinRel for Noisy Features

In this section, we redesign LinRel, which is referred to as NLinRel and described formally in what follows.

  for  to  do
      Eigenvalue decomposition of
     for all  do
     end for
     while  do
     end while
      select from uniformly at random
  end for
Algorithm 1 NLinRel

We also prove that the above algorithm achieves the sub-linear regret bound.

Theorem 1

Under the above identical noisy contextual MAB model, NLinRel has

with probability at least


The proof of the above theorem is given in Appendix B. For instance, when , the regret is

In what follows, we provide our strategies on the algorithm design and the regret proof.

Redesigning components. For designing NLinRel, we introduce

so that and as . Let be the eigenvalue decomposition of . For given , we use as a threshold for the eigenvalues and let be the largest index such that . We then estimate the coefficient vector and expected rewards of actions as follows:

where and as in LinRel.

Since the width of confidence interval for a single action is not a good indicator to control the exploration due to noise vectors, we re-define the width of confidence interval for each action-pair , which is referred to and approximates . More precisely, we compute

Note that is removed by and thus is independent to the noise.

The exploration and exploitation tradeoff is controlled by and . A candidate set is generated so that for all and there exists such that for all . Then, is selected uniformly at random from .

Proof strategy for Theorem 1. In NLinRel, indicates the uncertainty between the best arm and the selected arm at . The uncertainty is roughly proportional to the amount of revealed information by playing an arm . From that, we first bound the expected sum of uncertainties as follows:

We then connect the expected sum of uncertainties to the regret that has

In the above equation, we have an additional term , which stems from the fact that is just an approximation of . From the definition of , we have

where the last term was not considered when we compute . We can bound the last term using a tail bound for sums of random matrices Tropp (2012). More precisely, using the matrix Azuma inequality, we show that .

4 Non-identical Feature Uncertainty

When the noise feature vectors are not identical, i.e., , any algorithm based on a linear hypothesis is impossible to guarantee a sub-linear regret function. The regret function can grow linearly even though we know the hidden coefficient vector exactly. To see why, suppose each

is drawn under a normal distribution. Then, there exists some constant

such that for any given set of feature vectors , with probability , For coefficient vector , let be the corresponding expected regret function when the learner decides actions as follows:

In this section, we do not aim for designing an algorithm of a sub-linear regret with respect to the optimal sequence , but study how to find an optimal linear hypotheis that minimizes . Somewhat interestingly, we found that the choice is not always the best, i.e., there could exist such that for all . In order to describe the intuition why is not the optimal choice, we first consider a noisy Gaussian contextual MAB model in Section 4.1. Under the model, that minimizes is represented in a closed form and we prove that a very simple algorithm has a ‘relative’ sub-linear regret bound with respect to the optimal linear hypothesis. The optimal closed form for Gaussian models might no longer be true for non-Gaussian ones, which is discussed in Section 4.2.

4.1 Gaussian Features

In this section, we consider the following noisy Gaussian contextual MAB model. The true feature vectors are i.i.d. random vectors drawn from the normal distribution where is a positive-definite matrix. The noises are defined by i.i.d. multivariate Gaussian random vectors as well: for all and , follows with a positive-definite matrix . Since both and are positive-definite matrices, we have inverse matrices not only for and but also for and .

Optimal linear hypothesis. We would like to find such that for all and . The following theorem obtains a closed form of an optimal choice .

Theorem 2

Under the noisy Gaussian contextual MAB model, for all and , where


The proof of the above theorem is provided in Appendix C. Here we provide its high-level sketch. At each round, the learner receives noisy feature vectors . When one knows and the distributions of and for all , the optimal decision from the given feature vectors can be computed as:

where the last equality comes from the independence between actions and the linearity of the expectation. In the proof of Theorem 2, we obtain using Bayesian analysis that

Therefore, one can easily find the optimal action with under

From the optimal action, we define the following ‘relative’ regret function:

Greedy algorithm. We now propose a very simple greedy algorithm that operates only with observations , , and can find very accurately. The simple greedy algorithm consists of two parts, each for exploration and exploitation, as stated formally in what follows.

  , ,
  for  to  do
     Randomly select
  end for
  for  to  do
  end for
Algorithm 2 Simple greedy algorithm

The first selections of the above algorithm are used for the exploration to learn where and . The remaining selections exploit for their decision:

Observe that the above algorithm do not utilize the information , and . Nevertheless, we indeed show that it finds the optimal and a sub-linear regret .

Theorem 3

Under the above noisy Gaussian contextual MAB model, the simple greedy algorithm has

with probability .

The proof of the above theorem is provided in Appendix D. Here we provide its high-level sketch. In the exploration part, at each time instance, the learner selects an action uniformly at random so that the selection and noisy feature vectors become independent. Then, are i.i.d. random vectors following and are also i.i.d. random vectors following . Therefore, we have

From the matrix Azuma inequality, we show that residual matrices and are negligible compared with and , respectively, with respect to their spectral norms. From the facts, we derive

This will leads to the conclusion of Theorem 3.

4.2 Non-Gaussian Features

In this section, we consider that the true feature vectors drawn under an arbitrary, possibly non-Gaussian, distribution in this case, the proof of Theorem

2 is no longer true and it is not easy to analyze whether defined in (2) is optimal in any sense. Formally, we assume that are i.i.d. random vectors drawn from some (possibly, non-Gaussian) distribution with mean and covariance where is a positive-definite matrix. The noise model is same as that of Gaussian contextual MAB model in the previous section. We focus on verifying numerically whether defined in (2) is optimal under the the non-Gaussian setting.

To this end, one can observe that the optimal minimizes the following given the information , and :


The solution of this optimization might not be given as a closed form as like (2) unless is Gaussian/normal. Furthermore, computing a gradient is a non-trivial task depending on and the knowledge of might not be given in practical scenarios. Hence, we estimate it via the following Monte Carlo method:


where are randomly generated samples from the distribution or real feature vectors observed in practice. It is elementary to check that each gradient in (4

) can be expressed as an integral form with respect to the probability density function of


Under several different choices of feature distribution , we compute (4) at to confirm whether it is optimal or not. In all the experiments, the number of arm , the dimension of feature , the number of samples , and each element in the feature vector is an i.i.d. random variable. We also choose each element of uniformly at random in the interval , i.e., Uniform(-1,1). For the distribution of noise, we use . The numerical results are reported in Table 1, which implies that (2) might be far from being optimal unless is Gaussian.

Feature distribution -norm of gradient
Gaussian(0,1) 0.000
Uniform(-1,1) 0.013
Laplace(0,1) 0.032
Exponential(1) 0.413
LogNormal(0,1) 0.648
Mixture of Gaussian 0.320
Mixture of Uniform 0.273
  • 0.3 * Gaussian(10,1) + 0.7 * Gaussian(-10,1)

  • 0.3 * Uniform(9,11) + 0.7 * Uniform(-11,-9)

Table 1: -norms of gradients at

This motivates to design a new algorithm, completely different from Algorithm 2, for non-Gaussian feature distribution . For the purpose, we propose the following algorithm, called Universal-NLinRel.

  , Randomly select initial
  for  to  do
      Eigenvalue decomposition of
     Randomly sample from
  end for
Algorithm 3 Universal-NLinRel

In the above, is some UCB-like constant as like LinUCB Chu et al. (2011). The main idea on the algorithm design is that it runs NLinRel for estimating the true coefficient vector , and then update the current to a stochastic gradient direction by replacing by the estimation . Although NLinRel has its theoretical value, Universal-NLinRel uses a practical variant of NLinRel by introducing parameter since too many initial explorations might hurt its regret unless an extremely large enough number of time instances is allowed. In the following section, We measure the regret performance of Universal-NLinRel.

Figure 1: Comparisons of algorithms on synthetic (a)/(b)/(c)/(d) and real-world (e)/(f) datasets. (a)/(b) and (c)/(d) are measured under the choices of feature distributions as Gaussian(0,1) and 0.3 * Uniform(9,11) + 0.7 * Uniform(-11,-9), respectively. (a)/(c) report cumulative regrets of algorithms deducted by that of Oracle-TC. (b)/(d) report the cosine distances between coefficient vectors maintained by algorithms and the true one . (e) and (f) are for Yahoo and mushroom datasets, respectively.

5 Experimental Results

In this section, we report experimental results comparing the regret performances of Univeral-NLinRel with the following algorithms. First, LinUCB Chu et al. (2011) represents known algorithms designed for the noiseless contextual MAB problem.111 We choose for both LinUCB and Universal-NLinRel in all our experiments, but the choice is not sensitive for their performances in all our settings. Seocnd, Oracle-GD is identical to Univeral-NLinRel, except for using the true coefficient vector instead of the estimated one . Finally, Oracle-TC and Oracle-CF are linear hypotheses choosing arm where they consider the true coefficient vector and the closed form defined in (2), respectively.

Synthetic dataset. We follow the same synthetic setups described in Section 4.2, and the experimental comparisons among Universal-NLinRel, LinUCB, Oracle-GD, Oracle-TC and Oracle-CF are reported in Figure 1. In the case of the Gaussian distribution, as reported in Figure 1 (a), one can observe that both LinUCB and Univeral-NLinRel are close to the optimal Oracle-CF in this setting. The near-optimality of LinUCB can be explained as its similarity to the simple greedy algorithm in Section 4.1

. In the case of the mixture of uniform distribution, as reported in Figure

1 (c), one can observe that LinUCB has the worst regret and is significantly outperformed by Univeral-NLinRel. Figure 1 (b) and (d) show that NLinRel finds the true coefficient vector well in both Gaussian and non-Gaussian setups. This explains why Univeral-NLinRel can perform well (since Univeral-NLinRel uses NLinRel as its subroutine for tracking the true parameter).

Yahoo dataset. We use Yahoo Webscope R6A dataset Li et al. (2010), which contains the history of Yahoo! Front Page Module. The “Featured” tab of the Module highlights one article from the human edited candidate set of size 20. The log contains user context, arm context, candidate set, chosen arm, and reward (click or not). We consider an article as an arm. As a pre-processing step, we removed the lines which are incomplete (contains an arm whose context is not recorded). Then, we clustered the lines by the user. Then, each user can observe several candidate arms whose rewards can be calculated as their empirical CTRs (Click-Through Rates). For example, if user observed an arm for times and clicked it times, we assumed the reward of the context vector is . We only consider users whose candidates/arms are of size larger than 2. The number of users after this pruning processing is 11,352, and MAB algorithm iterated 10,000 of them without duplication. Both user and article are represented by a six-dimensional real vector. We used the inner product of two features as a context of each arm as in Li et al. (2010). We remark that our reported CTRs are different from those in Li et al. (2010); the authors uses different parameter for each article, but we instead use a single universal under which our algorithms and their theoretical reasoning have been developed.

We run LinUCB, NLinRel and Universal-NLinRel on the pre-processed Yahoo dataset. Compared to our synthetic setting, computing gradients in Universal-NLinRel becomes more expensive due to the larger number of candidate arms. Hence, we estimate each integral in gradients by Monte Carlo of 100 samples. In addition, since we do not have the knowledge of noise and feature distributions, we use the current context as a random sample in Universal-NLinRel, and set the noise variance as 10% of the sample variance of contexts in the entire dataset. Under the Yahoo data, Universal-NLinRel, LinUCB and NLinRel perform better in their orders, as reported in Figure 1 (e).

Mushroom dataset. We use mushroom dataset Bache & Lichman (2013) which was used in the contextual MAB experiment in Blundell et al. (2015). Each mushroom has 22 categorical features and labeled as edible or poisonous. As in Blundell et al. (2015), we used 126 dimensional binary vectors as features. At each round, we sample one edible mushroom and 4 poisonous ones. Thus, the learner searches one edible mushroom from 5 candidate mushrooms. If the agent chooses an edible mushroom, the regret does not change, and if the agent chooses a poisonous one, the regret increases by 1. We experimented two different settings. The first one uses raw data and assumes the noise of context is of the sample variance (as we do for Yahoo dataset). The second one added artificial noise to features, similar to the synthetic experiment. For each dimension, we added Gaussian noise with mean 0 and variance , where . The results are reported in Figure 1 (f). In the first experiment without noise, we observe that LinUCB performs quite well, almost zero regret, since this data is almost linearly separable, i.e., the best setting for LinUCB. However, in the second experiment with artificial noise, Universal-NLinRel definitely outperforms LinUCB. This experiment shows that in some scenarios, it is important to learn/know the statistical information on noise for the performance of Universal-NLinRel. We leave this for further exploration in the future.

6 Conclusion

In this paper, we study contextual multi-armed bandit problems under assuming linear payoffs and uncertainty on features. Based on our theoretical understandings on the special cases of identical noise and Gaussian features, we could develop Universal-NLinRel for general scenarios. We believe that utilizing model uncertainties as addressed in this paper would provide an important direction for designing more practical algorithms for the bandit task.


  • Abe & Long (1999) Abe, Naoki and Long, Philip M. Associative reinforcement learning using linear probabilistic concepts. In ICML, pp. 3–11, 1999.
  • Agrawal & Goyal (2013) Agrawal, Shipra and Goyal, Navin. Thompson sampling for contextual bandits with linear payoffs. In ICML, pp. 127–135, 2013.
  • Auer (2002) Auer, Peter. Using confidence bounds for exploitation-exploration trade-offs.

    Journal of Machine Learning Research

    , 3(Nov):397–422, 2002.
  • Auer et al. (2002) Auer, Peter, Cesa-Bianchi, Nicolo, Freund, Yoav, and Schapire, Robert E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  • Awerbuch & Kleinberg (2004) Awerbuch, Baruch and Kleinberg, Robert D. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In

    Proceedings of the thirty-sixth annual ACM symposium on Theory of computing

    , pp. 45–53. ACM, 2004.
  • Bache & Lichman (2013) Bache, Kevin and Lichman, Moshe. UCI machine learning repository, 2013. URL
  • Beygelzimer et al. (2011) Beygelzimer, Alina, Langford, John, Li, Lihong, Reyzin, Lev, and Schapire, Robert E.

    Contextual bandit algorithms with supervised learning guarantees.

    In AISTATS, pp. 19–26, 2011.
  • Blundell et al. (2015) Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray, and Wierstra, Daan.

    Weight uncertainty in neural network.

    In Proceedings of The 32nd International Conference on Machine Learning, pp. 1613–1622, 2015.
  • Bouneffouf et al. (2012) Bouneffouf, Djallel, Bouzeghoub, Amel, and Gançarski, Alda Lopes. A contextual-bandit algorithm for mobile context-aware recommender system. In International Conference on Neural Information Processing, pp. 324–331. Springer, 2012.
  • Chu et al. (2011) Chu, Wei, Li, Lihong, Reyzin, Lev, and Schapire, Robert E. Contextual bandits with linear payoff functions. In AISTATS, volume 15, pp. 208–214, 2011.
  • Dani et al. (2008) Dani, Varsha, Hayes, Thomas P, and Kakade, Sham M. Stochastic linear optimization under bandit feedback. In COLT, pp. 355–366, 2008.
  • Gal & Ghahramani (2016) Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1050–1059, 2016.
  • Hofmann et al. (2011) Hofmann, Katja, Whiteson, Shimon, de Rijke, Maarten, et al. Contextual bandits for information retrieval. In NIPS 2011 Workshop on Bayesian Optimization, Experimental Design, and Bandits, Granada, volume 12, pp. 2011, 2011.
  • Jung et al. (2012) Jung, Tobias, Martin, Sylvain, Ernst, Damien, and Leduc, Guy. Contextual multi-armed bandits for web server defense. In Neural Networks (IJCNN), The 2012 International Joint Conference on, pp. 1–8. IEEE, 2012.
  • Kaelbling (1994) Kaelbling, Leslie Pack. Associative reinforcement learning: Functions ink-dnf. Machine Learning, 15(3):279–298, 1994.
  • Langford & Zhang (2008) Langford, John and Zhang, Tong. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pp. 817–824, 2008.
  • Li et al. (2010) Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
  • Paulsen (2002) Paulsen, Vern. Completely bounded maps and operator algebras, volume 78. Cambridge University Press, 2002.
  • Radlinski et al. (2008) Radlinski, Filip, Kleinberg, Robert, and Joachims, Thorsten. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning, pp. 784–791. ACM, 2008.
  • Sarkar (1991) Sarkar, Jyotirmoy. One-armed bandit problems with covariates. The Annals of Statistics, pp. 1978–2002, 1991.
  • Schlag (1998) Schlag, Karl H. Why imitate, and if so, how?: A boundedly rational approach to multi-armed bandits. Journal of economic theory, 78(1):130–156, 1998.
  • Strehl et al. (2006) Strehl, Alexander L, Mesterharm, Chris, Littman, Michael L, and Hirsh, Haym. Experience-efficient learning in associative bandit problems. In ICML, pp. 889–896, 2006.
  • Thompson (1933) Thompson, William R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • Tropp (2012) Tropp, Joel A. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
  • Woodroofe (1979) Woodroofe, Michael. A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799–806, 1979.

Appendix A Tail Bounds of Sums of Random Matrices

In the proof of Theorem 1, we have matrix martingales and require to find their spectral norms to complete our proofs. When a matrix martingale is a sum of random matrices having bounded spectral norms, we can use matrix Azuma inequality which is Theorem 7.1 of Tropp (2012).

Theorem 4 (Matrix Azuma)

Let be a finite sequence of self-adjoint matrices in dimension that satisfy

Then, for all ,

For the proof of Theorem 1, we should study matrix martingales , , and , which are defined as follows:


We cannot directly apply matrix Azuma inequality to bound the spectral norms of and , since they are not self-adjoint. To resolve this problem, we introduce an operator , called dilations by Paulsen (2002), so that


It is known that dilations preserves the spectral norm, i.e.


Let and . Let and . Using matrix Azuma inequality and dilations operator, we can bound the spectral norm of , , and as follows:

  1. () Let . Then, is a sequence of self-adjoint matrices in dimension that satisfy