1 Introduction
Combinatorial optimization is a mature field (Papadimitriou & Steiglitz, 1998), which has countless practical applications. One of the most studied problems in combinatorial optimization is maximization of a modular function subject to combinatorial constraints. Many important problems, such as minimum spanning tree (MST), shortest path, and maximumweight bipartite matching, can be viewed as instances of this problem.
In practice, the optimized modular function is often unknown and needs to be learned while repeatedly solving the problem. This class of learning problems was recently formulated as a combinatorial bandit/semibandit, depending on the feedback model (Audibert et al., 2014). Since then, many combinatorial bandit/semibandit algorithms have been proposed: for the stochastic setting (Gai et al., 2012; Chen et al., 2013; Russo & Van Roy, 2014; Kveton et al., 2015b); for the adversarial setting (CesaBianchi & Lugosi, 2012; Audibert et al., 2014; Neu & Bartók, 2013); and for subclasses of combinatorial problems, matroid and polymatroid bandits (Kveton et al., 2014a, b, c), submodular maximization (Wen et al., 2013; Gabillon et al., 2013), and cascading bandits (Kveton et al., 2015a). Many regret bounds have been established for the combinatorial semibandit algorithms. To achieve an dependence on time , all of the regret bounds are , where is the number of items. The dependence on
is intrinsic because the algorithms estimate the weight of each item separately, and matching lower bounds have been established (Section
3.2).However, in many realworld problems, the number of items is intractably large. For instance, online advertising in a mainstream commercial website can be viewed as a bipartite matching problem with millions of users and products; routing in the Internet can be formulated as a shortest path problem with billions of edges. Thus, learning algorithms with regret are impractical in such problems. On the other hand, in many problems, items have features and their weights are similar when the features are similar. In movie recommendation, for instance, the expected ratings of movies that are close in the latent space are also similar. In this work, we show how to leverage this structure to learn to make good decisions more efficiently. More specifically, we assume a linear generalization across the items: conditioned on the features of an item, the expected weight of that item can be estimated using a linear model. Our goal is to develop more efficient learning algorithms for combinatorial semibandits with linear generalization.
It is relatively easy to extend many linear bandit algorithms, such as Thompson sampling (Thompson, 1933; Agrawal & Goyal, 2012; Russo & Van Roy, 2013) and Linear UCB (, see Auer (2002); Dani et al. (2008); AbbasiYadkori et al. (2011)) , to combinatorial semibandits with linear generalization. In this paper, we propose two learning algorithms, Combinatorial Linear Thompson Sampling () and Combinatorial Linear UCB (), based on Thompson sampling and . Both and are computationally efficient, as long as the offline version of the combinatorial problem can be solved efficiently. The first major contribution of the paper is that we establish a Bayes regret bound on and a regret bound on , under reasonable assumptions. Both bounds are independent, and sublinear in time. The second major contribution of the paper is that we evaluate on a variety of problems with thousands of items, and two of these problems are based on realworld datasets. We only evaluate since recent literature (Chapelle & Li, 2011) suggests that Thompson sampling algorithms usually outperform UCBlike algorithms in practice. Our experimental results demonstrate that is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines. It is worth mentioning that our derived independent regret bounds also hold in cases with . Moreover, as we will discuss in Section 7, our proposed algorithms and their analyses can be easily extended to the contextual combinatorial semibandits.
Finally, we briefly review some relevant papers. Gabillon et al. (2014) and Yue & Guestrin (2011) focus on submodular maximization with linear generalization. Our paper differs from these two papers in the following two aspects: (1) our paper allows general combinatorial constraints while they do not; (2) our paper focuses on maximization of modular functions while they focus on submodular maximization.
2 Combinatorial Optimization
We focus on a class of combinatorial optimization problems that aim to find a maximumweight set from a given family of sets. Specifically, one such combinatorial optimization problem can be represented as a triple , where (1) is a set of items, called the ground set, (2) is a family of subsets of with up to items, where , and (3) is a weight function that assigns each item in the ground set a real number. The total weight of all items in a set is defined as:
(1) 
which is a linear functional of and a modular function in . A set is a maximumweight set in if:
(2) 
Many classical combinatorial optimization problems, such as finding an MST, bipartite matching, the shortest path problem and the traveling salesman problem (TSP), have form (2). Though some of these problems can be solved efficiently (e.g. bipartite matching), others (e.g. TSP) are known to be NPhard. However, for many such NPhard problems, there exist computationally efficient approximation algorithms and/or randomized algorithms
that achieve nearoptimal solutions with high probability. Similarly to
Chen et al. (2013), in this paper, we allow the agent to use any approximation / randomized algorithm to solve (2), and denote its solution as . To distinguish from a learning algorithm, we refer to a combinatorial optimization algorithm as an oracle in this paper.3 Combinatorial SemiBandits with Linear Generalization
Many realworld problems are combinatorial in nature. In recommender systems, for instance, the user is typically recommended items out of . The value of an item, such as the expected rating of a movie, is never known perfectly and has to be refined while repeatedly recommending to the pool of the users. Recommender problems are known to be highly structured. In particular, it is well known that the useritem matrix is typically lowrank (Koren et al., 2009) and that the value of an item can be written as a linear combination of its position in the latent space. In this work, we propose a learning algorithm for combinatorial optimization that leverages this structure. In particular, we assume that the weight of each item is a linear function of its features and then we learn the parameters of this model, jointly for all items.
3.1 Combinatorial SemiBandits
We formalize our learning problem as a combinatorial semibandit. A combinatorial semibandit is a triple , where and are defined in Section 2 and
is a probability distribution over the weights
of the items in the ground set . We assume that the weights are drawn i.i.d. from . The mean weight is denoted by . Each item is associated with an arm and we assume that multiple arms can be pulled. A subset of arms can be pulled if and only if . The return of pulling arms is (Equation (1)), the sum of the weights of all items in . After the arms are pulled, we observe the individual return of each arm, . This feedback model is known as semibandit (Audibert et al., 2014).We assume that the combinatorial structure is known and the distribution is unknown. We would like to stress that we do not make any structural assumptions on . The optimal solution to our problem is a maximumweight set in expectation:
(3) 
This objective is equivalent to the one in Equation (2).
Our learning problem is episodic. In each episode , the learning agent adaptively chooses based on its observations of the weights up to episode , gains , and observes the weights of all chosen items in episode , . The learning agent interacts with the combinatorial semibandit for times and its goal is to maximize the expected cumulative return in episodes , where the expectation is over (1) the random weights ’s, (2) possible randomization in the learning algorithm, and (3) if it is randomly generated. Notice that the choice of impacts both the return and observations in episode . So we need to trade off exploration and exploitation, similarly to other bandit problems.
3.2 Linear Generalization
As we have discussed in Section 1, many provably efficient algorithms have been developed for various combinatorial semibandits of form (3) (Chen et al., 2013; Gai et al., 2012; Russo & Van Roy, 2014; Kveton et al., 2014b, 2015b). However, since there are parameters to learn and these algorithms do not consider generalization across items, the derived upper bounds on the expected cumulative regret and/or the Bayes cumulative regret of these algorithms are at least . Furthermore, Audibert et al. (2014) has derived an lower bound on adversarial combinatorial semibandits, while Kveton et al. (2014b, 2015b) have derived asymptotic gapdependent lower bounds on stochastic combinatorial semibandits, where is an appropriate “gap”.
However, in many modern combinatorial semibandit problems, tends to be enormous. Thus, an regret is unacceptably large in these problems. On the other hand, in many practical problems, there exists a generalization model based on which the weight of one item can be (approximately) inferred based on the weights of other items. By exploiting such generalization models, an or even an independent cumulative regret might be achieved.
In this paper, we assume that there is a (possibly imperfect) linear generalization model across the items. Specifically, we assume that the agent knows a generalization matrix s.t. either lies in or is “close” to the subspace . We use to denote the transpose of the th row of , and refer to it as the
feature vector
of item . Without loss of generality, we assume that .Similar to some existing literature (Wen & Van Roy, 2013; Van Roy & Wen, 2014), we distinguish between the coherent learning cases, in which , and the agnostic learning cases, in which . Like existing literature on linear bandits (Dani et al., 2008; AbbasiYadkori et al., 2011), the analysis in this paper focuses on coherent learning cases. However, we would like to emphasize that both of our proposed algorithms, and , are also applicable to the agnostic learning cases. As is demonstrated in Section 6, performs well in the agnostic learning cases.
Finally, we define . Since , is uniquely defined. Moreover, in coherent learning cases, we have .
3.3 Performance Metrics
Let . In this paper, we measure the performance loss of a learning algorithm with respect to . Recall that the learning algorithm chooses in episode , we define as the realized regret in episode . If the expected weight is fixed but unknown, we define the expected cumulative regret of the learning algorithm in episodes as
(4) 
where the expectation is over random weights and possible randomization in the learning algorithm. If necessary, we denote as to emphasize the dependence on . On the other hand, if is randomly generated or the agent has a prior belief in , then from Russo & Van Roy (2013), the Bayes cumulative regret of the learning algorithm in episodes is defined as
(5) 
where the expectation is also over . That is, is a weighted average of under the prior on .
4 Learning Algorithms
In this section, we propose two learning algorithms for combinatorial semibandits: Combinatorial Linear Thompson Sampling () and Combinatorial Linear UCB (), which are respectively motivated by Thompson sampling and . Both algorithms maintain a mean vector and a covariance matrix
, and use Kalman filtering to update
and . They differ in how to choose (i.e. how to explore) in each episode : chooses based on a randomly sampled coefficient vector , while chooses based on the optimism in the face of uncertainty (OFU) principle.4.1 Combinatorial Linear Thompson Sampling
The psuedocode of is given in Algorithm 2, where is the combinatorial structure, is the generalization matrix, is a combinatorial optimization algorithm, and and are two algorithm parameters controlling the learning rate. Specifically, is an inverseregularization parameter and smaller makes the covariance matrix closer to . Thus, a too small will lead to insufficient exploration and significantly reduce the performance of . On the other hand, controls the decrease rate of the covariance matrix . In particular, a large will lead to slow learning, while a too small will make the algorithm quickly converge to some suboptimal coefficient vector.
In each episode , Algorithm 2 consists of three steps. First, it randomly samples a coefficient vector
from a Gaussian distribution. Second, it computes
based on and the prespecified oracle. Finally, it updates the mean vector and the covariance matrix based on Kalman filtering (Algorithm 1).It is worth pointing our that if (1) , (2) the prior on is , and (3) , the noise is independently sampled from , then in each episode , the algorithm samples from the posterior distribution of . We henceforth refer to a case satisfying condition (1)(3) as a coherent Gaussian case. Obviously, the algorithm can be applied to more general cases, even to cases with no prior and/or agnostic learning cases.
4.2 Combinatorial Linear UCB
The pseudocode of is given in Algorithm 3, where , , and are defined the same as in Algorithm 2, and , , and are three algorithm parameters. Similarly, is an inverseregularization parameter, controls the decrease rate of the covariance matrix, and controls the degree of optimism (exploration). Specifically, if is too small, the algorithm might converge to some suboptimal coefficient vector due to insufficient exploration; on the other hand, too large will lead to excessive exploration and slow learning.
5 Regret Bounds
In this section, we present a Bayes regret bound on , and a regret bound on . We will also briefly discuss how these bounds are derived, as well as their tightness. The detailed proofs are left to the appendices. Without loss of generality, throughout this section, we assume that , .
5.1 Bayes Regret Bound on
We have the following upper bound on when is applied to a coherent Gaussian case with the right parameter.
Theorem 1.
If (1) , (2) the prior on is , (3) the noises are i.i.d. sampled from , and (4) , then under algorithm with parameter , we have
(6) 
Notice that condition (1)(3) ensure it is a coherent Gaussian case, and condition (4) almost always holds^{1}^{1}1Condition (4) is not essential, please refer to Theorem 3 in Appendix A for a Bayes regret bound without condition (4).. The notation hides the logarithm factors. We also note that Equation (6) is a minimum of two bounds. The first bound is dependent, but it is only ; on the other hand, the second bound is independent, but is instead of . We would like to emphasize that Theorem 1 holds even if is an approximation/randomized algorithm.
We now outline the proof of Theorem 1, which is motivated by Russo & Van Roy (2013) and Dani et al. (2008). Let denote the “history” (i.e. all the available information) by the start of episode . Note that from the Bayesian perspective, conditioning on , and are i.i.d. drawn from (Russo & Van Roy, 2013). This is because that conditioning on , the posterior belief in is and based on Algorithm 2, is independently sampled from . Since is a fixed combinatorial optimization algorithm (even though it can be independently randomized), and are all fixed, then conditioning on , and are also i.i.d., furthermore, is conditionally independent of , and is conditionally independent of .
To simplify the exposition, and , we define
where is an alternative notation for inner product. Thus we have . We also define a UCB function as
where is a constant to be specified. Notice that conditioning on , is a deterministic function and are i.i.d., then and
(7) 
Theorem 1 follows by respectively bounding the two terms on the righthand side of Equation (7). Two key observations are (1) if , then
and (2)
and we have a worstcase bound (see Lemma 4 in Appendix A) on . Please refer to Appendix A for the detailed proof for Theorem 1.
Finally, we briefly discuss the tightness of our bound. Without loss of generality, we assume that . For the special case when (i.e. no generalization), Russo & Van Roy (2014) provides an upper bound on when Thompson sampling is applied, and Audibert et al. (2014) provides an lower bound^{2}^{2}2Audibert et al. (2014) focuses on the adversarial setting but the lower bound is stochastic. So it is a reasonable lower bound to compare with.. Since when , the above results indicate that for general , the best upper bound one can hope is . Hence, our bound is at most larger. It is wellknown that the factor is due to linear generalization (Dani et al., 2008; AbbasiYadkori et al., 2011), and as is discussed in the appendix (see Remark 1), the extra factor is also due to linear generalization. They might be intrinsic, but we leave the final word and tightness analysis to future work.
5.2 Regret Bound on
Under the assumptions that (1) the support of is a subset of (i.e. and ), and (2) the oracle exactly solves the offline optimization problem^{3}^{3}3If is an approximation algorithm, a variant of Theorem 2 can be proved (see Appendix D). , we have the following upper bound on when is applied to coherent learning cases:
Theorem 2.
For any , any , and any satisfying
(8) 
if and the above two assumptions hold, then under algorithm with parameter , we have
Generally speaking, the proof for Theorem 2 proceeds as follows. We first construct a confidence set of based on the “self normalized bound” developed in AbbasiYadkori et al. (2011). Then we decompose the regret over the highprobability “good” event and the lowprobability “bad” event , where is the complement of . Finally, we bound the term associated with the event based on the same worstcase bound on used in the analysis for (see Lemma 4 in Appendix A), and bound the term associated with the event based on a naive bound. Please refer to Appendix B for the detailed proof of Theorem 2.
Notice that if we choose , , and as the lower bound specified in Inequality (8), then the regret bound derived in Theorem 2 is also . Compared with the lower bound derived in Audibert et al. (2014), this bound is at most larger. Similarly, the extra and factors are also due to linear generalization.
Finally, we would like to clarify that the assumption that the support of is bounded is not essential. By slightly modifying the analysis, we can achieve a similar highprobability bound on the realized cumulative regret as long as is subGaussian. We also want to point out that the independent bounds derived in both Theorem 1 and 2 will still hold even if .
6 Experiments
In this section, we evaluate on three problems. The first problem is synthetic, but the last two problems are constructed based on realworld datasets. As we have discussed in Section 1, we only evaluate since in practice Thompson sampling algorithms usually outperform the UCBlike algorithms. Our experiment results in the synthetic problem demonstrate that is both scalable and robust to the choice of algorithm parameters. They also suggest the Bayes regret bound derived in Theorem 1 is likely to be tight. On the other hand, our experiment results in the last two problems show the value of linear generalization in realworld settings: with domainspecific but imperfect linear generalization (i.e. agnostic learning), can significantly outperform stateoftheart learning algorithms that do not exploit linear generalization, which serve as baselines in these two problems.
In all three problems, the oracle exactly solves the offline combinatorial optimization problem. Moreover, in the two realworld problems, we demonstrate the experiment results using a new performance metric, the expected perstep return in episodes, which is defined as
(9) 
Obviously, it is the expected cumulative return in episodes divided by . We demonstrate experiment results using expected cumulative return rather than since it is more illustrative.
6.1 Longest Path
We first evaluate on a synthetic problem. Specifically, we experiment with a stochastic longest path problem on an square grid^{4}^{4}4That is, each side has edges and nodes. Notice that the longest path problem and the shortest path problem are mathematically equivalent.. The items in the ground set are the edges in the grid, in total. The feasible set are all paths in the grid from the upper left corner to the bottom right corner that follow the directions of the edges. The length of these paths is . In this problem, we focus on coherent Gaussian cases and randomly sample the linear generalization matrix to weaken the dependence on a particular choice of .
Our experiments are parameterized by a sextuple , where , , , and are defined before and and
are respectively the true standard deviations of
and the observation noises. In each round of simulation, we first construct a problem instance as follows: (1) generate by sampling each component of i.i.d. from ; (2) sample independently from and set ; and (3) , the observation noise is i.i.d. sampled from . Then we apply with parameter to the constructed instance for episodes. Notice that in general . We average the experiment results over simulations to estimate the Bayes cumulative regret .We start with a “default case” with , , and . Notice in this case and . We choose since in the default case, the Bayes perepisode regret of vanishes far before period . In the default case . In the experiments, we vary only one and only one parameter while keeping all the other parameters fixed to their “default values” specified above to demonstrate the scalability and robustness of .
First, we study how the Bayes cumulative regret of scales with the size of the problem by varying , and show the result in Figure 1(a). The experiment results show that roughly increases linearly with , which indicates that is scalable with respect to the problem size . We also experiment with , in this case we have , , and , which is only times of in the default case. It is worth mentioning that this result also suggests that the Bayes regret bound derived in Theorem 1 is (almost) tight in this problem^{5}^{5}5Recall that Theorem 1 requires . It can be easily extended to cases with by scaling the Bayes regret bound by . However, in this problem is not bounded since it is sampled from a Gaussian distribution. We believe that Theorem 1 can be extended to this case by exploiting the properties of Gaussian distribution. Roughly speaking, in this problem, with high probability, . . To see it, notice that and , and hence the Bayes regret bound derived in Theorem 1 is .
Second, we study how the Bayes cumulative regret of scales with , the dimension of the feature vectors, by varying , and demonstrate the result in Figure 1(b). The experiment results indicate that also roughly increases linearly with , and hence is also scalable with the feature dimension . This result also suggests that the bound in Theorem 1 is (almost) tight5.
Finally, we study the robustness of with respect to the algorithm parameters and . In Figure 1(c), we vary and in Figure 1(d), we vary . We would like to emphasize again that we only vary the algorithm parameters and fix and . The experiment results show that is robust to the choice of algorithm parameters and performs well for a wide range of and . However, too small or too large , or too small , can significantly reduce the performance of , as we have discussed in Section 4.1.
6.2 Online Advertising
In the second experiment, we evaluate on an advertising problem. Our objective is to identify people that are most likely to accept an advertisement offer, subject to the targeting constraint that exactly half of them are females. Specifically, the ground set includes representative people from Adult dataset (Asuncion & Newman, 2007), which was collected in the US census. A feasible solution is any subset of with and satisfying the targeting constraint mentioned above. We assume that person accepts an advertisement offer with probability
and people accept offers independently of each other. The features in the generalization matrix are the age, which is binned into groups; gender; whether the person works more than hours per week; and the length of education in years. All these features can be constructed based on the Adult dataset.
is compared to three baselines. The first baseline is the optimal solution . The second baseline is (Kveton et al., 2015b). This algorithm estimates the probability that person accepts the offer independently of the other probabilities. The third baseline is without linear generalization, which we simply refer to as . As in , this algorithm estimates the probability that person accepts the offer independently of the other probabilities. The posterior of
is modeled as a beta distribution.
Our experiment results are reported in Figure 2. We observe two major trends. First, learns extremely quickly. In particular, its perstep return at episode is of the optimum, and its perstep return at episode k is of the optimum. These results are remarkable since the linear generalization is imperfect in this problem. Second, both and perform poorly due to insufficient observations with respect to the model complexity. Specifically, in k episodes, the people in are observed k times, which implies that each person is observed only times on average. This is not enough to discriminate the people who are likely to accept the advertisement offer from those that are not.
6.3 Artist Recommendation
In the last experiment, we evaluate on a problem of recommending music artists that are most likely to be chosen by an average user of a music recommendation website. Specifically, the ground set include artists from the Last.fm music recommendation dataset (Cantador et al., 2011). The dataset contains tagging and music artist listening information from a set of users from Last.fm online music system^{6}^{6}6http://www.lastfm.com. The tagging part includes the tag assignments of all artists provided by the users. For each user, the artists to whom she listened and the number of listening events are also available in the dataset.
We choose as the set of artists that were listened by at least two users and had at least one tag assignment among the top most popular tags, and k. For each artist , we construct its feature vector by setting its th component as the fraction of users who assigned tag to this artist. We assume that each artist is chosen by an average user with probability , where is the set of users that listened to artist , and is the probability that user likes artist . We estimate
based on a Naïve Bayes classifier with respect to the number of person/artist listening events.
Like Section 6.2, we also compare to three baselines: the optimal solution , the algorithm and the algorithm. Our experiment results are reported in Figure 3. Similarly as Figure 2, the expected perstep return of approaches that of much faster than and . Moreover, both and perform poorly due to the insufficient observations with respect to the model complexity: In k episodes, each artist is observed less than times on average, which is not enough to discriminate most popular artists from less popular artists.
7 Conclusion
We have proposed two learning algorithms, and , for stochastic combinatorial semibandits with linear generalization. The main contribution of this work is twofold: First, we have established independent regret bounds for these two algorithms under reasonable assumptions, where is the number of items. Second, we have also evaluated on a variety of problems. The experiment results in the first problem show that is scalable and robust, and the experiment results in the other two problems demonstrate the value of exploiting linear generalization in realworld settings.
It is worth mentioning that our results can be easily extended to the contextual combinatorial semibandits with linear generalization. In a contextual combinatorial semibandit, the probability distribution (and hence the expected weight ) also depends on a context , which either follows an exogenous stochastic process or is adaptively chosen by an adversary. Assume that each stateitem pair is associated with a feature vector , then similar to Agrawal & Goyal (2013), both and , as well as their analyses, can be generalized to the contextual combinatorial semibandits.
We leave open several questions of interest. One interesting open question is how to derive regret bounds for and in the agnostic learning cases. Another interesting open question is how to extend the results to combinatorial semibandits with nonlinear generalization. We believe that our results can be extended to combinatorial semibandits with generalized linear generalization^{7}^{7}7That is, , where is a strictly monotone function., but leave it to future work.
References
 AbbasiYadkori et al. (2011) AbbasiYadkori, Yasin, Pál, Dávid, and Szepesvári, Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pp. 2312–2320, 2011.
 Agrawal & Goyal (2012) Agrawal, Shipra and Goyal, Navin. Analysis of thompson sampling for the multiarmed bandit problem. In COLT 2012  The 25th Annual Conference on Learning Theory, June 2527, 2012, Edinburgh, Scotland, pp. 39.1–39.26, 2012.

Agrawal & Goyal (2013)
Agrawal, Shipra and Goyal, Navin.
Thompson sampling for contextual bandits with linear payoffs.
In
Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013
, pp. 127–135, 2013.  Asuncion & Newman (2007) Asuncion, A. and Newman, D.J. UCI machine learning repository, 2007. URL http://www.ics.uci.edu/$∼$mlearn/{MLR}epository.html.
 Audibert et al. (2014) Audibert, JeanYves, Bubeck, Sebastien, and Lugosi, Gabor. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
 Auer (2002) Auer, Peter. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research, 3:397–422, 2002.
 Cantador et al. (2011) Cantador, Iván, Brusilovsky, Peter, and Kuflik, Tsvi. Second workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the ACM conference on Recommender systems, RecSys 2011. ACM, 2011.
 CesaBianchi & Lugosi (2012) CesaBianchi, Nicolò and Lugosi, Gábor. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
 Chapelle & Li (2011) Chapelle, Olivier and Li, Lihong. An empirical evaluation of Thompson sampling. In Neural Information Processing Systems, pp. 2249–2257, 2011.
 Chen et al. (2013) Chen, Wei, Wang, Yajun, and Yuan, Yang. Combinatorial multiarmed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, pp. 151–159, 2013.
 Dani et al. (2008) Dani, Varsha, Hayes, Thomas, and Kakade, Sham. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pp. 355–366, 2008.
 Gabillon et al. (2013) Gabillon, Victor, Kveton, Branislav, Wen, Zheng, Eriksson, Brian, and Muthukrishnan, S. Adaptive submodular maximization in bandit setting. In Advances in Neural Information Processing Systems 26, pp. 2697–2705, 2013.

Gabillon et al. (2014)
Gabillon, Victor, Kveton, Branislav, Wen, Zheng, Eriksson, Brian, and
Muthukrishnan, S.
Largescale optimistic adaptive submodularity.
In
Proceedings of the 28th AAAI Conference on Artificial Intelligence
, 2014.  Gai et al. (2012) Gai, Yi, Krishnamachari, Bhaskar, and Jain, Rahul. Combinatorial network optimization with unknown variables: Multiarmed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5):1466–1478, 2012.
 Koren et al. (2009) Koren, Yehuda, Bell, Robert, and Volinsky, Chris. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30–37, 2009.
 Kveton et al. (2014a) Kveton, Branislav, Wen, Zheng, Ashkan, Azin, and Eydgahi, Hoda. Matroid bandits: Practical largescale combinatorial bandits. In Workshops at the TwentyEighth AAAI Conference on Artificial Intelligence, 2014a.
 Kveton et al. (2014b) Kveton, Branislav, Wen, Zheng, Ashkan, Azin, Eydgahi, Hoda, and Eriksson, Brian. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pp. 420–429, 2014b.
 Kveton et al. (2014c) Kveton, Branislav, Wen, Zheng, Ashkan, Azin, Eydgahi, Hoda, and Valko, Michal. Learning to act greedily: Polymatroid semibandits. CoRR, abs/1405.7752, 2014c.
 Kveton et al. (2015a) Kveton, Branislav, Szepesvari, Csaba, Wen, Zheng, and Ashkan, Azin. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, 2015a.
 Kveton et al. (2015b) Kveton, Branislav, Wen, Zheng, Ashkan, Azin, and Szepesvari, Csaba. Tight regret bounds for stochastic combinatorial semibandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015b.
 Neu & Bartók (2013) Neu, Gergely and Bartók, Gábor. An efficient algorithm for learning with semibandit feedback. In Jain, Sanjay, Munos, Rémi, Stephan, Frank, and Zeugmann, Thomas (eds.), Algorithmic Learning Theory, volume 8139 of Lecture Notes in Computer Science, pp. 234–248. Springer Berlin Heidelberg, 2013. ISBN 9783642409349.
 Papadimitriou & Steiglitz (1998) Papadimitriou, Christos and Steiglitz, Kenneth. Combinatorial Optimization. Dover Publications, Mineola, NY, 1998.
 Russo & Van Roy (2013) Russo, Daniel and Van Roy, Benjamin. Learning to optimize via posterior sampling. CoRR, abs/1301.2609, 2013.
 Russo & Van Roy (2014) Russo, Daniel and Van Roy, Benjamin. An informationtheoretic analysis of thompson sampling. CoRR, abs/1403.5341, 2014.
 Thompson (1933) Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 Van Roy & Wen (2014) Van Roy, Benjamin and Wen, Zheng. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
 Wen & Van Roy (2013) Wen, Zheng and Van Roy, Benjamin. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems 26, pp. 3021–3029, 2013.
 Wen et al. (2013) Wen, Zheng, Kveton, Branislav, Eriksson, Brian, and Bhamidipati, Sandilya. Sequential Bayesian search. In Proceedings of the 30th International Conference on Machine Learning, pp. 977–983, 2013.
 Yue & Guestrin (2011) Yue, Yisong and Guestrin, Carlos. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems 24, pp. 2483–2491, 2011.
Appendix A Proof for Theorem 1
To prove Theorem 1, we first prove the following theorem:
Theorem 3.
If (1) , (2) the prior on is , and (3) the noises are i.i.d. sampled from , then under algorithm with parameter , then we have
(10) 
We now outline the proof of Theorem 3, which is based on (Russo & Van Roy, 2013; Dani et al., 2008). Let denote the “history” (i.e. all the available information) by the start of episode . Note that from the Bayesian perspective, conditioning on , and are i.i.d. drawn from (see (Russo & Van Roy, 2013)). This is because that conditioning on , the posterior belief in is and based on Algorithm 2, is independently sampled from . Since is a fixed combinatorial optimization algorithm (even though it can be independently randomized), and are all fixed, then conditioning on , and are also i.i.d., furthermore, is conditionally independent of , and is conditionally independent of .
To simplify the exposition, and , we define
(12) 
then we have and , hence we have . We also define the upper confidence bound (UCB) function as
(13) 
where is a constant to be specified. Notice that conditioning on , is a deterministic function and are i.i.d., then and
(14) 
One key observation is that
(15) 
where (b) follows from the fact that and are conditionally independent, and (c) follows from . Hence . We can show that (1) if we choose
(16) 
and (2) . Thus, the bound in Theorem 3 holds. Please refer to the remainder of this section for the full proof.