# Efficient Learning in Large-Scale Combinatorial Semi-Bandits

A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider efficient learning in large-scale combinatorial semi-bandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). Both algorithms are computationally efficient as long as the offline version of the combinatorial problem can be solved efficiently. We establish that CombLinTS and CombLinUCB are also provably statistically efficient under reasonable assumptions, by developing regret bounds that are independent of the problem scale (number of items) and sublinear in time. We also evaluate CombLinTS on a variety of problems with thousands of items. Our experiment results demonstrate that CombLinTS is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines.

• 47 publications
• 56 publications
• 6 publications
10/03/2014

### Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits

A stochastic combinatorial semi-bandit is an online learning problem whe...
02/17/2020

### Statistically Efficient, Polynomial Time Algorithms for Combinatorial Semi Bandits

We consider combinatorial semi-bandits over a set of arms X⊂{0,1}^d wher...
03/06/2020

### Optimizing Revenue while showing Relevant Assortments at Scale

Scalable real-time assortment optimization has become essential in e-com...
09/17/2021

### Online Learning of Network Bottlenecks via Minimax Paths

In this paper, we study bottleneck identification in networks via extrac...
06/03/2018

### Conservative Exploration using Interleaving

In many practical problems, a learning agent may want to learn the best ...
12/22/2009

### Learning to Predict Combinatorial Structures

The major challenge in designing a discriminative learning algorithm for...
05/21/2016

### Online Influence Maximization under Independent Cascade Model with Semi-Bandit Feedback

We study the stochastic online problem of learning to influence in a soc...

## 1 Introduction

Combinatorial optimization is a mature field (Papadimitriou & Steiglitz, 1998), which has countless practical applications. One of the most studied problems in combinatorial optimization is maximization of a modular function subject to combinatorial constraints. Many important problems, such as minimum spanning tree (MST), shortest path, and maximum-weight bipartite matching, can be viewed as instances of this problem.

In practice, the optimized modular function is often unknown and needs to be learned while repeatedly solving the problem. This class of learning problems was recently formulated as a combinatorial bandit/semi-bandit, depending on the feedback model (Audibert et al., 2014). Since then, many combinatorial bandit/semi-bandit algorithms have been proposed: for the stochastic setting (Gai et al., 2012; Chen et al., 2013; Russo & Van Roy, 2014; Kveton et al., 2015b); for the adversarial setting (Cesa-Bianchi & Lugosi, 2012; Audibert et al., 2014; Neu & Bartók, 2013); and for subclasses of combinatorial problems, matroid and polymatroid bandits (Kveton et al., 2014a, b, c), submodular maximization (Wen et al., 2013; Gabillon et al., 2013), and cascading bandits (Kveton et al., 2015a). Many regret bounds have been established for the combinatorial semi-bandit algorithms. To achieve an dependence on time , all of the regret bounds are , where is the number of items. The dependence on

is intrinsic because the algorithms estimate the weight of each item separately, and matching lower bounds have been established (Section

3.2).

However, in many real-world problems, the number of items is intractably large. For instance, online advertising in a mainstream commercial website can be viewed as a bipartite matching problem with millions of users and products; routing in the Internet can be formulated as a shortest path problem with billions of edges. Thus, learning algorithms with regret are impractical in such problems. On the other hand, in many problems, items have features and their weights are similar when the features are similar. In movie recommendation, for instance, the expected ratings of movies that are close in the latent space are also similar. In this work, we show how to leverage this structure to learn to make good decisions more efficiently. More specifically, we assume a linear generalization across the items: conditioned on the features of an item, the expected weight of that item can be estimated using a linear model. Our goal is to develop more efficient learning algorithms for combinatorial semi-bandits with linear generalization.

It is relatively easy to extend many linear bandit algorithms, such as Thompson sampling (Thompson, 1933; Agrawal & Goyal, 2012; Russo & Van Roy, 2013) and Linear UCB (, see Auer (2002); Dani et al. (2008); Abbasi-Yadkori et al. (2011)) , to combinatorial semi-bandits with linear generalization. In this paper, we propose two learning algorithms, Combinatorial Linear Thompson Sampling () and Combinatorial Linear UCB (), based on Thompson sampling and . Both and are computationally efficient, as long as the offline version of the combinatorial problem can be solved efficiently. The first major contribution of the paper is that we establish a Bayes regret bound on and a regret bound on , under reasonable assumptions. Both bounds are -independent, and sublinear in time. The second major contribution of the paper is that we evaluate on a variety of problems with thousands of items, and two of these problems are based on real-world datasets. We only evaluate since recent literature (Chapelle & Li, 2011) suggests that Thompson sampling algorithms usually outperform UCB-like algorithms in practice. Our experimental results demonstrate that is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines. It is worth mentioning that our derived -independent regret bounds also hold in cases with . Moreover, as we will discuss in Section 7, our proposed algorithms and their analyses can be easily extended to the contextual combinatorial semi-bandits.

Finally, we briefly review some relevant papers. Gabillon et al. (2014) and Yue & Guestrin (2011) focus on submodular maximization with linear generalization. Our paper differs from these two papers in the following two aspects: (1) our paper allows general combinatorial constraints while they do not; (2) our paper focuses on maximization of modular functions while they focus on submodular maximization.

## 2 Combinatorial Optimization

We focus on a class of combinatorial optimization problems that aim to find a maximum-weight set from a given family of sets. Specifically, one such combinatorial optimization problem can be represented as a triple , where (1) is a set of items, called the ground set, (2) is a family of subsets of with up to items, where , and (3) is a weight function that assigns each item in the ground set a real number. The total weight of all items in a set is defined as:

 f(A,w)=∑e∈Aw(e), (1)

which is a linear functional of and a modular function in . A set is a maximum-weight set in if:

 Aopt∈argmaxA∈Af(A,w)=argmaxA∈A∑e∈Aw(e). (2)

Many classical combinatorial optimization problems, such as finding an MST, bipartite matching, the shortest path problem and the traveling salesman problem (TSP), have form (2). Though some of these problems can be solved efficiently (e.g. bipartite matching), others (e.g. TSP) are known to be NP-hard. However, for many such NP-hard problems, there exist computationally efficient approximation algorithms and/or randomized algorithms

that achieve near-optimal solutions with high probability. Similarly to

Chen et al. (2013), in this paper, we allow the agent to use any approximation / randomized algorithm to solve (2), and denote its solution as . To distinguish from a learning algorithm, we refer to a combinatorial optimization algorithm as an oracle in this paper.

## 3 Combinatorial Semi-Bandits with Linear Generalization

Many real-world problems are combinatorial in nature. In recommender systems, for instance, the user is typically recommended items out of . The value of an item, such as the expected rating of a movie, is never known perfectly and has to be refined while repeatedly recommending to the pool of the users. Recommender problems are known to be highly structured. In particular, it is well known that the user-item matrix is typically low-rank (Koren et al., 2009) and that the value of an item can be written as a linear combination of its position in the latent space. In this work, we propose a learning algorithm for combinatorial optimization that leverages this structure. In particular, we assume that the weight of each item is a linear function of its features and then we learn the parameters of this model, jointly for all items.

### 3.1 Combinatorial Semi-Bandits

We formalize our learning problem as a combinatorial semi-bandit. A combinatorial semi-bandit is a triple , where and are defined in Section 2 and

is a probability distribution over the weights

of the items in the ground set . We assume that the weights are drawn i.i.d. from . The mean weight is denoted by . Each item is associated with an arm and we assume that multiple arms can be pulled. A subset of arms can be pulled if and only if . The return of pulling arms is (Equation (1)), the sum of the weights of all items in . After the arms are pulled, we observe the individual return of each arm, . This feedback model is known as semi-bandit (Audibert et al., 2014).

We assume that the combinatorial structure is known and the distribution is unknown. We would like to stress that we do not make any structural assumptions on . The optimal solution to our problem is a maximum-weight set in expectation:

 Aopt∈argmaxA∈AEw[f(A,w)]=argmaxA∈A∑e∈A¯w(e). (3)

This objective is equivalent to the one in Equation (2).

Our learning problem is episodic. In each episode , the learning agent adaptively chooses based on its observations of the weights up to episode , gains , and observes the weights of all chosen items in episode , . The learning agent interacts with the combinatorial semi-bandit for times and its goal is to maximize the expected cumulative return in -episodes , where the expectation is over (1) the random weights ’s, (2) possible randomization in the learning algorithm, and (3) if it is randomly generated. Notice that the choice of impacts both the return and observations in episode . So we need to trade off exploration and exploitation, similarly to other bandit problems.

### 3.2 Linear Generalization

As we have discussed in Section 1, many provably efficient algorithms have been developed for various combinatorial semi-bandits of form (3) (Chen et al., 2013; Gai et al., 2012; Russo & Van Roy, 2014; Kveton et al., 2014b, 2015b). However, since there are parameters to learn and these algorithms do not consider generalization across items, the derived upper bounds on the expected cumulative regret and/or the Bayes cumulative regret of these algorithms are at least . Furthermore, Audibert et al. (2014) has derived an lower bound on adversarial combinatorial semi-bandits, while Kveton et al. (2014b, 2015b) have derived asymptotic gap-dependent lower bounds on stochastic combinatorial semi-bandits, where is an appropriate “gap”.

However, in many modern combinatorial semi-bandit problems, tends to be enormous. Thus, an regret is unacceptably large in these problems. On the other hand, in many practical problems, there exists a generalization model based on which the weight of one item can be (approximately) inferred based on the weights of other items. By exploiting such generalization models, an or even an -independent cumulative regret might be achieved.

In this paper, we assume that there is a (possibly imperfect) linear generalization model across the items. Specifically, we assume that the agent knows a generalization matrix s.t. either lies in or is “close” to the subspace . We use to denote the transpose of the -th row of , and refer to it as the

feature vector

of item . Without loss of generality, we assume that .

Similar to some existing literature (Wen & Van Roy, 2013; Van Roy & Wen, 2014), we distinguish between the coherent learning cases, in which , and the agnostic learning cases, in which . Like existing literature on linear bandits (Dani et al., 2008; Abbasi-Yadkori et al., 2011), the analysis in this paper focuses on coherent learning cases. However, we would like to emphasize that both of our proposed algorithms, and , are also applicable to the agnostic learning cases. As is demonstrated in Section 6, performs well in the agnostic learning cases.

Finally, we define . Since , is uniquely defined. Moreover, in coherent learning cases, we have .

### 3.3 Performance Metrics

Let . In this paper, we measure the performance loss of a learning algorithm with respect to . Recall that the learning algorithm chooses in episode , we define as the realized regret in episode . If the expected weight is fixed but unknown, we define the expected cumulative regret of the learning algorithm in episodes as

 R(n)=∑nt=1E[Rt|¯w], (4)

where the expectation is over random weights and possible randomization in the learning algorithm. If necessary, we denote as to emphasize the dependence on . On the other hand, if is randomly generated or the agent has a prior belief in , then from Russo & Van Roy (2013), the Bayes cumulative regret of the learning algorithm in episodes is defined as

 RBayes(n)=E¯w[R(n;¯w)]=∑nt=1E[Rt], (5)

where the expectation is also over . That is, is a weighted average of under the prior on .

## 4 Learning Algorithms

In this section, we propose two learning algorithms for combinatorial semi-bandits: Combinatorial Linear Thompson Sampling () and Combinatorial Linear UCB (), which are respectively motivated by Thompson sampling and . Both algorithms maintain a mean vector and a covariance matrix

, and use Kalman filtering to update

and . They differ in how to choose (i.e. how to explore) in each episode : chooses based on a randomly sampled coefficient vector , while chooses based on the optimism in the face of uncertainty (OFU) principle.

### 4.1 Combinatorial Linear Thompson Sampling

The psuedocode of is given in Algorithm 2, where is the combinatorial structure, is the generalization matrix, is a combinatorial optimization algorithm, and and are two algorithm parameters controlling the learning rate. Specifically, is an inverse-regularization parameter and smaller makes the covariance matrix closer to . Thus, a too small will lead to insufficient exploration and significantly reduce the performance of . On the other hand, controls the decrease rate of the covariance matrix . In particular, a large will lead to slow learning, while a too small will make the algorithm quickly converge to some sub-optimal coefficient vector.

In each episode , Algorithm 2 consists of three steps. First, it randomly samples a coefficient vector

from a Gaussian distribution. Second, it computes

based on and the pre-specified oracle. Finally, it updates the mean vector and the covariance matrix based on Kalman filtering (Algorithm 1).

It is worth pointing our that if (1) , (2) the prior on is , and (3) , the noise is independently sampled from , then in each episode , the algorithm samples from the posterior distribution of . We henceforth refer to a case satisfying condition (1)-(3) as a coherent Gaussian case. Obviously, the algorithm can be applied to more general cases, even to cases with no prior and/or agnostic learning cases.

### 4.2 Combinatorial Linear UCB

The pseudocode of is given in Algorithm 3, where , , and are defined the same as in Algorithm 2, and , , and are three algorithm parameters. Similarly, is an inverse-regularization parameter, controls the decrease rate of the covariance matrix, and controls the degree of optimism (exploration). Specifically, if is too small, the algorithm might converge to some sub-optimal coefficient vector due to insufficient exploration; on the other hand, too large will lead to excessive exploration and slow learning.

In each episode , Algorithm 3 also consists of three steps. First, for each , it computes an upper confidence bound (UCB) . Second, it computes based on and the pre-specified oracle. Finally, it updates and based on Kalman filtering (Algorithm 1).

## 5 Regret Bounds

In this section, we present a Bayes regret bound on , and a regret bound on . We will also briefly discuss how these bounds are derived, as well as their tightness. The detailed proofs are left to the appendices. Without loss of generality, throughout this section, we assume that , .

### 5.1 Bayes Regret Bound on CombLinTS

We have the following upper bound on when is applied to a coherent Gaussian case with the right parameter.

###### Theorem 1.

If (1) , (2) the prior on is , (3) the noises are i.i.d. sampled from , and (4) , then under algorithm with parameter , we have

 RBayes(n)≤~O(Kλ√dnmin{ln(L),d}). (6)

Notice that condition (1)-(3) ensure it is a coherent Gaussian case, and condition (4) almost always holds111Condition (4) is not essential, please refer to Theorem 3 in Appendix A for a Bayes regret bound without condition (4).. The notation hides the logarithm factors. We also note that Equation (6) is a minimum of two bounds. The first bound is -dependent, but it is only ; on the other hand, the second bound is -independent, but is instead of . We would like to emphasize that Theorem 1 holds even if is an approximation/randomized algorithm.

We now outline the proof of Theorem 1, which is motivated by Russo & Van Roy (2013) and Dani et al. (2008). Let denote the “history” (i.e. all the available information) by the start of episode . Note that from the Bayesian perspective, conditioning on , and are i.i.d. drawn from (Russo & Van Roy, 2013). This is because that conditioning on , the posterior belief in is and based on Algorithm 2, is independently sampled from . Since is a fixed combinatorial optimization algorithm (even though it can be independently randomized), and are all fixed, then conditioning on , and are also i.i.d., furthermore, is conditionally independent of , and is conditionally independent of .

To simplify the exposition, and , we define

 g(A,θ)=∑e∈A⟨ϕe,θ⟩,

where is an alternative notation for inner product. Thus we have . We also define a UCB function as

 Ut(A)=∑e∈A[⟨ϕe,¯θt⟩+c√ϕTeΣtϕe],

where is a constant to be specified. Notice that conditioning on , is a deterministic function and are i.i.d., then and

 E[Rt|Ht]= E[g(A∗,θ∗)−Ut(A∗)|Ht] + E[Ut(At)−g(At,θ∗)∣∣Ht]. (7)

Theorem 1 follows by respectively bounding the two terms on the righthand side of Equation (7). Two key observations are (1) if , then

 E[g(A∗,θ∗)−Ut(A∗)|Ht]=O(1),

and (2)

and we have a worst-case bound (see Lemma 4 in Appendix A) on . Please refer to Appendix A for the detailed proof for Theorem 1.

Finally, we briefly discuss the tightness of our bound. Without loss of generality, we assume that . For the special case when (i.e. no generalization), Russo & Van Roy (2014) provides an upper bound on when Thompson sampling is applied, and Audibert et al. (2014) provides an lower bound222Audibert et al. (2014) focuses on the adversarial setting but the lower bound is stochastic. So it is a reasonable lower bound to compare with.. Since when , the above results indicate that for general , the best upper bound one can hope is . Hence, our bound is at most larger. It is well-known that the factor is due to linear generalization (Dani et al., 2008; Abbasi-Yadkori et al., 2011), and as is discussed in the appendix (see Remark 1), the extra factor is also due to linear generalization. They might be intrinsic, but we leave the final word and tightness analysis to future work.

### 5.2 Regret Bound on CombLinUCB

Under the assumptions that (1) the support of is a subset of (i.e. and ), and (2) the oracle exactly solves the offline optimization problem333If is an approximation algorithm, a variant of Theorem 2 can be proved (see Appendix D). , we have the following upper bound on when is applied to coherent learning cases:

###### Theorem 2.

For any , any , and any satisfying

 c≥1σ√dln(1+nKλ2dσ2)+2ln(1δ)+∥θ∗∥2λ, (8)

if and the above two assumptions hold, then under algorithm with parameter , we have

 R(n)≤2cKλ   ⎷dnln(1+nKλ2dσ2)ln(1+λ2σ2)+nKδ.

Generally speaking, the proof for Theorem 2 proceeds as follows. We first construct a confidence set of based on the “self normalized bound” developed in Abbasi-Yadkori et al. (2011). Then we decompose the regret over the high-probability “good” event and the low-probability “bad” event , where is the complement of . Finally, we bound the term associated with the event based on the same worst-case bound on used in the analysis for (see Lemma 4 in Appendix A), and bound the term associated with the event based on a naive bound. Please refer to Appendix B for the detailed proof of Theorem 2.

Notice that if we choose , , and as the lower bound specified in Inequality (8), then the regret bound derived in Theorem 2 is also . Compared with the lower bound derived in Audibert et al. (2014), this bound is at most larger. Similarly, the extra and factors are also due to linear generalization.

Finally, we would like to clarify that the assumption that the support of is bounded is not essential. By slightly modifying the analysis, we can achieve a similar high-probability bound on the realized cumulative regret as long as is sub-Gaussian. We also want to point out that the -independent bounds derived in both Theorem 1 and 2 will still hold even if .

## 6 Experiments

In this section, we evaluate on three problems. The first problem is synthetic, but the last two problems are constructed based on real-world datasets. As we have discussed in Section 1, we only evaluate since in practice Thompson sampling algorithms usually outperform the UCB-like algorithms. Our experiment results in the synthetic problem demonstrate that is both scalable and robust to the choice of algorithm parameters. They also suggest the Bayes regret bound derived in Theorem 1 is likely to be tight. On the other hand, our experiment results in the last two problems show the value of linear generalization in real-world settings: with domain-specific but imperfect linear generalization (i.e. agnostic learning), can significantly outperform state-of-the-art learning algorithms that do not exploit linear generalization, which serve as baselines in these two problems.

In all three problems, the oracle exactly solves the offline combinatorial optimization problem. Moreover, in the two real-world problems, we demonstrate the experiment results using a new performance metric, the expected per-step return in episodes, which is defined as

 1nEw1,…,wn[∑nt=1f(At,wt)∣∣¯w]. (9)

Obviously, it is the expected cumulative return in episodes divided by . We demonstrate experiment results using expected cumulative return rather than since it is more illustrative.

### 6.1 Longest Path

We first evaluate on a synthetic problem. Specifically, we experiment with a stochastic longest path problem on an square grid444That is, each side has edges and nodes. Notice that the longest path problem and the shortest path problem are mathematically equivalent.. The items in the ground set are the edges in the grid, in total. The feasible set are all paths in the grid from the upper left corner to the bottom right corner that follow the directions of the edges. The length of these paths is . In this problem, we focus on coherent Gaussian cases and randomly sample the linear generalization matrix to weaken the dependence on a particular choice of .

Our experiments are parameterized by a sextuple , where , , , and are defined before and and

are respectively the true standard deviations of

and the observation noises. In each round of simulation, we first construct a problem instance as follows: (1) generate by sampling each component of i.i.d. from ; (2) sample independently from and set ; and (3) , the observation noise is i.i.d. sampled from . Then we apply with parameter to the constructed instance for episodes. Notice that in general . We average the experiment results over simulations to estimate the Bayes cumulative regret .

We start with a “default case” with , , and . Notice in this case and . We choose since in the default case, the Bayes per-episode regret of vanishes far before period . In the default case . In the experiments, we vary only one and only one parameter while keeping all the other parameters fixed to their “default values” specified above to demonstrate the scalability and robustness of .

First, we study how the Bayes cumulative regret of scales with the size of the problem by varying , and show the result in Figure 1(a). The experiment results show that roughly increases linearly with , which indicates that is scalable with respect to the problem size . We also experiment with , in this case we have , , and , which is only times of in the default case. It is worth mentioning that this result also suggests that the Bayes regret bound derived in Theorem 1 is (almost) tight in this problem555Recall that Theorem 1 requires . It can be easily extended to cases with by scaling the Bayes regret bound by . However, in this problem is not bounded since it is sampled from a Gaussian distribution. We believe that Theorem 1 can be extended to this case by exploiting the properties of Gaussian distribution. Roughly speaking, in this problem, with high probability, . . To see it, notice that and , and hence the Bayes regret bound derived in Theorem 1 is .

Second, we study how the Bayes cumulative regret of scales with , the dimension of the feature vectors, by varying , and demonstrate the result in Figure 1(b). The experiment results indicate that also roughly increases linearly with , and hence is also scalable with the feature dimension . This result also suggests that the bound in Theorem 1 is (almost) tight5.

Finally, we study the robustness of with respect to the algorithm parameters and . In Figure 1(c), we vary and in Figure 1(d), we vary . We would like to emphasize again that we only vary the algorithm parameters and fix and . The experiment results show that is robust to the choice of algorithm parameters and performs well for a wide range of and . However, too small or too large , or too small , can significantly reduce the performance of , as we have discussed in Section 4.1.

In the second experiment, we evaluate on an advertising problem. Our objective is to identify people that are most likely to accept an advertisement offer, subject to the targeting constraint that exactly half of them are females. Specifically, the ground set includes representative people from Adult dataset (Asuncion & Newman, 2007), which was collected in the US census. A feasible solution is any subset of with and satisfying the targeting constraint mentioned above. We assume that person accepts an advertisement offer with probability

 ¯w(e)={0.15income is at least 50k% 0.05otherwise,

and people accept offers independently of each other. The features in the generalization matrix are the age, which is binned into groups; gender; whether the person works more than hours per week; and the length of education in years. All these features can be constructed based on the Adult dataset.

is compared to three baselines. The first baseline is the optimal solution . The second baseline is (Kveton et al., 2015b). This algorithm estimates the probability that person accepts the offer independently of the other probabilities. The third baseline is without linear generalization, which we simply refer to as . As in , this algorithm estimates the probability that person accepts the offer independently of the other probabilities. The posterior of

is modeled as a beta distribution.

Our experiment results are reported in Figure 2. We observe two major trends. First, learns extremely quickly. In particular, its per-step return at episode is of the optimum, and its per-step return at episode k is of the optimum. These results are remarkable since the linear generalization is imperfect in this problem. Second, both and perform poorly due to insufficient observations with respect to the model complexity. Specifically, in k episodes, the people in are observed k times, which implies that each person is observed only times on average. This is not enough to discriminate the people who are likely to accept the advertisement offer from those that are not.

### 6.3 Artist Recommendation

In the last experiment, we evaluate on a problem of recommending music artists that are most likely to be chosen by an average user of a music recommendation website. Specifically, the ground set include artists from the Last.fm music recommendation dataset (Cantador et al., 2011). The dataset contains tagging and music artist listening information from a set of users from Last.fm online music system. The tagging part includes the tag assignments of all artists provided by the users. For each user, the artists to whom she listened and the number of listening events are also available in the dataset.

We choose as the set of artists that were listened by at least two users and had at least one tag assignment among the top most popular tags, and k. For each artist , we construct its feature vector by setting its th component as the fraction of users who assigned tag to this artist. We assume that each artist is chosen by an average user with probability , where is the set of users that listened to artist , and is the probability that user likes artist . We estimate

based on a Naïve Bayes classifier with respect to the number of person/artist listening events.

Like Section 6.2, we also compare to three baselines: the optimal solution , the algorithm and the algorithm. Our experiment results are reported in Figure 3. Similarly as Figure 2, the expected per-step return of approaches that of much faster than and . Moreover, both and perform poorly due to the insufficient observations with respect to the model complexity: In k episodes, each artist is observed less than times on average, which is not enough to discriminate most popular artists from less popular artists.

## 7 Conclusion

We have proposed two learning algorithms, and , for stochastic combinatorial semi-bandits with linear generalization. The main contribution of this work is two-fold: First, we have established -independent regret bounds for these two algorithms under reasonable assumptions, where is the number of items. Second, we have also evaluated on a variety of problems. The experiment results in the first problem show that is scalable and robust, and the experiment results in the other two problems demonstrate the value of exploiting linear generalization in real-world settings.

It is worth mentioning that our results can be easily extended to the contextual combinatorial semi-bandits with linear generalization. In a contextual combinatorial semi-bandit, the probability distribution (and hence the expected weight ) also depends on a context , which either follows an exogenous stochastic process or is adaptively chosen by an adversary. Assume that each state-item pair is associated with a feature vector , then similar to Agrawal & Goyal (2013), both and , as well as their analyses, can be generalized to the contextual combinatorial semi-bandits.

We leave open several questions of interest. One interesting open question is how to derive regret bounds for and in the agnostic learning cases. Another interesting open question is how to extend the results to combinatorial semi-bandits with nonlinear generalization. We believe that our results can be extended to combinatorial semi-bandits with generalized linear generalization777That is, , where is a strictly monotone function., but leave it to future work.

## References

• Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Yasin, Pál, Dávid, and Szepesvári, Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pp. 2312–2320, 2011.
• Agrawal & Goyal (2012) Agrawal, Shipra and Goyal, Navin. Analysis of thompson sampling for the multi-armed bandit problem. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pp. 39.1–39.26, 2012.
• Agrawal & Goyal (2013) Agrawal, Shipra and Goyal, Navin. Thompson sampling for contextual bandits with linear payoffs. In

Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013

, pp. 127–135, 2013.
• Asuncion & Newman (2007) Asuncion, A. and Newman, D.J. UCI machine learning repository, 2007. \$mlearn/{MLR}epository.html.
• Audibert et al. (2014) Audibert, Jean-Yves, Bubeck, Sebastien, and Lugosi, Gabor. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
• Auer (2002) Auer, Peter. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.
• Cantador et al. (2011) Cantador, Iván, Brusilovsky, Peter, and Kuflik, Tsvi. Second workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the ACM conference on Recommender systems, RecSys 2011. ACM, 2011.
• Cesa-Bianchi & Lugosi (2012) Cesa-Bianchi, Nicolò and Lugosi, Gábor. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
• Chapelle & Li (2011) Chapelle, Olivier and Li, Lihong. An empirical evaluation of Thompson sampling. In Neural Information Processing Systems, pp. 2249–2257, 2011.
• Chen et al. (2013) Chen, Wei, Wang, Yajun, and Yuan, Yang. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, pp. 151–159, 2013.
• Dani et al. (2008) Dani, Varsha, Hayes, Thomas, and Kakade, Sham. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pp. 355–366, 2008.
• Gabillon et al. (2013) Gabillon, Victor, Kveton, Branislav, Wen, Zheng, Eriksson, Brian, and Muthukrishnan, S. Adaptive submodular maximization in bandit setting. In Advances in Neural Information Processing Systems 26, pp. 2697–2705, 2013.
• Gabillon et al. (2014) Gabillon, Victor, Kveton, Branislav, Wen, Zheng, Eriksson, Brian, and Muthukrishnan, S. Large-scale optimistic adaptive submodularity. In

Proceedings of the 28th AAAI Conference on Artificial Intelligence

, 2014.
• Gai et al. (2012) Gai, Yi, Krishnamachari, Bhaskar, and Jain, Rahul. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5):1466–1478, 2012.
• Koren et al. (2009) Koren, Yehuda, Bell, Robert, and Volinsky, Chris. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30–37, 2009.
• Kveton et al. (2014a) Kveton, Branislav, Wen, Zheng, Ashkan, Azin, and Eydgahi, Hoda. Matroid bandits: Practical large-scale combinatorial bandits. In Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014a.
• Kveton et al. (2014b) Kveton, Branislav, Wen, Zheng, Ashkan, Azin, Eydgahi, Hoda, and Eriksson, Brian. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pp. 420–429, 2014b.
• Kveton et al. (2014c) Kveton, Branislav, Wen, Zheng, Ashkan, Azin, Eydgahi, Hoda, and Valko, Michal. Learning to act greedily: Polymatroid semi-bandits. CoRR, abs/1405.7752, 2014c.
• Kveton et al. (2015a) Kveton, Branislav, Szepesvari, Csaba, Wen, Zheng, and Ashkan, Azin. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, 2015a.
• Kveton et al. (2015b) Kveton, Branislav, Wen, Zheng, Ashkan, Azin, and Szepesvari, Csaba. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015b.
• Neu & Bartók (2013) Neu, Gergely and Bartók, Gábor. An efficient algorithm for learning with semi-bandit feedback. In Jain, Sanjay, Munos, Rémi, Stephan, Frank, and Zeugmann, Thomas (eds.), Algorithmic Learning Theory, volume 8139 of Lecture Notes in Computer Science, pp. 234–248. Springer Berlin Heidelberg, 2013. ISBN 978-3-642-40934-9.
• Papadimitriou & Steiglitz (1998) Papadimitriou, Christos and Steiglitz, Kenneth. Combinatorial Optimization. Dover Publications, Mineola, NY, 1998.
• Russo & Van Roy (2013) Russo, Daniel and Van Roy, Benjamin. Learning to optimize via posterior sampling. CoRR, abs/1301.2609, 2013.
• Russo & Van Roy (2014) Russo, Daniel and Van Roy, Benjamin. An information-theoretic analysis of thompson sampling. CoRR, abs/1403.5341, 2014.
• Thompson (1933) Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
• Van Roy & Wen (2014) Van Roy, Benjamin and Wen, Zheng. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
• Wen & Van Roy (2013) Wen, Zheng and Van Roy, Benjamin. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems 26, pp. 3021–3029, 2013.
• Wen et al. (2013) Wen, Zheng, Kveton, Branislav, Eriksson, Brian, and Bhamidipati, Sandilya. Sequential Bayesian search. In Proceedings of the 30th International Conference on Machine Learning, pp. 977–983, 2013.
• Yue & Guestrin (2011) Yue, Yisong and Guestrin, Carlos. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems 24, pp. 2483–2491, 2011.

## Appendix A Proof for Theorem 1

To prove Theorem 1, we first prove the following theorem:

###### Theorem 3.

If (1) , (2) the prior on is , and (3) the noises are i.i.d. sampled from , then under algorithm with parameter , then we have

 RBayes(n)≤1+Kλmin⎧⎨⎩ ⎷ln(λLn√2π), ⎷dln(2dKnλ√2π)⎫⎬⎭   ⎷2dnln(1+nKλ2d)ln(1+λ2σ2). (10)

Notice that Theorem 1 follows immediately from Theorem 3. Specifically, if , then we have

 BBayes(n) ≤1+Kλmin⎧⎨⎩ ⎷ln(λLn√2π), ⎷dln(2dKnλ√2π)⎫⎬⎭√2dnlog2(1+nKλ2d) =~O(Kλ√dnmin{ln(L),d}). (11)

We now outline the proof of Theorem 3, which is based on (Russo & Van Roy, 2013; Dani et al., 2008). Let denote the “history” (i.e. all the available information) by the start of episode . Note that from the Bayesian perspective, conditioning on , and are i.i.d. drawn from (see (Russo & Van Roy, 2013)). This is because that conditioning on , the posterior belief in is and based on Algorithm 2, is independently sampled from . Since is a fixed combinatorial optimization algorithm (even though it can be independently randomized), and are all fixed, then conditioning on , and are also i.i.d., furthermore, is conditionally independent of , and is conditionally independent of .

To simplify the exposition, and , we define

 g(A,θ)=∑e∈A⟨ϕe,θ⟩, (12)

then we have and , hence we have . We also define the upper confidence bound (UCB) function as

 Ut(A)=∑e∈A[⟨ϕe,¯θt⟩+c√ϕTeΣtϕe], (13)

where is a constant to be specified. Notice that conditioning on , is a deterministic function and are i.i.d., then and

 (14)

One key observation is that

 E[Ut(At)−g(At,θ∗)∣∣Ht] (a)=∑e∈EE[1{e∈At}[⟨ϕe,¯θt−θ∗⟩+c√ϕTeΣtϕe]∣∣∣Ht] (b)=∑e∈EE[1{e∈At}∣∣Ht]E[⟨ϕe,¯θt−θ∗⟩∣∣Ht]+cE⎡⎣∑e∈At√ϕTeΣtϕe∣∣ ∣∣Ht⎤⎦ (c)=cE⎡⎣∑e∈At√ϕTeΣtϕe∣∣ ∣∣Ht⎤⎦, (15)

where (b) follows from the fact that and are conditionally independent, and (c) follows from . Hence . We can show that (1) if we choose

 c≥min⎧⎨⎩ ⎷ln(λLn√2π), ⎷dln(2dKnλ√2π)⎫⎬⎭, (16)

and (2) . Thus, the bound in Theorem 3 holds. Please refer to the remainder of this section for the full proof.