# Multinomial Logit Bandit with Linear Utility Functions

Multinomial logit bandit is a sequential subset selection problem which arises in many applications. In each round, the player selects a K-cardinality subset from N candidate items, and receives a reward which is governed by a multinomial logit (MNL) choice model considering both item utility and substitution property among items. The player's objective is to dynamically learn the parameters of MNL model and maximize cumulative reward over a finite horizon T. This problem faces the exploration-exploitation dilemma, and the involved combinatorial nature makes it non-trivial. In recent years, there have developed some algorithms by exploiting specific characteristics of the MNL model, but all of them estimate the parameters of MNL model separately and incur a regret no better than Õ(√(NT)) which is not preferred for large candidate set size N. In this paper, we consider the linear utility MNL choice model whose item utilities are represented as linear functions of d-dimension item features, and propose an algorithm, titled LUMB, to exploit the underlying structure. It is proven that the proposed algorithm achieves Õ(dK√(T)) regret which is free of candidate set size. Experiments show the superiority of the proposed algorithm.

## Authors

• 2 publications
• 32 publications
• 17 publications
• 34 publications
• ### Multinomial Logit Contextual Bandits: Provable Optimality and Practicality

We consider a sequential assortment selection problem where the user cho...
03/25/2021 ∙ by Min-hwan Oh, et al. ∙ 0

• ### Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit

We consider a dynamic assortment selection problem where the goal is to ...
11/28/2020 ∙ by Priyank Agrawal, et al. ∙ 6

• ### On Distributed Multi-player Multiarmed Bandit Problems in Abruptly Changing Environment

We study the multi-player stochastic multiarmed bandit (MAB) problem in ...
12/12/2018 ∙ by Lai Wei, et al. ∙ 0

• ### Rate-adaptive model selection over a collection of black-box contextual bandit algorithms

We consider the model selection task in the stochastic contextual bandit...
06/05/2020 ∙ by Aurélien F. Bibaut, et al. ∙ 5

• ### Multiclass Classification using dilute bandit feedback

This paper introduces a new online learning framework for multiclass cla...
05/17/2021 ∙ by Gaurav Batra, et al. ∙ 0

• ### Fully Gap-Dependent Bounds for Multinomial Logit Bandit

We study the multinomial logit (MNL) bandit problem, where at each time ...
11/19/2020 ∙ by Jiaqi Yang, et al. ∙ 0

• ### Finding a Collective Set of Items: From Proportional Multirepresentation to Group Recommendation

We consider the following problem: There is a set of items (e.g., movies...
02/13/2014 ∙ by Piotr Skowron, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In traditional stochastic multi-armed bandit (MAB) [Bubeck and Cesa-Bianchi2012], the player selects one from items and receives a reward corresponding to that item in each round. The objective is to maximize cumulative reward over a finite horizon of length , or alternatively, minimize the regret relative to an oracle. Typically, algorithms are designed based on appropriate exploration-exploitation tradeoff which allows the player to identify the best item through exploration whilst not spending too much on sub-optimal ones, and the family of upper confidence bound (UCB) algorithms [Auer2002, Chu et al.2011]

and Thompson sampling (TS)

[Thompson1933, Agrawal and Goyal2012] are representative examples. This paper studies a combinatorial variant of MAB, where in each round, the player offers a subset of cardinality to a user, and receives the reward associated with one of the items in the selected subset111This problem is known as dynamic assortment selection in the literature [Caro and Gallien2007, Rusmevichientong et al.2010], where the selected subset of items forms an assortment.. The player faces the problem of determining which subset of items to present to users who arrive sequentially, whilst not knowing user preference. Similar to MAB, we need to solve the exploration-exploitation tradeoff in this problem. However, a naive translation of this problem to MAB is prohibitive, since the number of possible -cardinality subsets is exponentially large and cannot be efficiently explored within a reasonable sized time horizon. To tackle this issue, different strategies have been proposed in the literature, e.g., [Kveton et al.2015, Lagrée et al.2016, Agrawal et al.2017a].

In recent literature, the multinomial logit (MNL) choice model [Luce2005, Plackett1975]

which is widely used in economics is utilized to model user choice behavior by specifying the probability that a user selects an item given the offered set, and above exploration-exploitation problem is referred as the MNL-Bandit problem

[Rusmevichientong et al.2010, Agrawal et al.2017a, Agrawal et al.2017b, Cheung and Simchi-Levi2017]. Unlike other combinatorial bandit problems, MNL-Bandit problem considers the substitution property among items and leads to non-monotonic reward function. By exploiting specific characteristics of the MNL model, UCB-style [Agrawal et al.2017a] and TS-style [Agrawal et al.2017b] algorithms have been developed to dynamically learn the parameters of the MNL model which are a priori unknown, achieving a regret of under a mild assumption. Also, a lower regret bound is established in [Agrawal et al.2017a], by showing that any algorithm based on the MNL choice model must incur a regret of . It is easy to find that the regret depends on the number of candidate items , making them less preferred in many large scale applications such as online advertising.

In this paper, we use the linear utility MNL choice model (formulated in Section 3) to model user choice behavior given a set of items, rather than traditional MNL model. Specifically, it is assumed that each item in candidate set is described by a

-dimension feature vector, and item utilities of the MNL model can be formulated as linear functions of item features. Based on this, the problem of estimating item utilities (

i.e., parameters of the MNL model) is changed to estimating underlying model parameters of linear functions. Since the number of parameters is irrelevant with the number of items, it is possible to achieve more efficient solution when the number of items is large. By taking the UCB approach, we propose an algorithm, titled LUMB (which is short for Linear Utility MNL-Bandit), to dynamically learn the parameters and narrow the regret. The main contributions of this work include:

• To the best of our knowledge, this is the first to use linear utility MNL choice model in sequential subset selection problem. Also, an UCB-style algorithm LUMB is proposed to learn the model parameters dynamically.

• An upper regret bound is established for the proposed LUMB algorithm. This regret bound is free of candidate item set size, which means that LUMB can be applied to large item set.

• Empirical studies demonstrate the superiority of the proposed LUMB algorithm over existing algorithms.

The rest of this paper is organized as follows. Section 2 briefly introduces related work. Section 3 and 4 present problem formulation and the LUMB algorithm. Section 5 establishes regret analysis. Section 6 summarizes the experiments, and Section 7 concludes this work with future directions.

## 2 Related Work

Classical bandit algorithms aim to find the best arm with exploration-exploitation strategy. auer2002using [auer2002using] first proposes a UCB approach in linear payoff setting. dani2008stochastic [dani2008stochastic] and abbasi2011improved [abbasi2011improved] propose improved algorithms which bound the linear parameters directly. agrawal2013thompson [agrawal2013thompson] propose a Thompson sampling approach. However, because the reward of a subset is not a linear function of item features in the subset, these works cannot be directly applied to our problem.

Another class of bandit works related to our work is combinatorial bandit where the player selects a subset of arms and receive a collective reward in each round. Researchers study the problem mainly on two settings, stochastic setting [Gai et al.2012, Russo and Van Roy2014, Kveton et al.2015] and adversarial setting [Cesa-Bianchi and Lugosi2012, Audibert et al.2013]. gai2012combinatorial [gai2012combinatorial] first learn the problem in linear reward setting and kveton2015tight [kveton2015tight] prove a tight regret bound. It is generalized to non-linear rewards in [Chen et al.2016, Chen et al.2013]. wen2015efficient [wen2015efficient] and wang2017efficient [wang2017efficient] propose contextual algorithms which can handle large item sets. However, these works imply that the reward is monotonic which is not satisfied in MNL-Bandit (Section 3). In practice, as clarified in [Cheung and Simchi-Levi2017], the low-reward item may divert the attention of user and lead to lower subset reward.

rusmevichientong2010dynamic [rusmevichientong2010dynamic] solve the MNL-Bandit problem and achieve instance-dependent upper regret bound , and saure2013optimal [saure2013optimal] extend to a wider class of choice models. agrawal2017mnl [agrawal2017mnl] propose a UCB approach and achieve instance-independent upper regret bound . agrawal2017mnl [agrawal2017mnl] propose a Thompson sampling approach with better empirical performance. Recently, some works begin to study variants of classic MNL-Bandit problem. Some works learn the personalized MNL-Bandit problem [Kallus and Udell2016, Bernstein et al.2017, Golrezaei et al.2014]. cheung2017assortment [cheung2017assortment] learn the problem with resource constraints. However, as clarified in Section 1, these works model item utility separately which is not feasible for large item sets.

## 3 Problem Formulation

Suppose there is a candidate item set, , to offer to users. Each item corresponds to a reward and a feature vector . Let be the feature matrix. In MNL-Bandit, at each time step , the player selects a subset and observes the user choice , where represents that the user chooses nothing from . The objective is to design a bandit policy to approximately maximize the expected cumulative reward of chosen items, i.e., .

According to above setting, in each time step , a bandit policy is only allowed to exploit the item features , historical selected subsets, and the user choice feedbacks, .

The user choice follows the MNL model [Luce2005, Plackett1975]. MNL assumes substitution property among items in a selected subset. Specifically, for item , the larger the utilities of the other items, the smaller the chosen probability of item is. The choice probability follows a multinomial distribution,

 pi(¯¯¯¯St,v)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩vi1+∑j∈¯¯¯Stvj,i∈¯¯¯¯St11+∑j∈¯¯¯Stvj ,i=00,otherwise , (1)

where is item utility which is a priori unknown to the player. The choice probability of item in selected subset, , is linear with its utility, . Besides, it is possible that nothing is chosen which is realized by adding a virtual item with utility . With the MNL model, we have the expected reward under given utility vector, ,

 R(¯¯¯¯St,v)=∑i∈¯¯¯Stpi(¯¯¯¯St,v)ri=∑i∈¯¯¯Stviri1+∑i∈¯¯¯Stvi . (2)

Note that the expected reward is non-monotonic, that is, both the addition of low-reward item to selected subset and increment on utility of low-reward item may lead to lower reward. The expected cumulative reward is

 E(∑trct)=∑tE(rct)=∑tE(R(¯¯¯¯St,v)) . (3)

Since direct analysis of (3) is not tractable when is unknown, we analyze the regret instead,

 Reg(T,v)=∑Tt=1(R(S∗,v)−E(R(¯¯¯¯St,v))) , (4)

where is the length of time horizon and is the optimal subset,

 S∗=argmaxS∈C(S)R(S,v) .

Naturally, the objective is to approximately minimize the expected cumulative regret, , with appropriate bandit policy. Especially, after enough time steps, an appropriate solution should almost achieve the subsets with highest reward, which implies that the cumulative regret, , should be sub-linear with . As each item corresponds to an utility which needs to be estimated separately, this makes the lower cumulative regret bound relevant with item number and will be not feasible for large item sets.

Therefore, linear item utility is introduced where item utility is a linear function of item feature,

 vi=θ∗⊤xi , (5)

where is a linear parameter vector unknown to the player. Thus, estimating item utilities will be changed to estimating utility function parameters which can exploit the correlation between items on features, then it is potential to achieve regret bound free of item number, .

## 4 Algorithm

We propose an algorithm, called Linear Utility MNL-Bandit (LUMB), which proposes a UCB approach to sequentially estimate linear utility function and approach highest reward. LUMB first estimates the linear parameters of utility function, then constructs the UCB of item utility and subset reward, finally offers the subset with highest reward UCB. Algorithm 1 clarifies the detail of LUMB.

### 4.1 Estimation of Linear Utility Function

As the choice probability of an item is non-linear with the parameters of utility function, it is difficult to estimate the parameters directly with user choice feedback. Instead, we split the time horizon into epochs like

[Agrawal et al.2017a]. Let be the number of epochs. In each epoch , the selected subset, , is offered repeatedly until that the user chooses nothing from the offered subset. Then, we can obtain the chosen times of each item ,

 (6) s.t. I(ct=i)={1,ct=i0,ct≠i ,

where is the set of time steps in epoch , is the chosen item in time step .

It can be proven that (Lemma 1), which means the empirical average of

is almost equal to real item utility and irrelevant to other items in the subset. Thus, the estimation of utility function can be simply formulated as a linear regression which directly approaches empirical samples of

. Specifically,

 θl=argminθ∑τ≤l∑i∈Sτ∥θ⊤xi−^vi,l∥22+λ∥θ∥22 ,

where is a constant regularization coefficient. Then, we can obtain close-form solution

 Al =λId+∑τ≤l∑i∈Sτxix⊤i , (7) bl =∑τ≤l∑i∈Sτ^vi,τxi , (8) θl =A−1lbl , (9)

where is a -by-identity matrix.

### 4.2 Construction of Upper Confidence Bound

We construct the UCB of item utility, which is proven in Lemma 2, as

 vUCBi,l=θ⊤lxi+(√2+α)σi,l , (10) s.t. σi,l=√x⊤iA−1lxi

where is constant. Let . Then, the UCB of the highest reward, , is constructed as the highest reward with (Lemma 5). The corresponding subset is

 Sl+1=argmaxS∈C(S)R(S,vUCBl) , (11)

and we offer the subset in epoch .

It is hard to get by directly solving the above optimization problem. According to [Davis et al.2013]

, the above optimization problem can be translated to a linear program problem,

 max ∑Ni=1riwi , (12) s.t. w0+∑Ni=1wi=1,∑Ni=1wivUCBi≤Kw0 , ∀i∈S,0≤wivUCBi≤w0 .

Then, .

## 5 Regret Analysis

In this section, we analyze the upper regret bound of Algorithm 1 to theoretically identify the convergence performance. Without loss of generality, we first declare the assumption in the following analysis.

###### Assumption 1.

.

According to the assumption, we have that . Moreover, we let . Then, we give the upper bound of regret in Theorem 1 in advance which is proven in Section 5.3. We can achieve result similar to Theorem 1 when the parameters in Assumption 1 are bounded by finite constants.

###### Theorem 1.

Following the process in Algorithm 1, let

 β =2log2T, α =β ⎷2log(2√T(1+Td)d/2),

then the upper bound of is

 O(dK√T(logT)2) . (13)

The proof is separated into three steps. We first prove the correctness of the constructed UCB of utility and the cumulative deviation between UCB of utility and real utility which is sublinear with respect to . Then we prove that the deviation between UCB of reward and real reward can be bounded by deviation of utility, finally the upper bound of cumulative regret can be proved by combining the above two results.

### 5.1 Analysis of Utility

We first prove the distribution of .

###### Lemma 1.

With the definition in Eq. (6), ,

follows geometric distribution, that is

 P(^vi,l=β) =11+vi(vi1+vi)β,∀β≥0 , (14) E(^vi,l) =vi . (15)

Proof is in Appendix A.1. According to above Lemma, the deviation between and real utility is unbounded. This makes the prove of utility UCB difficult. Fortunately, the probability decays exponentially when increases. We bound in a relative small interval with high probability. Then, we can prove the utility UCB as below.

###### Lemma 2.

With definition of in Eq. (10) and definition of in Eq. (6), if , then ,

 0≤vUCBi,l−vi≤2(√2+α)σi,l , (16)

with probability at least

 1−(1+Td)d/2exp(−α22β2) .

Below is a brief proof, and detailed proof is in Appendix A.2.

###### Proof.

Let , we just need to prove

 Δi,l≤(√2+α)σi,l .

According to Lemma 1, when ,

 E(^vi,l−vi)=−(1+β)pβ+1i1−pβ+1i , s.t.pi=vi1+vi .

Note that the result of is irrelevant with . Let .

 Δi,l ≤|x⊤iA−1l∑τ≤l∑j∈Sτxj(^vj,τ−vj−ϵj)| +|x⊤iA−1l∑τ≤l∑j∈Sτxjϵj−x⊤iA−1lθ∗| .

We prove the bound of two parts respectively. Let , , it is easy to prove that

 E(exp(γui))≤exp(γ2β22) ,

then, with Lemma 9 in [Abbasi-Yadkori et al.2011], we can prove that with probability

 1−(1+Td)d/2exp(−α22β2),

we have that

 |x⊤iA−1l∑τ≤l∑i∈Sτxi(^vi,τ−vi−ϵi)|≤ασi,l . (17)

Moreover, with Cauchy–Schwarz inequality, the other part is bounded as

 |x⊤iA−1l∑τ≤l∑j∈Sτxjϵj−x⊤iA−1lθ∗|≤√2σi,l . (18)

The lemma can be proven by combining Eq. (34) and Eq. (35). ∎

Moreover, we prove the bound of cumulative utility deviation.

###### Lemma 3.

Following the process in Algorithm 1, the cumulative deviation between utility UCB and real utility can be bounded as

 ∑Ll=1∑i∈Slσi,l≤K√70dLlogL . (19)

Proof is similar to the proof of Lemma 3 in [Chu et al.2011]. Because of the space limitation, the proof will be attached in a longer version. Lemma 3 shows that the bound of cumulative deviation is sub-linear with epoch number, and the average deviation in each epoch will vanish after enough epochs.

### 5.2 Analysis of Reward

We first estimate the deviation between estimated reward and real reward of with the result of Lemma 3.

###### Lemma 4.

In each epoch of Algorithm 1, if ,

 0≤vUCBi−vi≤2(√2+α)σi,l ,

then the cumulative deviation between estimated reward and real reward of is

 E⎛⎝∑t∈El(R(Sl,vUCBl)−rct)⎞⎠≤2(√2+α)∑i∈Slriσi,l . (20)

Proof is in Appendix A.4. This Lemma means that the deviation of subset reward is bounded by deviation of item utilities in the subset.

###### Lemma 5.

(Lemma 4.2 in [Agrawal et al.2017a])With the reward defined in Eq. (2), suppose there are two subsets, and , that

 ~SUCB =argmaxS∈C(S)R(S,vUCB) , ~S =argmaxS∈C(S)R(S,v) .

If , then the rewards satisfy the inequality

 R(~S,v)≤R(~S,vUCB)≤R(~SUCB,vUCB). (21)

Brief proof is in Appendix A.4. Lemma 5 shows that the estimated reward of subset is an upper bound of real highest reward. Then we can easily bound the regret in each epoch with Lemma 4 and Lemma 5.

###### Lemma 6.

(Regret bound in a single epoch) In each epoch of Algorithm 1, if

 0≤vUCBi−vi≤2(√2+α)σi,l , (22)

then the regret of epoch is

 E⎛⎝∑t∈El(R(S∗,v)−rct)⎞⎠≤2(√2+α)∑i∈Slriσi,l . (23)

### 5.3 Upper Bound of Regret

We first prove a more general version of Theorem 1 with parameters, and .

###### Lemma 7.

Following the process in Algorithm 1, if , the cumulative regret, defined in Eq. (4), can be bounded,

 Reg(T,v)≤ 2TK(1+Td)d/2exp(−α22β2)+T2K2β+2 +2(√2+α)K√70dTlogT . (24)
###### Proof.

We obtain the bound of regret respectively in two situations: the item utility inequality in Lemma 2 is (or not) satisfied. We model the event that the item utility inequality in Lemma 2 is not satisfied as

 Al=Ul∪Bl, (25)
 s.t. Ul={vUCBi,l>vi+2(√2+α)σi,l orvUCBi,lβ,∃τ≤l,i∈Sτ}.

Then, it is easy to bound the probability of ,

 P(Al)≤2K(1+Td)d/2exp(−α22β2)+lK2β+2 . (26)

Let be the complement of . Then, the regret can be splited into two parts, that is,

 Reg(T,v) +E⎛⎝L∑l=1∑t∈ElI(~Al−1)(R(S∗,v)−rct)⎞⎠ ,

where

is an indicator random variable whose value is

when happens, othewise . We first consider the situation that happens, according to Equation (26)

 E(∑Ll=1∑t∈ElI(Al)(R(S∗,v)−rct)) ≤E(∑Ll=1∑t∈ElI(Al)) =E(∑Ll=1∑t∈ElE(I(Al))) =E(∑Ll=1∑t∈ElP(Al)) ≤E(∑Ll=1∑t∈El1)(2K(1+Td)d/2exp(−α22β2)+TK2β+2) =2TK(1+Td)d/2exp(−α22β2)+T2K2β+2 . (27)

Then, we consider that does not happen. According to Lemma 6 and Lemma 3,

 E(∑Ll=1∑t∈ElI(~Al−1)(R(S∗,v)−rct)) ≤E(∑Ll=1∑t∈El(R(S∗,v)−rct)) ≤2(√2+α)E(∑Ll=1∑i∈Slriσi,l) ≤2(√2+α)E(∑Ll=1∑i∈Slσi,l) ≤2(√2+α)K√70dTlogT . (28)

Finally, we can finish the proof by adding Eq. (27) and Eq. (28).

With Lemma 7, Theorem 1 can be proven by setting

 β =2log2T , α =β ⎷2log(2√T(1+Td)d/2) .

As our method is in the similar framework of MNL-Bandit [Agrawal et al.2017a] whose lower bound is , our regret bound matches the lower bound up to logarithmic terms with respect to .

## 6 Experiments

In this section, we evaluate LUMB on synthetic data and compare to three existing alternative algorithms. We demonstrate the superiority of LUMB on cumulative regret. Moreover, we show that the estimated linear parameters of utility function and utilities will asymptotically converge to the real value.

### 6.1 Setting

The synthetic data is generated randomly. rewards are sampled from interval uniformly. -dimension parameter vector of utility function, , is sampled from uniformly, then is normalized to . -dimension feature vectors are sampled from uniformly. To follow the experiment setting in [Agrawal et al.2017b], feature vectors are normalized so that item utilities distribute uniformly on . Experiments are all performed on ten randomly generated data sets and the results show below are all average of results on these data sets.

Three alternative algorithms are compared:

• UCB-MNL [Agrawal et al.2017a]: This algorithm proposes a UCB approach with MNL choice model.

• Thompson-Beta [Agrawal et al.2017b]: This algorithm proposes a Thompson sampling approach with MNL choice model.

• Thompson-Corr [Agrawal et al.2017b]: This algorithm is a variant of Thompson-Beta which samples item utilities with correlated sampling.

### 6.2 Results

We conduct empirical experiments on synthetic data sets with ,. Subset size is set to . Figure 1 shows the cumulative regret on the synthetic data sets, which is normalized by best reward, i.e., . Note that the axis of cumulative regret is plot in a logarithm style for the convenience of observing the trend of LUMB regret on time horizon. The cumulative regrets increase slower when time step increases. Besides, we can see that the cumulative regret of LUMB is much smaller than the alternative algorithms through the time horizon.

We evaluate the convergence of utility vector on synthetic data sets with ,. Figure 2 shows the deviation between estimated mean utility and real utility, which is normalized by the norm of real utility, i.e., . The deviation of LUMB decreases fast in the early stage and achieve smaller deviation compared to the alternative algorithms.

Moreover, we evaluate the deviation of estimated linear parameter vector in Figure 2. The deviation is normalized by the norm of real parameters, i.e., . Note that the deviation also decreases fast in the early stage and asymptotically converges to zero finally. This demonstrates that LUMB can correctly estimate the linear parameters.

## 7 Conclusion

We study the sequential subset selection problem with linear utility MNL choice model, and propose a UCB-style algorithm, LUMB. Also, an upper regret bound, , is established, which is free of candidate item number. Experiments show the performance of LUMB. In the future work, we are interested in extending the idea to other choice models such as nested MNL.

## References

• [Abbasi-Yadkori et al.2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, pages 2312–2320, 2011.
• [Agrawal and Goyal2012] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT, pages 39–1, 2012.
• [Agrawal and Goyal2013] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML, pages 127–135, 2013.
• [Agrawal et al.2017a] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. arXiv preprint arXiv:1706.03880, 2017.
• [Agrawal et al.2017b] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Thompson sampling for the mnl-bandit. arXiv preprint arXiv:1706.00977, 2017.
• [Audibert et al.2013] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi.

Regret in online combinatorial optimization.

Mathematics of Operations Research, 39(1):31–45, 2013.
• [Auer2002] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. JMLR, 3(Nov):397–422, 2002.
• [Bernstein et al.2017] Fernando Bernstein, Sajad Modaresi, and Denis Sauré. A dynamic clustering approach to data-driven assortment personalization. 2017.
• [Bubeck and Cesa-Bianchi2012] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. CoRR, abs/1204.5721, 2012.
• [Caro and Gallien2007] Felipe Caro and Jérémie Gallien. Dynamic assortment with demand learning for seasonal consumer goods. Management Science, 53(2):276–292, 2007.
• [Cesa-Bianchi and Lugosi2012] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
• [Chen et al.2013] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In ICML, pages 151–159, 2013.
• [Chen et al.2016] Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, and Pinyan Lu. Combinatorial multi-armed bandit with general reward functions. In NIPS, pages 1659–1667, 2016.
• [Cheung and Simchi-Levi2017] Wang Chi Cheung and David Simchi-Levi. Assortment optimization under unknown multinomial logit choice models. arXiv preprint arXiv:1704.00108, 2017.
• [Chu et al.2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In IJCAI, pages 208–214, 2011.
• [Dani et al.2008] Varsha Dani, Thomas Hayes, and Sham Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355–366, 2008.
• [Davis et al.2013] James Davis, Guillermo Gallego, and Huseyin Topaloglu. Assortment planning under the multinomial logit model with totally unimodular constraint structures. Technical Report, Cornell University, pages 335–357, 2013.
• [Gai et al.2012] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. TON, 20(5):1466–1478, 2012.
• [Golrezaei et al.2014] Negin Golrezaei, Hamid Nazerzadeh, and Paat Rusmevichientong. Real-time optimization of personalized assortments. Management Science, 60(6):1532–1551, 2014.
• [Kallus and Udell2016] Nathan Kallus and Madeleine Udell. Dynamic assortment personalization in high dimensions. arXiv preprint arXiv:1610.05604, 2016.
• [Kveton et al.2015] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pages 535–543, 2015.
• [Lagrée et al.2016] Paul Lagrée, Claire Vernade, and Olivier Cappe. Multiple-play bandits in the position-based model. In NIPS, pages 1597–1605. 2016.
• [Luce2005] Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.
• [Plackett1975] Robin Plackett. The analysis of permutations. Applied Statistics, pages 193–202, 1975.
• [Rusmevichientong et al.2010] Paat Rusmevichientong, Zuo-Jun Max Shen, and David Shmoys. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58(6):1666–1680, 2010.
• [Russo and Van Roy2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
• [Sauré and Zeevi2013] Denis Sauré and Assaf Zeevi. Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management, 15(3):387–404, 2013.
• [Thompson1933] William Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
• [Wang et al.2017] Yingfei Wang, Hua Ouyang, Chu Wang, Jianhui Chen, Tsvetan Asamov, and Yi Chang. Efficient ordered combinatorial semi-bandits for whole-page recommendation. In AAAI, pages 2746–2753, 2017.
• [Wen et al.2015] Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In ICML, pages 1113–1122, 2015.

## Appendix A Proof of Lemmas

### a.1 Proof of Lemma 1

###### Proof.

Let

 P(^vi,l=β) = ∑n≥βP(|El|=n+1)P(^vi,l=β||El|=n+1) = ∑n≥β11+V(V1+V)n(nβ)(viV)β(V−viV)n−β

Let , we have .

With , we have

 P(^vi,l=β)=vi1+viP(^vi,l=β−1) .

Finally,

 P(^vi,l=β) =11+vi(vi1+vi)β , E(^vi,l) =vi .

### a.2 Proof of Lemma 2

###### Lemma 8.

Suppose random variable satisfies that and , then

 E(eλX)≤eλ2(b−a)2/2 . (29)
###### Proof.

Let be a random variable with same distribution as but independent of . According to Jensen’s inequality:

 EX(eλX) =EX(eλ(X−EX′(X′))) ≤EX,X′(eλ(X−X′))

As and are with same distribution, we have .

 EX,X′(eλ(X−X′)) = 12EX,X′(eλ(X−X′)+eλ(X′−X)) = 12EX,X′⎛⎜ ⎜⎝∞∑k=0λk(X−X′)kk!+∞∑k=0λk(X′−X)kk!⎞⎟ ⎟⎠ = EX,X′⎛⎜ ⎜⎝∞∑k=0λ2k(X−X′)2k(2k)!⎞⎟ ⎟⎠ ≤ EX,X′⎛⎜ ⎜⎝∞∑k=0λ2k(X−X′)2k2kk!⎞⎟ ⎟⎠ ≤ EX,X′(eλ2(X−X′)2/2) ≤ eλ2(b−a)2/2 (30)

We give the proof of Lemma 2 below.

###### Proof.
 vUCBi,l−vi=(θ⊤lxi−vi)+(√2+α)σi,l . (31)

We just need to prove the bound of .

According to Lemma 1, when ,

 E(^vi,l−vi)=−(1+β)pβ+1i1−pβ+1i , (32) s.t.pi=vi1+vi .

According to Assumption 1, we can have

 vi=θ∗⊤xi≤1 ,

then

 |E(^vi,l−vi)|≤1+β2β≤1+log2TT2 (33)

Note that is irrelevant with and will be very small when is large. Let .

 Δi,l =|x⊤iA−1lbl−x⊤iA−1lAlθ∗| =|x⊤iA−1l∑τ≤l∑j∈Sτxj^vj,τ −x⊤iA−1l⎛⎝Id+∑τ≤l∑j∈Sτxjxj⊤⎞⎠θ∗| ≤|x⊤iA−1l∑τ≤l∑j∈Sτxj(^vj,τ−vj−ϵj)| +|x⊤iA−1l∑τ≤l∑j∈Sτxjϵj−x⊤iA−1lθ∗| .

We prove the bound of two parts respectively.

Let . With Lemma 9 in [Abbasi-Yadkori et al.2011] and Lemma 8, we can prove that with probability ,

 |x⊤iA−1l∑τ≤l∑i∈Sτxi(^vi,τ−vi−ϵi)| ≤|x⊤iA