 # Multinomial Logit Bandit with Low Switching Cost

We study multinomial logit bandit with limited adaptivity, where the algorithms change their exploration actions as infrequently as possible when achieving almost optimal minimax regret. We propose two measures of adaptivity: the assortment switching cost and the more fine-grained item switching cost. We present an anytime algorithm (AT-DUCB) with O(N log T) assortment switches, almost matching the lower bound Ω(N log T/loglog T). In the fixed-horizon setting, our algorithm FH-DUCB incurs O(N loglog T) assortment switches, matching the asymptotic lower bound. We also present the ESUCB algorithm with item switching cost O(N log^2 T).

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The dynamic assortment selection problem with the multinomial logic (MNL) choice model, also called MNL-bandit, is a fundamental problem in online learning and operations research. In this problem we have distinct items, each of which is associated with a known reward and an unknown preference parameter . In the MNL choice model, given a subset

, the probability that a user chooses

is given by

 pi(S)=⎧⎪⎨⎪⎩viv0+∑j∈Svjif i∈S∪{0}0otherwise, (1)

where “” stands for the case that the user does not choose any item, and is the associated preference parameter. As a convention (see, e.g. Agrawal et al., 2019), we assume that no-purchase is the most frequent choice, which is very natural in retailing. W.l.o.g., we assume , and for all . The expected reward of the set

under the preference vector

is defined to be

 R(S,v)=∑i∈Sripi(S)=∑i∈Srivi1+∑j∈Svj. (2)

For any online policy that selects a subset  (, where is a predefined capacity parameter) at each time step , observes the user’s choice to gradually learn the preference parameters , and runs for a horizon of time steps, we define the regret of the policy to be

 RegTdef=T∑t=1(R(S⋆,v)−R(St,v)), (3)

where is the optimal assortment in hindsight. The goal is to find a policy to minimize the expected regret for all MNL-bandit instances.

To motivate the definition of the MNL-bandit problem, let us consider a fast fashion retailer such as Zara or Mango. Each of its product corresponds to an item in , and by selling the -th item the retailer takes a profit of . At each specific time in each of its shops, the retailer can only present a certain number of items (say, at most ) on the shelf due to the space constraints. As a consequence, customers who visit the store can only pick items from the presented assortment (or, just buy nothing which corresponds to item ), following a choice model. There has been a number of choice models being proposed in the literature (see, e.g., (Train, 2009; Luce, 2012) for overviews), and the MNL model is arguably the most popular one. The retailer certainly wants to maximize its profit by identifying the best assortment to present. However, it does not know in advance customers’ preferences to items in (i.e., the preference vector ), to get which it has to learn from customers’ actual choices. More precisely, the retailer needs to develop a policy to choose at each time step an assortment based on the previous presented assortments and customers’ choices in the past time steps. The retailer’s expected reward in a time horizon can be expressed by , which is typically reformulated as the regret compared with the best policy in the form of (3).

The MNL-bandit problem has attracted quite some attention in the past decade (Rusmevichientong et al., 2010; Sauré & Zeevi, 2013; Agrawal et al., 2016, 2017; Chen & Wang, 2018). However, all these works do not consider an important practical issue for regret minimization: in reality it is often impossible to frequently change the assortment display. For example, in retail stores it may not be possible to change the display in the middle of the day, not mentioning doing it after each purchase. We thus hope to minimize the number of assortment switches in the selling time horizon without increasing the regret by much. Another advantage of achieving a small number of assortment switches is that such algorithms are easier to parallelize, which enables us to learn users’ preferences much faster. This feature is particularly useful in applications such as online advertising where it is easy to show the same assortment (i.e., a set of ads) in a large amount of end users’ displays simultaneously.

We are interested in two kinds of switching costs under a time horizon . The first is the assortment switching cost, defined as

 Ψ(asst)Tdef=T∑t=1I[St≠St+1].

The second is the item switching cost, defined as

 Ψ(item)Tdef=T∑t=1|St⊕St+1|,

where binary operator computes the symmetric difference of the two sets. In comparison, the item switching cost is more fine-grained and put less penalty if two neighboring assortments are “almost the same”. As a straightforward observation, we always have that

 Ψ(asst)T≤Ψ(item)T≤min{2K,N}⋅Ψ(asst)T. (4)

#### Our results.

In this paper we obtain the following results for MNL-bandit with low switching cost. By default all ’s are of base .

We first introduce an algorithm, AT-DUCB, that achieves almost optimal regret (up to a logarithmic factor) and incurs an assortment switching cost of ; this algorithm is anytime, i.e., it does not need to know the time horizon in advance. We then show that the AT-DUCB algorithm achieves almost optimal assortment switching cost. In particular, we prove that every anytime algorithm that achieves almost optimal regret must incur an assortment switching cost of at least . These results are presented in Section 2.

When the time horizon is known beforehand, we obtain an algorithm, FH-DUCB, that achieves almost optimal regret (up to a logarithmic factor) and incurs an assortment switching cost of . We also prove the optimality of this switching cost by establishing a matching lower bound. See Section 3.

For item switches, while the trivial application of (4) leads to and item switching cost bounds for AT-DUCB and FH-DUCB respectively, in Section 4, we design a new algorithm, ESUCB, to achieve an item switching cost of . In Appendix F, we show that a more careful modification to the algorithm further improves the item switching cost to .

We make two interesting observations from the results above: (1) there is a separation between the assortment switching complexities when knowing the time horizon and when not; in other words, the time horizon is useful for achieving a smaller assortment switching cost; (2) the item switching cost is only at most a logarithmic factor higher than the assortment switching cost.

#### Technical contributions.

We combine the epoch-based offering algorithm for MNL-bandits

(Agrawal et al., 2019) and a natural delayed update policy in the design of AT-DUCB. Although a similar delayed update rule has been recently analyzed for multi-armed bandits and Q-learning (Bai et al., 2019), and such a result does not seem surprising, we present it in the paper as a warm-up to help the readers get familiar with a few algorithmic techniques commonly used for the MNL-bandit problem.

Our first main technical contribution comes from the design of FH-DUCB algorithm, where we invent a novel delayed update policy that uses the horizon information to improve the switching cost from to . We note that for the ordinary multi-armed bandit problem, recent works (Gao et al., 2019) and (Simchi-Levi & Xu, 2019) managed to show a similar switching cost with known horizon. However, their update rules do not have to utilize the learned parameters for the arms, and a straightforward conversion of such update rules to the MNL-bandit problem does not produce the desired guarantees. In contrast, our update rule, formally described in (6), carefully exploits the structure of the MNL-bandits and uses the information of the partially learned preference parameters (more specifically, in (6)) to adaptively decide when to switch to a different assortment.

Our second main technical contribution is the ESUCB algorithm for the low item switching cost. The technical challenge here stems from the fact that the low item switching cost is a much stronger requirement than the low assortment switching cost, and simple lazy updates with the doubling trick and the straightforward analysis will show that the item switching cost is at most times the assortment switching cost (see (4)), leading to a total item switching cost of . To reducing the extra factor

, we propose the idea of decoupling the learning for the optimal revenue and the assortment, so that the offering of the assortment is decided via optimizing a new objective function based on the (usually) fixed revenue estimate. Since the revenue estimates are fixed, the offered assortments enjoy improved stability, and the item switching cost can be upper bounded by careful analysis.

We remark that the item switching cost is a particularly interesting goal that arises in online learning problems when the actions are sets of elements, which is very different from traditional MAB and linear bandits. Thanks to our novel technical ingredients, we are able to bring the item switching cost down to almost the same order as the assortment switching cost. We hope our results will inspire future study of the switching costs in both settings for other online learning problems with set actions.

#### Related work.

MNL-bandit was first studied in (Rusmevichientong et al., 2010) and (Sauré & Zeevi, 2013), where the authors took the “explore-then-commit” approach, and proposed algorithms with regret and respectively under the assumption that the gap between the best and second-to-the-best assortments is known. (Agrawal et al., 2016) removed this assumption using a UCB-type algorithm, which achieves a regret of . An almost tight regret lower bound of was later given by (Chen & Wang, 2018). (Agrawal et al., 2017)

proposed an algorithm using Thompson Sampling, which achieves comparable regret bound to the UCB-type algorithms while demonstrates a better numerical performance.

Learning with low policy switches (also called learning in the batched model or limited adaptivity

) has recently been studied in reinforcement learning for several other problems, including stochastic multi-armed bandits

(Perchet et al., 2015; Jun et al., 2016; Agarwal et al., 2017; Gao et al., 2019; Esfandiari et al., 2019; Simchi-Levi & Xu, 2019), Q-learning (Bai et al., 2019), and online-learning (Cesa-Bianchi et al., 2013). This research direction is motivated by the fact that in many practical settings, the change of learning policy is very costly. For example, in clinical trials, every treatment policy switch would trigger a separate approval process. In crowdsourcing, it takes time for the crowd to answer questions, and thus a small number of rounds of interactions with the crowd is desirable. The performance of the learning would be much better if the data is processed in batches and during each batch the learning policy is fixed.

## 2 Warm-up: An anytime algorithm with O(NlogT) assortment switches

As a warm-up, we begin with a simple anytime algorithm using at most assortment switches. Our algorithm combines the epoch-based offering framework introduce in (Agrawal et al., 2016) and a deferred update policy. We will first briefly explain the epoch-based offering procedure, and then present and analyze our algorithm.

#### The epoch-based offering.

In the epoch-based offering framework, whenever we are to offer an assortment , instead of offering it for only one time period, we keep offering until a no-purchase decision (item ) is observed, and refer to all the consecutive time periods involved in this procedure as an epoch. The detailed offering procedure is described in Algorithm LABEL:alg:exploration, where is the global counter for the time period, and records the number of purchases made for each item in the epoch.

algocf[h]

The following key observation for states that

forms an unbiased estimate for the utility parameters of all items in

.

###### Observation 1.

Let be returned by . For each ,

is an independent geometric random variable with mean

. Moreover, one can verify that and

 Pr[Δi=k]=(vi1+vi)k(11+vi),∀k∈N.

At any time of the algorithm when an epoch has ended, for each item , we let where is the number of the past epochs in which is included in the offered assortment, and is the total number of purchases for item during all past epochs. By Observation 1, we know that is also an unbiased estimate of . In (Agrawal et al., 2016), the following upper confidence bound (UCB) is constructed for each ,

 ^vi=¯vi+√48¯viln(√Nℓ+1)Ti+48ln(√Nℓ+1)Ti. (5)

We will compute the assortment for the next epoch based on the vector of UCB values .

algocf[h]

We describe our algorithm in Algorithm LABEL:alg:doubling-ucb, which can be seen as an adaptation of the one in (Agrawal et al., 2016). The main difference from (Agrawal et al., 2016) is that the UCB values (and hence the assortment) is updated only when reaches an integer power of for any item . This deferred update strategy is implemented in Line LABEL:line:doubling-if. Also note that instead of directly evaluating (5), the update in Line LABEL:line:doubling makes sure that is non-increasing as the algorithm proceeds. We comment that the optimization task in Line LABEL:line:optimization can be done efficiently, as studied in, for example, (Rusmevichientong et al., 2010).

###### Theorem 2.

For any time horizon , the expect regret incurred by Algorithm LABEL:alg:doubling-ucb is

 E[RegT]≲√NTlogT,

and the expected number of assortment switches is . 111For two sequences and , we write or if there exists a universal constant such that . Similarly, we write or if there exists a universal constant such that .

The proof of the regret upper bound in Theorem 2 is similar to that of (Agrawal et al., 2016), except for a more careful analysis about the deferred update rule. For completeness, we prove this part in Appendix A.

###### Proof of the assortment switch upper bound in Theorem 2.

Let be the event that Line LABEL:line:doubling is executed in Algorithm LABEL:alg:doubling-ucb for item at the -th epoch. Recall that the assortment is computed by , and is updated after epoch only when happens for some . Let be the total number of epochs at or before time ; we thus have . We then have that

 E[Ψ(asst)T] =ET−1∑t=1I[St≠St+1] ≤L∑ℓ=1N∑i=1I[D(ℓ)i]=N∑i=1L∑ℓ=1I[D(ℓ)i]≲NlogT.

#### The lower bound.

We complement our algorithmic result with the following almost matching lower bound. The theorem states that the number of assortment switches has to be , if the algorithm is anytime and incurs only regret. The proof of Theorem 3 can be found in Appendix E.1.

###### Theorem 3.

There exist universal constants such that the following holds. For any constant , if an anytime algorithm achieves expected regret at most for all and all instances with items, then for any , and greater than a sufficiently large constant that only depends on , there exists an instance with items and a time horizon , such that the expected number of assortment switches before time is at least .

## 3 Achieving O(NloglogT) assortment switch with known horizons

When the time horizon is known to the algorithm, we can exploit this advantage via more carefully designed update policy to achieve only assortment switches. For the convenience of presentation, we first introduce a few notations.

algocf[h]

For each item , we divide the time periods into consecutive stages where the boundaries between any two neighboring stages are marked by the UCB updates for item . Note that the division for the stages may be different for different items. For any , let be the set of epochs to offer item , in stage for the item. Let be the total number of epochs to offer item , before stage for the item, and let be the total number of purchases for item in the epochs counted by . We can therefore define as an unbiased estimate of based on the observations before stage . Similarly to (5), we can define as a UCB for . The procedure (formally described in Algorithm LABEL:alg:update) is invoked whenever the main algorithm decides to conclude the current stage for item and update the UCB for together with the quantities defined above, where is the counter for the number of stages for item , and is the number of purchases observed in stage for item .

The key to the design of our main algorithm for the fixed time horizon setting is a new trigger for updating the UCB values. Let , for each item , we will conclude the current stage and invoke whenever the following condition is satisfied. Note that is adaptive to the estimated parameters to customize the number of epochs between assortment switches for each item. More specifically, the smaller is, the less regret may be incurred by offering item , and therefore the longer we can offer item without switching and incurring too large regret, and this is reflected in the design of .

 P(i,τi)def=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩|T(i,τi)|≥1+√T⋅T(τi)iNif τi<τ0|T(i,τi)|≥1+√T⋅T(τi)iN⋅^vi,τi~{}~{}~{}~{}~{}~{}~{}~{}and~{}^vi,τ0>1/√NTif%  τi≥τ0. (6)

For each epoch , we use to denote the stage (in terms of item ) where epoch belongs to. We present the details of our main algorithm in Algorithm LABEL:alg:defer-ucb-loglogT. The algorithm is terminated whenever the time step reaches the horizon .

###### Theorem 4.

For any given time horizon , we have the following upper bound for the expected regret:

 E[RegT]≲√NTln(√NT2+1)⋅loglogT,

and the following upper bound for the expected number of assortment switches:

 E[Ψ(asst)T]≲NloglogT.

To prove Theorem 4, we first define the desired events. Let

 E(1)i,τdef={^vi,τ≥vi and ^vi,τ≤vi+  ⎷144viln(√NT2+1)T(τ)i+144ln(√NT2+1)T(τ)i},

and

 E(1)def=∩i,τE(1)i,τ.

We also let

 E(2)i,τdef={ni,τ≥12vi|T(i,τ)|, if vi≥12√1NT % and |T(i,τ)|≥T4N⋅vi},

and

 E(2)def=∩i,τE(2)i,τ.

Finally, let . In Appendix B.1, we prove the following lemma.

algocf[h]

###### Lemma 5.

If and is greater than a large enough universal constant, then .

#### Bounds for the stage lengths.

When happens, we can infer the following useful lower bound for the lengths of the stages after . The lemma is proved in Appendix B.2.

###### Lemma 6.

Assume that and is greater than a sufficiently large universal constant. Conditioned on , for each , if is not the last stage for item , we have that . Additionally, if , then for all such that is not the last stage for , we have that .

#### Upper bounding the number of assortment switches.

Suppose that there are epochs before the algorithm terminates. We only need to upper bound which upper bounds the number of assortment switches . For each , if and , we easily deduce that because of the condition . Otherwise, assuming that , by Lemma 6, conditioned on , we have that and for all . Because of , we have for all . Therefore, we know that there are no more than pairs of satisfying . In total, conditioned on , we have that

 EN∑i=1τi(L) ≲Nτ0+N∑i=1I[^vi,τ0>1/√NT]loglogT2Nvi +EN∑i=1max{τi(L)−τ0−loglogT2Nvi,0} ≲NloglogT+N∑i=1loglogT3/2N1/2≲NloglogT, (7)

where the second inequality is because of Lemma 6. Finally, since the contribution to the expected number of assortment switches when fails is at most (because of Lemma 5), we prove the upper bound for the number of assortment switches in Theorem 4.

#### Upper bounding the expected regret.

Let be the length of epoch , i.e., the number of time steps taken in epoch . Note that is a geometric random variable with mean value . Also recall that there are epochs in total. Letting be the optimal assortment, conditioned on event , we have that

 E[RegT] =EL∑ℓ=1E(ℓ)(R(S⋆,v)−R(Sℓ,v)) =EL∑ℓ=1⎛⎝1+∑i∈Sℓvi⎞⎠(R(S⋆,v)−R(Sℓ,v)) ≤EL∑ℓ=1∑i∈Sℓ(^vi,τi(ℓ)−vi) =EN∑i=1∑ℓ:i∈Sℓ(^vi,τi(ℓ)−vi) =EN∑i=1τi(L)∑τ=1∑ℓ∈T(i,τ)(^vi,τ−vi), (8)

where the inequality is due to Lemma 17. In the next lemma, we upper bound the contribution from each item and stage to the upper bound in (8). The lemma is proved in Appendix B.3.

###### Lemma 7.

Conditioned on event , for any item and any stage , we have that

 ∑ℓ∈T(i,τ)(^vi,τ−vi)≲√Tln(√NT2+1)/N.

Combining Lemma 5, Lemma 7, inequalities (7) and (8), we have that

 E[RegT]≤T⋅Pr[¯¯¯¯¯¯¯¯¯E(1)]+E[RegT∣∣E(1)] ≲ 1+EN∑i=1τi(L)×√Tln(√NT2+1)N ≲ √NTln(√NT2+1)⋅loglogT,

proving the expected regret upper bound in Theorem 4.

#### The lower bound.

We prove the following matching lower bound in Appendix E.2.

###### Theorem 8.

For any constant and time horizon , if an algorithm achieves expected regret at most for all -item instances, then there exists an -item instance such that the expected number of assortment switches is

 E[Ψ(asst)T]=Ω(NloglogT).

## 4 Optimizing the number of item switches

In this section, we study how to minimize the item switch cost while still achieving regret.

algocf[h]

We now propose a new algorithm, Exponential Stride UCB (ESUCB), to achieve an item switching cost that is linear with

and poly-logarithmic with . The specific guarantee of the ESUCB algorithm is presented in Theorem 10, the main theorem of this section. The key idea of the algorithm is to decouple the learning of the optimal expected revenue and the optimal assortment, which is made possible by the following lemma.

###### Lemma 9.

Define where . There exists a unique such that

 G(θ⋆)=θ⋆=max|S|≤KR(S,v).

Moreover,

• for any , we have that , and

• for any , we have that .

The proof of Lemma 9 is deferred to Appendix D.1. Motivated by the lemma, we present our ESUCB algorithm in Algorithm LABEL:alg:trisection. The algorithm learns the optimal revenue in the main loop, using a sequence of exponentially decreasing learning step size . For each estimate , the Check procedure (Algorithm LABEL:alg:check) learns the assortment via the UCB method with deferred updates. (More precisely speaking, the algorithm learns and , and at Line LABEL:line:if, chooses one of them based on the UCB estimation for the expected revenue of .) In the Check procedure, the variable keeps the count of time steps and is updated in Exploration. We also make the following notes: 1) The ESUCB algorithm needs the horizon as input, and uses a confidence parameter , which is usually set as . The whole algorithm terminates whenever the horizon is reached. 2) At the optimization steps (Lines LABEL:line:erm-with-fixed-theta-0 and LABEL:line:erm-with-fixed-theta of Algorithm LABEL:alg:check), we have to adopt a deterministic tie breaking rule, e.g., we let the operator to return the such that is minimized among multiple maximizers.

###### Theorem 10.

Setting , we have the following upper bound for the expected regret of ESUCB:

 E[RegT]≲√NT⋅log1.5(NT),

and the item switching cost for ESUCB is

 E[Ψ(item)T]≲Nlog2T.

To prove Theorem 10, we upper bound the item switching cost and the expected regret separately.

#### Upper bounding the item switch cost.

Since the estimate of is fixed in Check, the outcome of (corresponding to Lines LABEL:line:erm-with-fixed-theta-0 and LABEL:line:erm-with-fixed-theta of Algorithm LABEL:alg:check) becomes more stable compared to that of in previous algorithms. Exploiting this advantage, we upper bound the number of item switches incurred by each call of Check as follows. The lemma is proved in Appendix D.2.

algocf[h]

###### Lemma 11.

The item switch cost incurred by any invocation is .

Since the loop in Algorithm LABEL:alg:trisection iterates for only times, Lemma 11 easily implies an item switching cost upper bound for ESUCB. We also note that this bound can be improved to via a slight modification to the algorithm which is elaborated in Appendix F.

#### Upper bounding the expected regret.

We first provide the following guarantees for Check.

###### Lemma 12 (Main Lemma for Check).

For any invocation , with probability at least , the following statements hold.

• If Check returns , then .

• If Check returns , then

 θ⋆≥θr−2tmax(c2√Ntmaxln3NTδ+c3Nln3NTδ).
• Let be the reward at time step in this invocation. If , then we have that

 tmaxθl−E[tmax∑t=1r(t)\textscCheck] ≲√Ntmaxln3(NT/δ)+Nln3(NT/δ).

Proof of Lemma 12 is built upon Lemma 9 and deferred to Appendix D.3.

Let be the event that the statements hold for the invocation of Check at iteration of Algorithm LABEL:alg:trisection, and let be the event that holds every all . By Lemma 12 and a union bound, we immediately have that . The next lemma, built upon Lemma 9 and Lemma 12, shows that in Algorithm LABEL:alg:trisection is always an upper confidence bound for the true parameter , and converges to with a decent rate.

###### Lemma 13.

Let be the value of at the beginning of iteration of Algorithm LABEL:alg:trisection. Conditioned on event , for any iteration , we have that .

###### Proof.

Recall that for every , we need to prove

 ^θ(τ)−3ϵτ≤θ⋆≤^θ(τ). (9)

We prove this by induction. For iteration , (9) trivially holds since and therefore .

Now suppose (9) holds for iteration , we will establish (9) for iteration . Consider the invocation of at iteration , where and . We discuss the following two cases.

Case 1. When the Check procedure returns true, by Lemma 12 we have that By Lemma 9, we have that Therefore, by Line LABEL:line:first-invocation and the induction hypothesis we have that and , proving (9).

Case 2. When the Check procedure returns false, by Lemma 12, we have that

 θ⋆≥θr−1tmax((c2+8)√Ntmaxln3NTδ+c3Nln3NTδ).

Recall that at Line LABEL:line:tmax we set . For large enough , this implies that

 θ⋆≥θr−ϵτ=^θ(τ)−2ϵτ=^θ(τ+1)−3ϵτ+1.

By Line LABEL:line:first-invocation and the induction hypothesis we have that , finishing the proof of (9). ∎

Finally we upper bound the expected regret of Algorithm LABEL:alg:trisection.

###### Lemma 14.

With probability at least , the expected regret incurred by Algorithm LABEL:alg:trisection is . Therefore, if we set , we have that

 E[RegT]≲√NTlog1.5(NT).
###### Proof.

Throughout the proof we condition on the event , which happens probability at least . We first prove that at iteration of Algorithm LABEL:alg:trisection, the expected regret for this iteration is bounded by Consider the invocation at Line LABEL:line:first-invocation. Recall that we define Combining with statement (c) of Lemma 12 and Lemma 13, the expected regret of this invocation is bounded by (where the term is due to the last epoch that might run over time ),

 E[θ⋆⋅tmax−tmax∑t=1r(t)\textscCheck]+O(N) ≲ tmax(θ⋆−θl)+E[θl⋅tmax−tmax∑t=1r(t)\textscCheck]+O(N) ≲ tmax(θ⋆−θl)+Nln3(NT/δ)/ϵτ. (10)

By Lemma 13, we have that . Therefore, (10) is upper bounded by .

Since runs for at least time steps, the second to the last iteration satisfies that , which means that

 ϵτmax≳√Nlog3(NT/δ)/T.

Since is an exponential sequence, the overall expected regret is bounded by the order of

 τmax∑τ=1Nlog3(NT/δ)/ϵτ≲√NTlog3(NT/δ).

#### Refined and non-trivial item switching cost upper bound for the AT-DUCB algorithm.

Since an assortment switch may incur at most item switches, Theorem 2 trivially implies that Algorithm LABEL:alg:doubling-ucb (AT-DUCB) incurs at most item switches, which is upper bounded by since .

In Appendix C, we present a refined analysis showing that the item switching cost of AT-DUCB is at most . While it is not clear to us whether the dependence on delivered by this analysis is optimal, we also discuss the relationship between the analysis and an extensively studied (but not yet fully resolved) geometry problem, namely the maximum number of planar -sets. We hope that further study of this relationship might lead to improvement of both upper and lower bounds of the item switching cost of AT-DUCB. Please refer to Appendix C for more details.

## 5 Conclusion

In this paper, we present algorithms for MNL-bandits that achieve both almost optimal regret and assortment switching cost, in both anytime and fixed-horizon settings. We also design the ESUCB algorithm that achieves the almost optimal regret and item switching cost . For future directions, it is interesting to study whether it is possible to achieve an item switching cost of in the anytime setting and in the fixed-horizon setting. Also, as mentioned in Section 4 (and Appendix C), given the simplicity of our AT-DUCB algorithm, it is worthwhile to further refine the bounds for its item switching cost.

## Acknowledgement

Part of the work done while Kefan Dong was a visiting student at UIUC. Kefan Dong and Yuan Zhou were supported in part by a Ye Grant and a JPMorgan Chase AI Research Faculty Research Award. Qin Zhang was supported in part by NSF IIS-1633215, CCF-1844234 and CCF-2006591.

## References

• Agarwal et al. (2017) Agarwal, A., Agarwal, S., Assadi, S., and Khanna, S. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In COLT, pp. 39–75, 2017.
• Agrawal et al. (2016) Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. A near-optimal exploration-exploitation approach for assortment selection. In EC, pp. 599–600, 2016.
• Agrawal et al. (2017) Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Thompson sampling for the MNL-bandit. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pp. 76–78, 2017.
• Agrawal et al. (2019) Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. MNL-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453–1485, 2019.
• Bai et al. (2019) Bai, Y., Xie, T., Jiang, N., and Wang, Y.-X. Provably efficient q-learning with low switching cost. In NeurIPS, 2019.
• Cesa-Bianchi et al. (2013) Cesa-Bianchi, N., Dekel, O., and Shamir, O. Online learning with switching costs and other adaptive adversaries. In NIPS, pp. 1160–1168, 2013.
• Chen & Wang (2018) Chen, X. and Wang, Y. A note on a tight lower bound for capacitated MNL-bandit assortment selection models. Oper. Res. Lett., 46(5):534–537, 2018.
• Dey (1998) Dey, T. K. Improved bounds for planar k-sets and related problems. Discrete & Computational Geometry, 19(3):373–382, 1998.
• Esfandiari et al. (2019) Esfandiari, H., Karbasi, A., Mehrabian, A., and Mirrokni, V. S. Batched multi-armed bandits with optimal regret. CoRR, abs/1910.04959, 2019.
• Gao et al. (2019) Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multi-armed bandits problem. In NeurIPS, 2019.
• Jin et al. (2019) Jin, Y., Li, Y., Wang, Y., and Zhou, Y. On asymptotically tight tail bounds for sums of geometric and exponential random variables. arXiv preprint arXiv:1902.02852, 2019.
• Jun et al. (2016) Jun, K., Jamieson, K. G., Nowak, R. D., and Zhu, X. Top arm identification in multi-armed bandits with batch arm pulls. In AISTATS, pp. 139–148, 2016.
• Luce (2012) Luce, R. D. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.
• Perchet et al. (2015) Perchet, V., Rigollet, P., Chassang, S., and Snowberg, E. Batched bandit problems. In COLT, pp. 1456, 2015.
• Pinsker (1964) Pinsker, M. S. Information and information stability of random variables and processes. Holden-Day, 1964.
• Rusmevichientong et al. (2010) Rusmevichientong, P., Shen, Z. M., and Shmoys, D. B. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations Research, 58(6):1666–1680, 2010.
• Sauré & Zeevi (2013) Sauré, D. and Zeevi, A. Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management, 15(3):387–404, 2013.
• Simchi-Levi & Xu (2019) Simchi-Levi, D. and Xu, Y. Phase transitions and cyclic phenomena in bandits with switching constraints. In NeurIPS, 2019.
• Tóth (2001) Tóth, G. Point sets with many k-sets. Discrete & Computational Geometry, 26(2):187–194, 2001.
• Train (2009) Train, K. E. Discrete Choice Methods with Simulation. Cambridge university press, 2009.

## Appendix A Proof of the regret upper bound in Theorem 2

In this section we complete the proof of Theorem 2 for completeness. The proof is almost identical to that in (Agrawal et al., 2017) except for the handling of the deferred UCB value updates.

The following lemma proves that is indeed an upper confidence bound of true parameter with high probability, and converges to the true value with decent rate.

###### Lemma 15 (Lemma 4.1 of (Agrawal et al., 2017)).

For any , in Algorithm LABEL:alg:doubling-ucb, at Line LABEL:line:doubling-if immediately after the -th epoch, the following two statements hold,

• With probability at least , for any

• With probability at least , for any

 niTi+√48(ni/Ti)ln(√Nℓ+1)Ti+48ln(√Nℓ+1)Ti−vi≤√144viln(√Nℓ+1)Ti+144ln(√Nℓ+1)Ti.

By the update rule, Lemma 16 can be extended to as follows.

###### Lemma 16.

For any , the following two statements hold at the end of the -th iteration of the outer for-loop of Algorithm LABEL:alg:doubling-ucb.

• With probability at least , for any

• With probability at least , for any

 ^vi−vi≲√vilog(√Nℓ+1)Ti+log(√Nℓ+1)Ti.
###### Proof.

For any epoch , let and be the value of and at the last update. Then we have, and . Inherited from Lemma 15, we have . And

 ^vi−vi=^v′i−vi≲ ⎷vilog(√Nℓ+1)T′i+log(√Nℓ+1)T′i≲√vilog(