# Odds-Ratio Thompson Sampling to Control for Time-Varying Effect

Multi-armed bandit methods have been used for dynamic experiments particularly in online services. Among the methods, thompson sampling is widely used because it is simple but shows desirable performance. Many thompson sampling methods for binary rewards use logistic model that is written in a specific parameterization. In this study, we reparameterize logistic model with odds ratio parameters. This shows that thompson sampling can be used with subset of parameters. Based on this finding, we propose a novel method, "Odds-ratio thompson sampling", which is expected to work robust to time-varying effect. Use of the proposed method in continuous experiment is described with discussing a desirable property of the method. In simulation studies, the novel method works robust to temporal background effect, while the loss of performance was only marginal in case with no such effect. Finally, using dataset from real service, we showed that the novel method would gain greater rewards in practical environment.

## Authors

• 1 publication
• 3 publications
06/18/2020

### Learning by Repetition: Stochastic Multi-armed Bandits under Priming Effect

We study the effect of persistence of engagement on learning in a stocha...
10/01/2021

### Asymptotic Performance of Thompson Sampling in the Batched Multi-Armed Bandits

We study the asymptotic performance of the Thompson sampling algorithm i...
05/18/2012

### Thompson Sampling: An Asymptotically Optimal Finite Time Analysis

The question of the optimality of Thompson Sampling for solving the stoc...
08/08/2018

### Nonparametric Gaussian mixture models for the multi-armed contextual bandit

The multi-armed bandit is a sequential allocation task where an agent mu...
07/30/2021

### Adaptively Optimize Content Recommendation Using Multi Armed Bandit Algorithms in E-commerce

E-commerce sites strive to provide users the most timely relevant inform...
01/17/2021

### TSEC: a framework for online experimentation under experimental constraints

Thompson sampling is a popular algorithm for solving multi-armed bandit ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Experimentation platform is now an essential part in online services for providing the best service by evaluating and comparing multiple candidates in real services [3]. A/B test and multi-armed bandit are main methodologies that direct a design and a decision in an experiment. Multi-armed bandit methods provide simple but flexible experiment framework compared to A/B test [3, 4]. For example, thompson sampling [5], a popular method of multi-armed bandit, outputs a result of experiment in terms of allocation of next experiment, which allows a sequential experiment [6].

In real service, it is common to have a type of non-stationary environment, that is time-varying effect [3, 7, 8, 9, 10, 11]. However, multi-armed bandit including thompson sampling is sensitive to this irregular condition in nature compared to A/B test, where sample sizes for all variants do not change during an experiment. In a continuous experiment where arms are added and dropped in the middle of the experiment, the time-varying effect would cause more severe sub-optimal result of the multi-armed bandit policy. It is because each arm is tested in different period.

In this study, we propose a natural way of dealing with time-varying effect with reparameterizing base model for thompson sampling. First, we will describe different parameterizations of logistic model and propose a novel thompson sampling (Odds-ratio thompson sampling) policy with a specific parameterization. Then its use in continuous experiment will be described. Finally we will evaluate its empirical regret together with other methods in simulated data.

## 2 Problem Setting

Many multi-armed bandit applications adopt Batch Update, where arms are played multiple times, then, policy and related parameters are updated with aggregated rewards [2, 6]

. Batch update, which is sometimes called as delay update, is practical set up, because it requires much less computational resources than online or realtime update. There are many chances that the temporal effect changes concurrent with batch update, making reward probabilities change through rounds. Our study discusses thompson sampling that are applied with the batch update, while there is common effect in each round.

Mean reward of -th arm is represented by , where

is a random variable for binary reward. Later we will discuss these mean rewards, or performances can drift through rounds. Reward observations are aggregated at each round and given as

, where is the number of trials of -th arm and is its number of successes. For simplicity of exposition, we suppress a subscript if it is not necessary. For a task finding the arm with the maximum click through rate (CTR), page view counts (or impression) and click counts can be used for and , respectively.

In batch update, we allocate samples (or traffics) to each arm according to their thompson probabilities, instead of choosing actions with their computed probabilities. In the context of batch update, we call thompson probability as proportion.

## 3 Full Rank Thompson Sampling

In this section, we will parameterize logistic model, and introduce the first version of thompson sampling policy based on the logistic model, which we denote as full rank thompson sampling (Full-TS). Logistic model for multi-armed bandit for number of arms is given:

 log(pi1−pi)=∑βixi+βK (1)

, where

We can see each parameter except is written as:

 βi=log(pi1−pi/pK1−pK)% for i=1,⋯,K−1

, which represents (log of) Odds Ratio, -th arm’s reward probability with respect to reference

-th arm in logit scale. Note

if and only if . We denote parameter as follows:

 ~β=(~βORβK),~βOR=⎛⎜ ⎜⎝ β1⋮βK−1⎞⎟ ⎟⎠

Our logistic model, (1) is written differently from typical setting, where in (1) is replaced by . Let us denote this parameterization as

. In fact, the two parameterizations have a relationship of linear transformation,

, where is matrix with entries,

 Cindi,j={[l]1 , if i=j or j=K0 , otherwise (2)

This transformation does not change the model’s representation, thus the two parameterizations are equivalent to each other.

Once our thompson sampling begins with initial proportion for each arm, it alternates two steps: posterior update and allocation proportion, each step of which is described in detail in separate sections.

### 3.1 Prior and Initial Proportions

We use non-informative, uniform prior, which is an imaginary distribution that has constant density on . We found use of this improper prior has many advantages, described below.

We manually give initial proportions, ’s as . We found setting

as zero-matrix from

makes the following computation simpler.

### 3.2 Posterior Update

Thompson sampling begins with prior, then as rewards are observed, posterior is updated using bayes rule:

 P(~β|r)=P(r|~β)P(~β)

, where is posterior at previous round, or prior if , is likelihood and is posterior that is updated in current round. For given round, , we will call as prior at the round.

We follow a general Bayesian logistic regression described in

[12]

. Each time we will do Laplacian approximation to keep posterior as Gaussian distribution:

 ~β|r∼N(μt,Σt)

, where and is mean and covariance parameter of Gaussian distribution, respectively. Note that our parameters, s are not independent to each other, so posterior is multivariate distribution that is not factorized into univariate distributions.

Negative log of posterior is

 lt(μ)=−12(l(r|μ)+l(μ,Σt−1)) (3)

, where is log of binomial likelihood and

is log of prior, that is log probability density function of Gaussian distribution.

The two parameters are updated sequentially,

 μt=argminμlt(μ)Σ−1t=Σ−1t−1+Λ (4)

, where is the second derivative of (3) at . Using which is derived from , is represented with its -th entry,

 Λij=⎧⎪ ⎪⎨⎪ ⎪⎩[l]ni^pi(1−^pi), % for i=j≠K, or i=K≠j, or j=K≠i ∑Ki=1ni^pi(1−^pi), for i=j=K 0, otherwise

### 3.3 Getting Allocation Proportion

Order among mean reward of arms is determined by order among ’s and . That is, we set

 Set β′i={[l]βi, for i=1,⋯,K−10, for i=K

Then, Thompson Sampling states probability with which each arm is pulled (or selected), is given,

 pTS,i=∫I(i=argmaxjβ′j)P(~β|r)d~β (5)

, where is an indicator function, is a function that maps to , which is derived from (1). This can be obtained at round by generating number of multivariate samples from and compute -th arm’s proportion, then, randomly allocate -th arm to proportion of next round.

### 3.4 Linear Transformation and Invariance

In our logistic model, it may be less intuitive whether the result is affected by an indexing of arms, for example, changing a reference arm.

Any indexing of same arm set can be transformed to the independent parameterization, , and its posterior is accordingly transformed to . Once transformed to the parameterization, univariate parameter in is independent to each other, where it is obvious different indexings give the identical posterior.

We call this property as "reference-invariance" property. In other domain, it has been understood that different encoding of categorical variable does not matter. Here, we see that this is true for multi-armed bandit using the logistic model with uniform prior.

## 4 Odds-Ratio Logistic Thompson Sampling

In previous section, the logistic model (1) has the interesting property: ”Order among reward probability ’s is determined only by comparing among and 0, rather than .” In other words, parameter required for step (5) is , and does not contribute any information.

This property motivated us to devise a bandit model where each round shares only OR parameters , rather than full parameters, . We name this model as Odds-Ratio Thompson Sampling (OR-TS). Freeing intercept parameter at each round corresponds to allow each arm’s reward probability drift of the same interval, in logit scale at each round.

We describe Odds-Ratio Logistic Bandit. Posterior update at round would produce posterior for to be used as prior at round . Then updating only odds-ratio parameters can be symbolically described as changing the equality from to .

We can obtain prior for as -variate Gaussian distribution, by marginalizing full rank prior with respect to , where and are

 [l]μ−K=μ(:K−1)Σ−K=Σ(:K−1,:K−1) (6)

We use a uniform prior for as we have done for parameters at . Then, bayes rule for posterior update is written:

 P(~β|r)=P(r|~β)P(~βOR) (7)

Once we replace of (3) by , we follow the same downstream process as in Full-TS.

It is worth to noting that this marginalization does not depend on a reference arm. In other words, OR update also has a reference-invariance property. This can be seen by the fact that any indexed marginalized prior is transform to independent parameterization, that is identical (degenerate) Gaussian distribution.

Note that OR-TS and Full-TS is defined for a round, thus, one can switch using between OR-TS or Full-TS at each round through an experiment. That is, at one round, assuming and at other round, .

## 5 Application in Continuous Experiment

Multi-Armed Bandit has been discussed with strict setting, where an experiment period or an arms set is assumed to be fixed during an experiment. However, multi-armed bandit can be used in more flexible way, such as without fixing the experiment period beforehand [7]. We define continuous experiment, the one where a set of arms to be allocated (denoted by ) changes over rounds and its period is not fixed. Simple scenarios include a case when new arms are added in the middle of an experiment.

Odds Ratio Thompson Sampling is more desirable in the continuous experiment, because periods for which each arm has been observed, differ, thus non-stationary environment affects full rank bandit worse.

We denote a set of arms observed until round as . Generally it can be represented as cumulative set: . For simplicity, we assume all arms in are pulled (i.e. observed) at round . Note that information from previous observations is delivered only when

 n(At∩Ct−1)≥2 (8)

, where is the number of items in a group. Therefore, to continue OR bandit, one should design satisfying the condition, (8). Nevertheless, when , we can just begin new multi-armed bandit by initializing and , or we can use full rank prior.

### 5.1 Getting Allocation Proportions in Continuous Experiment

Parameter, its prior and posterior can be freely transformed for any indexing of arms due to the reference-invariance property. This transformation is required when a previous reference arm is not included in , for example, because it is removed due to its low performance. In this case, we need to change a reference arm to one from and reparameterize it accordingly. Let be the mapping function, that is, is a new index for previous -th arm. The transform matrix, can be represented as a chain of transformations: , where is corresponding matrix for , with entries:

 Cfi,j={[l]1, if j=f−1(i)0, otherwise

Getting allocation proportions for in a continuous experiment is straightforward using the rules previously applied. Assume that comprises arms already observed, thus in and the new ones which is not in . Denote the two groups as and , respectively. First, we manually set proportions for arms in as . We transform posterior from indexing in posterior to new indexing for arms in . Then, we can obtain s for -th arm in on transformed posterior as described in section 3.3. Then, allocation proportions for arms in is given . Note this step is identical in Full-TS.

### 5.2 Posterior Update in Continuous Experiment

Posterior for at round is updated as we observe for . Again, if a reference arm in a given parameterization of posterior at previous round, is not in , change the reference arm to one in , and transform parameterization accordingly. Note that in previous allocation step, we transform to arms in , but for posterior update we transform posterior to all arms in . This previous posterior is marginalized in terms of reference arm, in equation 7. Since all parameters are correlated, parameters for arms in are also updated. As far as grows with satisfying (8) at round , then posterior for all arms in are successfully accumulated.

Information about relative performance among arms in is efficiently summarized in posterior. Note that this includes arm pair which has not been directly observed at the same round. For example, assume . Even though two arms, and are not directly compared at the same round, indirect information from their odds ratios with respect to a common reference, that is, B or C, is summarized in posterior.

## 6 Experiment

### 6.1 Simulation Study

We compared three different thompson sampling policies for binary reward: Full-TS, OR-TS and Beta-Bernoulli Thompson Sampling (Beta-TS). We simulated various environments and investigated behaviours of the three policies.

We set arms, one of which has greater mean reward than the other nine arms. In specific, for nine sub-optimal arms are set to have , and for an optimal arm. Then at each round, all arms have common background effect, , which is generated from Gaussian distribution:

 βi,t=βi+δtδt∼N(0,σ2)

Level of time-varying effect is controlled by variance of Gaussian distribution,

. We represent the level of time-varying effect by a scaled metric, which represents relative scale to mean reward difference between optimal and sub-optimal arms. We repeated simulation 100 times with 50 rounds for each simulation. Each round consists of 10,000 trials.

Mean of cumulative regret is shown in Figure 1.

For both panels, we see that Beta-TS and Full-TS has very similar regret as expected. In environment of no time-varying effect (left panel), OR-TS have a slightly greater regret than the other two policies. However, when there is time-varying effect (right panel), OR-TS is robust to the effect, while Beta-TS and Full-TS are affected severely.

We investigated the performance of the three policies with real digital advertisement data of online messenger users. Even number of users have been exposed to four different listings which is sub module in a mobile page for 18 days. Setting an arm as showing each listing, we consider a multi-armed bandit problem of finding the best listing that has the maximum CTRs among the four listings. On the -th day of the experiment, the listings are shifted up into very top screen. Due to this change, performance values of the arms have been changed: page views have increased more than two times, while CTRs have decreased because of increase of views. Therefore, the data is expected to have two types of varying effects: one minor effect daily and the other major effect between the two periods.

We set ground truth CTRs as estimated from the real data and simulated multi-armed bandit. To investigate behaviors of the policies more efficiently, we assume total sample size is about

to page views per round, which amounts to of real data. Then, we record expected rewards, i.e. click counts for each round from the three policies. Figure 2 shows click counts for Beta-TS, Full-TS, and OR-TS.

From the figure, indirectly we see time-varying effect through the period. OR-TS has gained more clicks compared to the other policies, especially during the two days after a major event occurred. We may infer that the temporal effects in day 1-10 (especially low performance in -th day) resulted in biased posterior for Beta-TS and Full-TS, while OR-TS worked robust to the effects. During the period, it is estimated that OR-TS gains more clicks than Beta-TS and Full-TS.

## 7 Conclusion and Discussion

In this study, we investigated an alternative expression of logistic model for thompson sampling. Based on the expression we showed that Odds-Ratio only update makes multi-armed bandit policy robust in many practical environment. Therefore, we believe that one may consider Odds Ratio thompson sampling as standard policy for binary reward data in batch update settings.

This study focused on the case when there is a common background effect, which does not change an optimal arm. However, real data can be confounded with the fact that an optimal arm changes over time. In this case, use of plain OR-TS would be also affected. In this case, we can use discount TS [8] or aggressiveness parameter [3]. For example, we can use multiply in second term of (1), where is a decay parameter ranging .

We have discussed a base logistic model. Its extension to other generalized linear model, or contextual bandit is straightforward. Our implementation of OR-TS and Full-TS is available at http://github.com/sulgik/orts.

## References

• [1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
• [2] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
• [3] Steven L Scott. Multi-armed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
• [4] Giuseppe Burtini, Jason Loeppky, and Ramon Lawrence. A survey of online experiment design with the stochastic multi-armed bandit. arXiv preprint arXiv:1510.00757, 2015.
• [5] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
• [6] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling.

Foundations and Trends® in Machine Learning

, 11(1):1–96, 2018.
• [7] Rocío Cañamares, Marcos Redondo, and Pablo Castells. Multi-armed recommender system bandit ensembles. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 432–436, 2019.
• [8] Vishnu Raj and Sheetal Kalyani. Taming non-stationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727, 2017.
• [9] Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. Online context-aware recommendation with time varying multi-armed bandit. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 2025–2034. ACM, 2016.
• [10] Giuseppe Burtini, Jason L Loeppky, and Ramon Lawrence. Improving online marketing experiments with drifting multi-armed bandits. In ICEIS (1), pages 630–636, 2015.
• [11] Neha Gupta, Ole-Christoffer Granmo, and Ashok Agrawala. Thompson sampling for dynamic multi-armed bandits. In 2011 10th International Conference on Machine Learning and Applications and Workshops, volume 1, pages 484–489. IEEE, 2011.
• [12] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.