1 Introduction
Experimentation platform is now an essential part in online services for providing the best service by evaluating and comparing multiple candidates in real services [3]. A/B test and multiarmed bandit are main methodologies that direct a design and a decision in an experiment. Multiarmed bandit methods provide simple but flexible experiment framework compared to A/B test [3, 4]. For example, thompson sampling [5], a popular method of multiarmed bandit, outputs a result of experiment in terms of allocation of next experiment, which allows a sequential experiment [6].
In real service, it is common to have a type of nonstationary environment, that is timevarying effect [3, 7, 8, 9, 10, 11]. However, multiarmed bandit including thompson sampling is sensitive to this irregular condition in nature compared to A/B test, where sample sizes for all variants do not change during an experiment. In a continuous experiment where arms are added and dropped in the middle of the experiment, the timevarying effect would cause more severe suboptimal result of the multiarmed bandit policy. It is because each arm is tested in different period.
In this study, we propose a natural way of dealing with timevarying effect with reparameterizing base model for thompson sampling. First, we will describe different parameterizations of logistic model and propose a novel thompson sampling (Oddsratio thompson sampling) policy with a specific parameterization. Then its use in continuous experiment will be described. Finally we will evaluate its empirical regret together with other methods in simulated data.
2 Problem Setting
Many multiarmed bandit applications adopt Batch Update, where arms are played multiple times, then, policy and related parameters are updated with aggregated rewards [2, 6]
. Batch update, which is sometimes called as delay update, is practical set up, because it requires much less computational resources than online or realtime update. There are many chances that the temporal effect changes concurrent with batch update, making reward probabilities change through rounds. Our study discusses thompson sampling that are applied with the batch update, while there is common effect in each round.
Mean reward of th arm is represented by , where
is a random variable for binary reward. Later we will discuss these mean rewards, or performances can drift through rounds. Reward observations are aggregated at each round and given as
, where is the number of trials of th arm and is its number of successes. For simplicity of exposition, we suppress a subscript if it is not necessary. For a task finding the arm with the maximum click through rate (CTR), page view counts (or impression) and click counts can be used for and , respectively.In batch update, we allocate samples (or traffics) to each arm according to their thompson probabilities, instead of choosing actions with their computed probabilities. In the context of batch update, we call thompson probability as proportion.
3 Full Rank Thompson Sampling
In this section, we will parameterize logistic model, and introduce the first version of thompson sampling policy based on the logistic model, which we denote as full rank thompson sampling (FullTS). Logistic model for multiarmed bandit for number of arms is given:
(1) 
, where
We can see each parameter except is written as:
, which represents (log of) Odds Ratio, th arm’s reward probability with respect to reference
th arm in logit scale. Note
if and only if . We denote parameter as follows:Our logistic model, (1) is written differently from typical setting, where in (1) is replaced by . Let us denote this parameterization as
. In fact, the two parameterizations have a relationship of linear transformation,
, where is matrix with entries,(2) 
This transformation does not change the model’s representation, thus the two parameterizations are equivalent to each other.
Once our thompson sampling begins with initial proportion for each arm, it alternates two steps: posterior update and allocation proportion, each step of which is described in detail in separate sections.
3.1 Prior and Initial Proportions
We use noninformative, uniform prior, which is an imaginary distribution that has constant density on . We found use of this improper prior has many advantages, described below.
We manually give initial proportions, ’s as . We found setting
as zeromatrix from
makes the following computation simpler.3.2 Posterior Update
Thompson sampling begins with prior, then as rewards are observed, posterior is updated using bayes rule:
, where is posterior at previous round, or prior if , is likelihood and is posterior that is updated in current round. For given round, , we will call as prior at the round.
We follow a general Bayesian logistic regression described in
[12]. Each time we will do Laplacian approximation to keep posterior as Gaussian distribution:
, where and is mean and covariance parameter of Gaussian distribution, respectively. Note that our parameters, s are not independent to each other, so posterior is multivariate distribution that is not factorized into univariate distributions.
Negative log of posterior is
(3) 
, where is log of binomial likelihood and
is log of prior, that is log probability density function of Gaussian distribution.
The two parameters are updated sequentially,
(4) 
, where is the second derivative of (3) at . Using which is derived from , is represented with its th entry,
3.3 Getting Allocation Proportion
Order among mean reward of arms is determined by order among ’s and . That is, we set
Then, Thompson Sampling states probability with which each arm is pulled (or selected), is given,
(5) 
, where is an indicator function, is a function that maps to , which is derived from (1). This can be obtained at round by generating number of multivariate samples from and compute th arm’s proportion, then, randomly allocate th arm to proportion of next round.
3.4 Linear Transformation and Invariance
In our logistic model, it may be less intuitive whether the result is affected by an indexing of arms, for example, changing a reference arm.
Any indexing of same arm set can be transformed to the independent parameterization, , and its posterior is accordingly transformed to . Once transformed to the parameterization, univariate parameter in is independent to each other, where it is obvious different indexings give the identical posterior.
We call this property as "referenceinvariance" property. In other domain, it has been understood that different encoding of categorical variable does not matter. Here, we see that this is true for multiarmed bandit using the logistic model with uniform prior.
4 OddsRatio Logistic Thompson Sampling
In previous section, the logistic model (1) has the interesting property: ”Order among reward probability ’s is determined only by comparing among and 0, rather than .” In other words, parameter required for step (5) is , and does not contribute any information.
This property motivated us to devise a bandit model where each round shares only OR parameters , rather than full parameters, . We name this model as OddsRatio Thompson Sampling (ORTS). Freeing intercept parameter at each round corresponds to allow each arm’s reward probability drift of the same interval, in logit scale at each round.
We describe OddsRatio Logistic Bandit. Posterior update at round would produce posterior for to be used as prior at round . Then updating only oddsratio parameters can be symbolically described as changing the equality from to .
We can obtain prior for as variate Gaussian distribution, by marginalizing full rank prior with respect to , where and are
(6) 
We use a uniform prior for as we have done for parameters at . Then, bayes rule for posterior update is written:
(7) 
Once we replace of (3) by , we follow the same downstream process as in FullTS.
It is worth to noting that this marginalization does not depend on a reference arm. In other words, OR update also has a referenceinvariance property. This can be seen by the fact that any indexed marginalized prior is transform to independent parameterization, that is identical (degenerate) Gaussian distribution.
Note that ORTS and FullTS is defined for a round, thus, one can switch using between ORTS or FullTS at each round through an experiment. That is, at one round, assuming and at other round, .
5 Application in Continuous Experiment
MultiArmed Bandit has been discussed with strict setting, where an experiment period or an arms set is assumed to be fixed during an experiment. However, multiarmed bandit can be used in more flexible way, such as without fixing the experiment period beforehand [7]. We define continuous experiment, the one where a set of arms to be allocated (denoted by ) changes over rounds and its period is not fixed. Simple scenarios include a case when new arms are added in the middle of an experiment.
Odds Ratio Thompson Sampling is more desirable in the continuous experiment, because periods for which each arm has been observed, differ, thus nonstationary environment affects full rank bandit worse.
We denote a set of arms observed until round as . Generally it can be represented as cumulative set: . For simplicity, we assume all arms in are pulled (i.e. observed) at round . Note that information from previous observations is delivered only when
(8) 
, where is the number of items in a group. Therefore, to continue OR bandit, one should design satisfying the condition, (8). Nevertheless, when , we can just begin new multiarmed bandit by initializing and , or we can use full rank prior.
5.1 Getting Allocation Proportions in Continuous Experiment
Parameter, its prior and posterior can be freely transformed for any indexing of arms due to the referenceinvariance property. This transformation is required when a previous reference arm is not included in , for example, because it is removed due to its low performance. In this case, we need to change a reference arm to one from and reparameterize it accordingly. Let be the mapping function, that is, is a new index for previous th arm. The transform matrix, can be represented as a chain of transformations: , where is corresponding matrix for , with entries:
Getting allocation proportions for in a continuous experiment is straightforward using the rules previously applied. Assume that comprises arms already observed, thus in and the new ones which is not in . Denote the two groups as and , respectively. First, we manually set proportions for arms in as . We transform posterior from indexing in posterior to new indexing for arms in . Then, we can obtain s for th arm in on transformed posterior as described in section 3.3. Then, allocation proportions for arms in is given . Note this step is identical in FullTS.
5.2 Posterior Update in Continuous Experiment
Posterior for at round is updated as we observe for . Again, if a reference arm in a given parameterization of posterior at previous round, is not in , change the reference arm to one in , and transform parameterization accordingly. Note that in previous allocation step, we transform to arms in , but for posterior update we transform posterior to all arms in . This previous posterior is marginalized in terms of reference arm, in equation 7. Since all parameters are correlated, parameters for arms in are also updated. As far as grows with satisfying (8) at round , then posterior for all arms in are successfully accumulated.
Information about relative performance among arms in is efficiently summarized in posterior. Note that this includes arm pair which has not been directly observed at the same round. For example, assume . Even though two arms, and are not directly compared at the same round, indirect information from their odds ratios with respect to a common reference, that is, B or C, is summarized in posterior.
6 Experiment
6.1 Simulation Study
We compared three different thompson sampling policies for binary reward: FullTS, ORTS and BetaBernoulli Thompson Sampling (BetaTS). We simulated various environments and investigated behaviours of the three policies.
We set arms, one of which has greater mean reward than the other nine arms. In specific, for nine suboptimal arms are set to have , and for an optimal arm. Then at each round, all arms have common background effect, , which is generated from Gaussian distribution:
Level of timevarying effect is controlled by variance of Gaussian distribution,
. We represent the level of timevarying effect by a scaled metric, which represents relative scale to mean reward difference between optimal and suboptimal arms. We repeated simulation 100 times with 50 rounds for each simulation. Each round consists of 10,000 trials.Mean of cumulative regret is shown in Figure 1.
For both panels, we see that BetaTS and FullTS has very similar regret as expected. In environment of no timevarying effect (left panel), ORTS have a slightly greater regret than the other two policies. However, when there is timevarying effect (right panel), ORTS is robust to the effect, while BetaTS and FullTS are affected severely.
6.2 Experiment Based on Real Digital Advertisement Data
We investigated the performance of the three policies with real digital advertisement data of online messenger users. Even number of users have been exposed to four different listings which is sub module in a mobile page for 18 days. Setting an arm as showing each listing, we consider a multiarmed bandit problem of finding the best listing that has the maximum CTRs among the four listings. On the th day of the experiment, the listings are shifted up into very top screen. Due to this change, performance values of the arms have been changed: page views have increased more than two times, while CTRs have decreased because of increase of views. Therefore, the data is expected to have two types of varying effects: one minor effect daily and the other major effect between the two periods.
We set ground truth CTRs as estimated from the real data and simulated multiarmed bandit. To investigate behaviors of the policies more efficiently, we assume total sample size is about
to page views per round, which amounts to of real data. Then, we record expected rewards, i.e. click counts for each round from the three policies. Figure 2 shows click counts for BetaTS, FullTS, and ORTS.From the figure, indirectly we see timevarying effect through the period. ORTS has gained more clicks compared to the other policies, especially during the two days after a major event occurred. We may infer that the temporal effects in day 110 (especially low performance in th day) resulted in biased posterior for BetaTS and FullTS, while ORTS worked robust to the effects. During the period, it is estimated that ORTS gains more clicks than BetaTS and FullTS.
7 Conclusion and Discussion
In this study, we investigated an alternative expression of logistic model for thompson sampling. Based on the expression we showed that OddsRatio only update makes multiarmed bandit policy robust in many practical environment. Therefore, we believe that one may consider Odds Ratio thompson sampling as standard policy for binary reward data in batch update settings.
This study focused on the case when there is a common background effect, which does not change an optimal arm. However, real data can be confounded with the fact that an optimal arm changes over time. In this case, use of plain ORTS would be also affected. In this case, we can use discount TS [8] or aggressiveness parameter [3]. For example, we can use multiply in second term of (1), where is a decay parameter ranging .
We have discussed a base logistic model. Its extension to other generalized linear model, or contextual bandit is straightforward. Our implementation of ORTS and FullTS is available at http://github.com/sulgik/orts.
References
 [1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multiarmed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
 [2] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
 [3] Steven L Scott. Multiarmed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
 [4] Giuseppe Burtini, Jason Loeppky, and Ramon Lawrence. A survey of online experiment design with the stochastic multiarmed bandit. arXiv preprint arXiv:1510.00757, 2015.
 [5] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

[6]
Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen,
et al.
A tutorial on thompson sampling.
Foundations and Trends® in Machine Learning
, 11(1):1–96, 2018.  [7] Rocío Cañamares, Marcos Redondo, and Pablo Castells. Multiarmed recommender system bandit ensembles. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 432–436, 2019.
 [8] Vishnu Raj and Sheetal Kalyani. Taming nonstationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727, 2017.
 [9] Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. Online contextaware recommendation with time varying multiarmed bandit. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 2025–2034. ACM, 2016.
 [10] Giuseppe Burtini, Jason L Loeppky, and Ramon Lawrence. Improving online marketing experiments with drifting multiarmed bandits. In ICEIS (1), pages 630–636, 2015.
 [11] Neha Gupta, OleChristoffer Granmo, and Ashok Agrawala. Thompson sampling for dynamic multiarmed bandits. In 2011 10th International Conference on Machine Learning and Applications and Workshops, volume 1, pages 484–489. IEEE, 2011.
 [12] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
Comments
There are no comments yet.