Many scenarios arise where a decision-making agent must learn the best action to present to a user, while maximizing the cumulative reward [Teo et al.2016, Sawant et al.2018]. Examples of such applications are present in e-commerce, recommender systems, and the travel industry, to name a few. A main challenge in these scenarios is the trade-off between the exploration required to learn the unknown environment, and the exploitation due to reward maximization.
One way of solving this explore/exploit trade-off is by using the multi-armed bandit (MAB) approach, originally proposed by Robbins Ro52. Since then many algorithms have been proposed to solve the MAB problem [Gittins1989, Auer et al.2002, Garivier and Cappé2011, Bubeck et al.2012, Cesa-Bianchi et al.2017, Riquelme et al.2018].
The focus of this work is on Bayesian MABs. Most such algorithms are based on Thompson Sampling (TS)[Thompson1933]
, where an action is selected proportionally to its probability of being optimal, conditioned on previous observations. TS has an optimal regret bound[Agrawal and Goyal2013a, Kaufmann et al.2012b] and demonstrated promising empirical and theoretical guarantees [Chapelle and Li2011, Agrawal and Goyal2013b, Korda et al.2013].
Such Bayesian bandits often assume a non-informative prior [Graepel et al.2010, Scott2010, Kaufmann et al.2012a]. On day 1, their behavior is random. Our objective is to leverage Empirical Bayes (EB) techniques to extract better priors from such early randomized data. Such informative priors can improve MAB optimization and potentially lead to higher cumulative reward and shorter convergence time.
In essence, EB is a statistical inference procedure where the prior distribution is estimated empirically (and frequentistically) from the data. It exploits the finding that large datasets of parallel situations carry within them their own Bayesian information [Efron, Bradley and Hastie, Trevor2016].
This work applies EB to Bayesian bandits to compute an informative prior in hindsight, and then use this prior to improve cumulative reward and convergence time. We apply EB on the first few days of random (or pseudo-random) MAB traffic to compute an empirical prior. We then rewind the bandit, re-training it on the same traffic, augmented with the empirical prior. Note that although we have a bandit use-case, our method can also be applied to Bayesian optimization in a standard classification setting, as we also show.
This approach was motivated by our production setting in Amazon, where we aim at optimizing a web-page layout with multiple components [Hill et al.2017]. In the simplest case, we have one feature per component value (e.g. “image2”), and one interaction feature between each pair of components (e.g. “image2 AND title1”). We refer to the first type of features as “first order features” and to the second type as “second order features”. These two features types are functionally distinct. We expect the 1st order features to have a more pronounced effect on the outcome in most use-cases.
Let be the number of values per component (e.g. possible images). As time progresses, we observe the 1st order feature values at a rate of , while we observe the 2nd order feature values at a slower rate of . Is it possible to learn the 1st order weights first, then to move to the 2nd order weights as more data becomes available? Can we decouple the learning rates of these two categories of features? It turns out that our EB method can achieve that.
2 Empirical Prior Estimation
Our problem of interest is one where features can be grouped into two or more groupings. For example, in a recommendation setting, one can distinguish between item and user features. In a personalization setting, one can distinguish between non-interaction features (e.g. “gender”) and interaction features (e.g. “female user likes action movies”).
Bayesian generalized linear bandits model each feature effect using an underlying (typically Gaussian) distribution, starting with a (usually standard normal) non-informative prior[Filippi et al.2010, Chapelle and Li2011]. Let be the model weight distribution associated with feature at time . As time progresses, the bandit learns from its interactions with the environment, updating the features’ weight distributions accordingly. In a stochastic setting [Bubeck and Cesa-Bianchi2012], as , we have and , where is the true feature effect.
By grouping the features into non-overlapping categories, we impose a Bayesian hierarchical model. We assume that each category has a distinct hyperparameter meta-prior distribution . We assume that each feature’s true effect is drawn from that feature’s category meta-prior:
We also assume that the observed feature effect of feature is drawn from a Gaussian with a mean equal to the true effect
, and variance equal to its observed variance:
Based on our assumptions, the following holds for any feature :
For each category , we perform variance decomposition:
Here is the number of features in category , and the expectations are taken over random draws from .
On the other hand, as the model interacts with the environment and collects data, it will update its estimates of and . Using the basic variance formula we obtain another estimate for :
Per category ,
is the unbiased estimate of the true parameter, with estimation noise . One can solve for by an empirical mean estimation:
Nevertheless, we set to ensure the model is invariant to input feature sign changes. As long as one includes a bias (intercept) term in the generalized linear model, setting to any value has little effect as its value will be absorbed into the bias term. We indeed confirmed this hypothesis in preliminary experiments, where we compared setting to setting it using Equation 7.
gains one degree of freedom, ensuring the sample variance denominator isand not . Equation 6 simplifies to:
To ensure a non-degenerative , one can enforce a minimum value threshold, such as a small value divided by the number of features. Our method is a parametric (zero-mean Gaussian) g-modeling EB approach [Efron, Bradley and Hastie, Trevor2016], where we aim at estimating the variance .
In practice, we start the model with a non-informative prior. At some small (even at the end of in a batch setting, where all the data is random), we compute the empirical Bayes priors using Equation 8. We then restart the model using the new informative prior, and retrain it using the data from the elapsed timesteps. One can repeat these steps at multiple , each time re-computing a new
in an expectation-maximization fashion, but the approximation after one round is likely sufficient.
3 Simulation Data and Pre-processing
3.1 Simulation Data set
In order to validate our method and test its generalization, we first report simulations on a public optimization dataset from a different domain. We pick the Adult dataset available from the UCI Machine Learning Repository[Blake and Merz1998]. The target is to predict whether one’s income exceeds $50,000 per year based on census data. The train and test datasets have 30,162 and 15,060 observations respectively, after removing rows with empty feature values. The dataset has 13 categorical features (see Table 1).
|capital gain||3||capital loss||2|
The original 13 features constitute our 1st order features. We mimic our production setting by generating second order features through pairwise combinations of first order features. We add all these 2nd order interaction terms (total of 78) to the Adult dataset, forming our raw feature vector. These 1st order and 2nd order feature groupings form a natural categorization forcomputation given in Equation 8.
To simulate daily updates, we divide the training set equally into 6 batches of 5027 examples each. We train the probit linear model on the day 1 batch, starting with a prior. We then apply our EB approach described in Section 2 to compute empirical priors over the first order features, and over the second order features. This resulted in degenerative negative values.
Here we are suffering from an inappropriate initial prior, to our point. As we had only observed a small number of data points, the default prior still dominates the posterior used to compute the empirical priors. was still close to , and close to . As , Equation 8 returns . This observation suggests that our first batch data is too small for empirical prior estimation.
3.2 Data Pre-processing
Here lies the essence of our challenge. Starting from an informative prior helps most in small traffic cases, and small traffic hinders proper computation of empirical prior. On the other hand, we know that the values of and increase with the number of samples, due to the reduction of the posterior variance . We resolve our challenge by bootstrapping the day 1 batch (for a total of 40,000 instances) and running the model on the bootstrapped data, in order to perform our empirical prior computations. We keep as initial prior, as this is the default non-informative prior used in many applications.
One may suspect that not all features in a model are relevant. To that effect, and to balance bootstrapping, we prune our bootstrapped model by applying adaptive lasso. We pick adaptive lasso due to its oracle properties, as it can identify the right subset of features to retain as if the true underlying model was given in advance [Zou2006].
The objective function for adaptive lasso is:
where is the shrinkage parameter estimated using cross validation, is the model weight vector where each is the weight coefficient of feature , is the response vector, and the design matrix. is the adaptive weight vector and is defined as , where is an initial estimate of coefficient
, obtained by performing ridge regression. Moreover,is a positive constant that adjusts the adaptive lasso weight vector and is set to in our experiments. Figure 1
shows an example of adaptive lasso feature selection using the glmnet package[Friedman et al.2010].
4 Simulation Experiments
In this section, we use the terms “batch” and “day” interchangeably, as it is more intuitive to think in term of a temporal framework. One could use any arbitrary time period, “day” is simply an example. Although we focus on batch updates, our method works equally well with online updates. We ran our simulations on the Adult dataset.
4.1 Scenarios Description
For our generalized linear model, we use the probit regression of [Graepel et al.2010]. We consider three scenarios that only differ on what happens at the end of a pre-specified timestep . At the start of the experiment, we initialize the three models using a standard normal prior . At the end of each day, the models are trained in batch with the day’s observed data. At the end of day , the following three scenarios are applied:
Our base model, which stands for Bayesian LInear Probit. We update the model in batch with day data.
We reset the model. We bootstrap all the data observed until day (as per Section 3.2) and train the model on the bootstrapped data. We then use the Empirical Bayes computation of Section 2 to compute separate informative priors for the 1st and 2nd order features. We then restart the model with a prior for the 1st order features, and a prior for the 2nd order features. We update the new EB model with the original data observed up until day .
We update the model twice, first with the same bootstrapped data as BLIPBayes, and second with the day data. The rational is that BLIPTwice uses the same amount of data as used by BLIPBayes.
We train the three models on the same batches. At the end of day , adaptive lasso prunes the feature space identically for the three scenarios. We start evaluating the models after the day update. At the end of each batch, we evaluate the models on the holdout testing set using binary log loss (cross-entropy). Let be the number of observed data points, the true binary label of instance , and the model’s predicted probability of , then:
4.2 Simulation Results
4.2.1 First Order Feature Effect
Based on the findings of Section 3.2, and after day , we compute the hierarchical empirical prior over a bootstrapped data pruned using adaptive lasso. Adaptive lasso retained 7 first order and 11 second order features. As we surmise that the first order features may hold more predictive power during the first batches, we test retaining all 13 first order features alongside the 11 pruned second order features. Keeping all first order features results in and . Keeping only selected first order features returns and .
Figure 2 plots the log loss of our scenarios. We observe that BLIPBayes outperforms Blip and BlipTwice in both cases. We suspect that the bad performance of BLIPTwice is due to overfitting to the first batches. Retaining all first order features improves the three methods’ prediction accuracies.
We also note that keeping all first-order features results in a markedly better performance for BLIPBayes. This may be due to a better estimate of . In our subsequent experiments, we use adaptive lasso to prune the second order features, while we retain all first order features.
4.2.2 Effect of Prior Reset Time
In this experiment, we reset the empirical prior at the end of day , after observing 15,000 samples. Adaptive lasso pruning retains 11 second order features. EB results in and .
Figure 3 compares the log loss for BLIP, BLIPBayes with reset, and BLIPBayes with reset. BLIPBayes gives lower log loss values and therefore higher prediction accuracy when using more data for constructing the prior. As we observed when comparing all vs selected 1st order features, more data improves performance.
4.2.3 Small Batch Dataset
In this experiment, we divide the train dataset into thirty batches, each with 1000 data points. Our objective is to observe the performance of EB when training occurs on a longer period with smaller data batches. We reset the prior at , but this time we bootstrap only 12,000 instances. Adaptive lasso retains 21 second order features. EB results in and .
Figure 4 plots the log loss over the thirty days. BlipBayes outperforms both other methods. What is remarkable is that the BlipBayes advantage persists over the whole range, indicating that such a method can be especially valuable for small batch training.
4.2.4 Effect of Prior Variance
Could it be that our improvements simply stem from the fact that our empirical prior variance is below 1, its non-informative counterpart? Recall from our first experiment setting that EB returns and . We hereby experiment with additional settings, namely , , and .
Figure 5 plots the log loss for the aforementioned scenarios, alongside the EB values (“optimal”) and BLIP. We observe that the optimal BLIPBayes consistently outperforms all other variations. We also note that BLIP (with its prior) outperforms the non-optimal BLIPBayes versions (with a negligible overlap with after day 5). This suggests that one needs to set the hierarchical priors in a principled manner, and that is not necessarily better.
We also note that achieves the worst results by far, and that underperforms . This suggests that erring towards a large prior variance is more easily overcome by data than erring towards a small prior variance. In fact, in a Bayesian setting, the relative effect of the observed data on the posterior increases with a larger prior variance.
5 MAB Live Experiments
We now examine the performance of EB on the live production system in Amazon described in [Hill et al.2017].
5.1 Experimental Settings
We aim at optimizing a message that promotes the purchase of an Amazon service. The message had 4 components with 2-3 possible options per component, for a total of 24 distinct combinatorial layouts. The target is binary, whether the customer purchased the service or not. The messages were shown to the selected customers during a browsing session on Amazon.com on desktop browsers.
Our model is a TS generalized linear MAB with a probit link function and prior [Graepel et al.2010, Teo et al.2016], its core is the same model used for classification in Section 4.1. Although our formulation can take user context into account, this work investigates the case where the features only reflect layout content. Our 1st order and 2nd order feature groupings form a natural categorization for computation.
We performed A/B tests with three treatments, a production baseline algorithm, the standard probit bandit, and EB applied to the probit bandit. We randomly diverted a constant subset of our traffic to this experiment, with the standard and EB bandits receiving equal shares. The baseline algorithm has a different pool of messages, and dynamically adjusts to seasonal shifts in an adversarial manner. We disabled seasonality adjustment for both standard and EB bandits, deploying them as stochastic MABs.
We start both bandits with a random phase, where the bandits allocate traffic equally between their 24 layouts. We then compute the empirical prior and re-train the EB MAB on the random data. We do not prune features nor do we bootstrap the data for EB computation. As we do not know ground truth, we can not compute regret nor log loss. Instead, we compute the daily and cumulative success rate of each bandit. We exclude the random phase from our plots and analysis.
5.2 Live Results
5.2.1 Traffic Effect
In the first experiment, we diverted a constant traffic percentage to our MABs. We set the random phase to . We ran the experiment for 15 time units during a fixed-season. EB resulted in and . Figure 6 plots the cumulative success rate of each bandit, relative to baseline final cumulative success rate.
The EB MAB clearly dominates the standard MAB in cumulative rewards. We notice that EB MAB stabilizes after four time units, while the standard MAB needs seven to plateau. This is an indication that EB MAB converged faster. After , both bandits attain the same performance, maintaining a steady success rate.
We followed this experiment with a shorter one (4 time units total) during a fixed-season, where we reduced the traffic by a quarter, and reduced the random phase to . EB resulted in and .
At the end of the experiment, we compared the cumulative performance of both MABs against baseline. Using two-tailed proportion z-test with pooled variance, EB bandit had a p-value, and standard bandit a p-value. EB MAB significantly outperformed baseline, and converged faster than standard MAB to the optimal layout. Another indication that EB can be most valuable for small traffic cases.
5.2.2 Effect of Seasonality
To test the impact of seasonality, we ran our final experiment over 56 time units encompassing the holidays season, with pre and post season changes. We dialed up the traffic assigned to each MAB, doubling the traffic on average, while tripling it in the first two weeks. Random phase remained . EB resulted in and .
At this level of high traffic, both bandits converged at the same time, and behaved indistinguishably (see Figure 7). The empirical prior had no effect. We also note that seasonality had an adverse effect on the stochastic bandits, as both lagged behind the seasonality-aware baseline.
6 Discussion and Future Work
In all our simulation and live experiments, we always had , reflecting a difference in the effects of the 1st and 2nd order features. This confirms our initial conjecture, that the second order features are likely to be less important. This result is likely to generalize to many applications.
By grouping the 1st order and 2nd order features together and imposing a hierarchical prior, we effectively clamped each category’s weights together. Since the relative effect of the observed data on the posterior increases with a larger prior variance, our method is putting more weight on the 1st order features, and is shrinking the 2nd order effect. The model thus focuses on learning the 1st order effects first. This may explain the increased stability and convergence speed of the EB model.
Of interest is how the improvement is correlated with the amount of available data. Our findings suggest that the EB improvement is most marked in cases of low to medium traffic, and is lost at high traffic. This is promising, as low traffic cases are the hardest to optimize.
At very low traffic, direct computation of empirical prior variances may fail, with
. One may either wait longer before computing the prior, add a threshold, bootstrap from the available data, or perform transfer learning.
6.2 Future Work
Our findings raise multiple questions and opportunities for future work. Our results suggest faster bandit convergence. Does the Empirical Prior affect the bandit regret bounds?
So far we computed the empirical prior using either bandit random data, or within a full information (standard classification) setting. Can we compute such prior from an active bandit (or other interactive logging policy)? And how should we unbias such data?
We restricted our EB application to stochastic MABs, which effectively failed in a seasonal adversarial setting. Can our EB formulation be extended to adversarial MABs?
Finally, we used EB to effectively bootstrap an existing model using its own data. Can we also use EB on a different but related use-case, and use its priors as a transfer learning technique? This may be valuable if the related use-case has a large data volume, while the target scenario is highly sparse.
In this study, we present an informative prior estimation framework using empirical Bayes. Our method can be used to decouple the learning rates of feature groupings in any Bayesian optimization procedure. We demonstrate our technique using first and second order features in a Generalized Linear Model.
Our empirical results reveal that initiating bandits with empirical Bayes prior leads to higher cumulative reward and lower convergence time. We also show a similar improvement over prediction accuracy on a classification problem. In both cases, we found that the 1st order features tend to have a more pronounced effect on the target variable. Of special note are the observed improvements in cases of small traffic, leading us to believe that empirical Bayes may offer an adequate solution for the challenges of sparse-data optimization.
- [Agrawal and Goyal2013a] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In AISTATS, pages 99–107, 2013.
- [Agrawal and Goyal2013b] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 127–135, Atlanta, Georgia, 2013. JMLR.
- [Auer et al.2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- [Berger1985] James O Berger. Statistical decision theory and Bayesian analysis. Springer Series in Statistics. Springer, New York, 2 edition, 1985.
- [Blake and Merz1998] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.
- [Bubeck and Cesa-Bianchi2012] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
- [Bubeck et al.2012] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Annual Conference on Learning Theory, volume 23, pages 1–14, 2012.
- [Carlin and Louis2010] Bradley P Carlin and Thomas A Louis. Bayes and empirical Bayes methods for data analysis. Chapman and Hall/CRC, 2010.
- [Cesa-Bianchi et al.2017] Nicolò Cesa-Bianchi, Claudio Gentile, Gábor Lugosi, and Gergely Neu. Boltzmann exploration done right. In Advances in Neural Information Processing Systems, pages 6284–6293, 2017.
- [Chapelle and Li2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
- [Efron and Morris1972] Bradley Efron and Carl Morris. Limiting the risk of bayes and empirical bayes estimators–Part II: The empirical bayes case. Journal of the American Statistical Association, 67(337):130–139, 1972.
- [Efron, Bradley and Hastie, Trevor2016] Efron, Bradley and Hastie, Trevor. Computer age statistical inference, volume 5. Cambridge University Press, 2016.
- [Efron2012] Bradley Efron. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge University Press, 2012.
- [Filippi et al.2010] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23 (NIPS), pages 586–594. 2010.
- [Friedman et al.2010] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
- [Garivier and Cappé2011] Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In COLT, pages 359–376, 2011.
- [Gittins1989] J. C. Gittins. Multi-armed Bandit Allocation Indices. Wiley, Chichester, NY, 1989.
- [Graepel et al.2010] Thore Graepel, Joaquin Q Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 13–20, 2010.
- [Hill et al.2017] Daniel N. Hill, Houssam Nassif, Yi Liu, Anand Iyer, and S.V.N. Vishwanathan. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1813–1821, 2017.
- [Kaufmann et al.2012a] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On bayesian upper confidence bounds for bandit problems. In AISTATS, pages 592–600, 2012.
- [Kaufmann et al.2012b] Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, pages 199–213. Springer, 2012.
- [Korda et al.2013] Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems, pages 1448–1456, 2013.
- [Maritz2018] Johannes S Maritz. Empirical Bayes Methods with Applications. Chapman and Hall/CRC, 2018.
- [Morris1983] Carl N. Morris. Parametric empirical bayes inference: Theory and applications. Journal of the American Statistical Association, 78(381):47–55, 1983.
- [Riquelme et al.2018] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In International Conference on Learning Representations (ICLR), 2018.
- [Robbins1952] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
- [Sawant et al.2018] N Sawant, C B Namballa, N Sadagopan, and H Nassif. Contextual multi-armed bandits for causal marketing. In Proceedings of the International Conference on Machine Learning (ICML’18) Workshops, Stockholm, Sweden, 2018.
- [Scott2010] Steven L Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
- [Teo et al.2016] Choon Hui Teo, Houssam Nassif, Daniel Hill, Sriram Srinivasan, Mitchell Goodman, Vijai Mohan, and S.V.N. Vishwanathan. Adaptive, personalized diversity for visual discovery. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys), pages 35–38, Boston, MA, 2016.
- [Thompson1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- [Zou2006] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.