Multi-armed bandits is a well known paradigm for managing exploration-exploitation tradeoffs in online learning. Over the past decade, bandits-based schemes have proven successful in optimizing many Web experiences. In particular, Contextual Bandits (CMAB) schemes were shown to be well-suited to online experiences, with context often representing traffic or user characteristics (Li et al., 2010, 2011; Tang et al., 2015, 2013; Tewari and A. Murphy, 2017).
In the stochastic contextual bandits model, the world presents a sequence of requests to an algorithm. Each request is accompanied by some context vector of dimension. The algorithm can respond to each request with one of possible actions (=arm pulls), whose reward distributions are (1) unknown; and (2) depend on the context. After choosing an action, its reward is observed and the algorithm can adapt. Given a context and an action, the rewards are assumed to be independent and identically distributed.
Much of the literature on multi-armed bandits in general and contextual bandits in particular has focused on theoretical guarantees of the various algorithms, such as regret bounds. In contrast, relatively little has been published on the challenges facing the productization and ongoing operational application of contextual bandits schemes in large scale Web services. We address such aspects by discussing (1) tailoring the actual context to be used to the traffic volume of the optimized experience; (2) sanity-testing a running optimization process; (3) performing efficient offline analysis of CMAB schemes; (4) adding actions to already running optimizations; (5) respecting constraints that limit the actions that may be taken at each point in time; and (6) designing the system so as to enable iterative improvement of models. We explain the importance of each challenge and how we addressed it within a concrete system whose use cases and architecture we describe. Our solutions are practical, and in specific cases – e.g. the methodologies for monitoring continuity and stability of CMAB processes (Section 4.2) and the extension of the replay method (Section 4.3) – constitute novel contributions in and of themselves.
This paper is organized as follows. Section 2 surveys related work. Section 3 describes the use cases which we solve with contextual bandits, and the architecture of our contextual bandits system. Section 4 enumerates the challenges that any product-grade usage of contextual bandits should address, and presents how our system addresses those challenges. We conclude in Section 5.
2. Related Work
The contextual bandits setting appears in the literature in many different names and flavours including bandit problems with side observations (WAN, 2005), bandit problems with side information (Lu et al., 2010), and bandit problems with covariates (Sarkar et al., 1991). The term contextual multiarmed bandits was coined by Langford and Zhang (Langford and Zhang, 2008). CMAB algorithms have been leveraged in many applications, from recommendation engines and advertising (Lai and Robbins, 1985; Li et al., 2010; Tang et al., 2013, 2015) to medicine and healthcare (Tewari and A. Murphy, 2017). See (Burtini et al., 2015; Tewari and A. Murphy, 2017) for detailed surveys.
Previous work examined how the performance of bandit schemes, which are inherently online, may be accurately evaluated in an offline manner. A commonly used technique is called Replay (Li et al., 2011; Mary et al., 2014; Langford et al., 2009). Swaminathan and Joachims framed this as a counterfactual risk minimization problem (Swaminathan and Joachims, 2015).
3. Use Cases and Architecture
We apply CMAB schemes – LinUCB (Li et al., 2010) in particular– with a unified learning and serving architecture, to two business problems:
- UI Optimization::
we serve billions of discovery widgets each day (see Figure 1). We have observed that seemingly small changes in the appearance or rendering of the widgets may dramatically impact user engagement. We thus have multiple designs of widget styles, modeled as bandit arms, that are selected and served given the context of the request (e.g. device type and screen size).
- Feed Optimization::
in addition to serving standalone widgets, we also support discovery feeds. A discovery feed is an infinite scroll experience that serves additional recommendations as the user scrolls through previous ones. In practice, our feeds are composed of a sequence of typed cards, where each card type is a coherent set of recommendations (e.g. about some common theme, or belonging to a certain vertical). The optimization problem here is to select the next card type to serve (those are the arms), given the context of the request and the last few cards already served in the feed. In essence, we are solving for order and frequency of card types.
To satisfy both these needs, we designed a system – depicted in Figure 2 – comprised of two layers, an offline training layer using aggregated data and and online serving layer.111A similar architecture was proposed by the FAME system (Lempel et al., 2012), which was also designed to optimize, among others, rendering and layout use-cases.
The offline layer is comprised of an Aggregations Database, a Training Service, and a Task Queue. The Task Queue is aware of all the different CMAB-instances running in the system (all optimization use cases), and enqueues periodically (every couple of minutes) requests to the Training Service to update the model per each active instance. Each request contains the set of actions (arms) that are active in that instance.
The Training Service holds in memory the models of all active CMAB-instances. It pulls update requests from the Task Queue, and updates the relevant model, namely the weights per each arm of that model. It does so by reading tuples stored in the Aggregations Database. Unlike theoretical bandit models, which are sequential decision processes where the model makes one decision at a time, immediately observing its reward and updating itself, the Web reality is different. Our serving layer makes thousands of decisions per second. Rewards, mostly in the form of user clicks, arrive asynchronously several minutes after our serving decision has been made, and in particular after the model may have been called upon to perform tens or hundreds of thousands of subsequent decisions. We thus aggregate decisions and rewards in mini-batches, spanning a couple of minutes of accumulated data, where each mini-batch includes tuples of : the number of times an arm was pulled in a context, and the overall reward resulting from such pulls.
Once the Training Service updates a model of a CMAB-instance, it stores the output in the Model Holder Database, which acts as a data interface between the offline and serving layers.
Moving to the serving layer, it handles requests that correspond to active CMAB-instances. Per request, it computes in real-time scores for all available arms and returns the scores. The decision or action corresponding to the highest-scoring arm is then served, and both context and decision are logged. Subsequent user interactions such as clicks will also be logged with that same metadata, and the join of serving decisions and resulting user interactions is aggregated into the Aggregations Database to be leveraged in the next batch of model updates.
4.1. Determining Context
Contextual Bandits literature assumes that the context of each arm pull is well defined. One of the biggest challenges in applications of CMABs is determining the context vector to apply to actual requests. There is often some broad world context which needs to be projected into a simpler context that will actually be plugged into the model. That projection is essentially a form of feature engineering that should consider the interplay between the context space and the amount of traffic (arm pulls and rewards) that the model will face. Optimizing a small-traffic situation while using a large feature space is a recipe for over-fitting, especially when using popular yet simple models such as LinUCB (Li et al., 2010), that do not inherently reduce dimensionality as part of their internal operation.
In our setting – prior to starting the incremental learning of our CMAB model – no reward data is available for guiding feature selection and engineering, nor for tuning hyper parameters. Since our (full) projected contextmight itself be too sparse to enable effective generalization in cases of limited traffic, we further project it into a coarser grained context , where each context dimension is a binned representation of the original corresponding contextual dimension . We then plug the unified context , into our CMAB model, enabling early generalization based on followed by further refinement, based on , as additional arms are pulled and rewards are observed.
To further reduce overfitting due to a large context relative to limited traffic data, we employ regularization. In the context of the LinUCB algorithm, regularization is introduced by a parametercannot be tuned in advance before model initiation, we set it to some initial value that is later periodically adjusted based on replays of the model on recently logged data (See Section 4.3).
Lastly, different forms of unsupervised dimensionality reduction may be employed such as PCA or Random Projection (Yu et al., 2017)
. Such methods may also require matching the reduced context cardinality to the amounts of available traffic, but are inherently less susceptible to a potential ”curse of dimensionality”.
4.2. Sanity Testing a Running Process
Imagine a CMAB process that has been running for a while. Especially if the context is rich, it may be difficult to determine whether the process is converging to a state that is ”making sense”, or whether there is some over-fitting or instability in its results. Two tests which we use to validate whether the algorithm has picked up some signal are (1) checking continuity; and (2) checking stability.
When checking for continuity, the basic assumption is that if the model outputs a distribution over the arms when given context , the distribution should be similar to whenever is small. When checking for stability, the basic assumption is that if the CMAB outputs a distribution over the arms when given context at time t, then the distribution should be similar to whenever is small.
To illustrate how one might test for continuity, we took a use case whose context is defined by a one-hot encoded vector of length 9. We defined the distance between contexts as the Hamming Distance between the vectors. The non-zero distances range fromto . We then used KL-Divergence to measure the distance between the distributions of arms pulled by the algorithm in each context.
Figure 3 presents the average KL-Divergence value for each value of Hamming Distance between contexts. The thickness of each point represents the amount of observed context pairs having the given Hamming Distance. As expected, we observe that the closer two contexts are, the closer the distributions over the served arms are.
Regarding the evaluation of stability, denote by the distribution of arms served (pulled) in the time span given context . Figure 4 plots the average, over all contexts, of
as a function of , the age of the learning instance. We used minutes and hour. We can see that when the instance is young and mostly exploring, one hour of observing rewards can result in large changes in how arms are pulled. As the instance matures, it shifts towards more exploitation, and the hourly changes in the distribution of arm pulls per context become smaller.
When employing LinUCB, another way to assess its stability is to examine its exploitation ratio
. We define this as the fraction of arms, pulled by the instance, that would have been pulled by a greedy scheme that selects the arm with the highest expected reward, ignoring confidence interval (standard deviation) considerations. Figure5 plots the exploitation ratio as a function of the age of the LinUCB instance. As Figure 5 shows, initially there is little agreement between the actual arms pulled and the greedy leader, which means that the instance is exploring and that upper confidence bound considerations heavily influence the choice of the arm to pull. Conversely, after 2 days, the instance stabilizes and its arm pulls mostly agree with the greedy choice that maximizes the expected reward, i.e. the instance is mostly exploiting.
4.3. Offline Analysis
Many machine learned models are trained offline and are measured by offline loss functions. Models exhibiting high loss are shelved. The more promising models can be pushed to production gradually, starting slow (affecting small percentages of traffic) and accelerating as online measurements arrive and show that the new model is performing well. However, this true and tested methodology doesn’t translate well to online algorithms such as CMAB processes. In particular, it is counter-productive to deploy such models to production on a small fraction of traffic, since that effectively limits the algorithm’s learning.
To be able to evaluate offline the performance of CMAB models, Langford et al. proposed the replay approach (Langford et al., 2009). To perform replay
, one must serve an unbiased portion of traffic by pulling arms uniformly at random. Then, given a CMAB model to evaluate offline, it is fed the same stream of requests served randomly, and a learning step is performed whenever the algorithm selects the same arm that was randomly pulled, using the reward that was observed. This happens with a probability of, when is the number of available arms. The evaluated model’s metrics are measured on the rewards of the subset of matching arm pulls.
The main drawback of the replay approach is that it requires serving a lot of random traffic, especially when the number of arms is high. To address this issue, we modified Replay as follows. If the evaluated algorithm decides to pull arm given context at time , we sample an observed reward from the set of random pulls of arm given context in where are parameters. In a stationary stochastic setting, and can be set to infinity. In most practical situations, they should be set to some finite application-dependent values, where the distribution of rewards in is believed to model the reward that would have been observed had arm been pulled at time . The sampling can be with or without repetitions, where the case of no repetitions sampling might sometimes ”run out” of rewards and will not be able to leverage time .
4.4. Adding Arms on the Fly
In many cases, a long-running CMAB process has matured, and is serving with little regret per context. The product then admits a new potential action/decision, modeled by the introduction of a new arm. While one can always stop the current CMAB process and start afresh with an expanded set of arms, that is highly inefficient as it loses the accrued learnings of the current model. It is preferable to dynamically add arms on the fly to a running process, continuing to leverage its historical learnings.
We address this challenge by leveraging the Task Queue. It fetches the list of available arms from an external service and passes them to the Training service. Upon encountering a new arm, the training layer will apply the initial model of its learning algorithm (e.g. LinUCB) and will save all arms’ models to the Model Holder. In the serving layer, the new arm will compete with the other arms.
4.5. Subjecting Decisions to Constraints
In real-life use-cases, some decisions (i.e. arms) may be forbidden in some points in time due to business rules. For example, in our Feed Optimization use-case, it may be forbidden to serve consecutive cards of the same type in the feed, or we may be required to serve at least one instance of a certain card type within the first cards.
Kleinberg et al.’s Sleeping Bandits model (Kleinberg et al., 2010) addresses Stochastic MAB settings when some arms may not be available in certain times. There, the authors proposed the Awake Upper Estimated Reward (AUER)
Awake Upper Estimated Reward (AUER)algorithm and proved its effectiveness. We follow the same intuition in our contextual bandits settings - considering the constraints, the serving layer selects the highest scoring eligible arm as its action.
4.6. Iteratively Improving the Model
We have a certain CMAB model in production, and have an idea for an improvement we believe would drive our metrics even further. Normally, we would run a controlled experiment, a.k.a. A/B test, pitting the production version (control) against the new version (treatment). In a traditional controlled experiment, we would split traffic in an unbiased manner between the control and treatment, annotate in logs which variant served each request, and report on the metrics resulting from each variant. However, that will not suffice for online learning algorithms. Each CMAB process must get its reward only from those requests that it served, and should not have access to results of decisions made by the other variant. Neglecting to do so runs the risk of vicarious reinforcement, wherein variants learn based on the arm pulls of its competitors.
We address this need by logging the test-id and variant-id of any controlled experiment on both its arm pulls and rewards. Our real-time aggregations are subsequently grouped by test and variant ids, and are stored in the Aggregations Database with the test and variant ids being part of the keyspace. Each model of the test reads aggregations from its corresponding keyspace, thus exposing to each variant only its own rewards.
5. Conclusions and Future Work
This paper presented several underexplored challenges that must be addressed when productizing Contextual Multi-Armed Bandits schemes. While there is an abundance of literature on theoretical aspects of CMABs and on their performance in practice, little has been written about what it takes to promote, monitor, augment and improve such models in a high-scale production environment. This paper covered six such topics, presenting practical solutions to each. In particular, our extension of the replay method and the methodologies for monitoring continuity and stability of CMAB processes are, to the best of our knowledge, novel in and of themselves.
- WAN (2005) 2005. Arbitrary side observations in bandit problems. Advances in Applied Mathematics 34, 4 (2005), 903 – 938. https://doi.org/10.1016/j.aam.2004.10.004 Special Issue Dedicated to Dr. David P. Robbins.
- Burtini et al. (2015) Giuseppe Burtini, Jason Loeppky, and Ramon Lawrence. 2015. A survey of online experiment design with the stochastic multi-armed bandit. arXiv preprint arXiv:1510.00757 (2015).
- Kleinberg et al. (2010) Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. 2010. Regret bounds for sleeping experts and bandits. Machine learning 80, 2-3 (2010), 245–272.
- Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6, 1 (1985), 4–22.
- Langford et al. (2009) John Langford, Alexander Strehl, and Jennifer Wortman. 2009. Exploration Scavenging. In Proceedings of the 25th international conference on Machine learning. ICML, 528–535.
John Langford and Tong
The epoch-greedy algorithm for multi-armed bandits with side information. InAdvances in neural information processing systems. 817–824.
- Lempel et al. (2012) Ronny Lempel, Ronen Barenboim, Edward Bortnikov, Nadav Golbandi, Amit Kagian, Liran Katzir, Hayim Makabee, Scott Roy, and Oren Somekh. 2012. Hierarchical composable optimization of web pages. In Proceedings of the 21st International Conference on World Wide Web. ACM, 53–62.
- Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web. ACM, 661–670.
- Li et al. (2011) Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 297–306.
et al. (2010)
Tyler Lu, Dávid
Pál, and Martin Pál.
Contextual multi-armed bandits. In
Proceedings of the Thirteenth international conference on Artificial Intelligence and Statistics. 485–492.
- Mary et al. (2014) Jérémie Mary, Philippe Preux, and Olivier Nicol. 2014. Improving offline evaluation of contextual bandit algorithms via bootstrapping techniques. In International Conference on Machine Learning. 172–180.
- Sarkar et al. (1991) Jyotirmoy Sarkar et al. 1991. One-armed bandit problems with covariates. The Annals of Statistics 19, 4 (1991), 1978–2002.
- Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning. 814–823.
- Tang et al. (2015) Liang Tang, Yexi Jiang, Lei Li, Chunqiu Zeng, and Tao Li. 2015. Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 323–332.
- Tang et al. (2013) Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. 2013. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 1587–1594.
- Tewari and A. Murphy (2017) Ambuj Tewari and Susan A. Murphy. 2017. From Ads to Interventions: Contextual Bandits in Mobile Health. Mobile Health: Sensors, Analytic Methods, and Applications (07 2017), 495–517. https://doi.org/10.1007/978-3-319-51394-2_25
- Yu et al. (2017) Xiaotian Yu, Michael R. Lyu, and Irwin King. 2017. CBRAP: Contextual Bandits with RAndom Projection. In AAAI.