Python implementations of contextual bandits algorithms
This work explores adaptations of successful multi-armed bandits policies to the online contextual bandits scenario with binary rewards using binary classification algorithms such as logistic regression as black-box oracles. Some of these adaptations are achieved through bootstrapping or approximate bootstrapping, while others rely on other forms of randomness, resulting in more scalable approaches than previous works, and the ability to work with any type of classification algorithm. In particular, the Adaptive-Greedy algorithm shows a lot of promise, in many cases achieving better performance than upper confidence bound and Thompson sampling strategies, at the expense of more hyperparameters to tune.READ FULL TEXT VIEW PDF
We present Exponentiated Gradient LINUCB, an algorithm for con-textual
Contextual multi-armed bandit problems arise frequently in important
We present a new type of acquisition functions for online decision makin...
Previous work on policy learning for Malaria control has often formulate...
Multi-armed bandits a simple but very powerful framework for algorithms ...
Next-generation wireless deployments are characterized by being dense an...
Delegation allows an agent to request that another agent completes a tas...
Python implementations of contextual bandits algorithms
Contextual bandits, also known as associative reinforcement learning or multi-armed bandits with covariates , is a problem characterized by the following iterative process: there is a number of choices (known as “arms”) from which an agent can choose, which contain stochastic rewards. At the beginning of each round, the world generates a set of covariates of fixed dimensionality (so-called ”context”), and rewards for each arm which are related to the covariates. The agent chooses an arm for that round, and the world reveals the reward for that arm, but not for the others. The goal is for the agent to maximize its obtained rewards in the long term, using his previous actions’ history.
This is related to the simpler multi-armed bandits problem (MAB) , which has no covariates, but faces the same dilemma between exploration of unkown alternatives or exploitation of known good arms, for which many approaches have been proposed such as upper confidence bounds  , Thompson sampling 
or other heuristics.
This work proposes adaptations of some successful strategies and baselines that have been proposed for the MAB setting and variations thereof to the contextual bandits setting by using supervised learning algorithms as black-box oracles, as well as other considerations such as exploration in the early phases in the absence of non-zero rewards, benchmarking them in an empirical evaluation using multilabel classification datasets with different characteristics from which the studied problem is simulated.
More formally, this work is concerned with a scenario as follows: there is a fixed number of choices or arms , from which an agent must choose one as his action in each round . At the beginning of each round, the world generates a set of covariates (a.k.a. the “context”) of fixed dimensionality , and stochastic binary rewards for each arm through a function of the covariates which is different for each arm but is the same throughout all rounds. The world reveals to the agent, which must then choose an arm based on his previous knowledge, and the reward for the arm that was chosen is revealed, while the rewards for the other arms remain unknown. The agent shall use his history of previously seen covariates , chosen actions and observed rewards
during each round in order to make a choice. Note that this work is only concerned with Bernoulli-distributed rewards, which is a setting encountered in domains such as recommender systems or online advertising in which a user either clicks or doesn’t click what she is presented with.
The objective is to maximize the rewards obtained in the long term, without a defined time limit. Compared to MAB which evaluates arm-selection policies based on upper bounds on regret , defined there as the difference between the reward from the arm selected at each round and the highest expected reward of any arm, this a less clear objective in online contextual bandits. Alternative definitions of regret based on the true reward-generating function have been proposed , and regret bounds have been studied for the case of linear functions. Other works have also tried to define regret differently, such as , but their definitions are not applicable in this scenario. While some methods that enjoy theoretical guarantees on their regret have been proposed before for contextual bandits, either using another algorithm as oracle or not, such methods tend to be computationally intractable such as  ,, or , and this work tries to explore more practical and scalable approaches at the expense of theoretical guarantees.
Alternative objectives such as sums of weighted rewards discounting later gains have also been proposed in the MAB setting, but this does not necessarily result in a good definition of “long term”, and is subject to variations in the discount rate used.
As such, this work is concerned with cumulative reward throughout the rounds instead of accumulated regret, with time periods running up to the available number of rows in the given datasets. While this has some chance of not being able to reflect asymptotic behaviors in an infinite-time scenario with all-fixed arms, it provides some insight on what happens during typical timelines of interest.
The simpler multi-armed bandits scenario has been extensively studied, and many good solutions have been proposed which enjoy theoretical limits on their regrets , as well as demonstrated performance in empirical tests . Among the solutions with best theoretical guarantees and empirical performance are upper confidence bounds, also known as “optimism in the face of uncertainty”, which try to establish an upper bound on the expected reward (such bound gets closer to the observed mean as more observations are accumulated, thereby balancing exploration and exploitation), and Thompson sampling
which takes a Bayesian perspective aiming to choose an arm according to its probability of being the best arm. Typical comparison baselines areEpsilon-Greedy algorithms and variations thereof, whose idea is to select the empirical best action with some probability or a random action otherwise. In the time-limited setting, other logical strategies have also been evaluated, such as choosing random actions at the beginning but then shifting to always playing the empirically best action after an optimal turning point (known as Explore-Then-Exploit).
Variations of the multi-armed bandit setting have also seen other interesting proposals, such as the Adaptive-Greedy algorithm proposed for the mortal (expiring) arms case 
, which selects the empirically best arm if its estimated expected reward is above a certain threshold, or a random arm otherwise.
The contextual bandits setting has been studied as different variations of the problem formulation, some differing a lot from the one presented here such as the bandits with “expert advise” in  and , and some presenting a similar scenario in which the rewards are assumed to be continuous (usually in the range ) and the reward-generating functions linear  . Particularly, LinUCB , which as its name suggests uses a linear function estimator with an upper bound on the expected rewards (one estimator per arm, all independent of each other), has proved to be a popular approach and many works build upon it in variations of its proposed scenario, such as when adding similarity information  .
Approaches taking a supervised learning algorithm as an oracle for a similar setting as presented here but with continuous rewards have been studied before  , in which these oracles are fit to the covariates and rewards from each arm separately, and the same strategies from multi-armed bandits have also resulted in good strategies in this setting. Other related problems such as building an optimal oracle or policy with data collected from a past policy have also been studied   , but this work only focuses on online policies that start from scratch and continue ad-infinitum.
, which assumes a scenario in which a supervised learning algorithm must make predictions about many observations, but revealing the true label of an observation in order to incorporate it in its training repertoire is costly, thus it must actively select which observations to label in order to improve predictions. In the classification scenario with differentiable models, a simple yet powerful technique that has been tried is to look at the gradient that an observation would have on the loss or likelihood function if its true label were known (there are only two possible outcomes in this case, and some methods provide a probability of each being the correct one), following the idea that larger gradients e.g. as measured by some vector norm, will lead to faster learning.
The aim of upper confidence bounds for an oracle (fit to rewards and covariates coming from a single arm) is to be able to establish a bound for the expected reward of an arm given the covariates or features, below which lies the true expected reward with a high probability. The tighter the bound at the same probability, the better. As the number of observations grows, this bound should get closer to the point estimate of the expected reward generated by the oracle.
Previous works have tried to use the upper confidence bound strategy in contextual bandits by upper-bounding the standard error of predictions under assumptions on the reward-generating functions such as and . However, methods typically used in the statistics and econometrics literature for the same purpose have been overlooked, such as Bayesian sampling  , estimations of predictor’s covariance under normality assumptions , and bootstrapping , which can result in tighter upper confidence bounds and more scalable approaches than  and .
The Bayesian approach would restrict the oracles to certain classes of models and might not result in a very scalable strategy (albeit stochastic variational inference might in some cases result in an online and fast-enough procedure), while the covariance estimations restrict the class of oracles to generalized linear models, but the bootstrapping approach results in a very scalable strategy that can work with any class of supervised learning algorithms (including collaborative filtering ones) and which makes almost no assumptions on the reward-generating functions.
To recall, bootstrapping consists in taking resamples of the data of the same size as the original by picking observations at random from the available pool but with replacement. These resamples are drawn from the same distribution as the originals, thus can be used to construct confidence intervals of parameters or other statistics of interest, such as the estimation of the expected reward for an arm given its covariates, by simply taking the statistic of interest estimated under each resample and calculating quantiles on it.
Obtaining upper confidence bounds this way is straightforward, but it has one inconvenience: it requires access to the whole dataset. In some cases, it might be desirable to have oracles that use online learning methods (i.e. stochastic optimization), which are fit to online data streams or small batches of observations incrementally instead of being refit to the whole data every time. In theory, as the sample size grows to infinity, the number of times that an observation appears in a resample should be a random number distributed , and one possibility is to take each observation as it arrives a random number of times . Alternative approaches have also been proposed, such as , which use a different distribution for the number of times that an observation appears. This work found a more practical but less theoretically correct method for dealing with this problem: assigning sample weights at random , which are passed to the classification oracle provided that it supports them. This produces a more stable effect at smaller sample sizes, as it avoids the problem of ending up with observations that have only one label. For more information on it see the appendix.
For some classes of algorithms such as decision trees, it should be possible to calculate an upper bound also by looking at the data on the terminal nodes, and methods such as random forests that implicitly perform bootstrapping should be able to produce an upper confidence bound without additional bootstrapping. Such methods however were not explored in this work.
One issue that makes the contextual bandits scenario harder is that most supervised learning algorithms used as oracles cannot be fit to data that has only one value or only one label (e.g. only observations which had a zero reward), and typical domains of interest involve a scenario in which the non-zero reward rate for any arm is rather small regardless of the covariates (e.g. clicks). In the MAB setting, this is usually solved by incorporating a prior or some smoothing criterion, and it’s possible to think of a similar fix for the scenario proposed in this work if the classifier is able to output probabilities: if there is only observations with one label for a given arm, always predict that label for that arm, and add a smoothing criterion regardless:
where is the expected reward estimated by the oracle, is the number of observations for that arm, and with are smoothing constants. One might also think of incorporating artificial observations with an unseen label, but this can end up doing more harm than good. A recalibration can also be applied to the outputs of classifiers that don’t produce probabilities in order to make this heuristic work .
However, as the arm sizes grow larger and the problem starts resembling more the many-armed bandits scenario  , in which there might be more arms than rounds or time steps, it’s easy to see that this smoothing criterion will lead to pretty much sampling each arm once or twice, which in turn will lead to low rewards until all arms have been sampled a certain number of times. Taking this into consideration, another logical solution would be to start with a Bayesian multi-armed bandit policy for each arm that uses a Beta prior and ignores the covariates, then switch to a contextual bandit policy once a minimum number of observations from each label has been obtained for that arm (as there is randomness involved, it is highly unlikely that it will start by sampling each arm as the smoothing would do in a mostly-zero reward scenario).
The proposed algorithms were evaluated by complementing them with both the smoothing technique and the MAB-first technique, which usually proved to be a better choice. For more information on this comparison, see the appendix. Some generic suggestion for these constants are as follows: ; ; .
Another logical solution for the problem of having too many arms is of course to limit oneself to a randomly chosen subset of them, perhaps adding more with time, but it’s hard to determine what would be the optimal number given some timeframe, especially in situations in which arms expire and/or new arms are available later. See the appendix for such comparison.
An idea to incorporate more observations into arms is to add each observation which results in a reward for one arm as a non-reward observation for all other arms. In scenarios in which only one arm per round tends to have a reward, or mostly one arm per round only, this can provide a small lift, but in scenarios in which this is not the case, it can do more harm in the long term.
Before determining the quality of a policy or strategy for contextual bandits, it’s a good idea to establish simple comparison points that any good policy or strategy should be able to beat in order to ensure that it is indeed a good policy.
MAB has also seen many proposals in this regard, with the most simple one being Epsilon-Greedy algorithms, which consist in playing the empirical best arm with some high probability or a random arm otherwise. Variations of it have also been proposed, such as decreasing the probability of choosing a random arm with each successive round, or dropping the probability of picking a random arm to zero after some turning point. This algorithm lends itself to an easy adaptation to the problem studied here:
would consist of separate and independent binary classifiers (such as XGBoost or logistic regression), each fit only to the observations and rewards from the rounds in which its respective arm was chosen, wrapped inside theMAB-first or the smoothing criterion as described in the previous section (), and ties broken arbitrarily. While ideally the oracles should be updated after every iteration, and stochastic optimization techniques allow doing this for many classification algorithms, they might also be updated only after a certain amount of rounds, or every time each one’s history accumulates a certain number of new cases, at the expense of some decrease in the predictive power of the oracles - this becomes less of an issue as the histories’ lenght increases.
Following the idea of alternating between exploring new alternatives under new contexts and exploiting the accrued knowledge from previous exploration, another logical option is to alternate between periods of choosing actions at random, and periods of playing the actions with the highest estimated expected reward (a more extreme case of the Epoch-Greedy policy proposed in ). In the most extreme case, if there is a timeline defined, this strategy would consist of playing an arm at random up to an optimal turning point, after which it will only play the arm with highest estimation.
In practice, we don’t have a set time limit, but since the sample size is known beforehand when using existing datasets, it can be added as another baseline.
Another logical idea that has been used in other problem domains when making a decision from uncertain estimations is to choose not by a simple , but with a probability proportional to the estimates, e.g. , where
. As the estimates here are probabilities bounded between zero and one, it might make more sense to apply an inverse sigmoid functionon these probabilities before applying the softmax function. In order to make such policy converge towards an optimal strategy in the long-term, a typical trick is to inflate the estimates before applying the softmax function by a multiplier that gets larger with the number of rounds, so that the policy would tend to with later iterations.
Finally, another good baseline is to always select the arm with the highest average reward ignoring the context. A good MAB policy should quickly converge towards always choosing such arm.
Following the previous sections, a natural adaptation of the upper confidence bound strategy is as follows:
And its online variant:
Just like before, the oracles (binary classifiers) might not be updated after every new observation is incorporated, but only after a reasonable number of them has been incorporated, or after a fixed number of rounds.
Thompson sampling results in an even more straight forward adaptation:
And similarly, its online variant:
Outside of algorithms relying on bootstrapping, some algorithms that use a random selection criterion also result in easy adaptations with reasonably good performance without requiring multiple oracles per arm, such as the Adaptive-Greedy  algorithm:
The choice of threshold is problematic though, and it might be a better idea to base it instead on the estimations produced by the oracles - for example by keeping a moving average window of the last highest estimated rewards of the best arm:
This moving window in turn might also be replaced with a non-moving window, i.e. compute the average for the first observations, but don’t update it until more rounds, then at time update only with the observations that were between and .
Instead of relying on choosing arms at random for exploration, active learning heuristics might be chosen for faster learning instead. Strategies such as Epsilon-Greedy
are easy to convert into active learning – for example, assuming a differentiable and smooth model such as logistic regression or artificial neural networks (depending on the particular activation functions):
Intuitively, it might also be a good idea to take the arm with the smallest or largest gradient for either label instead of a weighted average according to the estimated probabilities, as each alternative (max, min, weighted) seeks something that adds value, but in practice a weighted average tends to give slightly better results. For a comparison see the appendix.
The previously defined ContextualAdaptiveGreedy2 for example can also be enriched with this simple heuristic:
The algorithms above (implementations are open-source and freely available111https://github.com/david-cortes/contextualbandits) were benchmarked and compared to the simpler baselines by simulating contextual bandits scenarios using multi-label classification datasets, where the arms become the classes and the rewards are whether the chosen label for a given observation was correct or not. This was done by feeding them observations in rounds, letting each algorithm make its choice but presenting the same context to all, and revealing to each one whether the label that it chose in that round was correct or not.
The datasets used are the BibTeX tags , Del.icio.us tags , Mediamill , and EURLex222Due to speed reasons, this dataset was only used for the simulations in the appendix B  (these were taken from the Extreme Classification Repository333http://manikvarma.org/downloads/XC/XMLRepository.html), representing a variety of problem domains and datasets with different properties, such as having a dominant arm to which most observations belong (Mediamill), having a large number of labels in relation to the number of available rounds with some never offering a reward (EURLex), or a more balanced scenario with no dominant label (BibTeX). The same Mediamill dataset was also used without the 5 most common labels, which results in a very different scenario. The dataset sizes, average number of labels per row, average number of rows per label, and percent of observations having the most common label are as follows:
The classification oracles were refit every 50 rounds, and the experiments were run until iterating throughout all observations in a dataset, after which it was done again with the data shuffled differently, and the results averaged over 10 runs. Both full-refit and mini-batch-update versions were evaluated. The classifier algorithm used was logistic regression, with the same regularization parameter for every arm.
Contextual bandits policies were evaluated by their plots of cumulative mean reward over time (that is, the average reward per round obtained up to a given round), with time being the number of rounds or observations, and the reward being whether they choose a correct label (arm) for an observation (context).
No feature engineering or dimensionality reduction was performed, as the point was to compare metaheuristics.
Unfortunately, algorithms such as LinUCB don’t scale to these problem sizes, and it was not possible to compare against them. Additionally, they are intended for the case of continuous rather than binary rewards, and as such their performance might not be as good.
The MAB-first technique was used in all cases except for Explore-Then-Exploit, but its hyperparameters were not tuned, nor were the hyperparameters of the contextual bandit policies. The results in the appendix suggest that good tuning of the MAB-first hyperparameters can have a large impact. The hyperparameters were for the BibTeX and Del.icio.us datasets, and for the Mediamill datasets.
The other policies’ hyperparameters were set as follows: 10 resamples for bootstrapped methods, 80% confidence interval for UCB, 20% explore probability for Epsilon-Greedy and 15% for ActiveExplorer, decay rate for Epsion-Greedy, for ContextualAdaptiveGreedy, ContextualAdaptiveGreedy2 and AdaptiveActiveGreedy, multiplier of for SoftmaxExplorer with inflation rate , percentile 30 for ContextualAdaptiveGreedy2, threshold for ContextualAdaptiveGreedy. The turning point for Explore-Then-Exploit was set at for BibTeX, for Del.icio.us, and for Mediamill.
This work presented adaptations of the most common strategies or policies from multi-armed bandits to online contextual bandits scenarios with binary rewards through classification oracles.
Techniques such as bootstrapping or approximate bootstrapping were proposed for obtaining upper confidence bounds and for Thompson sampling, which resulted in more scalable approaches than previous works such as LinUCB , along with techniques that allow to start from zero rather than requiring an undefined earlier exploration phase as  or .
An empirical evaluation of adapted multi-armed bandits policies was performed, which in many cases showed better results compared to simpler baselines or to discarding the context. A further comparison with similar works meant for the regression setting was not feasible due to the lack of scalability of other algorithms.
Just like in MAB, the upper confidence bound approach proved to be a reasonably good strategy throughout all datasets despite the small number of resampes used, having fewer hyperparameters to tune. The overall best-performing policy however seems to be ContextualAdaptiveGreedy, which is also a faster approach. Enhancing it by incorporating active learning heuristics did not seem to have much of an effect, and it seems that seting a given initial threshold provides better results compared to setting the threshold as a moving percentile of the predictions.
While theoretically sound, using stochastic optimization to update classifiers with small batches of data resulted in severely degraded performance compared to full refits across all metaheuristics, even in the later rounds of larger datasets, with no policy managing to outperform choosing the best arm without context, at least with the hyperparameters experimented with for the MAB-first trick.
It shall be noted that all arms were treated as being independent from each other, which in reality might not be the case and other models incorporating similarity information might result in improved performance.
International Conference on Machine Learning, pages 1638–1646, 2014.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
In order to evaluate how often the expected value of the target variable falls below the upper confidence bounds estimated by each method, simulations were performed at different sample sizes with randomly-generated data as follows:
A set of coefficients is set fixed (real coefficients), for all sample sizes and iterations.
Many iterations are performed in which covariates of a fixed sample size are generated at random, always from the same data-generating distribution, then the expected value of the target variable is calculated from the real coefficients, and random noise is added to it. This represents real samples from the data-generating process.
A test set is generated the same way as the samples in 2), but without adding noise to the target variable. This represents the real expected values.
Upper confidence bounds are generated for the observations in the test set under each real sample through the resampling methods described before.
The number of times that the real expected values are lower than the estimated upper bounds is calculated over all samples by taking the proportion of cases in the test set that were lower than the estimated bound.
Classification problems were also simulated similarly, but sampling the value for from a Bernoulli distribution instead, after applying a logistic transformation.
The confidence bound was defined at 80% with number of resamples set at 10, which is a rather low and insufficient number, but is the kind of number that could be used without much of a speed penalty for the algorithms described in this work. In general, the statistic of interest in bootstrapping tends to be the standard error of coefficients, but in this case the estimations of expected values of the target variable are more directly related to the problem.
Another perhaps intuitive choice of random weighting would be , but as can be seen, it severly underestimates the bounds. One might also think that less-variable Gamma weights could also work better at small sample sizes at the expense of worse results in larger sizes, so weights were also evaluated, but turned out to provide rather similar instability as the others in the proportion of estimations falling below the confidence bound. The weights provide almost the same results on average as the real bootstrap and the number of samples, and is perhaps a better alternative for classification as it avoids bad resamples resulting in only observations of one class. For a more intuitive explanation of the choice, recall that a distribution would indicate, for example, expected time between events that happen at rate , thereby acting as a mostly equivalent but smoother weighting than full inclusion/exclusion.
The results are rather different in the case of independent and correlated components for logistic regression, but for linear regression (not shown here) they are pretty much the same. Although the numbers here suggest sistematic underestimation of the bounds regardless of the method, changing the bias term has a large effect on the numbers, e.g. using small bias (not shown here) results in the upper bounds being severly overestimated in the case of linear regression, and changing the variance of the random noise also leads to large changes in the variance of the estimated bounds.
While this small simulation does not constitute any rigurous proof, it suggests that random weights can be a more stable alternative for online bootstrapping than Poisson-distributed number of occurrences, without any loss of precision in the estimated bound.
Both the smoothing criterion and the MAB-first trick can help with estimations in arms that have seen few or no examples of the positive class for that classifier. These were compared with different values of their hyperparameters on the Bibtex, Eurlex and Mediamill datasets, using as metaheuristics Epsilon-Greedy, BootstrappedTS and BootstrappedUCB. The base classifier used is logistic regression. These statistics represent just one run, rather than an average over multiple runs as in the other plots.
The number of arms likely plays a role in the effectiveness of these heuristics. It’s reasonable to think that limiting the number of arms would bring better results when the number of rounds is not infinite, so experiments were run limiting the number of arms to random subsets of varying cardinality. If there is a dominant arm that tends to perform better than the others and it doesn’t get included in the subset being used, performance should suffer significantly - this is precisely the case in the Mediamill dataset.
It’s not possible to conclude that one approach is always better than the other, but MAB-first seems to have an edge. It might be the case that smoothing works better in situations in which the number of arms is very large, but it’s not possible to generalize from only these datasets.
The choice of hyperparameters for both the smoothing and MAB-first has a very large impact on the results, perhaps even more than the selection of contextual bandit strategy. Nevertheless, regardless of the choice of hyperparameters, they both provide a significant improvement compared to not using them at all when the contextual bandit policy does not choose arms uniformly at random.
These results seem to indicate that, for BootstrappedUCB and BootstrappedTS, when the choice of hyperparameters for MAB-first and smoothing is good, the more arms that are considered, the better the results, despite the relatively large number of arms compared to the number of rounds in these experiments. For Epsilon-Greedy, there seemed to be a lot more variance in this regard, but overall, a larger number of available arms also seems to lead to better results.
A small comparison of the ActiveExplorer and the ActiveAdaptiveGreedy algorithms using as active learning selection heuristic either the minimum, maximum, or weighted average of the gradient norms under each label for a given observation was done on the Bibtex dataset. Just like in the main comparison, the base classifier used is logistic regression, with the MAB-first trick, and models refit to the full dataset every 50 rounds. The comparison is an average of 10 runs with the dataset shuffled differently in each.
As can be seen, using the weighted average of the norms (as described in the algorithm) provides slightly better results under both metaheuristics experimented with compared to using the minimum or the maximum of either label.