1. Introduction
Contextual Bandits is a class of dynamic algorithms which can be used to learn efficiently targeting strategies. It is an extension of the multiarmed bandit problem (Slivkins, 2019), generalizing it with the concept of a context. Given a sampled context, the goal of the learning algorithm is to pick an action which maximizes the reward defined by the environment dynamics. We assume all the actions to be uncorrelated between them (e.g. each stateactionreward is a separate episode). In a bandit problem, we can only observe the outcome of an action that has been selected for a given state. The goal of a Bandit formulation is to minimize "regret"  the difference between the cumulative reward from the optimal policy and the trained agent cumulative sum of rewards. [Eqn 3]
In this work we prove that it is possible to build efficient contextual bandit system by using an off the shelf metalearning product (Google Cloud AutoML Tables) to learn policies without the need of any algorithmic coding or feature engineering. AutoML is on a high level similar to the Neural Architecture Search (NAS) proposed by (Zoph and Le, 2016)
which uses an autoregressive controller to generate architectural hyperparameters of a Neural Network. Rather than experimenting and hand crafting the best hierarchical representation of deep learning layers, these are learned automatically by the system
(Elsken et al., 2018).We aim to use this metalearning approach to approximate the Qfunction for a contextual bandit, e.g. the expected reward for a given action in a given state:
(1) 
where r is the reward, s the state and a the action. Armed with an approximated Qfunction generated by AutoML automatically, we can create easily an exploratory policy using the
greedy exploration schedule. The Qfunction, and thus the policy, is updated periodically as new batches of data are accumulated. Being a function of the state space, the Qfunction will be able to generalize across it. Note that we use here the Reinforcement Learning term "Qfunction" even if the system we study is purely a contextual bandit one. We believe the notation and this work can be extended to some multiaction problems by replacing the immediate reward with a long term discounted reward to implement a simple form of QLearning. We plan to address this idea in a future work.
Furthermore, while we limit ourselves to the egreedy exploration schedule in this work, this work can potentially be applied to other exploration strategies as well. The goal here is to show how using an offtheshelf AutoML product we can get functioning bandits systems with minimal tuning, thus we focused on the simplest exploration scheme.
2. Related Work
Our work is inspired by that of Li et al. (Li et al., 2010) who proposed the use of a contextual bandit based algorithm which can be evaluated offline. Their model achieved state of the art results on the Yahoo! Front Page Today Module Dataset. Langford et al. (Langford and Zhang, 2007) proposed a greedy exploration model which requires no knowledge of the time horizon and were able to determine an upper bound on the regret for their formulation.
Agarwal et al. (Agarwal et al., 2014) utilized an approach which guarantees an upper bound on the number of calls to an Oracle to get the statistically optimal regret guarantee. However, they evaluated their model on a nonpublic dataset which we could not use against our technique.
There has been prior work in Metalearning. Sharaz and Daumé (Sharaf and III, 2019)
proposed the use of an Imitation Learning based approach, "MÊLÉE" algorithm based on AggreVaTe. It provides the utility of moving away from hand engineering model architectures.
3. Mathematical Foundation
3.1. Bandit Formulation
The Set up for a Contextual Bandit problem is that an agent observes repeatedly a context , perform an action and receives a reward that depends (typically stochastically) on both and from the environment ( that from now on we’ll simplify as to keep the notation slim, implicitly intending it’s a stochastic variable).
The goal is to optimize the cumulative reward across a given sequence of episodes.
(2) 
where in the bandit problem we assume that the states at different times are all independent of each other and from the action taken in previous episodes.
A common measure of evaluating the performance of a Contextual Bandit algorithm is to estimate the net "regret". Regret can be framed as the difference between our model’s cumulative reward over time and the sequence of actions taken by the most optimal policy over the same period. The goal is to minimize this cost function as quickly as possible in a given period.
(3) 
3.2. MetaLearning Set Up
Google AutoML is inspired by the work on Automatic Architecture Selection (Zoph and Le, 2016). AutoML can help to optimize models that predict the expected reward (payoff) of a given action in a given context.
At a high level, a vanilla AutoML table implementation consists of the following steps in the pipeline. The core pillar of this work is that models are built with the available out of the box tools and with no hand crafted feature engineering or tuning (Lu, 2019).
Under the hood, a multistage Tensorflow Pipeline is automatically instantiated consisting of :

Automated Feature Engineering is applied to the raw input data.

Architecture Search to compute the best architecture/(s) for our bandits formulation task, e.g. to find the best predictor model for the expected reward of each episode.

Hyperparameter Tuning through search

Model Selection. Models which have achieved promising results so far are passed onto the next stage

Model Tuning and Ensembling
4. Datasets
As stated we ran experiments with multiple data sets to test the efficacy of our metalearning approach on a variety of different tasks.
4.1. Synthetic Dataset
We initially created our own synthetic dataset simulation to provide an experimental platform to test our early results. The simulation sampled an underlying multidimensional and multifactor probability model for which we could tune the complexity artificially. This allowed us to experiment in a controlled environment with state space, distribution as well as the number and type of actions. See fig.
1.4.2. Public Dataset
In addition to the synthetic environment, we also compared our model’s performance across four public datasets to benchmark the contribution. Some of these datasets have been used in popular work in this field. (Bietti et al., 2018):

A scientific dataset intended to simulate gamma particles collected in a atmospheric Cherenkov telescope. The dataset was taken from Chilingarian et al. (Bock, 2004).

A chess dataset which contains a list of features describing the board set up, and then a class denoting whether or not the white can win from that position, from Dua et al. (Dua and Graff, 2017).

A dataset on forest covertype, containing cartographic information of the forest and also a classification denoting whether the forest coverage consists of primarily Spruce Firs or Lodgepole Pines. The dataset was taken from Blackard and Anderson’s (Blackard and Anderson, 1998) work.

A dataset that held information about a game called Dou Shou Qi, and a dataset describing the different states of the game board. The classification denoted which player won the game, showcased in Rijin and Vis (van Rijn and Vis, 2016).
Each of these real world data sets were partitioned into blocks of 500 to 1000 episodes in size. During each of these blocks we run the current policy and we record all the triples (contexts, action, reward) corresponding to each trial. At the end of the block we retrain our QFunction model combining the data of all previous blocks and run the updated model and its derived policy in the subsequent block. We always start the first block, in absence of any data, with a random policy. This process continues for all the blocks until the end, refining the models progressively.
5. Exploration Strategies and Metrics
Our approach involves using an off the shelf product combined with a exploration strategy. We found that using our annealing schedule for greedy exploration provided us much better results over a fixed value( where n is the current iteration number). Our metalearning approach was built on top of this baseline exploration strategy. We compare the performance of this approach against the Online Cover algorithm. We also showcase the results of a random A/B testing baseline to compare the performance gains with respect to the simplest approach. We used the Regret metric as mentioned in section 3.2 to benchmark our meta learning approach.
6. Pipeline
Our data collection approach has been explained in Section 4.2. We sample these batch of experiences guided by our exploration strategy.
We applied the necessary preprocessing steps via Google Cloud BigQuery. This also dynamically handles data splits. These are then fed into the input pipeline for the AutoML Tables API. We maintained a held out inference set to verify the performance on unseen data. To verify our approach we ran multiple runs of the same parameter configurations and experimented with varying number of training samples, standard noise, factor variance. With increasing training samples, we would expect our model to better map the relationship between the state and action space and perform more accurately. Also the expected value of increasing factor size, and number of action of the Mean Absolute Error should be decreased.
We used the online cover algorithm implementation for Contextual Bandit in the Vowpal Wabbit library (John Langford and Strehl, 2007)(Agarwal et al., 2014)
and compared performance of both on the synthetic data. Finally to standardize our model’s performance, we deployed our model in the supervised learning datasets proposed in
(Bietti et al., 2018) and compared its performance against other state of the art models. While the Online Cover algorithm has shown promising results and ease of reproducibility with the open source implementation, (Riquelme et al., 2018) provided a survey of multiple algorithms which can also be used to benchmark our model performance. For this dataset, the greedy function had it’s Epsilon value anneal over time from 0.9 to eventually 0.0. This is reflected in its decreased average regret.The greedy approach achieves a balance between exploration and exploitation, as described more abstractly in the introduction.
7. Experimental Results
Figure 2
showcases the performance of our bandit model powered by AutoML Tables. It performs well and sometimes exceptionally better than previous work on the different datasets. It’s worth noting that this low regret prompted us to suspect data leakage, but that turned out not to be the case when we inspected the feature importance and found no proxy for the classification. The results shown have been averaged over multiple runs with the datasets being randomly reshuffled in each run. We experimented with some of the hyperparameters with AutoML handling most of these operations in under the hood as mentioned in section
3.28. Conclusion and Next Steps
We propose the use of an off the shelf metalearning approach to solve the Contextual Bandits problem with no custom feature engineering required. Our internally generated synthetic environment allowed us to quickly iterate and experiment with different environment conditions and policies. We showcase competitive results on various public datasets, converging to low regret quickly compared to the Online Cover algorithm.
We have showcased our meta learning model guided by a given egreedy policy. As mentioned, our approach is agnostic to the exploration strategy used and future work would involve experimenting with strategies such as UCB, Thompson sampling, bootstrapping models etc. An interesting result would be on the ability of our metalearning approach to adapt to time dependent environments. Incrementally adding a noise parameter to the environment would be a telling experiment to mimic real world scenarios.
References
 (1)
 Agarwal et al. (2014) Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. CoRR abs/1402.0555 (2014). arXiv:1402.0555 http://arxiv.org/abs/1402.0555
 Bietti et al. (2018) Alberto Bietti, Alekh Agarwal, and John Langford. 2018. A Contextual Bandit Bakeoff. https://www.microsoft.com/enus/research/publication/acontextualbanditbakeoff/

Blackard and
Anderson (1998)
Dean Blackard and
Anderson. 1998.
UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml  Bock (2004) Chilingarian A. Gaug M. Hakl F. Hengstebeck T. Jirina M. Klaschka J. Kotrc E. Savicky P. Towers S. Vaicilius A. Wittek W. Bock, R.K. 2004. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
 Dua and Graff (2017) Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
 Elsken et al. (2018) Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2018. Neural Architecture Search: A Survey.
 John Langford and Strehl (2007) Lihong Li John Langford and Alexander Strehl. 2007. Vowpal wabbit open source project. In Technical Report, Yahoo!, 2007. http://hunch.net/?p=309.

Langford and
Zhang (2007)
John Langford and Tong
Zhang. 2007.
The EpochGreedy Algorithm for Contextual Multiarmed Bandits. In
Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS’07). Curran Associates Inc., USA, 817–824. http://dl.acm.org/citation.cfm?id=2981562.2981665  Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A ContextualBandit Approach to Personalized News Article Recommendation. CoRR abs/1003.0146 (2010). arXiv:1003.0146 http://arxiv.org/abs/1003.0146
 Lu (2019) Yifeng Lu. 2019. An EndtoEnd AutoML Solution for Tabular Data at KaggleDays. http://ai.googleblog.com/2019/05/anendtoendautomlsolutionfor.html.
 Riquelme et al. (2018) Carlos Riquelme, George Tucker, and Jasper Snoek. 2018. Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. In International Conference on Learning Representations. https://openreview.net/forum?id=SyYe6kCW
 Sawant et al. (2018) Neela Sawant, Chitti Babu Namballa, Narayanan Sadagopan, and Houssam Nassif. 2018. Contextual MultiArmed Bandits for Causal Marketing. CoRR abs/1810.01859 (2018). arXiv:1810.01859 http://arxiv.org/abs/1810.01859
 Sharaf and III (2019) Amr Sharaf and Hal Daumé III. 2019. MetaLearning for Contextual Bandit Exploration. CoRR abs/1901.08159 (2019). arXiv:1901.08159 http://arxiv.org/abs/1901.08159
 Slivkins (2019) Aleksandrs Slivkins. 2019. Introduction to MultiArmed Bandits. CoRR abs/1904.07272 (2019). arXiv:1904.07272 http://arxiv.org/abs/1904.07272
 van Rijn and Vis (2016) J. N. van Rijn and J. K. Vis. 2016. UCI Machine Learning Repository. https://arxiv.org/abs/1604.07312
 Zoph and Le (2016) Barret Zoph and Quoc V. Le. 2016. Neural Architecture Search with Reinforcement Learning. CoRR abs/1611.01578 (2016). arXiv:1611.01578 http://arxiv.org/abs/1611.01578
Comments
There are no comments yet.