Recent advances in reinforcement learning have sparked renewed interest in sequential decision making with deep neural networks. Neural networks have proven to be powerful and flexible function approximators, allowing one to learn mappings directly from complex states (e.g., pixels) to estimates of expected return. While such models can be accurate on data they have been trained on, quantifying model uncertainty on new data remains challenging. However, having an understanding of what is not yet known or well understood is critical to some central tasks of machine intelligence, such as effective exploration for decision making.
A fundamental aspect of sequential decision making is the exploration-exploitation dilemma: in order to maximize cumulative reward, agents need to trade-off what is expected to be best at the moment, (i.e., exploitation), with potentially sub-optimal exploratory actions. Solving this trade-off in an efficient manner to maximize cumulative reward is a significant challenge as it requires uncertainty estimates. Furthermore, exploratory actions should be coordinated throughout the entire decision making process, known as deep exploration, rather than performed independently at each state.
Thompson Sampling (Thompson, 1933)
and its extension to reinforcement learning, known as Posterior Sampling, provide an elegant approach that tackles the exploration-exploitation dilemma by maintaining a posterior over models and choosing actions in proportion to the probability that they are optimal. Unfortunately, maintaining such a posterior is intractable for all but the simplest models. As such, significant effort has been dedicated to approximate Bayesian methods for deep neural networks. These range from variational methods(Graves, 2011; Blundell et al., 2015; Kingma et al., 2015)
to stochastic minibatch Markov Chain Monte Carlo(Neal, 1994; Welling & Teh, 2011; Li et al., 2016; Ahn et al., 2012; Mandt et al., 2016), among others. Because the exact posterior is intractable, evaluating these approaches is hard. Furthermore, these methods are rarely compared on benchmarks that measure the quality of their estimates of uncertainty for downstream tasks.
To address this challenge, we develop a benchmark for exploration methods using deep neural networks. We compare a variety of well-established and recent Bayesian approximations under the lens of Thompson Sampling for contextual bandits, a classical task in sequential decision making. All code and implementations to reproduce the experiments will be available open-source, to provide a reproducible benchmark for future development. 111 Available in Python and Tensorflow at
Available in Python and Tensorflow athttps://sites.google.com/site/deepbayesianbandits/.
Exploration in the context of reinforcement learning is a highly active area of research. Simple strategies such as epsilon-greedy remain extremely competitive (Mnih et al., 2015; Schaul et al., 2016). However, a number of promising techniques have recently emerged that encourage exploration though carefully adding random noise to the parameters (Plappert et al., 2017; Fortunato et al., 2017; Gal & Ghahramani, 2016) or bootstrap sampling (Osband et al., 2016) before making decisions. These methods rely explicitly or implicitly on posterior sampling for exploration.
In this paper, we investigate how different posterior approximations affect the performance of Thompson Sampling from an empirical standpoint. For simplicity, we restrict ourselves to one of the most basic sequential decision making scenarios: that of contextual bandits.
No single algorithm bested the others in every bandit problem, however, we observed some general trends. We found that dropout, injecting random noise, and bootstrapping did provide a strong boost in performance on some tasks, but was not able to solve challenging synthetic exploration tasks. Other algorithms, like Variational Inference, Black Box
-divergence, and minibatch Markov Chain Monte Carlo approaches, strongly couple their complex representation and uncertainty estimates. This proves problematic when decisions are made based on partial optimization of both, as online scenarios usually require. On the other hand, making decisions according to a Bayesian linear regression on the representation provided by the last layer of a deep network offers a robust and easy-to-tune approach. It would be interesting to try this approach on more complex reinforcement learning domains.
In Section 2 we discuss Thompson Sampling, and present the contextual bandit problem. The different algorithmic approaches that approximate the posterior distribution fed to Thompson Sampling are introduced in Section 3, while the linear case is described in Section 4. The main experimental results are presented in Section 5, and discussed in Section 6. Finally, Section 7 concludes.
2 Decision-Making via Thompson Sampling
The contextual bandit problem works as follows. At time a new context arrives and is presented to algorithm . The algorithm —based on its internal model and — selects one of the available actions, . Some reward is then generated and returned to the algorithm, that may update its internal model with the new data. At the end of the process, the reward for the algorithm is given by , and cumulative regret is defined as , where is the cumulative reward of the optimal policy (i.e., the policy that always selects the action with highest expected reward given the context). The goal is to minimize .
The main research question we address in this paper is how approximated model posteriors affect the performance of decision making via Thompson Sampling (Algorithm 1) in contextual bandits. We study a variety of algorithmic approaches to approximate a posterior distribution, together with different empirical and synthetic data problems that highlight several aspects of decision making. We consider distributions over the space of parameters that completely define a problem instance . For example, could encode the reward distributions of a set of arms in the multi-armed bandit scenario, or –more generally– all the parameters of an MDP in reinforcement learning.
Thompson Sampling is a classic algorithm (Thompson, 1933) which requires only that one can sample from the posterior distribution over plausible problem instances (for example, values or rewards). At each round, it draws a sample and takes a greedy action under the optimal policy for the sample. The posterior distribution is then updated after the result of the action is observed. Thompson Sampling has been shown to be extremely effective for bandit problems both in practice (Chapelle & Li, 2011; Granmo, 2010) and theory (Agrawal & Goyal, 2012). It is especially appealing for deep neural networks as one rarely has access to the full posterior but can often approximately sample from it.
In the following sections we rely on the idea that, if we had access to the actual posterior given the observed data at all times , then choosing actions using Thompson Sampling would lead to near-optimal cumulative regret or, more informally, to good performance. It is important to remark that in some problems this is not necessarily the case; for example, when actions that have no chance of being optimal still convey useful information about other actions. Thompson Sampling (or UCB approaches) would never select such actions, even if they are worth their cost (Russo & Van Roy, 2014). In addition, Thompson Sampling does not take into account the time horizon where the process ends, and if known, exploration efforts should be tuned accordingly (Russo et al., 2017). Nonetheless, under the assumption that very accurate posterior approximations lead to efficient decisions, the question is: what happens when the approximations are not so accurate? In some cases, the mismatch in posteriors may not hurt in terms of decision making, and we will still end up with good decisions. Unfortunately, in other cases, this mismatch together with its induced feedback loop will degenerate in a significant loss of performance. We would like to understand the main aspects that determine which way it goes. This is an important practical question as, in large and complex systems, computational sacrifices and statistical assumptions are made to favor simplicity and tractability. But, what is their impact?
In this section, we describe the different algorithmic design principles that we considered in our simulations of Section 5. These algorithms include linear methods, Neural Linear and Neural Greedy, variational inference, expectation-propagation, dropout, Monte Carlo methods, bootstrapping, direct noise injection, and Gaussian Processes. In Figure 6 in the appendix, we visualize the posteriors of the nonlinear algorithms on a synthetic one dimensional problem.
Linear Methods We apply well-known closed-form updates for Bayesian linear regression for exact posterior inference in linear models (Bishop, 2006). We provide the specific formulas below, and note that they admit a computationally-efficient online version. We consider exact linear posteriors as a baseline; i.e., these formulas compute the posterior when the data was generated according to where , and represents the reward. Importantly, we model the joint distribution of and for each action. Sequentially estimating the noise level for each action allows the algorithm to adaptively improve its understanding of the volume of the hyperellipsoid of plausible ’s; in general, this leads to a more aggressive initial exploration phase (in both and ).
The posterior at time for action , after observing , is , where we assume , and
, an Inverse Gamma and Gaussian distribution, respectively. Their parameters are given by
We set the prior hyperparameters to, and , while . It follows that initially, for , we have the prior , where . Note that we independently model and regress each action’s parameters, for .
We consider two approximations to (1) motivated by function approximators where is large. While posterior distributions or confidence ellipsoids should capture dependencies across parameters as shown above (say, a dense ), in practice, computing the correlations across all pairs of parameters is too expensive, and diagonal covariance approximations are common. For linear models it may still be feasible to exactly compute (1), whereas in the case of Bayesian neural networks, unfortunately, this may no longer be possible. Accordingly, we study two linear approximations where is diagonal. Our goal is to understand the impact of such approximations in the simplest case, to properly set our expectations for the loss in performance of equivalent approximations in more complex approaches, like mean-field variational inference or Stochastic Gradient Langevin Dynamics.
Assume for simplicity the noise standard deviation is known. In Figure1(a), for , we see the posterior distribution of a linear model based on (1), in green, together with two diagonal approximations. Each approximation tries to minimize a different objective. In blue, the PrecisionDiag posterior approximation finds the diagonal minimizing , like in mean-field variational inference. In particular, . On the other hand, in orange, the Diag posterior approximation finds the diagonal matrix minimizing instead. In this case, the solution is simply .
We add linear baselines that do not model the uncertainty in the action noise . In addition, we also consider simple greedy and epsilon greedy linear baselines (i.e., not based on Thompson Sampling).
Neural Linear The main problem linear algorithms face is their lack of representational power, which they complement with accurate uncertainty estimates. A natural attempt at getting the best of both worlds consists in performing a Bayesian linear regression on top of the representation of the last layer of a neural network, similarly to Snoek et al. (2015). The predicted value for each action is given by , where is the output of the last hidden layer of the network for context . While linear methods directly try to regress values on , we can independently train a deep net to learn a representation , and then use a Bayesian linear regression to regress on , obtain uncertainty estimates on the ’s, and make decisions accordingly via Thompson Sampling. Note that we do not explicitly consider the weights of the linear output layer of the network to make decisions; further, the network is only used to find good representations . In addition, we can update the network and the linear regression at different time-scales. It makes sense to keep an exact linear regression (as in (1) and (2)) at all times, adding each new data point as soon as it arrives. However, we only update the network after a number of points have been collected. In our experiments, after updating the network, we perform a forward pass on all the training data to obtain , which is then fed to the Bayesian regression. In practice this may be too expensive, and could be updated periodically with online updates on the regression. We call this algorithm Neural Linear.
Neural Greedy We refer to the algorithm that simply trains a neural network and acts greedily (i.e., takes the action whose predicted score for the current context is highest) as RMS
, as we train it using the RMSProp optimizer. This is our non-linear baseline, and we tested several versions of it (based on whether the training step was decayed, reset to its initial value for each re-training or not, and how long the network was trained for). We also tried the-greedy version of the algorithm, where a random action was selected with probability for some decaying schedule of .
Variational Inference Variational approaches approximate the posterior by finding a distribution within a tractable family that minimizes the KL divergence to the posterior (Hinton & Van Camp, 1993). These approaches formulate and solve an optimization problem, as opposed, for example, to sampling methods like MCMC (Jordan et al., 1999; Wainwright et al., 2008)
. Typically (and in our experiments), the posterior is approximated by a mean-field or factorized distribution where strong independence assumptions are made. For instance, each neural network weight can be modeled via a –conditionally independent– Gaussian distribution whose mean and variance are estimated from data. Recent advances have scaled these approaches to estimate the posterior of neural networks with millions of parameters(Blundell et al., 2015). A common criticism of variational inference is that it underestimates uncertainty (e.g., (Bishop, 2006)), which could lead to under-exploration.
Expectation-Propagation The family of expectation-propagation algorithms (Opper & Winther, 2000; Minka, 2001b, a) is based on the message passing framework (Pearl, 1986). They iteratively approximate the posterior by updating a single approximation factor (or site) at a time, which usually corresponds to the likelihood of one data point. The algorithm sequentially minimizes a set of local KL divergences, one for each site. Most often, and for computational reasons, likelihoods are chosen to lie in the exponential family. In this case, the minimization corresponds to moment matching. See Gelman et al. (2014) for further details. We focus on methods that directly optimize the global
EP objective via stochastic gradient descent, as, for instance, Power EP(Minka, 2004). In particular, in this work, we implement the black-box -divergence minimization algorithm (Hernández-Lobato et al., 2016), where local parameter sharing is applied to the Power EP energy function. Note that different values of correspond to common algorithms: to EP, and to Variational Bayes. The optimal value is problem-dependent (Hernández-Lobato et al., 2016).
Dropout is a training technique where the output of each neuron is independently zeroed out with probabilityat each forward pass (Srivastava et al., 2014). Once the network has been trained, dropout can still be used to obtain a distribution of predictions for a specific input. Following the best action with respect to the random dropout prediction can be interpreted as an implicit form of Thompson sampling. Dropout can be seen as optimizing a variational objective (Kingma et al., 2015; Gal & Ghahramani, 2016; Hron et al., 2017).
Monte Carlo Monte Carlo sampling remains one of the simplest and reliable tools in the Bayesian toolbox. Rather than parameterizing the full posterior, Monte Carlo methods estimate the posterior through drawing samples. This is naturally appealing for highly parameterized deep neural networks for which the posterior is intractable in general and even simple approximations such as multivariate Gaussian are too expensive (i.e. require computing and inverting a covariance matrix over all parameters). Among Monte Carlo methods, Hamiltonian Monte Carlo (Neal, 1994) (HMC) is often regarded as a gold standard algorithm for neural networks as it takes advantage of gradient information and momentum to more effectively draw samples. However, it remains unfeasible for larger datasets as it involves a Metropolis accept-reject step that requires computing the log likelihood over the whole data set. A variety of methods have been developed to approximate HMC using mini-batch stochastic gradients. These Stochastic Gradient Langevin Dynamics (SGLD) methods (Neal, 1994; Welling & Teh, 2011) add Gaussian noise to the model gradients during stochastic gradient updates in such a manner that each update results in an approximate sample from the posterior. Different strategies have been developed for augmenting the gradients and noise according to a preconditioning matrix. Li et al. (2016) show that a preconditioner based on the RMSprop algorithm performs well on deep neural networks. Patterson & Teh (2013) suggested using the Fisher information matrix as a preconditioner in SGLD. Unfortunately the approximations of SGLD hold only if the learning rate is asymptotically annealed to zero. Ahn et al. (2012) introduced Stochastic Gradient Fisher Scoring to elegantly remove this requirement by preconditioning according to the Fisher information (or a diagonal approximation thereof). Mandt et al. (2016) develop methods for approximately sampling from the posterior using a constant learning rate in stochastic gradient descent and develop a prescription for a stable version of SGFS. We evaluate the diagonal-SGFS and constant-SGD algorithms from Mandt et al. (2016) in this work. Specifically for constant-SGD we use a constant learning rate for stochastic gradient descent, where the learning rate is given by where is the batch size, the number of data points and is an online average of the diagonal empirical Fisher information matrix. For Stochastic Gradient Fisher Scoring we use the following stochastic gradient update for the model parameters at step :
where we take the noise covariance to also be and .
Bootstrap A simple empirical approach to approximate the sampling distribution of any estimator is the Bootstrap (Efron, 1982). The main idea is to simultaneously train models, where each model is based on a different dataset . When all the data is available in advance, is typically created by sampling elements from at random with replacement. In our case, however, the data grows one example at a time. Accordingly, we set a parameter , and append the new datapoint to each independently at random with probability . In order to emulate Thompson Sampling, we sample a model uniformly at random (i.e., with probability .) and take the action predicted to be best by the sampled model. We mainly tested cases and , with neural network models. Note that even when and the datasets are identical, the random initialization of each network, together with the randomness from SGD, lead to different predictions.
Direct Noise Injection Parameter-Noise (Plappert et al., 2017) is a recently proposed approach for exploration in deep RL that has shown promising results. The training updates for the network are unchanged, but when selecting actions, the network weights are perturbed with isotropic Gaussian noise. Crucially, the network uses layer normalization (Ba et al., 2016), which ensures that all weights are on the same scale. The magnitude of the Gaussian noise is adjusted so that the overall effect of the perturbations is similar in scale to -greedy with a linearly decaying schedule (see (Plappert et al., 2017) for details). Because the perturbations are done on the model parameters, we might hope that the actions produced by the perturbations are more sensible than -greedy.
Bayesian Non-parametric Gaussian processes (Rasmussen & Williams, 2005) are a gold-standard method for modeling distributions over non-linear continuous functions. It can be shown that, in the limit of infinite hidden units and under a Gaussian prior, a Bayesian neural network converges to a Gaussian process (Neal, 1994). As such, GPs would appear to be a natural baseline. Unfortunately, standard GPs computationally scale cubically in the number of observations, limiting their applicability to relatively small datasets. There are a wide variety of methods to approximate Gaussian processes using, for example, pseudo-observations (Snelson & Ghahramani, 2006) or variational inference (Titsias, 2009). We implemented both standard and sparse GPs but only report the former due to similar performance. For the standard GP, due to the scaling issue, we stop adding inputs to the GP after 1000 observations. This performed significantly better than randomly sampling inputs. Our implementation is a multi-task Gaussian process (Bonilla et al., 2008) with a linear and Matern
product kernel over the inputs and an exponentiated quadratic kernel over latent vectors for the different tasks. The hyperparameters of this model and the latent task vectors are optimized over the GP marginal likelihood. This allows the model to learn correlations between the outputs of the model. Specifically, the covariance functionof the GP is given by:
and the task kernel between tasks and are where indexes the latent vector for task and . The length-scales, and , and amplitude parameters , are optimized via the log marginal likelihood. For the sparse version we used a Sparse Variational GP (Hensman et al., 2015) with the same kernel and with 300 inducing points, trained via minibatch stochastic gradient descent (Matthews et al., 2017).
4 Feedback Loop in the Linear Case
In this section, we illustrate some of the subtleties that arise when uncertainty estimates drive sequential decision-making using simple linear examples.
There is a fundamental difference between static and dynamic scenarios. In a static scenario, e.g. supervised learning, we are given a model family (like the set of linear models, trees, or neural networks with specific dimensions), a prior distribution over , and some observed data that —importantly— is assumed i.i.d. Our goal is to return an approximate posterior distribution: . We define the quality of our approximation by means of some distance .
On the other hand, in dynamic settings, our estimate at time , say , will be used via some mechanism , in this case Thompson sampling, to collect the next data-point, which is then appended to . In this case, the data-points in are no longer independent. will now determine two distributions: the posterior given the data that was actually observed, , and our new estimate . When the goal is to make good sequential decisions in terms of cumulative regret, the distance is in general no longer a definitive proxy for performance. For instance, a poorly-approximated decision boundary could lead an algorithm, based on , to get stuck repeatedly selecting a single sub-optimal action . After collecting lots of data for that action, and could start to agree (to their capacity) on the models that explain what was observed for , while both would stick to something close to the prior regarding the other actions. At that point, may show relatively little disagreement, but the regret would already be terrible.
Let be the posterior distribution under Thompson Sampling’s assumption, that is, data was always collected according to for . We follow the idea that being close to for all leads to strong performance. However, this concept is difficult to formalize: once different decisions are made, data for different actions is collected and it is hard to compare posterior distributions.
We illustrate the previous points with a simple example, see Figure 1. Data is generated according to a bandit with arms. For a given context , the reward obtained by pulling arm follows a linear model with . The posterior distribution over can be exactly computed using the standard Bayesian linear regression formulas presented in Section 3. We set the contextual dimension , and the prior to be , for .
In Figure 1, we show the posterior distribution for two dimensions of for each arm after pulls. In particular, in Figure 0(a), two independent runs of Thompson Sampling with their posterior distribution are displayed in red and green. While strongly aligned, the estimates for some arms disagree (especially for arms that are best only for a small fraction of the contexts, like Arm 2 and 3, where fewer data-points are available). In Figure 0(b), we also consider Thompson Sampling with an approximate posterior with diagonal covariance matrix, Diag in red, as defined in Section 3. Each algorithm collects its own data based on its current posterior (or approximation). In this case, the posterior disagreement after decisions is certainly stronger. However, as shown in Figure 0(c), if we computed the approximate posterior with a diagonal covariance matrix based on the data collected by the actual posterior, the disagreement would be reduced as much as possible within the approximation capacity (i.e., it still cannot capture correlations in this case). Figure 0(b) shows then the effect of the feedback loop. We look next at the impact that this mismatch has on regret.
We illustrate with a similar example how inaccurate posteriors sometimes lead to quite different behaviors in terms of regret. In Figure 1(a), we see the posterior distribution of a linear model in green, together with the two diagonal linear approximations introduced in Section 3: the Diag (in orange) and the PrecisionDiag (in blue) approximations, respectively. We now assume there are linear arms, for , and decisions are made according to the posteriors in Figure 1(a). In Figures 1(b) and 1(c) we plot the regret of Thompson Sampling when there are arms, for both and . We see that, while the PrecisionDiag approximation does even outperform the actual posterior, the diagonal covariance approximation truly suffers poor regret when we increase the dimension , as it is heavily penalized by simultaneously over-exploring in a large number of dimensions and repeateadly acting according to implausible models.
5 Empirical Evaluation
In this section, we present the simulations and outcomes of several synthetic and real-world data bandit problems with each of the algorithms introduced in Section 3. In particular, we first explain how the simulations were set up and run, and the metrics we report. We then split the experiments according to how data was generated, and the underlying models fit by the algorithms from Section 3.
5.1 The Experimental Framework
We run the contextual bandit experiments as described at the beginning of Section 2, and discuss below some implementation details of both experiments and algorithms. A detailed summary of the key parameters used for each algorithm can be found in Table 2 in the appendix.
Neural Network Architectures
All algorithms based on neural networks as function approximators share the same architecture. In particular, we fit a simple fully-connected feedforward network with two hidden layers with 100 units each and ReLu activations. The input of the network has dimension(same as the contexts), and there are outputs, one per action. Note that for each training point only one action was observed (and algorithms usually only take into account the loss corresponding to the prediction for the observed action).
Updating Models A key question is how often and for how long models are updated. Ideally, we would like to train after each new observation and for as long as possible. However, this may limit the applicability of our algorithms in online scenarios where decisions must be made immediately. We update linear algorithms after each time-step by means of (1) and (2). For neural networks, the default behavior was to train for or mini-batches every timesteps. 222For reference, the standard strategy for Deep Q-Networks on Atari is to make one model update after every 4 actions performed (Mnih et al., 2015; Osband et al., 2016; Plappert et al., 2017; Fortunato et al., 2017). The size of each mini-batch was 512. We experimented with increasing values of , and it proved essential for some algorithms like variational inference approaches. See the details in Table 2.
Metrics We report two metrics: cumulative regret and simple regret. We approximate the latter as the mean cumulative regret in the last 500 time-steps, a proxy for the quality of the final policy (see further discussion on pure exploration settings, Bubeck et al. (2009)). Cumulative regret is computed based on the best expected reward, as is standard. For most real datasets (Statlog, Covertype, Jester, Adult, Census, and Song), the rewards were deterministic, in which case, the definition of regret also corresponds to the highest realized reward (i.e., possibly leading to a hard task, which helps to understand why in some cases all regrets look linear). We reshuffle the order of the contexts, and rerun the experiment 50 times to obtain the cumulative regret distribution and report its statistics.
Hyper-Parameter TuningDeep learning methods are known to be very sensitive to the selection of a wide variety of hyperparameters, and many of the algorithms presented are no exception. Moreover, that choice is known to be highly dataset dependent. Unfortunately, in the bandits scenario, we commonly do not have access to each problem a-priori to perform tuning. For the vast majority of algorithms, we report the outcome for three versions of the algorithm defined as follows. First, we use one version where hyper-parameters take values we guessed to be reasonable a-priori. Then, we add two additional instances whose hyper-parameters were optimized on two different datasets via Bayesian Optimization. For example, in the case of Dropout, the former version is named Dropout, while the optimized versions are named Dropout-MR (using the Mushroom dataset) and Dropout-SL (using the Statlog dataset) respectively. Some algorithms truly benefit from hyper-parameter optimization, while others do not show remarkable differences in performance; the latter are more appropriate in settings where access to the real environment for tuning is not possible in advance.
Buffer After some experimentation, we decided not to use a data buffer as evidence of catastrophic forgetting was observed, and datasets are relatively small. Accordingly, all observations are sampled with equal probability to be part of a mini-batch. In addition, as is standard in bandit algorithms, each action was initially selected times using round-robin independently of the context.
5.2 Real-World Data Problems with Non-Linear Models
We evaluated the algorithms on a range of bandit problems created from real-world data. In particular, we test on the Mushroom, Statlog, Covertype, Financial, Jester, Adult, Census, and Song datasets (see Appendix Section A for details on each dataset and bandit problem). They exhibit a broad range of properties: small and large sizes, one dominating action versus more homogeneous optimality, learnable or little signal, stochastic or deterministic rewards, etc. For space reasons, the outcome of some simulations are presented in the Appendix. The Statlog, Covertype, Adult, and Census datasets were originally tested in Elmachtoub et al. (2017). We summarize the final cumulative regret for Mushroom, Statlog, Covertype, Financial, and Jester datasets in Table 1. In Figure 5 at the appendix, we show a box plot of the ranks achieved by each algorithm across the suite of bandit problems (see Appendix Table 6 and 7 for the full results).
. Results are relative to the cumulative regret of the Uniform algorithm. We report the mean and standard error of the mean over 50 trials.
5.3 Real-World Data Problems with Linear Models
As most of the algorithms from Section 3 can be implemented for any model architecture, in this subsection we use linear models as a baseline comparison across algorithms (i.e., neural networks that contain a single linear layer). This allows us to directly compare the approximate methods against methods that can compute the exact posterior. The specific hyper-parameter configurations used in the experiments are described in Table 3 in the appendix. Datasets are the same as in the previous subsection. The cumulative and simple regret results are provided in appendix Tables 4 and 5.
5.4 The Wheel Bandit
Some of the real-data problems presented above do not require significant exploration. We design an artificial problem where the need for exploration is smoothly parameterized. The wheel bandit is defined as follows (see Figure 3). Set , and , the exploration parameter. Contexts are sampled uniformly at random in the unit circle in , . There are possible actions. The first action always offers reward , independently of the context. On the other hand, for contexts such that , i.e. inside the blue circle in Figure 3, the other four actions are equally distributed and sub-optimal, with for . When , we are outside the blue circle, and only one of the actions is optimal depending on the sign of context components . If , action 2 is optimal. If , action 3 is optimal, and so on. Non-optimal actions still deliver in this region, except whose mean reward is always , while the optimal action provides , with . We set , and . Note that the probability of a context randomly falling in the high-reward region is (not blue). The difficulty of the problem increases with , and we expect algorithms to get stuck repeatedly selecting action for large . The problem can be easily generalized for . Results are shown in Table 9.
) applied to a linear model and an exact mean field solution, denoted PrecisionDiag, with a linear bandit (left) and with the Statlog bandit (right). The suffix of the BBB legend label indicates the number of training epochs in each training step. We emphasize that in this evaluation, all algorithms use the same family of models (i.e., linear). While PrecisionDiag exactly solves the mean field problem, BBB relies on partial optimization via SGD. As the number of training epochs increases, BBB improves performance, but is always outperformed by PrecisionDiag.
Overall, we found that there is significant room for improvement in uncertainty estimation for neural networks in sequential decision-making problems. First, unlike in supervised learning, sequential decision-making requires the model to be frequently updated as data is accumulated. As a result, methods that converge slowly are at a disadvantage because we must truncate optimization to make the method practical for the online setting. In these cases, we found that partially optimized uncertainty estimates can lead to catastrophic decisions and poor performance. Second, and while it deserves further investigation, it seems that decoupling representation learning and uncertainty estimation improves performance. The NeuralLinear algorithm is an example of this decoupling. With such a model, the uncertainty estimates can be solved for in closed form (but may be erroneous due to the simplistic model), so there is no issue with partial optimization. We suspect that this may be the reason for the improved performance. In addition, we observed that many algorithms are sensitive to their hyperparameters, so that best configurations are problem-dependent.
Finally, we found that in many cases, the inherit randomness in Stochastic Gradient Descent provided sufficient exploration. Accordingly, in some scenarios it may be hard to justify the use of complicated (and less transparent) variations of simple methods. However, Stochastic Gradient Descent is by no means always enough: in our synthetic exploration-oriented problem (the Wheel bandit) additional exploration was necessary.
Next, we discuss our main findings for each class of algorithms.
Linear Methods. Linear methods offer a reasonable baseline, surprisingly strong in many cases. While their representation power is certainly a limiting factor, their ability to compute informative uncertainty measures seems to payoff and balance their initial disadvantage. They do well in several datasets, and are able to react fast to unexpected or extreme rewards (maybe as single points can have a heavy impact in fitted models, and their updates are immediate, deterministic, and exact). Some datasets clearly need more complex non-linear representations, and linear methods are unable to efficiently solve those. In addition, linear methods obviously offer computational advantages, and it would be interesting to investigate how their performance degrades when a finite data buffer feeds the estimates as various real-world online applications may require (instead of all collected data).
In terms of the diagonal linear approximations described in Section 3, we found that diagonalizing the precision matrix (as in mean-field Variational Inference) performs dramatically better than diagonalizing the covariance matrix.
NeuralLinear. The NeuralLinear algorithm sits near a sweet spot that is worth further studying. In general it seems to improve the RMS neural network it is based on, suggesting its exploration mechanisms add concrete value. We believe its main strength is that it is able to simultaneously learn a data representation that greatly simplifies the task at hand, and to accurately quantify the uncertainty over linear models that explain the observed rewards in terms of the proposed representation. While the former process may be noisier and heavily dependent on the amount of training steps that were taken and available data, the latter always offers the exact solution to its approximate parent problem. This, together with the partial success of linear methods with poor representations, may explain its promising results. In some sense, it knows what it knows. In the Wheel problem, which requires increasingly good exploration mechanisms, NeuralLinear is probably the best algorithm. Its performance is almost an order of magnitude better than any RMS algorithm (and its spinoffs, like Bootstrapped NN, Dropout, or Parameter Noise), and all greedy linear approaches. On the other hand, it is able to successfully solve problems that require non-linear representations (as Statlog or Covertype) where linear approaches fail. In addition, the algorithm is remarkably easy to tune, and robust in terms of hyper-parameter configurations. While conceptually simple, its deployment to large scale systems may involve some technical difficulties; mainly, to update the Bayesian estimates when the network is re-trained. We believe, however, standard solutions to similar problems (like running averages) could greatly mitigate these issues. In our experiments and compared to other algorithms, as shown in Table 8, NeuralLinear is fast from a computational standpoint.
Variational Inference. Overall, Bayes By Backprop performed poorly, ranking in the bottom half of algorithms across datasets (Table 1). To investigate if this was due to underestimating uncertainty (as variational methods are known to (Bishop, 2006)), to the mean field approximation, or to stochastic optimization, we applied BBB to a linear model, where the mean field optimization problem can be solved in closed form (Figure 4). We found that the performance of BBB slowly improved as the number of training epochs increased, but underperformed compared to the exact mean field solution. Moreover, the difference in performance due to the number of training steps dwarfed the difference between the mean field solution and the exact posterior. This suggests that it is not sufficient to partially optimize the variational parameters when the uncertainty estimates directly affect the data being collected. In supervised learning, optimizing to convergence is acceptable, however in the online setting, optimizing to convergence at every step incurs unreasonable computational cost.
Expectation-Propagation. The performance of Black Box -divergence algorithms was poor. Because this class of algorithms is similar to BBB (in fact, as , it converges to the BBB objective), we suspect that partial convergence was also the cause of their poor performance. We found these algorithms to be sensitive to the number of training steps between actions, requiring a large number to achieve marginal performance. Their terrible performance in the Mushroom bandit is remarkable, while in the other datasets they perform slightly worse than their variational inference counterpart. Given the successes of Black Box -divergence in other domains (Hernández-Lobato et al., 2016), investigating approaches to sidestep the slow convergence of the uncertainty estimates is a promising direction for future work.
Monte Carlo. Constant-SGD comes out as the winner on Covertype, which requires non-linearity and exploration as evidenced by performance of the linear baseline approaches (Table 1). The method is especially appealing as it does not require tuning learning rates or exploration parameters. SGFS, however, performs better on average. The additional injected noise in SGFS may cause the model to explore more and thus perform better, as shown in the Wheel Bandit problem where SGFS strongly outperforms Constant-SGD.
Bootstrap. The bootstrap offers significant gains with respect to its parent algorithm (RMS) in several datasets. Note that in Statlog one of the actions is optimal around 80% of the time, and the bootstrapped predictions may help to avoid getting stuck, something from which RMS methods may suffer. In other scenarios, the randomness from SGD may be enough for exploration, and the bootstrap may not offer important benefits. In those cases, it might not justify the heavy computational overhead of the method. We found it surprising that the optimized versions of BootstrappedNN decided to use only and networks respectively (while we set its value to in the manually tuned version, and the extra networks did not improve performance significantly). Unfortunately, Bootstrapped NNs were not able to solve the Wheel problem, and its performance was fairly similar to that of RMS. One possible explanation is that —given the sparsity of the reward— all the bootstrapped networks agreed for the most part, and the algorithm simply got stuck selecting action . As opposed to linear models, reacting to unusual rewards could take Bootstrapped NNs some time as good predictions could be randomly overlooked (and useful data discarded if ).
Direct Noise Injection. When properly tuned, Parameter-Noise provided an important boost in performance across datasets over the learner that it was based on (RMS), average rank of ParamNoise-SL is compared to RMS at (Table 1
). However, we found the algorithm hard to tune and sensitive to the heuristic controlling the injected noise-level. On the synthetic Wheel problem —where exploration is necessary— both parameter-noise and RMS suffer from underexploration and perform similarly, except ParamNoise-MR which does a good job. In addition, developing an intuition for the heuristic is not straightforward as it lacks transparency and a principled grounding, and thus may require repeated access to the decision-making process for tuning.
Dropout. We initially experimented with two dropout versions: fixed , and . The latter consistently delivered better results, and it is the one we manually picked. The optimized versions of the algorithm provided decent improvements over its base RMS (specially Dropout-MR). In the Wheel problem, dropout performance is somewhat poor: Dropout is outperformed by RMS, while Dropout-MR offers gains with respect to all versions of RMS but it is not competitive with the best algorithms. Overall, the algorithm seems to heavily depend on its hyper-parameters (see cum-regret performance of the raw Dropout, for example). Dropout was used both for training and for decision-making; unfortunately, we did not add a baseline where dropout only applies during training. Consequently, it is not obvious how to disentangle the contribution of better training from that of better exploration. This remains as future work.
Bayesian Non-parametrics. Perhaps unsurprisingly, Gaussian processes perform reasonably well on problems with little data but struggle on larger problems. While this motivated the use of sparse GP, the latter was not able to perform similarly to stronger (and definitively simpler) methods.
7 Conclusions and Future Work
In this work, we empirically studied the impact on performance of approximate model posteriors for decision making via Thompson Sampling in contextual bandits. We found that the most robust methods exactly measured uncertainty (possibly under the wrong model assumptions) on top of complex representations learned in parallel. More complicated approaches that learn the representation and its uncertainty together seemed to require heavier training, an important drawback in online scenarios, and exhibited stronger hyper-parameter dependence. Further exploring and developing the promising approaches is an exciting avenue for future work.
We are extremely grateful to Dan Moldovan, Sven Schmit, Matt Hoffman, Matt Johnson, Ramon Iglesias, and Rif Saurous for their valuable feedback and comments. We also thank the anonymous reviewers, whose suggestions truly helped improve the current work.
- Agrawal & Goyal (2012) Agrawal, Shipra and Goyal, Navin. Analysis of thompson sampling for the multi-armed bandit problem. In International Conference on Learning Theory, 2012.
Ahn et al. (2012)
Ahn, Sungjin, Balan, Anoop Korattikara, and Welling, Max.
Bayesian posterior sampling via stochastic gradient fisher scoring.
International Conference on Machine Learning, 2012.
- Asuncion & Newman (2007) Asuncion, Arthur and Newman, David. UCI machine learning repository, 2007.
- Ba et al. (2016) Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Bertin-Mahieux et al. (2011) Bertin-Mahieux, Thierry, Ellis, Daniel P.W., Whitman, Brian, and Lamere, Paul. The million song dataset. In International Conference on Music Information Retrieval, 2011.
- Bishop (2006) Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
- Blundell et al. (2015) Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray, and Wierstra, Daan. Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622, 2015.
- Bonilla et al. (2008) Bonilla, Edwin V, Chai, Kian M., and Williams, Christopher. Multi-task gaussian process prediction. In Advances in Neural Information Processing Systems, 2008.
- Bubeck et al. (2009) Bubeck, Sébastien, Munos, Rémi, and Stoltz, Gilles. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pp. 23–37. Springer, 2009.
- Chapelle & Li (2011) Chapelle, Olivier and Li, Lihong. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, 2011.
- Efron (1982) Efron, Bradley. The jackknife, the bootstrap and other resampling plans. SIAM, 1982.
- Elmachtoub et al. (2017) Elmachtoub, Adam N, McNellis, Ryan, Oh, Sechan, and Petrik, Marek. A practical method for solving contextual bandit problems using decision trees. arXiv preprint arXiv:1706.04687, 2017.
- Fortunato et al. (2017) Fortunato, Meire, Azar, Mohammad Gheshlaghi, Piot, Bilal, Menick, Jacob, Osband, Ian, Graves, Alex, Mnih, Vlad, Munos, Remi, Hassabis, Demis, Pietquin, Olivier, Blundell, Charles, and Legg, Shane. Noisy networks for exploration. arXiv:1706.10295, 2017.
- Gal & Ghahramani (2016) Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International conference on machine learning, pp. 1050–1059, 2016.
- Gelman et al. (2014) Gelman, Andrew, Vehtari, Aki, Jylänki, Pasi, Robert, Christian, Chopin, Nicolas, and Cunningham, John P. Expectation propagation as a way of life. arXiv preprint arXiv:1412.4869, 2014.
- Goldberg et al. (2001) Goldberg, Ken, Roeder, Theresa, Gupta, Dhruv, and Perkins, Chris. Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval, 4(2):133–151, 2001.
- Granmo (2010) Granmo, OleChristoffer. Solving twoarmed Bernoulli bandit problems using a Bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics, 3(2):207–234, 2010.
- Graves (2011) Graves, Alex. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pp. 2348–2356, 2011.
- Hensman et al. (2015) Hensman, James, Matthews, Alexander G. de G., and Ghahramani, Zoubin. Scalable variational gaussian process classification. In Proceedings of AISTATS, 2015.
- Hernández-Lobato et al. (2016) Hernández-Lobato, José Miguel, Li, Yingzhen, Rowland, Mark, Bui, Thang D., Hernández-Lobato, Daniel, and Turner, Richard E. Black-box alpha divergence minimization. In International Conference on Machine Learning, 2016.
- Hinton & Van Camp (1993) Hinton, Geoffrey E and Van Camp, Drew. Keeping the neural networks simple by minimizing the description length of the weights. In Computational learning theory, pp. 5–13. ACM, 1993.
- Hron et al. (2017) Hron, Jiri, Matthews, Alexander G de G, and Ghahramani, Zoubin. Variational gaussian dropout is not bayesian. arXiv preprint arXiv:1711.02989, 2017.
- Jordan et al. (1999) Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
- Kingma et al. (2015) Kingma, Diederik P, Salimans, Tim, and Welling, Max. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583, 2015.
- Kohavi (1996) Kohavi, Ron. In International Conference On Knowledge Discovery and Data Mining, 1996.
Li et al. (2016)
Li, Chunyuan, Chen, Changyou, Carlson, David, and Carin, Lawrence.
Preconditioned stochastic gradient langevin dynamics for deep neural
AAAI Conference on Artificial Intelligence, 2016.
- Mandt et al. (2016) Mandt, Stephan, Hoffman, Matthew D., and Blei, David M. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, 2016.
- Matthews et al. (2017) Matthews, Alexander G. de G., van der Wilk, Mark, Nickson, Tom, Fujii, Keisuke., Boukouvalas, Alexis, León-Villagrá, Pablo, Ghahramani, Zoubin, and Hensman, James. GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 2017.
- Minka (2004) Minka, Thomas. Power ep. 2004.
Minka, Thomas P.
Expectation propagation for approximate bayesian inference.In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 362–369. Morgan Kaufmann Publishers Inc., 2001a.
- Minka (2001b) Minka, Thomas Peter. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of Technology, 2001b.
- Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
- Neal (1994) Neal, Radford M. Bayesian learning for neural networks. Dept. of Computer Science, University of Toronto, 1994.
- Opper & Winther (2000) Opper, Manfred and Winther, Ole. Gaussian processes for classification: Mean-field algorithms. Neural computation, 12(11):2655–2684, 2000.
- Osband et al. (2016) Osband, Ian, Blundell, Charles, Pritzel, Alexander, and Van Roy, Benjamin. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems, pp. 4026–4034, 2016.
- Patterson & Teh (2013) Patterson, Sam and Teh, Yee Whye. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, 2013.
- Pearl (1986) Pearl, Judea. Fusion, propagation, and structuring in belief networks. Artificial intelligence, 29(3):241–288, 1986.
- Plappert et al. (2017) Plappert, Matthias, Houthooft, Rein, Dhariwal, Prafulla, Sidor, Szymon, Chen, Richard Y., Chen, Xi, Asfour, Tamim, Abbeel, Pieter, and Andrychowicz, Marcin. Parameter space noise for exploration. arXiv:1706.01905, 2017.
- Rasmussen & Williams (2005) Rasmussen, Carl Edward and Williams, Christopher K. I. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
- Riquelme et al. (2017) Riquelme, Carlos, Ghavamzadeh, Mohammad, and Lazaric, Alessandro. Active learning for accurate estimation of linear models. In International Conference on Machine Learning, 2017.
- Russo & Van Roy (2014) Russo, Dan and Van Roy, Benjamin. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems, pp. 1583–1591, 2014.
- Russo et al. (2017) Russo, Daniel, Tse, David, and Van Roy, Benjamin. Time-sensitive bandit learning and satisficing thompson sampling. arXiv preprint arXiv:1704.09028, 2017.
- Schaul et al. (2016) Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Silver, David. Prioritized experience replay. In International Conference on Learning Representations, 2016.
- Schlimmer (1981) Schlimmer, Jeff. Mushroom records drawn from the audubon society field guide to north american mushrooms. GH Lincoff (Pres), New York, 1981.
- Snelson & Ghahramani (2006) Snelson, Edward and Ghahramani, Zoubin. Sparse gaussian processes using pseudo-inputs. In Weiss, Y., Schölkopf, P. B., and Platt, J. C. (eds.), Advances in Neural Information Processing Systems, 2006.
- Snoek et al. (2015) Snoek, Jasper, Rippel, Oren, Swersky, Kevin, Kiros, Ryan, Satish, Nadathur, Sundaram, Narayanan, Patwary, Mostofa, Prabhat, Mr, and Adams, Ryan. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, 2015.
- Srivastava et al. (2014) Srivastava, Nitish, Hinton, Geoffrey E, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
- Thompson (1933) Thompson, William R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- Titsias (2009) Titsias, Michalis K. Variational learning of inducing variables in sparse gaussian processes. In International Conference on Artificial Intelligence and Statistics, 2009.
- Wainwright et al. (2008) Wainwright, Martin J, Jordan, Michael I, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
- Welling & Teh (2011) Welling, Max and Teh, Yee Whye. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, 2011.
|Alpha Divergences||BB -divergence with , noise , , prior var . (, first 100 times linear decay from ).|
|Alpha Divergences (1)||BB -divergence with , noise , , prior var . (, first 100 times linear decay from ).|
|Alpha Divergences (2)||BB -divergence with , noise , , prior var . (, first 100 times linear decay from ).|
|Alpha Divergences (3)||BB -divergence with , noise , , prior var . (, first 100 times linear decay from ).|
|BBBN||BayesByBackprop with noise . (, first 100 times linear decay from ).|
|BBBN2||BayesByBackprop with noise . (, first 100 times linear decay from ).|
|BBBN3||BayesByBackprop with noise . (, first 100 times linear decay from ).|
|BBBN4||BayesByBackprop with noise . (, first 100 times linear decay from ).|
|Bootstrapped NN||Bootstrapped with models, and . Based on RMS3 net.|
|Bootstrapped NN2||Bootstrapped with models, and . Based on RMS3 net.|
|Bootstrapped NN3||Bootstrapped with models, and . Based on RMS3 net.|
|Dropout (RMS3)||Dropout with probability . Based on RMS3 net.|
|Dropout (RMS2)||Dropout with probability . Based on RMS2 net.|
|RMS1||Greedy NN approach, fixed learning rate .|
|RMS2||Learning rate decays, and it is reset every training period.|
|RMS2b||Similar to RMS2, but training for longer ().|
|RMS3||Learning rate decays, and it is not reset at all. Starts at .|
|SGFS||Burning , learning rate , EMA decay , noise .|
|ConstSGD||Burning , EMA decay , noise .|
|EpsGreedy (RMS1)||Initial . Multiplied by after every context. Based on RMS1 net.|
|EpsGreedy (RMS2)||Initial . Multiplied by after every context. Based on RMS2 net.|
|EpsGreedy (RMS3)||Initial . Multiplied by after every context. Based on RMS3 net.|
|LinDiagPost||in Eq. 1 is diagonalized. Ridge prior . Assumed noise level .|
|LinDiagPrecPost||in Eq. 1 is diagonalized. Ridge prior . Assumed noise level .|
|LinGreedy||Takes action with highest predicted reward for Ridge regression, . Noise level .|
|LinGreedy (eps = 0.01)||linGreedy that selects action uniformly at random with prob .|
|LinGreedy (eps = 0.05)||linGreedy that selects action uniformly at random with prob .|
|LinPost||Ridge prior . Assumed noise level .|
|LinFullDiagPost||in Eq. 1 is diagonalized. Noise prior . Ridge prior .|
|LinFullDiagPrecPost||in Eq. 1 is diagonalized. Noise prior . Ridge prior .|
|LinFullPost||Noise prior . Ridge prior .|
|Param-Noise||Initial noise , and level . Based on RMS3 net.|
|Param-Noise2||Initial noise , and level . Based on RMS3 net. Trained for longer: .|
|Uniform||Takes each action at random with equal probability.|
|Alpha Divergences (1)|
|Alpha Divergences (2)|
|Alpha Divergences (3)|