## 1 Introduction

Reinforcement Learning (RL) aims at learning how to take optimal decisions in unknown environments by solving credit assignment problems that extend in time. In order to be sample efficient learners, agents are required to constantly update their own beliefs about the world, about which actions are good and which are not. Temporal difference (TD) (sutton1998reinforcement) and off-policy learning are the constitutional elements of this kind of behavior. TD allows agents to bootstrap their current knowledge to learn from a new observation as soon as it is available. Off-policy learning gives the means for exploration and enables experience replay (lin1991er). Q-Learning (qlearning) implements both paradigms.

Algorithms based on Q-learning are, in fact, driving Deep Reinforcement Learning (DRL) research towards solving complex problems and achieving super-human performance on many of them (mnih2015human; hessel2018rainbow). Nonetheless, Q-Learning is known to be positively biased (hasselt2010double) since it learns by using the maximum over the - noisy - bootstrapped TD estimates. This overoptimism can be particularly harmful in stochastic environments and when using function approximation (thrun1993issues), notably also in the case where the approximators are deep neural networks (hasselt2016ddqn)

. Systematic overestimation of the action-values coupled with the inherently high variance of DRL methods can lead to incrementally accumulate errors, causing the learning algorithm to diverge.

Among the possible solutions, the Double Q-Learning algorithm (hasselt2010double) and its DRL variant Double DQN (hasselt2016ddqn) tackle the overestimation problem by disentangling the choice of the target action and its evaluation. The resulting estimator, while achieving superior performance in many problems, is negatively biased (hasselt2013estimating). Underestimation, in fact, can lead in some environments to lower performance and slower convergence rates compared to standard Q-Learning (deramo2016wql; lan2020maxmin). Overoptimism, in general, is not uniform over the state space and may induce to overestimate the value of arbitrary bad actions, throwing the agent completely off. The same holds true, symmetrically, for overly pessimistic estimates that might undervalue a good course of action. Ideally, we would like DRL agents to be aware of their own uncertainty about the optimality of each action, and be able to exploit it to make more informed estimations of the expected return. This is exactly what we achieve in this work.

We exploit recent developments in Bayesian Deep Learning to model the uncertainty of DRL agents using neural networks trained with dropout variational inference

(kingma2015vardrop; gal2016dropout). We combine, in a novel way, the dropout uncertainty estimates with the Weighted Q-Learning algorithm (deramo2016wql), extending it to the DRL settings. The proposed Deep Weighted Q-Learning algorithm, or Weighted DQN (WDQN), leverages an approximated posterior distribution on Q-networks to reduce the bias of deep Q-learning. WDQN bias is neither always positive, neither negative, but depends on the state and the problem at hand. WDQN only requires minor modifications to the baseline algorithm and its computational overhead is negligible on specialized hardware.The paper is organized as follows. In Section 2 we define the problem settings, introducing key aspects of value-based RL. In Section 3 we analyze in depth the problem of estimation biases in Q-Learning and sequential decision making problems. Then, in Section 4

, we first discuss how neural networks trained with dropout can be used for Bayesian inference in RL and, from that, we derive the WDQN algorithm. In Section

LABEL:s:exp we empirically evaluate the proposed method against relevant baselines on several benchmarks. Finally, we provide an overview of related works in Section LABEL:s:related, and we draw our conclusions and discuss future works in Section LABEL:s:end.## 2 Preliminaries

A Markov Decision Process (MDP) is a tuple

where is a state space, is an action space, is a Markovian transition function, is a reward function, and is a discount factor. A sequential decision maker ought to estimate, for each state , the optimal value of each action , i.e., the expected return obtained by taking action in and following the optimal policy afterwards. We can write using the Bellman optimality equation (bellman1954theory)(1) |

#### (Deep) Q-Learning

A classical approach to solve finite MDPs is the Q-Learning algorithm (qlearning), an off-policy value-based RL algorithm, based on TD. A Q-Learning agent learns the optimal value function using the following update rule:

(2) |

where is the learning rate and, following the notation introduced by (hasselt2016ddqn),

(3) |

The popular Deep Q-Network algorithm (DQN) (mnih2015human) is a variant of Q-Learning designed to stabilize off-policy learning with deep neural networks in highly dimensional state spaces. The two most relevant architectural changes to standard Q-Learning introduced by DQN are the adoption of a replay memory, to learn offline from past experience, and the use of a target network, to reduce correlation between the current model estimate and the bootstrapped target value.

In practice, DQN learns the Q-values online, using a neural network with parameters , sampling the replay memory, and with a target network whose parameters are updated to match those of the online model every steps. The model is trained to minimize the loss

(4) |

where

is a uniform distribution over the transitions stored in the replay buffer and

is defined as(5) |

#### Double DQN

Among the many studied improvements and extensions of the baseline DQN algorithm (wang2016dueling; schaul2016prioritized; distributional2017bellemare; hessel2018rainbow), Double DQN (DDQN) (hasselt2016ddqn) reduces the overestimation bias of DQN with a simple modification of the update rule. In particular, DDQN uses the target network to decouple action selection and evaluation, and estimates the target value as

(6) |

DDQN improves on DQN converging to a more accurate approximation of the value function, while maintaining the same model complexity and adding a minimal computational overhead.

## 3 Estimation biases in Q-Learning

Choosing a target value for the Q-Learning update rule can be seen as an instance of the Maximum Expected Value (MEV) estimation problem for a set of random variables, here the action-values

. Q-Learning uses the Maximum Estimator (ME)^{1}

^{1}1Details about the estimators considered in this section are provided in the appendix. to estimate the maximum expected return and exploits it for policy improvement. It is well-known that ME is a positively biased estimator of MEV (smith2006optimizer). The divergent behaviour that may occur in Q-Learning, then, may be explained by the amplification over time effect on the action-value estimates caused by the overestimation bias, which introduces a positive error at each update (hasselt2010double). Double Q-Learning (hasselt2010double), on the other hand, learns two value functions in parallel and uses an update scheme based on the Double Estimator (DE). It is shown that DE is a negatively biased estimator of MEV, which helps to avoid catastrophic overestimates of the Q-values. The DDQN algorithm, introduced in Section 2, is the extension of Double Q-Learning to the DRL settings. In DDQN, the target network is used as a proxy of the second value function of Double Q-Learning to preserve sample and model complexity.

In practice, as shown by deramo2016wql and lan2020maxmin, the overestimation bias of Q-Learning is not always harmful, and may also be convenient when the action-values are significantly different among each other (e.g., deterministic environments with a short time horizon, or small action spaces). Conversely, the underestimation of Double Q-Learning is effective when all the action-values are very similar (e.g., highly stochastic environments with a long or infinite time horizon, or large action spaces). In fact, depending on the problem, both algorithms have properties that can be detrimental for learning. Unfortunately, a prior knowledge about the environment is not always available and, whenever it is, the problem may be too complex to decide which estimator should be preferred. Given the above, it is desirable to use a methodology which can robustly deal with heterogeneous problems.

### 3.1 Weighted Q-Learning

deramo2016wql proposes the Weighted Q-Learning (WQL) algorithm, a variant of Q-Learning based on the therein introduced Weighted Estimator (WE). The WE estimates MEV as the weighted sum of the random variables sample means, weighted according to their probability of corresponding to the maximum. Intuitively, the amount of uncertainty, i.e., the entropy of the WE weights, will depend on the nature of the problem, the number of samples and the variance of the mean estimator (critical when using function approximation). WE bias is bounded by the biases of ME and DE (deramo2016wql).

The target value of WQL can be computed as

(7) |

where are the weights of the WE and correspond to the probability of each action-value being the maximum:

(8) |

The update rule of WQL can be obtained replacing with in Equation 2

. The weights of WQL are estimated in the tabular setting assuming the sample means to be normally distributed.

WE has been studied also in the Batch RL settings, with continuous state and action spaces, by using Gaussian Process regression (d2017estimating).

## 4 Deep Weighted Q-Learning

A natural way to extend the WQL algorithm to the DRL settings is to consider the uncertainty over the model parameters using a Bayesian approach. Among the possible solutions to estimate uncertainty, bootstrapping has been the most successful in RL problems, with BootstrappedDQN (BDQN) (osband2016boot; osband2018deeprandomized) achieving impressive results in environments where exploration is critical. On the other hand, using bootstrapping necessitates significant modifications to the baseline DQN architecture and requires to train a model for each sample of the approximate posterior distribution. This limits the number of samples available considerably and is a major drawback in using BDQN to approximate the WE weights. Using dropout, conversely, does not impact model complexity and allows to compute the weights of the WE by using infinitely many samples.

In the following we first introduce how neural networks trained with dropout can be used for approximated Bayesian inference and discuss how this approach has been used with success in RL problems. Then, we propose a novel approach to exploit the uncertainty over the model parameters for action evaluation, adapting the WE to the DRL settings. Finally we analyze a possible shortcoming of the proposed method and identify a solution from the literature to address it.

### 4.1 Bayesian inference with dropout

Dropout (srivastava2014dropout) is regularization technique used to train large neural networks by randomly dropping units during learning. In recent years, dropout has been analyzed from a Bayesian perspective (kingma2015vardrop; gal2016dropout), and interpreted as a variational approximation of a posterior distribution over the neural network parameters. In particular, gal2016dropout show how a neural network trained with dropout and weight decay can bee seen as an approximation of a deep Gaussian process (damianou2013deepgauss). The result is a theoretically grounded interpretation of dropout and a class of Bayesian neural networks (BNNs) that are cheap to train and can be queried to obtain uncertainty estimates. In fact, a single stochastic forward pass through the BNN can be interpreted as taking a sample from the model’s predictive distribution, while the predictive mean can be computed as the average of multiple samples. This inference technique is known as Monte Carlo (MC) dropout and can be efficiently parallelized in modern GPUs.

A straightforward application of Bayesian models to RL is Thompson Sampling (TS)

(thompson1933likelihood). TS is an exploration technique that aims at improving the sample complexity of RL algorithms by selecting actions according to their probability of being optimal given the current agent’s beliefs. A practical way to use TS in Deep Reinforcement Learning is to take a single sample from a Q-network trained with dropout and select the action that corresponds to the maximum sampled action-value (gal2016dropout). TS based on dropout achieves superior data efficiency compared against naïve exploration strategies, such as -greedy, both in sequential decision making problems (gal2016dropout; stadie2015exploration) and contextual bandits (riquelme2018deepbandits; collier2018deepcontextual). Furthermore, dropout has been successfully used in model-based RL, to estimate the agent’s uncertainty over the environment dynamics (gal2015bayespilco; kahn2017uncertaintyaware; malik2019calibrated).Here we focus on the problem of action evaluation. We show how to use approximate Bayesian inference to evaluate the WE estimator by introducing a novel approach to exploit uncertainty estimates in DRL. Our method empirically reduces Q-Learning bias, is grounded in theory and simple to implement.

### 4.2 Weighted DQN

Let be a BNN with weights trained with a Gaussian prior and Dropout Variational Inference to learn the optimal action-value function of a certain MDP. We indicate with the set of random variables that represents the dropout masks, with the -th realization of the random variables and with

their joint distribution:

(9) | |||

(10) |

where is the number of weight layers of the network and is the number of units in layer .

Consider a sample the MDP return, obtained taking action in and following the optimal policy afterwards. Following the GP interpretation of dropout of gal2016dropout, we can approximate the likelihood of this observation as a Gaussian such that

(11) |

where is the model precision.

We can approximate the predictive mean of the process, and the expectation over the posterior distribution of the Q-value estimates, as the average of stochastic forward passes through the network:

(12) |

is the BNN prediction of the action-values associated to state and action . A similar, more computationally efficient, approximation can be obtained through weight averaging

, which consists in scaling the output of the neurons in layer

by during training and leaving them unchanged at inference time. We indicate this estimate as and we use it for action selection during training.The model epistemic uncertainty, i.e., the model uncertainty over its parameters, can be measured similarly as the sample variance across realizations of the dropout random variables:

(13) |

As shown in gal2016dropout the predictive variance can be approximated with the variance of the estimator in Eq. 13 plus the model inverse precision .

We can estimate the probability required to calculate the WE in a similar way. Given an action , the probability that corresponds the maximum expected action-value can be approximated as the number of times in which, given samples, the sampled action-value of is the maximum over the number of samples

(14) |

where are the Iverson brackets ( is if is true, otherwise). The weights can be efficiently inferred in parallel with no impact in computational time.

We can define the WE given the Bayesian target Q-network estimates using the obtained weights as:

(15) |

Finally we report for completeness the loss minimized by WDQN, where the parameter updates are backpropagated using the dropout masks: