State-Aware Variational Thompson Sampling for Deep Q-Networks

02/07/2021 ∙ by Siddharth Aravindan, et al. ∙ National University of Singapore 12

Thompson sampling is a well-known approach for balancing exploration and exploitation in reinforcement learning. It requires the posterior distribution of value-action functions to be maintained; this is generally intractable for tasks that have a high dimensional state-action space. We derive a variational Thompson sampling approximation for DQNs which uses a deep network whose parameters are perturbed by a learned variational noise distribution. We interpret the successful NoisyNets method <cit.> as an approximation to the variational Thompson sampling method that we derive. Further, we propose State Aware Noisy Exploration (SANE) which seeks to improve on NoisyNets by allowing a non-uniform perturbation, where the amount of parameter perturbation is conditioned on the state of the agent. This is done with the help of an auxiliary perturbation module, whose output is state dependent and is learnt end to end with gradient descent. We hypothesize that such state-aware noisy exploration is particularly useful in problems where exploration in certain high risk states may result in the agent failing badly. We demonstrate the effectiveness of the state-aware exploration method in the off-policy setting by augmenting DQNs with the auxiliary perturbation module.



There are no comments yet.


page 1

page 5

page 6

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Exploration is a vital ingredient in reinforcement learning algorithms that has largely contributed to its success in various applications khalil2017learning; mnih2015human; liang2017deep; gu2017deep. Traditionally, deep reinforcement learning algorithms have used naive exploration strategies such as -greedy, Boltzmann exploration or action-space noise injection to drive the agent towards unfamiliar situations. Although effective in simple tasks, such undirected exploration strategies do not perform well in tasks with high dimensional state-action spaces.

Theoretically, Bayesian approaches like Thompson sampling have been known to achieve an optimal exploration-exploitation trade-off in multi-armed bandits agrawal2012analysis; agrawal2013further; kaufmann2012thompson

and also have been shown to provide near optimal regret bounds for Markov Decision Processes (MDPs)

osband2017posterior; agrawal2017optimistic. Practical usage of such methods, however, is generally intractable as they require the posterior distribution of value-action functions to be maintained.

(a) A high risk state
(b) A low risk state
Figure 1. The white car which is controlled by the agent, has to move forward while avoiding other cars. (a) In this state, any action other than moving straight will result in a crash, making it a high risk state. (b) This is a low risk state since exploring random actions will not lead to a crash.

At the same time, in practical applications, perturbing the parameters of the model with Gaussian noise to induce exploratory behaviour has been shown to be more effective than -greedy and other approaches that explore primarily by randomization of the action space plappert2018parameter; fortunato2018noisy. Furthermore, adding noise to model parameters is relatively easy and introduces minimal computational overhead. NoisyNets fortunato2018noisy, in particular, has been known to achieve better scores on the full Atari suite than other exploration techniquesTaiga2020On.

In this paper, we derive a variational Thompson sampling approximation for Deep Q-Networks (DQNs), where the model parameters are perturbed by a learned variational noise distribution. This enables us to interpret NoisyNets as an approximation of Thompson sampling, where minimizing the NoisyNet objective is equivalent to optimizing the variational objective with Gaussian prior and approximate posterior distributions. These Gaussian approximating distributions, however, apply perturbations uniformly across the agent’s state space. We seek to improve this by approximating the Thompson sampling posterior with Gaussian distributions whose variance is dependent on the agent’s state.

To this end, we propose State Aware Noisy Exploration (SANE), an exploration strategy that induces exploration through state dependent parameter space perturbations. These perturbations are added with the help of an augmented state aware perturbation module, which is trained end-to-end along with the parameters of the main network by gradient descent.

We hypothesize that adding such perturbations helps us mitigate the effects of high risk state exploration, while exploring effectively in low risk states. We define a high risk state as a state where a wrong action might result in adverse implications, resulting in an immediate failure or transition to states from which the agent is eventually bound to fail. Exploration in such states might result in trajectories similar to the ones experienced by the agent as a result of past failures, thus resulting in low information gain. Moreover, it may also prevent meaningful exploratory behaviour at subsequent states in the episode, that may have been possible had the agent taken the correct action at the state. A low risk state, on the other hand, is defined as a state where a random exploratory action does not have a significant impact on the outcome or the total reward accumulated by the agent within the same episode. A uniform perturbation scheme for the entire state space may thus be undesirable in cases where the agent might encounter high risk states and low risk states within the same task. An instance of a high risk state and low risk state in Enduro, an Atari game, is shown in Figure 1. We try to induce uncertainty in actions, only in states where such uncertainty is needed through the addition of a state aware perturbation module.

To test our assumptions, we experimentally compare two SANE augmented Deep Q-Network (DQN) variants, namely the simple-SANE DQN and the Q-SANE DQN, with their NoisyNet counterparts fortunato2018noisy on a suite of 11 Atari games. Eight of the games in the suite have been selected to have high risk and low risk states as illustrated in Figure 1, while the remaining three games do not exhibit such properties. We find that agents that incorporate SANE do better in most of the eight games. An added advantage of SANE over NoisyNets is that it is more scalable to larger network models. The exploration mechanism in NoisyNets fortunato2018noisy adds an extra learnable parameter for every weight to be perturbed by noise injection, thus tying the number of parameters in the exploration mechanism to the network architecture being perturbed. The noise-injection mechanism in SANE on the other hand, is a separate network module, independent of the architecture being perturbed. The architecture of this perturbation module can be modified to suit the task. This makes it more scalable to larger networks.

2. Background

2.1. Markov Decision Processes

A popular approach towards solving sequential decision making tasks involves modelling them as MDPs. A MDP can be described as a 5-tuple, , where and denote the state space and action space of the task, and represent the state-transition and reward functions of the environment respectively and is the discount factor of the MDP. Solving a MDP entails learning an optimal policy that maximizes the expected cumulative discounted reward accrued during the course of an episode. Planning algorithms can be used to solve for optimal policies, when and are known. However, when these functions are unavailable, reinforcement learning methods help the agent learn good policies.

2.2. Deep Q-Networks

A DQN mnih2015human

is a value based temporal difference learning algorithm, that estimates the action-value function by minimizing the temporal difference between two successive predictions. It uses a deep neural network as a function approximator to compute all action values of the optimal policy

, for a given a state . A typical DQN comprises two separate networks, the Q network and the target network. The Q network aids the agent in interacting with the environment and collecting training samples to be added to the experience replay buffer, while the target network helps in calculating target estimates of the action value function. The network parameters are learned by minimizing the loss given in Equation 1, where and are the parameters of the Q network and the target network respectively. The training instances are sampled uniformly from the experience replay buffer, which contains the most recent transitions experienced by the agent.


2.3. Thompson Sampling

Thompson sampling thompson1933likelihood works under the Bayesian framework to provide a well balanced exploration-exploitation trade-off. It begins with a prior distribution over the action-value and/or the environment and reward models. The posterior distribution over these models/values is updated based on the agent’s interactions with the environment. A Thompson sampling agent tries to maximize its expected value by acting greedily with respect to a sample drawn from the posterior distribution. Thompson sampling has been known to achieve optimal and near optimal regret bounds in stochastic bandits agrawal2012analysis; agrawal2013further; kaufmann2012thompson and MDPs osband2017posterior; agrawal2017optimistic respectively.

3. Related Work

Popularly used exploration strategies like -greedy exploration, Boltzmann exploration and entropy regularization sutton2018reinforcement, though effective, can be wasteful at times, as they do not consider the agent’s uncertainty estimates about the state. In tabular settings, count based reinforcement learning algorithms such as UCRL jaksch2010near; auer2007logarithmic handle this by maintaining state-action visit counts and incentivize agents with exploration bonuses to take actions that the agent is uncertain about. An alternative approach is suggested by posterior sampling algorithms like PSRL strens2000bayesian, which maintain a posterior distribution over the possible environment models, and act optimally with respect to the model sampled from it. Both count based and posterior sampling algorithms have convergence guarantees in this setting and have been proven to achieve near optimal exploration-exploitation trade-off. Unfortunately, sampling from a posterior over environment models or maintaining visit counts in most real world applications are computationally infeasible due to the high dimensional state space and action space involved with these tasks. However, approximations of such methods that do well have been proposed in recent times.

bellemare2016unifying, generalizes count based techniques to non-tabular settings by using pseudo-counts obtained from a density model of the state space, while stadie2015incentivizing follows a similar approach but uses a predictive model to derive the bonuses. ostrovski2017count builds upon bellemare2016unifying by improving upon the density models used for granting exploration bonuses. Additionally, surprise-based motivation achiam2017surprise

learns the transition dynamics of the task, and adds a reward bonus proportional to the Kullback–Leibler (KL) divergence between the true transition probabilities and the learned model to capture the agent’s

surprise on experiencing a transition not conforming to its learned model. Such methods that add exploration bonuses prove to be most effective in settings where the rewards are very sparse but are often complex to implement plappert2018parameter.

Randomized least-squares value iteration (RLSVI) osband2016generalization is an approximation of posterior sampling approaches to the function approximation regime. RLSVI draws samples from a distribution of linearly parameterized value functions, and acts according to the function sampled. osband2016deep and osband2015bootstrapped are similar in principle to osband2016generalization; however, instead of explicitly maintaining a posterior distribution, samples are procured with the help of bootstrap re-sampling. Randomized Prior Functions osband2018randomized adds untrainable prior networks with the aim of capturing uncertainties not available from the agent’s experience, while azizzadenesheli2018efficient

tries to do away with duplicate networks by using Bayesian linear regression with Gaussian prior. Even though the action-value functions in these methods are no longer restricted to be linear, maintaining a bootstrap or computing a Bayesian linear regression makes these methods computationally expensive compared to others.

Parameter perturbations which form another class of exploration techniques, have been known to enhance the exploration capabilities of agents in complex tasks  xie2018nadpex; plappert2018parameter; florensa2017stochastic. ruckstiess2010exploring show that this type of policy perturbation in the parameter space outperforms action perturbation in policy gradient methods, where the policy is approximated with a linear function. However, ruckstiess2010exploring evaluate this on tasks with low dimensional state spaces. When extended to high dimensional state spaces, black box parameter perturbations salimans2017evolution, although proven effective, take a long time to learn good policies due to their non adaptive nature and inability to use gradient information. Gradient based methods that rely on adaptive scaling of the perturbations, drawn from spherical Gaussian distributions plappert2018parameter, gradient based methods that learn the amount of noise to be added fortunato2018noisy and gradient based methods that learn dropout policies for exploration xie2018nadpex are known to be more sample efficient than black box techniques. NoisyNets fortunato2018noisy, a method in this class, has been known to demonstrate consistent improvements over -greedy across the Atari game suite unlike other count-based methods Taiga2020On. Moreover, these methods are also often easier to implement and computationally less taxing than the other two classes of algorithms mentioned above.

Our exploration strategy belongs to the class of methods that perturb parameters to effect exploration. Our method has commonalities with the parameter perturbing methods above as we sample perturbations from a spherical Gaussian distribution whose variance is learnt as a parameter of the network. However, the variance learnt, unlike NoisyNets fortunato2018noisy, is conditioned on the current state of the agent. This enables it to sample perturbations from different Gaussian distributions to vary the amount of exploration when the states differ. Our networks also differ in the type of perturbations applied to the parameters. While fortunato2018noisy obtains a noise sample from possibly different Gaussian distributions for each parameter, our network, like plappert2018parameter, samples all perturbations from the same, but state aware, Gaussian distribution. Moreover, the noise injection mechanism in SANE is a separate network module that is subject to user design. This added flexibility might make it more scalable to larger network models when compared to NoisyNets, where this mechanism is tied to the network being perturbed.

4. Variational Thompson Sampling

Bayesian methods like Thompson Sampling use a posterior distribution to sample the weights of the neural network, given , the experience collected by the agent. is generally intractable to compute and is usually approximated with a variational distribution . Let be the dataset on which the agent is trained, with being the set of inputs, and being the target labels. Variational methods minimize the KL divergence between and to make a better approximation. Appendix A shows that minimizing is equivalent to maximizing the Evidence Lower Bound (ELBO), given by Equation 2.


For a dataset with datapoints, and under the i.i.d assumption, we have :

So, the objective to maximize is :

In DQNs, the inputs are state-action tuples, and the its corresponding target is an estimate of . Traditionally, DQNs are trained by minimizing the squared error, which assumes a Gaussian error distribution around the target value. Assuming the same, we define in Equation 3, where is the approximate target Q value of given by , is the variance of the error distribution and .


We approximate the integral for each example with a Monte Carlo estimate by sampling a , giving

As is a constant with respect to , maximizing the ELBO is approximately the same as optimizing the following objective.


4.1. Variational View of NoisyNet DQNs

The network architecture of NoisyNet DQNs usually comprises a series of convolutional layers followed by some fully connected layers. The parameters of the convolutional layers are not perturbed, while every parameter of the fully connected layers is perturbed by a separate Gaussian noise whose variance is learned along with the other parameters of the network.

For the unperturbed parameters of the convolutional layers, we consider . The parameters of any neural network are usually used in the floating point format. We choose a value of that is close enough to zero, such that adding any noise sampled from these distributions does not change the value of the weight as represented in this format with high probability. For the parameters of the fully connected layers, we take where is a diagonal matrix with equal to the learned variance for the parameter . We take the prior for all the parameters of the network.

With this choice of and , the value of can be computed as shown in Equation 6, where and are the number of parameters in the the convolutional and fully connected layers respectively. Note that , and are constants given the network architecture.


As NoisyNet DQN agents are usually trained on several million interactions, we assume that the KL term is dominated by the log likelihood term in the ELBO. Thus, maximizing the objective in Equation (5) can be approximated by optimizing the following objective :


which is the objective that NoisyNet DQN agents optimize. In NoisyNets, every sample is obtained by a simple reparameterization of the network parameters : , where . This reparameterization helps NoisyNet DQNs to learn through a sampled .

4.2. State Aware Approximating Distributions

It can be seen that the approximate posterior distribution is state agnostic, i.e., it applies perturbations uniformly across the state space, irrespective of whether the state is high risk or low risk. We thus postulate that is potentially a better variational approximator . is a special case of a state aware variational approximator where is the same for all . A reasonable ELBO estimate for such an approximate distribution would be to extend the ELBO in Equation 4 to accommodate as shown in 8.


Approximating the integral for each example with a Monte Carlo estimate by sampling a , maximizing the ELBO is equivalent to solving 9.


We assume that the KL term will eventually be dominated by the log likelihood term in the ELBO, given a sufficiently large dataset. This posterior approximation leads us to the formulation of SANE DQNs as described in the following sections.

5. State Aware Noisy Exploration

Figure 2. A high level view of a State Aware Noisy Exploring Network.

State Aware Noisy Exploration (SANE), is a parameter perturbation based exploration strategy which induces exploratory behaviour in the agent by adding noise to the parameters of the network. The noise samples are drawn from the Gaussian distribution , where

is computed as a function of a hidden representation,

, of the state of the agent by an auxiliary neural network module, i.e., , where and refer to the parameters of the auxiliary perturbation network () and the parameters of the main network respectively.

5.1. State Aware Noise Sampling

To procure state aware noise samples, we first need to compute

, the state dependent standard deviation of the Normal distribution from which the perturbations are sampled. As stated above, we do this by adding an auxiliary neural network module.

is then used to generate perturbations for every network parameter using noise samples from the standard Normal distribution, , in tandem with a simple reparameterization of the sampling network salimans2017evolution; plappert2018parameter; fortunato2018noisy as shown in Equation 10.


State aware perturbations can be added to all types of layers in the network. The standard baseline architectures used by popular deep reinforcement learning algorithms for tasks such as playing Atari games mainly consist of several convolutional layers followed by multiple fully connected layers. We pass the output of the last convolutional layer as the hidden representation to compute the state aware standard deviation, , where is the set of parameters of the convolutional layers. Perturbations using are then applied to the following fully connected layers.

Our mechanism of introducing perturbations is similar to Noisy DQNs fortunato2018noisy and adaptive parameter space noise plappert2018parameter

. Given a vector

as input to a fully connected layer with outputs, an unperturbed layer computes a matrix transformation of the form , where and are the parameters associated with the layer, and . We modify such layers with state-aware perturbations, by adding noise elements sampled from (Equation 10). This results in the perturbed fully connected layer computing a transform equivalent to Equation 11, where , , and , are vectors whose elements are samples from a standard normal distribution.


A high level view of a neural network with the augmented state aware perturbation module is shown in Figure 2. We partition into and , where is the set of parameters used to generate the hidden state representation using the neural network and are the parameters to which noise is to be added. Given the hidden state representation, perturbation module , is used to compute the state dependent standard deviation , which is used to perturb the parameters of the network . then computes action-values for all actions. Additional features that may aid in exploration such as state visit counts or uncertainty estimates can also be appended to before being passed as input to .

fortunato2018noisy suggests two alternatives to generate and . The more computationally expensive alternative, Independent Gaussian noise, requires the sampling of each element of and independently, resulting in a sampling of quantities per layer. Factored Gaussian noise, on the other hand, samples two vectors and of sizes and respectively. These vectors are then put through a real valued function before an outer product is taken to generate and (Equation 12), which are the required noise samples for the layer. Readers are referred to fortunato2018noisy for more details on these two noise sampling techniques. Being less computationally taxing and not having any notable impact on the performance fortunato2018noisy, we select Factored Gaussian noise as our method for sampling perturbations.


5.2. Network Parameters and Loss Function

The set of learnable parameters for a SANE network, is a union of the set of parameters of the main network, , and the set of parameters of the auxiliary network perturbation module, . Moreover, in place of minimizing the loss over the original set of parameters, , the SANE network minimizes the function , which is the loss corresponding to the network parameterized by the perturbed weights of the network. Furthermore, with both the main network and the perturbation module being differentiable entities, using the reparameterization trick to sample the perturbations allows the joint optimization of both and

via backpropagation.

5.3. State Aware Deep Q Learning

We follow an approach similar to fortunato2018noisy to add state aware noisy exploration to DQNs mnih2015human. In our implementation of a SANE DQN for Atari games, and correspond to the set of parameters in the convolutional layers and the fully connected layers respectively. The Q network and the target network have their own copies of the network and perturbation module parameters.

The DQN learns by minimizing the following loss, where represent the network parameters of the Q-network and the target network and represent the perturbation module parameters of the Q-network and the target network respectively. The training samples are drawn uniformly from the replay buffer.

Figure 3. High risk states learnt by Q-SANE in the 8 game sub-suite
Figure 4. Low risk states learnt by Q-SANE in the 8 game sub-suite

5.4. Variational View of SANE DQNs

In SANE DQNs, we allow the network to use a different posterior approximation for different states but restrict the perturbations that is added to all parameters to be sampled by the same distribution given a state . Similar to NoisyNets, for the unperturbed parameters of the convolutional layers and the perturbation module, we consider and for the parameters of the fully connected layers, we take . We take the prior for all the parameters of the network. It follows that the objective to maximize is the same as objective (7), but where the parameters are drawn from .

In our experiments, we compare two different SANE DQN variants, namely, the simple-SANE DQN and the Q-SANE DQN. Both these SANE DQNs have the same network structure as shown in Figure 2. Q-SANE DQNs and simple-SANE DQNs differ in the additional features that are added to the perturbation module. Simple-SANE DQNs add no additional features to aid the perturbation module. On the other hand, Q-SANE DQNs use the non-noisy Q-values of the state as additional features. The non-noisy Q-values are computed via a forward pass of the neural network with no perturbations applied to . Adding Q-values as features to the perturbation module can be useful, as a state where all the action values take similar values could be an indication of a low risk state and vice versa.

6. Experiments

We conduct our experiments on a suite of 11 Atari games. This suite contains 8 games that exhibit both high and low risk states (see Figures 1, 3 and 4) that we expect would benefit from state aware exploratory behaviour. We expect SANE not to have any notable benefit in the other 3 games.

6.1. Atari Test Suite

The 11 game Atari test suite has an 8 game sub-suite, consisting of the games Asterix, Atlantis, Enduro, IceHockey, Qbert, Riverraid, RoadRunner and Seaquest. High risk and low risk states of these games (in order) are shown in Figures 3 and 4 respectively. The games in this sub-suite have properties that benefit agents when trained with SANE exploration. Most high risk states in these games, occur when the agent is either at risk of being hit (or captured) by an enemy or at risk of going out of bounds of the play area. Figures 2(a), 2(b), 2(c), 2(e), 2(g) and 2(h) illustrate this property of high risk states. Low risk states, on the other hand, correspond to those states where the agent has a lot of freedom to move around without risking loss of life. Most of the states in Figure 4 demonstrate this.

Additionally, there maybe other complex instances of high risk states. For instance, in Riverraid, states where the agent is about to run out of fuel can be considered high risk states (Figure 2(f)). Moreover, sometimes the riskiness of a state can be hard to identify. This is illustrated by the high and low risk states of IceHockey shown in Figures 2(d) and 3(d) respectively. In games like IceHockey, high risk states are determined by the positions of the players and the puck. Figure 2(d) is a high risk state as the puck’s possession is being contested by both teams, while 3(d) is low risk as the opponent is certain to gain the puck’s possession in the near future.

We also include 3 games, namely, FishingDerby, Boxing and Bowling, in our suite to check the sanity of SANE agents in games where we expect SANE exploration not to have any additional benefits over NoisyNets.

6.2. Parameter Initialization

Figure 5. Learning curves of SANE DQNs, NoisyNet DQNs and -greedy DQNs.

We follow the initialization scheme followed by fortunato2018noisy to initialize the parameters of NoisyNet DQNs. Every layer of the main network of simple-SANE and Q-SANE DQNS are initialized with the Glorot uniform initialization scheme glorot2010understanding, while every layer of the perturbation module of both the SANE DQN variants are initialized with samples from , where is number of inputs to the layer.

Game Q-SANE simple-SANE NoisyNets -greedy
Asterix 12621320478 133849 49379 110566 31800 15777 3370
Atlantis 141337 67719 265144 151154 162738 73271 229921 41143
Enduro 2409 321 2798 311 2075 24 1736 197
IceHockey 2.991.4 1.43 1.76 -2.4 0.52 3.46 0.8
Qbert 17358 1015 15341 162 15625 166 16025 555
Riverraid 146203491 14919 997 11220 223 12023 512
RoadRunner 495981635 45929 1648 51805 885 47570 1651
Seaquest 83683426 8805 1392 6031 3567 7682 1648
FishingDerby -19.78 7.2 -12.1 4.2 -11.5 5.4 -33.9 9.1
Boxing 95.3 3 93.2 4.5 95.5 1.7 96.6 0.73
Bowling 291 28.08 1.2 37.4 3.8 20.6 4.7
Score (8 games) 4.86 5.51 4.28 3.33
Score (11 games) 4.1 4.85 3.98 3.25
Table 1. Scores of DQN agents when evaluated without noise injection or exploratory action selection.
Game Q-SANE simple-SANE NoisyNets
Asterix 18279751182 194547 56492 134682 26574
Atlantis 281189126834 230837 104472 166512 93945
Enduro 2849 464 2855 579 1946 136
IceHockey 2.86 1.97 1.9 3.25 -1.53 0.45
Qbert 16950479 15438 57 13824 2690
Riverraid 15168 2068 15434 891 11076 889
RoadRunner 47434 2352 47578 3787 51260 712
Seaquest 7184 2806 7844 1245 6087 3654
FishingDerby -15.92 7.9 -10.83 2.34 -14 2.8
Boxing 96.16 1.73 95.1 2.1 93.6 2.7
Bowling 28.8 0.61 28.13 1.25 34.2 2.4
Score (8 games) 6.43 6.2 4.64
Score (11 games) 5.54 5.37 4.22
Table 2. Scores of DQN Agents when evaluated with noise injection.

6.3. Architecture and Hyperparameters

We use the same network structure to train NoisyNet, simple-SANE, Q-SANE and -greedy DQNs. This network structure closely follows the architecture suggested in mnih2015human. The inputs to all the networks are also pre-processed in a similar way.

The perturbations for NoisyNet, simple-SANE and Q-SANE DQNs are sampled using the Factored Gaussian noise setup fortunato2018noisy

. The SANE perturbation module used for all games and all SANE agents is a 1-hidden layer fully connected neural network. We train simple-SANE DQNs on the games of Asterix and Seaquest to determine the size of the hidden layer. A hyperparameter search over the set

revealed that a module with 256 hidden neurons gave the best results on these games. The hidden layer uses ReLU activation, and output layer computes one output which corresponds to the state aware standard deviation


We train the DQNs with an Adam optimizer with learning rate, . All other hyperparameters use the same values as used by mnih2015human. The agents are trained for a total of 25M environment interactions where each training episode is limited to 100K agent-environment interactions. Both NoisyNet and SANE DQNs use greedy action selection. Please refer to Sections B and C in the Appendix for more details about the implementation.

For each game in the test suite, we train three simple-SANE, Q-SANE, NoisyNet and -greedy DQN agents. Figure 5 shows the average learning curves of all the learning agents. Each point in the learning curve corresponds to the average reward received by the agent in the last 100 episodes, averaged over 3 independent runs. Table 1 shows the mean scores achieved by simple-SANE, Q-SANE, NoisyNet and -greedy DQNs after being trained for 25M environment interactions on being evaluated for 500K frames with no noise injection. The scores of the best scoring agents in each game have been highlighted. We also evaluate simple-SANE, Q-SANE and NoisyNet DQNs with noise injection. These scores are presented in Table 2. Tables 1 and 2 also report the mean human-normalized scores (HNS) DBLP:journals/corr/abs-1905-12726 achieved by these methods on the 8 games which are likely to benefit from SANE exploration and on the whole 11 game suite. We also present some high-risk and low-risk states identified by Q-SANE agents in Figures 3 and 4.

We observe that when evaluated without noise injection, both Q-SANE and simple-SANE outperform NoisyNets in 6 of the 8 games in the sub-suite. NoisyNets achieve higher scores than both SANE variants in RoadRunner. In the three games not in the sub-suite, NoisyNets achieve higher but similar scores in FishingDerby and Boxing while performing much better in Bowling compared to the other agents. Evaluating the agents with noise injection proves beneficial for both SANE and NoisyNet agents, all of them achieving higher mean HNS in the 8 game sub-suite and the whole test suite. However, simple-SANE and Q-SANE agents achieve greater gains as they score higher than NoisyNets in 7 games in the sub-suite. SANE agents also score better in the remaining three games but do not manage to score better than NoisyNets in Bowling. Q-SANE and simple-SANE achieve the highest mean HNS on both the 8 game sub-suite and the whole test suite when evaluated with and without noise injection respectively.

7. Conclusion

In this paper, we derive a variational Thompson sampling approximation for DQNs, which uses the distribution over the network parameters as a posterior approximation. We interpret NoisyNet DQNs as an approximation to this variational Thompson Sampling method where the posterior is approximated using a state uniform Gaussian distribution. Using a more general posterior approximation, we propose State Aware Noisy Exploration, a novel exploration strategy that enables state dependent exploration in deep reinforcement learning algorithms by perturbing the model parameters. The perturbations are injected into the network with the help of an augmented SANE module, which draws noise samples from a Gaussian distribution whose variance is conditioned on the current state of the agent. We hypothesize that such state aware perturbations are useful to direct exploration in tasks that go through a combination of high risk and low risk situations, an issue not considered by other methods that rely on noise injection.

We test this hypothesis by evaluating two SANE DQN variants, namely simple-SANE and Q-SANE DQNs, on a suite of 11 Atari games containing a selection of games, most of which fit the above criterion and some that do not. We observe that both simple-SANE and Q-SANE perform better than NoisyNet agents in most of the games in the suite, achieving better mean human-normalized scores.

An additional benefit of SANE noise injection mechanism is its flexibility of design. SANE effects exploration via a separate perturbation module, the size or architecture of which is not tied to the model being perturbed and hence is flexible to user design and can be tailored to the task. As a consequence, this exploration method might scale better to larger network models. Hence, SANE presents a computationally inexpensive way to incorporate state information into exploration strategies and is a step towards more effective, efficient and scalable exploration.


This work was supported by the National Research Foundation Singapore under its AI Singapore Program (Award Number: AISGRP-2018-006).


Appendix A Deriving the ELBO

In Section 4, we mentioned that minimizing the is equivalent to maximizing the ELBO. Here we provide a derivation of that claim.


Substituting the above value,

Now, and are fixed with respect to . So, minimizing is equivalent to maximizing the ELBO (Equation 2).

Appendix B The SANE-DQN Algorithm

Algorithm 1 describes the implementation of a SANE DQN. The Q network and the target network in SANE DQNs have their own copies of the network parameters. These are denoted by and respectively. The Q and target networks also maintain different copies of the parameters of the perturbation module, and respectively. Further, is partitioned into two sets, , where is the set of base parameters that help us compute the hidden state representation and is the set of network parameters that are to be perturbed with state aware perturbations. We define similar counterparts and in the target network.

With every forward pass, the network first calculates , which is then passed to the perturbation module. Factored Gaussian noise samples are procured and multiplied with , to get perturbations equivalent to those directly sampled from (Equation 10). These perturbations are added to the parameters . The agent then selects the action greedily with respect to the action values computed by the perturbed Q-network. While computing the batch-loss, the Q network and target network parameters and are perturbed with state aware perturbations sampled from and respectively (Lines 21-25 in Algorithm 1). This loss is then backpropagated to train and .

1:Initialize the target network parameters and the Q network parameters randomly.
2:Initialize an empty experience replay buffer;
3:Initialize noise sampling method .
4:steps 0
5:while steps max_steps do
7:     Observe
8:     while  not terminal do
9:         Compute .
10:         Sample perturbations to perturb
13:         Take , observe next state , reward
14:         Add transition to the replay buffer;
15:         if buffer_size max_buffer_size then
16:              Remove oldest buffer entry;          
17:         if steps update_frequency = 0 then
18:              Sample transitions uniformly from the replay buffer.
20:              for  do
21:                  Compute and
22:                  Sample to perturb
27:              Backpropagate to minimize the batch loss L          
28:         if steps copy_frequency = 0 then
29:               ;          
30:         ; steps steps+1      
Algorithm 1 SANE Deep Q Learning

Appendix C DQN Implementation Details

The network structure and input pre-processing closely follows the architecture and method suggested in mnih2015human. The DQN consists of three convolutional layers followed by two linear layers. The first convolutional layer has 32 filters of size . This layer is followed by a convolutional layer with 64 filters of size . The last convolutional layer has 64 filters of size

. The convolutional layers use strides of 4,2 and 1 respectively. The convolutional layers are followed by 2 fully connected layers, a hidden layer with 512 neurons and an output layer with the number of outputs being equal to the number of actions available to the agent for the task. With the exception of the output layer, a ReLU activation function follows every layer.

The perturbations for both the networks are sampled using the Factored Gaussian noise setup fortunato2018noisy. The state aware perturbation module used for all games is a 1-hidden layer fully connected neural network. The hidden layer consists of 256 neurons and uses ReLU activation. The output layer computes one output which corresponds to the state aware standard deviation .

We train the DQNs with an Adam optimizer with learning rate, and . We use a replay buffer that can hold a maximum of 1M transitions. We populate the replay buffer with 50K transitions that we obtain by performing random actions for -greedy agents and by following the policy suggested by the network for NoisyNet and SANE agents. Thereafter, we train the Q network once every 4 actions, with a batch of 32 transitions sampled uniformly from the replay buffer. We copy over the parameters of the Q network to the target network after every 10K transitions. We use a discount factor of for all games. Additionally, the rewards received by the agent are clipped in the range .

The agent is trained for a total of 25M agent-environment interactions. The input to the network is a concatenation of 4 consecutive frames, and we take a random number of no-op actions (upto 30) at the start of each episode, so that the agent is given a random start.

The codebase for our SANE, Q-SANE, NoisyNet and -greedy DQN agents are available at gitSANE.

Appendix D Additional Experimental Details

d.1. Human Normalized Scores

The human normalized score for any agent is calculated as follows


We use the same random and human baseline scores as used in DBLP:journals/corr/abs-1905-12726. We list these baseline scores in Table 3 for easy access.

Game Human Score Random Score
Asterix 8503 210
Atlantis 29028 12850
Boxing 12.1 0.1
Bowling 160.7 23.1
Enduro 860.5 0
FishingDerby -38.7 -91.7
IceHockey 0.9 -11.2
Qbert 13455 163.9
Riverraid 17118 1338.5
RoadRunner 7845 11.5
Seaquest 42054 68.4
Table 3. Baseline human and random values used to calculate Human Normalized Scores

d.2. Visualizations

We present the high risk and low risk states (from Figures 3 and 4) along with the state aware standard deviation predicted by the Q-SANE agents in Figures 6 and 7 respectively.

Figure 6. High risk states learnt by Q-SANE in the 8 game sub-suite. The captions mention the standard deviation predicted for each state.
Figure 7. Low risk states learnt by Q-SANE in the 8 game sub-suite. The captions mention the standard deviation predicted for each state.