Deep reinforcement learning (RL) has enjoyed numerous recent successes in video games, board games, and robotics control[17, 3, 9, 18]. Deep RL algorithms typically apply naive exploration schemes such as greedy [12, 19], directly injecting noise into actions , and action level entropy regularization . However, such local perturbations of actions are not likely to lead to systematic exploration in hard environments . Recent work on deep exploration 
applies the bootstrap to approximate the posterior probability of value functions, or injects noise into value function/policy parameter space[4, 11].
We propose a framework that directly approximates the distribution of the value function parameters in a Deep Q Network. We present a surrogate objective that combines the Bellman error and an entropy term that encourages efficient exploration. The equivalence between our proposed objective and variational inference loss allows for the optimization of parameters using powerful variational inference subroutines. Our algorithm can be interpreted as performing approximate Thompson sampling, which can partially justify the algorithm’s efficiency. We demonstrate that the algorithm achieves efficient exploration with good performance on large scale chain Markov decision processes that surpasses DQN with greedy exploration  and NoisyNet .
Markov Decision Process
A Markov Decision Process is a tuple where we have state space , action space , transition kernel , reward function and initial distribution over states. A policy is a mapping from state to action . At time in state , the agent takes action , transitions to under , and receives reward under . Unless otherwise stated, the expectation over state and reward is with respect to transition kernel and reward function . The objective is to find a policy to maximize the discounted cumulative reward
where is a discount factor. Being in state , the action-value function under policy is defined as the expected cumulative reward that could be received by first taking action and then following policy
Under policy the the Bellman error under policy is
Bellman’s optimality condition specifies that, for optimal policy , its action-value function satisfies the following condition
for any and . Hence the optimal action value function has zero Bellman error, any action value function with zero Bellman error is also optimal.
Deep Q Network
Deep Q Networks (DQN)  proposes to approximate the action value function
by a neural networkwith parameters . Let be greedy policy with respect to . The aim is to choose such that the Bellman error of Equation (1) is minimized.
In practice, the expectation in (1
) is estimated bysample trajectories collected from the environment, each assumed to have period . Let be the approximate action value function computed at state-action pair on the th sample trajectory. The approximate Bellman error is
Equivalently, let be total number of samples and be a relabeling of by sample number. Then, the error can be written as
In (2) the term is called target value. To minimize is essentially to minimize the discrepancy between target value and prediction . To stabilize training,  proposes to compute the target value by a target network with parameter . The target network has the same architecture as the original network but its parameters are slowly updated, allowing the target distribution to be more stationary. The final approximate Bellman error is
is updated by stochastic gradient descent on the final approximate Bellman errorwhere is the learning rate.
1.1 Variational Inference
Given a generative model with parameter , the samples are generated from distribution . Define prior on the parameters . Given generated data the posterior of is computed by Bayes rule .
In most cases it is challenging to evaluate directly. Consider using a variational family distribution with parameter to approximate the posterior. One approach is to minimize the KL divergence between and
In complex generative models such as Bayesian neural networks, we can approximately solve the above minimization problem using gradient descent. The gradient of (4) can be derived as an expectation, which is estimated by sample averages in practical implementation. When is approximately minimized, we could directly infer from [1, 15, 8].
2 Related Methods
DQN  is one of the first successful frameworks in deep reinforcement learning. Built upon the original work, there have been numerous attempts to improve the learning stability and sample efficiency, such as prioritized replay , double DQN  and duel DQN  among others.
The duality between control and inference  encourages the application of variational inference to reinforcement learning problems.  propose specialized inference techniques applicable to small MDPs yet they could not be scaled to large problems.
VIME (Variational Information Maximization Exploration)  proposes to encourage exploration by informational bonus. The algorithm learns a dynamics model of the environment and then computes informational bonus based on changes in the posterior distribution of the dynamics model. The informational bonus is computed from a variational family distribution that approximates the posterior. This offers an novel approach to exploration yet the exploration strategy is still intuitively local.
Bootstrapped DQN [14, 13] proposes to approximate the formidable posterior of value function parameters with bootstrap. Different heads of the Bootstrapped DQN are trained with different sets of bootstrapped experience data. Multiple heads of the Bootstrapped DQN entail diverse strategies and encourage exploration. Though bootstrapping can be performed efficiently by parallel computing, this method is in general computationally costly.
Recent work on NoisyNet 
proposes to add noise to value function parameters or policy parameters directly. The true parameters of the model are parameters that govern the distribution of value function/policy parameters. By a re-parametrization trick, the distribution parameters are updated by conventional backpropagation. NoisyNet applies randomization in parameter space, which corresponds to randomization in policy space and entails more consistent exploration.
and an entire pipeline of natural language processing system. Compared to their work, our formulation starts from a surrogate objective that explicitly encourages exploration and establishes its connection with variational inference. The variational interpretation allows us to leverage efficient black box variational subroutines to update network parameters.
3 Proposed Algorithm
As in the DQN formulation, the optimal action value function is approximated by a neural network with parameter . Consider following a parameterized distribution with parameter . The aim is to minimize an expected Bellman error
where we have adopted the sample estimate of Bellman error (without ) as in (2). The distribution specifies a distribution over and equivalently specifies a distribution over policy . To entail efficient exploration, we need to be dispersed. Let be the entropy of a distribution. Since large implies dispersed , we encourage exploration by adding an entropy bonus to the above objective
where is a regularization constant, used to balance the expected Bellman error and entropy bonus. The aim is to find that achieves low expected Bellman error while encompassing as many different policies as possible.
As in DQN , to stabilize training, we have a target parameter distribution over with slowly updated parameters . The target is computed by a target network sampled from the target distribution . The final surrogate objective is
3.2 Variational Inference Interpretation
Next, we offer an interpretation of minimizing surrogate objective (6) as minimizing a variational inference loss. Let target value be given (computed by target network ) and let . The objective (6) is equivalent up to constant multiplication to
To bridge the gap to variational inference, consider as a Bayesian neural network with parameter with improper uniform prior
. The network generates data with Gaussian distribution
with given standard error. Given data , denotes the posterior distribution of parameter . The above objective (7) reduces to KL divergence between and the posterior of
Hence to update parameter based on the proposed objective (6) is equivalent to find a variational family distribution as approximation to the posterior . In fact, from (7) we know that the posterior distribution is the minimizer distribution of (6),
Here we have established the equivalence between surrogate objective (6) and variational inference loss (7). In general, we only need to assume Gaussian generative model and any variational inference algorithm will perform approximate minimization of Bellman error. This interpretation allows us to apply powerful black-box variational inference packages, such as Edward , to update the value function and leverage different black-box algorithms [15, 8]. We can recover the original DQN as a special case of Variational DQN. See Appendix.
The variational inference interpretation of proposed objective allows us to leverage powerful variational inference machinery to update policy distribution parameter .
We have a principal distribution parameter and a target distribution parameter . At each time step , we sample and select action by being greedy with respect to . The experience tuple is added to a buffer for update. When updating parameters, we sample a mini-batch of tuples and compute target values using target network parameter . Then we evaluate the KL divergence in (8) using as generated data and improper uniform prior . The parameter is updated by taking one gradient descent step in KL divergence. The target parameter is updated once in a while as in the original DQN. The pseudocode is summarized below. The algorithm can be interpreted as performing approximate Thompson sampling . See Appendix.
4 Testing Environments
4.1 Classic Control Tasks
These four classic control tasks are from OpenAI Gym environments . They all require the agent to learn a good policy to properly control mechanical systems. Among them, MountainCar and Acrobot are considered as more challenging since to solve the environment requires efficient exploration. For example in MountainCar, a bad exploration strategy will get the car stuck in the valley and the agent will never learn the optimal policy.
4.2 Chain MDP
The chain MDP  (Figure 1) serves as a benchmark environment to test if an algorithm entails deep exploration. The environment consists of states and each episode lasts time steps. The agent has two actions at each state , while state are both absorbing. The transition is deterministic. At state the agent receives reward , at state the agent receives reward and no reward anywhere else. The initial state is always , making it hard for the agent to escape local optimality at .
If the agent explores randomly (assign probability to choose left and right respectively), the expected number of time steps required to reach is . For large , it is almost not possible for the randomly exploring agent to reach in a single episode, and the optimal strategy to reach by keeping choosing right will never be learned.
The feature of state is used as input to the neural network to compute approximate action value function. As suggested in , we consider feature mapping in . where is the indicator function.
5.1 Classic Control Tasks
We compare Variational DQN and DQN on these control tasks. Both Variational DQN and DQN can solve the four control tasks within a given number of iterations, yet they display quite different characteristics in training curves. On simple tasks like CartPole (Figure 2 (a) and (b)), Variational DQN makes progress faster than DQN but converges at a slower rate. This is potentially because Variational DQN optimizes over the sum of Bellman error and exploration bonus, and the exploration bonus term in effect hinders fast convergence on optimal strategy for simple tasks.
5.2 Chain MDP
We compare Variational DQN, DQN and NoisyNet on Chain MDP tasks. For small , all algorithms converge to optimal policy within reasonable number of iterations, and even the training curves of Variational DQN and Noisy network are very similar. When increases such that , DQN barely makes progress and cannot converge to optimal policy, while NoisyNet converges more slowly to optimal policy and oscillates much. When , both DQN and NoisyNet barely make progress during tranining.
The performance of Variational DQN is fairly stable across a large range of . For , Variational DQN converges to optimal policy within 500 episodes (50 iterations) on average. However, when keeps increasing such that , Variational DQN takes longer time to find the optimal policy but it makes steady improvement over time.
The big discrepancy between the performance of these three algorithms on Chain MDP tasks is potentially due to different exploration schemes. As stated previously, under random exploration, the expected number of steps it takes to reach is approximately . Since DQN applies greedy for exploration, for large it will never even reach within limited number of episodes, letting alone learning the optimal policy. NoisyNet maintains a distribution over value functions, which allows the agent to consistently execute a sequence of actions under different policies, leading to more efficient exploration. However, since NoisyNet does not explicitly encourage dispersed distribution over policies, the algorithm can still converge prematurely if the variance parameter converges quickly to zero. On the other hand, Variational DQN encourages high entropy over policy distribution and can prevent such premature convergence.
To further investigate why Variational DQN can do systematic and efficient exploration of the environment, we plot the state visit counts of Variational DQN and DQN for in Figure 4. Let be the visit count to state for . In each episode, we set if the agent ever visits and otherwise. The running average of over consecutive episodes is the approximate visit probability of state under current policy. In Figure 4 we show visit probability for (locally optimal absorbing state), (optimal absorbing state) and . The probability for is meant to show if the agent ever explores the other half of the chain in one episode.
At early stage of training (), Variational DQN starts with and maintains a relatively high probability of visiting all three states. This enables the agent to visit for sufficient number of trials and converges to the optimal policy of keeping going to the right and reaching . On the other hand, DQN occasionally has nontrivial probability of visiting due to greedy random exploration. But since DQN does not have enough momentum to consistently go beyond and visit , visits to are finally suppressed and the agent converges to the locally optimal policy in . See Appendix for the comparison of visit counts for other sets of and for NoisyNet.
We have proposed a framework to directly tackle the distribution of the value function parameters. Assigning systematic randomness to value function parameters entails efficient randomization in policy space and allow the agent to do efficient exploration. In addition, encouraging high entropy over parameter distribution prevents premature convergence. We have also established an equivalence between the proposed surrogate objective and variational inference loss, which allows us to leverage black box variational inference machinery to update value function parameters.
Potential extension of our current work will be to apply similar ideas to Q-learning in continuous control tasks and policy gradient methods. We leave this as future work.
- Blei et al.  Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, Volume 112 - Issue 518.
- Brockman et al.  Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. Arxiv: 1606.01540.
Duan et al. 
Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016).
Benchmarking deep reinforcement learning for continuous control.
International Conference on Machine Learning.
- Fortunato et al.  Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, I., Blundell, C., and Legg, S. (2017). Noisy network for exploration. arXiv:1706.10295.
Furmston and Barber 
Furmston, T. and Barber, D. (2010).
Variational methods for reinforcement learning.
Proceedings of the 13th International Conference on Artificial Intelligence, PMLR 9:241-248.
- Hasselt et al.  Hasselt, H. V., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double q-learning. Association for the Advancement of Artificial Intelligence (AAAI).
- Houthooft et al.  Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. Advances in Neural Information Processing Systems.
- Kucukelbir et al.  Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14):1-45.
- Levine et al.  Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End to end training of deep visuomotor policies. Journal of Machine Learning Research.
- Lillicrap et al.  Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous control with deep reinforcement learning. International Conference on Learning Representations.
- Lipton et al.  Lipton, Z. C., Li, X., Gao, J., Li, L., Ahmed, F., and Deng, L. (2016). Efficient dialogue policy learning with bbq-networks. ArXiv: 1608.05081.
Mnih et al. 
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra,
D., and Riedmiller, M. (2013).
Playing atari with deep reinforcement learning.
NIPS workshop in Deep Learning.
- Osband et al.  Osband, I., Blundell, C., Pritzel, A., and Roy, B. V. (2016). Deep exploration via bootstrapped dqn. arXiv:1602.04621.
- Osband and Roy  Osband, I. and Roy, B. V. (2015). Bootstrapped thompson sampling and deep exploration. arXiv:1507:00300.
- Ranganath et al.  Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS).
- Schaul et al.  Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay. International Conference on Learning Representations.
- Schulman et al.  Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. International Conference on Machine Learning.
- Silver et al.  Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go using deep neural networks and tree search. Nature 529, 484-489 (28 Januaray 2016) doi: 10.1038/nature16961.
- Sutton and Barto  Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge University Press.
- Thompson  Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, Vol. 25, No. 3/4.
- Todorov  Todorov, E. (2008). General duality between optimal control and estimation. Proceedings of the 47th IEEE Conference on Decision and Control.
- Tran et al.  Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., and Blei, D. M. (2017). Deep probabilistic programming. International Conference on Learning Representations.
- Wang et al.  Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016). Dueling network architectures for deep reinforcement learning. arXiv: 1511.06581.
- Williams  Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229-256.
7.1 Derivation of Variational Inference Interpretation
Consider a bayesian network
Consider a bayesian networkwith input and parameter . The parameter has a prior . This bayesian network produces mean for a Gaussian distribution with variance i.e. let be a sample
Given samples , the posterior is in general not possible to evaluate. Hence we propose a variational family distribution with parameter to approximate the posterior. Variational inference literature [15, 1] has provided numerous techniques to compute , yet for a flexible model black box variational inference is most scalable. We consider minimizing KL divergence between and
Let be improper uniform prior. Also recall that
Decompose the objective in (9) and omit constants, we get
We then identify the first term as and the second as expected Bellman error.
7.2 Variational Inference as approximate minimization of Bellman error
Assume the Bayesian network produces Gaussian sample . Given samples , as
as the information in prior is overwhelmed by data. In fact, itself will also converge to MAPs. Using variational family distribution to approximate the posterior we expect
therefore any variational inference procedure that updates to approximate the posterior will converge to an approximate minimizer of the Bellman error. In particular, variational inference using will result in an additional entropy bonus term in the surrogate objective, which in effect encourages dispersed policy distribution and efficient exploration.
7.3 Implementation Details
Since we have established an equivalence between the surrogate objective (Bellman error entropy bonus) and variational inference loss, we could leverage highly optimized implementation of probabilistic programming packages to perform parameter update. In our experiment, we used Edward  to minimize the KL divergence between variational distribution and the posterior .
In classic control tasks, we train Variational DQN, DQN and NoisyNet agents for about episodes on each task. The learning rate is or . The discount factor is .
In Chain MDPs, we train all agents for about episodes for fixed . The learning rate is or . The discount factor is (if discount factor is used, when is large, going to state will be optimal).
For all experiments, the batch size of each mini-batch sampled from the replay buffer is . The target network is updated every time steps. Each experiment is replicated multiple times using different random seeds to start the entire training pipeline. DQN uses an exploration constant of throughout the training. Variational DQN and NoisyNet both use component-wise gaussian distribution to parameterize distribution over value function parameters, and they are all initialized according to recipes in . Variational DQN is updated using KLqp inference algorithm  with regularization constant .
7.4 Recover DQN
We set the variational family distribution to be point distribution . Hence has the same dimension as and is in effect itself. Then we apply Maximum a Posterior (MAP) inference to update . Under Gaussian generative model of and improper uniform prior, this is equivalent to minimizing Bellman error only.
7.5 Chain MDP: Visit Count for and
Below we present the state visit counts for and for all three algorithms (Variational DQN, DQN and NoisyNet). For small (), Variational DQN and NoisyNet both identify the optimal path faster yet display larger variance in performance, while DQN makes progress in a more steady manner.
For medium sized (), Variational DQN still manages to converge to the optimal policy though the initial exploration stage exhibits larger variance. DQN occasionally pass the middle point (observed from the green spike) but cannot reach . NoisyNet explores more efficiently than DQN since it sometimes converge to optimal policy but is less stable than Variational DQN.
For large (), Variational DQN takes more time to find the optimal policy but still converges within small number of iterations. On the other hand, both DQN and NoisyNet get stuck.
Interpretation as Approximate Thompson Sampling
Thompson sampling  is an efficient exploration scheme in multi-arm bandits and MDPs. At each step, Variational DQN maintains a distribution over action-value function . The algorithm proceeds by sampling a parameter and then select action . This sampling differs from exact Thompson sampling in two aspects.
The sampling distribution is not the posterior distribution of given data but only a variational approximation.
During mini-batch training, the data used to update the posterior is not generated by exact action-value function but the target network .
Hence the quality of this exact Thompson sampling depends on the quality of both approximation and , the second of which is in fact our very goal. When the approximation is not perfect, this approximate sampling scheme can still be beneficial to exploration.