1 Introduction
Most deep RL methods assume that the state of the environment is fully observable at every time step. However, this assumption often does not hold in reality, as occlusions and noisy sensors may limit the agent’s perceptual abilities. Such problems can be formalised as POMDP (Astrom, 1965; Kaelbling et al., 1998). Because we usually do not have access to the true generative model of our environment, there is a need for reinforcement learning methods that can tackle POMDP when only a stream of observations is given, without any prior knowledge of the latent state space or the transition and observation functions.
POMDP are notoriously hard to solve: since the current observation does in general not carry all relevant information for choosing an action, information must be aggregated over time and in general the entire history must be taken into account.
This history can be encoded either by remembering features of the past (McCallum, 1993) or by performing inference to determine the distribution over possible latent states (Kaelbling et al., 1998). However, the computation of this belief state requires knowledge of the model.
Most previous work in deep learning relies on training a RNN to summarize the past. Examples are the DRQN (Hausknecht & Stone, 2015) and the ADRQN (Zhu et al., 2017)
. Because these approaches are completely modelfree, they place a heavy burden on the RNN. Since performing inference implicitly requires a known or learned model, they are likely to summarise the history either by only remembering features of the past or by computing simple heuristics instead of actual belief states. This is often suboptimal in complex tasks. Generalisation is also often easier over beliefs than over trajectories since distinct histories can lead to similar or identical beliefs.
The premise of this work is that deep policy learning for POMDP can be improved by taking less of a black box approach than DRQN and ADRQN. While we do not want to assume prior knowledge of the transition and observation functions or the latent state representation, we want to allow the agent to learn models of them and infer the belief state using this learned model.
To this end, we propose DVRL, which implements this approach by providing a helpful inductive bias to the agent. In particular, we develop an algorithm that can learn an internal generative model and use it to perform approximate inference to update the belief state. Crucially, the generative model is not only learned based on an ELBO objective, but also by how well it enables maximisation of the expected return. This ensures that, unlike in an unsupervised application of VAE, the latent state representation and the inference performed on it are suitable for the ultimate control task. Specifically, we develop an approximation to the ELBO based on AESMC (Le et al., 2018), allowing joint optimisation with the step policy gradient update. Uncertainty in the belief state is captured by a particle ensemble. A highlevel overview of our approach in comparison to previous RNNbased methods is shown in Figure 1.
We evaluate our approach on Mountain Hike and several flickering Atari games. On Mountain Hike, a low dimensional, continuous environment, we can show that DVRL is better than an RNN based approach at inferring the required information from past observations for optimal action selection in a simple setting. Our results on flickering Atari show that this advantage extends to complex environments with high dimensional observation spaces. Here, partial observability is introduced by (1) using only a single frame as input at each time step and (2)
returning a blank screen instead of the true frame with probability 0.5.
2 Background
In this section, we formalise POMDP and provide background on recent advances in VAE that we use. Lastly, we describe the policy gradient loss based on step learning and A2C.
2.1 Partially Observable Markov Decision Processes
A partially observable Markov decision process (POMDP) is a tuple , where is the state space, the action space, and the observation space. We denote as the latent state at time , and the distribution over initial states as , the initial belief state. When an action is executed, the state changes according to the transition distribution, . Subsequently, the agent receives a noisy or partially occluded observation according to the distribution , and a reward according to the distribution .
An agent acts according to its policy which returns the probability of taking action at time , and where and are the observation and action histories, respectively. The agent’s goal is to learn a policy that maximises the expected future return
(1) 
over trajectories induced by its policy^{1}^{1}1The trajectory length is stochastic and depends on the time at which the agentenvironment interaction ends., where is the discount factor. We follow the convention of setting to noop (Zhu et al., 2017).
In general, a POMDP agent must condition its actions on the entire history . The exponential growth in of can be addressed, e.g., with suffix trees (McCallum & Ballard, 1996; Shani et al., 2005; Bellemare et al., 2014; Bellemare, 2015; Messias & Whiteson, 2017). However, those approaches suffer from large memory requirements and are only suitable for small discrete observation spaces.
Alternatively, it is possible to infer the filtering distribution , called the belief state. This is a sufficient statistic of the history that can be used as input to an optimal policy . The belief space does not grow exponentially, but the belief update step requires knowledge of the model:
(2) 
2.2 Variational Autoencoder
We define a family of priors over some latent state and decoders over observations , both parameterised by . A VAE (VAE) learns by maximising the sum of log marginal likelihood terms for a dataset where (Rezende et al., 2014; Kingma & Welling, 2014)) . Since evaluating the log marginal likelihood is intractable, the VAE instead maximises a sum of ELBO where each individual ELBO term is a lower bound on the log marginal likelihood,
(3) 
for a family of encoders parameterised by . This objective also forces to approximate the posterior under the learned model. Gradients of (3
) are estimated by Monte Carlo sampling with the reparameterisation trick
(Kingma & Welling, 2014; Rezende et al., 2014).2.3 VAE for Time Series
For sequential data, we assume that a series of latent states gives rise to a series of observations . We consider a family of generative models parameterised by that consists of the initial distribution , transition distribution and observation distribution . Given a family of encoder distributions , we can also estimate the gradient of the ELBO term in the same manner as in (3), noting that:
(4)  
(5) 
where we slightly abuse notation for by ignoring the fact that we sample from the model for . Le et al. (2018), Maddison et al. (2017) and Naesseth et al. (2018) introduce a new ELBO objective based on SMC (Doucet & Johansen, 2009) that allows faster learning in time series:
(6) 
where is the number of particles and is the weight of particle at time . Each particle is a tuple containing a weight and a value which is obtained as follows. Let be samples from for . For , the weights are obtained by resampling the particle set proportionally to the previous weights and computing
(7) 
where corresponds to a value sampled from and corresponds to the resampled particle with the ancestor index and for .
2.4 A2c
One way to learn the parameters of an agent’s policy is to use step learning with A2C (Dhariwal et al., 2017; Wu et al., 2017), the synchronous simplification of A3C (Mnih et al., 2016). An actorcritic approach can cope with continuous actions and avoids the need to draw stateaction sequences from a replay buffer. The method proposed in this paper is however equally applicable to other deep RL algorithms.
For step learning, starting at time , the current policy performs consecutive steps in parallel environments. The gradient update is based on this minibatch of size . The target for the valuefunction , parameterised by , is the appropriately discounted sum of onpolicy rewards up until time and the offpolicy bootstrapped value . The minus sign denotes that no gradients are propagated through this value. Defining the advantage function as
(8) 
the A2C loss for the policy parameters at time is
(9) 
and the value function loss to learn can be written as
(10) 
Lastly, an entropy loss is added to encourage exploration:
(11) 
where is the entropy of a distribution.
3 Deep Variational Reinforcement Learning
Fundamentally, there are two approaches to aggregating the history in the presence of partial observability: remembering features of the past or maintaining beliefs.
In most previous work, including ADRQN (Zhu et al., 2017), the current history is encoded by an RNN, which leads to the recurrent update equation for the latent state :
(12) 
Since this approach is modelfree, it is unlikely to approximate belief update steps, instead relying on memory or simple heuristics.
Inspired by the premise that a good way to solve many POMDP involves (1) estimating the transition and observation model of the environment, (2) performing inference under this model, and (3) choosing an action based on the inferred belief state, we propose DVRLDVRL. It extends the RNNbased approach to explicitly support belief inference. Training everything endtoend shapes the learned model to be useful for the RL task at hand, and not only for predicting observations.
We first explain our baseline architecture and training method in Section 3.1. For a fair comparison, we modify the original architecture of Zhu et al. (2017) in several ways. We find that our new baseline outperforms their reported results in the majority of cases.
In Sections 3.2 and 3.3, we explain our new latent belief state and the recurrent update function
(13) 
which replaces (12). Lastly, in Section 3.4
, we describe our modified loss function, which allows learning the model jointly with the policy.
3.1 Improving the Baseline
While previous work often used learning to train the policy (Hausknecht & Stone, 2015; Zhu et al., 2017; Foerster et al., 2016; Narasimhan et al., 2015), we use step A2C. This avoids drawing entire trajectories from a replay buffer and allows continuous actions.
Furthermore, since A2C interleaves unrolled trajectories and performs a parameter update only every steps, it makes it feasible to maintain an approximately correct latent state. A small bias is introduced by not recomputing the latent state after each gradient update step.
We also modify the implementation of BPTT for
step A2C in the case of policies with latent states. Instead of backpropagating gradients only through the computation graph of the current update involving
steps, we set the size of the computation graph independently to involve steps. This leads to an average BPTTlength of .^{2}^{2}2This is implemented in PyTorch using the
retain_graph=True flag in the backward() function.This decouples the biasvariance tradeoff of choosing
from the biasruntime tradeoff of choosing . Our experiments show that choosing greatly improves the agent’s performance.3.2 Extending the Latent State
For DVRL, we extend the latent state to be a set of particles, capturing the uncertainty in the belief state (Thrun, 2000; Silver & Veness, 2010). Each particle consists of the triplet (Chung et al., 2015). The value of particle is the latent state of an RNN; is an additional stochastic latent state that allows us to learn stochastic transition models; and assigns each particle an importance weight.
Our latent state is thus an approximation of the belief state in our learned model
(14) 
with stochastic transition model , decoder , and deterministic transition function which is denoted using the deltadistribution and for which we use an RNN. The model is trained to jointly optimise the ELBO and the expected return.
3.3 Recurrent Latent State Update
To update the latent state, we proceed as follows:
(15)  
(16)  
(17)  
(18) 
First, we resample particles based on their weight by drawing ancestor indices . This improves model learning (Le et al., 2018; Maddison et al., 2017) and allows us to train the model jointly with the step loss (see Section 3.4).
For , new values for are sampled from the encoder which conditions on the resampled ancestor values as well as the last actions and current observation . Latent variables are sampled using the reparameterisation trick. The values , together with and , are then passed to the transition function to compute .
The weights measure how likely each new latent state value is under the model and how well it explains the current observation.
To condition the policy on the belief , we need to encode the set of particles into a vector representation . We use a second RNN that sequentially takes in each tuple and its last latent state is .
3.4 Loss Function
To encourage learning a model, we include the term
(19) 
in each gradient update every steps. This leads to the overall loss:
(20) 
Compared to (9), (10) and (11), the losses now also depend on the encoder parameters and, for DVRL, model parameters , since the policy and value function now condition on the latent states instead of . By introducing the step approximation , we can learn and to jointly optimise the ELBO and the RL loss .
If we assume that observations and actions are drawn from the stationary state distribution induced by the policy , then is a stochastic approximation to the actionconditioned ELBO:
(21) 
which is a conditional extension of (6) similar to the extension of VAE by Sohn et al. (2015). The expectation over is approximated by sampling trajectories and the sum over the entire trajectory is approximated by a sum over only a part of it.
The importance of the resampling step (15) in allowing this approximation becomes clear if we compare (21) with the ELBO for the IWAE that does not include resampling (Doucet & Johansen, 2009; Burda et al., 2016):
(22) 
Because this loss is not additive over time, we cannot approximate it with shorter parts of the trajectory.
4 Related Work
Most existing POMDP literature focusses on planning algorithms, where the transition and observation functions, as well as a representation of the latent state space, are known (Barto et al., 1995; McAllester & Singh, 1999; Pineau et al., 2003; Ross et al., 2008; Oliehoek et al., 2008; Roijers et al., 2015). In most realistic domains however, these are not known a priori.
There are several approaches that utilise RNN in POMDP (Bakker, 2002; Wierstra et al., 2007; Zhang et al., 2015; Heess et al., 2015), including multiagent settings (Foerster et al., 2016), learning textbased fantasy games (Narasimhan et al., 2015) or, most recently, applied to Atari (Hausknecht & Stone, 2015; Zhu et al., 2017). As discussed in Section 3, our algorithm extends those approaches by enabling the policy to explicitly reason about a model and the belief state.
Another more specialised approach called QMDPNet (Karkus et al., 2017) learns a VIN (Tamar et al., 2016) endtoend and uses it as a transition model for planning. However, a VIN makes strong assumptions about the transition function and in QMDPNet the belief update must be performed analytically.
The idea to learn a particle filter based policy that is trained using policy gradients was previously proposed by Coquelin et al. (2009). However, they assume a known model and rely on finite differences for gradient estimation.
Instead of optimising an ELBO to learn a maximumlikelihood approximation for the latent representation and corresponding transition and observation model, previous work also tried to learn those dynamics using spectral methods (Azizzadenesheli et al., 2016), a Bayesian approach (Ross et al., 2011; Katt et al., 2017), or nonparametrically (DoshiVelez et al., 2015). However, these approaches do not scale to large or continuous state and observation spaces. For continuous states, actions, and observations with Gaussian noise, a Gaussian process model can be learned (Deisenroth & Peters, 2012). An alternative to learning an (approximate) transition and observation model is to learn a model over trajectories (Willems et al., 1995). However, this is again only possible for small, discrete observation spaces.
Due to the complexity of the learning in POMDP, previous work already found benefits to using auxiliary losses. Unlike the losses proposed by Lample & Chaplot (2017), we do not require additional information from the environment. The UNREAL agent (Jaderberg et al., 2016) is, similarly to our work, motivated by the idea to improve the latent representation by utilising all the information already obtained from the environment. While their work focuses on finding unsupervised auxiliary losses that provide good training signals, our goal is to use the auxiliary loss to better align the network computations with the task at hand by incorporating prior knowledge as an inductive bias.
There is some evidence from recent experiments on the dopamine system in mice (Babayan et al., 2018) showing that their response to ambiguous information is consistent with a theory operating on belief states.
5 Experiments
We evaluate DVRL on Mountain Hike and on flickering Atari. We show that DVRL deals better with noisy or partially occluded observations and that this scales to high dimensional and continuous observation spaces like images and complex tasks. We also perform a series of ablation studies, showing the importance of using many particles, including the ELBO training objective in the loss function, and jointly optimising the ELBO and RL losses.
More details about the environments and model architectures can be found in Appendix A
together with additional results and visualisations. All plots and reported results are smoothed over time and parallel executed environments. We average over five random seeds, with shaded areas indicating the standard deviation.
5.1 Mountain Hike
In this task, the agent has to navigate along a mountain ridge, but only receives noisy measurements of its current location. Specifically, we have where and are true and observed coordinates respectively and is the desired step. Transitions are given by with and is the vector with length capped to . Observations are noisy with with and . The reward at each timestep is where is shown in Figure 3. The starting position is sampled from and each episode ends after 75 steps.
DVRL used 30 particles and we set for both RNN and DVRL. The latent state for the RNNencoder architecture was of dimension 256 and 128 for both and for DVRL. Lastly, and
were used, together with RMSProp with a learning rate of
for both approaches.The main difficulty in Mountain Hike is to correctly estimate the current position. Consequently, the achieved return reflects the capability of the network to do so. DVRL outperforms RNN based policies, especially for higher levels of observation noise (Figure 4). In Figure 3 we compare the different trajectories for RNN and DVRL encoders for the same noise, i.e. and for all and . DVRL is better able to follow the mountain ridge, indicating that its inference based history aggregation is superior to a largely memory/heuristics based one.
5.2 Atari
We chose flickering Atari as evaluation benchmark, since it was previously used to evaluate the performance of ADRQN (Zhu et al., 2017) and DRQN (Hausknecht & Stone, 2015). Atari environments (Bellemare et al., 2013) provide a wide set of challenging tasks with high dimensional observation spaces. We test our algorithm on the same subset of games on which DRQN and ADRQN were evaluated.
Partial observability is introduced by flickering, i.e., by a probability of of returning a blank screen instead of the actual observation. Furthermore, only one frame is used as the observation. This is in line with previous work (Hausknecht & Stone, 2015). We use a frameskip of four^{3}^{3}3A frameskip of one is used for Asteroids due to known rendering issues with this environment and for the stochastic Atari environments there is a chance of repeating the current action for a second time at each transition.
DVRL used 15 particles and we set for both agents. The dimension of was 256 for both architectures, as was the dimension of . Larger latent states decreased the performance for the RNN encoder. Lastly, and was used with a learning rate of for RNN and for DVRL, selected out of a set of 6 different rates based on the results on ChopperCommand.
Table 1 shows the results for the more challenging stochastic, flickering environments. Results for the deterministic environments, including returns reported for DRQN and ADRQN, can be found in Appendix A. DVRL significantly outperforms the RNNbased policy on five out of ten games and narrowly underperforms significantly on only one. This shows that DVRL is viable for high dimensional observation spaces with complex environmental models.
Env  DVRL  RNN 

Pong  
Chopper  
MsPacman  
Centipede  
BeamRider  
Frostbite  
Bowling  
IceHockey  
DDunk  
Asteroids 
5.3 Ablation Studies
Using more than one particle is important to accurately approximate the belief distribution over the latent state . Consequently, we expect that higher particle numbers provide better information to the policy, leading to higher returns. Figure 5a shows that this is indeed the case. This is an important result for our architecture, as it also implies that the resampling step is necessary, as detailed in Section 3.4. Without resampling, we cannot approximate the ELBO on only observations.
Secondly, Figure 5b shows that the inclusion of to encourage model learning is required for good performance. Furthermore, not backpropagating the policy gradients through the encoder and only learning it based on the ELBO (“No joint optim”) also deteriorates performance.
Lastly, we investigate the influence of the backpropagation length on both the RNN and DVRL based policies. While increasing universally helps, the key insight here is that a short length (for an average BPTTlength of 2 timesteps) has a stronger negative impact on RNN than on DVRL. This is consistent with our notion that RNN is mainly performing memory based reasoning, for which longer backpropagationthroughtime is required: The belief update (2) in DVRL is a onestep update from to , without the need to condition on past actions and observations. The proposal distribution can benefit from extended backpropagation lengths, but this is not necessary. Consequently, this result supports our notion that DVRL relies more on inference computations to update the latent state.
6 Conclusion
In this paper we proposed DVRL, a method for solving POMDP given only a stream of observations, without knowledge of the latent state space or the transition and observation functions operating in that space. Our method leverages a new ELBObased auxiliary loss and incorporates an inductive bias into the structure of the policy network, taking advantage of our prior knowledge that an inference step is required for an optimal solution.
We compared DVRL to an RNNbased architecture and found that we consistently outperform it on a diverse set of tasks, including a number of Atari games modified to have partial observability and stochastic transitions.
We also performed several ablation studies showing the necessity of using an ensemble of particles and of joint optimisation of the ELBO and RL objective. Furthermore, the results support our claim that the latent state in DVRL approximates a belief distribution in a learned model.
Access to a belief distribution opens up several interesting research directions. Investigating the role of better generalisation capabilities and the more powerful latent state representation on the policy performance of DVRL can give rise to further improvements. DVRL is also likely to benefit from more powerful model architectures and a disentangled latent state. Furthermore, uncertainty in the belief state and access to a learned model can be used for curiosity driven exploration in environments with sparse rewards.
Acknowledgements
We would like to thank Wendelin Boehmer and Greg Farquar for useful discussions and feedback. The NVIDIA DGX1 used for this research was donated by the NVIDIA corporation. M. Igl is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. L. Zintgraf is supported by the Microsoft Research PhD Scholarship Program. T. A. Le is supported by EPSRC DTA and Google (project code DF6700) studentships. F. Wood is supported by DARPA PPAML through the U.S. AFRL under Cooperative Agreement FA87501420006; Intel and DARPA D3M, under Cooperative Agreement FA87501720093. S. Whiteson is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).
References
 Astrom (1965) Astrom, Karl J. Optimal control of markov decision processes with incomplete state estimation. Journal of mathematical analysis and applications, 10:174–205, 1965.
 Azizzadenesheli et al. (2016) Azizzadenesheli, Kamyar, Lazaric, Alessandro, and Anandkumar, Animashree. Reinforcement learning of pomdps using spectral methods. arXiv preprint 1602.07764, 2016.
 Babayan et al. (2018) Babayan, Benedicte M, Uchida, Naoshige, and Gershman, Samuel J. Belief state representation in the dopamine system. Nature communications, 9(1):1891, 2018.

Bakker (2002)
Bakker, Bram.
Reinforcement learning with long shortterm memory.
In Advances in neural information processing systems, pp. 1475–1482, 2002.  Barto et al. (1995) Barto, Andrew G, Bradtke, Steven J, and Singh, Satinder P. Learning to act using realtime dynamic programming. Artificial intelligence, 72(12):81–138, 1995.
 Bellemare et al. (2014) Bellemare, Marc, Veness, Joel, and Talvitie, Erik. Skip context tree switching. In International Conference on Machine Learning, pp. 1458–1466, 2014.
 Bellemare (2015) Bellemare, Marc G. Countbased frequency estimation with bounded memory. In TwentyFourth International Joint Conference on Artificial Intelligence, 2015.
 Bellemare et al. (2013) Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 Burda et al. (2016) Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Importance weighted autoencoders. In ICLR, 2016.
 Chung et al. (2015) Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel, Kratarth, Courville, Aaron C, and Bengio, Yoshua. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, 2015.
 Coquelin et al. (2009) Coquelin, PierreArnaud, Deguest, Romain, and Munos, Rémi. Particle filterbased policy gradient in pomdps. In NIPS, 2009.
 Deisenroth & Peters (2012) Deisenroth, Marc Peter and Peters, Jan. Solving nonlinear continuous stateactionobservation pomdps for mechanical systems with gaussian noise. 2012.
 Dhariwal et al. (2017) Dhariwal, Prafulla, Hesse, Christopher, Klimov, Oleg, Nichol, Alex, Plappert, Matthias, Radford, Alec, Schulman, John, Sidor, Szymon, and Wu, Yuhuai. Openai baselines, 2017.
 DoshiVelez et al. (2015) DoshiVelez, Finale, Pfau, David, Wood, Frank, and Roy, Nicholas. Bayesian nonparametric methods for partiallyobservable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 37(2):394–407, 2015.

Doucet & Johansen (2009)
Doucet, Arnaud and Johansen, Adam M.
A tutorial on particle filtering and smoothing: Fifteen years later.
Handbook of nonlinear filtering
, 12(656704):3, 2009.  Foerster et al. (2016) Foerster, Jakob N, Assael, Yannis M, de Freitas, Nando, and Whiteson, Shimon. Learning to communicate to solve riddles with deep distributed recurrent qnetworks. arXiv preprint 1602.02672, 2016.
 Hausknecht & Stone (2015) Hausknecht, Matthew and Stone, Peter. Deep recurrent qlearning for partially observable MDPs. In 2015 AAAI Fall Symposium Series, 2015.
 Heess et al. (2015) Heess, Nicolas, Hunt, Jonathan J, Lillicrap, Timothy P, and Silver, David. Memorybased control with recurrent neural networks. arXiv preprint 1512.04455, 2015.
 Jaderberg et al. (2016) Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint 1611.05397, 2016.
 Kaelbling et al. (1998) Kaelbling, Leslie Pack, Littman, Michael L, and Cassandra, Anthony R. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1), 1998.
 Karkus et al. (2017) Karkus, Peter, Hsu, David, and Lee, Wee Sun. Qmdpnet: Deep learning for planning under partial observability. In Advances in Neural Information Processing Systems, pp. 4697–4707, 2017.
 Katt et al. (2017) Katt, Sammie, Oliehoek, Frans A, and Amato, Christopher. Learning in pomdps with monte carlo tree search. In International Conference on Machine Learning, 2017.
 Kingma & Welling (2014) Kingma, Diederik P and Welling, Max. Autoencoding variational Bayes. In ICLR, 2014.
 Lample & Chaplot (2017) Lample, Guillaume and Chaplot, Devendra Singh. Playing fps games with deep reinforcement learning. In AAAI, pp. 2140–2146, 2017.
 Le et al. (2018) Le, Tuan Anh, Igl, Maximilian, Jin, Tom, Rainforth, Tom, and Wood, Frank. Autoencoding sequential Monte Carlo. In ICLR, 2018.
 Maddison et al. (2017) Maddison, Chris J, Lawson, John, Tucker, George, Heess, Nicolas, Norouzi, Mohammad, Mnih, Andriy, Doucet, Arnaud, and Teh, Yee. Filtering variational objectives. In Advances in Neural Information Processing Systems, 2017.
 McAllester & Singh (1999) McAllester, David A and Singh, Satinder. Approximate planning for factored pomdps using belief state simplification. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999.
 McCallum & Ballard (1996) McCallum, Andrew Kachites and Ballard, Dana. Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester. Dept. of Computer Science, 1996.
 McCallum (1993) McCallum, R Andrew. Overcoming incomplete perception with utile distinction memory. In Proceedings of the Tenth International Conference on Machine Learning, pp. 190–196, 1993.
 Messias & Whiteson (2017) Messias, João V and Whiteson, Shimon. Dynamicdepth context tree weighting. In Advances in Neural Information Processing Systems, pp. 3330–3339, 2017.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mnih et al. (2016) Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
 Naesseth et al. (2018) Naesseth, Christian A, Linderman, Scott W, Ranganath, Rajesh, and Blei, David M. Variational sequential monte carlo. In AISTATS (To Appear), 2018.
 Narasimhan et al. (2015) Narasimhan, Karthik, Kulkarni, Tejas, and Barzilay, Regina. Language understanding for textbased games using deep reinforcement learning. arXiv preprint 1506.08941, 2015.
 Oliehoek et al. (2008) Oliehoek, Frans A, Spaan, Matthijs TJ, Whiteson, Shimon, and Vlassis, Nikos. Exploiting locality of interaction in factored decpomdps. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systemsVolume 1, 2008.
 Pineau et al. (2003) Pineau, Joelle, Gordon, Geoff, Thrun, Sebastian, et al. Pointbased value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, 2003.
 Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
 Roijers et al. (2015) Roijers, Diederik Marijn, Whiteson, Shimon, and Oliehoek, Frans A. Pointbased planning for multiobjective pomdps. In IJCAI, pp. 1666–1672, 2015.
 Ross et al. (2008) Ross, Stéphane, Pineau, Joelle, Paquet, Sébastien, and ChaibDraa, Brahim. Online planning algorithms for pomdps. Journal of Artificial Intelligence Research, 32:663–704, 2008.
 Ross et al. (2011) Ross, Stéphane, Pineau, Joelle, Chaibdraa, Brahim, and Kreitmann, Pierre. A bayesian approach for learning and planning in partially observable markov decision processes. Journal of Machine Learning Research, 2011.
 Shani et al. (2005) Shani, Guy, Brafman, Ronen I, and Shimony, Solomon E. Modelbased online learning of pomdps. In European Conference on Machine Learning, pp. 353–364. Springer, 2005.
 Silver & Veness (2010) Silver, David and Veness, Joel. Montecarlo planning in large pomdps. In Advances in neural information processing systems, pp. 2164–2172, 2010.
 Sohn et al. (2015) Sohn, Kihyuk, Lee, Honglak, and Yan, Xinchen. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491, 2015.
 Tamar et al. (2016) Tamar, Aviv, Wu, Yi, Thomas, Garrett, Levine, Sergey, and Abbeel, Pieter. Value iteration networks. In Advances in Neural Information Processing Systems, pp. 2154–2162, 2016.
 Thrun (2000) Thrun, Sebastian. Monte carlo pomdps. In Advances in neural information processing systems, pp. 1064–1070, 2000.
 Wierstra et al. (2007) Wierstra, Daan, Foerster, Alexander, Peters, Jan, and Schmidhuber, Juergen. Solving deep memory pomdps with recurrent policy gradients. In International Conference on Artificial Neural Networks, pp. 697–706. Springer, 2007.
 Willems et al. (1995) Willems, Frans MJ, Shtarkov, Yuri M, and Tjalkens, Tjalling J. The contexttree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.
 Wu et al. (2017) Wu, Yuhuai, Mansimov, Elman, Grosse, Roger B, Liao, Shun, and Ba, Jimmy. Scalable trustregion method for deep reinforcement learning using Kroneckerfactored approximation. In Advances in neural information processing systems, pp. 5285–5294, 2017.
 Zhang et al. (2015) Zhang, Marvin, Levine, Sergey, McCarthy, Zoe, Finn, Chelsea, and Abbeel, Pieter. Policy learning with continuous memory states for partially observed robotic control. CoRR, 2015.
 Zhu et al. (2017) Zhu, Pengfei, Li, Xin, and Poupart, Pascal. On improving deep reinforcement learning for POMDPs. arXiv preprint 1704.07978, 2017.
Appendix A Experiments
a.1 Implementation Details
In our implementation, the transition and proposal distributions and
are multivariate normal distributions over
whose mean and diagonal variance are determined by neural networks. For image data, the decoder is a multivariate independent Bernoulli distribution whose parameters are again determined by a neural network. For realvalued vectors we use a normal distribution.When several inputs are passed to a neural network, they are concatenated to one vector. ReLUs are used as nonlinearities between all layers. Hidden layers are, if not otherwise stated, all of the same dimension as
. Batch normalization was used between layers for experiments on Atari but not on Mountain Hike as they significantly hurt performance. All RNN are GRUs.
Encoding functions , and are used to encode single observations, actions and latent states before they are passed into other networks.
To encode visual observations, we use the the same convolutional network as proposed by Mnih et al. (2015), but with only instead of channels in the final layer. The transposed convolutional network of the decoder has the reversed structure. The decoder is preceeded by an additional fully connected layer which outputs the required dimension (1568 for Atari’s observations).
For observations in we used two fully connected layers of size 64 as encoder. As decoder we used the same structure as for and which are all three normal distributions: One joint fully connected layer and two separated fully connected heads, one for the mean, one for the variance. The output of the variance layer is passed through a softplus layer to force positivity.
Actions are encoded using one fully connected layer of size 128 for Atari and size 64 for Mountain Hike. Lastly, is encoded before being passed into networks by one fully connected layer of the same size as .
The policy is one fully connected layer whose size is determined by the actions space, i.e. up to 18 outputs with softmax for Atari and only 2 outputs for the learned mean for Mountain Hike, together with a learned variance. The value function is one fully connected layer of size 1.
A2C used parallel environments and
step learning for a total batch size of 80. Hyperparameters were tuned on
Chopper Command. The learning rate of both DVRL and ADRA2C was independently tuned on the set of values with being chosen for DVRL on Atari and for DVRL on MountainHike and RNN on both environments. Without further tuning, we set and as is commonly used.As optimizer we use RMSProp with . We clip gradients at a value of . The discount factor of the control problem is set to and lastly, we use ’orthogonal’ initialization for the network weights.
The source code will be release in the future.
a.2 Additional Experiments and Visualisations
Table 2 shows the on deterministic and flickering Atari, averaged over 5 random seeds. The values for DRQN and ADRQN are taken from the respective papers. Note that DRQN and ADRQN rely on Qlearning instead of A2C, so the results are not directly comparable.
Figure 6 and 7 show individual learning curves for all 10 Atari games, either for the deterministic or the stochastic version of the games.
Env  DVRL  RNN  DRQN  ADRQN 

Pong  
Chopper  
MsPacman  
Centipede  
BeamRider  
Frostbite  
Bowling  
IceHockey  
DDunk  
Asteroids 
a.3 Computational Speed
The approximate training speed in frames per second (FPS) is on one GPU on a dgx1 for Atari:

RNN: 124k FPS

DVRL (1 Particle): 64k FPS

DVRL (10 Particles): 48k FPS

DVRL (30 Particle): 32k FPS
a.4 Model Predictions
In Figure 8 we show reconstructed and predicted images from the DVRL model for several Atari games. The current observation is in the leftmost column. The second column (’dt0’) shows the reconstruction after encoding and decoding the current observation. For the further columns, we make use of the learned generative model to predict future observations. For simplicity we repeat the last action. Columns 2 to 7 show predicted observations for unrolled timesteps. The model was trained as explained in Section 5.2. The reconstructed and predicted images are a weighted average over all 16 particles.
Note that the model is able to correctly predict features of future observations, for example the movement of the cars in ChopperCommand, the (approximate) ball position in Pong or the missing pins in Bowling. Furthermore, it is able to do so, even if the current observation is blank like in Bowling. The model has also correctly learned to randomly predict blank observations.
It can remember feature of the current state fairly well, like the positions of barriers (white dots) in Centipede. On the other hand, it clearly struggles with the amount of information present in MsPacman like the positions of all previously eaten fruits or the location of the ghosts.
Appendix B Algorithms
Algorithm 1 details the recurrent (belief) state computation (i.e. history encoder) for DVRL. Algorithm 2 details the recurrent state computation for RNN. Algorithm 3 describes the overall training algorithm that either uses one or the other to aggregate the history. Despite looking complicated, it is just a very detailed implementation of step A2C with the additional changes: Inclusion of and inclusing of the option to not delete the computation graph to allow longer backprop in step A2C.
Results for also using the reconstruction loss for the RNN based encoder aren’t shown in the paper as they reliably performed worse than RNN without reconstruction loss.