Deep Variational Reinforcement Learning for POMDPs

06/06/2018 ∙ by Maximilian Igl, et al. ∙ 2

Many real-world sequential decision making problems are partially observable by nature, and the environment model is typically unknown. Consequently, there is great need for reinforcement learning methods that can tackle such problems given only a stream of incomplete and noisy observations. In this paper, we propose deep variational reinforcement learning (DVRL), which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information. We develop an n-step approximation to the evidence lower bound (ELBO), allowing the model to be trained jointly with the policy. This ensures that the latent state representation is suitable for the control task. In experiments on Mountain Hike and flickering Atari we show that our method outperforms previous approaches relying on recurrent neural networks to encode the past.



page 6

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most deep RL methods assume that the state of the environment is fully observable at every time step. However, this assumption often does not hold in reality, as occlusions and noisy sensors may limit the agent’s perceptual abilities. Such problems can be formalised as POMDP (Astrom, 1965; Kaelbling et al., 1998). Because we usually do not have access to the true generative model of our environment, there is a need for reinforcement learning methods that can tackle POMDP when only a stream of observations is given, without any prior knowledge of the latent state space or the transition and observation functions.

POMDP are notoriously hard to solve: since the current observation does in general not carry all relevant information for choosing an action, information must be aggregated over time and in general the entire history must be taken into account.

(a) RNN-based approach. The RNN acts as an encoder for the action-observation history, on which actor and critic are conditioned. The networks are updated end-to-end with an RL loss.
(b) DVRL. The agent learns a generative model which is used to update a belief distribution. Actor and critic now condition on the belief. The generative model is learned to optimise both the ELBO and the RL loss.
Figure 1: Comparison of RNN and DVRL encoders.

This history can be encoded either by remembering features of the past (McCallum, 1993) or by performing inference to determine the distribution over possible latent states (Kaelbling et al., 1998). However, the computation of this belief state requires knowledge of the model.

Most previous work in deep learning relies on training a RNN to summarize the past. Examples are the DRQN (Hausknecht & Stone, 2015) and the ADRQN (Zhu et al., 2017)

. Because these approaches are completely model-free, they place a heavy burden on the RNN. Since performing inference implicitly requires a known or learned model, they are likely to summarise the history either by only remembering features of the past or by computing simple heuristics instead of actual belief states. This is often suboptimal in complex tasks. Generalisation is also often easier over beliefs than over trajectories since distinct histories can lead to similar or identical beliefs.

The premise of this work is that deep policy learning for POMDP can be improved by taking less of a black box approach than DRQN and ADRQN. While we do not want to assume prior knowledge of the transition and observation functions or the latent state representation, we want to allow the agent to learn models of them and infer the belief state using this learned model.

To this end, we propose DVRL, which implements this approach by providing a helpful inductive bias to the agent. In particular, we develop an algorithm that can learn an internal generative model and use it to perform approximate inference to update the belief state. Crucially, the generative model is not only learned based on an ELBO objective, but also by how well it enables maximisation of the expected return. This ensures that, unlike in an unsupervised application of VAE, the latent state representation and the inference performed on it are suitable for the ultimate control task. Specifically, we develop an approximation to the ELBO based on AESMC (Le et al., 2018), allowing joint optimisation with the -step policy gradient update. Uncertainty in the belief state is captured by a particle ensemble. A high-level overview of our approach in comparison to previous RNN-based methods is shown in Figure 1.

We evaluate our approach on Mountain Hike and several flickering Atari games. On Mountain Hike, a low dimensional, continuous environment, we can show that DVRL is better than an RNN based approach at inferring the required information from past observations for optimal action selection in a simple setting. Our results on flickering Atari show that this advantage extends to complex environments with high dimensional observation spaces. Here, partial observability is introduced by (1) using only a single frame as input at each time step and (2)

returning a blank screen instead of the true frame with probability 0.5.

2 Background

In this section, we formalise POMDP and provide background on recent advances in VAE that we use. Lastly, we describe the policy gradient loss based on -step learning and A2C.

2.1 Partially Observable Markov Decision Processes

A partially observable Markov decision process (POMDP) is a tuple , where is the state space, the action space, and the observation space. We denote as the latent state at time , and the distribution over initial states as , the initial belief state. When an action is executed, the state changes according to the transition distribution, . Subsequently, the agent receives a noisy or partially occluded observation according to the distribution , and a reward according to the distribution .

An agent acts according to its policy which returns the probability of taking action at time , and where and are the observation and action histories, respectively. The agent’s goal is to learn a policy that maximises the expected future return


over trajectories induced by its policy111The trajectory length is stochastic and depends on the time at which the agent-environment interaction ends., where is the discount factor. We follow the convention of setting to no-op (Zhu et al., 2017).

In general, a POMDP agent must condition its actions on the entire history . The exponential growth in of can be addressed, e.g., with suffix trees (McCallum & Ballard, 1996; Shani et al., 2005; Bellemare et al., 2014; Bellemare, 2015; Messias & Whiteson, 2017). However, those approaches suffer from large memory requirements and are only suitable for small discrete observation spaces.

Alternatively, it is possible to infer the filtering distribution , called the belief state. This is a sufficient statistic of the history that can be used as input to an optimal policy . The belief space does not grow exponentially, but the belief update step requires knowledge of the model:


2.2 Variational Autoencoder

We define a family of priors over some latent state and decoders over observations , both parameterised by . A VAE (VAE) learns by maximising the sum of log marginal likelihood terms for a dataset where  (Rezende et al., 2014; Kingma & Welling, 2014)) . Since evaluating the log marginal likelihood is intractable, the VAE instead maximises a sum of ELBO where each individual ELBO term is a lower bound on the log marginal likelihood,


for a family of encoders parameterised by . This objective also forces to approximate the posterior under the learned model. Gradients of (3

) are estimated by Monte Carlo sampling with the reparameterisation trick

(Kingma & Welling, 2014; Rezende et al., 2014).

2.3 VAE for Time Series

For sequential data, we assume that a series of latent states gives rise to a series of observations . We consider a family of generative models parameterised by that consists of the initial distribution , transition distribution and observation distribution . Given a family of encoder distributions , we can also estimate the gradient of the ELBO term in the same manner as in (3), noting that:


where we slightly abuse notation for by ignoring the fact that we sample from the model for . Le et al. (2018), Maddison et al. (2017) and Naesseth et al. (2018) introduce a new ELBO objective based on SMC (Doucet & Johansen, 2009) that allows faster learning in time series:


where is the number of particles and is the weight of particle at time . Each particle is a tuple containing a weight and a value which is obtained as follows. Let be samples from for . For , the weights are obtained by resampling the particle set proportionally to the previous weights and computing


where corresponds to a value sampled from and corresponds to the resampled particle with the ancestor index and for .

2.4 A2c

One way to learn the parameters of an agent’s policy is to use -step learning with A2C (Dhariwal et al., 2017; Wu et al., 2017), the synchronous simplification of A3C (Mnih et al., 2016). An actor-critic approach can cope with continuous actions and avoids the need to draw state-action sequences from a replay buffer. The method proposed in this paper is however equally applicable to other deep RL algorithms.

For -step learning, starting at time , the current policy performs consecutive steps in parallel environments. The gradient update is based on this mini-batch of size . The target for the value-function , parameterised by , is the appropriately discounted sum of on-policy rewards up until time and the off-policy bootstrapped value . The minus sign denotes that no gradients are propagated through this value. Defining the advantage function as


the A2C loss for the policy parameters at time is


and the value function loss to learn can be written as


Lastly, an entropy loss is added to encourage exploration:


where is the entropy of a distribution.

3 Deep Variational Reinforcement Learning

Fundamentally, there are two approaches to aggregating the history in the presence of partial observability: remembering features of the past or maintaining beliefs.

In most previous work, including ADRQN (Zhu et al., 2017), the current history is encoded by an RNN, which leads to the recurrent update equation for the latent state :


Since this approach is model-free, it is unlikely to approximate belief update steps, instead relying on memory or simple heuristics.

Inspired by the premise that a good way to solve many POMDP involves (1) estimating the transition and observation model of the environment, (2) performing inference under this model, and (3) choosing an action based on the inferred belief state, we propose DVRLDVRL. It extends the RNN-based approach to explicitly support belief inference. Training everything end-to-end shapes the learned model to be useful for the RL task at hand, and not only for predicting observations.

We first explain our baseline architecture and training method in Section 3.1. For a fair comparison, we modify the original architecture of Zhu et al. (2017) in several ways. We find that our new baseline outperforms their reported results in the majority of cases.

In Sections 3.2 and 3.3, we explain our new latent belief state and the recurrent update function


which replaces (12). Lastly, in Section 3.4

, we describe our modified loss function, which allows learning the model jointly with the policy.

3.1 Improving the Baseline

While previous work often used -learning to train the policy (Hausknecht & Stone, 2015; Zhu et al., 2017; Foerster et al., 2016; Narasimhan et al., 2015), we use -step A2C. This avoids drawing entire trajectories from a replay buffer and allows continuous actions.

Furthermore, since A2C interleaves unrolled trajectories and performs a parameter update only every steps, it makes it feasible to maintain an approximately correct latent state. A small bias is introduced by not recomputing the latent state after each gradient update step.

We also modify the implementation of BPTT for

-step A2C in the case of policies with latent states. Instead of backpropagating gradients only through the computation graph of the current update involving

steps, we set the size of the computation graph independently to involve steps. This leads to an average BPTT-length of .222

This is implemented in PyTorch using the

retain_graph=True flag in the backward() function.

This decouples the bias-variance tradeoff of choosing

from the bias-runtime tradeoff of choosing . Our experiments show that choosing greatly improves the agent’s performance.

3.2 Extending the Latent State

For DVRL, we extend the latent state to be a set of particles, capturing the uncertainty in the belief state (Thrun, 2000; Silver & Veness, 2010). Each particle consists of the triplet (Chung et al., 2015). The value of particle is the latent state of an RNN; is an additional stochastic latent state that allows us to learn stochastic transition models; and assigns each particle an importance weight.

Our latent state is thus an approximation of the belief state in our learned model


with stochastic transition model , decoder , and deterministic transition function which is denoted using the delta-distribution and for which we use an RNN. The model is trained to jointly optimise the ELBO and the expected return.

3.3 Recurrent Latent State Update

Figure 2: Overview of DVRL. We do the following times to compute our new belief : Sample an ancestor index based on the previous weights (Eq. 15). Pick the ancestor particle value and use it to sample a new stochastic latent state from the encoder (Eq. 16). Compute (Eq. 17) and (Eq. 18). Aggregate all values into the new belief

and summarise them into a vector representation

using a second RNN. Actor and critic can now condition on and is used as input for the next timestep. Red arrows denote random sampling, green arrows the aggregation of

values. Black solid arrows denote the passing of a value as argument to a function and black dashed ones the evaluation of a value under a distribution. Boxes indicate neural networks. Distributions are normal or Bernoulli distributions whose parameters are outputs of the neural network.

To update the latent state, we proceed as follows:


First, we resample particles based on their weight by drawing ancestor indices . This improves model learning (Le et al., 2018; Maddison et al., 2017) and allows us to train the model jointly with the -step loss (see Section 3.4).

For , new values for are sampled from the encoder which conditions on the resampled ancestor values as well as the last actions and current observation . Latent variables are sampled using the reparameterisation trick. The values , together with and , are then passed to the transition function to compute .

The weights measure how likely each new latent state value is under the model and how well it explains the current observation.

To condition the policy on the belief , we need to encode the set of particles into a vector representation . We use a second RNN that sequentially takes in each tuple and its last latent state is .

Additional encoders are used for , and ; see Appendix A for details. Figure 2 summarises the entire update step.

3.4 Loss Function

To encourage learning a model, we include the term


in each gradient update every steps. This leads to the overall loss:


Compared to (9), (10) and (11), the losses now also depend on the encoder parameters and, for DVRL, model parameters , since the policy and value function now condition on the latent states instead of . By introducing the -step approximation , we can learn and to jointly optimise the ELBO and the RL loss .

If we assume that observations and actions are drawn from the stationary state distribution induced by the policy , then is a stochastic approximation to the action-conditioned ELBO:


which is a conditional extension of (6) similar to the extension of VAE by Sohn et al. (2015). The expectation over is approximated by sampling trajectories and the sum over the entire trajectory is approximated by a sum over only a part of it.

The importance of the resampling step (15) in allowing this approximation becomes clear if we compare (21) with the ELBO for the IWAE that does not include resampling (Doucet & Johansen, 2009; Burda et al., 2016):


Because this loss is not additive over time, we cannot approximate it with shorter parts of the trajectory.

4 Related Work

Most existing POMDP literature focusses on planning algorithms, where the transition and observation functions, as well as a representation of the latent state space, are known (Barto et al., 1995; McAllester & Singh, 1999; Pineau et al., 2003; Ross et al., 2008; Oliehoek et al., 2008; Roijers et al., 2015). In most realistic domains however, these are not known a priori.

There are several approaches that utilise RNN in POMDP (Bakker, 2002; Wierstra et al., 2007; Zhang et al., 2015; Heess et al., 2015), including multi-agent settings (Foerster et al., 2016), learning text-based fantasy games (Narasimhan et al., 2015) or, most recently, applied to Atari (Hausknecht & Stone, 2015; Zhu et al., 2017). As discussed in Section 3, our algorithm extends those approaches by enabling the policy to explicitly reason about a model and the belief state.

Another more specialised approach called QMDP-Net (Karkus et al., 2017) learns a VIN (Tamar et al., 2016) end-to-end and uses it as a transition model for planning. However, a VIN makes strong assumptions about the transition function and in QMDP-Net the belief update must be performed analytically.

The idea to learn a particle filter based policy that is trained using policy gradients was previously proposed by Coquelin et al. (2009). However, they assume a known model and rely on finite differences for gradient estimation.

Instead of optimising an ELBO to learn a maximum-likelihood approximation for the latent representation and corresponding transition and observation model, previous work also tried to learn those dynamics using spectral methods (Azizzadenesheli et al., 2016), a Bayesian approach (Ross et al., 2011; Katt et al., 2017), or nonparametrically (Doshi-Velez et al., 2015). However, these approaches do not scale to large or continuous state and observation spaces. For continuous states, actions, and observations with Gaussian noise, a Gaussian process model can be learned (Deisenroth & Peters, 2012). An alternative to learning an (approximate) transition and observation model is to learn a model over trajectories (Willems et al., 1995). However, this is again only possible for small, discrete observation spaces.

Due to the complexity of the learning in POMDP, previous work already found benefits to using auxiliary losses. Unlike the losses proposed by Lample & Chaplot (2017), we do not require additional information from the environment. The UNREAL agent (Jaderberg et al., 2016) is, similarly to our work, motivated by the idea to improve the latent representation by utilising all the information already obtained from the environment. While their work focuses on finding unsupervised auxiliary losses that provide good training signals, our goal is to use the auxiliary loss to better align the network computations with the task at hand by incorporating prior knowledge as an inductive bias.

There is some evidence from recent experiments on the dopamine system in mice (Babayan et al., 2018) showing that their response to ambiguous information is consistent with a theory operating on belief states.

5 Experiments

We evaluate DVRL on Mountain Hike and on flickering Atari. We show that DVRL deals better with noisy or partially occluded observations and that this scales to high dimensional and continuous observation spaces like images and complex tasks. We also perform a series of ablation studies, showing the importance of using many particles, including the ELBO training objective in the loss function, and jointly optimising the ELBO and RL losses.

More details about the environments and model architectures can be found in Appendix A

together with additional results and visualisations. All plots and reported results are smoothed over time and parallel executed environments. We average over five random seeds, with shaded areas indicating the standard deviation.

5.1 Mountain Hike

Figure 3: Mountain Hike is a continuous control task with observation noise . Background colour indicates rewards. Red line: trajectory for RNN based encoder. Black line: trajectory for DVRL encoder. Dots: received observations. Both runs share the same noise values . DVRL achieves higher returns (see Fig. 4) by better estimating its current location and remaining on the high reward mountain ridge.

In this task, the agent has to navigate along a mountain ridge, but only receives noisy measurements of its current location. Specifically, we have where and are true and observed coordinates respectively and is the desired step. Transitions are given by with and is the vector with length capped to . Observations are noisy with with and . The reward at each timestep is where is shown in Figure 3. The starting position is sampled from and each episode ends after 75 steps.

DVRL used 30 particles and we set for both RNN and DVRL. The latent state for the RNN-encoder architecture was of dimension 256 and 128 for both and for DVRL. Lastly, and

were used, together with RMSProp with a learning rate of

for both approaches.

The main difficulty in Mountain Hike is to correctly estimate the current position. Consequently, the achieved return reflects the capability of the network to do so. DVRL outperforms RNN based policies, especially for higher levels of observation noise (Figure 4). In Figure 3 we compare the different trajectories for RNN and DVRL encoders for the same noise, i.e. and for all and . DVRL is better able to follow the mountain ridge, indicating that its inference based history aggregation is superior to a largely memory/heuristics based one.

The example in Figure 3 is representative but selected for clarity: The shown trajectories have compared to an average value of (see Figure 4).

Figure 4: Returns achieved in Mountain Hike. Solid lines: DVRL. Dashed lines: RNN. Colour: Noise levels. Inset: Difference in performance between RNN and DVRL for same level of noise: . DVRL achieves slighly higher returns for the fully observable case and, crucially, its performance deteriorates more slowly for increasing observation noise, showing the advantage of DVRL’s inference computations in encoding the history in the presence of observation noise.

5.2 Atari

We chose flickering Atari as evaluation benchmark, since it was previously used to evaluate the performance of ADRQN (Zhu et al., 2017) and DRQN (Hausknecht & Stone, 2015). Atari environments (Bellemare et al., 2013) provide a wide set of challenging tasks with high dimensional observation spaces. We test our algorithm on the same subset of games on which DRQN and ADRQN were evaluated.

Partial observability is introduced by flickering, i.e., by a probability of of returning a blank screen instead of the actual observation. Furthermore, only one frame is used as the observation. This is in line with previous work (Hausknecht & Stone, 2015). We use a frameskip of four333A frameskip of one is used for Asteroids due to known rendering issues with this environment and for the stochastic Atari environments there is a chance of repeating the current action for a second time at each transition.

DVRL used 15 particles and we set for both agents. The dimension of was 256 for both architectures, as was the dimension of . Larger latent states decreased the performance for the RNN encoder. Lastly, and was used with a learning rate of for RNN and for DVRL, selected out of a set of 6 different rates based on the results on ChopperCommand.

Table 1 shows the results for the more challenging stochastic, flickering environments. Results for the deterministic environments, including returns reported for DRQN and ADRQN, can be found in Appendix A. DVRL significantly outperforms the RNN-based policy on five out of ten games and narrowly underperforms significantly on only one. This shows that DVRL is viable for high dimensional observation spaces with complex environmental models.

Table 1: Returns on stochastic and flickering Atari environments, averaged over 5 random seeds. Bold numbers indicate statistical significance at the 5% level. Out of ten games, DVRL significantly outperforms the baseline on five games and underperforms narrowly on only one game. Comparisons against DRQN and ADRQN on deterministic Atari environments are in Appendix A.

5.3 Ablation Studies

(a) Influence of the particle number on performance for DVRL. Only using one particle is not sufficient to encode enough information in the latent state.
(b) Performance of the full DVRL algorithm compared to setting (”No ELBO”) or not backpropagating the policy gradients through the encoder (”No joint optim”).
(c) Influence of the maximum backpropagation length on performance. Note that RNN suffers most from very short lengths. This is consistent with our conjecture that RNN relies mostly on memory, not inference.
Figure 5: Ablation studies on flickering ChopperCommand (Atari).

Using more than one particle is important to accurately approximate the belief distribution over the latent state . Consequently, we expect that higher particle numbers provide better information to the policy, leading to higher returns. Figure 5a shows that this is indeed the case. This is an important result for our architecture, as it also implies that the resampling step is necessary, as detailed in Section 3.4. Without resampling, we cannot approximate the ELBO on only observations.

Secondly, Figure 5b shows that the inclusion of to encourage model learning is required for good performance. Furthermore, not backpropagating the policy gradients through the encoder and only learning it based on the ELBO (“No joint optim”) also deteriorates performance.

Lastly, we investigate the influence of the backpropagation length on both the RNN and DVRL based policies. While increasing universally helps, the key insight here is that a short length (for an average BPTT-length of 2 timesteps) has a stronger negative impact on RNN than on DVRL. This is consistent with our notion that RNN is mainly performing memory based reasoning, for which longer backpropagation-through-time is required: The belief update (2) in DVRL is a one-step update from to , without the need to condition on past actions and observations. The proposal distribution can benefit from extended backpropagation lengths, but this is not necessary. Consequently, this result supports our notion that DVRL relies more on inference computations to update the latent state.

6 Conclusion

In this paper we proposed DVRL, a method for solving POMDP given only a stream of observations, without knowledge of the latent state space or the transition and observation functions operating in that space. Our method leverages a new ELBO-based auxiliary loss and incorporates an inductive bias into the structure of the policy network, taking advantage of our prior knowledge that an inference step is required for an optimal solution.

We compared DVRL to an RNN-based architecture and found that we consistently outperform it on a diverse set of tasks, including a number of Atari games modified to have partial observability and stochastic transitions.

We also performed several ablation studies showing the necessity of using an ensemble of particles and of joint optimisation of the ELBO and RL objective. Furthermore, the results support our claim that the latent state in DVRL approximates a belief distribution in a learned model.

Access to a belief distribution opens up several interesting research directions. Investigating the role of better generalisation capabilities and the more powerful latent state representation on the policy performance of DVRL can give rise to further improvements. DVRL is also likely to benefit from more powerful model architectures and a disentangled latent state. Furthermore, uncertainty in the belief state and access to a learned model can be used for curiosity driven exploration in environments with sparse rewards.


We would like to thank Wendelin Boehmer and Greg Farquar for useful discussions and feedback. The NVIDIA DGX-1 used for this research was donated by the NVIDIA corporation. M. Igl is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. L. Zintgraf is supported by the Microsoft Research PhD Scholarship Program. T. A. Le is supported by EPSRC DTA and Google (project code DF6700) studentships. F. Wood is supported by DARPA PPAML through the U.S. AFRL under Cooperative Agreement FA8750-14-2-0006; Intel and DARPA D3M, under Cooperative Agreement FA8750-17-2-0093. S. Whiteson is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).


  • Astrom (1965) Astrom, Karl J. Optimal control of markov decision processes with incomplete state estimation. Journal of mathematical analysis and applications, 10:174–205, 1965.
  • Azizzadenesheli et al. (2016) Azizzadenesheli, Kamyar, Lazaric, Alessandro, and Anandkumar, Animashree. Reinforcement learning of pomdps using spectral methods. arXiv preprint 1602.07764, 2016.
  • Babayan et al. (2018) Babayan, Benedicte M, Uchida, Naoshige, and Gershman, Samuel J. Belief state representation in the dopamine system. Nature communications, 9(1):1891, 2018.
  • Bakker (2002) Bakker, Bram.

    Reinforcement learning with long short-term memory.

    In Advances in neural information processing systems, pp. 1475–1482, 2002.
  • Barto et al. (1995) Barto, Andrew G, Bradtke, Steven J, and Singh, Satinder P. Learning to act using real-time dynamic programming. Artificial intelligence, 72(1-2):81–138, 1995.
  • Bellemare et al. (2014) Bellemare, Marc, Veness, Joel, and Talvitie, Erik. Skip context tree switching. In International Conference on Machine Learning, pp. 1458–1466, 2014.
  • Bellemare (2015) Bellemare, Marc G. Count-based frequency estimation with bounded memory. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  • Bellemare et al. (2013) Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  • Burda et al. (2016) Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Importance weighted autoencoders. In ICLR, 2016.
  • Chung et al. (2015) Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel, Kratarth, Courville, Aaron C, and Bengio, Yoshua. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, 2015.
  • Coquelin et al. (2009) Coquelin, Pierre-Arnaud, Deguest, Romain, and Munos, Rémi. Particle filter-based policy gradient in pomdps. In NIPS, 2009.
  • Deisenroth & Peters (2012) Deisenroth, Marc Peter and Peters, Jan. Solving nonlinear continuous state-action-observation pomdps for mechanical systems with gaussian noise. 2012.
  • Dhariwal et al. (2017) Dhariwal, Prafulla, Hesse, Christopher, Klimov, Oleg, Nichol, Alex, Plappert, Matthias, Radford, Alec, Schulman, John, Sidor, Szymon, and Wu, Yuhuai. Openai baselines, 2017.
  • Doshi-Velez et al. (2015) Doshi-Velez, Finale, Pfau, David, Wood, Frank, and Roy, Nicholas. Bayesian nonparametric methods for partially-observable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 37(2):394–407, 2015.
  • Doucet & Johansen (2009) Doucet, Arnaud and Johansen, Adam M. A tutorial on particle filtering and smoothing: Fifteen years later.

    Handbook of nonlinear filtering

    , 12(656-704):3, 2009.
  • Foerster et al. (2016) Foerster, Jakob N, Assael, Yannis M, de Freitas, Nando, and Whiteson, Shimon. Learning to communicate to solve riddles with deep distributed recurrent q-networks. arXiv preprint 1602.02672, 2016.
  • Hausknecht & Stone (2015) Hausknecht, Matthew and Stone, Peter. Deep recurrent q-learning for partially observable MDPs. In 2015 AAAI Fall Symposium Series, 2015.
  • Heess et al. (2015) Heess, Nicolas, Hunt, Jonathan J, Lillicrap, Timothy P, and Silver, David. Memory-based control with recurrent neural networks. arXiv preprint 1512.04455, 2015.
  • Jaderberg et al. (2016) Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint 1611.05397, 2016.
  • Kaelbling et al. (1998) Kaelbling, Leslie Pack, Littman, Michael L, and Cassandra, Anthony R. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1), 1998.
  • Karkus et al. (2017) Karkus, Peter, Hsu, David, and Lee, Wee Sun. Qmdp-net: Deep learning for planning under partial observability. In Advances in Neural Information Processing Systems, pp. 4697–4707, 2017.
  • Katt et al. (2017) Katt, Sammie, Oliehoek, Frans A, and Amato, Christopher. Learning in pomdps with monte carlo tree search. In International Conference on Machine Learning, 2017.
  • Kingma & Welling (2014) Kingma, Diederik P and Welling, Max. Auto-encoding variational Bayes. In ICLR, 2014.
  • Lample & Chaplot (2017) Lample, Guillaume and Chaplot, Devendra Singh. Playing fps games with deep reinforcement learning. In AAAI, pp. 2140–2146, 2017.
  • Le et al. (2018) Le, Tuan Anh, Igl, Maximilian, Jin, Tom, Rainforth, Tom, and Wood, Frank. Auto-encoding sequential Monte Carlo. In ICLR, 2018.
  • Maddison et al. (2017) Maddison, Chris J, Lawson, John, Tucker, George, Heess, Nicolas, Norouzi, Mohammad, Mnih, Andriy, Doucet, Arnaud, and Teh, Yee. Filtering variational objectives. In Advances in Neural Information Processing Systems, 2017.
  • McAllester & Singh (1999) McAllester, David A and Singh, Satinder. Approximate planning for factored pomdps using belief state simplification. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999.
  • McCallum & Ballard (1996) McCallum, Andrew Kachites and Ballard, Dana. Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester. Dept. of Computer Science, 1996.
  • McCallum (1993) McCallum, R Andrew. Overcoming incomplete perception with utile distinction memory. In Proceedings of the Tenth International Conference on Machine Learning, pp. 190–196, 1993.
  • Messias & Whiteson (2017) Messias, João V and Whiteson, Shimon. Dynamic-depth context tree weighting. In Advances in Neural Information Processing Systems, pp. 3330–3339, 2017.
  • Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Mnih et al. (2016) Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
  • Naesseth et al. (2018) Naesseth, Christian A, Linderman, Scott W, Ranganath, Rajesh, and Blei, David M. Variational sequential monte carlo. In AISTATS (To Appear), 2018.
  • Narasimhan et al. (2015) Narasimhan, Karthik, Kulkarni, Tejas, and Barzilay, Regina. Language understanding for text-based games using deep reinforcement learning. arXiv preprint 1506.08941, 2015.
  • Oliehoek et al. (2008) Oliehoek, Frans A, Spaan, Matthijs TJ, Whiteson, Shimon, and Vlassis, Nikos. Exploiting locality of interaction in factored dec-pomdps. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 1, 2008.
  • Pineau et al. (2003) Pineau, Joelle, Gordon, Geoff, Thrun, Sebastian, et al. Point-based value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, 2003.
  • Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
  • Roijers et al. (2015) Roijers, Diederik Marijn, Whiteson, Shimon, and Oliehoek, Frans A. Point-based planning for multi-objective pomdps. In IJCAI, pp. 1666–1672, 2015.
  • Ross et al. (2008) Ross, Stéphane, Pineau, Joelle, Paquet, Sébastien, and Chaib-Draa, Brahim. Online planning algorithms for pomdps. Journal of Artificial Intelligence Research, 32:663–704, 2008.
  • Ross et al. (2011) Ross, Stéphane, Pineau, Joelle, Chaib-draa, Brahim, and Kreitmann, Pierre. A bayesian approach for learning and planning in partially observable markov decision processes. Journal of Machine Learning Research, 2011.
  • Shani et al. (2005) Shani, Guy, Brafman, Ronen I, and Shimony, Solomon E. Model-based online learning of pomdps. In European Conference on Machine Learning, pp. 353–364. Springer, 2005.
  • Silver & Veness (2010) Silver, David and Veness, Joel. Monte-carlo planning in large pomdps. In Advances in neural information processing systems, pp. 2164–2172, 2010.
  • Sohn et al. (2015) Sohn, Kihyuk, Lee, Honglak, and Yan, Xinchen. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491, 2015.
  • Tamar et al. (2016) Tamar, Aviv, Wu, Yi, Thomas, Garrett, Levine, Sergey, and Abbeel, Pieter. Value iteration networks. In Advances in Neural Information Processing Systems, pp. 2154–2162, 2016.
  • Thrun (2000) Thrun, Sebastian. Monte carlo pomdps. In Advances in neural information processing systems, pp. 1064–1070, 2000.
  • Wierstra et al. (2007) Wierstra, Daan, Foerster, Alexander, Peters, Jan, and Schmidhuber, Juergen. Solving deep memory pomdps with recurrent policy gradients. In International Conference on Artificial Neural Networks, pp. 697–706. Springer, 2007.
  • Willems et al. (1995) Willems, Frans MJ, Shtarkov, Yuri M, and Tjalkens, Tjalling J. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.
  • Wu et al. (2017) Wu, Yuhuai, Mansimov, Elman, Grosse, Roger B, Liao, Shun, and Ba, Jimmy. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Advances in neural information processing systems, pp. 5285–5294, 2017.
  • Zhang et al. (2015) Zhang, Marvin, Levine, Sergey, McCarthy, Zoe, Finn, Chelsea, and Abbeel, Pieter. Policy learning with continuous memory states for partially observed robotic control. CoRR, 2015.
  • Zhu et al. (2017) Zhu, Pengfei, Li, Xin, and Poupart, Pascal. On improving deep reinforcement learning for POMDPs. arXiv preprint 1704.07978, 2017.

Appendix A Experiments

a.1 Implementation Details

In our implementation, the transition and proposal distributions and

are multivariate normal distributions over

whose mean and diagonal variance are determined by neural networks. For image data, the decoder is a multivariate independent Bernoulli distribution whose parameters are again determined by a neural network. For real-valued vectors we use a normal distribution.

When several inputs are passed to a neural network, they are concatenated to one vector. ReLUs are used as nonlinearities between all layers. Hidden layers are, if not otherwise stated, all of the same dimension as

. Batch normalization was used between layers for experiments on Atari but not on Mountain Hike as they significantly hurt performance. All RNN are GRUs.

Encoding functions , and are used to encode single observations, actions and latent states before they are passed into other networks.

To encode visual observations, we use the the same convolutional network as proposed by Mnih et al. (2015), but with only instead of channels in the final layer. The transposed convolutional network of the decoder has the reversed structure. The decoder is preceeded by an additional fully connected layer which outputs the required dimension (1568 for Atari’s observations).

For observations in we used two fully connected layers of size 64 as encoder. As decoder we used the same structure as for and which are all three normal distributions: One joint fully connected layer and two separated fully connected heads, one for the mean, one for the variance. The output of the variance layer is passed through a softplus layer to force positivity.

Actions are encoded using one fully connected layer of size 128 for Atari and size 64 for Mountain Hike. Lastly, is encoded before being passed into networks by one fully connected layer of the same size as .

The policy is one fully connected layer whose size is determined by the actions space, i.e. up to 18 outputs with softmax for Atari and only 2 outputs for the learned mean for Mountain Hike, together with a learned variance. The value function is one fully connected layer of size 1.

A2C used parallel environments and

-step learning for a total batch size of 80. Hyperparameters were tuned on

Chopper Command. The learning rate of both DVRL and ADR-A2C was independently tuned on the set of values with being chosen for DVRL on Atari and for DVRL on MountainHike and RNN on both environments. Without further tuning, we set and as is commonly used.

As optimizer we use RMSProp with . We clip gradients at a value of . The discount factor of the control problem is set to and lastly, we use ’orthogonal’ initialization for the network weights.

The source code will be release in the future.

a.2 Additional Experiments and Visualisations

Table 2 shows the on deterministic and flickering Atari, averaged over 5 random seeds. The values for DRQN and ADRQN are taken from the respective papers. Note that DRQN and ADRQN rely on Q-learning instead of A2C, so the results are not directly comparable.

Figure 6 and 7 show individual learning curves for all 10 Atari games, either for the deterministic or the stochastic version of the games.

Table 2: Final results on deterministic and flickering Atari environments, averaged over 5 random seeds. Bold numbers indicate statistical significance at the 5% level when comparing DVRL and RNN. The values for DRQN and ADRQN are taken from the respective papers.
(a) Asteroids
(b) Beam Rider
(c) Bowling
(d) Centipede
(e) Chopper Command
(f) Double Dunk
(g) Frostbite
(h) Ice Hockey
(i) Ms. Pacman
(j) Pong
Figure 6: Training curves on the full set of evaluated Atari games, in the case of flickering and deterministic environments.
(a) Asteroids
(b) Beam Rider
(c) Bowling
(d) Centipede
(e) Chopper Command
(f) Double Dunk
(g) Frostbite
(h) Ice Hockey
(i) Ms. Pacman
(j) Pong
Figure 7: Training curves on the full set of evaluated Atari games, in the case of flickering and stochastic environments.

a.3 Computational Speed

The approximate training speed in frames per second (FPS) is on one GPU on a dgx1 for Atari:

  • RNN: 124k FPS

  • DVRL (1 Particle): 64k FPS

  • DVRL (10 Particles): 48k FPS

  • DVRL (30 Particle): 32k FPS

a.4 Model Predictions

In Figure 8 we show reconstructed and predicted images from the DVRL model for several Atari games. The current observation is in the leftmost column. The second column (’dt0’) shows the reconstruction after encoding and decoding the current observation. For the further columns, we make use of the learned generative model to predict future observations. For simplicity we repeat the last action. Columns 2 to 7 show predicted observations for unrolled timesteps. The model was trained as explained in Section 5.2. The reconstructed and predicted images are a weighted average over all 16 particles.

Note that the model is able to correctly predict features of future observations, for example the movement of the cars in ChopperCommand, the (approximate) ball position in Pong or the missing pins in Bowling. Furthermore, it is able to do so, even if the current observation is blank like in Bowling. The model has also correctly learned to randomly predict blank observations.

It can remember feature of the current state fairly well, like the positions of barriers (white dots) in Centipede. On the other hand, it clearly struggles with the amount of information present in MsPacman like the positions of all previously eaten fruits or the location of the ghosts.

(a) ChopperCommand
(b) Pong
(c) Bowling
(d) Centipede
(e) MsPacman
(f) BeamRider
Figure 8: Reconstructions and predictions using the learned generative model for several Atari games. First column: Current obseration (potentially blank). Second column: Encoded and decoded reconstruction of the current observation. Columns 3 to 7: Predicted observations using the learned generative model for timesteps into the future.

Appendix B Algorithms

Algorithm 1 details the recurrent (belief) state computation (i.e. history encoder) for DVRL. Algorithm 2 details the recurrent state computation for RNN. Algorithm 3 describes the overall training algorithm that either uses one or the other to aggregate the history. Despite looking complicated, it is just a very detailed implementation of -step A2C with the additional changes: Inclusion of and inclusing of the option to not delete the computation graph to allow longer backprop in -step A2C.

Results for also using the reconstruction loss for the RNN based encoder aren’t shown in the paper as they reliably performed worse than RNN without reconstruction loss.

  Input: Previous state , observation , action
  for  to  do
     Sample based on weights
  end for
  {When or is conditioned on , the summary is used.}
Algorithm 1 DVRL encoder
  Input: Previous state , observation , action
Algorithm 2 RNN encoder
  Input: Environment , Encoder (either RNN or DVRL)
  Initialize observation from .
  Initialize encoder latent state as either (for RNN) or (for DVRL)
  Initialize action to no-op
  Set .
  {The distinction between and is necessary when the environment resets at time .}
     {Run steps forward:}
     for  to  do
        if  then
           { is still available to compute }
        end if
     end for
     {Compute targets}
     for  to  do
        if  then
        end if
     end for
     {Compute losses}
     for  to  do
     end for
     Delete or save computation graph of to determine backpropagation length
  until converged
Algorithm 3 Training Algorithm