Learning Belief Representations for Imitation Learning in POMDPs

We consider the problem of imitation learning from expert demonstrations in partially observable Markov decision processes (POMDPs). Belief representations, which characterize the distribution over the latent states in a POMDP, have been modeled using recurrent neural networks and probabilistic latent variable models, and shown to be effective for reinforcement learning in POMDPs. In this work, we investigate the belief representation learning problem for generative adversarial imitation learning in POMDPs. Instead of training the belief module and the policy separately as suggested in prior work, we learn the belief module jointly with the policy, using a task-aware imitation loss to ensure that the representation is more aligned with the policy's objective. To improve robustness of representation, we introduce several informative belief regularization techniques, including multi-step prediction of dynamics and action-sequences. Evaluated on various partially observable continuous-control locomotion tasks, our belief-module imitation learning approach (BMIL) substantially outperforms several baselines, including the original GAIL algorithm and the task-agnostic belief learning algorithm. Extensive ablation analysis indicates the effectiveness of task-aware belief learning and belief regularization.


page 1

page 2

page 3

page 4


Episodic Self-Imitation Learning with Hindsight

Episodic self-imitation learning, a novel self-imitation algorithm with ...

Domain-Adversarial and -Conditional State Space Model for Imitation Learning

State representation learning (SRL) in partially observable Markov decis...

Neural Rate Control for Video Encoding using Imitation Learning

In modern video encoders, rate control is a critical component and has b...

Provable Representation Learning for Imitation Learning via Bi-level Optimization

A common strategy in modern learning systems is to learn a representatio...

Relational Mimic for Visual Adversarial Imitation Learning

In this work, we introduce a new method for imitation learning from vide...

Provable Hierarchical Imitation Learning via EM

Due to recent empirical successes, the options framework for hierarchica...

Cross Domain Imitation Learning

We study the question of how to imitate tasks across domains with discre...

Code Repositories


Code for "Learning Belief Representations for Imitation Learning in POMDPs" (UAI 2019)

view repo

1 Introduction

Recent advances in reinforcement learning (RL) have found successful applications in solving complex problems, including robotics, games, dialogue systems, and recommendation systems, among others. Despite such notable success, the application of RL is still quite limited to problems where the observation-space is rich in information and data generation is inexpensive. On the other hand, the environments in real-world problems, such as autonomous driving and robotics, are typically stochastic, complex and partially observable. To achieve robust and practical performance, RL algorithms should adapt to situations where the agent is being fed noisy and incomplete sensory information. To model these types of environments, partially observable Markov decision processes (POMDPs; Aström (1965)) have been proposed and widely studied. In a POMDP, since the current observation alone is insufficient for choosing optimal actions, the agent’s history (its past observations and actions) is encoded into a belief state, which is defined as the distribution (representing the agent’s beliefs) over the current latent state. Although belief states can be used to derive optimal policies (Kaelbling et al., 1998; Hauskrecht, 2000), maintaining and updating them requires knowledge of the transition and observation models of the POMDP, and is prohibitively expensive for high-dimensional spaces. To overcome this difficulty, several algorithms have been proposed that perform approximate inference of the belief state representation from raw observations, using recurrent neural networks (Guo et al., 2018)

, variational autoencoders 

(Igl et al., 2018; Gregor & Besse, 2018), and Predictive State Representations (Venkatraman et al., 2017). After the belief model has been learned, a policy optimization algorithm is then applied to the belief representation to optimize a predefined reward signal.

As an alternative to RL from predefined rewards, imitation learning often provides a fast and efficient way for training an agent to complete tasks. Expert demonstrations are provided to guide a learner agent to mimic the actions of the expert without the need to specify a reward function. A large volume of work has been done over the past decades on imitation learning for fully observable MDPs, including the seminal work on generative adversarial imitation learning (GAIL, Ho & Ermon (2016)), but there has been little focus on applying these ideas to partially observable environments.

In this paper, we study the problem of imitation learning for POMDPs. Specifically, we introduce a new belief representation learning approach for generative adversarial imitation learning in POMDPs. Different from previous approaches, where the belief state representation and the policy are trained in a decoupled manner, we learn the belief module jointly with the policy, using a task-aware imitation loss which helps to align the belief representation with the policy’s objective. To avoid potential belief degeneration, we introduce several informative belief regularization techniques, including auxiliary losses of predicting multi-step past/future observations and action-sequences, which improve the robustness of the belief representation. Evaluated on various partially observable continuous-control locomotion tasks built from MuJoCo, our belief-module imitation learning approach (BMIL) substantially outperforms several baselines, including the original GAIL algorithm and the task-agnostic belief learning algorithm. Extensive ablation analysis indicates the effectiveness of task-aware belief learning and belief regularization.

2 Background and Notation

Reinforcement Learning. We consider the RL setting where the environment is modeled as a partially-observable Markov decision process (POMDP). A POMDP is characterized by the tuple (, , , , , , , , ), where is the state-space, is the action-space, and is the observation-space. The true environment states are latent or unobserved to the agent. Given an action , the next state is governed by the transition dynamics , an observation is generated as , and reward is computed as . The RL objective involves maximization of the expected discounted sum of rewards, , where is the discount factor, and is the initial state distribution. The action-value function is . We define the unnormalized -discounted state-visitation distribution for a policy by , where

is the probability of being in state

at time , when following policy and starting state . The expected policy return can then be written as , where is the state-action visitation distribution (also referred to as the occupancy measure). For any policy , there is a one-to-one correspondence between and its occupancy measure (Puterman, 1994). Using the policy gradient theorem (Sutton et al., 2000), the gradient of the RL objective can be obtained as .

Imitation Learning. Learning in popular RL algorithms (such as policy-gradients and Q-learning) is sensitive to the quality of the reward function. In many practical scenarios, the rewards are either unavailable or extremely sparse, leading to difficulty in temporal credit assignment (Sutton, 1984). In the absence of explicit environmental rewards, a promising approach is to leverage demonstrations of the completed task by experts, and learn to imitate their behavior. Behavioral cloning (BC;  Pomerleau (1991)

) poses imitation as a supervised-learning problem, and learns a policy by maximizing the likelihood of expert-actions in the states visited by the expert. The policies produced with BC are generally not very robust due to the issue of compounding errors; several approaches have been proposed to remedy this 

(Ross et al., 2011; Ross & Bagnell, 2014). Inverse Reinforcement Learning (IRL) presents a more principled approach to imitation by attempting to recover the cost function under which the expert demonstrations are optimal (Ng et al., 2000; Ziebart et al., 2008). Most IRL algorithms, however, are difficult to scale up computationally because they require solving an RL problem in their inner loop. Recently, Ho & Ermon (2016) proposed framing imitation learning as an occupancy-measure matching (or divergence minimization) problem. Their architecture (GAIL) forgoes learning the optimal cost function in order to achieve computational tractability and sample-efficiency (in terms of the number of expert demonstrations needed). In detail, if and represent the state-action visitation distributions of the policy and the expert, respectively, then minimizing the Jenson-Shanon divergence helps to recover a policy with a similar trajectory distribution as the expert. GAIL iteratively trains a policy () and a discriminator () to optimize the mini-max objective:


where , is the buffer with expert demonstrations, and is the transition dynamics.

3 Methods

3.1 Belief Representation in POMDP

In a POMDP, the observations are by definition non-Markovian. A policy that chooses actions based on current observations performs sub-optimally, since does not contain sufficient information about the true state of the world. It is useful to infer a distribution on the true states based on the experiences thus far. This is referred to as the belief state, and is formally defined as the filtering distribution: . It combines the memory of past experiences with uncertainty about unobserved aspects of the world. Let denote the observation-action history, and be a function of . If is learned such that it forms the sufficient statistics of the filtering posterior over states, i.e., , then could be used as a surrogate code (or representation) for the belief state, and be used to train agents in POMDPs. Henceforth, with slight abuse of notation, we would refer to as the belief, although it is a high-dimensional representation rather than an explicit distribution over states.

An intuitive way to obtain this belief is by combining the observation-action history using aggregator functions such as recurrent or convolution networks. For instance, the intermediate hidden states in a recurrent network could represent . In the RL setting with environmental rewards, the representation could be trained by conditioning the policy on it, and back-propagating the RL (e.g. policy gradient) loss. However, the RL signal is generally too weak to learn a rich representation that provides sufficient statistics for the filtering posterior over states. Moreno et al. (2018) provide empirical evidence of this claim by training oracle models where representation learning is supervised with privileged information in form of the (unknown) environment states, and comparing them with learning solely using the RL loss. The problem is only exacerbated when the environmental rewards are extremely sparse. In our imitation learning setup, the belief update is incorporated into the mini-max objective for adversarial imitation of expert trajectories, and hence the representation is learned with a potentially stronger signal (Section 3.3). Prior work has shown that representations can be improved by using auxiliary losses such as reward-prediction (Jaderberg et al., 2016), depth-prediction (Mirowski et al., 2016), and prediction of future sensory data (Dosovitskiy & Koltun, 2016; Oh et al., 2015). Inspired by this, in Section 3.4, we regularize the representation with various prediction losses.

Recently, Ha & Schmidhuber (2018)

proposed an architecture (World-Models) that decouples model-learning from policy-optimization. In the model-learning phase, a variational auto-encoder compresses the raw observations to latent-space vectors, which are then temporally integrated using an RNN, combined with a mixture density network. In the policy-optimization phase, a policy conditioned on the RNN hidden-states is learned to maximize the rewards. We follow a similar

separation-of-concerns principle, and divide the architecture into two modules: 1) a policy module which learns a distribution over actions, conditioned on the belief; and 2) a belief module which learns a good representation of the belief , from the history of observations and actions, . While the policy module is trained with imitation learning, the belief module can be trained in a task-agnostic manner (like in World-Models), or in a task-aware manner. We describe these approaches in following sections.

3.2 Policy Module

The goal of our agent is to learn a policy by imitating a few expert demonstration trajectories of the form . Similar to the objective in GAIL, we hope to minimize the Jenson-Shanon divergence between the state-action visitation distributions of the policy and the expert: . However, since the true environment state is unobserved in POMDPs, we modify the objective to involve the belief representation instead, since it characterizes the posterior over via the generative process . Defining the belief-visitation distribution for a policy analogously to the state-visitation distribution, the data processing inequality for -divergences provides that: . Please see Appendix 7.1 for the proof. The objective thus minimizes an upper bound on the between the state-visitation distributions of the expert and the policy. A further relaxation of this objective allows us to explicitly include the belief-conditioned policy into the divergence minimization objective: , where is the belief-action visitation (proof in Appendix 7.1).

Minimizing . Although explicitly formulating these visitation distributions is difficult, it is possible to obtain an empirical distribution of by rolling out trajectories from , and using our belief module to produce samples of belief-action tuples , where . Similarly, the expert demonstrations buffer

contains observation-actions sequences, and can be used as an estimate of

. Therefore,  222To reduce clutter, we shorthand with just for the remainder of this section. can be approximated (up to a constant shift) with a binary classification problem as exploited in GANs (Goodfellow et al., 2014):


where are the parameters for the policy , is the discriminator, and is the transition dynamics. It should be noted that is a function of the belief module parameters through its dependence on the belief states. The imitation learning objective for optimizing the policy is then obtained as:


In Equation 2, denoting the functional maximum over by , the gradient for policy optimization is: . Figure 1 shows the stochastic computation graph (Schulman et al., 2015a) for this expectation term, where the stochastic nodes are represented by circles, deterministic nodes by rectangles, and we have written belief as a function of the history. Given fixed belief module parameters (), the required gradient w.r.t is obtained using the policy gradient theorem (Sutton et al., 2000) as:


Therefore, updating the policy to minimize is approximately the same as applying the standard policy-gradient using the rewards obtained from a learned discriminator, . As is standard practice, we do not train the discriminator to optimality, but rather jointly train the policy and discriminator using iterative gradient updates. The discriminator is updated using the gradient from Equation 2, while the policy is updated with gradient from Equation 4. We now detail the update rule for .

Figure 1: Stochastic computation graph for the expectation: . Both the policy and belief module parameters influence the generation of observation-action sequences, through environment interaction. is the belief. Circles represent stochastic nodes; rectangles are deterministic nodes.

3.3 Belief Module

This module transforms the history () into a belief representation. Various approaches could be used to aggregate historical context, such as RNNs, masked convolutions (Gehring et al., 2017) and attention-based methods (Vaswani et al., 2017). In our implementation, we model the belief module with an RNN, such that . We use GRUs (Cho et al., 2014) as they have been demonstrated to have good empirical performance. We denote by , a replay-buffer which stores observation-action sequences (current and past) from the agent. As stated before, the belief module could be learnt in a task-agnostic manner (similar to Ha & Schmidhuber (2018)), or with task-awareness.

Task-agnostic learning (separately from policy). An unsupervised approach to learning without accounting for the agent’s objective, is to maximize the joint likelihood of the observation sequence, conditioned on the actions, . This decomposes autoregressively as . The objective can be optimized by conditioning a generative model for on the RNN hidden state and action , and using MLE. Using a unimodal Gaussian generative model (learned function

for the mean, and fixed variance), the autoregressive loss to minimize is:


Task-aware learning (jointly with policy). Since the policy is conditioned on the belief, an intuitive way to improve the agent’s performance is to learn the belief with an objective more aligned with policy-learning. Since the agent minimizes , as defined in Equation 2, the same imitation learning objective naturally can also be used for learning :


which is the same as Equation 2 except for the use of the optimal discriminator (), and that we have written the belief in terms of history to explicitly bring out the dependence on . The gradient of the first expectation term w.r.t is straightforward; the gradient of the second expectation term w.r.t (for given fixed parameters ) comprises of a policy-gradient term and a pathwise-derivative term (Figure 1). Therefore, can be approximated with:


where is as defined in Equation 4. The overall mini-max objective for jointly training the policy, belief and discriminator is:

Figure 2: Schematic diagram of our complete architecture. The belief module is a recurrent network with GRU cells, and encodes trajectories from agent (on-policy data), replay buffer (off-policy data) and expert demonstrations memory into belief representations (). is updated with imitation-loss (Equation 6) computed from the current policy and discriminator networks. It is further regularized with forward-, inverse- and action-regularization using MLPs (colored in blue in the figure). Convolution layers (colored in red) encode the past actions and future actions into compact representations, which are then fed into the MLPs. The policy is conditioned on the belief, and updated using imitation learning (Equation 4). The discriminator

is a binary classifier trained on tuples from the agent and expert demonstrations (Equation 


3.4 Belief Regularization

With the mini-max objective (Equation 8), it may be possible that the belief parameters () are driven towards a degenerate solution that ignores the history , thereby producing constant (or similar) beliefs for policy and expert trajectories. Indeed, if we omit the actions () in the discriminator , a constant belief output is an optimal solution for Equation 8. To learn a belief representation that captures relevant historical context and is useful for deriving optimal policies, we add forward-, inverse- and action-regularization to the belief module. We define and motivate them from the perspective of mutual information maximization.


For two continuous random variables

, mutual information is defined as , where denotes the differential entropy 333The usual notation for differential entropy () is not used to avoid confusion with the history used in previous sections.. Intuitively, measures the dependence between and . Conditional mutual information is defined as . Given , if and are independent (), then

form a Markov Chain (

), and the data processing inequality for mutual information states that .

Forward regularization. As discussed in Section 3.1, an ideal belief representation completely characterizes the posterior over the true environment states . Therefore, it ought to be correlated with future true states (), conditioned on the intervening future actions . We frame this objective as maximization of the following conditional mutual information: . Since because of the observation generation process in a POMDP, we get the following after using the data processing inequality for mutual information:


where the final inequality follows because we can lower bound the mutual information using a variational approximation , similar to the variational information maximization algorithm (Agakov & Barber, 2004). Therefore, we maximize a lower bound to the mutual information with the surrogate objective:

With the choice of a unimodal Gaussian (learned function for the mean, and fixed variance) for the variational distribution

, the loss function for forward regularization of the belief module is:

where the expectation is over trajectories sampled from the replay buffer .

Inverse regularization. It is desirable that the belief at time is correlated with the past true states (), conditioned on the intervening past actions . This should improve the belief representation by helping to capture long-range dependencies. Proceeding in a manner similar to above, the conditional mutual information between these signals, , can be lower bounded by using the data processing inequality. Again, as before, this can be further lower bounded using a variational distribution for generating past observation . A unimodal Gaussian (mean function ) for yields the following loss for inverse regularization, which is optimized using trajectories from the replay :

Action regularization. We also wish to maximize for the reason that, conditioned on the current belief , a sequence of subsequent actions should provide information about the resulting true future state (). Similar lower bounding and use of a variational distribution with mean function for generating action-sequences gives us the loss:

The complete loss function for training the belief module results from a weighted combination of the imitation-loss and regularization terms. Imitation-loss uses on-policy data and expert demonstrations , while the regularization losses are computed with on-policy and off-policy data, as well as .


We derive our final expressions for by modeling the respective variational distributions as fixed-variance, unimodal Gaussians. We later show that using this simple model results in appreciable performance benefits for imitation learning. Other expressive model classes, such as mixture density networks and flow-based models (Rezende & Mohamed, 2015), can be readily used as well, to learn complex and multi-modal distributions over the future observations , past observations and action-sequences .

3.5 Learning Algorithm

Figure 2 shows the schematic diagram of our complete architecture, including an overview of implemented neural networks. In Algorithm 1, we outline the major steps of the training procedure. In each iteration, we run the policy for a few steps and obtain shaped rewards from the current discriminator (Line 6). The policy parameters are then updated using A2C, which is the synchronous adaptation of asynchronous advantage actor-critic (A3C; Mnih et al. (2016)), as the policy-gradient algorithm (Line 10). Other RL algorithms, such as those based on trust-regions methods (Schulman et al., 2015b) could also be readily used. Similar to the policy (actor), the baseline (critic) used for reducing variance of the stochastic gradient-estimation is also conditioned on the belief. To further reduce variance, Generalized Advantage Estimation (GAE; Schulman et al. (2015c)) is used to compute the advantage. Apart from the policy-gradient, on-policy data also enables computing the gradient for the discriminator network (Line 13) and the belief module (Line 14). The belief is further refined by minimizing the regularization losses on off-policy data from the replay buffer (Line 15).

The regularization losses described in Section 3.4

include a hyperparameter

that controls the temporal offset of the predictions. For instance, for , the larger the , the farther back in time the observation predictions are made, conditioned on the current belief and past actions. The temporal abstractions provided by multi-step predictions should help to extract more global information from the input stream into the belief representation. Our ablations (Section 5) show the performance benefit of including multi-step losses. Various strategies for selecting are possible, such as uniform sampling from a range (Guo et al., 2018) and adaptive selection based on a curriculum (Oh et al., 2015). For simplicity, we choose fixed values, and leave the exploration of the more sophisticated approaches to future work. Hence, our total regularization loss comprises of single-step and multi-step forward-, inverse-, and action-prediction losses. For encoding a sequence of past or future actions into a compact representation, we use multi-layer convolution networks (Figure 2).

1 for each iteration do
2       = {}, = {}
       /* Rollout steps from policy */
3       repeat
4             Get observation from environment
5             , where
8             If is terminal, add rollout to
9      until ;
      /* Update Policy */
       Update with policy-gradient (Eq. 4) /* Update discriminator */
10       Fetch () of length from
11       Generate belief-action tuples
       Update with log-loss objective using and /* Update Belief Module */
       Update with using and (Eq. 10) /* Off-policy Updates */
12       for few update steps do
13             Fetch () of length from
14             Update with
15       end for
17 end for
Algorithm 1 Belief-module Imitation Learning (BMIL)

4 Related Work

While we cannot do full justice to the extensive literature on algorithms for training agents in POMDPs, we here mention some recent related work. Most prior algorithms for POMDPs assume access to a predefined reward function. These include approaches based on Q-learning (DRQN; Hausknecht & Stone (2015)), policy-gradients (Igl et al., 2018), partially observed guided policy search (Zhang et al., 2016), and planning methods (Silver & Veness, 2010; Ross et al., 2008; Pineau et al., 2003). In contrast, we propose to adapt ideas from generative adversarial imitation learning to learn policies in POMDPs without environmental rewards.

Learning belief states from history was recently explored in Guo et al. (2018). The authors show that training the belief representation with a Contrastive Predictive Coding (CPC, Oord et al. (2018)) loss on future observations, conditioned on future actions, helps to infer knowledge about the underlying state of the environment. Predictive State Representations (PSRs) offer another approach to modeling the belief state in terms of observable data (Littman & Sutton, 2002). The assumption in PSRs is that the filtering distribution can be approximated with a distribution over the future observations, conditioned on future actions, . PSRs combined with RNNs have been shown to improve representations by predicting future observations (Venkatraman et al., 2017; Hefny et al., 2018). While we also make future predictions, a key difference compared to aforementioned methods is that our belief representation is additionally regularized by predictions in the past, and in action-space, which we later show benefits our approach.

State-space models (SSMs; Fraccaro et al. (2016); Goyal et al. (2017); Buesing et al. (2018)), which represent the unobserved environment states with latent variables, have also been used to obtain belief states.  Igl et al. (2018) use a particle-filtering method to train a VAE, and represent the belief state with a weighted collection of particles. The model is also updated with the RL-loss using a belief-conditioned policy.  Gregor & Besse (2018) proposed TD-VAE, which explicitly connects belief distributions at two distant timesteps, and enforces consistency between them using a transition distribution and smoothing posterior. Although we use a deterministic model for our belief module , our methods apply straightforwardly to SSMs as well.

5 Experiments

Figure 3: Mean episode-returns vs. timesteps of environment interaction. BMIL is our proposed architecture (Figure 2); BMIL w/o Reg excludes the various regularization terms (Section 3.4) from this design; Task-Agnostic learns the belief module separately from the policy using a task-agnostic loss (, Section 3.3

). We plot the average and standard-deviation over 5 random seeds.

The goal in this section is to evaluate and analyze the performance of our proposed architecture for imitation learning in partially-observable environments, given some expert demonstrations. Herein, we describe our environments, provide comparisons with GAIL, and perform ablations to study the importance of the design decisions that motivate our architecture.

Figure 4: Comparison of sensor information available to the agent in the MDP (original) and the POMDP (modified) settings for Hopper-v2 from the Gym MuJoCo suite.

Partially-observable locomotion tasks. We benchmark high-dimensional, continuous-control locomotion environments based on the MuJoCo physics simulator, available in OpenAI Gym (Brockman et al., 2016). The standard Gym MuJoCo library of tasks, however, consists of MDPs (and not POMDPs), since observations in such tasks contain sufficient state-information to learn an optimal policy conditioned on only the current observation. As such, it has been extensively used to evaluate performance of reinforcement-learning and imitation-learning algorithms in the MDP setting (Schulman et al., 2017; Ho & Ermon, 2016). To transform these tasks into POMDPs, we follow an approach similar to Duan et al. (2016), and redact some sensory data from the observations. Specifically, from the default observations, we remove measurements for the translation and angular velocities of the torso, and also the velocities for all the link joints. We denote the original (MDP) observations by , and the curtailed (POMDP) observations by . Figure 4 shows the Hopper agent composed of 4 links connected via actuated joints, and describes the original MDP sensors and our POMDP modifications. Similar information for other MuJoCo tasks is included in Appendix 7.2.

For all experiments, we assume access to 50 expert demonstrations of the type

, for each of the tasks. The policy and discriminator networks are feed-forward MLPs with two 64-unit layers. The policy network outputs include the action mean and per-action variances (i.e. actions are assumed to have an independent Gaussian distribution). In the belief module, the dimension of the GRU cell is 256, while the MLPs used for regularization have two 64-unit feed-forward layers. More details and hyperparameters are in Appendix 


5.1 Comparison to GAIL

Our first baseline is modeled after the architecture used in the original GAIL approach (Ho & Ermon, 2016). It consists of feed-forward policy and discriminator networks, without the recurrent belief module. The policy is conditioned on , and the discriminator performs binary classification on ) tuples. The update rules for the policy and discriminator are obtained in similar way as Equation 4 and Equation 2, respectively, by replacing the belief with observation . The next baseline, referred to as GAIL+Obs. stack, augments GAIL by concatenating 3 previous observations to each , and feeding the entire stack as input to the policy and discriminator networks. This approach has been found to extract useful historical context for a better state-representation (Mnih et al., 2015). We abbreviate our complete proposed architecture (Figure 2) by BMIL, short for Belief-Module Imitation Learning. BMIL jointly trains the policy, belief and discriminator networks using a mini-max objective (Equation 8), and additionally regularizes the belief with multi-step predictions. Table 1 compares the performance of different designs on POMDP MuJoCo. We shows the mean episode-returns, averaged over 5 runs with random seeds, after 10M timesteps of interaction with the environment. We observe that GAIL—both with and without observation stacking—is unable to successfully imitate the expert behavior. Since the observation alone does not contain adequate information, the policy conditioned on it performs sub-optimally. Also, the discriminator trained on ) tuples does not provide robust shaped rewards. Using the full state instead of in our experiments leads to successful imitation with GAIL, suggesting that the performance drop in Table 1 is due to partial observability, rather than other artifacts such as insufficient network capacity, or lack of algorithmic or hyperparameter tuning. Further, we see that techniques such as stacking past observations provide only a marginal improvement in some of the tasks. In contrast, in BMIL, the belief module curates a belief representation from the history (), which is used both for discriminator training, and to condition the action-distribution (policy). BMIL achieves scores very close to those of the expert.


     GAIL +

Obs. stack




( Avg.)

Inv.DoublePend. 109 1351 9104 9300
Hopper 157 517 2665 3200
Ant 895 1056 1832 2400
Walker 357 562 4038 4500
Humanoid 1686 1284 4382 4800
Half-cheetah 205 -948 5860 6500
Table 1: Mean episode-returns, averaged over 5 runs with random seeds, after 10M timesteps in POMDP MuJoCo.

5.2 Comparison to GAIL with Recurrent Networks

For our next two baselines, we replace the feed-forward networks in GAIL with GRUs. GAIL-RF uses a recurrent policy and a feed-forward discriminator, while in GAIL-RR, both the policy and the discriminator are recurrent. In both these baselines, the belief is created internally in the recurrent policy module. Importantly, unlike BMIL, the belief is not shared between the policy and the discriminator. The average final performance of GAIL-RF and GAIL-RR in our POMDP environments is shown in Table 2. We observe that GAIL-RR does not perform well on most of the tasks. A plausible explanation for this is that using the adversarial binary classification loss for training the discriminator parameters does not induce a sufficient representation of the belief state in its recurrent network. The other baseline, GAIL-RF, avoids this issue with a feed-forward discriminator trained on ) tuples from the expert and the policy. This leads to much better performance. However, BMIL consistently outperforms GAIL-RF, most significantly in Humanoid (1.6 higher score), indicating the effectiveness of the decoupled architecture and other design decisions that motivate BMIL. Figure 7 (in Appendix) plots the learning curves for these experiments. Appendix 7.5 further shows that BMIL outperforms the strongest baseline (GAIL-RF) on additional environments with accentuated partial observability.




Inv.DoublePend. 8965 9103 9104
Hopper 955 2164 2665
Ant -533 1612 1832
Walker 400 3188 4038
Humanoid 3829 2761 4382
Half-cheetah -922 5011 5860
Table 2: Mean episode-returns, averaged over 5 runs with random seeds, after 10M timesteps in POMDP MuJoCo.

5.3 Ablation Studies

How crucial is belief regularization? To quantify this, we compare the performance of our architecture with and without belief regularization (BMIL vs. BMIL w/o Reg.). Figure 3 plots the mean episode-returns vs. timesteps of environment interaction, over the course of learning. We observe that including regularization leads to better episode-returns and sample-complexity for most of the tasks considered, indicating that it is useful for shaping the belief representations. The single- and multi-step predictions for are made in the observation space. Although we see good improvements for tasks with high-dimensional spaces, such as Humanoid and Ant , predicting an entire raw observation may be sub-optimal and computationally wasteful for some tasks, since it requires modeling the complex structures within an observation. To avoid this, predictions can be made in a compact encoding space (output of Encoder in Figure 2). In Appendix 7.4, we show the performance of BMIL with predictions in encoding-space rather than observation-space, and note that the scores are quite similar.

Task-aware vs. Task-agnostic belief learning. Next, we compare with a design in which the belief module is trained separately from the policy, using a task-agnostic loss (, Section 3.3). This echos the philosophy used in World-Models (Ha & Schmidhuber, 2018). As Figure 3 shows, this results in mixed success for imitation learning in POMDPs. While the agent achieves good scores in tasks such as Ant and HalfCheetah, the performance in Walker and Hopper is unsatisfactory. In contrast, BMIL, which uses a task-aware imitation-loss for the belief module, is consistently better.

Are all of useful? We introduced 3 different regularization terms for the belief module – forward , inverse and action . To assess their benefit individually, in Figure 5, we plot learning curves for two tasks, with each of the regularizations applied in isolation. We compare them with BMIL, which includes all of them, and BMIL w/o Reg., which excludes all of them (no regularization). For the Ant task, we notice that each of provides performance improvement over the no-regularization baseline, and combining them (BMIL) performs the best. With the Walker task, we see better mean episode-returns at the end of training with each of , compared to no-regularization; BMIL attains the best sample-complexity.

Figure 5: Ablation on components of belief regularization. Forward-, Inverse-, Action-only correspond to using , , , respectively, in isolation, without the other two.

Are multi-step predictions useful? As argued before, making coarse-scale, multi-step () predictions for forward, inverse observations and action-sequences could improve representations by providing temporal abstractions. In Figure 6, we plot BMIL, which uses single- and multi-step losses , and compare it with two versions: first that uses a different temporal offset , and second that predicts only at the single-step granularity . For both tasks, we get better sample-complexity and higher final mean episode-returns with multi-step, suggesting its positive contribution to representation learning.

Figure 6: Ablation on hyperparameter in the regularization terms. Multi-step design builds over single-step by adding predictions at different temporal offsets, and .

6 Conclusion and Future Work

In this paper, we study imitation learning for POMDPs, which has been considerably less explored compared to imitation learning for MDPs, and learning in POMDPs with predefined reward functions. We introduce a framework comprised of a belief module, and policy and discriminator networks conditioned on the generated belief. Crucially, all networks are trained jointly with a min-max objective for adversarial imitation of expert trajectories.

Within this flexible setup, many instantiations are possible, depending on the choice of networks. Both feed-forward and recurrent networks can be used for the policy and the discriminator, while for the belief module there is an expansive set of options based on the rich literature on representation learning, such as CPC (Guo et al., 2018), Z-forcing (Ke et al., 2018), and using auxiliary tasks (Jaderberg et al., 2016). Many more methods based on state-space latent-variable models are also applicable (c.f. Section 4). In our instantiation of the belief module, we use the task-based imitation loss (Equation 6), and improve robustness of representations by regularizing with multi-step prediction of past/future observations and action-sequences. One benefit of our proposed framework is that in future work, it would be straightforward to substitute other methods for learning belief representations for the one we arrived at in our paper. Similarly, recent advancements in GAN and RL literature could guide the development of better discriminator and policy networks for imitation learning in POMDPs. Exploring these ranges, as well as their interplay, are important directions.


7 Appendix

Figure 7: Mean episode-returns vs. timesteps of environment interaction. BMIL is our proposed architecture (Figure 2); GAIL-RF uses a recurrent policy and a feed-forward discriminator, while in GAIL-RR, both the policy and the discriminator are recurrent. We plot the average and standard-deviation over 5 random seeds.

7.1 Proof of inequalities in Section 3.2

We first prove the inequality connecting between the state-visitation distribution and belief-visitation distribution of the agent and the expert:


The proof is a simple application of the data-processing inequality for -divergences (Ali & Silvey, 1966), of which is a type.

We denote the filtering posterior distribution over states, given the belief, by . Note that is characterized by the environment, and does not depend on the policy (agent or expert). The posterior over belief, given the state, however, is policy-dependent and obtained using Bayes rule as: . Also, . Analogously definitions exist for expert .

We write in terms of the template used for -divergences. Let be the following convex function with the property : . Then,

Similarity, we can prove the inequality connecting between belief-visitation distribution and belief-action-visitation distribution of the agent and the expert:


Replace and in the previous proof. The only required condition for that result to hold is the non-dependence of the distribution on the policy. Therefore, if it holds that is independent of the policy, then we have,

The independence holds under the trivial case of a deterministic mapping . This gives us the desired inequality. ∎

7.2 MDP and POMDP Sensors

Description of the sensor measurements given to the agent in the MDP and POMDP environments is provided in Table 3. As an example, for the Hopper task, the MDP space is 11-dimensional, which includes 6 velocity sensors and 5 position sensors; whereas the POMDP space is 5-dimensional, comprising of 5 position sensors. Amongst sensor categories, velocity includes translation and angular velocities of the torso, and also the velocities for all the joints; position includes torso position and orientation (quaternion), and the angle of the joints. The sensors in the MDP column marked in bold are removed in the POMDP setting.

Environment MDP sensors POMDP sensors
Hopper velocity(6) + position(5) position(5)
Half-Cheetah velocity(9) + position(8) position(8)
Walker2d velocity(9) + position(8) position(8)
Inv.DoublePend. velocity(3) + position(5) + actuator forces(3) position(5) + actuator forces(3)
Ant velocity(14) + position(13) + external forces(84) position(13) + external forces(84)
Humanoid velocity(23) + center-of-mass based velocity(84) + position(22) + center-of-mass based inertia(140) + actuator forces(23) + external forces(84) position(22) + center-of-mass based inertia(140) + actuator forces(23) + external forces(84)
Table 3: MDP and POMDP sensors (MuJoCo). The sensors in the MDP column marked in bold are removed in the POMDP setting.

7.3 Hyperparameters

Hyper-parameter Value
Parameters for Convolution networks (encoding past & future action-sequences)

Layers=2, Stride=1, Padding=1, Kernel_size=3, Num_filters = {5,5}

Belief Regularization Coefficients
Rollout length (c) in Algorithm 1 5
Size of expert demonstrations 50 (trajectories)
Size of replay buffer 1000 (trajectories)
Optimizer, Learning Rate RMSProp, 3e-4 (linear-decay)
(GAE) 0.99, 0.95

7.4 Predictions in encoding-space

In our approach, we regularize with single- and multi-step predictions in the space of raw observations. For many high-dimensional, complex spaces (e.g. visual inputs), it may be more efficient to operate in a lower-dimensional, compressed encoding-space, either pre-trained, or learnt online (Cuccu et al., 2019).

The encoder in our architecture (yellow box in Figure 2) pre-processes the raw observations before passing them to the RNN for temporal integration. We now evaluate BMIL with single- and multi-step predictions in the space of this encoder output. For instance, the forward regularization loss function is:

We do not pass the gradient through the target value . The encoder is trained online as part of the belief module. Table 4 indicates that, for the tasks considered, BMIL performance is fairly similar when predicting in observation-space vs. encoding-space.

BMIL: Predictions in


BMIL: Predictions in


Invd.DoublePend. 9104 134 8883 448
Hopper 2665 70 2700 116
Ant 1832 92 1784 44
Walker 4038 259 4043 113
Humanoid 4382 117 4322 263
Half-cheetah 5860 171 5912 128
Table 4: Mean and std. of episode-returns, averaged over 5 random seeds, after 10M timesteps in POMDP MuJoCo.

7.5 Experiments in environment variants that accentuate partial observability

To test robustness of BMIL, we evaluate it on two new POMDP environment variants designed to make inferring the true state from given sensors more challenging. These new environments are:

  • Inv.DoublePend. from velocities only - The partially observable Inverted-Double-Pendulum used in Section 5 removes the velocity sensors to achieve partial observability, and provides as sensors only the cart-position and sin/cos of link angles. In this new environment, we remove the previously shown sensors (cart-position and link angles), and instead provide only the velocity sensors (which were removed in our original environment). Note that the motivation is to exacerbate partial observability by restricting sensors such that inferring the true state is more challenging (i.e. it is easier to infer velocity from subsequent positions than to integrate position over time from only velocity information). Indeed, our experiments indicate this is a harder imitation learning scenario.

  • Walker from velocities only - In the same spirit as above. We remove all position sensors and instead provide only the velocity sensors to the agent.

We compare BMIL to GAIL-RF (the strongest baseline).

(velocity only) 4988 6578
(velocity only) 1539 4199
Table 5: Mean episode-returns, averaged over 5 runs with random seeds, after 10M timesteps.