BMIL
Code for "Learning Belief Representations for Imitation Learning in POMDPs" (UAI 2019)
view repo
We consider the problem of imitation learning from expert demonstrations in partially observable Markov decision processes (POMDPs). Belief representations, which characterize the distribution over the latent states in a POMDP, have been modeled using recurrent neural networks and probabilistic latent variable models, and shown to be effective for reinforcement learning in POMDPs. In this work, we investigate the belief representation learning problem for generative adversarial imitation learning in POMDPs. Instead of training the belief module and the policy separately as suggested in prior work, we learn the belief module jointly with the policy, using a taskaware imitation loss to ensure that the representation is more aligned with the policy's objective. To improve robustness of representation, we introduce several informative belief regularization techniques, including multistep prediction of dynamics and actionsequences. Evaluated on various partially observable continuouscontrol locomotion tasks, our beliefmodule imitation learning approach (BMIL) substantially outperforms several baselines, including the original GAIL algorithm and the taskagnostic belief learning algorithm. Extensive ablation analysis indicates the effectiveness of taskaware belief learning and belief regularization.
READ FULL TEXT VIEW PDFCode for "Learning Belief Representations for Imitation Learning in POMDPs" (UAI 2019)
Recent advances in reinforcement learning (RL) have found successful applications in solving complex problems, including robotics, games, dialogue systems, and recommendation systems, among others. Despite such notable success, the application of RL is still quite limited to problems where the observationspace is rich in information and data generation is inexpensive. On the other hand, the environments in realworld problems, such as autonomous driving and robotics, are typically stochastic, complex and partially observable. To achieve robust and practical performance, RL algorithms should adapt to situations where the agent is being fed noisy and incomplete sensory information. To model these types of environments, partially observable Markov decision processes (POMDPs; Aström (1965)) have been proposed and widely studied. In a POMDP, since the current observation alone is insufficient for choosing optimal actions, the agent’s history (its past observations and actions) is encoded into a belief state, which is defined as the distribution (representing the agent’s beliefs) over the current latent state. Although belief states can be used to derive optimal policies (Kaelbling et al., 1998; Hauskrecht, 2000), maintaining and updating them requires knowledge of the transition and observation models of the POMDP, and is prohibitively expensive for highdimensional spaces. To overcome this difficulty, several algorithms have been proposed that perform approximate inference of the belief state representation from raw observations, using recurrent neural networks (Guo et al., 2018)
, variational autoencoders
(Igl et al., 2018; Gregor & Besse, 2018), and Predictive State Representations (Venkatraman et al., 2017). After the belief model has been learned, a policy optimization algorithm is then applied to the belief representation to optimize a predefined reward signal.As an alternative to RL from predefined rewards, imitation learning often provides a fast and efficient way for training an agent to complete tasks. Expert demonstrations are provided to guide a learner agent to mimic the actions of the expert without the need to specify a reward function. A large volume of work has been done over the past decades on imitation learning for fully observable MDPs, including the seminal work on generative adversarial imitation learning (GAIL, Ho & Ermon (2016)), but there has been little focus on applying these ideas to partially observable environments.
In this paper, we study the problem of imitation learning for POMDPs. Specifically, we introduce a new belief representation learning approach for generative adversarial imitation learning in POMDPs. Different from previous approaches, where the belief state representation and the policy are trained in a decoupled manner, we learn the belief module jointly with the policy, using a taskaware imitation loss which helps to align the belief representation with the policy’s objective. To avoid potential belief degeneration, we introduce several informative belief regularization techniques, including auxiliary losses of predicting multistep past/future observations and actionsequences, which improve the robustness of the belief representation. Evaluated on various partially observable continuouscontrol locomotion tasks built from MuJoCo, our beliefmodule imitation learning approach (BMIL) substantially outperforms several baselines, including the original GAIL algorithm and the taskagnostic belief learning algorithm. Extensive ablation analysis indicates the effectiveness of taskaware belief learning and belief regularization.
Reinforcement Learning. We consider the RL setting where the environment is modeled as a partiallyobservable Markov decision process (POMDP). A POMDP is characterized by the tuple (, , , , , , , , ), where is the statespace, is the actionspace, and is the observationspace. The true environment states are latent or unobserved to the agent. Given an action , the next state is governed by the transition dynamics , an observation is generated as , and reward is computed as . The RL objective involves maximization of the expected discounted sum of rewards, , where is the discount factor, and is the initial state distribution. The actionvalue function is . We define the unnormalized discounted statevisitation distribution for a policy by , where
is the probability of being in state
at time , when following policy and starting state . The expected policy return can then be written as , where is the stateaction visitation distribution (also referred to as the occupancy measure). For any policy , there is a onetoone correspondence between and its occupancy measure (Puterman, 1994). Using the policy gradient theorem (Sutton et al., 2000), the gradient of the RL objective can be obtained as .Imitation Learning. Learning in popular RL algorithms (such as policygradients and Qlearning) is sensitive to the quality of the reward function. In many practical scenarios, the rewards are either unavailable or extremely sparse, leading to difficulty in temporal credit assignment (Sutton, 1984). In the absence of explicit environmental rewards, a promising approach is to leverage demonstrations of the completed task by experts, and learn to imitate their behavior. Behavioral cloning (BC; Pomerleau (1991)
) poses imitation as a supervisedlearning problem, and learns a policy by maximizing the likelihood of expertactions in the states visited by the expert. The policies produced with BC are generally not very robust due to the issue of compounding errors; several approaches have been proposed to remedy this
(Ross et al., 2011; Ross & Bagnell, 2014). Inverse Reinforcement Learning (IRL) presents a more principled approach to imitation by attempting to recover the cost function under which the expert demonstrations are optimal (Ng et al., 2000; Ziebart et al., 2008). Most IRL algorithms, however, are difficult to scale up computationally because they require solving an RL problem in their inner loop. Recently, Ho & Ermon (2016) proposed framing imitation learning as an occupancymeasure matching (or divergence minimization) problem. Their architecture (GAIL) forgoes learning the optimal cost function in order to achieve computational tractability and sampleefficiency (in terms of the number of expert demonstrations needed). In detail, if and represent the stateaction visitation distributions of the policy and the expert, respectively, then minimizing the JensonShanon divergence helps to recover a policy with a similar trajectory distribution as the expert. GAIL iteratively trains a policy () and a discriminator () to optimize the minimax objective:(1)  
where , is the buffer with expert demonstrations, and is the transition dynamics.
In a POMDP, the observations are by definition nonMarkovian. A policy that chooses actions based on current observations performs suboptimally, since does not contain sufficient information about the true state of the world. It is useful to infer a distribution on the true states based on the experiences thus far. This is referred to as the belief state, and is formally defined as the filtering distribution: . It combines the memory of past experiences with uncertainty about unobserved aspects of the world. Let denote the observationaction history, and be a function of . If is learned such that it forms the sufficient statistics of the filtering posterior over states, i.e., , then could be used as a surrogate code (or representation) for the belief state, and be used to train agents in POMDPs. Henceforth, with slight abuse of notation, we would refer to as the belief, although it is a highdimensional representation rather than an explicit distribution over states.
An intuitive way to obtain this belief is by combining the observationaction history using aggregator functions such as recurrent or convolution networks. For instance, the intermediate hidden states in a recurrent network could represent . In the RL setting with environmental rewards, the representation could be trained by conditioning the policy on it, and backpropagating the RL (e.g. policy gradient) loss. However, the RL signal is generally too weak to learn a rich representation that provides sufficient statistics for the filtering posterior over states. Moreno et al. (2018) provide empirical evidence of this claim by training oracle models where representation learning is supervised with privileged information in form of the (unknown) environment states, and comparing them with learning solely using the RL loss. The problem is only exacerbated when the environmental rewards are extremely sparse. In our imitation learning setup, the belief update is incorporated into the minimax objective for adversarial imitation of expert trajectories, and hence the representation is learned with a potentially stronger signal (Section 3.3). Prior work has shown that representations can be improved by using auxiliary losses such as rewardprediction (Jaderberg et al., 2016), depthprediction (Mirowski et al., 2016), and prediction of future sensory data (Dosovitskiy & Koltun, 2016; Oh et al., 2015). Inspired by this, in Section 3.4, we regularize the representation with various prediction losses.
Recently, Ha & Schmidhuber (2018)
proposed an architecture (WorldModels) that decouples modellearning from policyoptimization. In the modellearning phase, a variational autoencoder compresses the raw observations to latentspace vectors, which are then temporally integrated using an RNN, combined with a mixture density network. In the policyoptimization phase, a policy conditioned on the RNN hiddenstates is learned to maximize the rewards. We follow a similar
separationofconcerns principle, and divide the architecture into two modules: 1) a policy module which learns a distribution over actions, conditioned on the belief; and 2) a belief module which learns a good representation of the belief , from the history of observations and actions, . While the policy module is trained with imitation learning, the belief module can be trained in a taskagnostic manner (like in WorldModels), or in a taskaware manner. We describe these approaches in following sections.The goal of our agent is to learn a policy by imitating a few expert demonstration trajectories of the form . Similar to the objective in GAIL, we hope to minimize the JensonShanon divergence between the stateaction visitation distributions of the policy and the expert: . However, since the true environment state is unobserved in POMDPs, we modify the objective to involve the belief representation instead, since it characterizes the posterior over via the generative process . Defining the beliefvisitation distribution for a policy analogously to the statevisitation distribution, the data processing inequality for divergences provides that: . Please see Appendix 7.1 for the proof. The objective thus minimizes an upper bound on the between the statevisitation distributions of the expert and the policy. A further relaxation of this objective allows us to explicitly include the beliefconditioned policy into the divergence minimization objective: , where is the beliefaction visitation (proof in Appendix 7.1).
Minimizing . Although explicitly formulating these visitation distributions is difficult, it is possible to obtain an empirical distribution of by rolling out trajectories from , and using our belief module to produce samples of beliefaction tuples , where . Similarly, the expert demonstrations buffer
contains observationactions sequences, and can be used as an estimate of
. Therefore, ^{2}^{2}2To reduce clutter, we shorthand with just for the remainder of this section. can be approximated (up to a constant shift) with a binary classification problem as exploited in GANs (Goodfellow et al., 2014):(2)  
where are the parameters for the policy , is the discriminator, and is the transition dynamics. It should be noted that is a function of the belief module parameters through its dependence on the belief states. The imitation learning objective for optimizing the policy is then obtained as:
(3)  
In Equation 2, denoting the functional maximum over by , the gradient for policy optimization is: . Figure 1 shows the stochastic computation graph (Schulman et al., 2015a) for this expectation term, where the stochastic nodes are represented by circles, deterministic nodes by rectangles, and we have written belief as a function of the history. Given fixed belief module parameters (), the required gradient w.r.t is obtained using the policy gradient theorem (Sutton et al., 2000) as:
(4)  
Therefore, updating the policy to minimize is approximately the same as applying the standard policygradient using the rewards obtained from a learned discriminator, . As is standard practice, we do not train the discriminator to optimality, but rather jointly train the policy and discriminator using iterative gradient updates. The discriminator is updated using the gradient from Equation 2, while the policy is updated with gradient from Equation 4. We now detail the update rule for .
This module transforms the history () into a belief representation. Various approaches could be used to aggregate historical context, such as RNNs, masked convolutions (Gehring et al., 2017) and attentionbased methods (Vaswani et al., 2017). In our implementation, we model the belief module with an RNN, such that . We use GRUs (Cho et al., 2014) as they have been demonstrated to have good empirical performance. We denote by , a replaybuffer which stores observationaction sequences (current and past) from the agent. As stated before, the belief module could be learnt in a taskagnostic manner (similar to Ha & Schmidhuber (2018)), or with taskawareness.
Taskagnostic learning (separately from policy). An unsupervised approach to learning without accounting for the agent’s objective, is to maximize the joint likelihood of the observation sequence, conditioned on the actions, . This decomposes autoregressively as . The objective can be optimized by conditioning a generative model for on the RNN hidden state and action , and using MLE. Using a unimodal Gaussian generative model (learned function
for the mean, and fixed variance), the autoregressive loss to minimize is:
(5) 
Taskaware learning (jointly with policy). Since the policy is conditioned on the belief, an intuitive way to improve the agent’s performance is to learn the belief with an objective more aligned with policylearning. Since the agent minimizes , as defined in Equation 2, the same imitation learning objective naturally can also be used for learning :
(6)  
which is the same as Equation 2 except for the use of the optimal discriminator (), and that we have written the belief in terms of history to explicitly bring out the dependence on . The gradient of the first expectation term w.r.t is straightforward; the gradient of the second expectation term w.r.t (for given fixed parameters ) comprises of a policygradient term and a pathwisederivative term (Figure 1). Therefore, can be approximated with:
(7)  
where is as defined in Equation 4. The overall minimax objective for jointly training the policy, belief and discriminator is:
(8)  
With the minimax objective (Equation 8), it may be possible that the belief parameters () are driven towards a degenerate solution that ignores the history , thereby producing constant (or similar) beliefs for policy and expert trajectories. Indeed, if we omit the actions () in the discriminator , a constant belief output is an optimal solution for Equation 8. To learn a belief representation that captures relevant historical context and is useful for deriving optimal policies, we add forward, inverse and actionregularization to the belief module. We define and motivate them from the perspective of mutual information maximization.
Notation.
For two continuous random variables
, mutual information is defined as , where denotes the differential entropy ^{3}^{3}3The usual notation for differential entropy () is not used to avoid confusion with the history used in previous sections.. Intuitively, measures the dependence between and . Conditional mutual information is defined as . Given , if and are independent (), thenform a Markov Chain (
), and the data processing inequality for mutual information states that .Forward regularization. As discussed in Section 3.1, an ideal belief representation completely characterizes the posterior over the true environment states . Therefore, it ought to be correlated with future true states (), conditioned on the intervening future actions . We frame this objective as maximization of the following conditional mutual information: . Since because of the observation generation process in a POMDP, we get the following after using the data processing inequality for mutual information:
(9)  
where the final inequality follows because we can lower bound the mutual information using a variational approximation , similar to the variational information maximization algorithm (Agakov & Barber, 2004). Therefore, we maximize a lower bound to the mutual information with the surrogate objective:
With the choice of a unimodal Gaussian (learned function for the mean, and fixed variance) for the variational distribution
, the loss function for forward regularization of the belief module is:
where the expectation is over trajectories sampled from the replay buffer .
Inverse regularization. It is desirable that the belief at time is correlated with the past true states (), conditioned on the intervening past actions . This should improve the belief representation by helping to capture longrange dependencies. Proceeding in a manner similar to above, the conditional mutual information between these signals, , can be lower bounded by using the data processing inequality. Again, as before, this can be further lower bounded using a variational distribution for generating past observation . A unimodal Gaussian (mean function ) for yields the following loss for inverse regularization, which is optimized using trajectories from the replay :
Action regularization. We also wish to maximize for the reason that, conditioned on the current belief , a sequence of subsequent actions should provide information about the resulting true future state (). Similar lower bounding and use of a variational distribution with mean function for generating actionsequences gives us the loss:
The complete loss function for training the belief module results from a weighted combination of the imitationloss and regularization terms. Imitationloss uses onpolicy data and expert demonstrations , while the regularization losses are computed with onpolicy and offpolicy data, as well as .
(10) 
We derive our final expressions for by modeling the respective variational distributions as fixedvariance, unimodal Gaussians. We later show that using this simple model results in appreciable performance benefits for imitation learning. Other expressive model classes, such as mixture density networks and flowbased models (Rezende & Mohamed, 2015), can be readily used as well, to learn complex and multimodal distributions over the future observations , past observations and actionsequences .
Figure 2 shows the schematic diagram of our complete architecture, including an overview of implemented neural networks. In Algorithm 1, we outline the major steps of the training procedure. In each iteration, we run the policy for a few steps and obtain shaped rewards from the current discriminator (Line 6). The policy parameters are then updated using A2C, which is the synchronous adaptation of asynchronous advantage actorcritic (A3C; Mnih et al. (2016)), as the policygradient algorithm (Line 10). Other RL algorithms, such as those based on trustregions methods (Schulman et al., 2015b) could also be readily used. Similar to the policy (actor), the baseline (critic) used for reducing variance of the stochastic gradientestimation is also conditioned on the belief. To further reduce variance, Generalized Advantage Estimation (GAE; Schulman et al. (2015c)) is used to compute the advantage. Apart from the policygradient, onpolicy data also enables computing the gradient for the discriminator network (Line 13) and the belief module (Line 14). The belief is further refined by minimizing the regularization losses on offpolicy data from the replay buffer (Line 15).
The regularization losses described in Section 3.4
include a hyperparameter
that controls the temporal offset of the predictions. For instance, for , the larger the , the farther back in time the observation predictions are made, conditioned on the current belief and past actions. The temporal abstractions provided by multistep predictions should help to extract more global information from the input stream into the belief representation. Our ablations (Section 5) show the performance benefit of including multistep losses. Various strategies for selecting are possible, such as uniform sampling from a range (Guo et al., 2018) and adaptive selection based on a curriculum (Oh et al., 2015). For simplicity, we choose fixed values, and leave the exploration of the more sophisticated approaches to future work. Hence, our total regularization loss comprises of singlestep and multistep forward, inverse, and actionprediction losses. For encoding a sequence of past or future actions into a compact representation, we use multilayer convolution networks (Figure 2).While we cannot do full justice to the extensive literature on algorithms for training agents in POMDPs, we here mention some recent related work. Most prior algorithms for POMDPs assume access to a predefined reward function. These include approaches based on Qlearning (DRQN; Hausknecht & Stone (2015)), policygradients (Igl et al., 2018), partially observed guided policy search (Zhang et al., 2016), and planning methods (Silver & Veness, 2010; Ross et al., 2008; Pineau et al., 2003). In contrast, we propose to adapt ideas from generative adversarial imitation learning to learn policies in POMDPs without environmental rewards.
Learning belief states from history was recently explored in Guo et al. (2018). The authors show that training the belief representation with a Contrastive Predictive Coding (CPC, Oord et al. (2018)) loss on future observations, conditioned on future actions, helps to infer knowledge about the underlying state of the environment. Predictive State Representations (PSRs) offer another approach to modeling the belief state in terms of observable data (Littman & Sutton, 2002). The assumption in PSRs is that the filtering distribution can be approximated with a distribution over the future observations, conditioned on future actions, . PSRs combined with RNNs have been shown to improve representations by predicting future observations (Venkatraman et al., 2017; Hefny et al., 2018). While we also make future predictions, a key difference compared to aforementioned methods is that our belief representation is additionally regularized by predictions in the past, and in actionspace, which we later show benefits our approach.
Statespace models (SSMs; Fraccaro et al. (2016); Goyal et al. (2017); Buesing et al. (2018)), which represent the unobserved environment states with latent variables, have also been used to obtain belief states. Igl et al. (2018) use a particlefiltering method to train a VAE, and represent the belief state with a weighted collection of particles. The model is also updated with the RLloss using a beliefconditioned policy. Gregor & Besse (2018) proposed TDVAE, which explicitly connects belief distributions at two distant timesteps, and enforces consistency between them using a transition distribution and smoothing posterior. Although we use a deterministic model for our belief module , our methods apply straightforwardly to SSMs as well.
The goal in this section is to evaluate and analyze the performance of our proposed architecture for imitation learning in partiallyobservable environments, given some expert demonstrations. Herein, we describe our environments, provide comparisons with GAIL, and perform ablations to study the importance of the design decisions that motivate our architecture.
Partiallyobservable locomotion tasks. We benchmark highdimensional, continuouscontrol locomotion environments based on the MuJoCo physics simulator, available in OpenAI Gym (Brockman et al., 2016). The standard Gym MuJoCo library of tasks, however, consists of MDPs (and not POMDPs), since observations in such tasks contain sufficient stateinformation to learn an optimal policy conditioned on only the current observation. As such, it has been extensively used to evaluate performance of reinforcementlearning and imitationlearning algorithms in the MDP setting (Schulman et al., 2017; Ho & Ermon, 2016). To transform these tasks into POMDPs, we follow an approach similar to Duan et al. (2016), and redact some sensory data from the observations. Specifically, from the default observations, we remove measurements for the translation and angular velocities of the torso, and also the velocities for all the link joints. We denote the original (MDP) observations by , and the curtailed (POMDP) observations by . Figure 4 shows the Hopper agent composed of 4 links connected via actuated joints, and describes the original MDP sensors and our POMDP modifications. Similar information for other MuJoCo tasks is included in Appendix 7.2.
For all experiments, we assume access to 50 expert demonstrations of the type
, for each of the tasks. The policy and discriminator networks are feedforward MLPs with two 64unit layers. The policy network outputs include the action mean and peraction variances (i.e. actions are assumed to have an independent Gaussian distribution). In the belief module, the dimension of the GRU cell is 256, while the MLPs used for regularization have two 64unit feedforward layers. More details and hyperparameters are in Appendix
7.3.Our first baseline is modeled after the architecture used in the original GAIL approach (Ho & Ermon, 2016). It consists of feedforward policy and discriminator networks, without the recurrent belief module. The policy is conditioned on , and the discriminator performs binary classification on ) tuples. The update rules for the policy and discriminator are obtained in similar way as Equation 4 and Equation 2, respectively, by replacing the belief with observation . The next baseline, referred to as GAIL+Obs. stack, augments GAIL by concatenating 3 previous observations to each , and feeding the entire stack as input to the policy and discriminator networks. This approach has been found to extract useful historical context for a better staterepresentation (Mnih et al., 2015). We abbreviate our complete proposed architecture (Figure 2) by BMIL, short for BeliefModule Imitation Learning. BMIL jointly trains the policy, belief and discriminator networks using a minimax objective (Equation 8), and additionally regularizes the belief with multistep predictions. Table 1 compares the performance of different designs on POMDP MuJoCo. We shows the mean episodereturns, averaged over 5 runs with random seeds, after 10M timesteps of interaction with the environment. We observe that GAIL—both with and without observation stacking—is unable to successfully imitate the expert behavior. Since the observation alone does not contain adequate information, the policy conditioned on it performs suboptimally. Also, the discriminator trained on ) tuples does not provide robust shaped rewards. Using the full state instead of in our experiments leads to successful imitation with GAIL, suggesting that the performance drop in Table 1 is due to partial observability, rather than other artifacts such as insufficient network capacity, or lack of algorithmic or hyperparameter tuning. Further, we see that techniques such as stacking past observations provide only a marginal improvement in some of the tasks. In contrast, in BMIL, the belief module curates a belief representation from the history (), which is used both for discriminator training, and to condition the actiondistribution (policy). BMIL achieves scores very close to those of the expert.
GAIL 
GAIL + Obs. stack 
BMIL (Ours) 
Expert ( Avg.) 


Inv.DoublePend.  109  1351  9104  9300 
Hopper  157  517  2665  3200 
Ant  895  1056  1832  2400 
Walker  357  562  4038  4500 
Humanoid  1686  1284  4382  4800 
Halfcheetah  205  948  5860  6500 
For our next two baselines, we replace the feedforward networks in GAIL with GRUs. GAILRF uses a recurrent policy and a feedforward discriminator, while in GAILRR, both the policy and the discriminator are recurrent. In both these baselines, the belief is created internally in the recurrent policy module. Importantly, unlike BMIL, the belief is not shared between the policy and the discriminator. The average final performance of GAILRF and GAILRR in our POMDP environments is shown in Table 2. We observe that GAILRR does not perform well on most of the tasks. A plausible explanation for this is that using the adversarial binary classification loss for training the discriminator parameters does not induce a sufficient representation of the belief state in its recurrent network. The other baseline, GAILRF, avoids this issue with a feedforward discriminator trained on ) tuples from the expert and the policy. This leads to much better performance. However, BMIL consistently outperforms GAILRF, most significantly in Humanoid (1.6 higher score), indicating the effectiveness of the decoupled architecture and other design decisions that motivate BMIL. Figure 7 (in Appendix) plots the learning curves for these experiments. Appendix 7.5 further shows that BMIL outperforms the strongest baseline (GAILRF) on additional environments with accentuated partial observability.
GAILRR  GAILRF 
BMIL (Ours) 


Inv.DoublePend.  8965  9103  9104 
Hopper  955  2164  2665 
Ant  533  1612  1832 
Walker  400  3188  4038 
Humanoid  3829  2761  4382 
Halfcheetah  922  5011  5860 
How crucial is belief regularization? To quantify this, we compare the performance of our architecture with and without belief regularization (BMIL vs. BMIL w/o Reg.). Figure 3 plots the mean episodereturns vs. timesteps of environment interaction, over the course of learning. We observe that including regularization leads to better episodereturns and samplecomplexity for most of the tasks considered, indicating that it is useful for shaping the belief representations. The single and multistep predictions for are made in the observation space. Although we see good improvements for tasks with highdimensional spaces, such as Humanoid and Ant , predicting an entire raw observation may be suboptimal and computationally wasteful for some tasks, since it requires modeling the complex structures within an observation. To avoid this, predictions can be made in a compact encoding space (output of Encoder in Figure 2). In Appendix 7.4, we show the performance of BMIL with predictions in encodingspace rather than observationspace, and note that the scores are quite similar.
Taskaware vs. Taskagnostic belief learning. Next, we compare with a design in which the belief module is trained separately from the policy, using a taskagnostic loss (, Section 3.3). This echos the philosophy used in WorldModels (Ha & Schmidhuber, 2018). As Figure 3 shows, this results in mixed success for imitation learning in POMDPs. While the agent achieves good scores in tasks such as Ant and HalfCheetah, the performance in Walker and Hopper is unsatisfactory. In contrast, BMIL, which uses a taskaware imitationloss for the belief module, is consistently better.
Are all of useful? We introduced 3 different regularization terms for the belief module – forward , inverse and action . To assess their benefit individually, in Figure 5, we plot learning curves for two tasks, with each of the regularizations applied in isolation. We compare them with BMIL, which includes all of them, and BMIL w/o Reg., which excludes all of them (no regularization). For the Ant task, we notice that each of provides performance improvement over the noregularization baseline, and combining them (BMIL) performs the best. With the Walker task, we see better mean episodereturns at the end of training with each of , compared to noregularization; BMIL attains the best samplecomplexity.
Are multistep predictions useful? As argued before, making coarsescale, multistep () predictions for forward, inverse observations and actionsequences could improve representations by providing temporal abstractions. In Figure 6, we plot BMIL, which uses single and multistep losses , and compare it with two versions: first that uses a different temporal offset , and second that predicts only at the singlestep granularity . For both tasks, we get better samplecomplexity and higher final mean episodereturns with multistep, suggesting its positive contribution to representation learning.
In this paper, we study imitation learning for POMDPs, which has been considerably less explored compared to imitation learning for MDPs, and learning in POMDPs with predefined reward functions. We introduce a framework comprised of a belief module, and policy and discriminator networks conditioned on the generated belief. Crucially, all networks are trained jointly with a minmax objective for adversarial imitation of expert trajectories.
Within this flexible setup, many instantiations are possible, depending on the choice of networks. Both feedforward and recurrent networks can be used for the policy and the discriminator, while for the belief module there is an expansive set of options based on the rich literature on representation learning, such as CPC (Guo et al., 2018), Zforcing (Ke et al., 2018), and using auxiliary tasks (Jaderberg et al., 2016). Many more methods based on statespace latentvariable models are also applicable (c.f. Section 4). In our instantiation of the belief module, we use the taskbased imitation loss (Equation 6), and improve robustness of representations by regularizing with multistep prediction of past/future observations and actionsequences. One benefit of our proposed framework is that in future work, it would be straightforward to substitute other methods for learning belief representations for the one we arrived at in our paper. Similarly, recent advancements in GAN and RL literature could guide the development of better discriminator and policy networks for imitation learning in POMDPs. Exploring these ranges, as well as their interplay, are important directions.
Playing atari with six neurons.
In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 998–1006. International Foundation for Autonomous Agents and Multiagent Systems, 2019.International Conference on Machine Learning
, pp. 1329–1338, 2016.Journal of artificial intelligence research
, 13:33–94, 2000.We first prove the inequality connecting between the statevisitation distribution and beliefvisitation distribution of the agent and the expert:
The proof is a simple application of the dataprocessing inequality for divergences (Ali & Silvey, 1966), of which is a type.
We denote the filtering posterior distribution over states, given the belief, by . Note that is characterized by the environment, and does not depend on the policy (agent or expert). The posterior over belief, given the state, however, is policydependent and obtained using Bayes rule as: . Also, . Analogously definitions exist for expert .
We write in terms of the template used for divergences. Let be the following convex function with the property : . Then,
∎
Similarity, we can prove the inequality connecting between beliefvisitation distribution and beliefactionvisitation distribution of the agent and the expert:
Replace and in the previous proof. The only required condition for that result to hold is the nondependence of the distribution on the policy. Therefore, if it holds that is independent of the policy, then we have,
The independence holds under the trivial case of a deterministic mapping . This gives us the desired inequality. ∎
Description of the sensor measurements given to the agent in the MDP and POMDP environments is provided in Table 3. As an example, for the Hopper task, the MDP space is 11dimensional, which includes 6 velocity sensors and 5 position sensors; whereas the POMDP space is 5dimensional, comprising of 5 position sensors. Amongst sensor categories, velocity includes translation and angular velocities of the torso, and also the velocities for all the joints; position includes torso position and orientation (quaternion), and the angle of the joints. The sensors in the MDP column marked in bold are removed in the POMDP setting.
Environment  MDP sensors  POMDP sensors 

Hopper  velocity(6) + position(5)  position(5) 
HalfCheetah  velocity(9) + position(8)  position(8) 
Walker2d  velocity(9) + position(8)  position(8) 
Inv.DoublePend.  velocity(3) + position(5) + actuator forces(3)  position(5) + actuator forces(3) 
Ant  velocity(14) + position(13) + external forces(84)  position(13) + external forces(84) 
Humanoid  velocity(23) + centerofmass based velocity(84) + position(22) + centerofmass based inertia(140) + actuator forces(23) + external forces(84)  position(22) + centerofmass based inertia(140) + actuator forces(23) + external forces(84) 
Hyperparameter  Value 

Parameters for Convolution networks (encoding past & future actionsequences)  Layers=2, Stride=1, Padding=1, Kernel_size=3, Num_filters = {5,5} 
Belief Regularization Coefficients  
Rollout length (c) in Algorithm 1  5 
Size of expert demonstrations  50 (trajectories) 
Size of replay buffer  1000 (trajectories) 
Optimizer, Learning Rate  RMSProp, 3e4 (lineardecay) 
(GAE)  0.99, 0.95 
In our approach, we regularize with single and multistep predictions in the space of raw observations. For many highdimensional, complex spaces (e.g. visual inputs), it may be more efficient to operate in a lowerdimensional, compressed encodingspace, either pretrained, or learnt online (Cuccu et al., 2019).
The encoder in our architecture (yellow box in Figure 2) preprocesses the raw observations before passing them to the RNN for temporal integration. We now evaluate BMIL with single and multistep predictions in the space of this encoder output. For instance, the forward regularization loss function is:
We do not pass the gradient through the target value . The encoder is trained online as part of the belief module. Table 4 indicates that, for the tasks considered, BMIL performance is fairly similar when predicting in observationspace vs. encodingspace.
BMIL: Predictions in observationspace 
BMIL: Predictions in encodingspace 


Invd.DoublePend.  9104 134  8883 448 
Hopper  2665 70  2700 116 
Ant  1832 92  1784 44 
Walker  4038 259  4043 113 
Humanoid  4382 117  4322 263 
Halfcheetah  5860 171  5912 128 
To test robustness of BMIL, we evaluate it on two new POMDP environment variants designed to make inferring the true state from given sensors more challenging. These new environments are:
Inv.DoublePend. from velocities only  The partially observable InvertedDoublePendulum used in Section 5 removes the velocity sensors to achieve partial observability, and provides as sensors only the cartposition and sin/cos of link angles. In this new environment, we remove the previously shown sensors (cartposition and link angles), and instead provide only the velocity sensors (which were removed in our original environment). Note that the motivation is to exacerbate partial observability by restricting sensors such that inferring the true state is more challenging (i.e. it is easier to infer velocity from subsequent positions than to integrate position over time from only velocity information). Indeed, our experiments indicate this is a harder imitation learning scenario.
Walker from velocities only  In the same spirit as above. We remove all position sensors and instead provide only the velocity sensors to the agent.
We compare BMIL to GAILRF (the strongest baseline).
GAILRF  BMIL  

Inv.DoublePend.  
(velocity only)  4988  6578 
Walker  
(velocity only)  1539  4199 