Log In Sign Up

Imitating Latent Policies from Observation

by   Ashley D. Edwards, et al.

We describe a novel approach to imitation learning that infers latent policies directly from state observations. We introduce a method that characterizes the causal effects of unknown actions on observations while simultaneously predicting their likelihood. We then outline an action alignment procedure that leverages a small amount of environment interactions to determine a mapping between latent and real-world actions. We show that this corrected labeling can be used for imitating the observed behavior, even though no expert actions are given. We evaluate our approach within classic control and photo-realistic visual environments and demonstrate that it performs well when compared to standard approaches.


Imitation by Predicting Observations

Imitation learning enables agents to reuse and adapt the hard-won expert...

Chain of Thought Imitation with Procedure Cloning

Imitation learning aims to extract high-performance policies from logged...

Causal Confusion in Imitation Learning

Behavioral cloning reduces policy learning to supervised learning by tra...

Bayesian Disturbance Injection: Robust Imitation Learning of Flexible Policies

Scenarios requiring humans to choose from multiple seemingly optimal act...

Optimizing Crop Management with Reinforcement Learning and Imitation Learning

Crop management, including nitrogen (N) fertilization and irrigation man...

Building a Stochastic Dynamic Model of Application Use

Many intelligent user interfaces employ application and user models to d...

Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction

We propose an approach for mapping natural language instructions and raw...

1 Introduction

Humans often learn from and develop experiences through mimicry. It is noteworthy that we can mirror behavior through only the observation of state trajectories without knowing the underlying actions (e.g., the exact kinematic forces) and intentions that yielded them rizzolatti2010functional . In order to be general, artificial agents should also be equipped with similar forms of mimicry; however, in the typical case, agents cannot learn to imitate behaviors from observation of states alone. Instead, learning approaches usually need both observations and actions to inform them of the policies they should follow along with extensive interaction with the environment.

There are an increasing number of scenarios where state trajectories are readily available but the corresponding actions are not. Consider the case of autonomous driving. There is a large supply of dash cam footage on the web showing expert driving, but no access to the underlying driving actions. The risks of random exploration in the real-world are certainly too high, so we would like to minimize environment interactions. Or take the case of learning to navigate indoors from human demonstrations in a first-person perspective. Imitating from this environment could be difficult, as actions performed by humans are hard to obtain and to transform to the robot’s action space.

A recent approach for overcoming these issues would be to learn an initial self-supervised model for how to imitate by interacting with the environment pathak2018zero ; torabi2018behavioral , but unguided exploration can be risky in many real-world scenarios and costly to obtain. Thus, we need a mechanism for learning policies from observation alone without requiring access to expert actions and with only a few interactions within the environment.

In order to tackle this challenge, we hypothesize that predictable, though unknown, causes may describe the classes of transitions that we observe. These causes could be natural phenomena in the world, or the consequences of the actions that the agent takes. This work aims to answer the question: can an agent predict and then imitate these latent causes, even though the ground truth environmental actions are unknown? We follow a two-step approach, where the agent first learns a policy offline in a latent space that best describes the observed transitions. Then it takes a limited number of steps in the environment to ground this latent policy to the true action labels.

In particular, we first make the assumption that the transitions between states can be described through a discrete set of latent actions. We then learn a forward dynamics model that, given a state and latent action, predicts the next state and prior, supervised only by {state, next state} pairs. We use this model to greedily select the latent action that leads to the most probable next state. Because these latent actions are initially mislabeled, we use a few interactions with the environment to learn a relabeling that outputs the probability of the true action. We liken this to learning to play a video game by observing a friend play first, and then attempting to play it ourselves. By observing, we can learn the goal of the game and the types of actions we should be taking, but some interaction may be required to learn the correct mapping of controls on the joystick.

We evaluate our approach in three environments: classic control with cartpole, acrobot, and AI2-THOR ai2thor , which is a photo-realistic visual environment that is used for navigation tasks. We show that our approach is able to perform as well as the expert after just a few steps of interacting with the environment.

2 Related work

Imitation learning is the field of training artificial and real-world agents to imitate expert behavior using a set of demonstrations. This approach has an extensive breadth of applications and a long history of research, ranging from early successes in autonomous driving pomerleau1989alvinn , to applications in robotics Schaal1997 ; chernova_robot_2014 and software agents (e.g. silver2016mastering ; nair2017overcoming ). However, these approaches typically assume that the expert’s actions are known. This often requires the data to be specifically recorded for the purpose of imitation learning and drastically reduces the amount of data that is readily available. Furthermore, approaches that do not require expert actions typically must first learn behaviors in the agent’s environment through extensive interactions. As we will discuss, we aim to learn to imitate from only observations of expert states, followed by only a few necessary interactions with the environment. We now describe classic approaches to imitation learning along with more modern approaches.

2.1 Classic approaches

Arguably, the most straight-forward and common approach to imitation learning is behavioral cloning pomerleau1989alvinn

, which treats imitation learning as a supervised learning problem. More sophisticated methods achieve better performance by reasoning about the state-transitions explicitly, but often require extensive information about the effects of the agent’s actions on the environment. This information can come either in the form of a full, often unknown, dynamics model, or through numerous interactions with the environment. Inverse Reinforcement Learning (IRL) achieves this by using the demonstrated state-action pairs to explicitly derive the expert’s intent in the form of a reward function 

Ng2000 ; abbeel2004apprenticeship .

2.2 Direct policy optimization methods

Recently, more direct approaches have been introduced that aim to match the state-action visitation frequencies observed by the agent to those seen in demonstrations. GAIL ho2016generative learns to imitate policies from demonstrations and uses adversarial training to distinguish if a state-action pair comes from following the agent or expert’s policy while simultaneously minimizing the difference between the two. SAIL Schroecker2017

achieves a similar goal by using temporal difference learning to estimate the gradient of the normalized state-action visitation frequency directly. However, while these approaches are efficient in the amount of expert data necessary for training, they typically require a substantial amount of interactions within the environment.

2.3 Learning from state observations

Increasingly, works have aspired to learn from observation alone without utilizing expert actions. Imitation from Observation liu2017imitation , for example, learns to imitate from videos without actions and translates from one context to another. However, this approach requires using learned features to compute rewards for reinforcement learning, which will thus require many environment samples to learn a policy. Similarly, time contrastive networks sermanet2017time train robots to imitate from demonstrations of humans performing tasks. But this also learns a reward signal that is later used for reinforcement learning. Therefore, while these approaches learn policies from state observations, they require an intermediary step of using a reward signal, whereas we learn the policy directly without performing reinforcement learning. Recent works have aimed to learn from observations by first learning how to imitate in a self-supervised manner, then given a task, attempt it zero-shot pathak2018zero . However, this approach again requires learning in the agent’s environment rather than initially learning from the observations. Concurrently to this work, Torabi et al. torabi2018behavioral , present an approach which utilizes learned inverse dynamics to train agents from observation. However, while this approach aims to minimize the number of necessary interactions with the environment after a demonstration has been provided, it only shifts the burden to a preprocessing step as learning the inverse dynamics model usually still requires a substantial number of interactions with the environment. Our work aims to first learn policies from demonstrations offline, and then only use a few interactions with the environment to learn the true action labels.

2.4 Multi-modal predictions

As we will discuss, we learn to predict next states given a state and latent action. This is similar to recent works that have learned action-conditional predictions for reinforcement learning environments oh2015action ; chiappa2017recurrent , but those approaches utilize ground truth action labels. Rather, our approach learns a latent multi-modal distribution over future predictions. Other related works have utilized latent information to make multi-modal predictions. For example, BicycleGAN zhu2017toward

learns to predict a distribution over image-to-image translations, where the modes are sampled given a latent vector. InfoGAN uses latent codes for learning interpretable representations 

chen2016infogan , and InfoGAIL li2017infogail uses that approach to capture latent factors of variation between different demonstrations. These works, however, do not attempt to learn direct priors over the modes, which is crucial in our formulation for deriving policies.

As such, our approach is more analogous to online clustering, as we aim to learn multiple expected next states and priors over them. However, we do not have direct access to the clusters or means. Other works have aimed to cluster demonstrations, but these approaches have traditionally segmented different types of trajectories, which represent distinct preferences, rather than next-state predictions hausman2017multi ; babes2011apprenticeship .

3 Approach

(a) Latent Policy Network
(b) Action Remapping Network
Figure 1: The latent policy network learns priors and predicted next state . The action remapping network learns .

We now describe our approach for Imitating Latent Policies from Observation (ILPO). We develop this work in two steps. First, we train a multi-modal dynamics model based on expert data using only state-observations. This model predicts changes to the state of the agent and the environment in terms of latent causes that we simultaneously extract from the training data, as discussed in Section 3.2. We then describe how to use a limited number of interactions with the environment to efficiently associate the true actions the agent can take with the latent causes identified by our learned model. This step will be described in Section 3.3.

3.1 Problem formulation

We define ILPO as a Markov Decision Process

sutton1998reinforcement . Here, denotes the states in the environment, corresponds to actions, are the rewards the agent receives in each state, and is the transition model, which we assume is unknown. Reinforcement learning approaches aim to learn policies that maximize the long-term expected reward. In this work, we use imitation learning to directly learn the policy and use the reward only for evaluation purposes.

The action spaces that we consider are discrete and deterministic. We are given a set of expert demonstrations described through noisy state observations . Noise is necessary for ensuring the state transitions are properly modeled. Given two consecutive observations , is defined as a latent action that caused this transition. Because we are dealing with MDPs, we assume that the number of actions, , is known, and thus there are latent codes. Note that this assumption is not excessive, and in fact is very standard, otherwise we would not have a way of controlling the agent or representing any policy. Later, we empirically relax this assumption and study the effect of additional latent codes, beyond the number of environment actions.

1:function ILPO(, , …, )
2:     for  do
3:         for  do (Omitting batching for clarity)
7:     Observe state
8:     for  do
9:         Choose latent action
10:         Take -greedy action
11:         Observe state
12:         Infer closest latent action
Algorithm 1 Imitating Latent Policies from Observation

3.2 Step 1: Learning latent policies

Given expert observations , we aim to learn a latent policy that describes the probability that a latent action would be taken when observing

. This policy is represented using a neural network with two key components: a multi-modal forward latent dynamics model that learns to predict

, and a prior over given , which gives us the latent policy, as shown in Figure 1.

When learning to predict next states, a single prediction will not account for the different modes of the distribution and will thus predict the mean. Instead, our approach generates multiple predictions based on each of the latent codes, . To allow predictions to converge to the different modes, we only penalize the one closest to the true next observation, . Similar to recent works that predict state dynamics edwards2018forward ; goyal2018recall , we learn the change in state , rather than the absolute next state, and compute .

As such, we compute the loss on the generators as:


Hence the generator must learn to predict the closest mode within the multi-modal distribution conditioned on . Since is learning to predict deltas, we will need to add each prediction to in order to obtain a prediction for . For simplicity, in further discussion we will refer to directly as the predictions summed with the state input .

Concurrently, we aim to learn the likelihood . This represents the probability that given a state , a latent transition of the type will be observed in the expert data. It can be learned by computing the expectation of the generated predictions under this distribution as:


We then minimize the loss as:


while holding the individual predictions fixed. This training scheme essentially performs online clustering over the latent labels, as it slowly moves the closest prediction, or cluster mean, towards the current sample of the true next state. Then keeping all the predictions fixed, it computes the priors over each cluster. The latent policy, which is the probability of picking latent action in state , then directly corresponds to these cluster priors.

The entire network is trained using the combined loss,


3.3 Step 2: Correcting labels

In the previous step, we described how to train a multi-modal dynamics model to predict the state-trajectories observed by the expert, and to extract the latent actions that might cause these transitions. This first step fully extracts the information provided by the observed demonstration data and the bulk of information that is needed about the dynamics. Namely, it describes each of the different ways in which the environment can change. What remains is to systematically identify the agent’s capability to effect such changes in the environment.

To achieve this goal, the agent needs to learn a mapping from the latent actions to the true action space: , as shown in Figure 1. It is worth noting that while such a mapping depends on the current state , after all the latent causes are not necessarily invariant across the state space, generalization capabilities and representational limitations of neural networks encourage a strong correlation between and . In real world situations it is most often the case that the dynamics in two states are more similar for the same action than they are for two different actions, thus assigning the same latent cause to the same action allows the network to generalize more easily. It is furthermore worth noting that while this intuition will allow us to learn such a mapping from only a few interactions with the environment, it is not an absolute assumption that we make and the algorithm will be able to learn to imitate the expert’s policy regardless.

To learn a mapping from latent causes to the agent’s actions, it is invariantly necessary for the agent to explore the effect of its own actions. To this end, we interact with the environment and collect experiences in the form of triples. Note that this interaction can follow any policy, such as a random policy or one that is updated in an online way. The only stipulation is that a diverse section of the state space is explored to facilitate generalization. In this work, we choose to iteratively refine the remapped policy and collect experiences by following our current best guess, in addition to an exploration strategy such as -greedy. Having collected an experience, we proceed in two steps. First, we identify the latent causes based on the observed state transitions and then we use the environmental action taken as a label to train in a supervised manner.

To identify the latent actions from a state transition , we predict the next state given all possible latent causes using the multi-modal dynamics model trained in 3.2 and identify the predicted next state that is the most similar to the observed next state:


To extend this approach to situations where euclidean distance is not meaningful (such as high-dimensional visual domains), we propose to measure distance in the space of the embedding learned in Section 3.2. Note that we can do this readily as this step depends only on the predicted actions. In these domains, the latent action is thus given by:


Having obtained the latent actions most closely corresponding to the environmental action , we then train by treating the problem as a straight forward classification problem.

Combining the two steps into a full imitation learning algorithm, we can now use the latent policy model outlined in the previous section to identify the latent cause that is most likely to have the effect that the expert intended and subsequently identify the action that is most likely to cause this effect in order to define a policy that learns to imitate the expert’s behavior without having seen any of the actions the expert has taken. This procedure is outlined in Algorithm 13.

4 Experiments and results

Figure 2: Cartpole and Acrobot results. The trials were averaged over runs for ILPO and the policy was evaluated every steps. The reward used for training the expert and evaluation was +1 for every step that the pole was upright in cartpole, and a -1 step cost for acrobot.

The experiments aim to demonstrate that ILPO is able to imitate from state observations only and with little interactions with the environment. To assess the impact of on the agent’s performance, we run multiple studies for .

We evaluate the approach within classic control problems as well as a more complex vision domain. We used OpenAI Baselines baselines to obtain expert policies and generate demonstrations for each environment. We compare ILPO against this expert, a random policy, and Behavioral Cloning (BC), which is given ground truth actions, averaged over trials.

4.1 Classic control environments

Figure 3:

We first evaluated our approach within classic control environments, using the standard distance metric from equation 5

to compute the distances between observed and predicted next states. We used the same network structure and hyperparameters across both domains, as described in the appendix. We used

expert state observations to train ILPO, and the corresponding actions to train Behavioral Cloning (BC).

Cartpole (Figure 3 left) is a classic control environment used in reinforcement learning settings sutton1998reinforcement . In this environment, an agent must learn to balance a pole on a cart by applying forces of and on it. The state space consists of dimensions: , and the action space consists of the forces. As such, our method must predict latent actions and generate predicted next states with dimensions. Additionally, we test problems where is and .

Figure 2 (left) shows the results in this environment. Our approach learns the correct policy and is able achieve the same performance as the expert and behavioral cloning in less than steps within the environment. Furthermore, using incorrect sizes for did not negatively impact the performance. This indicates that the policy-remapping step of our approach is able to correctly map the latent causes to meaningful policies, even when the generators do not correspond only to unique actions.

Acrobot (Figure 3 right) is another classic environment, where an agent with links must learn to swing its end-effector to the top by applying a torque of , , or to its joint. The state space consists of dimensions: , and the action space consists of the forces. As such, our method must predict latent actions and generate a predicted next state with dimensions. We additionally test problems where is and .

Figure 2 shows the results in this environment. Our approach again learns the correct policy after a few steps and is able achieve as good of performance as the expert and behavioral cloning, again within steps. For one additional latent code, the performance is not impacted. However, the performance drops when . It is likely that the remapping takes longer in this step. However, the agent does eventually perform well.

4.2 First-person navigation

Figure 4: AI2-Thor environment and results. The trials were averaged over runs for ILPO and the policy for was evaluated every steps. The reward used for training the expert and evaluation was the negative distance to the target.
(a) State input
(b) Next state target
(c) Generator outputs and selection
Figure 5: Generator predictions for AI2-Thor. The first column represents different state inputs, the next column shows the target next state, and the final column shows the corresponding generator predictions and prediction with the smallest distance to the target.

We also evaluated our approach in a more complex visual environment, AI2-Thor ai2thor , which is a photo-realistic first-person environment for navigation tasks. In the defined task, the agent must navigate down a hallway to a refrigerator. The environment has a large obstacle, an island, in the middle of the room that the agent must navigate around. We used expert state observations to train ILPO, and the corresponding actions to train BC.

The state space consists of xx pixels and actions: move forward, backward, and rotate. As such, our method must predict latent actions and xx dimensions for the next-state predictions. As before, we test ILPO with size of the latent action space, greater than the underlying environment action space, in this case and . We use the embedded distance metric from equation 6 to compute the distances between observed and predicted next states.

Figure 4 shows the results for imitation learning. Again, we see ILPO is able to perform as well as the expert and BC. As this environment is high-dimensional, it takes more steps to learn the remapped policy than the previous experiments. However, it still only requires around steps to achieve good performance. Interestingly, ILPO with 3 latent actions never achieves optimal performance, even though it corresponds to the exact number of environmental actions. We hypothesize that this behavior is because there may be more than 3 underlying causes of state transitions in this environment. For example, sometimes the agent may take the ‘forward’ action, but bump into a wall and remain stationary. But if there is no wall in front of the agent, the same action will successfully transition into a new state. Even though the same real action caused both transitions, they will appear as separate modes in the state-transition distribution, and hence separate latent actions. Thus, having more latent codes than real actions may help learning. But, as we see with , having too many generators may slow down learning.

Figure 5 shows a selection of predictions made by the generator for AI2-Thor. As shown, each generator learns to make different predictions for the output. For example, in the second row, the first generator predicts rotation (looking down the corridor), the second predicts moving backward (farther from the wall), and the last predicts moving forward (closer to the wall). Interestingly, we see that the roles of the second and third generator flip in another example in the first row. So the mappings of latent actions to environmental actions are not globally consistent. Furthermore, we see that the distance metric is able to choose the correct generator, even if the prediction is blurry, as in row 3. Despite these difficulties in latent action consistency across the state space and blurry predictions, the agent was able to perform well compared to the expert and behavioral cloning (Figure 5).

5 Discussion and conclusion

In this paper, we introduced ILPO and described how agents can learn to imitate latent policies from only expert state observations and very few environment interactions. In many real world scenarios unguided exploration in the environment may be very risky but expert observations can be readily made available. Such a method of learning policies directly from observation followed by a small number of action alignment interactions with the environment can be very useful. We demonstrated that this approach recovered the expert behavior in three different domains consisting of classic control and challenging vision based tasks. ILPO requires very few environment interactions compared to leading imitation and reinforcement learning methods, provided by the expert observations.

There are many ways that this work can be extended. First, there are two assumptions in the current formulation of the problem: 1) it requires that actions are discrete and 2) assumes that the state transitions are deterministic, although a small amount of noise in the predictions does not prevent learning as we have shown in the AI-Thor domain. Second, the action remapping step can be made even more efficient by enforcing stronger local consistencies between latent actions and generated predictions across different states. This will drastically reduce the number of samples required to train the action remapping network by decreasing variation between latent and real actions.

We hope that this work will introduce opportunities for learning to observe not only from similar agents, but from other agents with different embodiments whose actions are unknown or do not have a known correspondence. Another contribution would be to learn to transfer across different environments.

In general, our work is complimentary to many of the related approaches we have discussed. Many algorithms rely on behavioral cloning as a pre-training step for more sophisticated approaches. As such, we believe ILPO could also be used for pre-training imitation, without requiring access to expert actions.


6 Appendix

We now discuss the hyperparameters and architectures used to train ILPO and BC. We used the Adam Optimizer to train all experiments.

6.1 Cartpole and Acrobot

In the latent policy network, the embedding architecture for was: , where dims was the number of state dimensions. The leak parameter for was for each case. After converting to a one-hot, the generator network encodes it with , concatenates it with and places it through , then places it into the following architecture to compute . We trained the network for epochs with a batch size of . The learning rate was .

In the action remapping network, the embedding architecture for was the same as . After converting to a one-hot, the generator network encodes it with , concatenates it with and performs lrelu, then places it into the following architecture to compute . We trained the network for steps with a batch size of . The learning rate was .

Behavior cloning used the same architecture as for encoding states, followed by lrelu and then . We trained the network for epochs with a batch size of . The learning rate was .

6.2 AI2-Thor

In the latent policy network, the embedding architecture for was: . The parameter for was for each case. After converting to a one-hot, the generator network encodes it with , concatenates it with and does , then places it into the following architecture to compute . All filters were size x

with stride

. We trained the network for epochs with a batch size of . The learning rate was .

In the action remapping network, the embedding architecture for was the same as . After converting to a one-hot, the generator network encodes it with , concatenates it with and places it through lrelu, then places it into the following architecture to compute . We trained the network for steps with a batch size of . The learning rate was .

Behavior cloning used the same architecture as for encoding states, followed by lrelu and then . We trained the network for epochs with a batch size of . The learning rate was .