Causal Confusion in Imitation Learning

05/28/2019 ∙ by Pim de Haan, et al. ∙ 0

Behavioral cloning reduces policy learning to supervised learning by training a discriminative model to predict expert actions given observations. Such discriminative models are non-causal: the training procedure is unaware of the causal structure of the interaction between the expert and the environment. We point out that ignoring causality is particularly damaging because of the distributional shift in imitation learning. In particular, it leads to a counter-intuitive "causal confusion" phenomenon: access to more information can yield worse performance. We investigate how this problem arises, and propose a solution to combat it through targeted interventions---either environment interaction or expert queries---to determine the correct causal model. We show that causal confusion occurs in several benchmark control domains as well as realistic driving settings, and validate our solution against DAgger and other baselines and ablations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imitation learning allows for control policies to be learned directly from example demonstrations provided by human experts. It is easy to implement, and reduces or removes the need for extensive interaction with the environment during training Widrow and Smith (1964); Pomerleau (1989); Bojarski et al. (2016); Argall et al. (2009); Hussein et al. (2017).

However, imitation learning suffers from a fundamental problem: distributional shift Daumé et al. (2009); Ross and Bagnell (2010). Training and testing state distributions are different, induced respectively by the expert and learned policies. Therefore, imitating expert actions on expert trajectories may not align with the true task objective. While this problem is widely acknowledged Pomerleau (1989); Daumé et al. (2009); Ross and Bagnell (2010); Ross et al. (2011), yet with careful engineering, naïve behavioral cloning approaches have yielded good results for several practical problems Widrow and Smith (1964); Pomerleau (1989); Schaal (1999); Muller et al. (2006); Mülling et al. (2013); Bojarski et al. (2016); Mahler and Goldberg (2017); Bansal et al. (2019). This raises the question: is distributional shift really still a problem?

In this paper, we identify a somewhat surprising and very problematic effect of distributional shift: “causal confusion.” Distinguishing correlates of expert actions in the demonstration set from true causes is usually very difficult, but may be ignored without adverse effects when training and testing distributions are identical (as assumed in supervised learning), since nuisance correlates continue to hold in the test set. However, this can cause catastrophic problems in imitation learning due to distributional shift. This is exacerbated by the causal structure of sequential action: the very fact that current actions cause future observations often introduces complex new nuisance correlates.

To illustrate, consider behavioral cloning to train a neural network to drive a car. In scenario A, the model’s input is an image of the dashboard and windshield, and in scenario B, the input to the model (with identical architecture) is the same image but with the dashboard masked out (see Fig 

1). Both cloned policies achieve low training loss, but when tested on the road, model B drives well, while model A does not. The reason: the dashboard has an indicator light that comes on immediately when the brake is applied, and model A wrongly learns to apply the brake only when the brake light is on. Even though the brake light is the effect of braking, model A could achieve low training error by misidentifying it as the cause instead.

Figure 1: Causal confusion: more information yields worse imitation learning performance. Model A relies on the braking indicator to decide whether to brake. Model B instead correctly attends to the pedestrian.

This situation presents a give-away symptom of causal confusion: access to more information leads to worse generalization performance in the presence of distributional shift. Causal confusion occurs commonly in natural imitation learning settings, especially when the imitator’s inputs include history information.

In this paper, we first point out and investigate the causal confusion problem in imitation learning. Then, we propose a solution to overcome it by learning the correct causal model, even when using complex deep neural network policies. We learn a mapping from causal graphs to policies, and then use targeted interventions to efficiently search for the correct policy, either by querying an expert, or by executing selected policies in the environment.

2 Related Work

Imitation learning.   Imitation learning through behavioral cloning dates back to Widrow and Smith, 1964 Widrow and Smith (1964), and has remained popular through today Pomerleau (1989); Schaal (1999); Muller et al. (2006); Mülling et al. (2013); Bojarski et al. (2016); Giusti et al. (2016); Mahler and Goldberg (2017); Wang et al. (2019); Bansal et al. (2019). The distributional shift problem, wherein a cloned policy encounters unfamiliar states during autonomous execution, has been identified as an issue in imitation learning Pomerleau (1989); Daumé et al. (2009); Ross and Bagnell (2010); Ross et al. (2011); Laskey et al. (2017); Ho and Ermon (2016); Bansal et al. (2019)

. This is closely tied to the “feedback” problem in general machine learning systems that have direct or indirect access to their own past states 

Sculley et al. (2015); Bagnell (2016). For imitation learning, various solutions to this problem have been proposed (Daumé et al., 2009; Ross and Bagnell, 2010; Ross et al., 2011) that rely on iteratively querying an expert based on states encountered by some intermediate cloned policy, to overcome distributional shift; DAgger Ross et al. (2011) has come to be the most widely used of these solutions.

We show evidence that the distributional shift problem in imitation learning is often due to causal confusion, as illustrated schematically in Fig 1. We propose to address this through targeted interventions on the states to learn the true causal model to overcome distributional shift. As we will show, these interventions can take the form of either environmental rewards with no additional expert involvement, or of expert queries in cases where the expert is available for additional inputs. In expert query mode, our approach may be directly compared to DAgger (Ross et al., 2011): indeed, we show that we successfully resolve causal confusion using orders of magnitude fewer queries than DAgger.

We also compare against  Bansal et al. (2019): to prevent imitators from copying past actions, they train with dropout Srivastava et al. (2014) on dimensions that might reveal past actions. While our approach seeks to find the true causal graph in a mixture of graph-parameterized policies, dropout corresponds to directly applying the mixture policy. In our experiments, our approach performs significantly better.

Causal inference.   Causal inference is the general problem of deducing cause-effect relationships among variables (Spirtes et al., 2000; Pearl, 2009; Peters et al., 2017; Spirtes, 2010; Eberhardt, 2017; Spirtes and Zhang, 2016). “Causal discovery” approaches allow causal inference from pre-recorded observations under constraints (Steyvers et al., 2003; Heckerman et al., 2006; Lopez-Paz et al., 2017; Guyon et al., 2008; Louizos et al., 2017; Maathuis et al., 2010; Le et al., 2016; Goudet et al., 2017; Mitrovic et al., 2018; Wang and Blei, 2018). Observational causal inference is known to be impossible in general (Pearl, 2009; Peters et al., 2014). We operate in the interventional regime (Tong and Koller, 2001; Eberhardt and Scheines, 2007; Shanmugam et al., 2015; Sen et al., 2017) where a user may “experiment” to discover causal structures by assigning values to some subset of the variables of interest and observing the effects on the rest of the system. We propose a new interventional causal inference approach suited to imitation learning. While ignoring causal structure is particularly problematic in imitation learning, ours is the first effort directly addressing this, to our knowledge.

3 The Phenomenon of Causal Confusion

In imitation learning, an expert demonstrates how to perform a task (e.g., driving a car) for the benefit of an agent. In each demo, the agent has access both to its -dim. state observations at each time , (e.g., a video feed from a camera), and to the expert’s action (e.g., steering, acceleration, braking). Behavioral cloning approaches learn a mapping from to using all tuples from the demonstrations. At test time, the agent observes and executes .

Figure 2: Causal dynamics of imitation. Parents of a node represent its causes.

The underlying sequential decision process has complex causal structures, represented in Fig 2. States influence future expert actions, and are also themselves influenced by past actions and states.

In particular, expert actions are influenced by some information in state

, and unaffected by the rest. For the moment, assume that the dimensions

of represent disentangled factors of variation. Then some unknown subset of these factors (“causes”) affect expert actions, and the rest do not (“nuisance variables”). A confounder influences each state variable in , so that some nuisance variables may still be correlated with among pairs from demonstrations. In Fig 1, the dashboard light represents a confounder.

A naïve behavioral cloned policy might rely on nuisance correlates to select actions, producing low training error, and even generalizing to held-out pairs. However, this policy must contend with distributional shift when deployed: actions are chosen by the imitator rather than the expert, affecting the distribution of and . This in turn affects the policy mapping from to , leading to poor performance of expert-cloned policies. We define “causal confusion" as the phenomenon whereby cloned policies fail by misidentifying the causes of expert actions.

3.1 Robustness and Causality in Imitation Learning

Intuitively, distributional shift affects the relationship of the expert action to nuisance variables, but not to the true causes. In other words, to be maximally robust to distributional shift, a policy must rely solely on the true causes of expert actions, thereby avoiding causal confusion. This intuition can be formalized in the language of functional causal models (FCM) and interventions Pearl (2009).

Functional causal models: A functional causal model (FCM) over a set of variables is a tuple containing a graph over , and deterministic functions with parameters describing how the causes of each variable determine it: where is a stochastic noise variable that represents all external influences on , and denote the indices of parent nodes of , which correspond to its causes.

An “intervention” on to set its value may now be represented by a structural change in this graph to produce the “mutilated graph” , in which incoming edges to are removed.111For a more thorough overview of FCMs, see Pearl (2009).

Applying this formalism to our imitation learning setting, any distributional shift in the state may be modeled by intervening on , so that correctly modeling the “interventional query” is sufficient for robustness to distributional shifts. Now, we may formalize the intuition that only a policy relying solely on true causes can robustly model the mapping from states to optimal/expert actions under distributional shift.

In Appendix B, we prove that under mild assumptions, correctly modeling interventional queries does indeed require learning the correct causal graph . In the car example, “setting” the brake light to on or off and observing the expert’s actions would yield a clear signal unobstructed by confounders: the brake light does not affect the expert’s braking behavior.

3.2 Causal Confusion in Policy Learning Benchmarks and Realistic Settings

Before discussing our solution, we first present several testbeds and real-world cases where causal confusion adversely influences imitation learning performance.

Control Benchmarks.   We show that causal confusion is induced with small changes to widely studied benchmark control tasks, simply by adding more information to the state, which intuitively ought to make the tasks easier, not harder. In particular, we add information about the previous action, which tends to correlate with the current action in the expert data for many standard control problems. This is a proxy for scenarios like our car example, in which correlates of past actions are observable in the state, and is similar to what we might see from other sources of knowledge about the past, such as memory or recurrence. We study three kinds of tasks: (i) MountainCar (continuous states, discrete actions), (ii) MuJoCo Hopper (continuous states and actions), (iii) Atari games: Pong, Enduro and UpNDown (states: two stacked consecutive frames, discrete actions).

(a) Pong
(b) Enduro
(c) UpNDown
Figure 3: The Atari environments with indicator of past action (white number in lower left).

For each task, we study imitation learning in two scenarios. In scenario A (henceforth called "confounded

"), the policy sees the augmented observation vector, including the previous action. In the case of low-dimensional observations, the state vector is expanded to include the previous action at an index that is unknown to the learner. In the case of image observations, we overlay a symbol corresponding to the previous action at an unknown location on the image (see Fig 

3). In scenario B ("original"), the previous action variable is replaced with random noise for low-dimensional observations. For image observations, the original images are left unchanged. Demonstrations are generated synthetically as described in Appendix A. In all cases, we use neural networks with identical architectures to represent the policies, and we train them on the same demonstrations.

Fig 4 shows the rewards against varying demonstration dataset sizes for MountainCar, Hopper, and Pong. Appendix D shows additional results, including for Enduro and UpNDown. All policies are trained to near-zero validation error on held-out expert state-action tuples. original produces rewards tending towards expert performance as the size of the imitation dataset increases. confounded either requires many more demonstrations to reach equivalent performance, or fails completely to do so.

Overall, the results are clear: across these tasks, access to more information leads to inferior performance. As Fig 10 in the appendix shows, this difference is not due to different training/validation losses on the expert demonstrations—for example, in Pong, confounded produces lower validation loss than original on held-out demonstration samples, but produces lower rewards when actually used for control. These results not only validate the existence of causal confusion, but also provides us with testbeds for investigating a potential solution.

Real-World Driving.   In more realistic imitation learning settings too, symptoms of causal confusion have been observed consistently in Muller et al. (2006); Wang et al. (2019); Bansal et al. (2019), when learning to drive from histories of video frames. While these histories contain valuable information for driving, they also naturally introduce information about nuisance factors such as previous actions. In all three cases, more information led to worse results for the behavioral cloning policy, but this was neither attributed specifically to causal confusion, nor tackled using causally motivated approaches.

Metrics Validation Driving Performance
Methods Perplexity Distance Interventions Collisions
history 0.834 144.92 2.94 1.79 6.49 5.72
no-history 0.989 268.95 1.30 0.78 3.38 2.55
Table 1: Imitation learning results from Wang et al. (2019). Accessing history yields better validation performance, but worse actual driving performance.

We draw the reader’s attention to particularly telling results from Wang et al. (2019) for learning to drive in near-photorealistic GTA-V Krähenbühl (2018) environments, using behavior cloning with DAgger-inspired expert perturbation. Imitation learning policies are trained using overhead image observations with and without “history” information (history and no-history) about the ego-position trajectory of the car in the past.

Similar to our tests above, architectures are identical for the two methods. And once again, like in our tests above, history has better performance on held-out demonstration data, but much worse performance when actually deployed. Tab 1 shows these results, reproduced from Wang et al. (2019) Table II. These results constitute strong evidence for the prevalence of causal confusion in realistic imitation learning settings. Bansal et al. (2019) also observe similar symptoms in a driving setting, and present a dropout Srivastava et al. (2014) approach to tackle it, which we compare to in our experiments.

(a) MountainCar
(b) Hopper
(c) Pong
Figure 4: Diagnosing causal confusion: net reward (y-axis) vs number of training samples (x-axis) for original and confounded, compared to expert reward (mean and stdev over 5 runs). Also see Appendix D.

4 Resolving Causal Confusion

Recall from Sec 3.1 that robustness to causal confusion can be achieved by finding the true causal model of the expert’s actions. We propose a simple pipeline to do this. First, we jointly learn policies corresponding to various causal graphs (Sec 4.1). Then, we perform targeted interventions to efficiently search over the hypothesis set for the correct causal model (Sec 4.2).

4.1 Causal Graph-Parameterized Policy Learning

Figure 5: Graph-parameterized policy.

In this step, we learn a policy corresponding to each candidate causal graph. Recall from Sec 3 that the expert’s actions are based on an unknown subset of the state variables . Each may either be a cause or not, so there are possible graphs. We parameterize the structure of the causal graph as a vector of binary variables, each indicating the presence of an arrow from to in Fig 2. We then train a single graph-parameterized policy , where is element-wise multiplication, and denotes concatenation. are neural network parameters, trained through gradient descent to minimize:

(1)

where is drawn uniformly at random over all graphs and is a mean squared error loss for the continuous action environments and a cross-entropy loss for the discrete action environments. Fig 5 shows a schematic of the training time architecture. The policy network mapping observations to actions represents a mixture of policies, one corresponding to each value of the binary causal graph structure variable , which is sampled as a bernoulli random vector.

In Appendix C, we propose an approach to perform variational Bayesian causal discovery over graphs , using a latent variable model to infer a distribution over functional causal models (graphs and associated parameters)—the modes of this distribution are the FCMs most consistent with the demonstration data. This resembles the scheme above, except that instead of uniform sampling, graphs are sampled preferentially from FCMs that fit the training demonstrations well. We compare both approaches in Sec 5, finding that simple uniform sampling nearly always suffices in preparation for the next step: targeted intervention.

4.2 Targeted Intervention

Having learned the graph-parameterized policy as in Sec 4.1, we propose targeted intervention to compute the likelihood of each causal graph structure hypothesis . In a sense, imitation learning provides an ideal setting for studying interventional causal learning: causal confusion presents a clear challenge, while the fact that the problem is situated in a sequential decision process where the agent can interact with the world provides a natural mechanism for carrying out limited interventions.

We propose two intervention modes, both of which can be carried out by interaction with the environment via the actions:

Expert query mode.   This is the standard intervention approach applied to imitation learning: intervene on to assign it a value, and observe the expert response . This requires an interactive expert, as in DAgger Ross and Bagnell (2010), but requires substantially fewer expert queries than DAgger, both because: (i) the queries serve only to disambiguate among a relatively small set of valid FCMs, and (ii) we use disagreement among the mixture of policies in

to query the expert efficiently in an active learning approach. We summarize this approach in Algorithm 

1.

Algorithm 1 Expert query intervention   Input: policy network s.t.    Initialize .   Collect states by executing , the mixture of policies for uniform samples .   For each in , compute disagreement score:  . . .   Select with maximal .   Collect state-action pairs by querying expert on .   for  do      Sample .                  Fit on

with linear regression.

  end for   Return:
Algorithm 2 Policy execution intervention   Input: policy network s.t.    Initialize .   for  do      Sample .      Collect episode return by executing .            Fit on with linear regression.   end for   Return:

Policy execution mode.   It is not always possible to query an expert. For example, for a learner learning to drive a car by watching a human driver, it may not be possible to put the human driver into dangerous scenarios that the learner might encounter at intermediate stages of training. In cases like these where we would like to learn from pre-recorded demonstrations alone, we propose to intervene indirectly by using environmental returns (sum of rewards over time in an episode) . The policies corresponding to different hypotheses are executed in the environment and the returns collected. The likelihood of each graph is proportional to the exponentiated returns

. The intuition is simple: environmental returns contain information about optimal expert policies even when experts are not queryable. Note that we do not even assume access to per-timestep rewards as in standard reinforcement learning; just the

sum of rewards for each completed run. As such, this intervention mode is much more flexible. See Algorithm 2.

Note that both of the above intervention approaches evaluate individual hypotheses in isolation, but the number of hypotheses grows exponentially in the number of state variables. To handle larger states, we infer a graph distribution

, by assuming an energy based model with a linear energy

, so the graph distribution is , where is the sigmoid, which factorizes in independent factors. The independence assumption is sensible as our approach collapses to its mode before returning it and the collapsed distribution is always independent. is inferred from linear regression on the likelihoods. This process is depicted in Algorithms 1 and 2. The above method can be formalized within the reinforcement learning framework Levine (2018). As we show in Appendix G, the energy-based model can be seen as an instance of soft Q-learning Haarnoja et al. (2017).

4.3 Disentangling Observations

In the above, we have assumed access to disentangled observations . When this is not the case, such as with image observations, must be set to a disentangled representation of the observation at time . We construct such a representation by training a -VAE Kingma and Welling (2013); Higgins et al. (2017) to reconstruct the original observations. To capture states beyond those encountered by the expert, we train with a mix of expert and random trajectory states. Once trained, is set to be the mean of the latent distribution produced at the output of the encoder. The VAE training objective encourages disentangled dimensions in the latent space Burgess et al. (2018); Chen et al. (2018). We employ CoordConv Liu et al. (2018) in both the encoder and the decoder architectures.

5 Experiments

We now evaluate the solution described in Sec 4 on the five tasks (MountainCar, Hopper, and 3 Atari games) described in Sec 3.2. In particular, recall that confounded performed significantly worse than original across all tasks. In our experiments, we seek to answer the following questions: (1) Does our targeted intervention-based solution to causal confusion bridge the gap between confounded and original? (2) How quickly does performance improve with intervention? (3) Do both intervention modes (expert query, policy execution) described in Sec 4.2 resolve causal confusion? (4) Does our approach in fact recover the true causal graph?

In each of the two intervention modes, we compare two variants of our method: unif-intervention and disc-intervention. They only differ in the training of the graph-parameterized mixture-of-policies —while unif-intervention samples causal graphs uniformly, disc-intervention uses the variational causal discovery approach mentioned in Sec 4.1, and described in detail in Appendix C.

Environment Pong Enduro UpNDown
original (upper bd) 15.0 39.5 80.9
confounded (lower bd) -4.0 30.5 24.8
original w/ vae 12.3 36.7 54.5
counfounded w/ vae -4.0 28.2 24.0
unif-intervention (ours) 11.6 32.4 66.3
dropout Bansal et al. (2019) -8.3 28.2 40.4
Table 2: Intervention by policy execution: Reward of the best models produced by our approach on Atari games. unif-intervention succeeds in getting rewards close to original w/ vae, while the dropout baseline only outperforms confounded w/ vae in UpNDown.

Baselines.   We compare our method against three baselines applied to the confounded state. dropout trains the policy using Eq 3 and evaluates with the graph containing all ones, which amounts to dropout regularization Srivastava et al. (2014) during training, as proposed by Bansal et al. (2019). dagger Ross and Bagnell (2010) addresses distributional shift by querying the expert on states encountered by the imitator, requiring an interactive expert. We compare dagger to our expert query intervention approach. Lastly, we compare to Generative Adversarial Imitation Learning (gailHo and Ermon (2016). gail is an alternative to standard behavioral cloning that works by matching demonstration trajectories to those generated by the imitator during roll-outs in the environment. Note that the PC algorithm Le et al. (2016), commonly used in causal discovery from passive observational data, relies on the faithfulness assumption, which causes it to be infeasible in our setting. See Appendices B & C for details.

Figure 6: Reward vs. number of intervention episodes (policy execution interventions). Our methods unif-intervention and disc-intervention bridge most of the causal confusion gap (between original (lower bound) and confounded (upper bound), approaching original performance after tens of episodes. gail Ho and Ermon (2016) (on Hopper) achieves this too, but after 1.5k episodes.

Intervention by policy execution.   Fig 6 plots episode rewards versus number of policy execution intervention episodes for MountainCar and Hopper. The reward always corresponds to the current mode of the posterior distribution over graphs, updated after each episode, as described in Algorithm 2. In these cases, both unif-intervention and disc-intervention eventually converge to models yielding similar rewards, which we verified to be the correct causal model i.e., true causes are selected and nuisance correlates left out. In early episodes on MountainCar, disc-intervention benefits from the prior over graphs inferred in the variational causal discovery phase. However, in Hopper, the simpler unif-intervention performs just as well. dropout does indeed help in both settings, as reported in Bansal et al. (2019), but is significantly poorer than our approach variants. gail requires about 1.5k episodes on Hopper to match the performance of our approaches, which only need tens of episodes. Appendix F further analyzes the performance of gail. Standard implementations of gail do not handle discrete action spaces, so we do not evaluate it on MountainCar.

Figure 7: Reward vs. expert queries (expert query interventions). Our methods partially bridge the gap from confounded (lower bd) to original (upper bd), also outperforming dagger Ross et al. (2011) and dropout Bansal et al. (2019). gail Ho and Ermon (2016) outperforms our methods on Hopper, but requires a large number of policy roll-outs (also see Fig 6 comparing gail to our policy execution-based approach).

Experiments on Atari games are more computationally expensive, so we report results after a heuristically pre-selected number of episodes (1000). As described in Sec 

4.3, we use a VAE to disentangle image states in Atari games to produce 30-D representations. Requiring the policy to utilize the VAE representation without end-to-end training does result in some drop in performance, as seen in Tab 1. However, causal confusion still causes a very large drop of performance even relative to the baseline VAE performance. As Tab 2 shows, unif-intervention indeed improves significantly over confounded w/ vae in all three cases, matching original w/ vae on Pong and UpNDown, while the dropout baseline only improves UpNDown. In our experiments thus far, gail fails to converge to above-chance performance on any of the Atari environments. These results show that our method successfully alleviates causal confusion within relatively few trials.

Intervention by expert queries.   Next, we perform direct intervention by querying the expert on samples from trajectories produced by the different causal graphs. In this setting, we can also directly compare to dagger Ross et al. (2011). Fig 7 shows results on MountainCar and Hopper. Both our approaches successfully improve over confounded within a small number of queries. Consistent with policy execution intervention results reported above, we verify that our approach again identifies the true causal model correctly in both tasks, and also performs better than dropout in both settings. It also exceeds the rewards achieved by dagger, while using far fewer expert queries. In Appendix E, we show that dagger requires hundreds of queries to achieve similar rewards for MountainCar and tens of thousands for Hopper. Finally, gail with 1.5k episodes outperforms our expert query interventions approach. Recall however from Fig 7 that this is an order of magnitude more than the number of episodes required by our policy intervention approach.

Once again, disc-intervention only helps in early interventions on MountainCar, and not at all on Hopper. Thus, our method’s performance is primarily attributable to the targeted intervention stage, and the exact choice of approach used to learn the mixture of policies is relatively insignificant.

Overall, of the two intervention approaches, policy execution converges to better final rewards. Indeed, for the Atari environments, we observed that expert query interventions proved ineffective. We believe this is because expert agreement is an imperfect proxy for true environmental rewards.

Figure 8: Samples from (top row) learned causal graph and (bottom row) random causal graph. (See text)

Interpreting the learned causal graph. Our method labels each dimension of the VAE encoding of the frame as a cause or nuisance variable. In Fig 8, we analyze these inferences in the Pong environment as follows: in the top row, a frame is encoded into the VAE latent, then for all nuisance dimensions (as inferred by our approach unif-intervention), that dimension is replaced with a sample from the prior, and new samples are generated. In the bottom row, the same procedure is applied with a random graph that has as many nuisance variables as the inferred graph. We observe that in the top row, the causal variables (the ball and paddles) are shared between the samples, while the nuisance variables (the digit) differ, being replaced either with random digits or unreadable digits. In the bottom row, the causal variables differ strongly, indicating that important aspects of the state are judged as nuisance variables. This validates that, consistent with MountainCar and Hopper, our approach does indeed identify true causes in Pong.

6 Conclusions

We have identified a naturally occurring and fundamental problem in imitation learning, “causal confusion”, and proposed a causally motivated approach for resolving it. While we observe evidence for causal confusion arising in natural imitation learning settings, we have thus far validated our solution in somewhat simpler synthetic settings intended to mimic them. Extending our solution to work for such realistic scenarios is an exciting direction for future work. Finally, apart from imitation, general machine learning systems deployed in the real world also encounter “feedback” Sculley et al. (2015); Bagnell (2016), which opens the door to causal confusion. We hope to address these more general settings in the future.

Acknowledgments:

We would like to thank Karthikeyan Shanmugam and Shane Gu for pointers to prior work early in the project, and Yang Gao, Abhishek Gupta, Marvin Zhang, Alyosha Efros, and Roberto Calandra for helpful discussions in various stages of the project. We are also grateful to Drew Bagnell and Katerina Fragkiadaki for helpful feedback on an earlier draft of this paper. This project was supported in part by Berkeley DeepDrive, NVIDIA, and Google.

References

Appendix A Expert Demonstrations

To collect demonstrations, we first train an expert with reinforcement learning. We use DQN Mnih et al. [2013] for MountainCar, TRPO Schulman et al. [2015] for Hopper, and PPO Schulman et al. [2017] for the Atari environments (Pong, UpNDown, Enduro). This expert policy is executed in the environment to collect demonstrations.

Appendix B Necessity of Correct Causal Model

Faithfulness: A causal model is said to be faithful when all conditional independence relationships in the distribution are represented in the graph.

We pick up the notation used in Sec 3.1, but for notational simplicity, we drop the time superscript for , , and when we are not reasoning about multiple time-steps.

Proposition 1.

Let the expert’s functional causal model be , with causal graph as in Figure 2 and function parameters . We assume some faithful learner that agrees on the interventional query:

Then it must be that .222We drop time from the superscript when discussing states and actions from the same time.

Proof.

For graph , define the index set of state variables that are independent of the action in the mutilated graph :

From the assumption of matching interventional queries and the assumption of faithfulness, it follows that: . From the graph, we observe that and thus . ∎

Appendix C Variational Causal Discovery

Figure 9: Training architecture for variational inference-based causal discovery as described in Appendix C. The policy network represents a mixture of policies, one corresponding to each value of the binary causal graph structure variable . This variable in turn is sampled from the distribution produced by an inference network from an input latent . Further, a network regresses back to the latent to enforce that should not be independent of .

The problem of discovering causal graphs from passively observed data is called causal discovery. The PC algorithm Spirtes et al. [2000] is arguably the most widely used and easily implementable causal discovery algorithm. In the case of Fig 2, the PC algorithm would imply the absence of the arrow , if the conditional independence relation holds, which can be tested by measuring the mutual information. However, the PC algorithm relies on faithfulness

of the causal graph. That is, conditional independence must imply d-separation in the graph. However, faithfulness is easily violated in a Markov decision process. If for some

, is a cause of the expert’s action (the arrow should exist), but is the result of a deterministic function of , then always and the PC algorithm would wrongly conclude that the arrow is absent.333More generally, faithfulness places strong constraints on the expert graph. For example, a visual state may contain unchanging elements such as the car frame in Fig 1, which are by definition deterministic functions of the past. As another example, goal-conditioned tasks must include a constant goal in the state variable at each time, which once again has deterministic transitions, violating faithfulness.

We take a Bayesian approach to causal discovery [Heckerman et al., 2006] from demonstrations. Recall from Sec 3 that the expert’s actions are based on an unknown subset of the state variables . Each may either be a cause or not, so there are possible graphs. We now define a variational inference approach to infer a distribution over functional causal models (graphs and associated parameters) such that its modes are consistent with the demonstration data .

While Bayesian inference is intractable, variational inference can be used to find a distribution that is close to the true posterior distribution over models. We parameterize the structure

of the causal graph as a vector of

correlated Bernoulli random variables

, each indicating the presence of a causal arrow from to

. We assume a variational family with a point estimate

of the parameters corresponding to graph

and use a latent variable model to describe the correlated Bernoulli variables, with a standard normal distribution

over latent random variable :

We now optimise the evidence lower bound (ELBO):

(2)
(3)

Likelihood

is the likelihood of the observations under the FCM . It is modelled by a single neural network , where is the element-wise multiplication, denotes concatenation and are neural network parameters.

Entropy

The entropy term of the KL divergence,

, acts as a regularizer to prevent the graph distribution from collapsing to the maximum a-posteriori estimate. It is intractable to directly maximize entropy, but a tractable variational lower bound can be formulated. Using the product rule of entropies, we may write:

In this expression, promotes diversity of graphs, while encourages correlation among . can be bounded below using the same variational bound used in InfoGAN Chen et al. [2016], with a variational distribution : . Thus, during optimization, in lieu of entropy, we maximize the following lower bound:

Prior

The prior over graph structures is set to prefer graphs with fewer causes for action —it is thus a sparsity prior:

Optimization

Note that is a discrete variable, so we cannot use the reparameterization trick [Kingma and Welling, 2013]. Instead, we use the Gumbel Softmax trick [Jang et al., 2016, Maddison et al., 2016] to compute gradients for training . Note that this does not affect

, which can be trained with standard backpropagation.

The loss of Eq 3 is easily interpretable independent of the formalism of variational Bayesian causal discovery. A mixture of predictors is jointly trained, each paying attention to diverse sparse subsets (identified by ) of the inputs. This is related to variational dropout [Kingma et al., 2015]. Once this model is trained, represents the hypothesis distribution over graphs, and represents the imitation policy corresponding to a graph . Fig 9 shows the architecture.

Usage for Targeted Interventions

In our experiments, we also evaluate the usefulness of causal discovery process to set a prior for the targeted interventions described in Sec 4.2. In Algorithm 1 and  2, we implement this by initializing to the discovered distribution (rather than uniform).

Appendix D Additional Results: Diagnosing Causal Confusion

Figure 10: An expanded version of Fig 4 in the main paper, demonstrating diagnosis of the causal confusion problem in three settings. Here, the final reward, shown in Fig 4 is shown in the third column. Additionally, we also show the behavior cloning training loss (first column) and validation loss (second column) on trajectories generated by the expert. The x-axis for all plots is the number of training examples used to train the behavior cloning policy.

In Fig 10 we show the causal confusion in several environments. We observe that while training and validation losses for behavior cloning are frequently near-zero for both the original and confounded policy, the confounded policy consistently yields significantly lower reward when deployed in the environment. This confirms the causal confusion problem.

Appendix E DAgger with many more interventions

(a) MountainCar
(b) Hopper
Figure 11: DAgger results trained on the confounded state.

In the main paper, we showed that DAgger performed poorly with equl number of expert interventions as our method. How many more samples does it need to do well?

The results in Fig 11 show that DAgger requires hundreds of samples before reaching rewards comparable to the rewards achieved by a non-DAgger imitator trained on the original state.

Appendix F GAIL Training Curves

In Figure 12

we show the average training curves of GAIL on the original and confounded state. Error bars are 2 standard errors of the mean. The confounded and original training curve do not differ significantly, indicating that causal confusion is not an issue with GAIL. However, training requires many interactions with the environment.

Figure 12: Rewards during GAIL training.

Appendix G Intervention Posterior Inference as Reinforcement Learning

Given a method of evaluating the likelihood of a certain graph to be optimal and a prior , we wish to infer the posterior . The number of graphs is finite, so we can compute this posterior exactly. However, there may be very many graphs, so that impractically many likelihood evaluations are necessary. Only noisy samples from the likelihood can be obtained, as in the case of intervention through policy execution, where the reward is noisy, this problem is exacerbated.

If on the other hand, a certain structure on the policy is assumed, the sample efficiently can be drastically improved, even though policy can no longer be exactly inferred. This can be done in the framework of Variational Inference. For a certain variational family, we wish to find, for some temperature :

(4)
(5)

The variational family we assume is the family of independent distributions:

(6)

Eq 5 can be interpreted as a 1 step entropy-regularized MDP with reward Levine [2018]. It can be optimized through a policy gradient, but this would require many likelihood evaluations. More efficient is to use a value based method. The independence assumption translates in a linear Q function: , which can be simply learned by linear regression on off-policy pairs . In Soft Q-Learning Haarnoja et al. [2017] it is shown that the policy that maximizes Eq 5 is , which can be shown to coincide in our case with Eq 6: