1 Introduction
Imitation learning allows for control policies to be learned directly from example demonstrations provided by human experts. It is easy to implement, and reduces or removes the need for extensive interaction with the environment during training Widrow and Smith (1964); Pomerleau (1989); Bojarski et al. (2016); Argall et al. (2009); Hussein et al. (2017).
However, imitation learning suffers from a fundamental problem: distributional shift Daumé et al. (2009); Ross and Bagnell (2010). Training and testing state distributions are different, induced respectively by the expert and learned policies. Therefore, imitating expert actions on expert trajectories may not align with the true task objective. While this problem is widely acknowledged Pomerleau (1989); Daumé et al. (2009); Ross and Bagnell (2010); Ross et al. (2011), yet with careful engineering, naïve behavioral cloning approaches have yielded good results for several practical problems Widrow and Smith (1964); Pomerleau (1989); Schaal (1999); Muller et al. (2006); Mülling et al. (2013); Bojarski et al. (2016); Mahler and Goldberg (2017); Bansal et al. (2019). This raises the question: is distributional shift really still a problem?
In this paper, we identify a somewhat surprising and very problematic effect of distributional shift: “causal confusion.” Distinguishing correlates of expert actions in the demonstration set from true causes is usually very difficult, but may be ignored without adverse effects when training and testing distributions are identical (as assumed in supervised learning), since nuisance correlates continue to hold in the test set. However, this can cause catastrophic problems in imitation learning due to distributional shift. This is exacerbated by the causal structure of sequential action: the very fact that current actions cause future observations often introduces complex new nuisance correlates.
To illustrate, consider behavioral cloning to train a neural network to drive a car. In scenario A, the model’s input is an image of the dashboard and windshield, and in scenario B, the input to the model (with identical architecture) is the same image but with the dashboard masked out (see Fig
1). Both cloned policies achieve low training loss, but when tested on the road, model B drives well, while model A does not. The reason: the dashboard has an indicator light that comes on immediately when the brake is applied, and model A wrongly learns to apply the brake only when the brake light is on. Even though the brake light is the effect of braking, model A could achieve low training error by misidentifying it as the cause instead.This situation presents a giveaway symptom of causal confusion: access to more information leads to worse generalization performance in the presence of distributional shift. Causal confusion occurs commonly in natural imitation learning settings, especially when the imitator’s inputs include history information.
In this paper, we first point out and investigate the causal confusion problem in imitation learning. Then, we propose a solution to overcome it by learning the correct causal model, even when using complex deep neural network policies. We learn a mapping from causal graphs to policies, and then use targeted interventions to efficiently search for the correct policy, either by querying an expert, or by executing selected policies in the environment.
2 Related Work
Imitation learning. Imitation learning through behavioral cloning dates back to Widrow and Smith, 1964 Widrow and Smith (1964), and has remained popular through today Pomerleau (1989); Schaal (1999); Muller et al. (2006); Mülling et al. (2013); Bojarski et al. (2016); Giusti et al. (2016); Mahler and Goldberg (2017); Wang et al. (2019); Bansal et al. (2019). The distributional shift problem, wherein a cloned policy encounters unfamiliar states during autonomous execution, has been identified as an issue in imitation learning Pomerleau (1989); Daumé et al. (2009); Ross and Bagnell (2010); Ross et al. (2011); Laskey et al. (2017); Ho and Ermon (2016); Bansal et al. (2019)
. This is closely tied to the “feedback” problem in general machine learning systems that have direct or indirect access to their own past states
Sculley et al. (2015); Bagnell (2016). For imitation learning, various solutions to this problem have been proposed (Daumé et al., 2009; Ross and Bagnell, 2010; Ross et al., 2011) that rely on iteratively querying an expert based on states encountered by some intermediate cloned policy, to overcome distributional shift; DAgger Ross et al. (2011) has come to be the most widely used of these solutions.We show evidence that the distributional shift problem in imitation learning is often due to causal confusion, as illustrated schematically in Fig 1. We propose to address this through targeted interventions on the states to learn the true causal model to overcome distributional shift. As we will show, these interventions can take the form of either environmental rewards with no additional expert involvement, or of expert queries in cases where the expert is available for additional inputs. In expert query mode, our approach may be directly compared to DAgger (Ross et al., 2011): indeed, we show that we successfully resolve causal confusion using orders of magnitude fewer queries than DAgger.
We also compare against Bansal et al. (2019): to prevent imitators from copying past actions, they train with dropout Srivastava et al. (2014) on dimensions that might reveal past actions. While our approach seeks to find the true causal graph in a mixture of graphparameterized policies, dropout corresponds to directly applying the mixture policy. In our experiments, our approach performs significantly better.
Causal inference. Causal inference is the general problem of deducing causeeffect relationships among variables (Spirtes et al., 2000; Pearl, 2009; Peters et al., 2017; Spirtes, 2010; Eberhardt, 2017; Spirtes and Zhang, 2016). “Causal discovery” approaches allow causal inference from prerecorded observations under constraints (Steyvers et al., 2003; Heckerman et al., 2006; LopezPaz et al., 2017; Guyon et al., 2008; Louizos et al., 2017; Maathuis et al., 2010; Le et al., 2016; Goudet et al., 2017; Mitrovic et al., 2018; Wang and Blei, 2018). Observational causal inference is known to be impossible in general (Pearl, 2009; Peters et al., 2014). We operate in the interventional regime (Tong and Koller, 2001; Eberhardt and Scheines, 2007; Shanmugam et al., 2015; Sen et al., 2017) where a user may “experiment” to discover causal structures by assigning values to some subset of the variables of interest and observing the effects on the rest of the system. We propose a new interventional causal inference approach suited to imitation learning. While ignoring causal structure is particularly problematic in imitation learning, ours is the first effort directly addressing this, to our knowledge.
3 The Phenomenon of Causal Confusion
In imitation learning, an expert demonstrates how to perform a task (e.g., driving a car) for the benefit of an agent. In each demo, the agent has access both to its dim. state observations at each time , (e.g., a video feed from a camera), and to the expert’s action (e.g., steering, acceleration, braking). Behavioral cloning approaches learn a mapping from to using all tuples from the demonstrations. At test time, the agent observes and executes .
The underlying sequential decision process has complex causal structures, represented in Fig 2. States influence future expert actions, and are also themselves influenced by past actions and states.
In particular, expert actions are influenced by some information in state
, and unaffected by the rest. For the moment, assume that the dimensions
of represent disentangled factors of variation. Then some unknown subset of these factors (“causes”) affect expert actions, and the rest do not (“nuisance variables”). A confounder influences each state variable in , so that some nuisance variables may still be correlated with among pairs from demonstrations. In Fig 1, the dashboard light represents a confounder.A naïve behavioral cloned policy might rely on nuisance correlates to select actions, producing low training error, and even generalizing to heldout pairs. However, this policy must contend with distributional shift when deployed: actions are chosen by the imitator rather than the expert, affecting the distribution of and . This in turn affects the policy mapping from to , leading to poor performance of expertcloned policies. We define “causal confusion" as the phenomenon whereby cloned policies fail by misidentifying the causes of expert actions.
3.1 Robustness and Causality in Imitation Learning
Intuitively, distributional shift affects the relationship of the expert action to nuisance variables, but not to the true causes. In other words, to be maximally robust to distributional shift, a policy must rely solely on the true causes of expert actions, thereby avoiding causal confusion. This intuition can be formalized in the language of functional causal models (FCM) and interventions Pearl (2009).
Functional causal models: A functional causal model (FCM) over a set of variables is a tuple containing a graph over , and deterministic functions with parameters describing how the causes of each variable determine it: where is a stochastic noise variable that represents all external influences on , and denote the indices of parent nodes of , which correspond to its causes.
An “intervention” on to set its value may now be represented by a structural change in this graph to produce the “mutilated graph” , in which incoming edges to are removed.^{1}^{1}1For a more thorough overview of FCMs, see Pearl (2009).
Applying this formalism to our imitation learning setting, any distributional shift in the state may be modeled by intervening on , so that correctly modeling the “interventional query” is sufficient for robustness to distributional shifts. Now, we may formalize the intuition that only a policy relying solely on true causes can robustly model the mapping from states to optimal/expert actions under distributional shift.
In Appendix B, we prove that under mild assumptions, correctly modeling interventional queries does indeed require learning the correct causal graph . In the car example, “setting” the brake light to on or off and observing the expert’s actions would yield a clear signal unobstructed by confounders: the brake light does not affect the expert’s braking behavior.
3.2 Causal Confusion in Policy Learning Benchmarks and Realistic Settings
Before discussing our solution, we first present several testbeds and realworld cases where causal confusion adversely influences imitation learning performance.
Control Benchmarks. We show that causal confusion is induced with small changes to widely studied benchmark control tasks, simply by adding more information to the state, which intuitively ought to make the tasks easier, not harder. In particular, we add information about the previous action, which tends to correlate with the current action in the expert data for many standard control problems. This is a proxy for scenarios like our car example, in which correlates of past actions are observable in the state, and is similar to what we might see from other sources of knowledge about the past, such as memory or recurrence. We study three kinds of tasks: (i) MountainCar (continuous states, discrete actions), (ii) MuJoCo Hopper (continuous states and actions), (iii) Atari games: Pong, Enduro and UpNDown (states: two stacked consecutive frames, discrete actions).
For each task, we study imitation learning in two scenarios. In scenario A (henceforth called "confounded
"), the policy sees the augmented observation vector, including the previous action. In the case of lowdimensional observations, the state vector is expanded to include the previous action at an index that is unknown to the learner. In the case of image observations, we overlay a symbol corresponding to the previous action at an unknown location on the image (see Fig
3). In scenario B ("original"), the previous action variable is replaced with random noise for lowdimensional observations. For image observations, the original images are left unchanged. Demonstrations are generated synthetically as described in Appendix A. In all cases, we use neural networks with identical architectures to represent the policies, and we train them on the same demonstrations.Fig 4 shows the rewards against varying demonstration dataset sizes for MountainCar, Hopper, and Pong. Appendix D shows additional results, including for Enduro and UpNDown. All policies are trained to nearzero validation error on heldout expert stateaction tuples. original produces rewards tending towards expert performance as the size of the imitation dataset increases. confounded either requires many more demonstrations to reach equivalent performance, or fails completely to do so.
Overall, the results are clear: across these tasks, access to more information leads to inferior performance. As Fig 10 in the appendix shows, this difference is not due to different training/validation losses on the expert demonstrations—for example, in Pong, confounded produces lower validation loss than original on heldout demonstration samples, but produces lower rewards when actually used for control. These results not only validate the existence of causal confusion, but also provides us with testbeds for investigating a potential solution.
RealWorld Driving. In more realistic imitation learning settings too, symptoms of causal confusion have been observed consistently in Muller et al. (2006); Wang et al. (2019); Bansal et al. (2019), when learning to drive from histories of video frames. While these histories contain valuable information for driving, they also naturally introduce information about nuisance factors such as previous actions. In all three cases, more information led to worse results for the behavioral cloning policy, but this was neither attributed specifically to causal confusion, nor tackled using causally motivated approaches.
Metrics  Validation  Driving Performance  

Methods  Perplexity  Distance  Interventions  Collisions 
history  0.834  144.92  2.94 1.79  6.49 5.72 
nohistory  0.989  268.95  1.30 0.78  3.38 2.55 
We draw the reader’s attention to particularly telling results from Wang et al. (2019) for learning to drive in nearphotorealistic GTAV Krähenbühl (2018) environments, using behavior cloning with DAggerinspired expert perturbation. Imitation learning policies are trained using overhead image observations with and without “history” information (history and nohistory) about the egoposition trajectory of the car in the past.
Similar to our tests above, architectures are identical for the two methods. And once again, like in our tests above, history has better performance on heldout demonstration data, but much worse performance when actually deployed. Tab 1 shows these results, reproduced from Wang et al. (2019) Table II. These results constitute strong evidence for the prevalence of causal confusion in realistic imitation learning settings. Bansal et al. (2019) also observe similar symptoms in a driving setting, and present a dropout Srivastava et al. (2014) approach to tackle it, which we compare to in our experiments.
4 Resolving Causal Confusion
Recall from Sec 3.1 that robustness to causal confusion can be achieved by finding the true causal model of the expert’s actions. We propose a simple pipeline to do this. First, we jointly learn policies corresponding to various causal graphs (Sec 4.1). Then, we perform targeted interventions to efficiently search over the hypothesis set for the correct causal model (Sec 4.2).
4.1 Causal GraphParameterized Policy Learning
In this step, we learn a policy corresponding to each candidate causal graph. Recall from Sec 3 that the expert’s actions are based on an unknown subset of the state variables . Each may either be a cause or not, so there are possible graphs. We parameterize the structure of the causal graph as a vector of binary variables, each indicating the presence of an arrow from to in Fig 2. We then train a single graphparameterized policy , where is elementwise multiplication, and denotes concatenation. are neural network parameters, trained through gradient descent to minimize:
(1) 
where is drawn uniformly at random over all graphs and is a mean squared error loss for the continuous action environments and a crossentropy loss for the discrete action environments. Fig 5 shows a schematic of the training time architecture. The policy network mapping observations to actions represents a mixture of policies, one corresponding to each value of the binary causal graph structure variable , which is sampled as a bernoulli random vector.
In Appendix C, we propose an approach to perform variational Bayesian causal discovery over graphs , using a latent variable model to infer a distribution over functional causal models (graphs and associated parameters)—the modes of this distribution are the FCMs most consistent with the demonstration data. This resembles the scheme above, except that instead of uniform sampling, graphs are sampled preferentially from FCMs that fit the training demonstrations well. We compare both approaches in Sec 5, finding that simple uniform sampling nearly always suffices in preparation for the next step: targeted intervention.
4.2 Targeted Intervention
Having learned the graphparameterized policy as in Sec 4.1, we propose targeted intervention to compute the likelihood of each causal graph structure hypothesis . In a sense, imitation learning provides an ideal setting for studying interventional causal learning: causal confusion presents a clear challenge, while the fact that the problem is situated in a sequential decision process where the agent can interact with the world provides a natural mechanism for carrying out limited interventions.
We propose two intervention modes, both of which can be carried out by interaction with the environment via the actions:
Expert query mode. This is the standard intervention approach applied to imitation learning: intervene on to assign it a value, and observe the expert response . This requires an interactive expert, as in DAgger Ross and Bagnell (2010), but requires substantially fewer expert queries than DAgger, both because: (i) the queries serve only to disambiguate among a relatively small set of valid FCMs, and (ii) we use disagreement among the mixture of policies in
to query the expert efficiently in an active learning approach. We summarize this approach in Algorithm
1.Policy execution mode. It is not always possible to query an expert. For example, for a learner learning to drive a car by watching a human driver, it may not be possible to put the human driver into dangerous scenarios that the learner might encounter at intermediate stages of training. In cases like these where we would like to learn from prerecorded demonstrations alone, we propose to intervene indirectly by using environmental returns (sum of rewards over time in an episode) . The policies corresponding to different hypotheses are executed in the environment and the returns collected. The likelihood of each graph is proportional to the exponentiated returns
. The intuition is simple: environmental returns contain information about optimal expert policies even when experts are not queryable. Note that we do not even assume access to pertimestep rewards as in standard reinforcement learning; just the
sum of rewards for each completed run. As such, this intervention mode is much more flexible. See Algorithm 2.Note that both of the above intervention approaches evaluate individual hypotheses in isolation, but the number of hypotheses grows exponentially in the number of state variables. To handle larger states, we infer a graph distribution
, by assuming an energy based model with a linear energy
, so the graph distribution is , where is the sigmoid, which factorizes in independent factors. The independence assumption is sensible as our approach collapses to its mode before returning it and the collapsed distribution is always independent. is inferred from linear regression on the likelihoods. This process is depicted in Algorithms 1 and 2. The above method can be formalized within the reinforcement learning framework Levine (2018). As we show in Appendix G, the energybased model can be seen as an instance of soft Qlearning Haarnoja et al. (2017).4.3 Disentangling Observations
In the above, we have assumed access to disentangled observations . When this is not the case, such as with image observations, must be set to a disentangled representation of the observation at time . We construct such a representation by training a VAE Kingma and Welling (2013); Higgins et al. (2017) to reconstruct the original observations. To capture states beyond those encountered by the expert, we train with a mix of expert and random trajectory states. Once trained, is set to be the mean of the latent distribution produced at the output of the encoder. The VAE training objective encourages disentangled dimensions in the latent space Burgess et al. (2018); Chen et al. (2018). We employ CoordConv Liu et al. (2018) in both the encoder and the decoder architectures.
5 Experiments
We now evaluate the solution described in Sec 4 on the five tasks (MountainCar, Hopper, and 3 Atari games) described in Sec 3.2. In particular, recall that confounded performed significantly worse than original across all tasks. In our experiments, we seek to answer the following questions: (1) Does our targeted interventionbased solution to causal confusion bridge the gap between confounded and original? (2) How quickly does performance improve with intervention? (3) Do both intervention modes (expert query, policy execution) described in Sec 4.2 resolve causal confusion? (4) Does our approach in fact recover the true causal graph?
In each of the two intervention modes, we compare two variants of our method: unifintervention and discintervention. They only differ in the training of the graphparameterized mixtureofpolicies —while unifintervention samples causal graphs uniformly, discintervention uses the variational causal discovery approach mentioned in Sec 4.1, and described in detail in Appendix C.
Environment  Pong  Enduro  UpNDown 

original (upper bd)  15.0  39.5  80.9 
confounded (lower bd)  4.0  30.5  24.8 
original w/ vae  12.3  36.7  54.5 
counfounded w/ vae  4.0  28.2  24.0 
unifintervention (ours)  11.6  32.4  66.3 
dropout Bansal et al. (2019)  8.3  28.2  40.4 
Baselines. We compare our method against three baselines applied to the confounded state. dropout trains the policy using Eq 3 and evaluates with the graph containing all ones, which amounts to dropout regularization Srivastava et al. (2014) during training, as proposed by Bansal et al. (2019). dagger Ross and Bagnell (2010) addresses distributional shift by querying the expert on states encountered by the imitator, requiring an interactive expert. We compare dagger to our expert query intervention approach. Lastly, we compare to Generative Adversarial Imitation Learning (gail) Ho and Ermon (2016). gail is an alternative to standard behavioral cloning that works by matching demonstration trajectories to those generated by the imitator during rollouts in the environment. Note that the PC algorithm Le et al. (2016), commonly used in causal discovery from passive observational data, relies on the faithfulness assumption, which causes it to be infeasible in our setting. See Appendices B & C for details.
Intervention by policy execution. Fig 6 plots episode rewards versus number of policy execution intervention episodes for MountainCar and Hopper. The reward always corresponds to the current mode of the posterior distribution over graphs, updated after each episode, as described in Algorithm 2. In these cases, both unifintervention and discintervention eventually converge to models yielding similar rewards, which we verified to be the correct causal model i.e., true causes are selected and nuisance correlates left out. In early episodes on MountainCar, discintervention benefits from the prior over graphs inferred in the variational causal discovery phase. However, in Hopper, the simpler unifintervention performs just as well. dropout does indeed help in both settings, as reported in Bansal et al. (2019), but is significantly poorer than our approach variants. gail requires about 1.5k episodes on Hopper to match the performance of our approaches, which only need tens of episodes. Appendix F further analyzes the performance of gail. Standard implementations of gail do not handle discrete action spaces, so we do not evaluate it on MountainCar.
Experiments on Atari games are more computationally expensive, so we report results after a heuristically preselected number of episodes (1000). As described in Sec
4.3, we use a VAE to disentangle image states in Atari games to produce 30D representations. Requiring the policy to utilize the VAE representation without endtoend training does result in some drop in performance, as seen in Tab 1. However, causal confusion still causes a very large drop of performance even relative to the baseline VAE performance. As Tab 2 shows, unifintervention indeed improves significantly over confounded w/ vae in all three cases, matching original w/ vae on Pong and UpNDown, while the dropout baseline only improves UpNDown. In our experiments thus far, gail fails to converge to abovechance performance on any of the Atari environments. These results show that our method successfully alleviates causal confusion within relatively few trials.Intervention by expert queries. Next, we perform direct intervention by querying the expert on samples from trajectories produced by the different causal graphs. In this setting, we can also directly compare to dagger Ross et al. (2011). Fig 7 shows results on MountainCar and Hopper. Both our approaches successfully improve over confounded within a small number of queries. Consistent with policy execution intervention results reported above, we verify that our approach again identifies the true causal model correctly in both tasks, and also performs better than dropout in both settings. It also exceeds the rewards achieved by dagger, while using far fewer expert queries. In Appendix E, we show that dagger requires hundreds of queries to achieve similar rewards for MountainCar and tens of thousands for Hopper. Finally, gail with 1.5k episodes outperforms our expert query interventions approach. Recall however from Fig 7 that this is an order of magnitude more than the number of episodes required by our policy intervention approach.
Once again, discintervention only helps in early interventions on MountainCar, and not at all on Hopper. Thus, our method’s performance is primarily attributable to the targeted intervention stage, and the exact choice of approach used to learn the mixture of policies is relatively insignificant.
Overall, of the two intervention approaches, policy execution converges to better final rewards. Indeed, for the Atari environments, we observed that expert query interventions proved ineffective. We believe this is because expert agreement is an imperfect proxy for true environmental rewards.
Interpreting the learned causal graph. Our method labels each dimension of the VAE encoding of the frame as a cause or nuisance variable. In Fig 8, we analyze these inferences in the Pong environment as follows: in the top row, a frame is encoded into the VAE latent, then for all nuisance dimensions (as inferred by our approach unifintervention), that dimension is replaced with a sample from the prior, and new samples are generated. In the bottom row, the same procedure is applied with a random graph that has as many nuisance variables as the inferred graph. We observe that in the top row, the causal variables (the ball and paddles) are shared between the samples, while the nuisance variables (the digit) differ, being replaced either with random digits or unreadable digits. In the bottom row, the causal variables differ strongly, indicating that important aspects of the state are judged as nuisance variables. This validates that, consistent with MountainCar and Hopper, our approach does indeed identify true causes in Pong.
6 Conclusions
We have identified a naturally occurring and fundamental problem in imitation learning, “causal confusion”, and proposed a causally motivated approach for resolving it. While we observe evidence for causal confusion arising in natural imitation learning settings, we have thus far validated our solution in somewhat simpler synthetic settings intended to mimic them. Extending our solution to work for such realistic scenarios is an exciting direction for future work. Finally, apart from imitation, general machine learning systems deployed in the real world also encounter “feedback” Sculley et al. (2015); Bagnell (2016), which opens the door to causal confusion. We hope to address these more general settings in the future.
Acknowledgments:
We would like to thank Karthikeyan Shanmugam and Shane Gu for pointers to prior work early in the project, and Yang Gao, Abhishek Gupta, Marvin Zhang, Alyosha Efros, and Roberto Calandra for helpful discussions in various stages of the project. We are also grateful to Drew Bagnell and Katerina Fragkiadaki for helpful feedback on an earlier draft of this paper. This project was supported in part by Berkeley DeepDrive, NVIDIA, and Google.
References
 Argall et al. (2009) Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
 Bagnell (2016) Drew Bagnell. Talk: Feedback in machine learning, 2016. URL https://youtu.be/XRSvz4UOpo4.
 Bansal et al. (2019) Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. ChauffeurNet: Learning to drive by imitating the best and synthesizing the worst. Robotics: Science & Systems (RSS), 2019.
 Bojarski et al. (2016) Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.
 Burgess et al. (2018) Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in vae. arXiv preprint arXiv:1804.03599, 2018.
 Chen et al. (2018) Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 Daumé et al. (2009) Hal Daumé, John Langford, and Daniel Marcu. Searchbased structured prediction. Machine learning, 75(3):297–325, 2009.

Eberhardt (2017)
Frederick Eberhardt.
Introduction to the foundations of causal discovery.
International Journal of Data Science and Analytics
, 3(2):81–91, 2017.  Eberhardt and Scheines (2007) Frederick Eberhardt and Richard Scheines. Interventions and causal inference. Philosophy of Science, 74(5):981–995, 2007.
 Giusti et al. (2016) Alessandro Giusti, Jerome Guzzi, Dan Ciresan, FangLin He, Juan Pablo Rodriguez, Flavio Fontana, Matthias Faessler, Christian Forster, Jurgen Schmidhuber, Gianni Di Caro, Davide Scaramuzza, and Luca Gambardella. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 2016.
 Goudet et al. (2017) Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, David LopezPaz, Isabelle Guyon, Michele Sebag, Aris Tritas, and Paola Tubaro. Learning functional causal models with generative neural networks. arXiv preprint arXiv:1709.05321, 2017.
 Guyon et al. (2008) Isabelle Guyon, Constantin Aliferis, Greg Cooper, André Elisseeff, JeanPhilippe Pellet, Peter Spirtes, and Alexander Statnikov. Design and analysis of the causation and prediction challenge. In Causation and Prediction Challenge, pages 1–33, 2008.
 Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165, 2017.
 Heckerman et al. (2006) David Heckerman, Christopher Meek, and Gregory Cooper. A Bayesian Approach to Causal Discovery, pages 1–28. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. ISBN 9783540334866. doi: 10.1007/3540334866_1. URL https://doi.org/10.1007/3540334866_1.
 Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.
 Ho and Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
 Hussein et al. (2017) Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):21, 2017.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2575–2583. Curran Associates, Inc., 2015.

Krähenbühl (2018)
Philipp Krähenbühl.
Free supervision from video games.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2955–2964, 2018.  Laskey et al. (2017) Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. arXiv preprint arXiv:1703.09327, 2017.
 Le et al. (2016) Thuc Le, Tao Hoang, Jiuyong Li, Lin Liu, Huawen Liu, and Shu Hu. A fast pc algorithm for high dimensional causal discovery with multicore pcs. IEEE/ACM transactions on computational biology and bioinformatics, 2016.
 Levine (2018) Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. CoRR, abs/1805.00909, 2018.
 Liu et al. (2018) Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. CoRR, abs/1807.03247, 2018. URL http://arxiv.org/abs/1807.03247.
 LopezPaz et al. (2017) D. LopezPaz, R. Nishihara, S. Chintala, B. Schölkopf, and L. Bottou. Discovering causal signals in images. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages 58–66, Piscataway, NJ, USA, July 2017. IEEE.
 Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latentvariable models. In Advances in Neural Information Processing Systems, pages 6446–6456, 2017.
 Maathuis et al. (2010) Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Bühlmann. Predicting causal effects in largescale systems from observational data. Nature Methods, 7(4):247, 2010.
 Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Mahler and Goldberg (2017) Jeffrey Mahler and Ken Goldberg. Learning deep policies for robot bin picking by simulating robust grasping sequences. In Conference on Robot Learning, pages 515–524, 2017.
 Mitrovic et al. (2018) Jovana Mitrovic, Dino Sejdinovic, and Yee Whye Teh. Causal inference via kernel deviance measures. arXiv preprint arXiv:1804.04622, 2018.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Muller et al. (2006) Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, and Yann L Cun. Offroad obstacle avoidance through endtoend learning. In Advances in neural information processing systems, pages 739–746, 2006.
 Mülling et al. (2013) Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263–279, 2013.
 Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
 Peters et al. (2014) Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models. The Journal of Machine Learning Research, 15(1):2009–2053, 2014.
 Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. MIT press, 2017.
 Pomerleau (1989) Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.
 Ross and Bagnell (2010) Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668, 2010.
 Ross et al. (2011) Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
 Schaal (1999) Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Sculley et al. (2015) David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, JeanFrancois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. In Advances in neural information processing systems, pages 2503–2511, 2015.
 Sen et al. (2017) Rajat Sen, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. Identifying best interventions through online importance sampling. arXiv preprint arXiv:1701.02789, 2017.
 Shanmugam et al. (2015) Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, pages 3195–3203, 2015.
 Spirtes (2010) Peter Spirtes. Introduction to causal inference. Journal of Machine Learning Research, 11(May):1643–1662, 2010.
 Spirtes and Zhang (2016) Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodological advances. In Applied informatics, 2016.
 Spirtes et al. (2000) Peter Spirtes, Clark N Glymour, Richard Scheines, David Heckerman, Christopher Meek, Gregory Cooper, and Thomas Richardson. Causation, prediction, and search. MIT press, 2000.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Steyvers et al. (2003) Mark Steyvers, Joshua B Tenenbaum, EricJan Wagenmakers, and Ben Blum. Inferring causal networks from observations and interventions. Cognitive science, 27(3):453–489, 2003.

Tong and Koller (2001)
Simon Tong and Daphne Koller.
Active learning for structure in bayesian networks.
In IJCAI, 2001.  Wang et al. (2019) Dequan Wang, Coline Devin, QiZhi Cai, Philipp Krähenbühl, and Trevor Darrell. Monocular plan view networks for autonomous driving. arXiv preprint arXiv:1905.06937, 2019.
 Wang and Blei (2018) Yixin Wang and David M Blei. The blessings of multiple causes. arXiv preprint arXiv:1805.06826, 2018.
 Widrow and Smith (1964) Bernard Widrow and Fred W Smith. Patternrecognizing control systems, 1964.
Appendix A Expert Demonstrations
To collect demonstrations, we first train an expert with reinforcement learning. We use DQN Mnih et al. [2013] for MountainCar, TRPO Schulman et al. [2015] for Hopper, and PPO Schulman et al. [2017] for the Atari environments (Pong, UpNDown, Enduro). This expert policy is executed in the environment to collect demonstrations.
Appendix B Necessity of Correct Causal Model
Faithfulness: A causal model is said to be faithful when all conditional independence relationships in the distribution are represented in the graph.
We pick up the notation used in Sec 3.1, but for notational simplicity, we drop the time superscript for , , and when we are not reasoning about multiple timesteps.
Proposition 1.
Let the expert’s functional causal model be , with causal graph as in Figure 2 and function parameters . We assume some faithful learner that agrees on the interventional query:
Then it must be that .^{2}^{2}2We drop time from the superscript when discussing states and actions from the same time.
Proof.
For graph , define the index set of state variables that are independent of the action in the mutilated graph :
From the assumption of matching interventional queries and the assumption of faithfulness, it follows that: . From the graph, we observe that and thus . ∎
Appendix C Variational Causal Discovery
The problem of discovering causal graphs from passively observed data is called causal discovery. The PC algorithm Spirtes et al. [2000] is arguably the most widely used and easily implementable causal discovery algorithm. In the case of Fig 2, the PC algorithm would imply the absence of the arrow , if the conditional independence relation holds, which can be tested by measuring the mutual information. However, the PC algorithm relies on faithfulness
of the causal graph. That is, conditional independence must imply dseparation in the graph. However, faithfulness is easily violated in a Markov decision process. If for some
, is a cause of the expert’s action (the arrow should exist), but is the result of a deterministic function of , then always and the PC algorithm would wrongly conclude that the arrow is absent.^{3}^{3}3More generally, faithfulness places strong constraints on the expert graph. For example, a visual state may contain unchanging elements such as the car frame in Fig 1, which are by definition deterministic functions of the past. As another example, goalconditioned tasks must include a constant goal in the state variable at each time, which once again has deterministic transitions, violating faithfulness.We take a Bayesian approach to causal discovery [Heckerman et al., 2006] from demonstrations. Recall from Sec 3 that the expert’s actions are based on an unknown subset of the state variables . Each may either be a cause or not, so there are possible graphs. We now define a variational inference approach to infer a distribution over functional causal models (graphs and associated parameters) such that its modes are consistent with the demonstration data .
While Bayesian inference is intractable, variational inference can be used to find a distribution that is close to the true posterior distribution over models. We parameterize the structure
of the causal graph as a vector ofcorrelated Bernoulli random variables
, each indicating the presence of a causal arrow from to. We assume a variational family with a point estimate
of the parameters corresponding to graphand use a latent variable model to describe the correlated Bernoulli variables, with a standard normal distribution
over latent random variable :We now optimise the evidence lower bound (ELBO):
(2)  
(3) 
Likelihood
is the likelihood of the observations under the FCM . It is modelled by a single neural network , where is the elementwise multiplication, denotes concatenation and are neural network parameters.
Entropy
The entropy term of the KL divergence,
, acts as a regularizer to prevent the graph distribution from collapsing to the maximum aposteriori estimate. It is intractable to directly maximize entropy, but a tractable variational lower bound can be formulated. Using the product rule of entropies, we may write:
In this expression, promotes diversity of graphs, while encourages correlation among . can be bounded below using the same variational bound used in InfoGAN Chen et al. [2016], with a variational distribution : . Thus, during optimization, in lieu of entropy, we maximize the following lower bound:
Prior
The prior over graph structures is set to prefer graphs with fewer causes for action —it is thus a sparsity prior:
Optimization
Note that is a discrete variable, so we cannot use the reparameterization trick [Kingma and Welling, 2013]. Instead, we use the Gumbel Softmax trick [Jang et al., 2016, Maddison et al., 2016] to compute gradients for training . Note that this does not affect
, which can be trained with standard backpropagation.
The loss of Eq 3 is easily interpretable independent of the formalism of variational Bayesian causal discovery. A mixture of predictors is jointly trained, each paying attention to diverse sparse subsets (identified by ) of the inputs. This is related to variational dropout [Kingma et al., 2015]. Once this model is trained, represents the hypothesis distribution over graphs, and represents the imitation policy corresponding to a graph . Fig 9 shows the architecture.
Usage for Targeted Interventions
Appendix D Additional Results: Diagnosing Causal Confusion
In Fig 10 we show the causal confusion in several environments. We observe that while training and validation losses for behavior cloning are frequently nearzero for both the original and confounded policy, the confounded policy consistently yields significantly lower reward when deployed in the environment. This confirms the causal confusion problem.
Appendix E DAgger with many more interventions
In the main paper, we showed that DAgger performed poorly with equl number of expert interventions as our method. How many more samples does it need to do well?
The results in Fig 11 show that DAgger requires hundreds of samples before reaching rewards comparable to the rewards achieved by a nonDAgger imitator trained on the original state.
Appendix F GAIL Training Curves
In Figure 12
we show the average training curves of GAIL on the original and confounded state. Error bars are 2 standard errors of the mean. The confounded and original training curve do not differ significantly, indicating that causal confusion is not an issue with GAIL. However, training requires many interactions with the environment.
Appendix G Intervention Posterior Inference as Reinforcement Learning
Given a method of evaluating the likelihood of a certain graph to be optimal and a prior , we wish to infer the posterior . The number of graphs is finite, so we can compute this posterior exactly. However, there may be very many graphs, so that impractically many likelihood evaluations are necessary. Only noisy samples from the likelihood can be obtained, as in the case of intervention through policy execution, where the reward is noisy, this problem is exacerbated.
If on the other hand, a certain structure on the policy is assumed, the sample efficiently can be drastically improved, even though policy can no longer be exactly inferred. This can be done in the framework of Variational Inference. For a certain variational family, we wish to find, for some temperature :
(4)  
(5) 
The variational family we assume is the family of independent distributions:
(6) 
Eq 5 can be interpreted as a 1 step entropyregularized MDP with reward Levine [2018]. It can be optimized through a policy gradient, but this would require many likelihood evaluations. More efficient is to use a value based method. The independence assumption translates in a linear Q function: , which can be simply learned by linear regression on offpolicy pairs . In Soft QLearning Haarnoja et al. [2017] it is shown that the policy that maximizes Eq 5 is , which can be shown to coincide in our case with Eq 6:
Comments
There are no comments yet.