Resolving Spurious Correlations in Causal Models of Environments via Interventions

02/12/2020 ∙ by Sergei Volodin, et al. ∙ Google 0

Causal models could increase interpretability, robustness to distributional shift and sample efficiency of RL agents. In this vein, we address the question of learning a causal model of an RL environment. This problem is known to be difficult due to spurious correlations. We overcome this difficulty by rewarding an RL agent for designing and executing interventions to discover the true model. We compare rewarding the agent for disproving uncertain edges in the causal graph, rewarding the agent for activating a certain node, or rewarding the agent for increasing the causal graph loss. We show that our methods result in a better causal graph than one generated by following the random policy, or a policy trained on the environment's reward. We find that rewarding for the causal graph loss works the best.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Causality (Halpern & Pearl, 2005) is an important concept (Pearl, 2018)

for Machine Learning, since it resolves many issues in performance and Artificial Intelligence (AI) safety

(Amodei et al., 2016) such as interpretability (Madumal et al., 2019; Bengio, 2017), robustness to distributional shift (de Haan et al., 2019a) and sample-efficiency (Buesing et al., 2018)

. It is particularly well suited for Reinforcement Learning (RL), compared to supervised learning, because in RL there is an opportunity to take actions and influence the environment in a directed way. Since causality is a cornerstone in science, such an agent is expected to be superior to non-causal agents

(Marino et al., 2019).

Spurious correlations are a major obstacle in learning causal models. If present, they make learning from purely observational data impossible (Pearl & Mackenzie, 2018). We take advantage of the fact that it is possible to uncover the causal graph by executing interventions (Halpern & Pearl, 2005) which change the data distribution. We design a method to automatically resolve spurious correlations when learning the causal graph of the environment. Since we are interested in learning the high-level dynamics of the environment, low-level actions do not directly represent interventions in the environment. As a solution, we reward an RL agent for either setting nodes or edges (based on uncertainty) to specific values, or by rewarding the agent for disproving the learned causal model.

Contributions. In this paper, we formulate the problem of learning the causal graph of an RL environment in an active way. We present ways to design interventions and execute them in an end-to-end fashion. We show that interventions outperform a random baseline policy and a policy trained on the environment’s reward. The main contribution is a framework to deal with spurious correlations in RL via interventions, from definitions to methods and experiments.

2 Problem

We have an RL environment containing a set of observations , set of actions

, transition probability

(we use for probability) for , initial observation distribution . We call a tuple of an observation, action and a reward a step. We denote the space of steps. History contains the steps from one agent-environment interaction. is the mapping from raw steps to high-level features.

Causal learning. We use the standard definition of a Structural Causal Model for time series data (Halpern & Pearl, 2005). We create a directed graph of nodes, one for every feature. Edges represent dependencies between features of time-steps and are labelled with number of time-steps it takes for one node to influence another. The value of a node is determined by values of its parents at past time-steps. , where is the policy used to collect the data, denotes the loss of the causal learner (Runge et al., 2019). This means how well the graph fits the features of previous episodes.

Spurious correlations and interventions. We would like to find a graph which fits the environment , without giving the agent the ground truth . However, which policy should we execute in to learn ? Our proposition shows that a random policy is sufficient to learn the graph in a realizable case defined by (Shalev-Shwartz & Ben-David, 2014) where some graph perfectly fits the data111Note that in practice, a random policy might take too much time to explore the environment. In that case, interventions are not required to resolve spurious correlations.

Proposition 1.

For an environment and features with a true causal graph s.t. for all and for other , a random policy gives the true graph:

Intuitively, in the realizable case, more data is always better. The proof is based on two ideas: first, for a policy giving the true , a random policy will take same actions as with some probability: . Thus, data from will be in the dataset. Next, since fits any policy, the learner will find a graph s.t. . By linearity of , loss on equals a non-negative combination of losses over policies equals to, including one for . Now, since the non-negative combination is , one particular term as well, and

In contrast, in cases where we cannot fit the data perfectly (), it is possible that two policies produce different graphs, no matter the length of the history or the method to learn the graph. Indeed, given data from one policy only, we can construct two environments with different correct ground truth graphs, which will match the existing data perfectly. Therefore, no learning method can uncover the graph because it is not fully determined by the data obtained. In some cases, this leads to learning spurious correlations: dependencies which work given one policy, but not the other. For example, noise in one feature might force the learner to rely on spurious features, which is irrelevant given a different policy.

We define the problem in a minimax222Note also that the definition is dependent on the particular . It might be meaningless if different policies generate drastically different graphs. In that case, the environment has an ill-defined minimax causal structure. This is similar to the problem of defining the general fitness of an RL agent (Legg, 2008), which does not have a silver-bullet solution (it is prior-dependent). A slightly better way is to create a soft prior over policies, like the complexity prior (Rathmanner & Hutter, 2011): . The minimax version, though, is easier to define and evaluate. fashion: the graph should not be disproved even by the worst policy , at any number of collected episodes . This is similar to the scientific method in the real world: we want to learn causal relationships that are true no matter which experiments or actions we perform.333For example, the goal of Physics is to find laws which cannot be disproved by doing experiments.


In practice, in the equation above, we use a finite sequence of policies, instead of all policies in , and we consider a finite . Executing the next policy after can be seen as doing an intervention in the causal model, since the policy sets nodes to specific values. In that sense, we sample from the interventional distribution .

In the next section we give concrete methods for intervention design.

Figure 1: Learning the causal graph by actively interacting with the environment. Given a high-level set of features and an environment , we collect data using a policy to learn an initial causal model (hypothesis) from features time series. Then, we design an intervention (an experiment), a new policy aimed at disproving the current causal graph to learn the true one. Then, the intervention is executed in the environment as a standard agent-environment interaction with intrinsic reward, and the process is repeated.

3 Solution

We compare different methods for intervention design:

Intervention design via edges. We test if selected edges in the graph are real or caused by spurious correlations. Edges where the causal graph learning algorithm is uncertain are selected more often. We measure uncertainty by how different the results are when trained on different subsets of the data. When selecting an edge with a positive coefficient in a linear causal model444In the non-linear case, the coefficient might depend on the current value of features. In that case, this approach will still work, but the step has to be small enough., we propose to test it by setting (minimal and maximal feature values). To do so, we reward the agent for setting and for . The total reward is 555In case with a negative coefficient, we need .. Note that in this approach, we do not need to explicitly learn the mapping between low-level actions and high-level features , since we simply use RL for the high-level task.

Intervention design via nodes. We reward the agent for setting a target node to a target value . We also reward for keeping everything else the same by penalizing the difference

between the averages and variances of feature distributions from previous and current policies

666We try set

as the difference between random variables’ expected values and variances.

: . Another method for the distance part is to reward the agent for the environment reward with some coefficient, because having a common reward will keep the behavior similar: : . Node values are selected uniformly at random, while nodes are selected based on the average uncertainty of the edges as explained above.

Intervention design via loss. We reward the agent for finding policies which give high causal graph loss . This is similar to curiosity approaches (Pathak et al., 2017). The reasoning behind this is that we want to find data disproving our model, like in Eq. 1. Compared to previous methods, we do not select the node or an edge explicitly.

For the rest of the setup, we use the simplest methods (Granger 1-step causality (Granger, 1969) with hardcoded features and a sparsity loss). In this approach, we simply regress the current time-step on the previous one : using a linear relationship with regularization. It is trained in a supervised manner by aggregating data from different policies to approximate a solution to Eq. 1. We note that better methods (Runge et al., 2019) of learning causal graphs would still fail without interventions (as explained in the previous section). Other causality learning methods are compatible with our approach. Methods which discover the features end-to-end (Thomas et al., 2018; Kurutach et al., 2018; Ke et al., 2019; François-Lavet et al., 2018; Zhang et al., 2019) can be used to discover the nodes in the causal graph.

Now, all the discussed components are combined together into a causal agent, to discover the true causal graph of the environment, see Figure 1.

4 The environment

We use a simple Grid-World environment. The agent needs to eat food in the environment or the episode will end. In addition, it collects keys to open chests, with each chest giving a reward. There is a button which turns a light on and off and does not give reward. Figure 1(a) represents the causal model we want to discover. Appendix C contains more details about the environment.

We use the following specific environments: (A) a 5x5 grid-world with randomly placed items. (B, Figure 1(b)) a grid-world with a fixed map, where the agent must collect the key before the food. (C, Figure 1(c)) 10x10 grid-world with randomly placed items where the food is close to the key, and the chest is far away.

We add noise to each of the environments. With some probability, food is visible at cells not containing food.

The environment we choose is characteristic of the real world, as it contains spurious correlations that we need to uncover by changing behavior.

(a) Causal diagram
(b) Environment B
(c) Environment C
Figure 2: Left 1(a): The causal diagram of the environment which the agent should learn. The player needs to collect food and keys. Keys are used to open chests and the number of keys is displayed above the first black line. Top row with health decreases at every time-step, and the episode ends if it is 0. The button toggles the lamp (black/white) which gives no reward. Right two 1(b), 1(c): layouts of environments B and C.

5 Experiments

Hardcoded features. We augment the feature set with conjunctions of relevant features in order to keep the problem in the linear domain. This allowed us to keep the causal learning simple to focus on interventions. There are many techniques for learning nonlinear causal graphs that are compatible with our approach.

Baselines and methods. For all methods we first use a random policy for exploration of the environment. Next, we train a PPO (Schulman et al., 2017)

agent to follow the intervention reward. We compare three methods for interventions: rewarding the agent proportional to the loss of the model (Loss), disproving edges (Edge), setting nodes to values (Node). We measure if the correct graph was learned using cosine similarity.

5.1 Results

The random and environment reward policies discover the true causal graph in environment A, because the environment is randomly generated, so there aren’t any spurious correlations. The random and environment reward policies extremely rarely discover the true causal graph in environments B and C. This is because the food presence feature is noisy and a random policy often collects the food before the keys because they are close together. To predict the health increase, it is best to rely on the spurious feature “food and keys .

The intervention methods discover the true causal graph in environments B and C more often (results for C in the appendix). This is because they also include data from the intervention policy which collects the food when the agent doesn’t have a key. With data from the intervention policy, it is no longer optimal to rely on the spurious feature.

The Loss intervention method outperforms Node and Edge methods on environment (B), Figure 1(b)

. The main problem with the Node and Edge methods is that they have to choose the correct node or edge to intervene on. Once the correct edge or node is chosen, the true graph is learned quickly. In contrast, the Loss method doesn’t have to choose the right thing to intervene on. We expected that for harder environments, we would have to explicitly specify what node or edge to intervene on, as they are less trivial to find by random exploration. This would lead to Node and Edge methods outperforming the Loss method. However, the experimental results show the opposite: in simple environments hand-designed methods (edges and nodes) perform reasonably well, but if we increase the number of features or the complexity of the environment, they stop finding good policies. We didn’t test the node and edge interventions in environment C extensively due to this reason. Sampling without replacement when selecting edges gives faster convergence to the true graph versus sampling with replacement. We also found that selecting edges based on uncertainty (by training on different subsets

) gives better results than selecting random edges. Results for selecting nodes based on uncertainty are similar to selecting randomly. For the nodes method, we found that keeping the new policy close to the old one by using the reward from the environment works. However keeping the feature statistics the same doesn’t work because the agent learns a new way to achieve the same statistics.

Figure 3: Experimental results on environment B for predicting the true causal graph. Horizontal axis shows the number of episodes (in 1000s), plots are arranged by intervention method (Loss, Node, Edge). Vertical axis represents the number of runs (out of 10) which have converged to the true graph . Plots are arranged by the number of interventions (0, 5, 20). Green line represents the median. 0 interventions corresponds to training with reward. The random policy is evaluated in a separate experiment with spurious correlations as a result. means that the algorithm didn’t find the correct graph during training.

6 Conclusion

We design a method to learn the causal model of the environment by performing interventions, which helps prevent learning spurious correlations. This shows the potential of RL to improve causal graph learning and compares techniques to accomplish this. We state the problem of learning a causal graph in a RL setting so other work can build off of ours.

7 Future Work

We plan to combine our graph learning with one of the approaches for learning the features (Kurutach et al., 2018; Ke et al., 2019; François-Lavet et al., 2018; Zhang et al., 2019) and train the entire network end to end. The sparsity loss will help the features be disentangled because it will minimize dependencies between them (Thomas et al., 2018).

To make our method more general, we plan to use one of the advanced non-linear causality learners (Ke et al., 2019)

. Some of them are differentiable, which would allow to backpropagate from the graph to the features.

Finally, we can utilize the high-level graph as a hierarchical RL controller (Nachum et al., 2018). Specifically, we can run a traversal algorithm on the causal graph to find chains of nodes that lead to high reward. Then, we can reward the agent for activating the nodes in the correct sequence. This might increase the robustness to distributional shift, as we will rely on the correct features for acting.


Appendix A Relevant work

Our method to perform an intervention on an edge is similar to the method used in Marino et al. (2019) to test hypotheses. Compared to that approach, we are interested in the true causal graph of the environment rather than in testing specific hypotheses. Interventions to learn the true graph can be seen in (de Haan et al., 2019b). Our approach is focused on learning the correct graph rather than acting well. We extend the Action-Influence model (Marino et al., 2019; Everitt et al., 2019) to understand the environment. Compared to (Madumal et al., 2019), we learn the graph rather than design it by hand. The idea to reward the agent for the loss of the causal graph is taken from (Pathak et al., 2017). However, here we are interested in a very low-dimensional causal graph rather than in a black-box model of the environment.

Appendix B Hyperparameter selection

Resources and parameters.

Parameters were chosen with a hyperparameter search on the task of solving the environment using PPO

(Schulman et al., 2017)

. We run the total of 8000 episodes for B and 50000 episodes for C. We vary the number of epochs to train the causal graph in 500-10000, number of interventions 0-50, number of training calls 5-100, intervention method Loss, Edge, maximal number of episodes in the buffer 10-5000, method to select the edge Constant, Weighted and Random. We learn the graph on evaluation data without noise, and update the reward for the trainer.

Appendix C The environment

We implement the environment using pycolab. All updates are delayed 1 time-step to give causal information.

Appendix D Experiments

Figure 4 shows the results for environment C. Without interventions, the correct graph is never uncovered. In contrast, with interventions, the correct graph is learner, the more interventions the better.

Figure 4: Experimental results on environment C for predicting health, reward, keys and lamp. The description matches that of Figure 3.