Deconfounding Reinforcement Learning in Observational Settings

12/26/2018 ∙ by Chaochao Lu, et al. ∙ University of Cambridge Max Planck Society 10

We propose a general formulation for addressing reinforcement learning (RL) problems in settings with observational data. That is, we consider the problem of learning good policies solely from historical data in which unobserved factors (confounders) affect both observed actions and rewards. Our formulation allows us to extend a representative RL algorithm, the Actor-Critic method, to its deconfounding variant, with the methodology for this extension being easily applied to other RL algorithms. In addition to this, we develop a new benchmark for evaluating deconfounding RL algorithms by modifying the OpenAI Gym environments and the MNIST dataset. Using this benchmark, we demonstrate that the proposed algorithms are superior to traditional RL methods in confounded environments with observational data. To the best of our knowledge, this is the first time that confounders are taken into consideration for addressing full RL problems with observational data. Code is available at



There are no comments yet.


page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, much attention has been devoted to the development of reinforcement learning (RL) algorithms with the goal of improving treatment policies in healthcare (Gottesman et al., 2018). RL algorithms have been proposed to infer better decision-making strategies for mechanical ventilation (Prasad et al., 2017), sepsis management (Raghu et al., 2017a, b), and treatment of schizophrenia (Shortreed et al., 2011). In healthcare, a common practice is to focus on the observational setting in which we learn policies solely from historical data produced by real environments, instead of learning policies by actively taking actions in the traditional RL setting. The reason for this is that we do not wish to experiment with patients’ lives without evidence that the proposed treatment strategy is better than the current practice (Gottesman et al., 2018).111Another example is in finance where, due to costs in terms of time and money, it is often impractical to evaluate trading strategies by actually buying and selling stocks in the market. As pointed out by Raghu et al. (2017a)

, RL also has advantages over other machine learning algorithms even in the observational setting, especially in two situations: when the optimal treatment strategy is unclear in medical literature

(Marik, 2015), and when training examples do not represent optimal behavior.

Another approach that has been explored and used is causal inference (Pearl, 2009). Causal inference is a fundamental notion in science and plays an increasingly important role in healthcare and medicine (Liu et al., ; Soleimani et al., 2017; Schulam and Saria, 2017; Alaa et al., 2017; Alaa and van der Schaar, 2018; Atan et al., 2016). Especially in the observational setting, causal inference offers a powerful tool to deal with confounding that occurs when a hidden variable influences both the treatment and the outcome (Pearl and Mackenzie, 2018). It is widely acknowledged that confounding is the most crucial aspect of inferring the effect of the treatment on the outcome from observational data (Louizos et al., 2017), because not taking confounding into account might lead to choosing a wrong treatment. However, most approaches in causal inference to tackling confounding are restricted to fixed treatments which do not vary over time (Louizos et al., 2017). When treatments are allowed to vary over time, the space of combinational treatment strategies increases exponentially (Peters et al., 2017; Hernán and Robins, 2018). In this case, it is still unclear how to apply causal inference to dealing with confounding in such sequential data where treatments changes over time.

On the basis of the discussion above, in this paper we attempt to combine advantages of RL and causal inference to cope with an important family of RL problems in the observational setting, that is, learning good policies solely from the historical data produced by real environments with confounding bias. This type of problem will become increasingly common in future RL research with the burgeoning development in healthcare and medicine. To the best of our knowledge, however, little work has been done in this promising area of integrating RL with causal inference (Bareinboim et al., 2015; Forney et al., 2017; Buesing et al., 2019). In contrast, confounders have been extensively studied in epidemiology, sociology, and economics. Take for example the widespread kidney stones in which the size of the kidney stone is a confounding factor affecting both the treatment and the recovery (Peters et al., 2017; Pearl, 2009). Correcting for the confounding effect of the size of the kidney stone is crucial for determining how to choose an effective treatment. Similarly, in RL, if unobserved potential confounders exist, they affect both actions and rewards as an agent interacts with environments, and eventually influence the policy to be optimized.

Let us for a moment stick with the example of the kidney stones and assume that physicians need to take a series of steps to treat this disease. We further assume that, during the course of treatment, physicians have natural predilections of the treatment choices as a function of the stone size. That is, it is more likely for them to choose some treatment when patients have large stones and to choose another treatment when patients have small stones. In the observational setting, given only such a set of historical records about physicians’ treatments and outcomes on patients, it is extremely challenging or even impossible to learn an optimal treatment policy due to the existence of the confounder (i.e., the size of kidney stones).

To this end, we present a general formulation for addressing RL problems with observational data, namely deconfounding reinforcement learning

(DRL). More specifically, given several common confounding assumptions, we first estimate a latent-variable model from observational data. Under some conditions for identification, through the latent-variable model we can simultaneously discover the latent confounders and infer how they affect actions and reward. Then the confounders in the model can be adjusted for, using the causal language developed by

Pearl (2009), and finally we optimize the policy based on the resulting deconfounding model.

On the basis of the proposed formulation, we extend one popular RL algorithm, the Actor-Critic method, to its corresponding deconfounding variant. Note that our procedure for obtaining this deconfounding variant can be easily applied to other RL algorithms. Due to lack of datasets in this respect, we revise the classic control toolkit in OpenAI Gym (Brockman et al., 2016), making it a benchmark for comparison of DRL algorithms. In addition, we also devise a confounding version of the MNIST dataset (LeCun et al., 1998) to verify the performance of our causal model. Finally, we conduct extensive experiments to demonstrate the superiority of the proposed formulation compared to traditional RL algorithms in confounded environments with observational data.

To sum up, our contributions in this paper are as follows:

  1. We propose a general formulation for addressing RL problems in confounded environments with observational data, namely deconfounding reinforcement learning (DRL);

  2. We present the deconfounding variant of Actor-Critic methods, obtained through a methodology that can be easily applied to other RL methods;

  3. We develop a benchmark for DRL by revising the toolkit for classic control in OpenAI Gym (Brockman et al., 2016) and by devising a version of the MNIST dataset with confounders (LeCun et al., 1998);

  4. We perform a comprehensive comparison of our DRL algorithm with its vanilla version, showing that the proposed approach has an advantage in confounded environments.

  5. To the best of our knowledge, this is the first attempt to build a bridge between confounding and the full RL problem. This is one of few research papers aiming at understanding the connections between causal inference and the full RL setting.

2 Background

In this section, we briefly review confounding in causal inference. We recommend Pearl’s excellent monograph for further reading (Pearl, 2009).

2.1 Simpson’s Paradox

Let us begin with an example of one of the most famous paradoxes in statistics: Simpson’s Paradox. Consider the previously mentioned kidney stones (Peters et al., 2017)

. We collect electronic patient records to investigate the effectiveness of two treatments against kidney stones. Although the overall probability of recovery is higher for patients who took treatment

, treatment performs better than treatment both on patients with small kidney stones and with large kidney stones. More precisely, we have


where is the size (large or small) of the stone, the treatment, and the recovery (all binary). How do we cope with this change in conclusions? Which treatment do you prefer if you had kidney stones? Does treatment cause recovery? As described in the following section, the answers to these questions depend on the causal relationship between treatment, recovery, and the size of the kidney stone.

Figure 1: Causal diagram for kidney stones.

2.2 Confounding

An intuitive explanation for this kidney stone example of Simpson’s paradox is that larger stones are more severe than small stones and are much more likely to be treated with treatment , resulting in that treatment looks worse than treatment overall. We assume that Figure 1 depicts the true underlying causal diagram of the kidney stone example. Confounding occurs because the size of kidney stones influences both treatment and recovery. The size of the kidney stones is called a confounder. The term “confounding” originally meant “mixing” in English, which describes that the true causal effect is “mixed” with the spurious correlation between and induced by the fork (Pearl and Mackenzie, 2018). In other words, we will not be able to disentangle the true effect of on from the spurious effect if we do not have data on . If we have measurements of or can indirectly estimate , it is easy to deconfound the true and spurious effects. To this end, we can adjust for by averaging the effect of on in each subgroup of (i.e., different size groups in the case of kidney stones).

2.3 Do-Operator and Adjustment Criterion

From the viewpoint of causal inference, we can use the language of intervention, namely the do-operator, to describe when confounding happens. In fact, in the example of the kidney stones, what we are interested in is how these two treatments compare when we force all patients to take treatment or , rather than which treatment has a higher recovery rate given only the observational patient records. Mathematically, we focus on the true effect (the intervention distribution obtained when patients are forced to take treatment ) instead of the conditional (the observational distribution obtained when patients are observed to take treatment ). Therefore, as described previously, confounding can be naturally described as the discrepancy between and .

The do-operator can be executed in two common ways: by Randomized Controlled Trials (RCTs) (Fisher, 1935) and by adjustment formulas (i.e., Back-door criterion and Front-door criterion) (Pearl, 2009). RCTs are the gold standard but sometimes limited by practical considerations (e.g., safety, laws, ethics, physically infeasibility, etc.). The Back-door and Front-door criteria require knowledge of the causal diagram, which typically means that causal assumptions are thus provided in advance. According to the Back-door criterion, in the kidney stone example, we can immediately obtain


Beyond the two adjustment formulas, the -calculus provides a syntactic method of deriving claims about interventions (Pearl, 2009). It consists of three rules which can be repeatedly applied to simplify the expression for an interventional distribution. With the do-calculus we can calculate intervention probabilities from observation probabilities.

2.4 Proxy Variables for Confounding

If confounders can be measured, then they can be adjusted for through the methods discussed in Section 2.3. However, in most cases where confounders are hidden or unmeasured, without further assumptions, it is impossible to estimate the effect of the intervention on the outcome. A common practice is then to leverage observed proxy variables that contain information about the unobserved confounders (Angrist and Pischke, 2008; Maddala and Lahiri, 1992; Montgomery et al., 2000; Schölkopf et al., 2016). However, using proxy variables to correctly recover causal effects requires strict mathematical assumptions (Louizos et al., 2017; Edwards et al., 2015; Kuroki and Pearl, 2014; Miao et al., 2018; Pearl, 2012; Wooldridge, 2009) and, in practice, we do not know whether or not the available data meets those assumptions. Hence, we decide to follow Louizos et al. (2017) and, instead of using proxy variables, we estimate a latent-variable model in which we simultaneously discover the latent confounders and infer how they affect treatment and outcome.

3 Deconfounding Reinforcement Learning

In this section, we will formally introduce deconfounding reinforcement learning (DRL). Generally speaking, DRL consists of two steps: learning a deconfounding model shown in Figure 2 and optimizing a policy based on the learned deconfounding model. The main idea in step 1 is to simultaneously discover hidden confounders and infer causal effects through the estimation of a latent-variable model. More specifically, we first discuss the time-independent confounding assumption in Section 3.1 and then, based on this assumption, we formalize the deconfounding model in Section 3.2. Section 3.3 talks about the problem of identification in our model, which is a central issue in causal inference. After that, we present details about how to learn the proposed model via variational inference in Section 3.4. Step 2 is provided in Section 3.5 where we describe how to design the deconfounding actor-critic method. The proposed approach is straightforward to apply to other RL algorithms in the same manner. Finally, the training procedure for our deconfounding actor-critic method is presented in Section 3.6.

3.1 Causal Assumptions

Without loss of generality, we assume that there exists a common confounder in the model, denoted by in Figure 2, which is time-independent across a number of episodes. This assumption is sufficiently general to apply to various RL tasks across domains. For example, in personalized medicine, socio-economic status can be a confounder affecting both the medication strategy a patient has access to and the patient’s general health (Louizos et al., 2017). In this case, socio-economic status is time-independent for each patient during the course of treatment. Another example is in agriculture, where soil fertility may serve as a time-independent confounder affecting both the application of fertilizer and the yield of each plot of land (Pearl and Mackenzie, 2018). Finally, in the stock market example, government policy may also act as a time-independent confounder.

Figure 2: The model for deconfounding reinforcement learning. Grey nodes denote observed variables and white nodes represent unobserved variables. Red and blue arrows emphasize the observed variables affected by and by , respectively. The causal effects of interest are colored in green.

3.2 The Model

Given our causal assumptions, we first fit a generative model to a sequence of observational data: observations, actions, and rewards, where actions and rewards are confounded by one or several unknown factors, as shown in Figure 2. Formally, let , , , be the sequences of observations, actions, rewards, and corresponding latent states, respectively. As mentioned previously, the confounder is denoted by , and it is worth noting that here may stand for more than one confounder in which multiple confounders are seen as a whole represented by (i.e., the confounder

can be a vector). We assume that

, , , , and , where . The generative model for DRL is then given by


where we have parametrized each probability distribution as a Gaussian with mean and variance given by nonlinear functions

represented by neural networks with parameters

for . Note that, in our case, Equation (4) is not necessary when learning the model, because the data used in our experiments are generated from a policy depending only on the confounder (see Section 4.1), that is, does not depend on in our data. In addition, in this case is not viewed as a confounder of and , for does not influence . This also provides one reason why we do not need to adjust for . However, in some other cases such as when applied to medical data, the strategies of treatment from physicians definitely contain valuable information about and (i.e., does influence ), and therefore the arrow between them is necessary when learning the model.

3.3 Identification of Causal Effect

The key component of our method that allows us to address problems with confounders is the computation of the reward according to the model from Figure 2. To be more precise, assuming that an agent standing at state performs action , we do not use the conditional as predictor for the reward as traditional RL methods do. Instead, we use the do-operator described in Section 2.3 to compute the reward222To give some intuition, take for example the case of kidney stones. One prefers treatment when considering the overall probability of recovery. By contrast, one chooses treatment when considering the recovery rate when conditioning on each possible kidney stone size. The optimal treatment is treatment in this example. The -operator in Equation (2) allows us to obtain the correct solution in this type of problems with confounders. By contrast, traditional RL methods will fail in such settings since they consider only conditional probabilities (e.g., the overall probability of recovery) instead of interventional probabilities, as the -operator does.:


where Equation (8) is obtained by applying the rules of -calculus to the causal graph in Figure 2 (Pearl, 2009). We also can use the back-door criterion to directly obtain Equation (8). Equation (8) shows that

can be identified from the joint distribution

, because, through Equation (8), the interventional probability is converted to observation probabilities with regard to . In other words, if we can recover then we also can recover .

Now the problem is reduced to whether or not we can estimate from observations of . Fortunately, it is possible, because a number of works have shown that one can use the knowledge inferred from the joint distribution between proxy variables and confounders to adjust for the hidden confounders (Louizos et al., 2017; Edwards et al., 2015; Kuroki and Pearl, 2014; Miao et al., 2018; Pearl, 2012; Wooldridge, 2009). Here, we focus only on one possible case presented in Figure 1 of (Louizos et al., 2017) (see Appendix D), because their result can be directly used to show that the joint distribution can be approximately recovered solely from the observations . The main idea is that our model can be factorized into two types of 4-tuple components, each of which can be proved using the same method in (Louizos et al., 2017). More precisely, in our model, at each time step the 4-tuple is exactly the same as the 4-tuple shown in Figure 1 of (Louizos et al., 2017). Since it has been proved in (Louizos et al., 2017) that the joint distribution can be recovered from observations of , can be also approximately recovered only from the observations . Likewise, the joint distribution over the other 4-tuple can be recovered from in the same manner (see Appendix E). Applying this rule repeatedly to the sequential data, we are finally able to approximately recover solely from the observations . Hence, in the present paper we may reasonably assume that the joint distribution can be approximately recovered solely from the observations .

We estimate the joint distribution using Variational Auto-Encoders (VAEs) (Kingma and Welling, 2013), which can represent a very large class of distributions with latent variables and are easily implemented by solving an optimization problem (Tran et al., 2015; Louizos et al., 2017). However, VAEs do not guarantee that the true model can be identified through learning (Louizos et al., 2017). For example, it might occur that different model parameters could equally fit the same observed data as long as the values of the latent confounders are adjusted accordingly (D’Amour, 2018). Despite this, identification is possible in both our model and the one described by Louizos et al. (2017) under specific conditions (Louizos et al., 2017; Kuroki and Pearl, 2014; Miao et al., 2018; Pearl, 2012; Allman et al., 2009). In addition to this, in our model, each of the and have at least proxy variables , where

can be approximately viewed as hidden states in a hidden Markov model. In this case,

Allman et al. (2009) show that such a model can be identified if the latent confounders and are assumed to be categorical. However, in practice there is no guarantee for this to be the case. Hence, we prefer to use VAEs, which make substantially weaker assumptions about the data generating process and the structure of the hidden confounders (Louizos et al., 2017). Despite the lack of general identifiability results for VAEs, we empirically show that our approach is useful for learning a better policy in the presence of confounding.

In the model from Figure 2 there are two types of confounders: the time-independent confounder and the time-dependent confounders , playing different roles in the model. We refer to the former as “nuisance confounders” and the latter as “advantage confounders”. The nuisance confounder will negatively affect the whole course of treatment and, therefore, should be adjusted for. In the example of kidney stones, the existence of the confounder (i.e., the size of stones) will lead to a wrong treatment if not adjusted for. By contrast, the advantage confounders act as state variables and, in principle, they are not supposed to be adjusted for because in RL we aim to take optimal actions when conditioning on the current state333Another reason why we do not need to adjust for has been discussed in Section 3.2..

3.4 Learning

The nonlinear functions parametrized by neural networks make inference intractable. Because of this, we learn the parameters of the model by using variational inference together with an inference model, a neural network which approximates the intractable posterior distributions (Rezende et al., 2014; Kingma and Welling, 2013; Krishnan et al., 2015). We now review how to learn a simple latent variable model using variational inference. In this simple model stands for the observational variables and for the latent variables. By using the variational principle, we introduce an approximate posterior distribution to obtain the following lower bound on the model’s marginal likelihood:


where we have used Jensen’s inequality, and

denotes the Kullback-Leibler divergence between two distributions and

are the parameters of the inference model .

3.4.1 Variational Lower Bound

By directly applying the lower bound in Equation (9) to our model, we obtain


By using the Markov property of our model, we can factorize the full joint distribution in the following way:


In addition, for simplicity, we also have a similar factorization assumption in the posterior approximation for and :


Combining equations (3.4.1), (3.4.1) and (12) yields


where we omit the subscripts and . A more detailed derivation can be found in Appendix A. Equation (3.4.1) is differentiable with respect to the model parameters and the inference parameters and, by using the reparametrization trick (Kingma and Welling, 2013)

, we can directly apply backpropagation to optimize this objective.

3.4.2 Inference Model

From the factorization in Equation (12), we can see that there are two types of inference models: and . Similar to the generative model in Section 3.2, we also parametrize both of them as Gaussian:


In fact, as shown in Equation (12), can be factorized as the product of for . By using the Markov property of our model, we have


Therefore, we can further simplify each of these terms as follows:


Equation (17) tells us that depends on and all the current and future observed data in . Meanwhile, the conditional independence above means that contains all the historical data. Therefore, it is natural to calculate

based on the whole sequence of data. This can be done using recurrent neural networks (RNNs). Inspired by

Krishnan et al. (2015, 2017), we similarly choose a bi-directional LSTM (Zaremba and Sutskever, 2014) to parameterize and in Equation (15). Since Equation (14) has the same structure as Equation (15), and are parameterized by a bi-directional LSTM as well. More details about the architecture can be found in Appendix C.

Note that to generate data from the model, at any time step , we require to know and before inferring the distribution over when conditioning on the observed . Hence, we need to introduce two auxiliary distributions to perform counterfactual reasoning. This corresponds to the problem of predicting given an unseen in the training set. To be more precise, we have


where , , , and are also parameterized by neural networks. To estimate the parameters of these inference models, we will add extra terms in the variational lower bound given by Equation (3.4.1):


3.5 Deconfounding RL Algorithms

We have now all the building blocks for our DRL algorithm. Once our model is learned from the observational data, it can be directly used as a dynamic environment like those in OpenAI Gym (Brockman et al., 2016). We can exploit the learned model to generate rollouts for policy evaluation and optimization. In practice, Equation (8) is approximated using the Monte Carlo method as follows:


where is the sample size from the prior . Given observational data, we sample instead from the approximate posterior which we compute using the inference network presented in Section 3.4.2.

By using this deconfounding reward function, it is straightforward to extend traditional RL algorithms to their corresponding deconfounding version. In this paper, we select and extend one representative RL algorithm: the Actor-Critic method (Sutton et al., 1998). Nevertheless, our methodology can be used to extend other algorithms as well in a straightforward manner.

Deconfounding Actor-Critic Methods

The actor-critic method is a policy-based RL method directly parameterizing a policy function . The goal is to reduce variance in the estimate of the policy gradient by subtracting a learned function of the state, known as a baseline, from the return. The learned value function is commonly used as the baseline. Taking into consideration that the return is estimated by

, we can write the gradient of the actor-critic loss function at step time



where is an estimate of the advantage of action in state . In practice, is usually replaced with the one-step return, that is, . The crucial difference in deconfounding actor-critic methods is to use given by Equation (21), as opposed to the used by vanilla actor-critic methods.

3.6 Training

As mentioned previously, DRL consists of two steps: learning a deconfounding model and optimizing a policy based on the learned deconfounding model. In step 1 we learn the model by optimizing the objective given by Equation (20). Once the deconfounding model is learned, we have an estimate of the state transition function , as given by the model, and can also calculate the deconfounding reward function according to Equation (21). In step 2, we treat the learned deconfounding model as an RL environment like CartPole in OpenAI Gym and generate trajectories/rollouts using the estimated state transition function and deconfounding reward function. These trajectories/rollouts are then used to train the policy by following the gradient given by Equation (22).

3.7 Implementation Details

We used Tensorflow

(Abadi et al., 2016) for the implementation of our model and the proposed DRL algorithm. Optimization was done using Adam (Kingma and Ba, 2014)

. Unless stated otherwise, the setting of all hyperparameters and network architectures can be found in Appendix


To assess the equality of the learned model, we performed two types of tasks: reconstruction and counterfactual reasoning. The reconstructions were performed by first feeding an input sequence into the learned inference network, then sampling from the resulting posterior distribution over according to Equation (15), and finally feeding those samples into the generative network described in Equation (3) to reconstruct the original observed sequence . The counterfactual reasoning, that is, the prediction of given an that we have not seen during training, was executed in four steps:

  1. Given the new , we estimate and using equations (18) and (19).

  2. Once we have , , and , we estimate using Equation (15).

  3. Given the estimated and another uniformly randomly selected , we can directly compute from Equation (6).

  4. The final step is to reconstruct from and according to Equation (3).

By repeating the last two steps, we can counterfactually reason out a sequence of data.

We may also be interested in estimating the approximate posterior over the confounder from observed data. We consider two possible scenarios for this. In the easy one we are given the observed data and we estimate the posterior by using Equation (14). The more challenging scenario involves estimating the posterior from only and no action or reward data. In this case, we follow the same steps used in the task of counterfactual reasoning. We first compute and for to using Equations (18) and (19), and then estimate the posterior through Equation (14).

4 Experimental Results

The evaluation of any method that deals with confounders is always challenging due to a lack of real-world benchmark problems with known ground truth. Furthermore, little work has been done before on the task of deconfounding RL. All this creates difficulties in the evaluation of our algorithms and motivates us to develop several new benchmark problems by revising the MNIST dataset (LeCun et al., 1998) and by revising two environments in OpenAI Gym (Brockman et al., 2016), CartPole and Pendulum.

4.1 New Confounding Benchmarks

We now describe three new confounding benchmark problems, all of them including a single binary confounder. The procedure to create these benchmarks is inspired by Krishnan et al. (2015), who synthesized a dataset mimicking healthcare data under realistic conditions (e.g., noisy laboratory measurements, surgeries and drugs affected by patient age and gender, etc.). The data used is from either popular RL environments in OpenAI Gym such as Pendulum and CartPole or popular machine learning datasets such as MNIST.

Confounding Pendulum and CartPole

We revised two environments in OpenAI Gym (Brockman et al., 2016): Pendulum and CartPole. To obtain a confounding version of Pendulum, we selected different screen images of Pendulum and created a synthetic problem in which actions are joint effort444More details can be found at with values between and . We added bit-flip noise to each image and then, a random policy confounded with a binary factor was used to select the action applied at each time step. This is repeated multiple times to produce a large number of 5-step sequences of noisy images. In each generated sequence, one block with three consecutive squares ( pixels) is superimposed with the top-left corner of the images in a random starting location. These squares are intended to be analogous to seasonal flu or other ailments that a patient could exhibit, which are independent of the actions taken and which last several time steps (Krishnan et al., 2015). The goal is to show that our model can learn long-range patterns since these play an important role in medical applications. We treat the generated sequences of images as the observations . The training, validation and test sets respectively comprise , and sequences of length five.

The key characteristic of confounding Pendulum is the relationship between confounder, actions, and rewards. For simplicity of notation, we denote the confounder by , the action by , and the reward by . In this case,

is a binary variable mimicking socio-economic status (i.e., the rich and the poor). The range of actions

is grouped into two categories (i.e., ) and (i.e., ) representing different treatments555Note that is a better treatment than in our setting described in Appendix F.. The treatment selection depends on , that is, determines the probabilities of choosing and . The reward is defined as


where is the original reward666In fact, here is a function of both and a state, and we mention only to emphasize that the confounder affects the action. in Pendulum as a function of , and is the extra reward as a function of both and , defined by a two-component Gaussian mixture:


with and fixed and mixing coefficient determined by both and . More details are available in Appendix F.

Obviously, in the definition above, depends on and . Furthermore, has an influence on through and , meaning that contains information about and, therefore, can be viewed as a proxy variable for the confounder. We assume that the influence between the confounder, action and reward is stochastic while generating the data. The reason for this is that, in practice, we do not have an oracle that tells us which treatment is better so we might make wrong decisions sometimes. For example, take the case of kidney stones. Even though treatment is better than treatment , there are still some patients choosing treatment with positive probability in each category. All the details about the data generation process can be found in Appendix F, where a straightforward analogy is provided as well.

The confounding CartPole problem is implemented in the exact same manner, except for the action which is now binary and can be naturally divided into two categories777More details can be found at Accordingly, the binary confounder determines the probabilities of choosing which of two actions, and the probabilities are set to the same value as those in the confounding Pendulum given in Appendix F.

Confounding MNIST

We follow the same procedure to obtain a confounding MNIST problem. However, the definitions of action and the original reward term are now different. Similar to the Healing MNIST dataset (Krishnan et al., 2015), the actions encode rotations of digit images with each entry in satisfying and the actions being divided into two categories and according to the confounder . The original reward term is defined as the minus degree between the upright position and the position that the digit rotates to. For example, if the digit rotates to the position of 3 o’clock or 9 o’clock, then both rewards are .

Figure 3: Reconstruction and counterfactual reasoning on the confounding Pendulum dataset. Top row: results from the model without the confounder (). Second row: results from the model with the confounder (). The last two rows are the zoom of samples selected in the same positions from the top row (dashed boxes for ) and the second row (solid boxes for ), respectively. These results show that the model taking into account confounding performs better in every task producing less blurry images.
Figure 4: Reconstruction and counterfactual reasoning on the confounding MNIST dataset. Top row: results from the model without the confounder (). Bottom row: results from the model with the confounder (). Note that in the task of counterfactual reasoning, the second and forth sequences of the sample produced by , boxed in yellow, have non-consecutive white squares and contain more blurry images. This does not really make sense because only consecutive patterns appear in the training set. By contrast, this kind of non-consecutive white squares does not occur on the samples from .

(a) (b)

Figure 5: (a) Plot of data points sampled from the posterior approximate distribution of on the confounding MNIST dataset. (b) Plot of data points sampled from the posterior approximate distribution of on the confounding CartPole dataset. We can see that, given a binary confounder , our model still identifies two clear clusters from the data even if the assumed prior over

is a factorized standard Gaussian distribution.

4.2 Performance Analysis of the Deconfounding Model

We assess the performance of the deconfounding model from Figure 2, denoted by , and compare it with a similar alternative model that does not include the confounder and that is denoted by . We train by optimizing Equation (20) but train using a different loss function which excludes the confounder and whose full derivation can be found in Appendix B. Both models are separately trained in a minibatch manner on the training set of the confounding dataset (i.e., image sequences of length five). Afterwards, following the steps depicted in Section 3.7, we use each trained model to perform the reconstruction task on the training set, and both reconstruction and counterfactual reasoning tasks on the testing set (i.e., image sequences of length five).

Figure 3 presents a comparison of and in terms of reconstruction and counterfactual reasoning on confounding Pendulum. The second row is based on whilst the top row comes from . The results generated by the deconfounding model are superior to those produced by the model not taking into account the confounder. To be more specific, as shown in the zoom of samples on the bottom row, generates more blurry images than because, without modelling the confounder , is forced to average over its multiple latent states, resulting in more blurry samples.

We obtain similar results regarding the samples produced by and on the confounding MNIST dataset, as shown in Figure 4. Looking closely at the squares on the generated digit samples (inside a yellow box), we observe that generates non-consecutive white squares in the task of counterfactual reasoning, which does not really make sense because only consecutive patterns appear in the training set. The generated images are also more blurry. By contrast, this does not occur on the samples from , showing that our deconfounding model is able to cope with long-range patterns and describe the data better.

Figure 5 visualizes approximate posterior samples of the -dimensional confounder . We can see that, although the prior distribution of the confounder is assumed to be a factorized standard Gaussian distribution, the model still identifies two clear clusters from the data because is originally a binary variable. This demonstrates that our model can learn confounders even if the assumed prior is not that accurate.

4.3 Comparison of RL Algorithms

In this section, we evaluate the proposed deconfounding actor-critic (AC) method by comparing with its vanilla version on confounding Pendulum. In the vanilla AC method, given a learned , we optimize the policy by calculating the gradient presented in Equation (22) on the basis of the trajectories/rollouts generated through . Equation (22) involves two functions: and , both of which are represented using neural networks and their corresponding hyperparameters can be found in Appendix G. It is worth noting that, in this vanilla case, each reward is sampled from the conditional distribution . By contrast, the proposed deconfounding AC method uses instead. The optimizaiton of the policy is again performed using the gradient from Equation (22). However, for the deconfounding AC approach, we use trajectories/rollouts generated by in which each reward is obtained using the interventional distribution computed according to Equation (21). For completeness, we also compare with the direct AC method in which the AC method is directly trained on the training data instead of the trajectories/rollouts.

Figure 6: Comparison of vanilla (Alt_AC), direct (Direct_AC), and deconfounding (Decon_AC) Actor-Critic methods on the confounding Pendulum problem. (a) total reward over episodes in the training phase. (b) total reward over episodes in the testing phase. (c) Probability of optimal action over episodes, corresponding to (b).

In the training phase, we respectively run the vanilla AC and the deconfounding AC algorithms over episodes with time steps each. For the fair comparison, we also run the direct AC algorithm over episodes, each of which has pairs of state-action-reward randomly selected from the training data. In order to reduce non-stationarity and to decorrelate updates, the generated data is stored in an experience replay memory and then randomly sampled in a batch manner (Mnih et al., 2013; Riedmiller, 2005; Schulman et al., 2015; Van Hasselt et al., 2016). We summarize all the rewards produced during the rollouts in each episode and further average the summarized rewards over a window of episodes to obtain a smoother curve. Figure 6(a) shows that our deconfounding AC algorithm performs significantly better than the vanilla AC algorithm in the confounded environment. Here, the direct AC algorithm is not included because it does not generate rollouts.

In the testing phase, we first randomly select samples from the testing set, each starting a new episode, and then use the learned policies to generate trajectories over time steps as we did during training. From the resulting episodes, we plot the total reward obtained by each method, see in Figure 6(b), and also plot the percentage of times that the optimal action is chosen in each episode, as shown in Figure 6(c). Figure 6(b) shows that our deconfounding AC obtains on average much higher reward at test time than the baselines. Figure 6(c) further tells us that our deconfounding AC is much more likely to choose the optimal action at each time step, whilst the vanilla AC and the direct AC make a wrong decision more than half of the times.

5 Related Work

Krishnan et al. (2015, 2017) used deep neural networks to construct nonlinear state space models and leveraged a structured variational approximation parameterized by recurrent neural networks to mimic the posterior distribution. Levine (2018) reformulated RL and control problems using probabilistic inference, which allows us to bring to bear a large pool of approximate inference methods, and flexibly extend the models. Raghu et al. (2017a, b) exploited continuous state-space models and deep RL to obtain improved treatment policies for septic patients from observational data. Gottesman et al. (2018) discussed problems when evaluating RL algorithms in observational health setting. However, all the works mentioned above do not take into account confounders in the proposed models.

Louizos et al. (2017) attempted to learn individual-level causal effects from observational data in the non-temporal setting. They also used a variational auto-encoder to estimate the unknown confounder given a causal graph. Paxton et al. (2013) developed predictive models based on electronic medical records without using causal inference. Saria et al. (2010) proposed a nonparametric Bayesian method to analyze clinical temporal data. Soleimani et al. (2017) represented the treatment response curves using linear time-invariant dynamical systems. This provides a flexible approach to modeling response over time. Although the latter two works model sequential data, they both do not consider RL or causal inference.

Bareinboim et al. (2015) considered the problem of bandits with unobserved confounders. Sen et al. (2016) and Ramoly et al. (2017) further studied contextual bandits with latent confounders. Forney et al. (2017) circumvented some problems caused by unobserved confounders in multi-armed bandits by using counterfactual-based decision making. Zhang and Bareinboim (2017) leveraged causal inference to tackle the problem of transferring knowledge across bandit agents. Unlike our method, all these works are restricted to bandits, which corresponds to a simplified RL setting without state transitions.

Last but not least, we want to mention some connections between our method and partially observable Markov decision processes (POMDPs). To the best of our knowledge, existing work on POMDPs has not considered problems with confounders yet. Apart from the confounding part, in POMDPs the observation only provides partial information about the actual state, so the agent in POMDPs does not necessarily know which actual state it is in. By contrast, for convenience in practice, we simplify this setting and assume that the observation provides all the information about the actual state the agent is in but with some noise. Hence we only need to denoise the observation to obtain the actual state. In this sense, a more related work to our model is probably the world model

(Ha and Schmidhuber, 2018), because both models used variational inference to estimate the actual state from the observation. If we treat our model as POMDPs with the confounding bias, it will make our model more complicated but it is worth exploring in the future.

As far as we are concerned, this is the first attempt to build a bridge between confounding and the full RL problem with observational data.

6 Conclusion and Future Work

We have introduced deconfounding reinforcement learning (DRL), a general method for addressing reinforcement learning problems with observational data in which hidden confounders affect observed rewards and actions. We have used DRL to obtain deconfounding variants of actor-critic methods and showed how these new variants perform better than the original vanilla algorithms on new confounding benchmark problems. In the future, we will collaborate with hospitals and apply our approach to real-world medical datasets. We also hope that our work will stimulate further investigation of connections between causal inference and RL.


Appendix A Variational Lower Bound for