1 Introduction
In recent years, much attention has been devoted to the development of reinforcement learning (RL) algorithms with the goal of improving treatment policies in healthcare (Gottesman et al., 2018). RL algorithms have been proposed to infer better decisionmaking strategies for mechanical ventilation (Prasad et al., 2017), sepsis management (Raghu et al., 2017a, b), and treatment of schizophrenia (Shortreed et al., 2011). In healthcare, a common practice is to focus on the observational setting in which we learn policies solely from historical data produced by real environments, instead of learning policies by actively taking actions in the traditional RL setting. The reason for this is that we do not wish to experiment with patients’ lives without evidence that the proposed treatment strategy is better than the current practice (Gottesman et al., 2018).^{1}^{1}1Another example is in finance where, due to costs in terms of time and money, it is often impractical to evaluate trading strategies by actually buying and selling stocks in the market. As pointed out by Raghu et al. (2017a)
, RL also has advantages over other machine learning algorithms even in the observational setting, especially in two situations: when the optimal treatment strategy is unclear in medical literature
(Marik, 2015), and when training examples do not represent optimal behavior.Another approach that has been explored and used is causal inference (Pearl, 2009). Causal inference is a fundamental notion in science and plays an increasingly important role in healthcare and medicine (Liu et al., ; Soleimani et al., 2017; Schulam and Saria, 2017; Alaa et al., 2017; Alaa and van der Schaar, 2018; Atan et al., 2016). Especially in the observational setting, causal inference offers a powerful tool to deal with confounding that occurs when a hidden variable influences both the treatment and the outcome (Pearl and Mackenzie, 2018). It is widely acknowledged that confounding is the most crucial aspect of inferring the effect of the treatment on the outcome from observational data (Louizos et al., 2017), because not taking confounding into account might lead to choosing a wrong treatment. However, most approaches in causal inference to tackling confounding are restricted to fixed treatments which do not vary over time (Louizos et al., 2017). When treatments are allowed to vary over time, the space of combinational treatment strategies increases exponentially (Peters et al., 2017; Hernán and Robins, 2018). In this case, it is still unclear how to apply causal inference to dealing with confounding in such sequential data where treatments changes over time.
On the basis of the discussion above, in this paper we attempt to combine advantages of RL and causal inference to cope with an important family of RL problems in the observational setting, that is, learning good policies solely from the historical data produced by real environments with confounding bias. This type of problem will become increasingly common in future RL research with the burgeoning development in healthcare and medicine. To the best of our knowledge, however, little work has been done in this promising area of integrating RL with causal inference (Bareinboim et al., 2015; Forney et al., 2017; Buesing et al., 2019). In contrast, confounders have been extensively studied in epidemiology, sociology, and economics. Take for example the widespread kidney stones in which the size of the kidney stone is a confounding factor affecting both the treatment and the recovery (Peters et al., 2017; Pearl, 2009). Correcting for the confounding effect of the size of the kidney stone is crucial for determining how to choose an effective treatment. Similarly, in RL, if unobserved potential confounders exist, they affect both actions and rewards as an agent interacts with environments, and eventually influence the policy to be optimized.
Let us for a moment stick with the example of the kidney stones and assume that physicians need to take a series of steps to treat this disease. We further assume that, during the course of treatment, physicians have natural predilections of the treatment choices as a function of the stone size. That is, it is more likely for them to choose some treatment when patients have large stones and to choose another treatment when patients have small stones. In the observational setting, given only such a set of historical records about physicians’ treatments and outcomes on patients, it is extremely challenging or even impossible to learn an optimal treatment policy due to the existence of the confounder (i.e., the size of kidney stones).
To this end, we present a general formulation for addressing RL problems with observational data, namely deconfounding reinforcement learning
(DRL). More specifically, given several common confounding assumptions, we first estimate a latentvariable model from observational data. Under some conditions for identification, through the latentvariable model we can simultaneously discover the latent confounders and infer how they affect actions and reward. Then the confounders in the model can be adjusted for, using the causal language developed by
Pearl (2009), and finally we optimize the policy based on the resulting deconfounding model.On the basis of the proposed formulation, we extend one popular RL algorithm, the ActorCritic method, to its corresponding deconfounding variant. Note that our procedure for obtaining this deconfounding variant can be easily applied to other RL algorithms. Due to lack of datasets in this respect, we revise the classic control toolkit in OpenAI Gym (Brockman et al., 2016), making it a benchmark for comparison of DRL algorithms. In addition, we also devise a confounding version of the MNIST dataset (LeCun et al., 1998) to verify the performance of our causal model. Finally, we conduct extensive experiments to demonstrate the superiority of the proposed formulation compared to traditional RL algorithms in confounded environments with observational data.
To sum up, our contributions in this paper are as follows:

We propose a general formulation for addressing RL problems in confounded environments with observational data, namely deconfounding reinforcement learning (DRL);

We present the deconfounding variant of ActorCritic methods, obtained through a methodology that can be easily applied to other RL methods;

We perform a comprehensive comparison of our DRL algorithm with its vanilla version, showing that the proposed approach has an advantage in confounded environments.

To the best of our knowledge, this is the first attempt to build a bridge between confounding and the full RL problem. This is one of few research papers aiming at understanding the connections between causal inference and the full RL setting.
2 Background
In this section, we briefly review confounding in causal inference. We recommend Pearl’s excellent monograph for further reading (Pearl, 2009).
2.1 Simpson’s Paradox
Let us begin with an example of one of the most famous paradoxes in statistics: Simpson’s Paradox. Consider the previously mentioned kidney stones (Peters et al., 2017)
. We collect electronic patient records to investigate the effectiveness of two treatments against kidney stones. Although the overall probability of recovery is higher for patients who took treatment
, treatment performs better than treatment both on patients with small kidney stones and with large kidney stones. More precisely, we have(1) 
where is the size (large or small) of the stone, the treatment, and the recovery (all binary). How do we cope with this change in conclusions? Which treatment do you prefer if you had kidney stones? Does treatment cause recovery? As described in the following section, the answers to these questions depend on the causal relationship between treatment, recovery, and the size of the kidney stone.
2.2 Confounding
An intuitive explanation for this kidney stone example of Simpson’s paradox is that larger stones are more severe than small stones and are much more likely to be treated with treatment , resulting in that treatment looks worse than treatment overall. We assume that Figure 1 depicts the true underlying causal diagram of the kidney stone example. Confounding occurs because the size of kidney stones influences both treatment and recovery. The size of the kidney stones is called a confounder. The term “confounding” originally meant “mixing” in English, which describes that the true causal effect is “mixed” with the spurious correlation between and induced by the fork (Pearl and Mackenzie, 2018). In other words, we will not be able to disentangle the true effect of on from the spurious effect if we do not have data on . If we have measurements of or can indirectly estimate , it is easy to deconfound the true and spurious effects. To this end, we can adjust for by averaging the effect of on in each subgroup of (i.e., different size groups in the case of kidney stones).
2.3 DoOperator and Adjustment Criterion
From the viewpoint of causal inference, we can use the language of intervention, namely the dooperator, to describe when confounding happens. In fact, in the example of the kidney stones, what we are interested in is how these two treatments compare when we force all patients to take treatment or , rather than which treatment has a higher recovery rate given only the observational patient records. Mathematically, we focus on the true effect (the intervention distribution obtained when patients are forced to take treatment ) instead of the conditional (the observational distribution obtained when patients are observed to take treatment ). Therefore, as described previously, confounding can be naturally described as the discrepancy between and .
The dooperator can be executed in two common ways: by Randomized Controlled Trials (RCTs) (Fisher, 1935) and by adjustment formulas (i.e., Backdoor criterion and Frontdoor criterion) (Pearl, 2009). RCTs are the gold standard but sometimes limited by practical considerations (e.g., safety, laws, ethics, physically infeasibility, etc.). The Backdoor and Frontdoor criteria require knowledge of the causal diagram, which typically means that causal assumptions are thus provided in advance. According to the Backdoor criterion, in the kidney stone example, we can immediately obtain
(2) 
Beyond the two adjustment formulas, the calculus provides a syntactic method of deriving claims about interventions (Pearl, 2009). It consists of three rules which can be repeatedly applied to simplify the expression for an interventional distribution. With the docalculus we can calculate intervention probabilities from observation probabilities.
2.4 Proxy Variables for Confounding
If confounders can be measured, then they can be adjusted for through the methods discussed in Section 2.3. However, in most cases where confounders are hidden or unmeasured, without further assumptions, it is impossible to estimate the effect of the intervention on the outcome. A common practice is then to leverage observed proxy variables that contain information about the unobserved confounders (Angrist and Pischke, 2008; Maddala and Lahiri, 1992; Montgomery et al., 2000; Schölkopf et al., 2016). However, using proxy variables to correctly recover causal effects requires strict mathematical assumptions (Louizos et al., 2017; Edwards et al., 2015; Kuroki and Pearl, 2014; Miao et al., 2018; Pearl, 2012; Wooldridge, 2009) and, in practice, we do not know whether or not the available data meets those assumptions. Hence, we decide to follow Louizos et al. (2017) and, instead of using proxy variables, we estimate a latentvariable model in which we simultaneously discover the latent confounders and infer how they affect treatment and outcome.
3 Deconfounding Reinforcement Learning
In this section, we will formally introduce deconfounding reinforcement learning (DRL). Generally speaking, DRL consists of two steps: learning a deconfounding model shown in Figure 2 and optimizing a policy based on the learned deconfounding model. The main idea in step 1 is to simultaneously discover hidden confounders and infer causal effects through the estimation of a latentvariable model. More specifically, we first discuss the timeindependent confounding assumption in Section 3.1 and then, based on this assumption, we formalize the deconfounding model in Section 3.2. Section 3.3 talks about the problem of identification in our model, which is a central issue in causal inference. After that, we present details about how to learn the proposed model via variational inference in Section 3.4. Step 2 is provided in Section 3.5 where we describe how to design the deconfounding actorcritic method. The proposed approach is straightforward to apply to other RL algorithms in the same manner. Finally, the training procedure for our deconfounding actorcritic method is presented in Section 3.6.
3.1 Causal Assumptions
Without loss of generality, we assume that there exists a common confounder in the model, denoted by in Figure 2, which is timeindependent across a number of episodes. This assumption is sufficiently general to apply to various RL tasks across domains. For example, in personalized medicine, socioeconomic status can be a confounder affecting both the medication strategy a patient has access to and the patient’s general health (Louizos et al., 2017). In this case, socioeconomic status is timeindependent for each patient during the course of treatment. Another example is in agriculture, where soil fertility may serve as a timeindependent confounder affecting both the application of fertilizer and the yield of each plot of land (Pearl and Mackenzie, 2018). Finally, in the stock market example, government policy may also act as a timeindependent confounder.
3.2 The Model
Given our causal assumptions, we first fit a generative model to a sequence of observational data: observations, actions, and rewards, where actions and rewards are confounded by one or several unknown factors, as shown in Figure 2. Formally, let , , , be the sequences of observations, actions, rewards, and corresponding latent states, respectively. As mentioned previously, the confounder is denoted by , and it is worth noting that here may stand for more than one confounder in which multiple confounders are seen as a whole represented by (i.e., the confounder
can be a vector). We assume that
, , , , and , where . The generative model for DRL is then given by(3)  
(4)  
(5)  
(6) 
where we have parametrized each probability distribution as a Gaussian with mean and variance given by nonlinear functions
represented by neural networks with parameters
for . Note that, in our case, Equation (4) is not necessary when learning the model, because the data used in our experiments are generated from a policy depending only on the confounder (see Section 4.1), that is, does not depend on in our data. In addition, in this case is not viewed as a confounder of and , for does not influence . This also provides one reason why we do not need to adjust for . However, in some other cases such as when applied to medical data, the strategies of treatment from physicians definitely contain valuable information about and (i.e., does influence ), and therefore the arrow between them is necessary when learning the model.3.3 Identification of Causal Effect
The key component of our method that allows us to address problems with confounders is the computation of the reward according to the model from Figure 2. To be more precise, assuming that an agent standing at state performs action , we do not use the conditional as predictor for the reward as traditional RL methods do. Instead, we use the dooperator described in Section 2.3 to compute the reward^{2}^{2}2To give some intuition, take for example the case of kidney stones. One prefers treatment when considering the overall probability of recovery. By contrast, one chooses treatment when considering the recovery rate when conditioning on each possible kidney stone size. The optimal treatment is treatment in this example. The operator in Equation (2) allows us to obtain the correct solution in this type of problems with confounders. By contrast, traditional RL methods will fail in such settings since they consider only conditional probabilities (e.g., the overall probability of recovery) instead of interventional probabilities, as the operator does.:
(7)  
(8) 
where Equation (8) is obtained by applying the rules of calculus to the causal graph in Figure 2 (Pearl, 2009). We also can use the backdoor criterion to directly obtain Equation (8). Equation (8) shows that
can be identified from the joint distribution
, because, through Equation (8), the interventional probability is converted to observation probabilities with regard to . In other words, if we can recover then we also can recover .Now the problem is reduced to whether or not we can estimate from observations of . Fortunately, it is possible, because a number of works have shown that one can use the knowledge inferred from the joint distribution between proxy variables and confounders to adjust for the hidden confounders (Louizos et al., 2017; Edwards et al., 2015; Kuroki and Pearl, 2014; Miao et al., 2018; Pearl, 2012; Wooldridge, 2009). Here, we focus only on one possible case presented in Figure 1 of (Louizos et al., 2017) (see Appendix D), because their result can be directly used to show that the joint distribution can be approximately recovered solely from the observations . The main idea is that our model can be factorized into two types of 4tuple components, each of which can be proved using the same method in (Louizos et al., 2017). More precisely, in our model, at each time step the 4tuple is exactly the same as the 4tuple shown in Figure 1 of (Louizos et al., 2017). Since it has been proved in (Louizos et al., 2017) that the joint distribution can be recovered from observations of , can be also approximately recovered only from the observations . Likewise, the joint distribution over the other 4tuple can be recovered from in the same manner (see Appendix E). Applying this rule repeatedly to the sequential data, we are finally able to approximately recover solely from the observations . Hence, in the present paper we may reasonably assume that the joint distribution can be approximately recovered solely from the observations .
We estimate the joint distribution using Variational AutoEncoders (VAEs) (Kingma and Welling, 2013), which can represent a very large class of distributions with latent variables and are easily implemented by solving an optimization problem (Tran et al., 2015; Louizos et al., 2017). However, VAEs do not guarantee that the true model can be identified through learning (Louizos et al., 2017). For example, it might occur that different model parameters could equally fit the same observed data as long as the values of the latent confounders are adjusted accordingly (D’Amour, 2018). Despite this, identification is possible in both our model and the one described by Louizos et al. (2017) under specific conditions (Louizos et al., 2017; Kuroki and Pearl, 2014; Miao et al., 2018; Pearl, 2012; Allman et al., 2009). In addition to this, in our model, each of the and have at least proxy variables , where
can be approximately viewed as hidden states in a hidden Markov model. In this case,
Allman et al. (2009) show that such a model can be identified if the latent confounders and are assumed to be categorical. However, in practice there is no guarantee for this to be the case. Hence, we prefer to use VAEs, which make substantially weaker assumptions about the data generating process and the structure of the hidden confounders (Louizos et al., 2017). Despite the lack of general identifiability results for VAEs, we empirically show that our approach is useful for learning a better policy in the presence of confounding.In the model from Figure 2 there are two types of confounders: the timeindependent confounder and the timedependent confounders , playing different roles in the model. We refer to the former as “nuisance confounders” and the latter as “advantage confounders”. The nuisance confounder will negatively affect the whole course of treatment and, therefore, should be adjusted for. In the example of kidney stones, the existence of the confounder (i.e., the size of stones) will lead to a wrong treatment if not adjusted for. By contrast, the advantage confounders act as state variables and, in principle, they are not supposed to be adjusted for because in RL we aim to take optimal actions when conditioning on the current state^{3}^{3}3Another reason why we do not need to adjust for has been discussed in Section 3.2..
3.4 Learning
The nonlinear functions parametrized by neural networks make inference intractable. Because of this, we learn the parameters of the model by using variational inference together with an inference model, a neural network which approximates the intractable posterior distributions (Rezende et al., 2014; Kingma and Welling, 2013; Krishnan et al., 2015). We now review how to learn a simple latent variable model using variational inference. In this simple model stands for the observational variables and for the latent variables. By using the variational principle, we introduce an approximate posterior distribution to obtain the following lower bound on the model’s marginal likelihood:
(9) 
where we have used Jensen’s inequality, and
denotes the KullbackLeibler divergence between two distributions and
are the parameters of the inference model .3.4.1 Variational Lower Bound
By directly applying the lower bound in Equation (9) to our model, we obtain
(10) 
By using the Markov property of our model, we can factorize the full joint distribution in the following way:
(11) 
In addition, for simplicity, we also have a similar factorization assumption in the posterior approximation for and :
(12) 
Combining equations (3.4.1), (3.4.1) and (12) yields
(13) 
where we omit the subscripts and . A more detailed derivation can be found in Appendix A. Equation (3.4.1) is differentiable with respect to the model parameters and the inference parameters and, by using the reparametrization trick (Kingma and Welling, 2013)
, we can directly apply backpropagation to optimize this objective.
3.4.2 Inference Model
From the factorization in Equation (12), we can see that there are two types of inference models: and . Similar to the generative model in Section 3.2, we also parametrize both of them as Gaussian:
(14)  
(15) 
In fact, as shown in Equation (12), can be factorized as the product of for . By using the Markov property of our model, we have
(16) 
Therefore, we can further simplify each of these terms as follows:
(17) 
Equation (17) tells us that depends on and all the current and future observed data in . Meanwhile, the conditional independence above means that contains all the historical data. Therefore, it is natural to calculate
based on the whole sequence of data. This can be done using recurrent neural networks (RNNs). Inspired by
Krishnan et al. (2015, 2017), we similarly choose a bidirectional LSTM (Zaremba and Sutskever, 2014) to parameterize and in Equation (15). Since Equation (14) has the same structure as Equation (15), and are parameterized by a bidirectional LSTM as well. More details about the architecture can be found in Appendix C.Note that to generate data from the model, at any time step , we require to know and before inferring the distribution over when conditioning on the observed . Hence, we need to introduce two auxiliary distributions to perform counterfactual reasoning. This corresponds to the problem of predicting given an unseen in the training set. To be more precise, we have
(18)  
(19) 
where , , , and are also parameterized by neural networks. To estimate the parameters of these inference models, we will add extra terms in the variational lower bound given by Equation (3.4.1):
(20) 
3.5 Deconfounding RL Algorithms
We have now all the building blocks for our DRL algorithm. Once our model is learned from the observational data, it can be directly used as a dynamic environment like those in OpenAI Gym (Brockman et al., 2016). We can exploit the learned model to generate rollouts for policy evaluation and optimization. In practice, Equation (8) is approximated using the Monte Carlo method as follows:
(21) 
where is the sample size from the prior . Given observational data, we sample instead from the approximate posterior which we compute using the inference network presented in Section 3.4.2.
By using this deconfounding reward function, it is straightforward to extend traditional RL algorithms to their corresponding deconfounding version. In this paper, we select and extend one representative RL algorithm: the ActorCritic method (Sutton et al., 1998). Nevertheless, our methodology can be used to extend other algorithms as well in a straightforward manner.
Deconfounding ActorCritic Methods
The actorcritic method is a policybased RL method directly parameterizing a policy function . The goal is to reduce variance in the estimate of the policy gradient by subtracting a learned function of the state, known as a baseline, from the return. The learned value function is commonly used as the baseline. Taking into consideration that the return is estimated by
, we can write the gradient of the actorcritic loss function at step time
as(22) 
where is an estimate of the advantage of action in state . In practice, is usually replaced with the onestep return, that is, . The crucial difference in deconfounding actorcritic methods is to use given by Equation (21), as opposed to the used by vanilla actorcritic methods.
3.6 Training
As mentioned previously, DRL consists of two steps: learning a deconfounding model and optimizing a policy based on the learned deconfounding model. In step 1 we learn the model by optimizing the objective given by Equation (20). Once the deconfounding model is learned, we have an estimate of the state transition function , as given by the model, and can also calculate the deconfounding reward function according to Equation (21). In step 2, we treat the learned deconfounding model as an RL environment like CartPole in OpenAI Gym and generate trajectories/rollouts using the estimated state transition function and deconfounding reward function. These trajectories/rollouts are then used to train the policy by following the gradient given by Equation (22).
3.7 Implementation Details
We used Tensorflow
(Abadi et al., 2016) for the implementation of our model and the proposed DRL algorithm. Optimization was done using Adam (Kingma and Ba, 2014). Unless stated otherwise, the setting of all hyperparameters and network architectures can be found in Appendix
G.To assess the equality of the learned model, we performed two types of tasks: reconstruction and counterfactual reasoning. The reconstructions were performed by first feeding an input sequence into the learned inference network, then sampling from the resulting posterior distribution over according to Equation (15), and finally feeding those samples into the generative network described in Equation (3) to reconstruct the original observed sequence . The counterfactual reasoning, that is, the prediction of given an that we have not seen during training, was executed in four steps:

Once we have , , and , we estimate using Equation (15).

Given the estimated and another uniformly randomly selected , we can directly compute from Equation (6).

The final step is to reconstruct from and according to Equation (3).
By repeating the last two steps, we can counterfactually reason out a sequence of data.
We may also be interested in estimating the approximate posterior over the confounder from observed data. We consider two possible scenarios for this. In the easy one we are given the observed data and we estimate the posterior by using Equation (14). The more challenging scenario involves estimating the posterior from only and no action or reward data. In this case, we follow the same steps used in the task of counterfactual reasoning. We first compute and for to using Equations (18) and (19), and then estimate the posterior through Equation (14).
4 Experimental Results
The evaluation of any method that deals with confounders is always challenging due to a lack of realworld benchmark problems with known ground truth. Furthermore, little work has been done before on the task of deconfounding RL. All this creates difficulties in the evaluation of our algorithms and motivates us to develop several new benchmark problems by revising the MNIST dataset (LeCun et al., 1998) and by revising two environments in OpenAI Gym (Brockman et al., 2016), CartPole and Pendulum.
4.1 New Confounding Benchmarks
We now describe three new confounding benchmark problems, all of them including a single binary confounder. The procedure to create these benchmarks is inspired by Krishnan et al. (2015), who synthesized a dataset mimicking healthcare data under realistic conditions (e.g., noisy laboratory measurements, surgeries and drugs affected by patient age and gender, etc.). The data used is from either popular RL environments in OpenAI Gym such as Pendulum and CartPole or popular machine learning datasets such as MNIST.
Confounding Pendulum and CartPole
We revised two environments in OpenAI Gym (Brockman et al., 2016): Pendulum and CartPole. To obtain a confounding version of Pendulum, we selected different screen images of Pendulum and created a synthetic problem in which actions are joint effort^{4}^{4}4More details can be found at https://github.com/openai/gym/wiki/Pendulumv0. with values between and . We added bitflip noise to each image and then, a random policy confounded with a binary factor was used to select the action applied at each time step. This is repeated multiple times to produce a large number of 5step sequences of noisy images. In each generated sequence, one block with three consecutive squares ( pixels) is superimposed with the topleft corner of the images in a random starting location. These squares are intended to be analogous to seasonal flu or other ailments that a patient could exhibit, which are independent of the actions taken and which last several time steps (Krishnan et al., 2015). The goal is to show that our model can learn longrange patterns since these play an important role in medical applications. We treat the generated sequences of images as the observations . The training, validation and test sets respectively comprise , and sequences of length five.
The key characteristic of confounding Pendulum is the relationship between confounder, actions, and rewards. For simplicity of notation, we denote the confounder by , the action by , and the reward by . In this case,
is a binary variable mimicking socioeconomic status (i.e., the rich and the poor). The range of actions
is grouped into two categories (i.e., ) and (i.e., ) representing different treatments^{5}^{5}5Note that is a better treatment than in our setting described in Appendix F.. The treatment selection depends on , that is, determines the probabilities of choosing and . The reward is defined as(23) 
where is the original reward^{6}^{6}6In fact, here is a function of both and a state, and we mention only to emphasize that the confounder affects the action. in Pendulum as a function of , and is the extra reward as a function of both and , defined by a twocomponent Gaussian mixture:
(24) 
with and fixed and mixing coefficient determined by both and . More details are available in Appendix F.
Obviously, in the definition above, depends on and . Furthermore, has an influence on through and , meaning that contains information about and, therefore, can be viewed as a proxy variable for the confounder. We assume that the influence between the confounder, action and reward is stochastic while generating the data. The reason for this is that, in practice, we do not have an oracle that tells us which treatment is better so we might make wrong decisions sometimes. For example, take the case of kidney stones. Even though treatment is better than treatment , there are still some patients choosing treatment with positive probability in each category. All the details about the data generation process can be found in Appendix F, where a straightforward analogy is provided as well.
The confounding CartPole problem is implemented in the exact same manner, except for the action which is now binary and can be naturally divided into two categories^{7}^{7}7More details can be found at https://github.com/openai/gym/wiki/CartPolev0.. Accordingly, the binary confounder determines the probabilities of choosing which of two actions, and the probabilities are set to the same value as those in the confounding Pendulum given in Appendix F.
Confounding MNIST
We follow the same procedure to obtain a confounding MNIST problem. However, the definitions of action and the original reward term are now different. Similar to the Healing MNIST dataset (Krishnan et al., 2015), the actions encode rotations of digit images with each entry in satisfying and the actions being divided into two categories and according to the confounder . The original reward term is defined as the minus degree between the upright position and the position that the digit rotates to. For example, if the digit rotates to the position of 3 o’clock or 9 o’clock, then both rewards are .
4.2 Performance Analysis of the Deconfounding Model
We assess the performance of the deconfounding model from Figure 2, denoted by , and compare it with a similar alternative model that does not include the confounder and that is denoted by . We train by optimizing Equation (20) but train using a different loss function which excludes the confounder and whose full derivation can be found in Appendix B. Both models are separately trained in a minibatch manner on the training set of the confounding dataset (i.e., image sequences of length five). Afterwards, following the steps depicted in Section 3.7, we use each trained model to perform the reconstruction task on the training set, and both reconstruction and counterfactual reasoning tasks on the testing set (i.e., image sequences of length five).
Figure 3 presents a comparison of and in terms of reconstruction and counterfactual reasoning on confounding Pendulum. The second row is based on whilst the top row comes from . The results generated by the deconfounding model are superior to those produced by the model not taking into account the confounder. To be more specific, as shown in the zoom of samples on the bottom row, generates more blurry images than because, without modelling the confounder , is forced to average over its multiple latent states, resulting in more blurry samples.
We obtain similar results regarding the samples produced by and on the confounding MNIST dataset, as shown in Figure 4. Looking closely at the squares on the generated digit samples (inside a yellow box), we observe that generates nonconsecutive white squares in the task of counterfactual reasoning, which does not really make sense because only consecutive patterns appear in the training set. The generated images are also more blurry. By contrast, this does not occur on the samples from , showing that our deconfounding model is able to cope with longrange patterns and describe the data better.
Figure 5 visualizes approximate posterior samples of the dimensional confounder . We can see that, although the prior distribution of the confounder is assumed to be a factorized standard Gaussian distribution, the model still identifies two clear clusters from the data because is originally a binary variable. This demonstrates that our model can learn confounders even if the assumed prior is not that accurate.
4.3 Comparison of RL Algorithms
In this section, we evaluate the proposed deconfounding actorcritic (AC) method by comparing with its vanilla version on confounding Pendulum. In the vanilla AC method, given a learned , we optimize the policy by calculating the gradient presented in Equation (22) on the basis of the trajectories/rollouts generated through . Equation (22) involves two functions: and , both of which are represented using neural networks and their corresponding hyperparameters can be found in Appendix G. It is worth noting that, in this vanilla case, each reward is sampled from the conditional distribution . By contrast, the proposed deconfounding AC method uses instead. The optimizaiton of the policy is again performed using the gradient from Equation (22). However, for the deconfounding AC approach, we use trajectories/rollouts generated by in which each reward is obtained using the interventional distribution computed according to Equation (21). For completeness, we also compare with the direct AC method in which the AC method is directly trained on the training data instead of the trajectories/rollouts.
In the training phase, we respectively run the vanilla AC and the deconfounding AC algorithms over episodes with time steps each. For the fair comparison, we also run the direct AC algorithm over episodes, each of which has pairs of stateactionreward randomly selected from the training data. In order to reduce nonstationarity and to decorrelate updates, the generated data is stored in an experience replay memory and then randomly sampled in a batch manner (Mnih et al., 2013; Riedmiller, 2005; Schulman et al., 2015; Van Hasselt et al., 2016). We summarize all the rewards produced during the rollouts in each episode and further average the summarized rewards over a window of episodes to obtain a smoother curve. Figure 6(a) shows that our deconfounding AC algorithm performs significantly better than the vanilla AC algorithm in the confounded environment. Here, the direct AC algorithm is not included because it does not generate rollouts.
In the testing phase, we first randomly select samples from the testing set, each starting a new episode, and then use the learned policies to generate trajectories over time steps as we did during training. From the resulting episodes, we plot the total reward obtained by each method, see in Figure 6(b), and also plot the percentage of times that the optimal action is chosen in each episode, as shown in Figure 6(c). Figure 6(b) shows that our deconfounding AC obtains on average much higher reward at test time than the baselines. Figure 6(c) further tells us that our deconfounding AC is much more likely to choose the optimal action at each time step, whilst the vanilla AC and the direct AC make a wrong decision more than half of the times.
5 Related Work
Krishnan et al. (2015, 2017) used deep neural networks to construct nonlinear state space models and leveraged a structured variational approximation parameterized by recurrent neural networks to mimic the posterior distribution. Levine (2018) reformulated RL and control problems using probabilistic inference, which allows us to bring to bear a large pool of approximate inference methods, and flexibly extend the models. Raghu et al. (2017a, b) exploited continuous statespace models and deep RL to obtain improved treatment policies for septic patients from observational data. Gottesman et al. (2018) discussed problems when evaluating RL algorithms in observational health setting. However, all the works mentioned above do not take into account confounders in the proposed models.
Louizos et al. (2017) attempted to learn individuallevel causal effects from observational data in the nontemporal setting. They also used a variational autoencoder to estimate the unknown confounder given a causal graph. Paxton et al. (2013) developed predictive models based on electronic medical records without using causal inference. Saria et al. (2010) proposed a nonparametric Bayesian method to analyze clinical temporal data. Soleimani et al. (2017) represented the treatment response curves using linear timeinvariant dynamical systems. This provides a flexible approach to modeling response over time. Although the latter two works model sequential data, they both do not consider RL or causal inference.
Bareinboim et al. (2015) considered the problem of bandits with unobserved confounders. Sen et al. (2016) and Ramoly et al. (2017) further studied contextual bandits with latent confounders. Forney et al. (2017) circumvented some problems caused by unobserved confounders in multiarmed bandits by using counterfactualbased decision making. Zhang and Bareinboim (2017) leveraged causal inference to tackle the problem of transferring knowledge across bandit agents. Unlike our method, all these works are restricted to bandits, which corresponds to a simplified RL setting without state transitions.
Last but not least, we want to mention some connections between our method and partially observable Markov decision processes (POMDPs). To the best of our knowledge, existing work on POMDPs has not considered problems with confounders yet. Apart from the confounding part, in POMDPs the observation only provides partial information about the actual state, so the agent in POMDPs does not necessarily know which actual state it is in. By contrast, for convenience in practice, we simplify this setting and assume that the observation provides all the information about the actual state the agent is in but with some noise. Hence we only need to denoise the observation to obtain the actual state. In this sense, a more related work to our model is probably the world model
(Ha and Schmidhuber, 2018), because both models used variational inference to estimate the actual state from the observation. If we treat our model as POMDPs with the confounding bias, it will make our model more complicated but it is worth exploring in the future.As far as we are concerned, this is the first attempt to build a bridge between confounding and the full RL problem with observational data.
6 Conclusion and Future Work
We have introduced deconfounding reinforcement learning (DRL), a general method for addressing reinforcement learning problems with observational data in which hidden confounders affect observed rewards and actions. We have used DRL to obtain deconfounding variants of actorcritic methods and showed how these new variants perform better than the original vanilla algorithms on new confounding benchmark problems. In the future, we will collaborate with hospitals and apply our approach to realworld medical datasets. We also hope that our work will stimulate further investigation of connections between causal inference and RL.
References
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 Alaa and van der Schaar (2018) Ahmed M Alaa and Mihaela van der Schaar. Bayesian nonparametric causal inference: Information rates and learning algorithms. IEEE Journal of Selected Topics in Signal Processing, 12(5):1031–1046, 2018.
 Alaa et al. (2017) Ahmed M Alaa, Michael Weisz, and Mihaela van der Schaar. Deep counterfactual networks with propensitydropout. arXiv preprint arXiv:1706.05966, 2017.
 Allman et al. (2009) Elizabeth S Allman, Catherine Matias, John A Rhodes, et al. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37(6A):3099–3132, 2009.
 Angrist and Pischke (2008) Joshua D Angrist and JörnSteffen Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2008.
 Atan et al. (2016) Onur Atan, William R Zame, Qiaojun Feng, and Mihaela van der Schaar. Constructing effective personalized policies using counterfactual inference from biased data sets with many features. arXiv preprint arXiv:1612.08082, 2016.
 Bareinboim et al. (2015) Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with unobserved confounders: A causal approach. In Advances in Neural Information Processing Systems, pages 1342–1350, 2015.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 Buesing et al. (2019) Lars Buesing, Theophane Weber, Yori Zwols, Nicolas Heess, Sebastien Racaniere, Arthur Guez, and JeanBaptiste Lespiau. Woulda, coulda, shoulda: Counterfactuallyguided policy search. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BJG0voC9YQ.
 D’Amour (2018) Alexander D’Amour. (non)identification in latent confounder models. https://www.alexdamour.com/blog/public/2018/05/18/nonidentificationinlatentconfoundermodels/, 2018.
 Edwards et al. (2015) Jessie K Edwards, Stephen R Cole, and Daniel Westreich. All your data are always missing: incorporating bias due to measurement error into the potential outcomes framework. International journal of epidemiology, 44(4):1452–1459, 2015.
 Fisher (1935) Ronald Aylmer Fisher. The design of experiments. 1935.
 Forney et al. (2017) Andrew Forney, Judea Pearl, and Elias Bareinboim. Counterfactual datafusion for online reinforcement learners. In International Conference on Machine Learning, pages 1156–1164, 2017.
 Gottesman et al. (2018) Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, et al. Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298, 2018.
 Ha and Schmidhuber (2018) David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
 Hernán and Robins (2018) MA Hernán and JM Robins. Causal Inference. Boca Raton: Chapman and Hall/CRC, forthcoming, 2018.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krishnan et al. (2015) Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
 Krishnan et al. (2017) Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In AAAI, pages 2101–2109, 2017.
 Kuroki and Pearl (2014) Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, 2014.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Levine (2018) Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 (24) Qing Liu, Katharine Henry, Yanbo Xu, and Suchi Saria. Using causal inference to estimate whatif outcomes for targeting treatments.
 Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latentvariable models. In Advances in Neural Information Processing Systems, pages 6446–6456, 2017.
 Maddala and Lahiri (1992) Gangadharrao Soundaryarao Maddala and Kajal Lahiri. Introduction to econometrics, volume 2. Macmillan New York, 1992.
 Marik (2015) PE Marik. The demise of early goaldirected therapy for severe sepsis and septic shock. Acta Anaesthesiologica Scandinavica, 59(5):561–567, 2015.
 Miao et al. (2018) Wang Miao, Zhi Geng, and Eric J Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4):987–993, 2018.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Montgomery et al. (2000) Mark R Montgomery, Michele Gragnolati, Kathleen A Burke, and Edmundo Paredes. Measuring living standards with proxy variables. Demography, 37(2):155–174, 2000.
 Paxton et al. (2013) Chris Paxton, Alexandru NiculescuMizil, and Suchi Saria. Developing predictive models using electronic medical records: challenges and pitfalls. In AMIA Annual Symposium Proceedings, volume 2013, page 1109. American Medical Informatics Association, 2013.
 Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
 Pearl (2012) Judea Pearl. On measurement bias in causal inference. arXiv preprint arXiv:1203.3504, 2012.
 Pearl and Mackenzie (2018) Judea Pearl and Dana Mackenzie. The Book of Why. Allen Lane, 2018.
 Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. MIT press, 2017.
 Prasad et al. (2017) Niranjani Prasad, LiFang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. arXiv preprint arXiv:1704.06300, 2017.
 Raghu et al. (2017a) Aniruddh Raghu, Matthieu Komorowski, Imran Ahmed, Leo Celi, Peter Szolovits, and Marzyeh Ghassemi. Deep reinforcement learning for sepsis treatment. arXiv preprint arXiv:1711.09602, 2017a.
 Raghu et al. (2017b) Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous statespace models for optimal sepsis treatmenta deep reinforcement learning approach. arXiv preprint arXiv:1705.08422, 2017b.
 Ramoly et al. (2017) Nathan Ramoly, Amel Bouzeghoub, and Beatrice Finance. A causal multiarmed bandit approach for domestic robots’ failure avoidance. In International Conference on Neural Information Processing, pages 90–99. Springer, 2017.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Riedmiller (2005) Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328. Springer, 2005.
 Saria et al. (2010) Suchi Saria, Daphne Koller, and Anna Penn. Learning individual and population level traits from clinical temporal data. In Proceedings of Neural Information Processing Systems, pages 1–9. Citeseer, 2010.
 Schölkopf et al. (2016) B. Schölkopf, D. Hogg, D. Wang, D. ForemanMackey, D. Janzing, C. J. SimonGabriel, and J. Peters. Modeling confounding by halfsibling regression. Proceedings of the National Academy of Science, 113(27):7391–7398, 2016. URL http://www.pnas.org/content/113/27/7391.full.
 Schulam and Saria (2017) Peter Schulam and Suchi Saria. Reliable decision support using counterfactual models. In Advances in Neural Information Processing Systems, pages 1697–1708, 2017.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 Sen et al. (2016) Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sanjay Shakkottai. Contextual bandits with latent confounders: An nmf approach. arXiv preprint arXiv:1606.00119, 2016.
 Shortreed et al. (2011) Susan M Shortreed, Eric Laber, Daniel J Lizotte, T Scott Stroup, Joelle Pineau, and Susan A Murphy. Informing sequential clinical decisionmaking through reinforcement learning: an empirical study. Machine learning, 84(12):109–136, 2011.
 Soleimani et al. (2017) Hossein Soleimani, Adarsh Subbaswamy, and Suchi Saria. Treatmentresponse models for counterfactual reasoning with continuoustime, continuousvalued interventions. arXiv preprint arXiv:1704.02038, 2017.
 Sutton et al. (1998) Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998.
 Tran et al. (2015) Dustin Tran, Rajesh Ranganath, and David M Blei. The variational gaussian process. arXiv preprint arXiv:1511.06499, 2015.
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
 Wooldridge (2009) Jeffrey M Wooldridge. On estimating firmlevel production functions using proxy variables to control for unobservables. Economics Letters, 104(3):112–114, 2009.
 Zaremba and Sutskever (2014) Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
 Zhang and Bareinboim (2017) Junzhe Zhang and Elias Bareinboim. Transfer learning in multiarmed bandit: a causal approach. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 1778–1780. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
Comments
There are no comments yet.