1 Introduction
Estimating conditional average treatment effects (CATE) is crucial for decisionmaking in many application domains such as economics [Smith2005, Baum2015], marketing [Wang2015, hatt2020early], and medicine [Alaa2017]
. For example, a doctor deciding on a personalized treatment plan based on patient characteristics. Extensive work focuses on using machine learning to estimate CATE
[e. g., Shalit2017, Alaa2018, Yoon2018, Athey2019]. However, existing work has given little attention to settings where treatment information is missing.Missing treatment information is common in many realworld applications [Kennedy2020]. For instance, Zhang2016 describe the Consortium on Safe Labor study, where the question of interest is the causal effect of mothers’ body mass index (BMI) on infants’ weight. In this study, BMI was missing for about half of the subjects. Another example is provided by Ahn2011, where the authors analyze the effect of physical activity on colorectal cancer using data from the Molecular Epidemiology of Colorectal Cancer study. However, information on physical activity was missing for around 20 % of the subjects. Further, Molinari2010 gives numerous examples of missing treatment information in survey settings. Missing treatment information can create additional challenges for treatment effect estimation. Motivated by needs in practice, the question is how one can reliably estimate CATE even when treatment information is missing.
In this paper, we analyze the problem of estimating CATE with missing treatment information. We consider a causal structure where both treatment and treatment missingness are affected by covariates. In such setting, we have two covariate shifts: (i) a covariate shift between the treated and control population; and (ii) a covariate shift between the observed and missing treatment population. These covariate shifts increase the CATE estimation error in the covariate domains where we lack fully observed data. For instance, if lowincome patients are reluctant to share information about their treatment, they can be largely underrepresented in the observed treatment population, and, hence, CATE estimation for lowincome patients might be unreliable due to the lack of observed treatment data. We theoretically show the effect of these covariate shifts by deriving a generalization bound for estimating CATE in our setting with missing treatments. Our derivation shows that the expected CATE estimation error is bounded by the sum of (i) the standard estimation error; (ii) the distance between the covariate distributions of the treated and control population; and (iii) the distance between the covariate distributions of the observed and missing treatment population.
Our generalization bound reveals that we need to account for the two covariate shifts when estimating CATE in our setting with missing treatments. Motivated by our bound, we propose the missing treatment representation network (MTRNet), a novel CATE estimation algorithm for our setting with missing treatments. MTRNet makes use of representation learning [Bengio2013] and domain adaptation [Ganin2016] to address the covariate shifts while aiming at a low CATE estimation error. In particular, MTRNet uses adversarial learning to learn a balanced representation of covariates which is neither predictive of treatment nor of treatment missingness. By using balanced representations, we reduce the CATE estimation error in domains that have different covariate distributions than the one in which we fully observe data, and, thus, we improve the overall performance. In various experiments with semisynthetic and realworld data, we demonstrate that our MTRNet yields superior CATE estimates in our setting with missing treatment information compared to the stateoftheart.
We list our main contributions^{1}^{1}1Code available at:
https://anonymous.4open.science/r/MTI/ as follows:

We analyze the problem of estimating CATE with missing treatment information. To the best of our knowledge, existing literature on CATE estimation has previously overlooked this setting.

We derive a generalization bound that shows different sources of error that we need to account for when estimating CATE in the setting with missing treatments.

We develop MTRNet, a novel CATE estimation algorithm based on our generalization bound. Across various experiments, we demonstrate that MTRNet provides superior CATE estimates in our setting with missing treatments compared to the stateoftheart.
2 Related work
We review two streams in the literature that are particularly relevant to our problem (i. e., CATE estimation with missing treatments): (i) methods for average treatment effect (ATE) estimation with missing treatments, and (ii) methods for CATE estimation in the standard setting that address the covariate shift between the treated and control population.
(i) ATE estimation with missing treatments. Only a few methods have been developed for estimating treatment effects in the setting with missing treatment information. These methods primarily focus on identification and estimation of average treatment effects. Williamson2012
proposed a doubly robust augmented inverse probability weighted estimator for ATE that deals with both confounding and missing treatments.
Zhang2016combined standard causal inference and missing data models to create a triply robust estimator for ATE. Both estimators are semiparametric and thus offer certain robustness to misspecification; however, they are restricted to standard parametric models as nuisance functions.
Kennedy2020 proposed a nonparametric estimator for ATE in the missing treatment setting that can incorporate flexible machine learning models for nuisance functions.The major difference between the existing literature on treatment effect estimation with missing treatments [Williamson2012, Zhang2016, Kennedy2020] and our work is that our focus is not on ATE but on CATE estimation. In fact, the existing methods focus only on identification and direct estimation of ATE. As such, they cannot be straightforwardly adapted to CATE estimation and are thus not applicable to our setting. To the best of our knowledge, we are the first to study CATE estimation with missing treatment information.
(ii) CATE estimation in the standard setting. Numerous methods have been proposed for estimating treatment effects [e. g., Alaa2018, Yoon2018, Athey2019, hatt2021sequential]. Here, we focus on methods that address the covariate shift between the treated and control population, as our work deals with covariate shifts for CATE estimation as well. Johansson2016 were the first to identify the covariate shift problem when estimating CATE. In order to account for the covariate shift, the authors propose an algorithm that learns a balanced representation of covariates by enforcing domain invariance through distributional distances. Shalit2017 extended their work by deriving a more flexible family of algorithms for this task. The authors also provide an intuitive generalization bound for CATE estimation that theoretically shows the effect of the covariate shift. Building on top of these works, other methods were proposed for addressing the covariate shift, some of which include learning balanced representations [Johansson2018, Assaad2021, Hatt.2022] and learning overlapping representations [Zhang2020].
In our work, we also address the covariate shift between the treated and control population since we have a CATE estimation problem. However, we consider a more general setting with missing treatments where we identify an additional covariate shift between the observed and missing treatment population that needs to be accounted for. This covariate shift, as well as the setting with missing treatments in general, was not studied by any prior work on CATE estimation. Moreover, due to having two covariate shifts, our proposed algorithm is designed to learn a covariate representation that is balanced over multiple domains. This requires a tailored approach that differentiates from the above methods.
3 Problem setup
Let denote whether a treatment is applied, and let denote whether the treatment information is observed (or missing). Further, we refer to a covariate space and an outcome space . We describe the outcomes of different treatments using the RubinNeyman potential outcomes framework [Rubin2005]. We assume a distribution with the following variables: treatment assignment , treatment missingness , covariates , and potential outcomes . We observe only one potential outcome, i. e., we observe , where or , depending on the assigned treatment . The observed potential outcome corresponding to the assigned treatment is called the factual outcome, and the unobserved potential outcome corresponding to the other treatment possibility (i. e., ) is called the counterfactual outcome. We have data for individuals given by , where is observed only if . That is, some of treatment information is missing.
Our objective is to estimate the conditional average treatment effect (CATE)^{2}^{2}2Also known as the individualized treatment effect (ITE). for an individual with covariates from data with missing treatment information. This is given by
(1) 
We make the following assumptions about our setting with missing treatments (the causal structure of our problem is illustrated in Fig. 1):
Assumption 1 (Consistency, Positivity, Ignorability).

[label=()]

if , and if (Consistency);

if (Positivity);

(Ignorability).
Assumption 1 are the standard assumptions for identification of treatment effects from data. Ignorability^{3}^{3}3Ignorability is often referred to as ‘no hidden confounders’ assumption [kuzmanovic2021deconfounding] or strong ignorability [hatt2021generalizing] ensures that all variables that affect both treatment and potential outcomes and are measured in covariates .
Assumption 2 (Positivity, Ignorability).

[label=()]

if (Positivity);

(Ignorability).
Assumption 2 corresponds to a standard variant of missing at random (MAR) assumption: the missingness depends only on the fully observed part of the data (in our case on the covariates ). Together, Assumption 1 and Assumption 2 allow for identification of treatment effects from data with missing treatment information [Zhang2016, Kennedy2020].
The fundamental problem of causal inference is that counterfactual outcomes (i. e., outcomes under a different treatment than the one assigned) are unobserved. Additionally, in our setting, we also have missing treatment information. Unobserved counterfactual outcomes and missing treatments preclude direct estimation of CATE from data. However, under Assumption 1, we have , and, under Assumption 2, we have
. Hence, in our setting, we can unbiasedly estimate CATE by learning a function
for , such that approximates for which we have fully observed data. Then, we have the CATE estimator given by(2) 
Learning for from data is a standard machine learning problem for which various methods can be used. However, while the above assumptions ensure unbiased estimation of and
from data, the estimators could have high variance when the covariate distributions between treatment groups (
and ) and/or between treatment missingness groups ( and ) differ. To illustrate this with an example, consider a job training program (treatment ) offered to high and lowskilled workers (covariate ). Let us assume that lowskilled workers rarely decide to participate in the program (i. e., predominantly) and also rarely share the information about their participation (i. e., predominantly). In this case, we can have a high error when estimating and for lowskilled workers due to the lack of observed treatment data for this group. Moreover, we can have an even higher error when estimating since not many lowskilled workers participated in the training program (i. e., even when we observe the treatment for lowskilled workers, we have predominantly).Hence, in the presence of different covariate distributions, standard methods may give unreliable CATE estimates due to high estimation variance in the covariate domains where observed data are lacking. The problem is that we fully observe data only from distribution (i. e., the factual domain with observed treatment), but reliable CATE estimation also requires accurate outcome predictions in the missing treatment domain (), as well as in the counterfactual domain (). However, for both, we do not have fully observed data (i. e., we have missing treatment information and missing counterfactual outcomes, respectively). By observing that (under the causal structure in Fig. 1), we see that the differences in the covariate distributions between these domains come from distributional differences (i) between and , and (ii) between and . We frame these distributional differences as covariate shifts.
Therefore, we identify two covariate shifts in our setting with missing treatments: (i) a covariate shift between the treated and control population, and (ii) a covariate shift between the observed and missing treatment population. These covariate shifts could lead to high CATE estimation errors in covariate domains where data are not fully observed. In this paper, we develop a novel CATE estimation algorithm which addresses these covariate shifts and thus provides more reliable CATE estimates by reducing the estimation error. In the following section, we first mathematically show the effect of these covariate shifts by deriving a generalization bound for CATE estimation in our setting with missing treatments. The bound then serves as a theoretical foundation for our proposed algorithm.
4 Theory: Generalization bound
Our intuition from the previous section suggests that the expected error of CATE estimation depends on three error sources: (i) the standard estimation error; (ii) the covariate shift between the treated and control population; and (iii) the covariate shift between the observed and missing treatment population. Here, we mathematically underpin this intuition and, to this end, derive a generalization bound in three steps:

Step 1. We bound the overall loss with the sum of the factual loss and the counterfactual loss (Lemma 1).

Step 2. We bound both the factual and counterfactual loss in the missing treatment domain using the corresponding losses in the observed treatment domain and the distance between the covariate distributions of the observed and missing treatment population (Lemma 2).

Step 3. We bound the counterfactual loss in the observed treatment domain using the corresponding factual loss and the distance between the covariate distributions of the treated and control population (Lemma 3).
The lemmas then imply our main theoretical result provided in Theorem 1: the expected error of CATE estimation with missing treatments is bounded by the sum of (i) the factual loss in the observed treatment domain (i. e., the standard generalization error); (ii) the covariate distribution distance between the treated and control population; and (iii) the covariate distribution distance between the observed and missing treatment population. The proofs and further details on theoretical results are in Appendix A.
In order to derive the generalization bound for CATE estimation, we define the (overall) estimation error in our setting. The standard CATE estimation error is given by the expected precision in estimation of heterogeneous effect (PEHE) [Hill2011], which is basically the mean squared error of estimating . We adjust the PEHE for our setting with missing treatments and define the PEHE loss of a function as
(3) 
We consider for , where is a representation function, and is a hypothesis defined over the representation space . Hence, we have . We further use and the pair interchangeably. We assume that the representation is a onetoone function and define to be the inverse of , such that for all . Moreover, by mapping the covariate space with distribution onto the representation space , the representation induces a corresponding distribution over .
Step 1. In the first step, we bound the overall PEHE loss with a sum of losses in the factual and counterfactual domain. Let
be a loss function (e. g., absolute or squared loss). Then, we define the expected loss of
and for a covariatestreatment pair as(4) 
Note that the expected loss for a given pair does not depend on treatment missingness, since is conditionally independent of given . The expected factual and counterfactual losses of and are given by
(5)  
(6) 
Lemma 1 Let be an invertible representation function and for a hypothesis. Let be the squared loss. Then, we have
(7) 
where is the minimal variance of potential outcomes as defined in Definition 8 of Appendix A.
Lemma 1 provides a bound on using the sum of the factual and counterfactual loss, i. e., and . However, the problem is that, for , we neither can estimate and from data due to missing treatment information nor can we estimate in general due to missing counterfactual outcomes. Here, our idea is to bound these inestimable terms using their estimable counterparts and corresponding distributional distances induced by the representation. Hence, in Step 2, we bound the factual and counterfactual loss in the missing treatment domain using the corresponding losses in the observed treatment domain and the distance between the observed and missing treatment population (Lemma 2). Then, in Step 3, we bound the counterfactual loss in the observed treatment domain using the factual loss in the observed treatment domain and the distance between the treated and control population (Lemma 3). The three lemmas then directly imply our final bound (Theorem 1).
Step 2. We first introduce notation for the corresponding factual and counterfactual loss in the observed and missing treatment domain. We also define a distributional distance metric. We use superscripts to denote when we condition on a given variable, e. g., . Then, the expected factual and counterfactual losses of and in the domain for (i. e., missing and observed treatment domain) are given by
(8)  
(9) 
To measure distributional distances, we use the integral probability metric (IPM), which is a class of metrics between probability distributions
[Muller1997, Sriperumbudur2012]. Let be a function family consisting of functions . For a pair of distributions over , the IPM is defined by(10) 
Thus, is a pseudometric on the space of probability functions over . For a sufficiently rich function family , is a true metric over the corresponding set of probabilities, i. e., .
Lemma 2 Let be an invertible representation and its inverse. Let be the distribution induced by over . Let . Let be a family of functions and the integral probability metric induced by . Let for be a hypothesis. Assume there exists a constant , such that, for , the function . Then, we have
(11)  
Step 3. The remaining inestimable term following Lemma 2 is the counterfactual loss in the observed treatment domain. However, we cannot estimate it due to missing counterfactual outcomes. Hence, in Lemma 3, we bound this term as well.
Lemma 3 Let be an invertible representation and its inverse. Let be the distribution induced by over . Let . Let be a family of functions and the integral probability metric induced by . Let for be a hypothesis. Assume there exists a constant , such that, for , the function . Then, we have
(12)  
Given the above lemmas, we state the generalization bound as the main result of our paper in Theorem 1.
Theorem 1 Let be an invertible representation and its inverse. Let be the distribution induced by over . Let . Let be a family of functions and the integral probability metric induced by . Let for be a hypothesis. Let be the squared loss function. Assume there exists a constant , such that, for , the function . Then, we have
(13)  
Theorem 1 shows that the expected CATE estimation error for a representation and hypothesis is bounded by a sum of (i) the standard generalization error for that representation (); (ii) the distance between the treated and control distributions induced by the representation (); and (iii) the distance between the observed and missing treatment distributions induced by the representation (). The bound shows different sources of error when estimating CATE with missing treatment information, i. e., the standard generalization error and the two covariate shifts formalized using the IPM metric.
We make a few additional remarks regarding the derived generalization bound. The IPM terms reflect the two described covariate shifts. Both evaluate to zero in case that the covariate distributions are balanced with respect to treatment and treatment missingness, i. e., when covariates neither affect treatment nor treatment missingness . The choice of the function family determines how tight the bound is. For a small function family, the bound is tighter, but it could be incomputable [Shalit2017]. The IPM term that reflects the distribution imbalance with respect to treatment missingness (i. e., ) is scaled by the probability of missingness , meaning that its relative importance depends on . In other words, when we have a small probability of treatment missingness, the corresponding covariate shift between the observed and missing treatment population is relatively less important compared to the other sources of estimation error.
The derived generalization bound holds for any given representation and hypothesis that satisfy the conditions of Theorem 1. Given empirical data and representationhypothesis space, we can upper bound the loss terms and with their empirical counterparts and model complexity terms by applying standard machine learning theory [Shalev2014]. This naturally leads to a CATE estimation algorithm based on representation learning that minimizes the upper bound in Eq. (4): (i) by minimizing the empirical version of the loss terms and , and (ii) by minimizing respective IPM terms using either the empirical IPM distances as in Shalit2017 or via adversarial learning [Ganin2016, Hatt.2022]. Here, we use adversarial learning.
5 CATE estimation algorithm
In this section, we propose the missing treatment representation network (MTRNet), our algorithm for CATE estimation in the setting with missing treatment information. The architecture of MTRNet is shown in Fig. 2. For given data , MTRNet minimizes a novel empirical loss based on our generalization bound from Theorem 1. The corresponding objective function is given by
(14)  
with , , and . In Eq. (5), we replaced the theoretical loss terms from the bound by their corresponding empirical ones. The standard generalization error, i. e., , is replaced by a weighted outcome prediction loss, , where the weights reflect the size of the treated and control population. The IPM terms, i. e., and , are minimized by adding a negative prediction loss for treatment (i. e., with prediction function ), as well as for treatment missingness (i. e., with prediction function ), respectively. We maximize these prediction losses using adversarial learning with gradient reversal layers (GRLs) [Ganin2016]. Since constant cannot be evaluated for a general function family [Shalit2017]
, we use hyperparameters
and to tradeoff outcome prediction accuracy and reducing the respective IPM distances. This makes our objective sensitive to scaling of the representation, and as a remedy, we use batch normalization to fix the norm of
. We also introduce an regularization with parameter for the weights of the hypothesis layers .MTRNet outputs the optimal and based on the above objective. The learning algorithm is given in Algorithm 1. To train MTRNet, we use Adam [Kingma2015] and run Algorithm 1 for a given number of iterations. The hyperparameters include: architecture (number and size of different layers), number of iterations, batch size, learning rate, dropout rate, , , and . We choose hyperparameters via crossvalidation with a split. Further implementation details are given in Appendix B.
6 Experiments
In this section, we show the effectiveness of our MTRNet for CATE estimation with missing treatments and, to do so, we use both semisynthetic and realworld data. To this end, we demonstrate that, by addressing the covariate shifts, MTRNet reduces CATE estimation error across different covariate domains and thus provides superior overall performance compared to baseline methods.
Baselines. CATE estimation with missing treatments has been overlooked by the existing literature. Hence, appropriate baselines are missing. Instead, we need to construct baselines by combining CATE estimation methods in the standard setting with different methods for dealing with missing data. Here, we use the following CATE estimation methods: (i) linear model (OLS) fitted for each treatment group; (ii) causal forest (CF) [Athey2019]; (iii) treatment agnostic representation network (TARNet) [Shalit2017]; and (iv) counterfactual regression maximum mean discrepancy (CFRMMD) [Shalit2017]. Note that none of above methods address the covariate shift between the observed and missing treatment population since neither our setting nor this particular covariate shift were considered by the existing work.
We combine the above methods with common methods for dealing with missing data [Williamson2012]: (i) deleting data points with missing treatment (del
); (ii) imputing missing treatments using a machine learning model (
imp); and (iii) reweighting data points with observed treatment (after deleting those with missing treatment) by the inverse probability of treatment being observed (rew). For imputation and reweighting, we use random forests to model the respective probabilities. By combining the above CATE estimation methods with methods for dealing with missing data, we obtain 12 baselines in total. We name the baselines using the CATE method name and the method for dealing with missing data as subscript (e. g.,
means OLS combined with deletion of data points with missing treatment).Datasets. We conduct experiments with three benchmark datasets for CATE estimation but modify them such that treatment information is partially missing. The mechanism for introducing missingness is designed such that treatment missingness depends on covariates (as in our setting, see Fig. 1). This way, we introduce both missing treatments and the covariate shift between the observed and missing treatment population. The proportion of data with missing treatment information is controlled by a parameter , and the magnitude of the covariate shift by a parameter . Details are in Appendix B.
We use the following datasets: (i) IHDP [Hill2011, Shalit2017, hatt2021estimating]: a semisynthetic dataset with covariates from a randomized experiment and outcomes simulated using a domainspecific probabilistic model. Hence, noiseless outcomes and the true CATE are available for this dataset. (ii) Twins [Almond2005, Yoon2018, hatt2021estimating]: a semisynthetic dataset where the treatment assignment is simulated. Here, we do not observe the true CATE but we observe both potential outcomes. (iii) Jobs [LaLonde1986, Smith2005, Shalit2017]: realworld dataset that combines a randomized controlled trial (RCT) and a larger observational dataset. Here, we do not have information about the true CATE; however, the randomized portion of the data still allows for evaluating CATE estimation error using policy risk (explained later).
Method  IHDP ()  Twins ()  Jobs ()  
MTRNet (ours)  
* Lower is better (best in bold). 
Performance metrics. We evaluate the CATE estimation performance in different ways depending on the above datasets, i. e., depending on whether the true CATE is available. (i) IHDP: we use the empirical PEHE given by , thereby reflecting that we have access to the true CATE. (ii) Twins: we use the observed PEHE given by since we observe both potential outcomes, and , but we cannot access information on the true CATE. (iii) Jobs: we cannot evaluate the PEHE loss because we can neither access the true CATE nor the counterfactual outcomes. Instead, we use the policy risk that measures the average loss in value when treating according to the policy suggested by a CATE estimator. For a given model , we define the policy to be: treat if , and do not treat otherwise. Then, the policy risk is given by . Here, we compute the empirical policy risk using the randomized portion of the data.
Results. Table 1 shows the performance of our MTRNet vs. the 12 baselines for different experiments using the IHDP, Twins, and Jobs datasets. We report the mean performance averaged over 10 runs with the corresponding standard deviation. For each dataset, we report the overall error, the error in the observed treatment domain (), and the error in the missing treatment domain ().
We make two important observations. (i) MTRNet achieves the lowest overall error across all three datasets. This shows that our algorithm is effective for CATE estimation in the setting with missing treatments. On top of that, it provides superior CATE estimates compared to the stateoftheart baselines. (ii) The improvement in the overall CATE estimation by MTRNet comes from a substantially better performance in the missing treatment domain. Hence, by addressing the covariate shift between the observed and missing treatment population, MTRNet achieves a lower error when estimating CATE in the missing treatment domain (i. e., the covariate domain where CATE estimation is impeded due to the lack of fully observed data) compared to the baselines which ignore this covariate shift. This stresses the importance of addressing this aforementioned covariate shift in settings with missing treatment information. So far, this issue that has been overlooked by previous literature.
The results in Table 1 were obtained in experiments where the proportion of missing treatment data was fixed to . In Fig. 3, we show the results of IHDP experiments when varying parameter . These results show the performance of MTRNet and the four CATE estimation methods combined with a method for imputing missing treatments (similar results with deletion and reweighting are given in Appendix C). We see that, as we increase the proportion of data with missing treatment information (), the performance gap between our MTRNet (in redred) and the baseline methods (in blueblue) becomes larger. This means that addressing the covariate shift between the observed and missing treatment population becomes more important, the higher is the probability that treatments are missing, which is also in line with our theoretical result in Theorem 1. Hence, addressing the covariate shift between the observed and missing treatment population is essential for reliable CATE estimation in settings with missing treatments, especially in case of large rates of missing treatments.
7 Discussion
In this paper, we analyzed CATE estimation in the setting with missing treatments, which, as shown above, presents unique challenges in the form of covariate shifts. Specifically, we identified two covariate shifts in our setting: (i) a covariate shift between the treated and control population, and (ii) a covariate shift between the observed and missing treatment population. While the covariate shift (i) has been addressed in the existing CATE estimation literature, both the setting with missing treatments and the covariate shift (ii) have been overlooked by the existing work.
We fill this research gap from both theoretical and practical perspective. First we derived a generalization bound for CATE estimation with missing treatments that theoretically shows the effect of the two covariate shifts. Then, based on our bound, we proposed MTRNet, a novel CATE estimation algorithm that addresses these covariate shifts in our setting with missing treatments. We demonstrated that our MTRNet achieves superior performance in estimating CATE, especially in the missing treatment domain since it is the only CATE estimation algorithm that addresses the covariate shift between the observed and missing treatment population. The performance gain becomes even more pronounced when , i. e., the treatment missingness rate, is large. The importance of our work is reflected by omnipresence of missing treatments in realworld applications. This holds true for both observational and RCT studies. Moreover, our MTRNet has direct practical implications as it provides more reliable CATE estimates that can improve personalized decisionmaking in many application areas, including personalized medicine.
References
References
Appendix A Proof of Theorem 1
In our problem setup, we assume a distribution with the following variables: assigned treatment , treatment missingness , covariates , and potential outcomes . We observe only one of the two potential outcomes, i. e., we observe , where or , depending on the assigned treatment . The observed potential outcome corresponding to the assigned treatment is called the factual outcome, and the unobserved potential outcome corresponding to the other treatment possibility (i. e., ) is called the counterfactual outcome.
Our objective is to estimate the conditional average treatment effect (CATE) for an individual with covariates .
Definition 1 The conditional average treatment effect (CATE) for an individual with covariates is given by
We make the following assumptions needed for identification of CATE in the setting with missing treatments:
Assumption 1 (Consistency, Positivity, Ignorability).

[label=()]

if , and if (Consistency);

if (Positivity);

(Ignorability).
Assumption 2 (Positivity, Ignorability).

[label=()]

if (Positivity);

(Ignorability).
Under the above assumptions we have that . Hence, we can unbiasedly estimate CATE from data by learning a function for . However, such estimation can have high variance in the presence of covariate shifts.
In this work, we simultaneously address: (i) the covariate shift between the observed and the missing treatment population, and (ii) the covariate shift between the treated and the control population. We use a representation learning approach with , where is a representation function, and for is a hypothesis defined over the representation space . Hence, we have . Below, we define the estimator of CATE.
Definition 2 The CATE estimator for an individual with covariates is given by
The estimation error for our setting with missing treatment information is given by the expected precision in estimation of heterogeneous effect (PEHE), i. e., the mean squared error in estimating .
Definition 3 The PEHE loss of and is given by
We make the following assumption about the representation function .
Assumption 3 The representation is a differentiable, invertible function. We assume that is the image of under and define to be the inverse of , such that for all .
By mapping the covariate space onto the representation space , the representation induces a corresponding distribution .
Definition 4 For a representation function and for a distribution defined over , let be the distribution induced by over .
Let be a loss function, e. g., absolute or squared loss.
Definition 5 Let be a representation function and for a hypothesis defined over the representation space . We define the expected loss for the covariatestreatment pair as
Note that the expected loss for a given pair does not depend on treatment missingness, since we have conditional independence between and given . Next, we define losses in the factual and counterfactual domain, and the variance of with respect to the distribution .
Definition 6 The expected factual and counterfactual losses of and are given by
Definition 7 For , we define
Definition 8 The variance of with respect to the distribution is given by
and we define
Lemma 1 For any function and distribution over , we have
and
where and are with respect to the squared loss.
Proof.