1 Introduction
Randomized control experiments have become increasingly appealing as a way to help researchers in many fields understand the causal effect of a given treatment. For example, firms invest in A/B testing to see whether a certain homepage layout will attract more traffic. Marketers design randomized experiments to test the effectiveness of a specific incentive program to retain customers and increase revenue. In the field of healthcare and sociology, scientists desire to understand the effect of technology access and advancement on human wellbeing, and they usually conduct their studies through randomized experiments. In general, these randomized experiment designs can guarantee that the treatment and the control groups are randomly different from one another in regard to all background variables. This in turn allows researchers to gauge the causal impact of a treatment on a certain outcome variable. One major challenge, however, is the cost of field experiments, which tend to be expensive and occasionally harmful to participants or platforms. Additionally, randomization is impossible in many situations. For example, if we were concerned about the negative effects of some treatment, such as drugs, tobacco smoke, or alcohol, it would be unethical to deliberately assign and expose a group of individuals to this treatment. Thus, researchers must develop new techniques to estimate treatment effects using the collected observational data.
Existing methods and challenges The alternatives to randomized field experiments are various computational methods that use observational data, primarily including regression, matching and weighting. In regression models, researchers adjust confounding variables that are correlated with both treatment and outcome variables to alleviate the confounding effect. Although regression models are gaining widespread popularity, they rely heavily on the assumption of a functional form, often a linear form, that does not reflect most realworld scenarios. Matching and weighting methods are then developed to tackle endogeneity issues in observational studies, by balancing covariates in treatment and control groups. While matching methods do not fully use all data, weighting overcomes this limitation by assigning all individuals different weights so they can all be used [40]. Both matching and weighting have been well studied to replicate randomized control experiments as closely as possible, and their popularity is reflected in scattered research across disciplines such as statistics, economics, political science, sociology, and business.
Current weighting methods for treatment effect estimation are often built upon the idea of propensity scores or covariate balance. In 1983, Rosenbaum and Rubin introduced the propensity score, which summarizes all covariates into one scalar to represent the probability of treatment assignment conditioned on observed covariates
[35]. A typical propensity scorebased weighting method for treatment estimation is inverse probability of treatment weighting (IPTW) [29], which has been theoretically proven to result in an unbiased estimation [20]. Thus, the key challenge becomes estimating the propensity score in an accurate way.To estimate the propensity score, researchers often assume a simply specified functional form to model the relationship of covariates to the treatment assignment. For example, a simple logistic regression is most typically used. However, this simple assumption does not hold in real applications, and its failure causes model misspecification
[11]. This misspecification in turn can easily lead to poor propensity score prediction and consequently to biased treatment effect estimation. Therefore, many researchers have proposed using machine learning methods such as Lasso, boosting regression, bagged CART and random forest in order to improve propensity score estimation by modeling the complicated nonlinearity.
In addition to developing propensity scorebased weighting, the existing literature has also introduced statistical weighting methods that aim to directly balance covariates between treatment and control groups. These methods include entropy balancing [15], optimizationbased weighting [46], empirical balancing calibration weighting [5], and many others [19, 26, 2]
. One common drawback of these methods is that rather than balancing the joint distribution of covariates, their objective is only balancing the first moment (i.e., means) and sometimes the second moment or interactions of covariates. In consequence, these methods call for some strong assumptions, such as linearity or specific functional forms, in order to achieve unbiased estimation of treatment effects.
Present study To address the aforementioned issue of model misspecification in weighting, in this paper we propose a new weightingbased method via distribution learning for treatment effect estimation with observational data. The fundamental idea is that when we have the true underlying distribution of covariates conditioned on treatment assignment, we can compute the ratio of covariates’ probability density in the treatment group to that in the control group, use it as weighting to balance the distribution of covariates, and thereby obtain an unbiased treatment effect estimation.
The key question then becomes how we can learn the underlying distribution of covariates in both groups. Some parametric methods can be used. For example, we can assume the data follows some commonly used probability distribution (e.g., Gaussian) and estimate associated parameters using maximum likelihood estimation or other techniques. This might be problematic, as the true probability density function of data is often unknown, is likely far from Gaussian or any other wellknown distributions, and even an analytical form might not exist. Therefore, to approximate the true distribution from data, we borrow the idea of constructing flexible learnable probability distribution via transformation
[41]. Specifically, we start with a simple base distribution and take a series of invertible transformations via change of variables to approach a distribution that is as close as possible to the truth. Previous studies have shown that the transformed distribution can converge to the truth, as long as there exist an infinite amount of observations and the transformations have enough expressive power [41, 32]. However, finding a closedform analytical function for a probability distribution via transformation is not easy and is sometimes impossible. Building upon the universal approximation theory, we develop a datadriven method by leveraging deep neural networks to learn such transformation functions.
The transformation can be computationally expensive since it involves a component related to computing the determinant of a Jacobian. To make our method practical, we consider using transformations with determinants of Jacobians that are easy to compute. To this end, we choose an autoregressive transformation function , where is the invertible transformation and is the autoregressive conditioner. For autoregressive transformations, Jacobians are triangular matrices whose logdeterminants can be easily computed in linear time, as the sum of diagonal entries on a log scale, as we will describe more fully in the methodology section. After approximating the distribution of covariates conditioned on whether the individual is treated , , we then use the density ratio as the weight for every individual in the control group to estimate the treatment effect.
To evaluate the performance of our proposed method, in this study we focus on estimating the average treatment effect on the treated (ATT) with observational data. The ATT is the average treatment effect on those subjects who ultimately received the treatment, which is one of the most widely studied treatment effects; however, our method can also easily be extended to estimate the average treatment effect (ATE). In subsequent sections of this paper, we conduct simulation studies under model specification settings that vary with respect to treatment assignment and outcome model. In addition, to demonstrate the effectiveness, robustness, scalability, and generalizability of our approach, we vary the dimensions of covariates, degrees of confounding and sample sizes in our simulations. The experiment results show that our model outperforms several cuttingedge weightingbased baselines. For example, it can achieve up to 81% RMSE reduction over the best weightingbased benchmark. We will explain the details of our results in the experiments section.
Unlike statistical weighting approaches, our method does not impose strong assumptions on the relationship between covariates and outcome, such as linearity and additivity. Moreover, our method calculates weights based on the learned distribution of covariates, rather than using propensity score function fitting as existing propensity scorebased weighting methods do. This explains why our method has consistent performance across different settings and outperforms stateoftheart weightingbased methods. To further demonstrate the superiority and generalizability of our method, we apply our approach to a realworld dataset, the Twins data introduced by [27]. To the best of our knowledge this data is the only public dataset in which we can obtain counterfactual outcomes for all individuals and which allows us to arbitrarily introduce confounding. The experiment results tell us that compared to other weightingbased methods, the ATT estimated using our proposed method is more accurate, as it is closer to the actual ATT.
As previously discussed, treatment effect estimation primarily lies in weighting or outcome modeling, which each have their own pros and cons. One natural extension of earlier research lies in combining these two approaches for better treatment effect estimation. In statistics and economics, researchers have explored along these lines, developing a doublyrobust framework [34, 7]
where a weighting method is combined with outcome modeling. In this study, we also show the effectiveness of our weighting method by plugging it into the doublyrobust framework, which serves as the replacement of the weighting component. Using an advanced machine learning component in the doublyrobust framework as alternative to regression for outcome modeling, such as BART or random forest, our approach consistently achieves the best performance among all baseline estimators. Finally, we compare our method with existing stateoftheart approaches, such as deep learningbased generative models
[27], purely statistical methods [2] and machine learning methods [42]. Not surprisingly, our method exhibits better performance. In sum, we make four major contributions in this study.
[(i)]

We propose a new weightingbased method via distribution learning to approximate the underlying distribution of covariates conditioned on the treatment assignment for estimating treatment effects. We believe this innovative approach could open a new perspective in studying ATT and ATE estimation using observational data.

We theoretically demonstrate that our method can be viewed as importance sampling and can obtain an unbiased estimation as long as the learned distribution is close to the truth. We also empirically validate the convergence of the learnable transformed distribution to the true distribution.

To evaluate the performance of our weightingbased method, we conduct extensive experiments on synthetic data across various settings, as well as real data. The results show the superiority and robustness of our method as compared to cuttingedge baselines.

In addition to weightingonlybased ATT estimation, we further demonstrate the efficacy and superiority of our method under the paradigm of the doublyrobust framework.
2 Literature Review
A large body of work in the literature addresses treatment effect estimation, especially the weightingbased methods, with observational data. The methods studied can be primarily categorized into propensity score and covariate balancebased approaches. In this section we provided an overview of some pertinent studies, then we review the doublyrobust framework and machine learning methods used for treatment effect estimation.
Propensity scorebased weighting estimators: The most wellknown and widely used propensity score weighting method is the inverse probability of treatment weighting (IPTW) [8, 33, 29, 17]. The weight in IPTW is set to , where is the estimated propensity score for individual . is then used to weight both treatment () and control () groups to estimate ATE; alternatively is used to estimate ATT [17, 20]. Often, is estimated by a regression (e.g., logistic regression) over covariates. However, most regression models impose a strong assumption about the functional form of the true propensity score model, which may not be true in real applications. This consequently leads to a common issue in causal inference, called model misspecification, which can easily result in bad propensity score prediction, accordingly leading to an unbalanced covariate distribution via weighting, and thus finally yielding a biased treatment effect estimation [25]. To overcome this issue, researchers have developed various advanced machine learningbased methods as alternatives to regression approaches including Lasso [12], boosting regression [31], bagged CART and random forest [25, 43]. These regression approaches have all been developed in the context of propensity score estimation to improve the propensity score weighting performance. Due to the imperfect nature of prediction models, estimations of unbiased propensity scores remain nonguaranteed both theoretically and practically.
Covariate balancebased weighting estimators: Another stream of literature on weightingbased treatment effect estimation focuses on balancing covariates. For example, Hainmueller introduced entropy balancing, which relies on a maximum entropy reweighting scheme that calibrates unit weights to achieve prespecified balance conditions [15]. Imai and Ratkovic proposed a covariate balancing propensity score (CBPS) method, which combines treatment assignment modeling and covariate balance optimization [19]
. Zubizarreta learned the stable weights of minimum variance that balance precisely the means of the observed covariates and other features of their distributions
[46]. Chan et al. considered a wide class of calibration weights constructed to obtain balance of the moments of observed covariates [5]. All these statistical weighting estimators aim to achieve covariate balance for the treatment and control groups. However, in practice they can only model the first, second, and possibly higher moments of covariates, and they are not able to adjust inequalities of the covariates’ joint distribution. Therefore, they rely on some strong assumptions, such as linearity and additivity, to obtain unbiased estimates of treatment effects.Doublyrobust framework: Existing studies have demonstrated that the treatment effect estimation can be tackled by either properly controlling the treatment assignment mechanism (e.g., propensity score weighting) or modeling the outcome function (e.g., regressions) [20, 16]. Doublyrobust estimators combine the propensity score weighting with outcome modeling. They are “doubly robust” in the sense that they can achieve consistency as long as either the propensity score or the outcome model is estimated consistently, and they can achieve efficiency if these two components are estimated at a sufficiently fast rate [20, 2]. A simple doublyrobust method is augmented inverse probability weighting (AIPW) [34], which estimates ATT as follows:
(1) 
where is the size of the treatment group and
is the fitted outcome function (e.g. linear regression). In fact, the propensity score weighting component
in the AIPW can be substituted by other weighting estimators. For example, [2] proposed an approximate residual balancing method,(2) 
where is based on covariate balance optimization and is fitted by a Lasso or an elastic net. The outcome model can be fitted using highly dataadaptive machine learning algorithms, such as random forest, Bayesian additive regression tree (BART) and boosting [24]. Hence, a doublyrobust framework can be seen as a generic framework in which different weighting methods (including propensity score weighting and covariatebalance based weighting) and different outcome modeling approaches (including linear regression, regularized regression and other machine learning methods) can be employed.
Machine learningbased methods: As machine learning has advanced and especially as deep neural networks have become successful in many domains, various algorithms have been developed for causal inference with observational data. For example, treebased algorithms such as causal forest are extended in the potential outcome framework to estimate treatment effects. For example, Wager and Athey developed a nonparametric causal forest for estimating heterogeneous treatment effects that extends the widely used random forest algorithm. In their potential outcomes framework with unconfoundedness, they showed that causal forests are pointwise consistent for the true treatment effect and have an asymptotically Gaussian and centered sampling distribution [42]
. Athey and Imbens developed a recursive partition approach to construct trees for causal effects that allow to do valid inference for the causal effects in randomized experiments and in observational studies satisfying unconfoundedness. Their method provides valid confidence intervals without restrictions on the number of covariates or the complexity of the datagenerating process
[3]. Yahav et al. introduced a treebased approach adjusting for observable selfselection bias in intervention studies [44]. Chen et al. applied causal forest to analyze patientleven treat effect and facilitate the implementation of the targeted outreach program [6]. Deep learningbased methods such as balancing neural network (BNN) [37], counterfactual regression (CFR) [21], and causal effect variational autoencoder (CEVAE)
[27], are proposed for causal effect inference by leveraging the expressive power and high flexibility of neural networks. Among these methods, CEVAE uses a variational autoencoder to learn the latent variable representation for observed covariates in both treatment and control groups, and it estimates the treatment effect based on counterfactual prediction. To the best of our knowledge, CEVAE is the best deep neural networkbased model for treatment effect estimation.3 Methodology
In this section, we first formally define our research problem and illustrate necessary assumptions. Then we propose a general framework of treatment effect estimation via density approximation. Meanwhile, we show it allows us to obtain an unbiased estimator in theory. Finally, we introduce our weighting approach DLW under this framework.
3.1 Problem definitions and assumptions
Definition 1: Propensity score. The propensity score for an individual with covariates is the probability that the individual would be assigned to the treatment group (), i.e., , where and
are random variables representing individuals and treatment/control group assignment, respectively.
Definition 2: Average treatment effect on the treated (ATT). Let , denote covariates, be the potential outcome when not treated, be the potential outcome when treated, and , represent a binary treatment variable. The treatment effect for an individual is . The average treatment effect for the entire population is . If focusing only on the treatment group, the goal becomes estimating the average treatment effect on the treated (ATT), formally defined as . As previously stated, we focus on ATT estimation in this study, but our proposed method can easily be extended to estimate ATE.
Assumptions: When weighting methods are used to estimate treatment effects in nonexperimental studies, strongly ignorable treatment assignment and overlap are always assumed [35, 20]. That is: (1) treatment assignment () is independent of the potential outcomes () given the covariates (), i.e., , and (2) there is a nonzero probability of an individual with covariate receiving each treatment, i.e., for all .
3.2 A generic framework of treatment effect estimation via density approximation
In this section, we first demonstrate that ATT estimation can be derived under the strongly ignorable treatment assignment assumption as follows:
(3)  
By the strongly ignorable treatment assignment assumption, i.e., should be equal to , the above equation can be rewritten as:
(4)  
The last step tells us that
can be viewed as the numerical integration by Markov chain Monte Carlo (MCMC) and specifically by importance sampling, where the density ratio
can be used as the weight on the control group. Thus, ATT can be obtained as long as can be estimated. There are many parametric and nonparametric methods for probability density estimation [9, 28, 39, 30]. In this paper, we develop a datadriven method to learn this density ratio, and several important aspects related to this density ratio are worth noting here. (1) can be estimated via a sample mean of Y in the treatment group, and it has been proven to be a consistent nonparametric estimator [38]. (2) In theory, using the true density ratio as weight can be viewed as the inverse probability of treatment weighting for ATT (IPTW), based on the Bayesian theorem:(5)  
(3) As mentioned, the density ration can be easily extended to ATE estimation.
(6)  
Similarly, and can be viewed as the MCMC integration (importance sampling) where and can be used as weights for the treatment group and the control group, respectively. According to the Bayesian theorem,
(7)  
(8)  
As and can be estimated straightforwardly using the proportion of treated units and controlled units, the estimation of the importance weights and relies on the estimation of the density ratio .
3.3 DLW: our proposed weighting approach via distribution learning
Now we turn to probability density estimation for both treatment and control groups using observational data. In theory, any probability distribution can be constructed via a method of change of variables, i.e., through a sequence of invertible mappings starting from a simple base distribution like Gaussian or Uniform [32]. Specifically, let be a random variable (e.g. covariates) and express as an invertible mapping from random variable . Then the true distribution of can be obtained by a change of variables via the following operations [36, 4]:
(9)  
Note that the computational complexity of the transformation is dependent on the determinant of a Jacobian , which can be very computationally expensive.
By repeatedly applying the rule of change of variables, the base distribution “flows" through a sequence of invertible mappings and finally reaches a much more complex distribution that is close to the true underlying distribution.
(10)  
Our goal is to estimate the distribution of where we update parameters involved in the mappings using maximum likelihood estimation (MLE). As previous studies have proven, when there is an infinite number of observations and the transformations have enough expressive power, the transformed distribution can converge to the true distribution of , where the likelihood function converges to a constant [41, 32].
A number of transformations have been introduced in the machine learning community, and of those we choose the autoregressive transformation for the ease of calculating the determinant of Jacobian . The autoregressive transformation is formalized as
(11) 
where is the invertible transformation and is the autoregressive conditioner. As only depends on and in autoregressive transformations, the Jacobians are triangular matrices with logdeterminants that can be computed in linear time as the sum of the diagonal entries on a log scale. Previous studies have proven that this kind of transformation is sufficiently expressive to form a universal approximator for probability distributions if the autoregressive conditioner, , is characterized by a masked autoregressive neural network [14, 22], and the invertible transformation is characterized by another neural network [18, 23]. The outputs of are weights and biases of . Specifically, the “neural autoregressive transformation” is formalized as follows:
(12)  
where is the dimension of covariates and represents the index of every dimension.
Using the transformations described above, we can construct a universal approximator for probability distributions by repeatedly applying the rule of change of variables from the base distribution . Then by maximum likelihood estimation, we can approximate and using all training data in the treatment and the control groups, respectively. To summarize, the procedure for our distribution learningbased weighting for ATT estimation is outlined below, and the overall framework is shown in Figure 1.

[I)]

For the treatment group, the distribution of covariates , is learned by “neural autoregressive transformations” from a simple base distribution (dimensional Gaussian with unit variance), i.e., , where . All associated parameters are learned by MLE using data in the treatment group. Specifically, we choose the widely used stochastic gradientbased optimization method [22] to optimize the likelihood function.

For the control group, we have a similar procedure where the distribution is learned by “neural autoregressive transformations” from a dimensional Gaussian with unit variance.

The density ratios for all individuals in the control group are calculated. Based on the derivation in Section 3.2, can be estimated by taking the weighted average of in the control group, with being the weights.

can be estimated nonparametrically via the sample mean of in the treatment group, and finally we estimate the ATT by:
(13)
Note that this procedure can be applied for ATE estimation as illustrated in Section 3.2.
4 Experiments
In this section, we evaluate the performance of our proposed distribution learningbased weighting method in comparison with existing methods for ATT estimation, with the goal of addressing the following questions: (Q1): How does our proposed weighting method perform compared with other weighting methods? (Q2): Can our method consistently achieve good performance of ATT estimation under different model specifications with respect to treatment assignment and outcome models? (Q3): Can our method really approximate the underlying true distribution of covariates conditioned on treatment assignment? All experiments are conducted using Python and some R packages^{1}^{1}1https://CRAN.Rproject.org/package=WeightIt. [40] on a machine with a RAM of 32Gb. The details of hyperparameter tuning are described in the Appendix. To facilitate reproducibility, we release our code and data online^{2}^{2}2https://github.com/DLweighting/DistributionLearningbasedweighting..
4.1 Evaluation metrics
We evaluate all methods using two widely used metrics: bias and root mean squared error (RMSE), which are defined as
(14)  
where is the number of experiments (e.g., 100 in our study), is the estimated ATT in the experiment, and ATT represents the true average treatment effect on the treated.
4.2 Baselines
We compare our method with other weightingbased estimators and several methods commonly used in machine learning and statistics. We also include the unadjusted estimator of ATT, denoted as , which evaluates ATT by directly taking the average outcome of the control group as the counterfactual for the treatment group. It completely ignores the confounding bias and is defined mathematically as , where and are the number of units in the control and treatment groups, respectively. The comparison baselines are organized as follows:

[(1)]

First, in the weightingonly framework for ATT estimation, we compare our distribution learningbased weighting estimator with other weighting estimators, including propensity scorebased weighting and covariate balancebased weighting estimators.

Second, in the doublyrobust framework, we replace the weighting component with all weighting estimators, including ours and baselines. Combining weighting with a widely used machine learning component for outcome modeling, i.e., BART or random forest, we again show the effectiveness of our weighting estimator.

Finally, we compare our method with existing frequently used methods in statistics and machine learning.
I. Weightingonly methods

[1)]

Covariate balancebased weighting methods

: Entropy balancing estimates the ATT by using a maximum entropy reweighting scheme that balances the covariates in treatment and control groups [15].

: Covariate balancing propensity score weighting combines treatment assignment modeling and covariate balance optimization for ATT estimation [19].

: Optimizationbased weighting estimates ATT via stable weights obtained by optimizing the balance of observed covariates [46].

: Empirical balancing calibration weighting estimates the ATT by constructing a wide class of calibration weights based on the balance of the moments of observed covariates [5].


Propensity scorebased weighting methods
As stated in Section 2, propensity score weighting (IPTW) sets the weight to estimate the ATT (denoted as ), where the propensity score is often modeled by logistic regression [40, 17]. Advanced machine learning models, including random forest (), bagged cart (), gradient boosting (
), and Lasso (), are also proposed as alternatives to logistic regression to improve the performance of propensity score estimation.
II. Doublyrobust framework
As previously mentioned, a doublyrobust framework is another common framework that adopts weighting for ATT estimation:
(15)  
where is the treatment group size, is the weight based on either covariate balance or propensity score, and is the outcome function which can be fitted by regression or other dataadaptive algorithms. Various advanced machine learning approaches can be used to model the outcome function, and in this study we choose two widely used ones: random forest and BART. We compare the performance of different weighting methods under the doublyrobust framework, with the outcome function modeled by random forest and BART, respectively. Note that these estimators are denoted as and , where ‘*’ represents the weighting component implemented by various weighting approaches, including our distribution learningbased weighting DLW.
III. Frequently used methods
To further demonstrate the effectiveness and usefulness of our proposed method, we also compare it against current frequently used methods in statistics and machine learning, including:

: Ordinary least square regresses the outcome variable on the treatment assignment variable while controlling covariates to estimate the ATT.

: Propensity score matching uses logistic regression to calculate the probability of an individual being treated and then performs 1:1 nearest neighbor matching to select for each treated individual a control individual with the smallest distance from that individual [35].

: Approximate residual balancing combines balancing weights with a regularized regression adjustment [2].

: Causal forest extends the widely used random forest algorithm to estimate treatment effect [42].

: CEVAE estimates the ATT by latent variable representation learning and counterfactual prediction employing a deep learningbased generative model, i.e., variational autoencoder [27].
4.3 Experiments using synthetic data
In this section, we first introduce our generation of synthetic data and then demonstrate the effectiveness of our distribution learningbased weighting method for ATT estimation under various settings.
Dataset To demonstrate that our method is capable of handling data that is complex and close to the reality, we generate synthetic data with relatively complex distributions. We first follow prior studies and draw each covariate from a threecomponent Gaussian mixture distribution with means of (3, 0, 3), unit variance and equal mixture proportion [2]. Then we standardize the covariates in each dimension to ensure all covariates have zero mean and unit variance as most in the literature do [10, 25, 2]. To show the generalizability and robustness of our method, we simulate three different model specification scenarios (i.e., linear, moderately nonlinear and strongly nonlinear) based on existing studies [10, 46, 13, 1, 11]. We also experiment with small and large sample sizes , weak and strong confounding strengths , and low and high dimensions of covariates .
Setting 1 (Linear and additive): This very simple setting has been widely used in many prior studies. It uses a correctly specified logistic function to generate the binary treatment variable and a linear function to generate the outcome as
(16)  
Setting 2 (Moderately nonlinear and nonadditive): This setting uses a moderately misspecified logistic function to generate the binary treatment variable and a moderately nonlinear function to generate the outcome as
(17)  
Setting 3 (Strongly nonlinear and nonadditive): This setting uses a strongly misspecified logistic function to generate the binary treatment variable and a strongly nonlinear function to generate the outcome as
(18)  
where , and .
Results Tables 14 summarize our results and compare our method to the baselines in terms of ATT estimation performance under Setting 2. From these tables, we draw the following observations and findings.

Overall, compared to other covariate balancebased and propensity scorebased weighting methods, our distribution learningbased weighting estimator performs the best, both with and without the doublyrobust framework. It also outperforms other frequently used stateoftheart methods that are not purely based on weighting. Specifically, our method can achieve up to approximately 81% RMSE reduction over the best baseline, which answers Q1. Furthermore, it is noteworthy that doublyrobust methods outperform outcome modelingbased methods in most cases, which explains the positive effect of combining outcome modeling with a weighting component.

Statistical weighting estimators based on covariate balance, i.e., , , and , all perform poorly. Their performance is similar to that of , which means they are unable to address the confounding problem. This is not surprising because the basic assumptions are violated under nonlinear and nonadditive settings. In contrast, our method balances the distribution of covariates rather than just the moments, and thus it does not rely on strong assumptions. In the doublyrobust framework, these results also hold when comparing (or ) with random forest (or BART), where ‘*’ represents the weighting component.

As expected, traditional propensity score weighting based on logistic regression, denoted as , fails due to the issue of model misspecification. Machine learningbased propensity score weighting estimators, such as , and , achieve performance that is better than but apparently worse than our method, . The reason for this performance difference is that our method calculates weights based on the learned distribution of covariates rather than on the predicted propensity score as these machine learning methods do. These comparative results of weighting methods also hold consistently under the doublyrobust framework.

Furthermore, our weighting estimator can be improved when the population size increases from 5,000 to 10,000, indicating a good asymptotic property. In addition, we find that the estimation error increases when (i) more confounders exist or (ii) confounding becomes stronger.

While the results reported here are for Setting 2, note that we obtain similar results for Setting 3 (strongly nonlinear and nonadditive). For Setting 1 (linear and additive), which is not even practical, our distribution learningbased weighting estimator also achieves robust and sufficiently good performance, though not as good as the estimators specifically designed for linear cases. Please refer to the Appendix for more details.
Based on these observations and in response to Q2, we can conclude that our method provides consistent and robust performance across different settings, and it outperforms several stateoftheart weightingbased methods because it is able to learn the underlying true distribution of covariates conditioned on the treatment assignment. It is therefore able to alleviate the model misspecification issue. Note that the best performance is bold while the second best is underlined.
d = 8  d = 16  
N = 5,000  N = 10,000  N = 5,000  N = 10,000  
Estimator  Bias  RMSE  Bias  RMSE  Bias  RMSE  Bias  RMSE  
0.2  2.674  2.677  2.673  2.674  6.073  6.076  6.057  6.058  
2.674  2.677  2.673  2.674  6.074  6.077  6.059  6.060  
2.674  2.676  2.673  2.674  6.074  6.077  6.059  6.060  
2.674  2.676  2.673  2.674  6.075  6.078  6.058  6.060  
2.674  2.677  2.673  2.674  6.074  6.077  6.059  6.060  
2.675  2.678  2.673  2.674  6.075  6.078  6.059  6.061  
1.575  1.576  1.469  1.470  4.999  5.001  4.842  4.843  
1.456  1.458  1.324  1.325  4.866  4.868  4.663  4.665  
0.587  0.590  0.427  0.430  2.359  2.361  2.148  2.149  
2.675  2.677  2.673  2.674  6.073  6.077  6.058  6.059  
0.132  0.195  0.124  0.177  0.679  1.197  0.347  0.416  
0.4  3.744  3.746  3.744  3.745  7.767  7.769  7.761  7.763  
3.745  3.746  3.743  3.744  7.767  7.769  7.761  7.762  
3.745  3.746  3.743  3.744  7.766  7.769  7.761  7.762  
3.745  3.747  3.743  3.745  7.769  7.771  7.761  7.763  
3.745  3.746  3.743  3.744  7.767  7.769  7.761  7.762  
3.745  3.747  3.744  3.745  7.769  7.771  7.761  7.763  
2.369  2.370  2.217  2.218  6.720  6.721  6.574  6.576  
2.216  2.218  2.042  2.042  6.609  6.611  6.414  6.416  
0.980  0.982  0.845  0.847  3.907  3.908  3.563  3.565  
3.745  3.746  3.744  3.745  7.767  7.769  7.761  7.763  
0.198  0.221  0.177  0.188  0.843  1.396  0.398  0.690 
d = 8  d = 16  
N = 5,000  N = 10,000  N = 5,000  N = 10,000  
Estimator  Bias  RMSE  Bias  RMSE  Bias  RMSE  Bias  RMSE  
0.2  2.674  2.677  2.673  2.674  6.073  6.076  6.057  6.058  
1.426  1.427  1.265  1.265  4.863  4.865  4.638  4.639  
1.428  1.429  1.270  1.271  4.812  4.814  4.593  4.594  
1.428  1.429  1.270  1.271  4.812  4.814  4.593  4.594  
1.428  1.429  1.270  1.271  4.813  4.815  4.594  4.595  
1.428  1.429  1.270  1.271  4.812  4.814  4.593  4.594  
1.428  1.429  1.270  1.271  4.813  4.814  4.594  4.595  
1.167  1.168  1.014  1.014  4.482  4.484  4.230  4.231  
1.139  1.140  0.987  0.987  4.448  4.449  4.190  4.191  
0.948  0.950  0.832  0.833  3.584  3.586  3.413  3.414  
1.428  1.429  1.270  1.271  4.813  4.815  4.594  4.595  
0.828  0.830  0.735  0.735  2.962  2.988  2.817  2.819  
0.4  3.744  3.746  3.744  3.745  7.767  7.769  7.761  7.763  
2.142  2.143  1.921  1.921  6.528  6.529  6.286  6.287  
2.140  2.141  1.923  1.923  6.465  6.466  6.233  6.234  
2.140  2.141  1.923  1.923  6.465  6.466  6.233  6.234  
2.140  2.141  1.923  1.923  6.466  6.468  6.234  6.235  
2.140  2.141  1.923  1.923  6.465  6.466  6.233  6.234  
2.140  2.141  1.923  1.923  6.466  6.467  6.233  6.234  
1.785  1.786  1.566  1.567  6.124  6.125  5.856  5.857  
1.746  1.747  1.527  1.528  6.097  6.098  5.821  5.822  
1.421  1.422  1.267  1.268  5.102  5.103  4.837  4.838  
2.140  2.141  1.923  1.923  6.466  6.467  6.234  6.235  
1.237  1.238  1.112  1.112  3.946  3.988  3.746  3.753 
d = 8  d = 16  
N = 5,000  N = 10,000  N = 5,000  N = 10,000  
Estimator  Bias  RMSE  Bias  RMSE  Bias  RMSE  Bias  RMSE  
0.2  2.674  2.677  2.673  2.674  6.073  6.076  6.057  6.058  
0.128  0.136  0.085  0.089  1.409  1.415  0.926  0.929  
0.128  0.136  0.085  0.089  1.410  1.415  0.926  0.929  
0.128  0.136  0.085  0.089  1.410  1.415  0.926  0.929  
0.128  0.136  0.085  0.089  1.410  1.416  0.926  0.929  
0.128  0.136  0.085  0.089  1.410  1.415  0.926  0.929  
0.128  0.136  0.085  0.089  1.410  1.415  0.926  0.929  
0.099  0.109  0.060  0.065  1.302  1.308  0.824  0.828  
0.096  0.107  0.059  0.066  1.289  1.295  0.815  0.818  
0.081  0.094  0.054  0.060  1.029  1.036  0.614  0.618  
0.128  0.136  0.085  0.089  1.410  1.416  0.926  0.929  
0.062  0.085  0.034  0.043  0.834  0.919  0.439  0.448  
0.4  3.744  3.746  3.744  3.745  7.767  7.769  7.761  7.763  
0.172  0.180  0.105  0.112  2.148  2.153  1.388  1.392  
0.172  0.180  0.105  0.112  2.148  2.154  1.388  1.392  
0.172  0.180  0.105  0.112  2.148  2.154  1.388  1.392  
0.172  0.180  0.105  0.112  2.149  2.154  1.388  1.392  
0.172  0.180  0.105  0.112  2.148  2.154  1.388  1.392  
0.172  0.180  0.105  0.112  2.148  2.154  1.388  1.392  
0.133  0.143  0.070  0.079  2.020  2.026  1.265  1.269  
0.127  0.139  0.068  0.078  2.002  2.008  1.245  1.249  
0.105  0.118  0.061  0.072  1.695  1.701  0.984  0.989  
0.172  0.180  0.105  0.112  2.149  2.155  1.388  1.392  
0.085  0.102  0.038  0.054  1.202  1.242  0.611  0.640 
d = 8  d = 16  
N = 5,000  N = 10,000  N = 5,000  N = 10,000  
Estimator  Bias  RMSE  Bias  RMSE  Bias  RMSE  Bias  RMSE  
0.2 