M3E2: Multi-gate Mixture-of-experts for Multi-treatment Effect Estimation

by   Raquel Aoki, et al.

This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 is robust to multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the true treatment effects. The code is available at github.com/raquelaoki/M3E2.



There are no comments yet.


page 6


Learning Triggers for Heterogeneous Treatment Effects

The causal effect of a treatment can vary from person to person based on...

Adapting Neural Networks for the Estimation of Treatment Effects

This paper addresses the use of neural networks for the estimation of tr...

Can Transformers be Strong Treatment Effect Estimators?

In this paper, we develop a general framework based on the Transformer a...

Probabilistic Learning of Treatment Trees in Cancer

Accurate identification of synergistic treatment combinations and their ...

GraphITE: Estimating Individual Effects of Graph-structured Treatments

Outcome estimation of treatments for target individuals is an important ...

Personalized Prediction of Future Lesion Activity and Treatment Effect in Multiple Sclerosis from Baseline MRI

Precision medicine for chronic diseases such as multiple sclerosis (MS) ...

ABtree: An Algorithm for Subgroup-Based Treatment Assignment

Given two possible treatments, there may exist subgroups who benefit gre...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider an experiment to verify a drug’s efficacy for an aggressive cancer type. It would be extremely unethical to split the patients into two groups and give only a placebo as drug therapy for one of them, considering that the lack of proper medication could lead to the patients to death. The more suitable approach for such experiments is adopting causal inference methods for observational studies. In observational studies, the counterfactual data is missing due to several factors, such as cost or ethical concerns. The core idea of the causal inference methods for observational data is very similar: overcome the lack of counterfactual data by estimating the conditional treatment effects. However, different methods are based on different assumptions. Some of the common assumptions among these methods are regarding the nature of the treatment (binary versus continuous), the number of treatments (single versus multiple), the occurrence of a single treatment at any given time, and causal sufficiency.

Our work focuses on developing a causal inference method to estimate treatment effects in observational studies with realistic assumptions. A typical example is an application with millions of covariates, and yet, missing confounders, and several continuous (or sometimes binary) treatments applied simultaneously, such as studies about adverse side effects in patients under cancer treatment. These patients often receive a drug cocktail, and these are considered a set of treatments in the causal inference context. Sometimes, it is possible to observe an additive effect: a single drug would not result in adverse side effects; however, the adverse side effect is observed when considering all the drugs together. Such an effect is observed because many of the drugs adopted in cancer therapy contain heavy metals, and their accumulation on the patients’ bodies might be one of the causes of the adverse side effects. Inspired by such applications, we propose a method that uses proxies to decrease confounding biases, and most importantly, can estimate the effect of multiple treatments.

There are only a few methods that can estimate the effect of multiple treatments. Hi-CI (Sharma et al. (2020)) considers and models multiple treatments but assumes that only one is assigned to a unit at any given time. The Deconfounder Algorithm (DA) (Wang and Blei (2019)), a probabilistic graphical model, is our main baseline. DA works for multiple continuous and binary treatments. Single-treatment effect methods, such as Dragonnet (Shi et al. (2019)), and CEVAE (Louizos et al. (2017)), can be adapted to the multiple-treatments context by fitting an independent model for each treatment. Nevertheless, similarly to the single-task versus multi-task learning approach (Ruder (2017); Vandenhende et al. (2020)), single-treatment methods may fail to capture the synergy of multiple treatments, e.g. an additive effect.

Contributions: In this work, we focus on applications with multiple treatments and potential confounding biases. The main contributions of this paper are as follows:

  • We propose M3E2, a method to estimate multiple treatment effects with a multi-task learning neural network architecture.

  • We validate M3E2 in three synthetic datasets where the true causal effects are known. We also compare our method with three existing methods.

2 Related Work

This work combines estimation of treatment effects and multi-task learning (MTL).

Estimating Treatment Effects BART (Chipman et al. (2010); Hill (2011)), CEVAE (Louizos et al. (2017)), and Dragonnet (Shi et al. (2019)

), have explored the estimation of a single treatment effect, using Bayesian Random Forests, VAEs, and neural networks (NN) respectively.

Hernán and Robins (2006) present an inverse propensity weighting based method, also focused on single treatments. Other methods such as the Deconfounder Algorithm (Wang and Blei (2019)), Hi-CI (Sharma et al. (2020)), and approaches based on the propensity score (Lechner (2001); Lopez and Gutman (2017); McCaffrey et al. (2013)) aim to estimate multiple treatment effects. However, many of these methods assume that only one treatment is applied at any given time. The causal sufficiency assumption also poses a challenge. Causal sufficiency assumes that all confounders, variables that affect both treatment assignment and outcomes, are observed. This assumption often fails in real-world settings. The Deconfounder Algorithm, Hi-CI, and CEVAE are robust to unobserved confounders under certain conditions. Using an approach similar to Hi-CI and CEVAE, M3E2 adopts latent variables to obtain treatment effect estimates robust to hidden confounders. Finally, M3E2 estimates the multiple treatment effects through an outcome model in a multi-task learning neural network architecture. The outcome model approach is similar to the one adopted by the Deconfounder Algorithm.

Multi-task learning MTL neural network architectures aim to simultaneously optimize a single model for two or more tasks. Hard-parameter sharing NN, first proposed by Caruana (1993), is one of the MTL pillars. Such architecture is composed of a set of layers shared among all tasks and a set of task-specific layers on the top. From the MTL perspective, the Dragonnet (Shi et al. (2019)) has a hard-parameter sharing architecture. Building upon the hard-parameter sharing architectures, Ma et al. (2018)

proposed a Multi-gate Mixture-of-Experts (MMoE) architecture, where each expert can be seen as a hard-parameter sharing NN, and all the experts are combined through a gate function, which is also trainable. The core idea of such an approach is to improve the model’s generalization; plus, it allows experts to specialize in one of the tasks. To put into perspective, an MMoE is to hard-parameter sharing NN what a Random Forest Model is to a Decision Tree. Our proposed method M3E2 is built upon a MMoE (

Ma et al. (2018)

) architecture with an autoencoder to address hidden confounders. For more details on MTL architectures and optimization methods,

Ruder (2017) and Vandenhende et al. (2020) present an overview of the most recent works. Our work expands the MMoE architecture to satisfy causal inference assumptions and estimate multiple treatment effects simultaneously.

3 Multi-gate Mixture-of-experts for Multi-treatment Effect Estimation (M3E2)

This section describes our proposed method, the Multi-gate Mixture-of-experts for Multi-treatment Effect Estimation (M3E2). Its multi-task learning architecture allows simultaneous prediction of the multiple treatments assignment and the outcome .

Figure 1 illustrates the proposed neural network architecture, with a MMoE (Ma et al. (2018)), and an autoencoder as subcomponents. The autoencoder is responsible for learning latent variables, used as proxies of hidden confounders. In the MTL context, each propensity score prediction is a task, and the prediction of the outcome of interest is also a task. Therefore, if a given application has treatments, M3E2 would have tasks. Note that, in the testing phase, the propensity score predictions are not used (Section 1 - Sup. Material).

Figure 1: M3E2 architecture training architecture. Shows the architecture for the training phase, where input data and the treatment assignments are used to predict the outcome and estimate the treatment effects.

One of the strengths of M3E2 is its capacity to estimate the combined effect of a large number of treatments: the M3E2 network only grows linearly with the number of treatments, handling all potential combinations, something that other multi-treatment methods typically struggle to accomplish. Furthermore, the autoencoder can handle different data types by dividing the input covariates into two groups, and . While the autoencoder handles the covariates in , the covariates are fed directly to the experts. For example, consider an application whose input data mixes clinical information (few columns) and gene expression (thousands of columns). In this case, splitting the data into clinical covariates and genomic information can improve the results and handle the different data types better. In applications with only one data type, both and , and and are acceptable. In summary, the proposed architecture of M3E2 extends the MMoE architecture by adding meaning to the tasks, incorporating causal sufficiency assumptions through suitable regularizers, and adding the autoencoder component.

The following sections explain each component of the architecture in a bottom-up approach.

3.1 Autoencoder

In observational studies, it is always important to describe how one is addressing confounders. Some works assume no unobserved confounders (Shi et al. (2019)), others try to reduce the bias through latent variables (Louizos et al. (2017); Wang and Blei (2019); Sharma et al. (2020)); while others question if the latent variables are solving the problem at all (Zheng et al. (2021)). We believe that the choice of the best approach for handling confounders is still an open question. In this paper, our focus is to propose a methodology to estimate the effects of several simultaneous treatments which is robust to large datasets. A user of M3E2 might use the autoencoder component to estimate latent variables used to reduce confounding biases (Louizos et al. (2017); Wang and Blei (2019); Sharma et al. (2020)). Alternatively, a user can choose not to adopt the autoencoder (by setting ), which is the same as assuming no unobserved confounders or not using latent variables.

Following Louizos et al. (2017), Wang and Blei (2019), Sharma et al. (2020), and others (D’Amour (2019); Li et al. (2020); Ranganath and Perotte (2018); Sharma et al. (2018)), M3E2 also builts its robustness to hidden confounders through the adoption of latent variables. The overall idea of CEVAE, DA, Hi-CI, and M3E2 is very similar: learn latent variables to replace hidden confounders. The latent features and the treatment assignment are used to predict the outcome , as shown in Equation 1, and to estimate the treatment effects (Section 3.6).


While these methods core idea is very alike, their assumptions and path to Equation 1 differ. The DA, a multi-treatment method, uses a probabilistic factor model to estimate the latent variables. Their main assumption is that their robustness to unobserved confounders raises from them affecting multiple causes. CEVAE uses Pearl’s back-door adjustment. In this case, , where represent latent variables estimated. Through a variational autoencoder, CEVAE obtains the required distributions to estimate an approximate distribution for , and later use to estimate the outcomes . CEVAE’s main limitation is to consider only one binary treatment. Hi-CI, another multi-treatment method, uses an autoencoder to estimate

with a loss function that enforces decorrelation. M3E2 is somewhere in between CEVAE and Hi-CI. Like CEVAE, M3E2 aims to predict

with the difference that M3E2 does not use the propensity score as an auxiliary distribution. Instead, similarly to Hi-CI, M3E2 uses to predict the propensity score.

The use of latent variables to replace unobserved confounders still has its limitations. Similar to CEVAE and Hi-CI, M3E2 assumes robustness only to observed confounders in the input data or unobserved confounders correlated with observed variates in . Otherwise, it would be impossible for the factor models to estimate proxies for the unobserved confounders. To the best of our knowledge, DA is the only method that claims to be robust to unobserved confounders even if they are uncorrelated with observed variates in , under the condition that these unobserved confounders affect multiple causes. However, there has been intense debate on the validity of its assumptions (Ogburn et al. (2019)).

M3E2 adopts an autoencoder with two linear encoder layers and two linear decoder layers. Consider an application with samples, columns in , as the latent variables size, and the input data as a matrix . The function returns , a representation of in a lower dimension. Finally, returns the reconstructed data , back on space. The autoencoder’s loss is the mean squared error between the input and the reconstructed input .

3.2 Experts and Gates

In multi-task learning, a hard-parameter sharing network is a combination of shared layers and task-specific layers. In a multi-gate mixture-of-expert (MMoE) architecture, a hard-parameter sharing network can be interpreted as a single expert model. Therefore, a model with several experts is a model with several independent shared layers components.

Ma et al. (2018) shows that a set of experts can generalize better than hard-parameter sharing architectures. The MMoE architecture is more flexible than a traditional hard-parameter sharing architecture, which allows some of the experts to specialize in a single task. The user defines the number of experts and the architecture, with the possibility of adopting experts with different architectures. In the context of multiple treatment effect estimation, the tasks are the treatment assignment and the outcome prediction. The experts’s input data is . The ideal number of experts depends a lot on the tasks’ nature. Homogeneous tasks might not benefit from many experts and might overfit if the number of experts is too large; heterogeneous tasks, on the other hand, tend to benefit from a larger number of experts.

The gates control the contribution of each expert to each task. There is a gate per treatment defined as:

where is a trainable matrix of weights, is the number of experts defined by the user, and the number of columns in .

3.3 Task-specific layers and

The task-specific layers are responsible for predicting the propensity score and the outcome of interest . Each treatment task-specific layer receives as input a weighted average of the experts, where the weights come from the gates associated with that given task. This relationship is formally defined as:

In the training phase (Figure 1), the treatment assignment is predicted with the propensity score , estimated as (for discrete treatments) or (for continuous treatments using the conditional density (Hirano and Imbens (2004); Nie et al. (2021))). To estimate the treatment assignment of we only use

. For binary treatments, a softmax activation function will outputs, for each sample, the probability of

and . These predictions are used to calculate the loss of the neural network, as described in the Section 3.4. The propensity score losses are used to drive to be sufficient (Assumption 1 - Section 3.5). Note that can be a combination of one or more layers.

Finally, a layer with trainable weights is used to predicts the final outcome. Consider the input data of this layer as , where are the observed treatment assignments, , and is the number of columns. The trainable weights layer estimates the final outcome as . In the testing set, if the true treatment assignment is unknown, it is also possible to adopt the propensity score predictions as a replacement of the observed treatment assignments. The works as an outcome model and the treatment effects are estimated as in Wang and Blei (2019). In our context of treatment effect estimation, is the treatment effect of the treatment .

Note that our approach targets additive effects, which are fairly common in real-world applications. For instance, consider adversarial side effects in cancer therapy. Cancer therapy usually involves several treatments (drugs) simultaneously or through a certain period. Many of these drugs contain heavy metals, and their cumulation can result or contribute to observed adversarial side effects. The inclusion of non-linear effects on the outcome model would require a different approach to estimate the treatment effects. Plus, it could result in a combinatorial explosion if all potential combinations of treatments are considered, something avoided on M3E2. Therefore, the consideration of non-linear effects is left as future work.

3.4 Loss function

M3E2’s loss function is composed of:

  1. Root mean square error loss for continuous outcomes and binary cross-entropy for binary outcomes.

  2. Similar to the outcome loss functions, we adopt or/and as the propensity score losses.

  3. is the the autoencoder loss function.

  4. as the regularization.

The modification of both and to other loss functions are quite straightforward.

As a reminder, while our architecture minimizes the propensity score and the outcome losses, our main target is to obtain estimates of the treatment effects robust to hidden confounders. The treatment effects are a co-product of this model, being the weights associated with the treatments in the trainable layer . This last layer is responsible for estimating ; therefore, poor estimates of would directly reflect on

. Furthermore, M3E2 robustness to confounders comes from getting unbiased estimates of the outcome by conditioning on the treatment assignments T and H. Therefore, it is fundamental that the propensity score components ensure that the experts learn a meaningful representation of the confounders. The model also learns weights in

associated with the confounders; however, these are not considered treatment effects. The final loss is shown in Equation 2.


The , and from Equation 2

are weights. There are two possible ways to define these weights: adopting them as a hyperparameter or through the adoption of an MTL task balancing approach. For example,

Liu et al. (2019) dynamically sets these weights based on the losses values to minimize negative transfer between tasks. For more heterogeneous treatments, such as a mix between continuous and discrete treatments, adopting a weight per propensity score can also help with task-balancing.

3.5 Assumptions

We aim to develop a method that makes more realistic assumptions, such as considering multiple treatments and robust to their data type (binary or continuous). However, like any other causal inference method, there is no free lunch.

Assumption 1 Sufficiency of Propensity Score (Rosenbaum and Rubin (1983); Shi et al. (2019); Nie et al. (2021)): If the average treatment effect is identifiable from observational data by adjusting for X, i.e., , then adjusting for the propensity score also suffices:

In other words, this assumption means that it suffices to adjust only the information in that is relevant for predicting the treatment , and this information is the output of . For multiple treatments, the generalization goes as follows(Imbens (2000); Imai and Van Dyk (2004)):

Assumption 2 Stable Unit Treatment Value Assumption (SUTVA): the response of a particular unit depends only on the treatment(s) assigned, not the treatments of other units.

Assumption 3 Common Confounders and conditional independence: Treatments share confounders. Given the shared confounders, the treatments are independent. (Ranganath and Perotte (2018))

Assumption 4 (Weak) Ignorability Assumption: There are no uncorrelated unobserved confounders.

Assumption 2 guarantee that the elements are independent and do not interfere in each other. Assumption 3 is specific for multiple treatment settings, where the dependence between the treatments might result in biased treatment effects estimates. Finally, the original ignorability assumption considers that all confounders are observed. However, due to the adoption of proxies to account for correlated unobserved confounders, we make a weak ignorability assumption. The weak is to account for unobserved confounders correlated with observed covariates. These unobserved confounders would be made accountable through the autoencoder and experts. Therefore, our method is not robust to unobserved confounders uncorrelated to observed covariates, or whose strong effect is not sufficiently captured by the latent variables learned.

3.6 Estimation of the treatment effect

There are several metrics to measure the treatment effect. The perfect metric choice depends on the nature of the treatment, the method’s architecture, and the outcome’s datatype. Among our baselines, CEVAE and Dragonnet use the average treatment effect (ATE), defined as . While widely adopted, ATE is suitable for applications where only one binary treatment. As previously mentioned, the intend of M3E2 is to be a method with realistic assumptions and a wide range of possible applications. Therefore, similarly to the Deconfounder Algorithm, we adopt an outcome model (Wang and Blei (2019)), with is a more flexible alternative to estimate the treatment effects. The outcome model fits a predictive model on the augmented observed data to estimate the quantity . In our architecture, this is done by our last layer , as previously described in Section 3.3.

4 Experiments

Figure 2:

MAE barplots of the M3E2 and baseline methods. Smaller MAE values indicate more assertive models. The black line indicates a 95% confidence interval.

Figure 3: Copula results for one simulated dataset () with 24 independent repetitions of each model. The baselines results are shown in orange, our results are in blue, and the red line shows the true effect (c-f).

In causal inference, the lack of ground truth for real-world applications poses a challenge to its evaluation. Therefore, we adopt three synthetic benchmark datasets111The code to generate the datasets is available at github.com/raquelaoki/M3E2. that have known treatment effects:

  • GWAS (Wang and Blei (2019); Aoki and Ester (2021)): this semi-simulated dataset explores sparse settings, with a large number of confounders, 3-10 binary treatments, and continuous outcome.

  • Copula (Zheng et al. (2021)): this synthetic dataset contains four treatments, continuous treatments, fewer confounders, and a binary outcome. We adopt the same treatment effects described in Zheng et al. (2021).

  • IHDP (Hill (2011); Louizos et al. (2017); Shi et al. (2019)): this is a traditional benchmark dataset for single binary treatments. It has 24 covariates and a continuous outcome. We adopt this dataset to compare with some of our single-treatment estimation baselines that have been evaluated on the IHDP dataset in their publications.

For more details about the synthetic benchmark datasets, check Section 2 - Sup. Material.

We adopted the mean absolute error (MAE) to compare the real treatment effects, , with the estimated values :


The goal is to minimize the difference between estimated and true treatment effects; therefore, low MAE values are desirable.

We adopt an experimental setting similar to the multi-task learning settings (Ma et al. (2018)), where the proposed multi-task learning method is compared with other multi-task learning methods and single-task learning models, specialized and designed to optimize a single task. Among our baselines, the DA(Wang and Blei (2019)) is the only method that can estimate the effect of multiple treatments. Similarly to M3E2, there is one model for each dataset simulated that estimates all treatment effects. The CEVAE(Louizos et al. (2017)) and Dragonnet (Shi et al. (2019)), on the other hand, are single-treatment methods. In this case, we ran one independent model for each treatment. We utilized the author’s implementation of the baselines when available. We also performed experiments with BART. However, since CEVAE and Dragonnet achieved better performance results in the recent publications (Louizos et al. (2017); Shi et al. (2019)), and since BART performed poorly on the GWAS and Copula datasets, we decided not to discuss BART in the experimental section. We also performed experiments with Hi-CI, a multi-treatment method. We re-implemented Hi-Cl, since, unfortunately, no implementation is publicly available. In our experiments, Hi-CI produced very poor results and frequently did not converge. Consequently, we did not include Hi-Cl in our experimental comparison.

We generate four datasets using different seeds for each configuration explored in the GWAS and Copula datasets. For the IHDP, we adopted the ten replications previously generated by Louizos et al. (2017). We fitted the models eight times in each dataset generated, also using different seeds. These repetitions were important to remove potential biases from random seeds and to construct error bars. All methods were run on the same virtual environment setting, with 1 GPU and 16 GB. Table 1 shows the settings explored in the experiments.

Setting Data Sample Size #T #C
a GWAS 2500 - 10000 5 995
b GWAS 10000 5 100 - 1000
c GWAS 10000 3, 6, 9 1k
d Copula 2500 - 10000 4 10
e Copula 10000 4 5 - 1000
f IHDP 747 1 24
Table 1: Datasets settings explored. Example: Setting a indicates a study on the sample size effect on the GWAS dataset. We compared the models’ MAE with 2500 5000 10000 samples, with a fix number of treatments () and covariates ().
Figure 4: Analysis of the settings proposed in Table 1. Small MAE is desirable.

4.1 Overall performance

Figure 2 shows, for each dataset, the average MAE across all settings. Our proposed method, M3E2, clearly outperforms all baselines on the multi-treatment datasets GWAS and COPULA. DA obtains a smaller error bar, indicating more consistent treatment effect estimates. On IHDP, a single-treatment dataset, M3E2 was outperformed by Dragonnet, but was comparable or better than the other two baselines. We note that our results for Dragonnet and CEVAE on IHDP match the results previously reported by Shi et al. (2019)

. The larger variance on the IHDP dataset can be explained by the larger scale of the true treatment effect.

To better illustrate the methods performance, Figure 3 shows a deeper analysis of the results on one of the Copula datasets. Figure 3.a shows M3E2 has the lowest MAE values in comparison with the other baselines. Figure 3.b shows the total run time of each method in seconds. As a reminder, both DA and M3E2 fit one model for all treatments; Dragonnet and CEVAE, on the other hand, fit one model for each treatment. DA, a probabilistic model, has the fastest running time; among the NN methods, M3E2 has the lowest running time. A comparison between the true (line in red) and the estimated treatment effects (dots) is shown in Figures 3.c-f. As previously mentioned, M3E2 has more assertive treatment effect estimates; on the other hand, it also has more variability when compared with the baselines. For , M3E2 is the only method capable of recovery the true effect; for the remaining treatment effects, CEVAE and M3E2 have both strong results, being at a similar distance from the true effect on , M3E2 having better results for , and CEVAE having the best results in .

To investigate the fit of the baselines and M3E2, we checked their F1 score or the root mean square error (RMSE) when predicting the outcome of interest on the testing set. We observed fair values for all methods, indicating the methods were well fitted.

4.2 Impact of dataset parameters

The following experiments explore the impact of the dataset parameters in the estimation of the treatment effects. As Table 1 shows, we explored three parameters: the sample size, number of treatments, and number of covariates.

Figure 4 shows, in details, the average MAE and the 95% confidence interval (colored area) for the settings shown in Table 1. Settings and , for instance, explore the sample size, and the results are shown in Figure 4 and , respectively. Our proposed method, M3E2, is the method that benefits the most from increasing the sample size. Increase the number of treatments and confounders on the GWAS simulated dataset (Figure 4 ) did not seem to affect the baselines’ MAE and M3E2’s MAE. On the other hand, in the Copula synthetic dataset, when the number of covariates grows (Figure 4 ), there was an increase in CEVAE’s variability and Dragonnet MAE. In this same setting, both DA and M3E2 are robust to the number of covariates. Note that we followed the experimental design of Zheng et al. (2021) for the COPULA dataset, which did not vary the number of treatments.

5 Discussion and Conclusion

In this work, we proposed M3E2, a multi-task learning neural network model to estimate multiple treatment effects with an autoencoder component to improve its robustness to hidden confounders. The proposed neural network extends the MMoE model by adding meaning to the tasks and by incorporating causal inference assumptions in the regularizers. Compared to existing methods, M3E2 makes only a weaker ignorability assumption, which is more realistic in many applications. We experimentally compared our method against three baselines on three synthetic benchmark datasets and demonstrated its superior performance. The online repository github.com/raquelaoki/M3E2 contains the code to replicate all the experiments. We put extra effort into making the M3E2 implementation agnostic to the application; therefore, its adoption in other applications should be straightforward.

A potential negative societal impact of estimating treatment effects from observational data is the lack of a standard approach to evaluate such results in real-world applications. Most publications evaluate their methods using the method’s predictive power or expert knowledge. However, both are susceptible to mistakes because they are an indirect and non-objective measure of the treatment effect, which opens room for potential misuse. Furthermore, the success of observational causal inference methods might lead researchers to rely on such methods even in settings where they are not needed. Randomized controlled trials are still the gold standard for treatment effect estimation, and if available, should always be the first choice.

M3E2 demonstrated promising experimental results and strong evidence that the joint learning contributed to better estimate the treatment effects; however, it still has a few limitations. As discussed in Section 3.5

, our method is robust only to observed confounders and unobserved confounders correlated with observed covariates. M3E2 inherits the limitations of other multi-task learning models, in particular, the susceptibility to imbalanced tasks and overfitting. Plus, like many other machine learning methods, its generalization depends on the quality of its data. All strengths and limitations considered, we believe that M3E2 has a very good use case with manageable limitations. We want to explore applications with very heterogeneous treatments, such as temporal versus non-temporal treatments, and non-linear effects in future research. We also want to apply our proposed method to a real-world dataset that explores adverse side effects in therapies for treating cancer in infants.


  • R. Aoki and M. Ester (2021) ParKCa: causal inference with partially known causes. Pac Symp Biocomputing. Cited by: 1st item.
  • R. Caruana (1993) Multitask learning: a knowledge-based source of inductive bias. Proceedings of the Tenth International Conference on Machine Learning. Cited by: §2.
  • H. A. Chipman, E. I. George, R. E. McCulloch, et al. (2010) BART: bayesian additive regression trees. The Annals of Applied Statistics 4 (1), pp. 266–298. Cited by: §2.
  • A. D’Amour (2019) On multi-cause causal inference with unobserved confounding: counterexamples, impossibility, and alternatives. arXiv preprint arXiv:1902.10286. Cited by: §3.1.
  • M. A. Hernán and J. M. Robins (2006) Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health 60 (7), pp. 578–586. Cited by: §2.
  • J. L. Hill (2011) Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20 (1), pp. 217–240. Cited by: §2, 3rd item.
  • K. Hirano and G. W. Imbens (2004) The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives 226164, pp. 73–84. Cited by: §3.3.
  • K. Imai and D. A. Van Dyk (2004) Causal inference with general treatment regimes: generalizing the propensity score. Journal of the American Statistical Association 99 (467), pp. 854–866. Cited by: §3.5.
  • G. W. Imbens (2000) The role of the propensity score in estimating dose-response functions. Biometrika 87 (3), pp. 706–710. Cited by: §3.5.
  • M. Lechner (2001) Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. In Econometric evaluation of labour market policies, pp. 43–58. Cited by: §2.
  • Y. Li, K. Kuang, B. Li, P. Cui, J. Tao, H. Yang, and F. Wu (2020) Continuous treatment effect estimation via generative adversarial de-confounding. In Proceedings of the 2020 KDD Workshop on Causal Discovery, pp. 4–22. Cited by: §3.1.
  • S. Liu, Y. Liang, and A. Gitter (2019) Loss-balanced task weighting to reduce negative transfer in multi-task learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 9977–9978. Cited by: §3.4.
  • M. J. Lopez and R. Gutman (2017) Estimation of causal effects with multiple treatments: a review and new ideas. Statistical Science, pp. 432–454. Cited by: §2.
  • C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling (2017) Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems, pp. 6446–6456. Cited by: §1, §2, §3.1, §3.1, 3rd item, §4, §4.
  • J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi (2018) Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1930–1939. Cited by: §2, §3.2, §3, §4.
  • D. F. McCaffrey, B. A. Griffin, D. Almirall, M. E. Slaughter, R. Ramchand, and L. F. Burgette (2013) A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Statistics in medicine 32 (19), pp. 3388–3414. Cited by: §2.
  • L. Nie, M. Ye, qiang liu, and D. Nicolae (2021) Varying coefficient neural network with functional targeted regularization for estimating continuous treatment effects. In International Conference on Learning Representations, External Links: Link Cited by: §3.3, §3.5.
  • E. L. Ogburn, I. Shpitser, and E. J. T. Tchetgen (2019) Comment on “blessings of multiple causes”. Journal of the American Statistical Association 114 (528), pp. 1611–1615. Cited by: §3.1.
  • R. Ranganath and A. Perotte (2018) Multiple causal inference with latent confounding. arXiv preprint arXiv:1805.08273. Cited by: §3.1, §3.5.
  • P. R. Rosenbaum and D. B. Rubin (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. Cited by: §3.5.
  • S. Ruder (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §1, §2.
  • A. Sharma, J. M. Hofman, D. J. Watts, et al. (2018) Split-door criterion: identification of causal effects through auxiliary outcomes. Annals of Applied Statistics 12 (4), pp. 2699–2733. Cited by: §3.1.
  • A. Sharma, G. Gupta, R. Prasad, A. Chatterjee, L. Vig, and G. Shroff (2020) Hi-ci: deep causal inference in high dimensions. In Proceedings of the 2020 KDD Workshop on Causal Discovery, pp. 39–61. Cited by: §1, §2, §3.1, §3.1.
  • C. Shi, D. Blei, and V. Veitch (2019) Adapting neural networks for the estimation of treatment effects. In Advances in Neural Information Processing Systems, pp. 2503–2513. Cited by: §1, §2, §2, §3.1, §3.5, 3rd item, §4.1, §4.
  • S. Vandenhende, S. Georgoulis, M. Proesmans, D. Dai, and L. Van Gool (2020)

    Revisiting multi-task learning in the deep learning era

    arXiv preprint arXiv:2004.13379. Cited by: §1, §2.
  • Y. Wang and D. M. Blei (2019) The blessings of multiple causes. Journal of the American Statistical Association, pp. 1–71. Cited by: §1, §2, §3.1, §3.1, §3.3, §3.6, 1st item, §4.
  • J. Zheng, A. D’Amour, and A. Franks (2021) Copula-based sensitivity analysis for multi-treatment causal inference with unobserved confounding. arXiv preprint arXiv:2102.09412. Cited by: §3.1, 2nd item, §4.2.