Learning Continuous Treatment Policy and Bipartite Embeddings for Matching with Heterogeneous Causal Effects

04/21/2020 ∙ by Will Y. Zou, et al. ∙ 4

Causal inference methods are widely applied in the fields of medicine, policy, and economics. Central to these applications is the estimation of treatment effects to make decisions. Current methods make binary yes-or-no decisions based on the treatment effect of a single outcome dimension. These methods are unable to capture continuous space treatment policies with a measure of intensity. They also lack the capacity to consider the complexity of treatment such as matching candidate treatments with the subject. We propose to formulate the effectiveness of treatment as a parametrizable model, expanding to a multitude of treatment intensities and complexities through the continuous policy treatment function, and the likelihood of matching. Our proposal to decompose treatment effect functions into effectiveness factors presents a framework to model a rich space of actions using causal inference. We utilize deep learning to optimize the desired holistic metric space instead of predicting single-dimensional treatment counterfactual. This approach employs a population-wide effectiveness measure and significantly improves the overall effectiveness of the model. The performance of our algorithms is. demonstrated with experiments. When using generic continuous space treatments and matching architecture, we observe a 41 cost-effectiveness and 68 treatment effect. The algorithms capture subtle variations in treatment space, structures the efficient optimizations techniques, and opens up the arena for many applications.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a large set of patients and medications, how does one recommend the best medication and its dosage to each patient through observation of treatment effects on a multitude of outcomes?

Past work in causal inference provides solutions through estimation of treatment effect (Künzel et al., 2017) (Shalit et al., 2017) (Nie and Wager, 2017). The effect on a treatment subject is addressed with a ‘counterfactual’ (Pearl, 2009) argument of what if the subject was given a different treatment quantified by the Individual Treatment Effect (ITE) function (Künzel et al., 2017) (Nie and Wager, 2017).

This approach can estimate counterfactual outcomes to measure effectiveness of treatments. However, the approach requires a prohibitively large number of models, since a subject can be matched with exponentially large number of actions or choices, each leading to a multitude of outcomes. The complexity introduces the Curse of Dimensionality 

(Friedman, 1997; Bishop, 2006), resulting in data sparsity in the large parameter space. The solution calls for a generalization to continuous space treatment variables, and accounting for shared latent factors.

We propose a solution to capture both the complexity of actions and outcomes. At the heart of this solution is Bayesian decomposition combined with deep learning methods.

We address the intractability of continuous-space integration of the Average Treatment Effect (ATE) function with Bayesian methods. Instead of integrating over the continuous space, we formulate probability density weighting with Bayesian decomposition. With multiple ATE functions, we construct a combined objective to account for the multitude of outcomes in a joint optimization problem.

We are the first to propose continuous space treatment policy for causal inference. Instead of evaluating the counterfactual of a binary decision, we assume the treatment is given at continuous levels such as the dosage of medication or the price of a product. We estimate likelihood of the treatment intensity using continuous distributions and re-weight the prior. This method allows us to answer questions such as ‘what is the optimal level of treatment for a subject?’ and ‘how effective is the treatment if given at 53% as opposed to another?’. We leverage deep learning to formulate a model for matching subject to treatment types. Activations from hidden layers in neural networks form embeddings for subject and treatment candidates. The distance measure across embeddings is used as matching score, as applied by learning-to-rank 

(Huang et al., 2013).

We demonstrate from first principles how the Average Treatment Effect (ATE) functions in causal inference can be decomposed into multiple factors, namely, a continuous policy likelihood, a matching likelihood and a subject instance prior. We also parameterize the model using deep learning techniques and construct the objective using multiple ATE functions. The powerful combination of Bayesian decomposition and deep learning provides a highly flexible and effective framework to optimize for the holistic treatments effectiveness with arbitrary objective functions.

Our key contributions can be summarized as follows:

Continuous Treatment Policy

- We are the first to incorporate continuous treatment intensity into the causal machine learning framework. We formulate treatment policies as continuous variables and propose probability distributions to model them. The effectiveness measures can then be defined with arbitrary distributions that factor into the objective. The eventual aggregated objective offers training signals to determine the optimal continuous treatment level.

Bipartite Embeddings for Relevance Matching - Compared with previous research, our proposal makes decisions on subject instance being matched with the types of treatment, rather than only considering subject instance scoring for treatment. We consider the likelihood of match in the perspective of relevance matching in bipartite instances, and formulate embedding spaces for both subject instance and matched instance. For example, this can be patients with medications, and customers with products.

Bayesian Decomposition for Chained Causal Factors - Instead of integrating over a series of latent variables, Bayesian decomposition offers a generic way to incorporate a chain of factors into the aggregated treatment effect function. The holistic objective can eventually be optimized with deep learning. This offers a framework for incorporating additional factors beyond continuous treatment and bipartite matching, making it adaptable to a wide range of application domains.

Deep Learning with Causal Effectiveness Objective

- We derive a causal objective function to optimize for treatment effectiveness. Instead of estimating treatment effects of a single outcome, the objective aggregates contributions from multiple outcome dimensions. Effectiveness can thus be evaluated with regards to the combined objective. Further, the loss function can be utilized in deep learning models with flexible architectures. We demonstrate superior algorithmic flexibility of this modeling framework evidenced by experimentation.

2 Prior Work

The space of causal inference has been studied in the context of treatment effect estimation  (Shalit et al., 2017; Künzel et al., 2017; Nie and Wager, 2017; Johansson et al., 2016; Swaminathan and Joachims, 2015b, a) which uncovers a range of algorithms from meta-learners (Künzel et al., 2017), to balancing the representation space (Shalit et al., 2017).  (Louizos et al., 2017) discovers hidden confounding factors using techniques such as variational auto-encoders,  (Parbhoo et al., 2018) investigates from information bottleneck specific to causal algorithms.  (Lim, 2018) adopted recurrent networks to study the effects of sequential treatments. There are many applications to precision medicine (Shalit et al., 2017) (Lim, 2018). It is worth noting the classic work  (Rubin, 1974) which proposed a paradigm for treatment effect estimation whereby the outcome of the subject treatment is observed and subsequently used to fit a model for counterfactual estimation. Statistical viewpoints have been taken by  (Künzel et al., 2017) to decompose the learning algorithm into composite models with meta-learners. Notably, quasi-oracle estimation by (Nie and Wager, 2017)

is effective when estimating the treatment effect in a single outcome. Recently, decision trees and random forests 

(Chen and Guestrin, 2016) have been applied (Rzepakowski and Jaroszewicz, 2012) as another mainstream methodology for treatment effect estimation. This includes causal tree, random forests (Wager and Athey, 2018; Athey and Imbens, 2016), boosting (Powers et al., 2017; Hill, 2011).

Most previous methods consider the Average Treatment Effect (ATE)

of a single treatment, indicated by a binary variable across the population and commonly considered with respect to a single dimension of outcome. Previous approaches are not suited for dealing with continuous treatments, multi-dimensional outcomes, or matching with treatment types. We propose a way to cope with continuous treatment variables and create an architecture for matching treatment types to subjects, integrating into a framework that is generalizable to a chain of factors using Bayesian decomposition. The framework eventually leverages deep learning to optimize for aggregated treatment effect.

3 Algorithms

3.1 Problem Statement: Matching Algorithm with Continuous Space Policy

Consider when a patient, based on her medical history and symptoms, needs to be matched with a treatment. The treatment is characterized with suitable symptoms and properties. When prescribed, there is a continuous measure for how intense the treatment should be. The goal is match the best treatment, with the optimal treatment intensity. Another example is pricing at an internet company, when a user needs to be matched with any product in the on-line store, and continuous treatment is the price to allocate for the specific user-product match.

The quasi-oracle estimation algorithm, among other meta-learner algorithms (Künzel et al., 2017), is capable of estimating treatment effects in a multitude of dimensions. However, this is in-efficient due to exponentially large outcome and action spaces. The approach in this paper is to maximize an overall effectiveness measure, as a key theme in applications of causal inference. We propose causal inference paradigm to maximize combined effectiveness of heterogeneous treatments.

The objective for our algorithm is to identify a collection of treatment sessions, each composed of a pair of subject and treatment candidate, that achieves the highest treatment effectivness considering many outcomes. To formulate the problem, we introduce a notation of the overlined treatment effect , which represents the treatment effect function across the collection of subjects . The models will be parametrized by and effectiveness measures per session . The models output the effectiveness measures that correlate with the expectation across the collection, i.e. which can be combined into the same expression using .

Note the critical step to represent the objective in the causal deep learning paradigm as a treatment effect function parametrized by multiple deep learning models. This allows us to represent the eventual objective function as a combination of multiple .

Suppose one tries to solve the following optimization problems:

maximize (1)
minimize (2)

The first problem tries to maximize a combined treatment effect of reward minus cost weighted by an overall factor . The second represents a cost efficiency measure, with an extra weighting factor . We aim to optimize for the holistic effectiveness exemplified in these equations. The holistic objective allows us to take all subjects, treatments, and outcomes into consideration. The unconstrained optimization allows us to utilize deep learning to build flexible models and efficiently train with large-scale data.

Bayesian Decomposition with Causal Inference. Without loss of generality, we include per sample treatment propensity from causal statistics (Lunceford and Davidian, 2004) and derive without superscript as it could extend to both and . Starting from the fundamental definition of treatment effect:

The treatment effect term can be conditioned on a treatment policy. Given a policy , we can write:

Different from prior work (Lunceford and Davidian, 2004), we differentiate the Policy with the treatment cohort indicator , the latter random variable indicates whether an instance is in the treatment cohort or in the control cohort. Only within the treatment cohort, the Policy is applied. The Policy can be optimized to produce the best possible outcome in treatment cohort, when treatment cohort variable is correlated with whether assign instances to control or held-out cohort. The propensity in the context of a possible policy evaluation, is the expected value of treatment cohort indicator.

From (Lunceford and Davidian, 2004), the expected value of outcome of any instance can be written as following equations. This takes Treatment Cohort Indicator and propensities into account:


is a propensity function . This quantity is estimated given features of the subject instance, thus can be specific per instance with a learnable propensity function fitted with the feature and treatment cohort indicator labels. Detailed proof of the above equation is given for case: 111Third step follows from unconfoundedness assumption (Lunceford and Davidian, 2004) (Nie and Wager, 2017).

Substitute into the treatment effect definition:


The last step follows from law of total expectation. We then expand the equation and eliminate zero terms:


We expand expectation in the following equations. Every match is decomposed into two instances of the items being matched, which denote with random variables and indexed by and . We define to be the overall propensity, or likelihood of being in the treatment cohort across all instances. For each user, the term is estimated from a pre-trained propensity function. In the last equation, we introduce matches of instances denoted by random variable , and use the tuple notation and to indicate each match across two distinct items.


The last two lines are obtained using Bayes Rule to rewrite the posterior with , prior multiplied by likelihood, divided by the normalization factor or partition function .

As stated before, we parameterize the above function, then combine the Treatment Effect Functions , we will be able to compose an objective function for the learning algorithm.

Continuous treatment policy model. The key to defining an objective function composed of conditional Average Treatment Effect (ATE) representations with is finding representations for the likelihood , the probability of a policy given a match. We solve the problem with a continuous treatment policy problem formulation. This is to say, we assume the policy random variable

to be a continuous random variable, assuming its its scope to be

. represents intensity of treatment. The likelihood is defined to be the family of distributions with scope that can be parameterized by :


For the quantity, given any specific match, it is then formulated a parameterized regression model:


Feature vectors

and denote features to specify the match subject and object , e.g. user and product, patient and treatment. The intuition behind this formulation is the regression model determines hyper-parameters of the continuous distribution, which then measures the likelihood of any continuous policy value. The distribution is distinct per match, and offers measure for the goodness of policy values. For example, when denotes the mean of a bell-shaped distribution, the regression model should gives the optimal treatment intensity for the specific match of subject and object. During training, if the actual data deviates from this optimal value, its likelihood would be penalized with respect to the amount of deviation. This is shown in Figure 1. Further, functional form of

can be any differentiable regression algorithm, such as a logistic regression, or multi-layer neural network.

Figure 1: Illustration of penalty of a sub-optimal policy.

For the likelihood function, we can use a distribution whose density function can be specified by the derivative of the sigmoid:


denotes the sigmoid function and

is its derivative. This formulation gives the likelihood a probabilistic interpretation, such that the cumulative density of the random variable takes forms of the sigmoid222The distribution has the scope and is re-normalized., and the probabilistic density takes a bell-shaped form. This means whenever the intensity deviates from the optimal, the likelihood function penalizes by lowering its value, positioning the highest value at top of bell curve, determined by Equation 11.

This likelihood function can also be defined with the Beta distribution:


Where the random variable

assumes a density of a Beta distribution with parameters

which are parameterized regression function from feature sets and . This second formulation has the interpretation for variable to be in the range of . If both parameters are limited to be above , the distribution also takes the bell-shaped form with maximum density at .

Parameterize subject-candidate matching model. We next parameterize the matching model. We can decompose the following equation:


A simple form of is a sigmoid compressed version of arbitrary differentiable regression function, and forms the base instance weighting model. With sigmoid as the non-linearity, the outputs are positive and are well-controlled below .

For the functional form of we adopt a bipartite embedding formulation. First the input features are projected with a neural network to an embedding space for both subject instance, and matching candidate: . In this manner, the embedding space can be learned using eventual effectiveness measures, across the treatment subjects, e.g. patient, as well as the treatment candidates, e.g. medications. The eventual objective helps to learn the embedding sub-space each of the subject and candidate lies in. Also the algorithm for matching here is highly related to learning-to-rank models such as DSSM (Huang et al., 2013). Where the candidates are projected first into a vector space before the metric distance is used to define the relevance of terms. Note that we use the same projection function in the definition of . Then, we can formulate the likelihood function:


Normalization into effectiveness measures. With the previously defined likelihood functions for both treatment policy and matching, the functions need to be normalized to offer a probabilistic interpretation.

Concretely, all output measures from distributions and function proposed in the above sections are positive. We sum the scores together to form the partition function then normalize the product of prior and likelihood. This is done for across all possible data instances, or matches across subject and candidate, each of these is indexed by .

Written as , the partition function normalizes parameterized likelihood. Thus, Equation 6 can be computed to define the overall objective function with:


Objective Function. Summing up the parameterized policy model, matching model, and normalization, we can write the eventual average treatment effect function as follows. Note we notate all the parameterized functions with a top bar, and all of them are a differentiable function with respect to parameters.


Then we substitute the expression of the ATE function back to Equations 1 2 to obtain the eventual objective function. Since all functions are differentiable, we can obtain parameter gradients and optimize the objective.

4 Empirical Experiments

Benchmark Models333Our models codes and data sources will be made public.

We benchmark our method with treatment effect estimation algorithms with the technique of estimating multiple outcomes with separate models. In this context, two mainstream methodologies in treatment effect estimation are meta-learners (Künzel et al., 2017) (Nie and Wager, 2017) and causal trees and forests (Wager and Athey, 2018). For each of these methods, we compare with the most representative algorithm known in literature, the quasi-oracle estimation algorithm, and causal forests.

Quasi-oracle estimation (R-Learner)

. We use linear regression


’s ridge regression with zero regularization.

as the base estimator. Since the experiment treatments are randomly given, we use constant treatment percentage as propensity in the algorithm. We use the R-learner to model the gain value incrementality across treatment and control with an conditional average treatment effect function for each outcome dimension. Each sample in the test set is evaluated for the Individual Treatment Effect (ITE), and eventual metric is computed by combining ITE of all outcomes. For instance, in the case of maximizing Equation 1, we would train an R-learner estimator for each of the , and dimensions, then for each sample in the dataset, we compute the predictions for each of ITE , and , then compute score for evaluation.

Causal Forest. We leverage the generalized random forest (grf) library in R (Wager and Athey, 2018) (GRF, ). For details, we apply causal forest with 50 trees, 0.2 as alpha, 3 as the minimum node size, and 0.5 as the sample fraction. We apply the ratio of two average treatment effect functions in ranking by training two causal forests. To evaluate matches with respect to the effectiveness objective, we estimate the conditional treatment effect function for e.g. gain (), cost (), utility factor () i.e. train multiple Causal Forest models. For evaluation, similar as R-learner, we compute the score according to the pre-defined metric by combining ITE estimates in Equations 1 2

. For hyper-parameters, we perform search on deciles for parameters

num_trees, min.node.size, and at 0.05 intervals for alpha, sample.fraction parameters. We also leverage the tune.parameters option for the grf package, eventually, we found best parameters through best performance on validation set555Best parameters we experimented: num_trees, alpha, min.node.size, sample.fraction.

Simplified CT Model. We use a simple parameterization to compute one score for any match as a benchmark model. To align with baseline and other methods in our experiments, we use a scoring function similar to logistic regression, i.e. . Note and

are feature sets for subject and matching candidates, respectively. The model is trained without weight regularization. We use the Adam optimizer with learning rate 0.01 and default beta values. We compute gradients for entire batch of data for optimization. For hyperparameter selection, variance in parameter initialization, we take the best validation set performance out of 6 random initializations.

Continuous Treatment Policy Matching Model (CTPM)

. We implement our deep learning based models with Tensorflow 

(Abadi et al., 2016). The graph construction utilizes Bayesian decomposition and implements subject model, bipartite embedding model, and continuous policy model, as well as normalizations using partition functions. We use two-layer neural networks with the same number of first-layer units in subject, matching and policy models 666Number of hidden units is determined by validation results, 15 units for Ponpare dataset and 8 units for USCensus.. Adam optimizer is applied with learning rate 0.01 and default beta values. The batch gradient is used to run the same number of iterations as simple CT model777Due to variance-bias trade-off across datasets, both CTPM and simple CT models are run for 2500 iterations for Ponpare dataset and 650 iterations for USCensus. We take best validation set performance out of 6 random intitializations.


Ponpare Data The Ponpare dataset is a public dataset with a coupons website (Ponpare, )

. The dataset is well-suited to evaluate our proposed methodology since it offers the discount levels of the coupons, which serve as continuous treatment intensities. Also the sessions in the dataset are when user browses the specific coupon, so offers treatment type to match with subject as the different coupons. The dataset contains session as row items where the instance contains customer, coupon browsed, a continuous discount percentage, whether or not purchase, and auxiliary features on customer and coupon. We leverage the open-source feature engineering code provided by 

(Pon-Features, ). The causal inference scenario focuses on estimating the combined benefits when we offer a continuous and variable discount percentage given a user-coupon match. We pre-filter the sessions where customers are below the age of 45. Due to disproportion of positive and negative samples, we subsample 4.0% of sample of sessions that do not result in purchase. The eventual dataset is around samples, we utilize discount level as the continuous treatment policy, and use the median of the level to segment out sessions into treatment and control groups, indicated by binary variable , Discount level is subsequently used as continuous policy . For this dataset, we apply the optimization problem in 2 with as treatment effect for absolute discount amount with reference to cost, the purchase boolean variable with reference to benefit, and the geographical distance from user to the product location for the coupon as extra cost related to delivery or travel. The variable is chosen to be fixed at 0.1 across all models with the goal of adding distance factor into the objective.

US Census 1990 The US Census (1990) Dataset (Asuncion & Newman, 2007 (26) contains data for people in the census. Each sample contains a number of personal features (native language, education…). The features are pre-screened for confounding variables, we left out dimensions such as other types of income, marital status, age and ancestry. This reduces features to d = 46 dimensions. Before constructing experiment data, we first filter with several constraints. We select people with one or more children (‘iFertil’ 2)888‘iFertil’ field is off-set by 1, ‘iFertil’ indicating 15 year old male, ‘iFertil’=1 no children., born in the U.S. (‘iCitizen’ = 0) and less than 50 years old (‘dAge’ 5), resulting in a dataset with samples. We select ‘treatment’ label as whether the person works more hours than the median of everyone else, and select the income (‘dIncome1’) as the gain dimension of outcome for , then the number of children (‘iFertil’) multiplied by as the cost dimension for estimating . The hypothetical meaning of this experiment is to measure the cost effectiveness, and evaluate who in the dataset is effective to work more hours. We apply optimization problem in Equation 1 as comparison with Ponpare Dataset with as treatment effect in income, as treatment effect in negative value of number of offspring, and as effect on married or not as an overall weighting factor across the objective. This gives the objective hypothetical meaning of utility. The variable is chosen to be fixed at 3.0 across all models to add a fixed cost weighting factor across income and offspring cost.

For both datasets, we split training, validation and test with ratios 60%, 20%, 20%.

Evaluation Methodology

Our algorithm computes the matching score thus evaluate the differentiation across instances. This means the average treatment effect of highly-scored instances will be better for the designed objective. We evaluate the algorithms in two ways. The first evaluation is Average Treatment Effect To Percentage (ATETP). This measure compute the effectiveness measure on the test data-set, then take an increasing percentage of the test set as to evaluate the average treatment effect according to the pre-defined causal metric in Equations 1 2. If the model scores the matches and treatment policies well, the ATETP should be high across the lower spectrum of percentages. We also use the ATETP area under curve (termed a-AUC) to be a numerical measure. The secondary metric is to plot a cost curve, i.e. to plot the treatment effect on reward versus cost as we increase percentage of coverage in the test set. This measure sees cost versus reward as the main concern, and we also compute the area under curve (termed c-AUC) to numerically measure performance.  999For both a-AUC and c-AUC, the higher the measure the better.

Experiment Results

Figure 2 and Figure 3 show results of causal models on Ponpare dataset101010Standard deviations across 6 runs are indicated for both Ponpare and USCensus. The simplified version of the model without continuous policy or bipartite matching is proposed by (Du et al., 2019). The CTPM out-performs R-learner, and simplied CT model on both ATETP curve and cost-curve. With peak at 10-20% treatment, the CTPM produces ATE improvement at the most effective match instances across user and coupons. For cost curve, CTPM also outperforms other models.

Figure 2: Average treatment effect to percentage for Ponpare data.
Figure 3: Secondary measure cost curve for Ponpare data.

Figure 4 and Figure 5 show results of the CTPM on US Census. We observe higher ATE for the CTPM model in high-scored instances. CTPM could identify the most incremental instances without significant differences in cost. The model outperforms baseline R-learner and simplified CT model significantly both on the ATETP and Cost Curve measures.

Figure 4: Average treatment effect to percentage for US Census.
Figure 5: Secondary measure cost curve for US Census.

Table 1 summarizes results of the continuous treatment policy matching model. The CTPM outperforms prior models. On Ponpare dataset, CTPM out-performs R-learner by more than 3 and improves 67% upon Simple CT model in a-AUC. For c-AUC, CTPM improves 41% upon R-learner and improves 2 upon Simple CT model. On USCensus dataset, CTPM performs better than 8 in terms of a-AUC than R-learner, and out-performs Simple CT model by around 42%. CTPM is more cost effectiveness in terms of c-AUC by 28% compared with R-learner, and 13% improvement upon Simple CT model.

Algo/Dataset Ponpare USCensus
Eval. Metric a-AUC c-AUC a-AUC c-AUC
Random 1.15 0.50 0.31 0.50
R-learner 5.06 0.65 0.40 0.54
Causal Forest 6.58 0.61 0.53 0.51
Simple CT 11.12 0.74 2.47 0.61
CTPM 18.57 0.92 3.51 0.69
Table 1: Summary of results across models and datasets.

Analysis and Interpretation

The continuous policy model is able to predict the optimal treatment intensity. In Figure 6, we visualize the optimal discount per session for the genre ‘Health‘ in the Ponpare test set. Compared with original treatment intensities, the optimal intensities from model prediction shows apparent segregation of low vs high intensity recommendations. This is shown by the data clusters near zero percentage (orange oval), and near full percentage (green oval). As age of the user increases, we see higher number of sessions with recommendation for high intensity of treatment. It implies older age group may have higher return for increased discount in the ‘Health’ genre, measured by the purchase probability.

Figure 6: Scatter plot comparison across optimal predictions from model (left) and original treatment intensities (right).

Figure 7 shows the results of the learned embeddings by the causal models on the Ponpare dataset. The bipartite embedding space, in this case, is jointly learned across the treatment subjects (users who received coupons), and the treatment candidates (coupons). Figure 7 plots the subject embeddings projected by the model, after running dimensionality reduction using 2D t-distributed stochastic neighbor embedding  (van der Maaten and Hinton, 2008) (t-SNE). We used a learning rate of 30 with a perplexity of 20. The idea is that the learned subject embedding normalized against different subject dimensions will be mapped to nearby points based on similarity in context and we can see that the learned subject embeddings are organized longitudinally by gender with two separable clusters.

Figure 7: Visualization of CTPM user embeddings for age group 30 (left) and 44 (right) from Ponpare dataset using T-SNE with color indicating user gender.

5 Conclusion and Discussion

In this paper, we proposed a model that combines methodologies to utilize continuous space treatment policy, and matching treatment subjects to candidates. These methods categorize the intensity and complexity of treatment action space thus significantly improve the performance of decision models. The models are able to make better decisions that maximizes holistic average treatment effects and on cost versus reward effectiveness. Further, the CTPM is able to predict the optimal treatment policy per matching instance, based on the contextual features. Finally, the matching algorithm offers latent variable models with embedding space to characterize the joint space in the subject and candidate instances and further improve results on applicable datasets.

The proposed algorithms holistically optimize with respect to action spaces, for a flexible objective combined with multiple treatment effect functions. For future work, our proposal offers potential to combine with other deep learning techniques such as sequential, recurrent models, generative models, and can be potentially extended and applied to other scientific domains.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: §4.
  • S. Athey and G. Imbens (2016) Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113 (27), pp. 7353–7360. Cited by: §2.
  • C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §1.
  • T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §2.
  • S. Du, J. Lee, and F. Ghaffarizadeh (2019) Improve user retention with causal learning. In The 2019 ACM SIGKDD Workshop on Causal Discovery, pp. 34–49. Cited by: §4.
  • J. H. Friedman (1997) On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data mining and knowledge discovery 1 (1), pp. 55–77. Cited by: §1.
  • [7] GRF GRF: generalized random forests. Note: https://grf-labs.github.io/grf/Accessed: 2019-11-15 Cited by: §4.
  • J. L. Hill (2011) Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20 (1), pp. 217–240. Cited by: §2.
  • P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. Cited by: §1, §3.1.
  • F. Johansson, U. Shalit, and D. Sontag (2016) Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020–3029. Cited by: §2.
  • S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu (2017) Meta-learners for estimating heterogeneous treatment effects using machine learning. arXiv preprint arXiv:1706.03461. Cited by: §1, §2, §3.1, §4.
  • B. Lim (2018) Forecasting treatment responses over time using recurrent marginal structural networks. In Advances in Neural Information Processing Systems, pp. 7483–7493. Cited by: §2.
  • C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling (2017) Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems, pp. 6446–6456. Cited by: §2.
  • J.K. Lunceford and M. Davidian (2004) Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Cited by: §3.1, §3.1, §3.1, footnote 1.
  • X. Nie and S. Wager (2017) Quasi-oracle estimation of heterogeneous treatment effects. Working Paper. Cited by: §1, §2, §4, footnote 1.
  • S. Parbhoo, M. Wieser, and V. Roth (2018) Causal deep information bottleneck. arXiv preprint arXiv:1807.02326. Cited by: §2.
  • J. Pearl (2009) Causality. Cambridge university press. Cited by: §1.
  • [18] Pon-Features Reference code for coupon purchase prediction.. Note: https://github.com/threecourse/kaggle-coupon-purchase-prediction.gitAccessed: 2020-02-06 Cited by: §4.
  • [19] Ponpare Ponpare: coupons purchase prediction dataset. Note: https://www.kaggle.com/c/coupon-purchase-prediction/dataAccessed: 2020-01-17 Cited by: §4.
  • S. Powers, J. Qian, K. Jung, A. Schuler, N. H. Shah, T. Hastie, and R. Tibshirani (2017) Some methods for heterogeneous treatment effect estimation in high-dimensions. arXiv preprint arXiv:1707.00102. Cited by: §2.
  • D. B. Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies.. Journal of educational Psychology 66 (5), pp. 688. Cited by: §2.
  • P. Rzepakowski and S. Jaroszewicz (2012) Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems 32 (2), pp. 303–327. Cited by: §2.
  • U. Shalit, F. D. Johansson, and D. Sontag (2017) Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3076–3085. Cited by: §1, §2.
  • A. Swaminathan and T. Joachims (2015a) Counterfactual risk minimization: learning from logged bandit feedback. In International Conference on Machine Learning, pp. 814–823. Cited by: §2.
  • A. Swaminathan and T. Joachims (2015b) The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pp. 3231–3239. Cited by: §2.
  • [26] US census 1990 dataset on uci machine learning repository. Note: https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)Accessed: 2019-11-15 Cited by: §4.
  • L.J.P. van der Maaten and G.E. Hinton (2008)

    Visualizing high-dimensional data using t-sne

    Journal of Machine Learning Research, pp. 2579–2605. Cited by: §4.
  • S. Wager and S. Athey (2018) Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113 (523), pp. 1228–1242. Cited by: §2, §4, §4.