Heterogeneous Causal Learning for Effectiveness Optimization in User Marketing

04/21/2020 ∙ by Will Y. Zou, et al. ∙ Uber 14

User marketing is a key focus of consumer-based internet companies. Learning algorithms are effective to optimize marketing campaigns which increase user engagement, and facilitates cross-marketing to related products. By attracting users with rewards, marketing methods are effective to boost user activity in the desired products. Rewards incur significant cost that can be off-set by increase in future revenue. Most methodologies rely on churn predictions to prevent losing users to make marketing decisions, which cannot capture up-lift across counterfactual outcomes with business metrics. Other predictive models are capable of estimating heterogeneous treatment effects, but fail to capture the balance of cost versus benefit. We propose a treatment effect optimization methodology for user marketing. This algorithm learns from past experiments and utilizes novel optimization methods to optimize cost efficiency with respect to user selection. The method optimizes decisions using deep learning optimization models to treat and reward users, which is effective in producing cost-effective, impactful marketing campaigns. Our methodology demonstrates superior algorithmic flexibility with integration with deep learning methods and dealing with business constraints. The effectiveness of our model surpasses the quasi-oracle estimation (R-learner) model and causal forests. We also established evaluation metrics that reflect the cost-efficiency and real-world business value. Our proposed constrained and direct optimization algorithms outperform by 24.6 baseline methods. The methodology is useful in many product scenarios such as optimal treatment allocation and it has been deployed in production world-wide.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Improving user marketing efficacy have become an important focus for many internet companies. Customer growth and engagement are critical in a fast-changing market, and cost of acquiring new users are rising. New product areas are especially pressured to acquire customers. In different industries, companies provide various ways for user marketing and cross-sell to new products, examples include ride-sharing (Uber, Lyft), accommodation (Airbnb), and e-commerce (Amazon, Ebay).

As suggested in previous research [1] from Uber, providing a user with a reward without explicit apology after an unsatisfactory trip experience will have a positive treatment effect on future billings. This is consistent with the finding in [2] where researchers conducted a similar experiment on Via (a ride-sharing company in NYC). Marketing campaigns in internet companies offer similar rewards to encourage users to engage or use new products. The treatment has positive effects on desired business growth, also lead to a surplus in cost. To study the outcome of these rewards, the research perspective originates from treatment effect estimation (Rubin, 1974)

in a population or users. Previous research and common practice relies on non-causal churn prediction or heuristics based on frustrating experiences for reward decisions instead of directly optimizing for users’ treatment effects under a cost constraint. In this paper, we apply the treatment effect estimation perspective on user marketing scenarios.

The goal of our work is provide a business decision methodology to optimize for the effectiveness of treatments. This methodology has the combined effect of minimizing cost and creating uplift in user engagement. Compared to existing work, novel contributions of this paper are:

  • Heterogeneous Treatment Effect based Business Decisions - A common approach for user reward decisions relies on regular predictions, redemption or heuristics which are tied to specific scenario and require rich background context. In this paper we propose a general methodology that directly optimizes the heterogeneous treatment effect and could be applied to various business use cases with minimum change. This approach can be evaluated effectively and give guidance to decisions.

  • Cost versus Benefit for Aggregated Efficiency - Most research studies focus on treatment effect of one single outcome. However, in real-world applications it’s necessary to consider treatment effect on the cost, i.e. the efficiency ratio of cost/value when making the resource allocation decision. Common approach also only considers point estimates but our objective is to maximize effectiveness from aggregated treatment effect. Our proposed framework will solve these two challenges together.

  • Deep Learning Integration and Joint Objective - Previous methodology have focused on greedily estimating the treatment effect across multiple outcomes. Their algorithmic approach rely on statistical regression methods or linear models. We develop methodologies that incorporate various dimensions of outcomes in the learning objective, so a desired, holistic metric can be optimized through deep learning. This makes the algorithm flexible to integrate with deep learning algorithms.

  • Barrier Function for Constrained Optimization - Constraints such as budgets, geography limitations, affect user behavior in sophisticated ways. User state variations under barrier constraints form a novel problem space. We formulate a constrained ranking algorithm to learn combined effect of actions and constraints in production. This is a all-purpose model that can be used to model both market-wide efficiency, and treatment effects with limited resources.

The structure of this paper is as follows: in Section 2, we will cover related work in optimization of treatment effect. In Section 3, we make the problem statement and introduce effectiveness measures and our modeling approaches for treatment effect optimization. In Section 4, we will cover experimentation, results, comparisons across models and real-world performance from the product we launched. Finally we briefly cover future research steps.

2. Background

Methods optimizing for user marketing, rewards and retention have been widely studied. Two recent studies by Halperin et al. [1] and Cohen et al. [2] look into the effect of apology treatments when the user’s trust is compromised. Andrews et al. [19] studied factors that affect coupon redemption. Hanna et al. [20] and Manzoor and Akoglu [21] investigated factors that influence redemption of time limited incentives. These studies focus on redemption or exploratory average treatment effect and do not explore the optimization of user selection.

The above methods attempt to solve the business problem, and do not yet apply a causal learning approach.  (Rubin, 1974) first brought forward a framework for studying treatment effects. User instances are treated with an action, and when the outcome is observed it is used in model fitting. One significant area is application of statistical methods such as (Künzel et al., 2017) that decomposes the learning algorithm into composite models with meta-learners. The study of meta-learners

have developed to a variety of models. Another area is application of decision trees and random forests 

(Chen and Guestrin, 2016), for instance, uplift tree(Rzepakowski and Jaroszewicz, 2012), causal tree and random forests (Wager and Athey, 2017) (Athey and Imbens, 2016), boosting (Powers et al., 2017) are powerful components to build causal inference models. Recently, another widely-adopted framework for learning heterogeneous treatment effect is the work of quasi-oracle estimation by (Nie and Wager, 2017), which is proven to be effective when estimating the treatment effect in a single outcome. These methods consider both the Conditional Treatment Effect (CTE) and the Average Treatment Effect (ATE). The CTE is the treatment effect predicted by the model per sample conditional on its features while ATE is the overall treatment effect. However, these algorithms are designed to estimate ATE and CTE for single outcome but could not deal with multiple outcomes and benefit-cost trade off. In this work we propose a set of algorithms which not only able to predict effect of treatment, but combine multiple outcomes into effectiveness measures that can be optimized jointly.

2.1. Estimation of Treatment Effect

We start with the estimation of treatment effects with the potential outcomes framework (Neyman, 1923 (Neyman, 1923); Rubin, 1974 (Rubin, 1974)) consistent with prior work (Nie and Wager, 2017). In the user retention case, users are independent and identically distributed examples indexed by , where denotes per-sample features for user while is the entire dataset, is the observed outcome if treated, and is observed outcome if not treated. is the treatment assignment and is binary for a particular treatment type, i.e. .

We assume the treatment assignment is unconfounded, i.e., the outcome pair is independent of treatment label given the user features, or treatment assignment is as good as random once we control for the features (Rosenbaum and Rubin, 1983 (P.R. and D.B., 1983)):

. This is the assumption we make on all causal models we explore in the paper. The treatment propensity, probability of a user receiving treatment as


With experiments, outcomes are observed given the treatment assignments. With each user we would have only observed one outcome per treatment. This historical data can be used to fit a model. For treatment effect estimation, we seek to estimate the treatment effect function given that we observe user features :


2.2. Quasi-oracle Estimation (R-learner)

Closely related to our work, we briefly review of the quasi-oracle estimation algorithm (Nie and Wager, 2017) for heterogeneous treatment effects, also known as ‘R-learner’. The quasi-oracle estimation algorithm is a two-step algorithm for observational studies of treatment effects. The marginal effects and treatment propensities are first evaluated to form an objective function that isolates the causal component of the signal. Then the algorithm optimizes for the up-lift or causal component using regression.

Concretely, the conditional mean of outcomes giving user features are , thus expected value of outcome from the model is . The expected value of the error across data and expected value of Y is zero given unconfoundedness assumption:


Replacing in the error, and substitute the conditional mean outcome: ; , we arrive at the decomposition:


An equation that balances difference between outcome with a ‘mean’ model with the conditional average treatment effect function. In a simple formulation of the quasi-oracle estimation algorithm a regression is used to fit the and models as the first step. The prediction result of the regression is then used to determine the regression target of model, which is then fitted also as a regression. After the learning, function can be used to estimate the treatment effect given user with feature .

3. Algorithms

The quasi-oracle estimation algorithm is efficient for estimating conditional treatment effects, however, sometimes different outcomes incurred by treatment cannot be converted to the same unit, for example if we want to boost trip growth by increasing the dollar spend on rewards, trip number and dollar spend cannot be converted to a single value. So the eventual goal is to maximize gains and with a cost constraint. In this paper, we propose causal inference paradigm to maximize cost effectiveness of heterogeneous treatments.

Concretely, we make the problem statement. Instead of estimating the treatment effect function , we propose to solve the problem illustrated below to maximize the gain outcome given a cost constraint.


The variables represent whether we offer a reward to the user during a campaign and is the cost constraint. We represent retention treatment effects as and cost as . It is important to note these treatment effect values are part of the optimization objective and are implicitly modeled as intermediate quantities. They are not strictly regression functions, and we holistically solve the stated problem.

3.1. Duality R-learner

We describe the duality method with Lagrangian multipliers to solve the constrained optimization problem for maximizing gain (minimizing negative gain) subject to a budget () constraint, and relaxing the previous variables to continuous:


First, we assume the CATE functions are fixed, so we solve Problem 4 assuming and are given. Applying one Lagrangian multiplier, the Lagrangian for Problem 4:


The optimization in Problem 4 can then be rewritten in its Dual form to maximize the Lagrangian dual function :


We need to address the caveats for solving the problem with duality, and determine whether the dual problem has the same minimum with original problem.

  • If , we know, for the optimal values of the two problems, holds from convex optimization. Equality holds if , are convex, and the Slater constraint qualification holds, which requires the problem to be strictly feasible.

  • For any values of , if we consider very small values of some , the strict inequality can always hold. Further, is usually large for a marketing campaign. Thus Slater qualifications hold.

From the analysis above, Problem 4 and its dual problem 7 are equivalent, and we can solve Problem 7 by iteratively optimizing with respect to .

Optimize : Keeping fixed, as and are constants, Problem 7 becomes:


Where we define the effectiveness score . This optimization problem has a straightforward solution: assign the multiplier when the ranking score and assign when ranking score .

Optimize : Take the derivative of with regards to , . We can update by Eq. (9) where is the learning rate.


Based on the two steps above, we can iteratively solve for both and (Bertsekas, 1999) .

In the next part, we solve for the functions, then finally connect components together to form the eventual algorithm. We can leverage quasi-oracle estimation of the CATE function  (Nie and Wager, 2017). Concretely, the function, and optionally

function, are fitted with L2 regularized linear regression, then

functions are fitted with Eq. 3. The problems are convex and have deterministic solutions.

In our Duality R-learner algorithm, we take an approach to combine the two functions into one model. Instead of learning and respectively, we fit a single scoring model in Eq. 10. Note the Duality solution suggests we should include any sample with . Larger this value, more contribution the sample will have and thus a higher ranking it should get.


This form is linear, so we can use instead of the the original (single outcome for value and cost respectively) in the estimators above. Specifically, Eq. (11).


Then we train a regression model through the quasi-oracle estimation method, with this and the output becomes which could be used directly. This has two benefits: first, we optimize a joint model across and for the parameters to be able to find correlations jointly; second, for production and online service, we will arrive at one single model to perform prediction.

We iteratively solve the Duality R-learner algorithm. This duality method lightens the production burden of having multiple models, and the algorithm can jointly improve cost and benefit by directly solving the constrained optimization problem for balanced effectiveness.

3.2. Direct Ranking Model

The approach described in the previous section contains two separate steps, treatment effect prediction and constraint optimization. The ultimate business objective is to identify a portfolio of users that we can achieve highest incremental user cross-sell or up-sell with a cost budget, which does not rely on the perfect individual prediction (point estimate) of treatment effect, but rather, achieves the overall market-wide effectiveness. This is similar to the search ranking algorithm to optimize for a holistic ranking objective vs Click Through Rate (CTR) point estimate (Huang et al., 2013) (Shen et al., 2014). We aim to achieve better performance by combining these two steps together, and this is the algorithm we propose: Direct Ranking Model (DRM).

This model tries to solve an unconstrained optimization problem where we minimize the cost per unit of gain:


Model and Effectiveness Objective.

We can then construct our model and the loss function as follow. In Eq. (

14) is the function the model will learn with tunable parameters. This function outputs an effectiveness score, indicating how efficient the sample is based on its features . can be in any differentiable form such as linear or a neural network structure.


We use a standard hyperbolic tangent as non-linear activation for the neural network().


We then normalize the effectiveness scores using the softmax function to arrive at for each user (Eq. (16)). sum to 1 in each cohort respectively, for and .


Here is the indicator function for sample whether it’s in the same group (treatment or control) as sample . Based on this, we can calculate the expected treatment effect of our user portfolio. We can write effectiveness weighted sample treatment effect for retention and cost with (Eq. (17), Eq. (18)).


Finally, we have our loss function in Eq. (19), which is the ratio of treatment effects as the holistic efficiency measure plus a regularization term.


Since all the operations above are differentiable, we can use any off-the-shelf optimization method to minimize the loss function and learn the function

. Because the direct optimization is well suited for deep learning, we incorporated this method with the deep learning architectures and frameworks, and implemented our approach using TensorFlow

(Abadi et al., 2016) and used Adam optimizer (Kingma and Ba, 2014). The definition of function is flexible for instance, multi-layer neural networks, convolutional and recurrent networks.

3.3. Constrained Ranking Models

Constraints are inherent in retention and engagement products, such as a fixed cost budget or product limitations to send to only 30% quantile of the users. Despite the previous model is able to directly optimize for market-wide effectiveness and utilize powerful deep learning models, the algorithm is disadvantaged with constraints and may not find the best solution.

There is also difficulty in leveraging deep learning models to solve hard-constrained optimization problems (Marquez Neila et al., 2017). To address these difficulties, we develop methods to turn hard constraints into soft constraints applicable to the deep learning methodology. Concretely, we enable this by developing two novel deep learning components: Quantile pooling and constraint annealing.

Quantile Pooling Many deep learning algorithms apply the critical step of pooling. Pooling applies a mathematical operator such as or to selectively retain values from the previous layer. These operations create useful sparsity in deep learning architectures which eases pressure on the numerical optimization process and increase invariance in the top layer representations (LeCun et al., 1995) (Goodfellow et al., 2009) (Zou et al., 2012) (Jarrett et al., 2009) (Le et al., 2011) (Le, 2013). In this section, we describe the new pooling method for selecting a quantile of effectiveness measures from the whole population using a sorting operator. This pooling component enables us to systematically select output satisfying constraints and dynamically construct efficiency objective focused on those selections. We propose this method with the deep learning architecture in our causal learning framework.

We assume either a quantile or a cost budget is given as a fixed hyper-parameter. For the former, we are constrained to offer treatment to top of the users, for the latter, we could not exceed the budget .

Leveraging methodologies developed in the previous section (Eq. 14, Eq. 15), at optimization iteration , for user in the dataset, users’ effectiveness score is calculated as below. Assume is the original score:

The treatment decision depends on the value of and its mathematical relationship with our constraints. We abstract this treatment decision with a fall-off function (chosen to be a sigmoid function) and an input offset , shown in Eq. 20. It illustrates how this offset lead to a fall-off variable which discounts output scores. In this equation the

variable is a hyperparameter called

temperature to control softness of the fall-off.


Here the offset is determined by both constraints and the population of scores at iteration . In this paper, we give two definitions of this offset transform:

Top Quantile Constraint: For optimization constrained to a fixed quantile , we related the offset with a quantile function where is the quantile percentage above which we decide to offer treatment:

The function is implemented using a sorting operator and take th operator , and where N is total number of users in the population:

Semantically it means we first sort user effectiveness scores then take the q% quantile value as offset .

Fixed Cost Constraint: For optimization constrained to a fixed cost , we related the offset with a cost-limiting function :

Similarly, the function is implemented using the sorting operator and cumulative sum operator , and a operator that represents a function which returns the effectiveness score corresponding to the input’s last element that’s smaller than :

Semantically, we sort users based on their effectiveness scores, then take quantile value of as offset , where the quantile value corresponds to the rank of user just before where the budget exceeds .

Despite the sophistication of these definitions, all the operators defined are differentiable, thus can be easily incorporated into the deep learning framework. This Quantile Pooling mechanisms deactivates or nullify outputs that do not satisfy constraints with the equation below:


The intuition for quantile pooling is analogous to max-pooling. The model dynamically creates sparse connection patterns in the neural network to focuses on the largest activations across a population of neurons. This algorithm structures the model for reducing against model variance and helps optimizers to find better local minima.

We replace the effectiveness score in Eq. 16 with the score after pooling . The quantile pooling ensures on every optimization iteration, the eventual effectiveness objective is focused on users that are valid according to constraints. Finally the constraints are soft, so we translate constraints into the architecture of the model, and the user effectiveness scoring function is eventually learned through direct and unconstrained optimization.

Constraint annealing The temperature term in Eq. 20 determine how hard the fall-off function is, thus determines the hardness of constraints. We observed difficulties optimizing the model with constrained ranking when is set large and constraint is hard. The early stages of optimization could not find local minima because the gradients are small with a sharp cut-off sigmoid. At the same time, when we set too small, the performance is similar to Direct Ranking Model (Eq. 19).

We propose an annealing process on the parameter to have a schedule of rising temperature 111The exact annealing parameters are in the Empirical Results section.. This allows gradient methods for optimization to be effective at early stages of optimization, and when the model settles in a better local minima, the constraints could be tightened so solutions that fit into those constraints could be found.

3.4. Evaluation Methodology

The business objective is to achieve most incremental user retention with a given cost budget. The retention and cost here are two critical values to trade-off.

Cost Curve. With two treatment outcome and , we draw a curve and use cost as X-axis and retention as Y-axis as the illustration below.

Figure 1. Illustration of the Cost-Curve.

Samples are ordered by the effectiveness score on the cost curve. For each point on the curve, we take the number of treatment samples at this point on the curve, multiplied by ATE (Average Treatment Effect) of this group.

Therefore each point represents aggregated incremental cost and value, usually both increasing from left to right. From origin to right-most of the curve, points on the curve represents the outcome if we include of the population for treatment, .

If the score is randomly generated, the cost curve should be a straight line. If the score is generated by a good model, then the curve should be above the benchmark line, meaning for the same level of incremental cost, samples are selected to achieve higher incremental value.

Area Under Cost Curve (AUCC). Similar to Area Under Curve of ROC curve, we define the normalized area under cost curve as the area under curve divided by the area of rectangle extended by maximum incremental value and cost, or the area ratio . A and B are the area shown in the cost curve figure. So the AUCC value should be bounded within [0.5, 1) and larger the AUCC, generally better the model.

4. Empirical Results

In this section, we will cover the empirical results to compare proposed algorithms with prior art approaches (Causal Forest, R-Learner) on marketing and pubic datasets. We will first describe the experiment data set and experiment setup. Then we would analyze both offline and online test results. In summary, cost curve offline evaluation is consistent with real-world online result and our proposed methods perform significantly better versus previous methods.

4.1. Experiments

The application goal of our model is to rank users from most effective to least effective, so that the overall market-wide metrics are optimized. As stated in algorithm section, we train our model on data logged from previous experiments with treatment assignment logs and the actual outcomes.

4.1.1. Experiment with Marketing Data

We adopt an explore and exploit experimental set-up in the paradigm of reinforcement learning 

(Liu et al., 2018) and multi-armed bandits (Katehakis and Veinott Jr, 1987) (Li et al., 2010). We launch experiment algorithms in a cyclic fashion. For each cycle we have 2 experiments: explore and exploit, which contains non-overlapping sets of users. The explore experiment is randomly given, and serves to collect data from all possible scenarios. On the other hand, exploit applies model and other product features to optimizes performance. The experiment design is illustrated in the following chart.

Explore. Users are randomly selected without any product specific algorithm, into explore experiments from the predefined user candidate pool. This allows us to collect an unbiased dataset which represents the whole population. Once users are selected, we then randomly give treatment / control assignment with a fixed probability222The number of samples in explore is solely determined by the budget..

Exploit. Excluding users already in explore experiments, based on model and budget we select users into exploit experiments. This exploit group is for product application.

We use explore for model training and offline performance evaluation and exploit for online performance evaluation. We collect data from experiments following this design. For each sample, we will log their feature constructed with data before experiment starts, experiment label (explore or exploit), treatment control assignment and outcomes (value and cost). Outcomes are aggregated within the experiment period. Value outcome could be any arbitrary desired business value the specific definition of which is unrelated to the algorithm, while cost outcome is also arbitrary undesired cost.

Marketing Data To obtain data for model training and offline evaluation, we utilize a randomized explore online experiment. We first randomly allocating users to control and treatment cohorts (A/B). For the treatment cohort, we give all users treatment. In this experiment we collected millions of user level samples in multiple experiment periods. Following is an illustrative table for the dataset we collected.

user id strategy
A explore
B exploit
Table 1. Example marketing dataset

4.1.2. Causal Experiments Designed with Public Datasets

The effectiveness of our proposed causal models is mainly experimented with marketing data. To ensure reproducibility we also experiment on public datasets. We make assumptions to select treatment assignment and outcomes on available data vectors to design simulated experiments for our proposed causal models.

US Census 1990 The US Census (1990) Dataset (Asuncion & Newman, 2007 (25) contains data for people in the census. Each sample contains a number of personal features (native language, education…). The features are pre-screened for confounding variables, we left out dimensions such as other types of income, marital status, age and ancestry. This reduces features to d = 46 dimensions. Before constructing experiment data, we first filter with several constraints. We select people with one or more children (‘iFertil’ 2), born in the U.S. (‘iCitizen’ = 0) and less than 50 years old (‘dAge’ 5), resulting in a dataset with samples. We select ‘treatment’ label as whether the person works more hours than the median of everyone else, and select the income (‘dIncome1’) as the gain dimension of outcome for , then the number of children (‘iFertil’) multiplied by as the cost dimension for estimating . The hypothetical meaning of this experiment is to measure the cost effectiveness, and evaluate who in the dataset is effective to work more hours.

Covertype Data The Covertype Dataset (Asuncion & Newman, 2007) contains the cover type of northern Colorado forest areas with tree classes, distance to hydrology, distance to wild fire ignition points, elevation, slope, aspect, and soil type. We pre-filter and only consider two types of forests: ‘Spruce-Fir’ and ‘Lodgepole Pine ’, and use data for all forests above the median elevation. This results in a total of

samples. After processing and screening for confounding variables, we use 51 features for model input. With the filtered data, we build experiment data by assuming we are able to re-direct and create water source in certain forests to fight wild fires, but also like to ensure the covertype trees are not imbalanced by changing the hydrology with preference to ‘Spruce-Fir’. Thus, the treatment label is selected as whether the forest is close to hydrology, concretely, distance to hydrology is below median of the filtered data. The gain outcome is a binary variable for whether distance to wild fire points is smaller than median, and cost outcome is the indicator for ‘Lodgepole Pine’ (1.0, undesired) as opposed to ‘Spruce-Fir’ (0.0, desired).

Marketing data, public US Census and Covtype datasets are split into 3 parts: train, validation and test sets with respective percentages 60%, 20%, 20%. We use train and validation sets to perform hyper-parameter selection for each model type. The model is then evaluated on the test set.

4.1.3. Model implementation details

In this section we briefly give the implementation details of our models.

Quasi-oracle estimation (R-Learner). We use Linear Regression333

Using SKLearn library’s Ridge Regression with 0.0 as the regularization weight.

as the base estimator. Since the experiment cohorts are randomly selected, we use constant treatment percentage as propensity in the algorithm. Since we need to define one CATE function for R-learner, we use the R-learner to model the gain value incrementality with .

Causal Forest. We leverage the generalized random forest (grf) library in R (Wager and Athey, 2018) (6) (Athey and Imbens, 2016). For details, we apply causal forest with 100 trees, 0.2 as alpha, 3 as the minimum node size, and 0.5 as the sample fraction. We apply the ratio of two average treatment effect functions in ranking by training two causal forests. To rank users or other cardinalities with respect to cost vs gain effectiveness, we estimate the conditional treatment effect function both for gain () and cost (), i.e. train two Causal Forest models. For evaluation, we compute the ranking score according to the ratio of the two

. For hyper-parameters, we perform search on deciles for parameters

num_trees, min.node.size, and at 0.05 intervals for alpha, sample.fraction parameters. We also leverage the tune.parameters option for the grf package, eventually, we found best parameters through best performance on validation set444Best parameters are the same for all three datasets we experimented: num_trees (50 trees for each of the two CATE function, , ), alpha, min.node.size, sample.fraction.

Duality R-learner. Similar to R-learner, we use Ridge Regression as the base estimator and constant propensity, and apply the model stated in Eq. 9 for ease of online deployment. The iterative process to solve in Eq. 10 is inefficient as the value function here is piece-wise linear w.r.t . Since Ridge Regression is lightweight to train, in practice, we take the approach to select with best performance on the validation set.555We determine the value of through hyper-parameter search on deciles and mid-deciles, e.g. ; best for marketing data is , for US Census and Covertype data is .

Direct Ranking. We implement our deep learning based models with Tensorflow (Abadi et al., 2016). To align with baseline and other methods in our experiments, we use a one layer with linear parameterization as the scoring function, without weight regularization, the objective functions stated in the algorithm section are used. We use the Adam optimizer with learning rate 0.01 and default beta values. We compute gradients for entire batch of data, and run for 600 iterations.

Constrained Ranking. We experiment with the Top Quantile operator. In addition to using a linear scoring function, we use a consistent quantile target at 40%, and apply a starting sigmoid temperature of , and use constraint annealing at increments of 0.1 every 100 steps of Adam optimizer. For constraint annealing, we validate and select different schedules. We note the quantile pooling offers a flexible lever to minimize objective function, making the optimization unstable. We stop the gradient on to disable the fast changing of this value.

4.2. Results on Causal Learning Models

Figure 2 shows the cost curve for each model on marketing data test set. The baseline R-learner optimized for incremental gain could not account for the cost outcome and under-performs on our task. Thus we use Duality R-learner as a benchmark for all our experiments. Causal Forest also perform reasonably well. Direct Ranking out-performs previous models with 22.1% AUCC improvement upon Duality R-learner, and Constrained Ranking algorithm is the best performing model on the marketing dataset, out-performing Duality R-learner by 24.6% in terms of AUCC.

Figure 2. Cost-Curve results for marketing data.

Figure 3 shows results of causal models on US Census. The baseline R-learner on gain performs slightly better due to less cost impact. Duality R-learner still works reasonably well. Direct Ranking and Constrained Ranking out-performs Duality R-learner by 2.8% and 21.2%, respectively to AUCC 0.58 and 0.69.

Figure 3. Cost-Curve results for public US Census data.

Figure 4 shows results of causal models on Covtype datasets. The optimization problem on this dataset is easier as results on multiple models are better. Direct Ranking and Constrained Ranking out-performs Duality R-learner by 7.3% and 17.9%, with AUCC for Constrained Ranking algorithm as high as 0.92.

Figure 4. Cost-Curve results for public Covtype data.

Table 2 shows results Constrained Ranking and Direct Ranking algorithms are significantly better, more than 25% in terms of AUCC, than R-learner on gain outcome, and out-performs Duality R-Learner by around 10%. One example for cost effectiveness is to look at the vertical dash line at of total incremental cost, we can achieve 2X more incremental retention than random selection by using our causal models. This can result in 50% reduction in cost.

Algorithm Prod. % imp. USCensus Covtype
Random 0.500 0.500 0.500
R-learner G 0.464 0.533 0.779
Duality R-learner 0.544 0.0% 0.567 0.783
Causal Forest 0.628 15.4% 0.510 0.832
Direct Ranking 0.664 22.1% 0.583 0.840
Constrained Ranking 0.678 24.6% 0.687 0.915
Table 2. Summary of AUCC results across models and datasets.

The models we proposed has been deployed in production and operates in multiple regions. The model is developed using data from previous experiments, and launched as a prediction model in production to rank users. In this section we describe the challenges and learning of putting the model in production.

Engineering system for production

The first challenge is designing an engineering system to support our causal learning approach. Different from traditional machine learning models, we build a Heterogeneous Causal Learning Workflow (HCLW) system to learn from observed outcomes of predefined treatments in previous launches. The previous production launch offers training data for subsequent launch, thus the product will improve decisions and model settings across a sequence of launches before forming the eventual production model. The design of this engineering system is shown in Figure 

5. The data are collected from previous launches in the form of Table 1, and stored in offline storage before feeding into the causal learning pipeline. The pipeline produces the trained model, evaluation, and service components. Service components offers model decisions on users, interact with launch pipelines and product serving system through API’s to issue rewards through user interface in the production system.

Figure 5. Production system for causal learning.

Production and offline evaluation Another important consideration to deploy to production is alignment of evaluation results. Unlike the full cost-curve metrics for offline evaluation, in online case, we could only measure one specific point on the cost curve for each model. The slope of the straight line between that point and origin measures the general cost effectiveness. This slope is given as in Eq.22. If both models have similar spend level (similar value on x-axis), this slope would sufficiently capture the performance.


As mentioned in Section 4.1.1, we have both explore and exploit in our online setup. In this case when we try to test 2 models for comparison between DRM and Causal Forest, we will have 1 explore (random selection) and 2 exploits (model based selection). Within selected users we then random split them into treatment and control. To make the numerical metric uniform, we use for explore as benchmark and derive Eq. 23, which represents the relative efficiency gain compared to the benchmark.


Online results are consistent with the offline results that all models perform significantly better than explore and DRM consistently out-performs quasi-oracle estimation (R-learner) and Causal Forest.

5. Conclusion and Future Work

5.1. Conclusion

We propose a novel ranking method to optimize heterogeneous treatment effect for user retention. The method combines prediction and optimization into one single stage and provides a loss layer that can be incorporated with any deep learning structure. We also provide an empirical evaluation metric and adjustments for existing estimator for the treatment effect optimization problem. We evaluate various methods empirically both offline and online. Our proposed method achieves significantly better performance than explore benchmark and existing estimators. After successful test, this method has been deployed to production and is live in many regions all over the world.

5.2. Future Work

Smart Explore/Exploit. In current work we use epsilon-greedy explore, where we split a fixed percentage of budget to spend on fully randomized explore to collect data for model training. However, this will sacrifice the overall performance and is suboptimal. As a better approach, we will try to use multi-arm bandit or Bayesian optimization framework to guide our smart explore based on the model uncertainty.

Deep Embedding. Raw time and geo features are extremely sparse. Various embedding techniques have been used for sparse features but none of them is specifically for treatment effect. As treatment effect is different from its underlying outcome, the embedding should also be different. Now that we have a general loss layer which could be incorporated with any deep learning structure, we could start to work on the embeddings specifically for treatment effects.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: §3.2, §4.1.3.
  • S. Athey and G. Imbens (2016) Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113 (27), pp. 7353–7360. Cited by: §2, §4.1.3.
  • D. P. Bertsekas (1999) Nonlinear programming. Athena scientific Belmont. Cited by: §3.1.
  • T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §2.
  • I. Goodfellow, H. Lee, Q. V. Le, A. Saxe, and A. Y. Ng (2009) Measuring invariances in deep networks. In Advances in neural information processing systems, pp. 646–654. Cited by: §3.3.
  • [6] Grf: generalized random forests. Note: https://grf-labs.github.io/grf/Accessed: 2019-11-15 Cited by: §4.1.3.
  • P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. External Links: Link Cited by: §3.2.
  • K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun (2009) What is the best multi-stage architecture for object recognition?. In

    2009 IEEE 12th international conference on computer vision

    pp. 2146–2153. Cited by: §3.3.
  • M. N. Katehakis and A. F. Veinott Jr (1987) The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research 12 (2), pp. 262–268. Cited by: §4.1.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
  • S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu (2017) Meta-learners for estimating heterogeneous treatment effects using machine learning. arXiv preprint arXiv:1706.03461. Cited by: §2.
  • Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR 2011, pp. 3361–3368. Cited by: §3.3.
  • Q. V. Le (2013)

    Building high-level features using large scale unsupervised learning

    In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8595–8598. Cited by: §3.3.
  • Y. LeCun, Y. Bengio, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §3.3.
  • L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §4.1.1.
  • H. Liu, A. Kumar, W. Yang, and B. Dumoulin (2018) Explore-exploit: a framework for interactive and online learning. arXiv preprint arXiv:1812.00116. Cited by: §4.1.1.
  • P. Marquez Neila, M. Salzmann, and P. Fua (2017) Imposing hard constraints on deep networks: promises and limitations. In CVPR Workshop on Negative Results in Computer Vision, Cited by: §3.3.
  • J. Neyman (1923) Sur les applications de la theorie des probabilites aux experiences agricoles: essai des principes.. Master Thesis. Cited by: §2.1.
  • X. Nie and S. Wager (2017) Quasi-oracle estimation of heterogeneous treatment effects. Working Paper. Cited by: §2.1, §2.2, §2, §3.1.
  • R. P.R. and R. D.B. (1983) The central role of the propensity score in observational studies for causal effects.. Cited by: §2.1.
  • S. Powers, J. Qian, K. Jung, A. Schuler, N. H. Shah, T. Hastie, and R. Tibshirani (2017) Some methods for heterogeneous treatment effect estimation in high-dimensions. arXiv preprint arXiv:1707.00102. Cited by: §2.
  • D. B. Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies.. Journal of educational Psychology 66 (5), pp. 688. Cited by: §1, §2.1, §2.
  • P. Rzepakowski and S. Jaroszewicz (2012) Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems 32 (2), pp. 303–327. Cited by: §2.
  • Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil (2014) A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, External Links: Link Cited by: §3.2.
  • [25] US census 1990 dataset on uci machine learning repository. Note: https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)Accessed: 2019-11-15 Cited by: §4.1.2.
  • S. Wager and S. Athey (2017) Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association (just-accepted). Cited by: §2.
  • S. Wager and S. Athey (2018) Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113 (523), pp. 1228–1242. Cited by: §4.1.3.
  • W. Zou, S. Zhu, K. Yu, and A. Y. Ng (2012) Deep learning of invariant features via simulated fixations in video. In Advances in neural information processing systems, pp. 3203–3211. Cited by: §3.3.