I Introduction
Uplift modeling [8, 10, 11, 15, 21, 22, 24, 25, 26, 27], also known as heterogeneous treatment effect estimation or incremental modeling, is a technique designed to estimate the individual treatment effect (ITE) of an intervention. It can be used for optimizing user targeting and personalization in many areas, including promotion, advertisement, customer service, recommendation system and product design. Most typically, such optimization is achieved by first estimating the treatment effect of the intervention or product experience on each user and then delivering the treatment condition to the users with the largest estimated uplift.
Using informative and predictive features is key for the performance of an uplift model. In practice, there is often a rich set of features that can be used to build a model. However, using all of the available features in the model can lead to computational inefficiency, overfitting, high maintenance workload, and model interpretation challenges. Consequently, feature selection becomes an essential step to leverage the benefits of a rich feature set and to reduce the associated cost. A feature selection method calculates an importance score for each feature and then ranks them based on the score. An uplift model can then be built based on the most important features. Focusing on the important features only has multiple benefits for uplift modeling applications: (1) faster computation speed for model training; (2) more accurate prediction by avoiding overfitting; (3 lower maintenance cost for data pipelines; and (4) easier model interpretation and diagnostics.
Although feature selection is an important topic for uplift modeling, it has been rarely discussed in the literature. Feature selection methods for classic machine learning problems have been well studied [4, 5, 23]. However, as we will show, these methods are ineffective for solving feature selection problem for uplift modeling. Therefore, it is necessary to develop and discuss feature selection methods specifically for uplift modeling.
We contribute to this area from both methodological and empirical evaluation perspectives. Specifically:

We propose multiple feature selection methods for uplift modeling.

We evaluate feature selection methods with various uplift models, in both synthetic and real data settings, in order to provide empirical evidence of method performance.

We demonstrate that important features for uplift modeling are different from important features for standard machine learning problems, and that feature selection methods for standard machine learning problems are suboptimal in the uplift modeling context.

We make the proposed filter methods available in CausalML Python package [6].
We focus on the uplift modeling classification problem where the outcome variable is categorical, which covers many commonly seen use cases such as advertisement click through, new user conversion and existing user retention. However, the idea can be generalized to uplift modeling regression problems.
The structure of the paper is as follows. In Section II, we review the key concepts of uplift modeling and describe why feature selection for uplift modeling is a unique challenge. In Section III, we introduce a list of feature selection methods for uplift modeling. In Section IV, we evaluate these methods with both synthetic and realworld data. Finally, in Section V, we summarize the findings and make recommendations for choosing and using the proper feature selection methods for uplift modeling applications.
Ii Background
Iia Uplift Models
Uplift modeling can be viewed as a way to estimate heterogeneous treatment effects at a user level using machine learning. It is helpful to frame the problem and introduce uplift modeling from a causal inference perspective. Following the commonly used NeymanRubin causal model [19, 16, 20, 13], the treatment effect for user can be expressed as:
(1) 
where and denotes the outcome variable for individual under treatment condition and control condition respectively.
The treatment effect can vary from user to user. The conditional average treatment effect (CATE) is defined as:
(2) 
where
is a feature vector and
is the feature value for user .The CATE quantifies how treatment effects vary among users depending on the observed user features, and it is the target quantity uplift modeling tries to estimate [11]. Based on the estimated CATE, different treatment conditions can be selected and applied to users to achieve preferred outcome. if a model estimates CATEs at an individual level, we also refer to this quantity as the individual treatment effect (ITE),
There are two main types of uplift modeling frameworks. The first category is known as ”metalearners” ([17, 15]), which is based on combining standard machine learning models to estimate the CATE. For example, the “Two Model” approach ([12]), also known as Tlearner, is constructed by fitting a separate model for the control and treatment observations and then taking the difference between the predicted treatment outcome and the predicted control outcome to estimate the CATE. More complex metalearners include XLearner proposed by [15] and RLearner proposed by [17]. The other category is based on modifying the component within the existing machine learning algorithms such as classification and regression trees. [3, 24, 10, 9, 21, 2] For example, [21] proposes modifying the splitting criterion of a classification tree algorithm such that the split is optimized for maximizing the heterogeneity of treatment effects in the resulting subgroups. In this paper, we evaluate feature selection methods using models from both categories.
IiB Relation with Standard Feature Selection Methods
There are various feature selection methods available for standard classification and regression problems. The methods can be roughly divided into three categories: filter methods, wrapper methods, and embedded methods ([4, 5, 23]
). However, these standard methods fail to perform for the feature selection task for uplift modeling. The reason is that, in the classification problem, the modeling goal is to predict the outcome probability of each class based on the features. Therefore, feature importance is usually measured in terms of its relationship with class probability.
In contrast to the standard classification problem, the goal for uplift modeling is to predict the CATE. Consequently, a good feature should be predictive of the treatment effect rather than a class probability. These two prediction targets do not necessarily coincide. Thus, an important feature for standard classification is not necessarily an important feature for uplift modeling, and vice versa. The same argument applies to regression problems.
To address the feature selection problem for uplift modeling, we propose both filter methods, which are easy and fast to use as a preprocessing step for uplift modeling, as well as embedded methods, which are a byproduct from training an uplift model. We compare the performance between these proposed methods and standard feature selection methods in Section IV.
Iii Feature Selection Methods for Uplift Modeling
Iiia Filter Methods
In an uplift modeling task, a feature’s importance depends on how well it predicts the treatment effect. A filter method calculates the importance score for each feature based on the marginal relationship between the treatment effect and the feature. It is a fast preprocessing step because only simple metrics are calculated for one feature at a time.
The first proposed filter method, called F filter, is based on a linear regression model for the outcome variable with the treatment indicator, the feature of interest, and their interaction terms as the predictors. The importance score is defined as the Fstatistic for the coefficient of the interaction term: a large statistic value implies the feature is correlated with a strong heterogeneous treatment effect. The second filter method, called LR filter for “likelihood ratio”, defines the importance score as the likelihood ratio test statistic for the interaction term coefficient in a logistic regression model.
The third filter method has three variants and is motivated by the split criteria for uplift trees proposed by [21]. For a given feature, this method first divides the samples into bins based on the percentiles of the feature, where
is a hyperparameter for this method. The importance score is defined as the divergence measure of treatment effect over these
bins. Specifically, assuming there are classes in the outcome variable, let and denote the sample proportion of class in the th () bin for the treatment group and control group respectively. The importance score is defined as:(3) 
where is the sample size in the th bin, is the total sample size, and the distribution divergence is one of the three measures proposed by [21]
, namely KullbackLeibler divergence (denoted as KL), the squared Euclidean distance (denoted as ED), and the chisquared divergence (denoted as Chi):
(4)  
(5)  
(6) 
The time complexity for filter methods are linear with the sample size and number of features : .
IiiB Embedded Methods
The embedded methods obtain feature importance as a byproduct from training a uplift model and can be derived for both metalearners and uplift trees. For metalearners, feature importances can be obtained from the baselearners, which are the composite models making up a metalearner. For example, for the Two Model
approach, a feature’s importance score can be defined as the sum of its embedded importance scores produced by the two baselearners. For uplift trees, the importance score for a feature can be defined as the cumulative contribution to the loss function during the tree node splits in the trees. This is similar to the wellknown embedded feature importance for standard classification trees, except the score is obtained from a uplift tree with special splitting criterion. At each split, we calculate the gain in the distribution divergence:
(7) 
where is defined as in Eq.( 4) to ( 6), and , denote the outcome distributions of the treatment group and control group. The feature importance score is calculated by summing over all the from the tree node splits where the feature is used.
The time complexity for embedded methods depend on the learners used, for random forest algorithms, it is at order of
, where is the number of trees and is the maximum features considered in each split.Iv Empirical Evaluation
In this section our goal is to answer following questions: (1) Which feature selection method works better than others? (2) Is the performance consistent in different scenarios? (3) How does the feature selection step affect the accuracy of uplift modeling? (4) How does the number of bins, as a hyperparameter for the binbased uplift filter methods, affect their performance?
We use both synthetic and realworld data to evaluate the performance of the feature selection methods. The advantage of synthetic data is that the true individual treatment effect and true important features are known, while the advantage of real world data is in helping us to understand how feature selection methods work in practice.
One approach for evaluating the performance of a feature selection method is to feed the top features selected by this method to an uplift model, and then report the accuracy of the uplift model output. We would expect a good feature selection method to identify the truly important features and increase the predictive performance of an uplift model.
Iva Experiment 1: Evaluation with Synthetic Data
We consider a binary conversion problem in the study with synthetic data [1]
. The generated data has three types of features: (1) uplift features influencing the treatment effect on the conversion probability; (2) classification features affecting the conversion probability but independent of the treatment effect; and (3) irrelevant features that are independent of both conversion probability and the treatment effect. To model the relationship between uplift features and the treatment effect and classification features and outcome probability, we implement six types of association patterns in the data generation process: linear, quadratic, cubic, ReLU (Rectified Linear Unit
[7]), trigonometric function sine, and cosine. Example feature patterns are plotted in Figure 1.The data generating process is composed by the following steps:

Supposing there are users and features, with classification features, uplift features, and irrelevant features ().

Generating feature value for the th user and the
th feature from a standard normal distribution:
, where and . 
Transforming the features to represent different association patterns by applying one ^{1}^{1}1In this simulation study, the transformation function is selected by the natural order for the first six uplift features and the first six classification features. If there are more than six features in a type, then a random transformation function is selected from the set for each additional feature. of the transformation functions on the feature: where
. The transformed feature values are then standardized by subtracting the mean and dividing by the standard error, and are denoted by
. 
Generating the conversion probability based on a logistic model:
where denotes the feature vector, denotes the transformed feature vector, is the treatment indicator variable with for treatment and for control, are the realized sample values, is a constant controlling the baseline conversion probability for control group, is a constant controlling the average treatment effect, is a coefficient ( with for , and for ), and is an error term from a normal distribution with mean
. Note that classification features affect the conversion probability regardless of the treatment group, while the uplift features only affect the conversion probability for the treatment group, which cause the treatment effect. For each user, we generate a counterfactual conversion probability under both control and treatment: and . 
Randomly assigning the control and treatment labels to users with equal probability.

According to the observed experiment group , generating the observed conversion
by a Bernoulli distribution with probability
.
Note that for each user, the true CATE is: . For feature selection and model training, only the feature values , experiment group , and the corresponding outcome are observed as a training data set.
In this study, there are features in total, including classification features, uplift features, and irrelevant features. The values for the constants and are set such that the average control conversion probability is around and the average treatment effect is around .
We evaluate eight feature selection methods, including five filter methods (F filter, LR filter, KL filter, Chi filter, and ED filter), two embedded methods (Two Model
embedded and KL embedded), and one standard embedded method for classification as a benchmark (feature importance based on random forest classifier denoted as “outcome embedded”). The embedded methods associated with uplift random forests (KL embedded, Chi embedded, ED embedded) are very similar to each other. Therefore, we use the KL embedded method to represent the performance of this class of methods. For the three uplift filter methods (KL filter, Chi filter, and ED filter), we set the number of bins at
as default. We use four uplift models to evaluate the performance of the feature selection methods: Two Model, Xlearner, Rlearner, and KL uplift random forest. As the uplift random forests have a similar performance ([21, 27]), we use the KL model to represent this model family.For all the metalearners, we use a random forest classifier as the base learner. In the simulation, all the random forest classifiers in the metalearners and uplift random forest share the same hyperparameter values: the number of trees is , the maximum tree depth is , the minimum sample size in leaf to perform split is , and the maximum number of features for split is . If the number of features fed into the model is smaller than , then we set the maximum number of features for split equal to the number of features.
Each simulation trial consists of four steps. First, we use the data generator to simulate the data with a new random seed and by randomly splitting the data into training and testing (with ratio). Second, we apply each feature selection method on the training data and rank the features from the most important to the least important. Third, for each feature selection method, we collect the top (for ) features selected and build uplift models based on these features using training data. Fourth, we use testing data to evaluate the accuracy of the uplift models based on the top features selected by each feature selection method. For each trial, we generate samples. The simulation study consists of trials.
As the main functional goal of uplift modeling is to estimate the CATE or ITE, we expect a good feature selection method to improve an uplift model’s accuracy in estimating these effects. Figure 2 summarizes the RMSE (Root Mean Square Error) of ITE estimates by different model and feature selection combinations. The four plots are divided by uplift models. Within each plot, the xaxis shows the number of top features used from the ranked feature list produced by each feature selection method, and the yaxis shows the RMSE of ITE. We use the mean RMSE of the trials (
) to make the dot plot and calculate the confidence intervals as
, where is the standard error of the RMSE across trials. We provide a benchmark line as the mean RMSE of the uplift model with all features included.The results show that the three uplift filter methods (KL filter, Chi filter, ED filter) have consistent top performance in all scenarios, followed by F filter, LR filter, KL embedded methods. The Outcome embedded method has the poorest performance in nearly all scenarios. This observation supports the theory that the standard feature selection method (Outcome embedded) fails for feature selection tasks for uplift modeling. The potential reason for KL filter method outperforming KL embedded method is that the binning in the filter method provides richer information compared with binary node split in the uplift trees.
Except the cases with fewer than features, there is a clear advantage of performing feature selection compared to including all of the features. Peak model performance is achieved at the top features by the three uplift filter methods. This is expected since there are uplift features in the data generation process by design. This also shows that the uplift filter methods are able to choose the true uplift features as the most important ones. As a comparison, the accuracy of other methods keeps improving beyond features, which means they missed some true uplift features in the top positions.
The F filter and LR filter methods have similar performance with the three top performing filter methods for top features. However, their performance declines after top features. The reason is that F filter and LR filter are good at picking features with a linear uplift pattern but miss features with a nonlinear uplift pattern.
The relative performance of feature selection methods is consistent across different uplift models. Although the purpose of this study is not to compare uplift model performance, the Xlearner, Rlearner, and KL model perform better than the TwoModel approach (consistent with [21, 27]).
Method  All Uplift  Linear  Quadratic  Cubic  ReLU  Sin  Cos 

ED filter  93.3%  99%  97%  78%  97%  94%  95% 
KL filter  85%  92%  90%  61%  92%  85%  90% 
Chi filter  81.7%  91%  86%  53%  90%  84%  86% 
KL embedded  59.8%  77%  90%  65%  62%  25%  40% 
F filter  54.8%  100%  11%  100%  100%  8%  10% 
LR filter  53.8%  100%  7%  100%  98%  7%  11% 
TwoModel embedded  42.7%  74%  9%  35%  76%  24%  38% 
Outcome embedded  27.5%  61%  35%  23%  37%  5%  4% 
To better understand what explains the increase in uplift model accuracy, we report the proportion of uplift features selected in top positions in Table I. The proportion is averaged across the trials. Note that in each trial, there are uplift features, one in each pattern category. For example, on average, ED filter is able to capture of uplift features in the top positions and times the linear uplift feature can be captured in the top positions.
The table shows the three filter methods perform the best for capturing the uplift features, with the ED approach as the strongest method. The order of the feature selection methods in the table is consistent with the order based on uplift modeling performance. This shows the connection between selecting the true uplift features and having a good uplift modeling performance. Consistently with the previous results, we also see poorer performance by standard feature selection methods like “outcome embedded”.
The detailed breakdown by uplift feature type explains why some methods are not performing well. F filter and LR filter fail to capture quadratic features, Sin features, and Cos features. These methods, by design, have limitations for selecting nonlinear uplift features. KL embedded method also does not perform well for recognizing Sin features and Cos features.
The three top performing uplift filter methods have one common hyperparameter: the number of bins. In the study above, we use bins for these methods. It is interesting to study the sensitivity of the feature selection method performance with respect to this hyperparameter. Therefore, we perform an additional simulation study for KL filter, Chi filter and ED filter. The simulation setting is similar to the one above, except for the number of bins taking different values in . Figure 3 summarizes the results. The plots are divided by number of top features selected and uplift model type. Within each plot, the xaxis shows the number of bins used by each filter method and the yaxis shows the RMSE of ITE with a confidence interval. Across these scenarios, the common pattern is that bins is an inefficient choice for fully capturing feature importance, while using or bins is generally a good choice. However, adding more bins does not necessarily improve performance.
IvB Experiment 2: Evaluation with Real Data
Category  Filter  Embedded  
Method  Chi  ED  KL  F  LR  KL  TwoModel  Outcome 
Time (second)  144  161  161  56  502  6,643  43  58 
In this example, we evaluate the proposed methods by using realworld data from an experiment conducted in a mobile phone application. The business context is that a product team would like to increase user conversion for a paid product feature on the application by offering a discount to users. Conversion is defined as whether the user chooses to click and use this feature or not. The default control experience is showing the original price without a discount and the treatment experience is showing the discounted price. The intervention is tested in a randomized experiment and a Chisquared test shows that the average treatment effect on conversion is statistically significant (p value ). We train an uplift model on this data and historical user features to predict who would be the customers with the highest expected lift if they were given a promotion. The data set contains features and samples with an equal split between treatment group and control group. We randomly split the observations into training and testing data at ratio.
To test the performance and generalizability of the feature selection methods on uplift models beyond random forest learners, different sets of base learners are tested within the metalearner approaches. The uplift model variants considered are: (1) TwoModelLR, XLearnerLR, RLearnerLR using { Logistic Regression Classifier
Linear Regression Regressor } as base learners; (2) TwoModelLGBM, XLearnerLGBM, RLearnerLGBM using { Gradient Boosting Classifier
Gradient Boosting Regressor } from LightGBM implementation [14] as base learners, with hyperparameter values (); (3) TwoModelRF, XLearnerRF, RLearnerRF using { Random Forest Classifier Random Forest Regressor } as base learners, with hyperparameter values (); (4) KLRF as the uplift random forest using KL divergence criterion with hyperparameter values ().The results are summarized in Figure 4, reporting the AUUC (area under the uplift curve) scores [22, 26, 21, 11] from the uplift models using the top features selected by each feature selection method. The relative performance of different feature selection methods can be compared within each column given the same uplift model. Generally speaking, the three binbased uplift filter methods (Chi filter, ED filter, and KL filter) keep performing well. The KL embedded method also has competitive performance. On the contrary, F filter, LR filter, and outcome embedded methods show poorer performance compared with the methods above. In addition, most uplift models perform more accurately with a feature selection method than they do without a feature selection method. The Logistic / Linear based metalearner perform worse than more complex models such as LGBM and Random Forests. Despite the differences in uplift models, the relative order of feature selection method performance is quite consistent across different uplift models.
Computation time for feature selection is reported in Table II. All filter methods have moderate time, while TwoModel embedded method and Outcome embedded method benefit from the Cython implementation of the underlying model in scikitlearn [18]. As a comparison, the KL embedded method has the highest time cost due to pure Python implementation of the tree algorithm.
V Conclusion
We have discussed seven feature selection methods designed for uplift modeling, including filter methods and embedded methods. Our experiments demonstrate that the proposed methods are able to select important features based on their association with heterogeneous treatment effects and improve the ability of uplift models to predict individual treatment effects. In the empirical evaluation on synthetic and realworld data, the three binbased filter methods, namely Chi filter, ED filter, and KL filter, stand out with a consistently good performance. The embedded method with uplift random forest also shows competitive results. Our experiments also indicate that standard feature selection methods for classification and regression cannot effectively solve the feature selection problem for uplift modeling.
One assumption of the proposed feature selection methods is that the data is collected from randomized experiments, where the treatment assignment mechanism breaks any systematic relationship between the features and whether a unit is in the treatment or control group. If the data is observational and the collected features differ between the treatment and control groups, then the methods proposed here may not improve the accuracy of ITE estimation. The reason is that accurate ITE estimation in observational studies requires us to condition on confounding variables, which are not guaranteed to survive the variable selection process. Extending the approaches proposed here into the observational setting is a promising area of future research.
References
 [1] Cited by: §IVA.
 [2] (201504) Recursive partitioning for heterogeneous causal effects. External Links: 1504.01132 Cited by: §IIA.
 [3] (201610) Generalized random forests. External Links: 1610.01271 Cited by: §IIA.
 [4] (2013) A review of feature selection methods on synthetic data. Knowledge and information systems 34 (3), pp. 483–519. Cited by: §I, §IIB.
 [5] (2014) A survey on feature selection methods. Computers & Electrical Engineering 40 (1), pp. 16–28. Cited by: §I, §IIB.
 [6] (2020) CausalML: python package for causal machine learning. arXiv preprint arXiv:2002.11631. Cited by: 4th item.

[7]
(2013)
Improving deep neural networks for lvcsr using rectified linear units and dropout
. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8609–8613. Cited by: §IVA.  [8] (201710) Estimating heterogeneous treatment effects and the effects of heterogeneous treatments with ensemble methods. Polit. Anal. 25 (4), pp. 413–434. Cited by: §I.
 [9] (2012) Random forests for uplift modeling: an insurance customer retention case. In Modeling and Simulation in Engineering, Economics and Management, pp. 123–133. Cited by: §IIA.
 [10] (201505) Uplift random forests. Cybern. Syst. 46 (34), pp. 230–248. Cited by: §I, §IIA.
 [11] (2016) Causal inference and uplift modeling a review of the literature. JMLR: Workshop and Conference Proceedings 67. Cited by: §I, §IIA, §IVB.
 [12] (2001) Incremental value modeling. Research Council Journal. Cited by: §IIA.
 [13] (1986) Statistics and causal inference. J. Am. Stat. Assoc. 81 (396), pp. 945–960. Cited by: §IIA.

[14]
(2017)
Lightgbm: a highly efficient gradient boosting decision tree
. In Advances in neural information processing systems, pp. 3146–3154. Cited by: §IVB.  [15] (201706) Metalearners for estimating heterogeneous treatment effects using machine learning. External Links: 1706.03461 Cited by: §I, §IIA.
 [16] (1923) Sur les applications de la théorie des probabilités aux experiences agricoles: essai des principes. Roczniki Nauk Rolniczych 10, pp. 1–51. Cited by: §IIA.
 [17] (201712) QuasiOracle estimation of heterogeneous treatment effects. External Links: 1712.04912 Cited by: §IIA.
 [18] (2011) Scikitlearn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §IVB.
 [19] (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66 (5), pp. 688–701. Cited by: §IIA.
 [20] (200503) Causal inference using potential outcomes. J. Am. Stat. Assoc. 100 (469), pp. 322–331. Cited by: §IIA.
 [21] (201208) Decision trees for uplift modeling with single and multiple treatments. Knowl. Inf. Syst. 32 (2), pp. 303–327. Cited by: §I, §IIA, §IIIA, §IVA, §IVA, §IVB.
 [22] (201511) Ensemble methods for uplift modeling. Data Min. Knowl. Discov. 29 (6), pp. 1531–1559. Cited by: §I, §IVB.
 [23] (2014) Feature selection for classification: a review. Data classification: algorithms and applications, pp. 37. Cited by: §I, §IIB.
 [24] (201510) Estimation and inference of heterogeneous treatment effects using random forests. External Links: 1510.04342 Cited by: §I, §IIA.
 [25] (201312) Support vector machines for uplift modeling. In 2013 IEEE 13th International Conference on Data Mining Workshops, pp. 131–138. Cited by: §I.
 [26] (201705) Uplift modeling with multiple treatments and general response types. External Links: 1705.08492 Cited by: §I, §IVB.
 [27] (2019) Uplift modeling for multiple treatments with cost optimization. arXiv preprint arXiv:1908.05372. Cited by: §I, §IVA, §IVA.
Comments
There are no comments yet.