The explosion of data science in modern technology firms has created a new class of workers with the technical backgrounds needed to solve a wide array of statistical problems using a diverse set of machine learning (ML) techniques. However, the most important decisions made by such firms are typically policy questions such as How much should we invest in R&D?, Should we cut prices?, or Which product would benefit most from an aggressive marketing campaign?. These are all questions that hinge on understanding the causal effect of various policy interventions and, as such, cannot be answered (or even well-informed) by purely statistical approaches. Instead, they require econometric techniques that can yield answers with a clear causal interpretation.
Causal inference is about understanding the true effect of a treatment, call it ‘’, on an outcome, call it ‘’. How would change if we changed D? ML on the other hand is usually about building a good predictor function of using many features (that may include ). These are fundamentally different and therefore one should be careful when moving from one domain to the other or combining the two.
When is randomly assigned, establishing a causal interpretation is straightforward and can be done using even very simple statistical techniques. However, most business decisions involve situations where experimental data is impossible to attain or at least not immediately unavailable. This introduces considerable additional difficulty. One has to control for factors that independently affect both and . These are called confounders (call them ). Omitting these will cause our causal estimates to be wrong. This is often referred to as “omitted variable bias”. As a simple example, suppose we were to analyze the impact of price on sales volume for a retailer of children’s toys, but we were unaware of the important role of the holiday season. We might naively conclude that the relatively small discounts observed at Christmas time were very effective drivers of sales and we might erroneously advise the retailer to consider lowering it’s prices more often. Should they follow our advice and Christmas should “fail to come in July”, our error could be quite embarrassing.
Even a naive analyst could be expected to recognize the confounding role of Christmas, but this is an extreme case. Often confounding variables are much more subtle and an analyst may be unsure whether a particular variable should be considered a confounder or how it interacts with other elements of the system. This process, of carefully selecting and modeling the impact of confounds, is referred to by economists as model selection. In practice, this is often a time-consuming, tenuous and somewhat arbitrary process that can make a significant difference in the final result. Economist’s are often loathe to trust empirical work done by non-experts because they will feel that this model selection stage has been performed in error.
It is this last issue – the difficulty of model selection – that is the motivation of our package. We seek to automate the process of model selection by placing machine learning techniques within a carefully constructed econometric framework that can deliver robust causal estimates. For economists this offers the attraction of automating model selection to generate significantly improved (and less arbitrary) models. For machine learning practitioners, this has the advantage of placing the necessary structure around their tools to ensure a clean causal interpretation of business-relevant parameters. For a business person, this can yield an all in-one solution that generates treatment-aware forecasts, providing forward looking predictions for future outcomes of interest (e.g. sales) and how (on average) those outcomes can be modified as a function of planned treatments (e.g. prices).
2 Double ML Preliminaries
The econometric framework that allows us to use ML in causal inference is referred to as “Double ML” 
. Its application to the problem of controlling for confounders rests on two key ideas: the Frisch–Waugh–Lovell (FWL) Theorem and the “cross-fitting” procedure. The FWL theorem is a simple application of linear algebra to ordinary least squares (OLS) regression. Suppose we want to estimate the effect ofon while controlling for in a regression without any ML. The standard way would be to estimate the full regression model:
The FWL Theorem states that we will recover the same estimate of by estimating equation 1 as if we first remove the effect of from both and . That is if:
Regress on . Then generate fitted values and residuals .
Regress on . Then generate fitted values and residuals .
The first two steps can be thought of as “baseline” estimates of and that just use . Notice that we do not care about the coefficients in the baseline stage. All we care about is how to predict the outcome and treatments as a function of the potential confounders. However, using OLS to fit the regressions in steps (1-2) has two key weakness: first, it does not allow for a non-linear relationship between confounders and outcomes/treatments and second, if the number of confounders is not small (when compared to the sample size), OLS will substantially overfit the model in a way that has poor out of sample predictive properties.
As such, we may prefer to use more general ML techniques to fit the baseline estimations. However, without further accommodation, overfitting can still lead to poor statistical performance even if our ML algorithm has strong predictive performance. Overfitting, though, is not a new problem for the ML literature. The key then is to re-purpose the existing solution of cross-validation into a new algorithm called “cross-fitting”.
[-fold Cross-fitting] Cross-fitting is a procedure for fitting and predicting a model type on data using multiple sub-models of the same type in such a way that predictions for each observation are done using sub-models that were not trained on that observation.
Split the data into a -fold partition.
For each partition , fit by excluding the data from partition .
Prediction: For each observation with features , the prediction is the average prediction of all sub-models that were not trained on observation .
If observation was used in fitting , then for the th fold that contained observation .
If observation was not used in fitting (e.g. we are looking at a hold-out sample), then .
Cross-fitting can be thought of as the first phase of cross-validation. Cross-validation would normally continue and look at average predictive performance on the test folds, for example, as part of a larger algorithm to tune a hyper-parameter.
By using cross-fitting in our baseline predictive stages, our third regression is performed on residuals that are calculated using models trained on entirely independent data. These are called “honest” residuals and have much better statistical properties.
We are now ready to pose the full ML problem. This involves structuring the overall causal inference question so as to split out parts that are pure prediction problems and as such can be hand over to ML. To be concrete, imagine that we are interested in some parameter which gives the average effect of treatment () onto an outcome (). Further, suppose that we observe a high-dimensional set of potential confounds (). Each element of might be related to the outcome, to the assignment of treatment, or both. As such we will not be able to learn until we have modeled the impact of these confounding variables. Suppose we place absolutely no restrictions on the impact of our confounders, then we would write
where and are unrestricted functions. These are nuisance parameters of the system, and let their collection be . One important note, is that (4) has imposed an important restriction on our model. Often referred to as an independence or orthogonality condition, this implies that once we have correctly modeled the impact of our confounding variables on our treatment, any remaining variation in treatment must be uncorrelated with the outcome. If was sales and was price, this condition would require that idiosyncratic variation in sales (over and above what we could forecast in our first regression) was not driven by the same unobservables that drove idiosyncratic variation in prices. If there are major demand shocks of which price-setters were aware, but which are not recorded in our data, then we may find this assumption to be questionable and may need to consider other approaches. But, if we accept this condition (as we might if contains all the important data available to our price-setters), then we can perform robust causal inference on by using the following steps. First, ignore treatment and estimate a reduced form relationship () between and our outcome .111Note that when we are estimating the reduced form relationship between and , we do not recover the structural impact of on (as given by ), but rather the related object which also includes the downstream impact our confounders have on the outcome that is channeled through treatment.
Second, estimate as the reduced form relationship between our confounds X and our treatment . These two steps are analogous to “baseline” (or treatment-blind) forecasts of our both our outcome and our expected assignment of treatment. These steps are analogous to steps 1 and 2 in the FWL Theorem and we refer to them collectively as first-stage regressions. Arbitrary ML methods can be used to estimate these objects with the maximum possible out-of-sample precision. The residuals from these regressions can then be considered to be “surprises” in the evolution of treatment/outcome and any remaining correlation between them can then be estimated, in a second-stage regression (analogous to the final step of the FWL Theorem) and interpreted as robust evidence of a causal effect of treatment on outcome. We formalize this recipe (and also describe how cross-fitting is used) in 2 below.
[Double ML Recipe] The generic Double ML recipe is:
Split the data into a -fold partition.
Estimate and by cross-fitting using the common data partition.
Compute first-stage residuals.
Pool the first-stage residuals from all partitions and estimate the causal effect () by a simple regression of onto
. Additionally, the OLS standard errors computed from this regression can be interpreted as a valid standard error for
This recipe was originally proposed by Chernozhukov et al.  and proven to be consistent for inference on a low-dimensional treatment parameter . It was extended to the case where treatment is high dimensional (i.e. we may want to learn a full matrix of cross-price elasticities) .
2.1 Extension to heterogeneous treatment effects
Suppose we are interested in understanding how the impact of varies with some other co-variate. For example, we might want to test the hypothesis that consumer demand responds to some treatment (perhaps a price cut) by more in the US as compared to other markets. Thus we might modify our previous model to take the slightly altered form given by
Here it is sufficient to follow the same algorithm as in 2, but to alter the final stage by regression jointly onto and to learn both a baseline treatment effect and a heterogeneous impact of treatment in the US Market. More generally, shows that this algorithm may be used to learn any set of heterogeneous treatment effects which are an affine modification of a single core treatment which is residualized 
. In addition to the simple interaction demonstrated above, this can include higher order interactions, the impact of a peer’s treatment, or the average impact of peer treatment averaged over a broad set of peers (e.g. an average cross-price elasticity over some range of competing products). However, this method cannot be applied to learn the impacts of non-linear transformations of the core-treatment (e.g.. In such cases we must preform an entirely separate residualization as demonstrated in the next subsection.
2.2 Extension to multiple treatments
Often we want to estimate the impact of multiple treatments. For example, we may wish to model sales as a function of both pricing and marketing treatments. Often pricing and marketing decisions are correlated (it may make sense to run a price cut contemporaneously with a big ad purchase) and as such we must model these two treatments simultaneously. To do otherwise – estimating their effects separately – would attribute the impacts of both treatments to whichever one was being modeled. More generally, suppose we want to estimate different treatment effects. Let our environment be
Then our previous procedure is modified so that we train and independent predictive function for each treatment . Then, following the established cross-fitting formula, we compute residuals and then jointly regress these residuals onto
3 Dynamic DML
On it’s own, Double ML doesn’t incorporate any explicit knowledge of how data or effects are related across time. But in a business context, this can be very important. There is often a significant gap between when actions need to be planned and when they can be executed, but the size of this gap can vary dramatically across contexts. Supply chain decisions often require an initial purchase to be executed months in advance, marketing budgets can be adjusted weeks in advance, and in some cases pricing can be adjusted at a mere moments notice. Furthermore, we may often want to understand the impact of a price chosen tomorrow on consumer demand into the future (did we cannibalize some future demand?). These concerns lead us to develop the Dynamic DML algorithm, which is an extension of Double ML to a setting where we must model outcomes a variety of lead times.
Dynamic DML incorporates this into both the baseline forecasting and causal model stages. At the baseline stage, we need not just one baseline forecasting model (, ), but rather a range of forecasting models for different lead times. For example, a forecast with a lead of one would mean a forecast that is looking one period ahead. Formally, for each lead , we want to train baseline models over data from each unit and time period :
where is all the information known at time such as , , , their past values, and anything predetermined (e.g. dates of holidays). The analogue is of a forecaster at some reference date trying to predict values at some future outcome date and the lead time is the difference between the two. Each variable’s lead time-specific forecast is estimated across all possible values of the reference date. We formalize this view in the following definitions.
Reference date – the date from which we sit when we train our first-stage baseline models to predict outcome and treatment.
Outcome date – the date for which the forecaster wants to predict outcomes.
Lead time – the gap between the outcome date and the reference date.
With multiple forecasts at different leads we can extend the causal model to identify delayed effects of some treatment (e.g. the “pull-forward” effect of a sale). Suppose we trained forecast models for leads one to four. The residuals and are the one-period-ahead surprises in treatment and outcome. The relation between them gives us evidence for the contemporaneous treatment effect. With more leads, we can go further and look at how affects , which helps us identify a delayed treatment effect. In economic terms, this gives inter-temporal substitution (or the “pull forward” cannibalization of demand). Figure 3 shows this graphically for this example setup and think of the forecaster at date .
The DynamicDML object takes a set of leads, builds the accompanying baseline forecasts, and then allows the user to estimate intertemporal effects. If one wishes to ignore these dynamic complexities, the DoubleML class (which defaults to a single lead of zero) may offer a simpler user experience. Or one can simply set min_lead=max_lead=1 in DynamicDML
which trains a single model using the maximum available information set (which typically results in the narrowest confidence intervals on the resulting causal parameters).
4 Modeling Recommendations
4.1 Baseline Stage
In this section we recommend what should be included in the base set of features and how to construct the unknown and functions. This latter part entails both the feature generation and picking the ML algorithm, which are closely related (e.g. tree models will automatically try to detect interactions and non-linearities whereas a Lasso will need these explicit featurized).
In general it is better to add anything that might be a confounder or predictor to , but technically we must avoid what the Econometrics literature calls bad controls. These are variables whose values were affected by the value of treatment. They are, therefore, a type of ancillary outcome rather than a real control and their inclusion in will bias our estimated treatment effects (they capture part of the overall effect of on ). In most settings, however, the forecaster perspective used by Dynamic DML will be sufficient to avoid the problem as only previous values of variables are used as features in the baseline stage.
For feature generation, we provide the following built-in featurizers:
, dummy variables for each unit of time, and dummy variable for each panel/unit variable (e.g. “product”). This feature attempts to mimic standard “panel data” analysis.
default_dynamic_featurizer: Includes , past values of and , and trends of those variables. This featurizer attempts to detect trends in product popularity.
In addition, both of these features can absorb additional features of the reference date that may help us make forward looking predictions (e.g. how many google searches do we observe for our product) or features of the outcome date that may help predict seasonal patterns (e.g. is the outcome date Christmas? What is the usual intensity of sales during that time of year?).
For ML algorithms, the following are some general recommendations from Chernozhukov et al. :
If the set of true confounders is sparse (i.e. only a few are truly important), use sparsity-based techniques such as Lasso, post-Lasso222Where we first run a Lasso to get selected variables and then run OLS with just the selected variables., or -boosting.
If confounders have sharply different behavior on different subsets of our date, it may be best to use trees or random forests.
If and are well approximated by a sparse (deep) neural net, then use an
-penalized (deep) neural network.
If any of the above are true, then one can also use an ensemble method over the methods methods mentioned above.
Following these guidelines, the Pricing Engine has built-in models for Lasso, boosted trees, Random Forests, Neural Nets, as well as simple ensemble methods such as Bucket of Models and Stacking. Alternatively, the pricing engine can also take (as inputs to a PrePredicted class) first-stage forecasts generated by some other ML tool.
4.2 Causal algorithm
The Double ML theory provides statistical guarantees for using OLS as the causal algorithm and this should be the default choice for most problems. OLS tends to perform badly, however, with many, highly-colinear features. In these cases Ridge regression may provide more stable second-stage estimates. The basic Ridge algorithm, however, has the downside that, since it penalizes its parameters, the estimates will be biased towards zero. We therefore provide a modified Ridge algorithm that in practice captures the majority of the benefits of Ridge regression while retaining only a fraction of its downsides. The key is that usually the treatment effects can be clustered so that each group contains a high-level, main effect and then numerous secondary effects. For example, we may look for the average effect of a price discount and then check for heterogeneous effects by each sales region. Given we are checking for heterogeneous effects by all regions we should be more skeptical about each of those than about the main effect. Therefore we provide a modified Ridge algorithm that allows certain features to be unpenalized and pair that withTreatmentBuilder objects that can be configured to penalize just the “secondary” treatments effects.
Evaluating the performance of a model’s predictive ability is usually straightforward: retain a hold-out “test/validation” sample and check the fitted model’s performance on that sample. This will give an unbiased estimate of the true predictive performance. Evaluating the validity of a model’s causal estimates is much more difficult. In observational data, there is usually no “ground truth” that cleanly indicates true causal effects.333One can know the ground truth by generating artificial data with known treatment parameters and seeing how well the estimation strategy does at recovering the effects. Since this type of validation is quite narrow (it only gives assurance about the particular data generating process specified, not necessarily what is going on in the real world) it may be best when there is a particular concern about the estimation setting and the model choices under consideration. This problem is not new, and there is an extensive literature on validating and interpreting the output of OLS type methods. We briefly mention here some of the key points and note how these methods can be applied to the Double ML setup.
First we note that estimated coefficients may occasionally have an unintuitive sign. As an example, if treatment is price and our outcome is sales, a positive coefficient would indicate that a price increase would result in more sales. This is not necessarily cause for concern. A user should first check their estimated standard errors, to see if their result is statistically significant. If the estimate is statistically insignificant, it may be the case that there was simply too little residual variation in your treatments in order to receive useful causal estimates. If however, coefficients are estimated “with the wrong sign” and are statistically significant, this may be cause to re-evaluate: (1) the first-stage models and whether or not they are appropriately specified to model the impacts of confounding variables or (2) the independence assumption discussed in section 2.
Second, we may want to see how robust our estimates are to small changes in our estimation strategy. Here we suggest diagnostics that analyze how model estimates change as either variables or observations are dropped. One of the key metrics at the variable level is the Variable inflation index (VIF)
. This is a matrix of the correlations between the treatment variables. Highly correlated treatments will causes the estimate of one parameter to depend on the inclusion of the other. This will also mean that the two treatments will “compete” for the effect often causing one to take an unintuitive sign. A Ridge regression in the causal stage can be used to deal with highly correlated treatments. A good baseline stage should also lower the VIF by projecting out common drivers of both treatments so that effects are easier to estimate. One may also be concerned that outlier observations are driving their causal estimates. It may be wise to individually inspect those observations who’s values of treatment and outcome are fit the most poorly in the first-stage regressions and to see how omitting these data points impacts model estimates.444There are multiple statistics at the observation level that characterize how the model changes with the inclusion of each observation. General outlier analysis at baseline is helpful for determining problems in the underlying data. After baseline one can look at outliers (in terms of the outcome or treatment variables) to assess how well the model is at finding the “mini-experiments”. Are the large residuals periods where we think there was some new change to the variable or is there some confounding that the model is picking up? This can be extended to looking at specific measures of “influence”, such as Cook’s D (how the overall fit of a regression changes with the inclusion of each point) and DFBETA (how the coefficients change with the inclusion of a point) in the causal stage.
Finally, even if our causal estimates are of reasonable sign and magnitude and are not overly sensitive to outliers/model specification, we will still want to validate them. Fundamentally, this can only be done by randomizing the value of treatment (i.e. preforming an experiment). In some cases, experimental variation may already be present in some subset of the data. Then one can compare estimates on this subset versus on its compliment. Alternatively, one can use the model’s estimates as a suggestion of where to target new experiments and in the process validate the model estimates.
5 Implementing Double ML in the Pricing Engine SDK
The Pricing Engine SDK enables the user to flexibly apply this structure to their problem of choice. The major choices available to the user are:
Given the data, what features compromise the high-dimensional set of potential confounds? Do we want to construct derived features to capture particular dynamics, interactions, or non-linearities? (See section 4.1)
What ML algorithm should be use for the first-stage regressions of and onto ? (See section 4.1)
Exactly how residualized treatments should be manipulated so that we can learn interesting patterns of heterogeneous, or peer, treatment effects. This will depend on the stakeholders
What second-stage algorithm should be used to infer the causal parameter ? (See section 4.2 )
How far into the future do we wish to forecast outcomes and understand causal impacts?
We encourage the reader to review the accompanying OJ Demand Model Jupyter notebook to see an example of how these choices are specified. A relevant code snippet is produced below. As you can see the user can flexibly enter different values for
feature_builders: This is a list of VarBuilder structures specifying how first-stage features are generated. In this example, we use a default list of VarBuilders created by another class, but this can just as easily be flexibly specified by the user to contain a preferred set of forecasting features. You can swap in your own featurizer function in place of default_panel_featurizer.
baseline_model: The model used to estimate first-stage (ML) regressions. New base models can added by inheriting from our Model class (which will be automatically wrapped for cross-fitting) and new ensembles methods can be added by inheriting from our SampleSplitModel class. If you would like to generate the predictions offline, potentially in a completely different environment, we provide the PrePredicted class with utilities for integrating those predictions into DynamicDML.
treatment_builders: A list of VarBuilder objects specifying how residualized treatments will be modified before second-stage regression. In this case we have used the interaction_levels parameter to get heterogeneous effects across a number of dimensions and used the PToPVar class to specify peer (cross-price) treatment effects.
causal_model: The second-stage regression model used to learn the causal effects.
Options: Where we have passed min_lead and max_lead which govern the range of leads for which we preform baseline forecasting.
6 Typical Analysis Process
In this section, we briefly outline the general process of using the Pricing Engine for causal estimation and show how it is implemented in our OJ demand example.
Identify the main variables of interest (outcome and treatments), time granularity (e.g. week), and unit (panel) identifying variables (e.g. region channel SKU). This will be influenced both by data availability, desired causal estimates, and useful variation in the data.
OJ: We use ln sales as our outcome and ln price and featured as our treatments. Our data is weekly and individual units are at the store id brand level.
Determine what information that decision makers used when modifying the treatment in the past (e.g. competitor actions, product life-cycles, and holidays). Divide this set into those that could independently be affecting demand, which are potential confounders, and those that do not. When in doubt, assume an element is potentially a confounder. Only potential confounders should be included in the first-stage regression of treatment.
OJ: We found that previous trends in the outcome and treatments were used when setting new values of the treatments. We considered these as potential confounders as they may be related to overall demand changes.
Identify any additional variables that may be useful in predicting the outcome variable. Any variables (excluding bad controls) that improve prediction of the outcome can be usefully added to the first-stage regression of outcomes.
OJ: We included the same features as with treatment.
Collect and prepare data. Make sure to collect data on all important potential confounders.
OJ: Already done.
With these in place you should be able to use the Pricing Engine as outlined above.
OJ: See section 5.
Evaluate model results and potentially revise the model. (See section 4.3)
-  V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins, “Double/debiased machine learning for treatment and causal parameters,” arXiv:1608.00060, 2017.
-  V. Chernozhukov, M. Goldman, V. Semenova, and M. Taddy, “Orthogonal machine learning for demand estimation: High dimensional causal inference in dynamic panels,” arXiv:1712.09988, 2017.