Regret Minimization for Causal Inference on Large Treatment Space

06/10/2020 ∙ by Akira Tanimoto, et al. ∙ Kyoto University nec global 0

Predicting which action (treatment) will lead to a better outcome is a central task in decision support systems. To build a prediction model in real situations, learning from biased observational data is a critical issue due to the lack of randomized controlled trial (RCT) data. To handle such biased observational data, recent efforts in causal inference and counterfactual machine learning have focused on debiased estimation of the potential outcomes on a binary action space and the difference between them, namely, the individual treatment effect. When it comes to a large action space (e.g., selecting an appropriate combination of medicines for a patient), however, the regression accuracy of the potential outcomes is no longer sufficient in practical terms to achieve a good decision-making performance. This is because the mean accuracy on the large action space does not guarantee the nonexistence of a single potential outcome misestimation that might mislead the whole decision. Our proposed loss minimizes a classification error of whether or not the action is relatively good for the individual target among all feasible actions, which further improves the decision-making performance, as we prove. We also propose a network architecture and a regularizer that extracts a debiased representation not only from the individual feature but also from the biased action for better generalization in large action spaces. Extensive experiments on synthetic and semi-synthetic datasets demonstrate the superiority of our method for large combinatorial action spaces.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Predicting individualized causal effects is an important issue in many domains for decision-making. For example, a doctor considers which medication would be the most effective for a patient, a teacher considers which problems are most effective for improving the achievement of a student, and a retail store manager considers which assortment would improve the overall store sales. To support such decision-making, we consider providing a prediction of which actions will lead to better outcomes.

Recent efforts in causal inference and counterfactual machine learning have focused on making predictions of the potential outcomes that correspond to each action for each individual target based on observational data. Observational data consists of features of targets, past actions actually taken, and their outcomes. We have no direct access to the past decision-makers’ policies, i.e., the mechanism of how to choose an action under a target feature given. Unlike in normal prediction problems, pursuing high-accuracy predictions only with respect to the historical data carries the risk of incorrect estimates due to the biases in the past policies. These biases are also known as spurious correlation (Simon, 1954; Pearl, 2009), which might mislead the decision-making. For those cases where real-world experiments such as randomized controlled trials (RCTs) or multi-armed bandit is infeasible or too expensive, causal inference methods provide debiased estimation of potential outcomes from observational data.

While most of the existing approaches assume limited action spaces such as binary ones as in individual treatment effect estimation (ITE), there are many real-world situations where the number of options is large. For example, doctors need to consider which combination of medicines will best suit a patient. For such cases, it is difficult to apply existing methods (as in (Shalit et al., 2017; Yoon et al., 2018; Schwab et al., 2018)

) for two reasons. First, since the sample sizes for each action would be limited, building models for each action (or a multi-head neural network), which existing methods adopt, is not sample-efficient. Second, even if we manage to achieve the same level of regression accuracy as when the action space is limited, the same decision-making performance is no longer guaranteed in a large action space, as we prove in Section 

3. This is because, in short, even though overestimation of the potential outcome for only a single action in many alternatives has only a small impact on the overall regression accuracy, it can mislead the whole decision to a bad action and result in a poor decision performance.

To achieve informative causal inference for decision-making in a large action space, we propose solutions for the above two problems. For the sample-efficiency, we directly formulate the observational bias problem as a domain adaptation from a biased policy to a uniform random policy, which enables the extraction of debiased representations from both the individual features and the actions. Thereby, we can build a debiased single-head model, aiming at better generalization for the large action space. For the second issue, we analyze our defined decision-focused performance metric, “regret”, and find that we can further improve the decision performance by minimizing the classification error of being in the top- best actions among feasible actions for each target, in addition to the regression error (MSE). We cannot directly observe whether the action is in the top- since only one action and its outcome is observed for each target, and we propose a proxy loss that compares the observed outcome to the estimated conditional average performance of the past decision-makers.

In summary, our proposed method minimizes both the classification error and the MSE using debiased representations of both the features and the actions. We demonstrate the effectiveness of our method through extensive experiments with synthetic and semi-synthetic datasets.

(a) Sample for ITE
(b) Sample for causal inference on a combinatorial action space
Figure 1: Example data tables for the ITE estimation and our setting on a combinatorial action space. Dashes indicate missing entries. Both tasks can be seen as learning from missing-at-random data. Only factual outcomes are observed (when , is observed) and the counterfactual records are missing.

2 Problem setting

In this section, we formulate our problem and define a decision-focused performance metric. Our aim is to build a predictive model to inform decision-making. Given a feature vector

the learned predictive model is expected to correctly predict which action leads to a better outcome , where is a feasible subset of finite action space given . We hereafter assume feasible action space does not depend on the feature, i.e., , for simplicity. As a typical case of large action spaces, we assume an action consisting of multiple causes, i.e., (combinatorial action space).

We assume there exists a joint distribution

where is the unknown decision-making policy of past decision-makers, called propensity, and are the potential outcomes corresponding to each action. The observed (factual) outcome is the one corresponding to the observed action , i.e., a training instance is , where denotes the instance index, and the other (counterfactual) potential outcomes are regarded as missing as shown in Fig. 1. Note that the joint distribution is assumed to have conditional independence (unconfoundedness). In addition, we assume and , (overlap). These are commonly required to identify causal effects (Imbens and Wooldridge, 2009; Pearl, 2009).

To define a performance measure of a model, we utilize a simple prediction-based decision-making policy: given a parameter ,

where denotes the rank among all the feasible actions , i.e., . We also denote as for short. This means choosing an action uniformly at random from the predicted top- actions.

Here we define the performance of by its expected outcome , which can be written as the following mean cumulative gain (mCG), and we also define its difference from the oracle’s performance (regret):


where is the expected potential outcome and is its rank among all the feasible actions. Here is known as the policy risk (Shalit et al., 2017). Since the first term in (2) is constant with respect to , the mCG and the regret are two sides of the same coin as the performance metrics of a model. We regard the mCG (or the regret) as the metric in this paper.

3 Relation between prediction accuracy and precision in decision-making

In this section, we analyze our decision-focused performance metric . Our analysis reveals the difficulty of causal inference in a large action space that the regret bound get worse for the same regression accuracy. At the same time, however, it is shown that we can improve the bound by simultaneously minimizing a classification error, which leads to our proposed method.

A typical performance measure in existing causal inference studies that is applicable to large action spaces is the following uniform MSE (Schwab et al., 2018; Yoon et al., 2018):


Note that is different from the normal MSE in the supervised machine learning context in which the expectation is taken over the same distribution as the training, i.e., . We refer to as MSE, or specifically the uniform MSE, in this paper.

Here the relation between the uniform MSE and the regret is the following (proof is in Appendix A).

Proposition 3.1.

The regret in (2) will be bounded with uniform MSE in (3) as


where is the top- classification error rate, i.e.,

where denotes the logical XOR.

Since for any , we see that only minimizing the uniform MSE as in existing causal inference methods leads to minimizing the regret. However, when is large, the bound would be loose, and only unrealistically small provide a meaningful guarantee for the regret. At the same time, we see that the bound can be further improved by minimizing the uniform top- classification error rate simultaneously, which leads to our proposed method.

4 Regret minimization network: debiased potential outcome regression and classification on a large action space

Our proposed method, regret minimization network (RMNet), consists of two parts. First, we introduce our loss that aim to minimize the regret by minimizing both and . Then, we introduce a sample-efficient network architecture, in which a representation is extracted from both the feature and the action , and a representation-based debiasing regularizer that performs domain adaptation according to the structure.

4.1 Uniform regret minimization loss

As we saw in Section 3, we can improve the decision-making performance by minimizing the uniform top- classification error rate . Notice that the r.h.s. of Eq. (4) is bounded as follows:

from the inequality of arithmetic and geometric means, for

, and the equality holds when . We thus aim to minimize the weighted sum of and .

Since we observe only one action and its outcome for each target, we cannot directly estimate , which is based on the ranked list of potential outcomes, only from the data. Therefore, we recast the minimization of into a simple classification.

First, we rewrite with the 0-1 classification risk as follows (the derivation is in Appendix D):


where and is the 0-1 classification loss. Here the terms are constant with respect to , and thus holds. Therefore, we optimize the 0-1 loss with respect to .

Next, we replace the unobservable -th best outcome in (5) with the conditional average outcome , which can be estimated by a model trained using observational data as . This means that we do not optimize for arbitrary but for a specific that corresponds to the average performance of the observational policy, i.e., such that satisfies ( may depend on ). The replaced numerical label is called residual111

Also known as the advantage in the reinforcement learning

(Mnih et al., 2016).. A positive residual means that the action outperformed the conditional average performance of the observational policy, thus ranking such higher under leads to superior performance to the past decision-makers.

Considering the noise on the residual due to the noise on and the estimation error of , we train our model with an estimation of the true label called soft-label (Peng et al., 2014; Nguyen et al., 2011) where

is the sigmoid function, instead of a naive plug-in label

. The proposed proxy risk for is the following cross-entropy:


where and . Note that the loss for each is minimized when regardless of , as illustrated in Fig. 5 in the appendix.

After all, our risk is defiend as the weighted sum of the classification risk and the MSE:


where .

4.2 Debiasing by representation-based domain adaptation to the RCT policy

While accessible observational data is biased by the propensity , the expected risk is averaged over all actions uniformly. In this section, therefore, we construct a debiased empirical risk against the sampling bias. Also, we propose an architecture that extracts representations from both the feature and the action for better generalization in a large action space.

There are two major approaches for debiased learning in individual-level causal inference. One is density estimation-based method called inverse probability weighting using propensity score (IPW)

(Austin, 2011), in which each instance is weighted by . Since the expected risk matches the one of RCT, a good performance can be expected asymptotically under accurate estimation of or when it is recorded as in logged bandit problems. However, in observational studies, where the propensity has to be estimated and plugged-in, its efficacy would easily drop (Kang et al., 2007). The other approach is representation balancing (Shalit et al., 2017; Johansson et al., 2016), in which a model consists of representation extractor and hypotheses as in Fig. 2(a) and the conditional probabilities of representations are encouraged to be similar to each other by means of so-called integral probability metric (IPM) regularizer. We also take this approach and extend for large action spaces.

It is difficult to naively extend these methods to a large action space. A reason is, as in Fig. 2(a), constructing hypothesis layers for each action is not sample-efficient. Also, representation balancing of each pair of actions is computationally and statistically infeasible. Therefore, we propose extracting representations from both the features and the action as in Fig. 2(b).

(a) Counterfactual Regression (CFR)
(b) Proposed architecture
Figure 2: Network structures of Counterfactual regression for ITE (Shalit et al., 2017; Schwab et al., 2018) 333As for the representation balancing regularizer , (Shalit et al., 2017) assumed and (Schwab et al., 2018) extended it to , but both assumed the existence of a special action (e.g., no intervention), and only the distances between and other actions are taken into account. We assume no such special action, and thus the pairwise comparison is a reasonable extension.

and our proposed method. A broken line indicates no backpropagation.

We want to minimize the risk under the joint distribution with the uniform policy where

denotes the discrete uniform distribution, using sample from observational joint distribution

. This can be seen as an unsupervised domain adaptation task from the training distribution to the joint distribution with the uniform policy . From this observation, we directly apply the representation regularizer to these distributions. That is, we encourage matching and where .

The resulting objective function is


where is the empirical instance-wise version of (7), is sampled from , and is a regularizer. We utilize the Wasserstein distance, which is an instance of the IPM, as the discrepancy measure of the representation distributions, as in (Shalit et al., 2017). Specifically, we use an entropy relaxation of the exact Wasserstein distance, called Sinkhorn distance (Cuturi, 2013), for the compatibility with the gradient-based optimization. The resulting learning flow is shown in Algorithm 1. A theoretical analysis for our representation balancing regularization can be found in Appendix B.

0:  Observational data

, hyperparameters

0:  Trained network parameter
1:  Train

by an arbitrary supervised learning method with

, e.g.:
2:  while Convergence criteria is not met do
3:     Sample mini-batch .
4:     Calculate the gradient of the supervised loss in (8):
5:     Sample uniformly random action .
6:     Calculate the gradient of the representation balancing regularizer (e.g., Sinkhorn distance (Cuturi, 2013)):
7:     Obtain step size with an optimizer (e.g., Adam (Kingma and Ba, 2015))
9:     Check convergence criterion
10:  end while
11:  return
Algorithm 1 Unbiased learning of outcomes

5 Experiments

We investigated the effectiveness of our method through synthetic and semi-synthetic experiments. Both datasets were newly designed by us for the problem setting with a large action space.

5.1 Experimental setup

Compared methods.

We compared our proposed method (RMNet) with ridge linear regression (OLS), random forests

(Breiman, 2001), Bayesian additive regression trees (BART) (Hill, 2011), naive deep neural network (S-DNN), naive DNN with multi-head architecture for each actions (M-DNN) (a.k.a. TARNET (Shalit et al., 2017)), and straightforward extensions of the existing action-wise representation balancing method (counterfactual regression network (CFRNet)) (Shalit et al., 2017)

. We also made comparisons with the methods in which each one component of our proposed method was removed from the loss function, i.e.,

(“w/o MSE”), (“w/o ER”), and (“w/o ”), to clarify the contributions of each component. For the main proposed method (RMNet), we equally weighted ER and MSE (). The strength of representation balancing regularizer in CFRNet and proposed method was selected from . Other specification of DNN parameters can be found in Appendix C.

Evaluation. We used the normalized mean CG (nmCG) as the main metric, defined as follows.

The normalized mean CG is proportional to the mean CG (1) except that the expected outcomes are replaced with the actual ones. We can see from the definition of . Since we have standardized the outcome, the chance rate is In addition to nmCG, we have also evaluated with respect to the uniform MSE. The validation and the model selection was based on the mean CG, including the results in MSE.

Infrastructure. All the experiments were run on a machine with 28 CPUs (Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz), 250GB memory, and 8 GPUs.

5.2 Synthetic experiment

Dataset. We prepared seven biased datasets in total to examine the robustness of the proposed and baseline methods. For detailed generation process, see Appendix C. The feature space and the action space are fixed to and , respectively. The sample sizes for were 1,000 for training, 100 for validation, 200 for testing. For training, only one actions and the corresponding outcomes are sampled as follows. Six of the settings have generalized linear models , where denotes one-dimensional representations of and . The function is linear in three of them () and quadratic with respect to in the rest (). The last setting is a bilinear model We set sampling biases as where denotes another representations of and . The three settings for linear and quadratic patterns correspond to the relation between and as illustrated in Fig. 3(a)LABEL:, i.e., () in Setup-A and C, and () in Setup-B and C. These relations of variables were designed to reproduce spurious correlation, which may mislead the deicision-making as follows. In Setup-A, would have dependence to through the dependence to despite itself has no causal relationship to . Samely, in Setup-B, would have dependence to through , and the causal effect of may appear discounted. In Setup-C, the causal effect of might appear to be the opposite, as illustrated in Fig. 3(d).

(a) Setup-A
(b) Setup-B
(c) Setup-C
(d) Illustration of Linear-C
Figure 3: LABEL:LABEL: The data generation models for synthetic experiment. Shaded variables denote the accessible variables in training. Non-shaded variables are latent one-dimensional representations of and . LABEL: Illustration of how Linear-C setting can mislead learners via sampling bias.

Result. As can be seen in Table 1, our proposed method achieved the best performance or compared favorably under all settings. The results in MSE is also shown in Appendix E. In the linear settings, OLS achieved on par with the oracle (nmCG=1), since the model class is correctly specified, and its performance dropped under the nonlinear settings. In real-world situations where the class of the true function is unknown, versatility to the true function classes would be a significant strength of the proposed method. Under the setting of Linear-C, some of the compared methods performed below the chance rate (). This is maybe because Linear-C is designed to mislead as illustrated in Fig. 3(d). Here we would like to mention that Linear-C is not unrealistic, e.g., doctors (the past decision-makers) are likely to give stronger medicines ( is large) to more serious patients ( is large), and the stronger medicines might appear to rather worsen the patients’ health (). In Linear-C and Quadratic-C, the performance of RMNet was worsened by using . This might be because, in Setup-C, the empirical loss ( in (8)) and the regularizer might conflict, i.e., extracting both and is needed for better prediction, which increase due to the bias. In recent studies (Hassanpour and Greiner, 2020; Zhang et al., 2020; Johansson et al., 2019), it is argued that requiring low IPM is unnecessarily strong, and alternatives for IPM are proposed. Thus, there is a room for further improvement left in this direction for future work.

Method Linear-A Linear-B Linear-C Quadratic-A Quadratic-B Quadratic-C Bilinear
OLS 0.99 .00 1.00 .00 1.00 .00 0.20 .10 0.68 .11 0.80 .13 0.00 .01
Random Forest 0.52 .10 0.46 .06 0.89 .11 0.71 .10 0.24 .03 0.90 .06 0.64 .04
BART 0.69 .12 0.99 .00 1.03 .08 0.54 .15 0.87 .07 0.99 .00 0.02 .04
M-DNN 0.40 .16 0.76 .09 0.07 .09 0.77 .09 0.45 .14 0.62 .13 0.25 .10
S-DNN 0.83 .11 0.85 .08 0.64 .18 0.78 .09 0.52 .08 0.70 .08 0.66 .08
CFRNet 0.10 .23 0.72 .16 0.06 .10 0.52 .18 0.30 .12 0.53 .15 0.09 .08
RMNet 0.96 .01 0.98 .01 0.76 .07 0.95 .02 0.87 .03 0.90 .05 0.83 .02
RMNet (w/o MSE) 0.94 .01 0.89 .09 0.47 .09 0.93 .02 0.86 .03 0.83 .05 0.75 .07
RMNet (w/o ER) 0.90 .05 0.84 .08 0.60 .13 0.88 .05 0.55 .07 0.71 .08 0.56 .09
RMNet (w/o ) 0.91 .03 0.98 .01 0.87 .05 0.94 .01 0.86 .03 0.92 .04 0.83 .02
Table 1:

Synthetic results on normalized mean CG@1 (larger is better and a maximum at one) and its standard error in ten data generations. Best and second-best methods are in bold.

5.3 Semi-synthetic experiment

Dataset (GPU kernel performance). For semi-synthetic experiment, we used the SGEMM GPU kernel performance dataset (Nugteren and Codreanu, 2015; Ballester-Ripoll et al., 2017), which has 14 feature attributes of GPU kernel parameters and four target attributes of elapsed times in milliseconds for four independent runs for each combination of parameters. We used the inverse of the mean elapsed times as the outcome. Then we had 241.6k instances in total. By treating some of the feature attributes as action dimensions, we got a complete dataset, which has all the entries (potential outcomes) in Fig. 1(b) observed. Then we composed our semi-synthetic dataset by biased subsampling of only one action and the corresponding potential outcome for each . The details of this preprocess can be found in Appendix C.

The sampling policy in the training data was where is sampled from . This policy reproduced a spurious correlation; that is, a random projection of the feature and the action is likely to have little causal relationship with but does have a strong correlation due to the sampling policy. This policy also depends on , which violates the unconfoundedness assumption. Although, the dataset we used has a low noise level, i.e., for some function , and thus

We split the feature set into 80% for training, 5% for validation, and 15% for testing. Then, for the training set, only one action and the corresponding outcome was taken for each . The resulting training sample size for each setting of is listed in Table 3 in Appendix C. We repeated the training and evaluation process ten times for different splits and samplings of .

Result. As shown in Table 2, our proposed method outperformed the others in nmCG@1 in all cases. In terms of MSE, S-DNN with the same backbone also achieved a high performance, which demonstrates that the structure in Fig. 2(b) efficiently modeled the data. The performance gains compared to “w/o ER” and “w/o ” demonstrate the effectiveness of both of the components proposed in Section 4. The superior performance of RMNet without MSE in the settings of and indicates the room for optimizing , which we fixed to 0.5.

Normalized mean CG@1 MSE
8 16 32 64 8 16 32 64
OLS 0.04 .15 0.08 .20 0.10 .13 0.01 .10 1.12 .12 1.89 .26 1.70 .26 5.86 1.10
Random Forest 0.23 .08 0.33 .07 0.32 .05 0.37 .05 1.03 .11 0.87 .08 0.93 .09 1.07 .18
BART 0.00 .13 0.17 .13 0.11 .10 0.04 .09 1.06 .08 1.04 .08 1.19 .12 1.63 .23
M-DNN 0.41 .05 0.48 .06 0.31 .07 0.37 .05 0.78 .05 0.84 .02 0.83 .02 0.84 .02
S-DNN 0.29 .09 0.26 .10 0.32 .07 0.46 .05 0.75 .12 0.60 .09 0.74 .06 0.74 .04
CFRNet 0.50 .06 0.39 .14 0.39 .10 0.35 .05 0.79 .02 0.81 .02 0.87 .01 0.86 .01
RMNet 0.68 .00 0.60 .05 0.60 .05 0.51 .05 0.77 .00 0.76 .09 0.84 .02 0.73 .07
RMNet (w/o MSE) 0.68 .00 0.66 .01 0.67 .01 0.50 .05 0.76 .00 0.75 .06 0.85 .01 0.80 .08
RMNet (w/o ER) 0.68 .00 0.45 .08 0.56 .05 0.49 .05 0.77 .00 0.67 .08 0.88 .02 0.75 .05
RMNet (w/o ) 0.33 .09 0.27 .10 0.40 .07 0.48 .06 0.72 .12 0.81 .18 0.78 .08 0.71 .06
Table 2: Semi-synthetic results on normalized mean CG@1 and MSE with the standard error in ten different samplings of the training data. Best and second-best methods are in bold.

6 Summary

In this paper, we have investigated causal inference on a large action space with a focus on the decision-making performance. We first defined and analyzed the performance in decision-making brought about by a model through a simple prediction-based decision-making policy. Then we showed that the bound only with the regression accuracy (MSE) gets looser as the action space gets large, which illustrates the difficulty of utilizing causal inference in decision-making in a large action space. At the same time, however, our bound indicates that minimizing not only the regression loss but also the classification loss leads to a better performance. From this viewpoint, our proposed method minimizes both the regression and classification losses, specifically, soft cross-entropy with a teacher label indicating whether an observed outcome is better than the estimated conditional average outcome in the observational distribution under a given feature. Experiments on synthetic and semi-synthetic datasets, which is designed to have misleading spurious correlations, demonstrated the superior performance of the proposed method with respect to the decision performance and the regression accuracy.


  • P. C. Austin (2011) An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research 46 (3), pp. 399–424. Cited by: §4.2.
  • R. Ballester-Ripoll, E. G. Paredes, and R. Pajarola (2017)

    Sobol tensor trains for global sensitivity analysis

    arXiv preprint arXiv:1712.00233. Cited by: §5.3.
  • L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §5.1.
  • M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292–2300. Cited by: §4.2, 6.
  • N. Hassanpour and R. Greiner (2020) Learning disentangled representations for counterfactual regression. In International Conference on Learning Representations, External Links: Link Cited by: §5.2.
  • J. L. Hill (2011) Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20 (1), pp. 217–240. Cited by: §5.1.
  • G. W. Imbens and J. M. Wooldridge (2009) Recent developments in the econometrics of program evaluation. Journal of economic literature 47 (1), pp. 5–86. Cited by: §2.
  • F. D. Johansson, D. Sontag, and R. Ranganath (2019) Support and invertibility in domain-invariant representations. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 527–536. External Links: Link Cited by: §5.2.
  • F. Johansson, U. Shalit, and D. Sontag (2016) Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020–3029. Cited by: §4.2.
  • J. D. Kang, J. L. Schafer, et al. (2007) Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science 22 (4), pp. 523–539. Cited by: §4.2.
  • D. P. Kingma and J. L. Ba (2015) Adam: a method for stochastic optimization. In ICLR 2015 : International Conference on Learning Representations 2015, External Links: Link Cited by: Appendix C, 7.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: footnote 1.
  • Q. Nguyen, H. Valizadegan, and M. Hauskrecht (2011) Learning classification with auxiliary probabilistic information. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 477–486. External Links: Document, ISBN 9780769544083, ISSN 15504786 Cited by: §4.1.
  • C. Nugteren and V. Codreanu (2015) CLTune: a generic auto-tuner for opencl kernels. In Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2015 IEEE 9th International Symposium on, pp. 195–202. Cited by: §5.3.
  • J. Pearl (2009) Causality. Cambridge university press. Cited by: §1, §2.
  • P. Peng, R. C. Wong, and P. S. Yu (2014) Learning on probabilistic labels. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 307–315. Cited by: §4.1.
  • P. Schwab, L. Linhardt, and W. Karlen (2018) Perfect match: a simple method for learning representations for counterfactual inference with neural networks. arXiv preprint arXiv:1810.00656. Cited by: §1, §3, Figure 2, footnote 2, footnote 3.
  • U. Shalit, F. D. Johansson, and D. Sontag (2017) Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3076–3085. Cited by: §1, §2, Figure 2, §4.2, §4.2, §5.1, footnote 2, footnote 3.
  • H. A. Simon (1954) Spurious correlation: a causal interpretation. Journal of the American statistical Association 49 (267), pp. 467–479. Cited by: §1.
  • J. Yoon, J. Jordon, and M. van der Schaar (2018) GANITE: estimation of individualized treatment effects using generative adversarial nets. In International Conference on Learning Representations, External Links: Link Cited by: §1, §3.
  • Y. Zhang, A. Bellot, and M. van der Schaar (2020) Learning overlapping representations for the estimation of individualized treatment effects. arXiv preprint arXiv:2001.04754. Cited by: §5.2.

Appendix A Proof of Proposition 3.1

Proposition A.1.

The expected regret will be bounded with uniform MSE in (3) as

where is the top- classification error rate, i.e.,

where denotes the logical XOR.


Here we denote the true and the predicted -th best action by and , respectively; i.e., . For all , the target-wise regret can be bounded as follows.


where . Inequality (9) is from the definition of ; i.e., is the summation of the top- s out of , which must be larger than or equal to the summation of ’s that are not necessarily top-, . Let where

is the one-hot encoding of

, and be the error vector that consists of . The r.h.s. is bounded as


where and are the target-wise error rate and MSE, respectively. The inequality comes from the Cauchy–Schwarz inequality. By taking the expectation with respect to and applying Jensen’s inequality, we get the proposition. ∎

Appendix B Error analysis for representation-based domain adaptation from observational data to the uniform average on action space

By performing the representation balancing regularization, our method enjoys better generalization through minimizing the upper bound of the error on the test distribution (under uniform random policy). We briefly show why minimizing the combination of empirical loss on training and the regularization of distribution (8) results in minimizing the test error. First, we define the point-wise loss function under a hypothesis and an extractor , which defines the representation , as

Then, the expected losses for the training (source) and the test distribution (target) are

We assume there exists such that for the given function space . Then the integral probability metric is defined for as

The difference between the expected losses under training and test distributions are then bounded as

For , we use the 1-Lipshitz function class, after which is the Wasserstein distance . Although is unknown, the hyperparameter tuning of the regularization strength in (8) can achieve the tuning of .

Appendix C Experimental details

Synthetic data generation process. Our synthetic datasets are built as follows.

  • Sample where .

  • Sample , where , from where and are the following.

    • In settings other than Setup-B, where .

    • In Setup-B, , i.e., only the first dimension in is used to bias .

    • where .

  • Calculate the expected oucome where we examine three types of functions , namely, Linear, Quadratic, and Bilinear. In the Linear and Quadratic types, where and are one-dimensional representations of and , respectively.

    • In Setup-B, where denotes all dimensions other than the first dimension ().

    • In settings other than Setup-B, .

    • In Setup-A, where

    • In settings other than Setup-A, .

    • In the Linear setting,

    • In the Quadratic setting,

    • In the Bilinear setting, where

  • Sample the observed outcome

Details of semi-synthetic data We transformed the target attributes of elapsed times into the average speed as the outcome, i.e., , where are the original elapsed times. Then we standardized and the features. Each feature can take binary values or up to four different powers of two values. Out of 1,327k total parameter combinations, only 241.6k feasible combinations are recorded. We split these original feature dimensions into and as follows. The dimension of the action space ranged from three to six, and the 8th, 11th, 12th, 13th, 14th, and 3rd dimensions are regarded as from the head in order (e.g., for , the 8th, 11th, and 12th dimensions in the original features are regarded as ). This split was for maximizing the overlap of among .

Other DNN parameters. The detailed parameters we used for DNN-based methods (S-DNN, M-DNN, CFRNet, and proposed) were as follows. The backbone DNN structure had four layers for representation extraction and three layers for hypothesis with the width of 64 for the middle layers and the width of 10 for the representation . The batch size was 64, but only for CFRNet, it was 512 for the need to approximate the distributions for each action. The strength of our used L2 regularizer was . We used Adam [Kingma and Ba, 2015] for the optimizer with the learning rate of .

3 8 24,160
4 16 12,080
5 32 6,040
6 64 3,591
Table 3: Training sample size for each setting.

Appendix D Derivation of Eq. 5



we have

Here satisfies the condition , i.e., the -th largest prediction of should be equal to . Although, since is unobservable, we relax the optimization of in the function space that satisfies the condition into the optimization in the general function space. Assuming that our function space includes the optimal function that minimizes

Appendix E Additional experimental results

Method Linear-A Linear-B Linear-C Quadratic-A Quadratic-B Quadratic-C Bilinear
OLS 0.01 0.00 0.01 0.00 0.01 0.00 2.70 0.62 9.91 4.53 10.89 4.16 0.28 0.03
Random Forest 20.29 5.34 3.19 0.54 17.48 5.14 16.83 5.58 12.59 5.18 19.80 4.87 0.24 0.03
BART 14.59 3.80 0.67 0.23 14.30 3.33 14.62 3.82 11.58 4.51 18.64 3.77 0.50 0.10
M-DNN 10.70 2.23 3.70 2.07 12.05 2.25 12.79 2.26 16.44 5.99 19.20 3.66 0.36 0.07
S-DNN 0.64 0.23 1.04 0.46 2.61 1.21 5.18 2.25 16.75 5.87 16.43 3.12 0.13 0.03
CFRNet 10.09 2.22 5.01 2.14 12.64 2.42 10.04 2.18 13.26 4.09 16.25 3.58 0.40 0.07
RMNet 0.30 0.08 0.48 0.09 2.45 0.46 1.75 0.96 10.59 4.69 13.82 3.87 0.10 0.01
RMNet (w/o MSE) 0.46 0.14 2.31 1.46 3.14 0.79 2.07 1.00 10.55 4.69 13.52 4.17 0.13 0.04
RMNet (w/o ER) 0.46 0.14 1.65 0.62 3.71 1.10 3.55 1.82 16.68 5.87 16.33 3.10 0.19 0.03
RMNet (w/o ) 0.50 0.14 0.54 0.10 1.29 0.27 1.71 0.69 11.87 4.38 14.43 3.87 0.08 0.01
Table 4: Synthetic results on MSE and its standard error in ten data generations. Best and second-best methods are in bold.

Elapsed times compared to CFR Figure 4 shows the comparison in training time between the proposed method and CFRNet. For CFRNet, the elapsed time grew when the size of the action space became large. The main reason for this is the calculation of distance between the representation distributions for each pair of actions in Fig. 2(a). The decrease of the elapsed time for RMNet is mainly due to the sample sizes shown in Table 3.

Figure 4:

Elapsed time for training. Error bars indicate standard deviation.

Figure 5: Comparison of MSE and our proposed soft cross-entropy (denoted as ) for a training instance. denotes the estimation of the conditional expectation on the observational distribution . The soft cross-entropy also takes the minimum value when . The asymmetry of the loss works as follows. The actual outcome was larger than the estimated conditional expectation, i.e., , in this case, which means that the action of this instance performed “relatively well” compared to the estimated average performance in the observational data under . Therefore, predicting as a better action than average (

) is regarded as “successfully classified” and penalized less than the failed case (

) for the same regression error.