Predicting individualized causal effects is an important issue in many domains for decision-making. For example, a doctor considers which medication would be the most effective for a patient, a teacher considers which problems are most effective for improving the achievement of a student, and a retail store manager considers which assortment would improve the overall store sales. To support such decision-making, we consider providing a prediction of which actions will lead to better outcomes.
Recent efforts in causal inference and counterfactual machine learning have focused on making predictions of the potential outcomes that correspond to each action for each individual target based on observational data. Observational data consists of features of targets, past actions actually taken, and their outcomes. We have no direct access to the past decision-makers’ policies, i.e., the mechanism of how to choose an action under a target feature given. Unlike in normal prediction problems, pursuing high-accuracy predictions only with respect to the historical data carries the risk of incorrect estimates due to the biases in the past policies. These biases are also known as spurious correlation (Simon, 1954; Pearl, 2009), which might mislead the decision-making. For those cases where real-world experiments such as randomized controlled trials (RCTs) or multi-armed bandit is infeasible or too expensive, causal inference methods provide debiased estimation of potential outcomes from observational data.
While most of the existing approaches assume limited action spaces such as binary ones as in individual treatment effect estimation (ITE), there are many real-world situations where the number of options is large. For example, doctors need to consider which combination of medicines will best suit a patient. For such cases, it is difficult to apply existing methods (as in (Shalit et al., 2017; Yoon et al., 2018; Schwab et al., 2018)
) for two reasons. First, since the sample sizes for each action would be limited, building models for each action (or a multi-head neural network), which existing methods adopt, is not sample-efficient. Second, even if we manage to achieve the same level of regression accuracy as when the action space is limited, the same decision-making performance is no longer guaranteed in a large action space, as we prove in Section3. This is because, in short, even though overestimation of the potential outcome for only a single action in many alternatives has only a small impact on the overall regression accuracy, it can mislead the whole decision to a bad action and result in a poor decision performance.
To achieve informative causal inference for decision-making in a large action space, we propose solutions for the above two problems. For the sample-efficiency, we directly formulate the observational bias problem as a domain adaptation from a biased policy to a uniform random policy, which enables the extraction of debiased representations from both the individual features and the actions. Thereby, we can build a debiased single-head model, aiming at better generalization for the large action space. For the second issue, we analyze our defined decision-focused performance metric, “regret”, and find that we can further improve the decision performance by minimizing the classification error of being in the top- best actions among feasible actions for each target, in addition to the regression error (MSE). We cannot directly observe whether the action is in the top- since only one action and its outcome is observed for each target, and we propose a proxy loss that compares the observed outcome to the estimated conditional average performance of the past decision-makers.
In summary, our proposed method minimizes both the classification error and the MSE using debiased representations of both the features and the actions. We demonstrate the effectiveness of our method through extensive experiments with synthetic and semi-synthetic datasets.
2 Problem setting
In this section, we formulate our problem and define a decision-focused performance metric. Our aim is to build a predictive model to inform decision-making. Given a feature vectorthe learned predictive model is expected to correctly predict which action leads to a better outcome , where is a feasible subset of finite action space given . We hereafter assume feasible action space does not depend on the feature, i.e., , for simplicity. As a typical case of large action spaces, we assume an action consisting of multiple causes, i.e., (combinatorial action space).
We assume there exists a joint distributionwhere is the unknown decision-making policy of past decision-makers, called propensity, and are the potential outcomes corresponding to each action. The observed (factual) outcome is the one corresponding to the observed action , i.e., a training instance is , where denotes the instance index, and the other (counterfactual) potential outcomes are regarded as missing as shown in Fig. 1. Note that the joint distribution is assumed to have conditional independence (unconfoundedness). In addition, we assume and , (overlap). These are commonly required to identify causal effects (Imbens and Wooldridge, 2009; Pearl, 2009).
To define a performance measure of a model, we utilize a simple prediction-based decision-making policy: given a parameter ,
where denotes the rank among all the feasible actions , i.e., . We also denote as for short. This means choosing an action uniformly at random from the predicted top- actions.
Here we define the performance of by its expected outcome , which can be written as the following mean cumulative gain (mCG), and we also define its difference from the oracle’s performance (regret):
where is the expected potential outcome and is its rank among all the feasible actions. Here is known as the policy risk (Shalit et al., 2017). Since the first term in (2) is constant with respect to , the mCG and the regret are two sides of the same coin as the performance metrics of a model. We regard the mCG (or the regret) as the metric in this paper.
3 Relation between prediction accuracy and precision in decision-making
In this section, we analyze our decision-focused performance metric . Our analysis reveals the difficulty of causal inference in a large action space that the regret bound get worse for the same regression accuracy. At the same time, however, it is shown that we can improve the bound by simultaneously minimizing a classification error, which leads to our proposed method.
Note that is different from the normal MSE in the supervised machine learning context in which the expectation is taken over the same distribution as the training, i.e., . We refer to as MSE, or specifically the uniform MSE, in this paper.
Here the relation between the uniform MSE and the regret is the following (proof is in Appendix A).
Since for any , we see that only minimizing the uniform MSE as in existing causal inference methods leads to minimizing the regret. However, when is large, the bound would be loose, and only unrealistically small provide a meaningful guarantee for the regret. At the same time, we see that the bound can be further improved by minimizing the uniform top- classification error rate simultaneously, which leads to our proposed method.
4 Regret minimization network: debiased potential outcome regression and classification on a large action space
Our proposed method, regret minimization network (RMNet), consists of two parts. First, we introduce our loss that aim to minimize the regret by minimizing both and . Then, we introduce a sample-efficient network architecture, in which a representation is extracted from both the feature and the action , and a representation-based debiasing regularizer that performs domain adaptation according to the structure.
4.1 Uniform regret minimization loss
from the inequality of arithmetic and geometric means, for, and the equality holds when . We thus aim to minimize the weighted sum of and .
Since we observe only one action and its outcome for each target, we cannot directly estimate , which is based on the ranked list of potential outcomes, only from the data. Therefore, we recast the minimization of into a simple classification.
First, we rewrite with the 0-1 classification risk as follows (the derivation is in Appendix D):
where and is the 0-1 classification loss. Here the terms are constant with respect to , and thus holds. Therefore, we optimize the 0-1 loss with respect to .
Next, we replace the unobservable -th best outcome in (5) with the conditional average outcome , which can be estimated by a model trained using observational data as .
This means that we do not optimize for arbitrary but for a specific that corresponds to the average performance of the observational policy, i.e., such that satisfies ( may depend on ).
The replaced numerical label is called residual111 Also known as the advantage in the reinforcement learning
Also known as the advantage in the reinforcement learning(Mnih et al., 2016).. A positive residual means that the action outperformed the conditional average performance of the observational policy, thus ranking such higher under leads to superior performance to the past decision-makers.
Considering the noise on the residual due to the noise on and the estimation error of , we train our model with an estimation of the true label called soft-label (Peng et al., 2014; Nguyen et al., 2011) where
is the sigmoid function, instead of a naive plug-in label. The proposed proxy risk for is the following cross-entropy:
where and . Note that the loss for each is minimized when regardless of , as illustrated in Fig. 5 in the appendix.
After all, our risk is defiend as the weighted sum of the classification risk and the MSE:
4.2 Debiasing by representation-based domain adaptation to the RCT policy
While accessible observational data is biased by the propensity , the expected risk is averaged over all actions uniformly. In this section, therefore, we construct a debiased empirical risk against the sampling bias. Also, we propose an architecture that extracts representations from both the feature and the action for better generalization in a large action space.
There are two major approaches for debiased learning in individual-level causal inference. One is density estimation-based method called inverse probability weighting using propensity score (IPW)(Austin, 2011), in which each instance is weighted by . Since the expected risk matches the one of RCT, a good performance can be expected asymptotically under accurate estimation of or when it is recorded as in logged bandit problems. However, in observational studies, where the propensity has to be estimated and plugged-in, its efficacy would easily drop (Kang et al., 2007). The other approach is representation balancing (Shalit et al., 2017; Johansson et al., 2016), in which a model consists of representation extractor and hypotheses as in Fig. 2(a) and the conditional probabilities of representations are encouraged to be similar to each other by means of so-called integral probability metric (IPM) regularizer. We also take this approach and extend for large action spaces.
It is difficult to naively extend these methods to a large action space. A reason is, as in Fig. 2(a), constructing hypothesis layers for each action is not sample-efficient. Also, representation balancing of each pair of actions is computationally and statistically infeasible. Therefore, we propose extracting representations from both the features and the action as in Fig. 2(b).
and our proposed method. A broken line indicates no backpropagation.
We want to minimize the risk under the joint distribution with the uniform policy where
denotes the discrete uniform distribution, using sample from observational joint distribution. This can be seen as an unsupervised domain adaptation task from the training distribution to the joint distribution with the uniform policy . From this observation, we directly apply the representation regularizer to these distributions. That is, we encourage matching and where .
The resulting objective function is
where is the empirical instance-wise version of (7), is sampled from , and is a regularizer. We utilize the Wasserstein distance, which is an instance of the IPM, as the discrepancy measure of the representation distributions, as in (Shalit et al., 2017). Specifically, we use an entropy relaxation of the exact Wasserstein distance, called Sinkhorn distance (Cuturi, 2013), for the compatibility with the gradient-based optimization. The resulting learning flow is shown in Algorithm 1. A theoretical analysis for our representation balancing regularization can be found in Appendix B.
We investigated the effectiveness of our method through synthetic and semi-synthetic experiments. Both datasets were newly designed by us for the problem setting with a large action space.
5.1 Experimental setup
Compared methods.2001), Bayesian additive regression trees (BART) (Hill, 2011), naive deep neural network (S-DNN), naive DNN with multi-head architecture for each actions (M-DNN) (a.k.a. TARNET (Shalit et al., 2017)), and straightforward extensions of the existing action-wise representation balancing method (counterfactual regression network (CFRNet)) (Shalit et al., 2017)
. We also made comparisons with the methods in which each one component of our proposed method was removed from the loss function, i.e.,(“w/o MSE”), (“w/o ER”), and (“w/o ”), to clarify the contributions of each component. For the main proposed method (RMNet), we equally weighted ER and MSE (). The strength of representation balancing regularizer in CFRNet and proposed method was selected from . Other specification of DNN parameters can be found in Appendix C.
Evaluation. We used the normalized mean CG (nmCG) as the main metric, defined as follows.
The normalized mean CG is proportional to the mean CG (1) except that the expected outcomes are replaced with the actual ones. We can see from the definition of . Since we have standardized the outcome, the chance rate is In addition to nmCG, we have also evaluated with respect to the uniform MSE. The validation and the model selection was based on the mean CG, including the results in MSE.
Infrastructure. All the experiments were run on a machine with 28 CPUs (Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz), 250GB memory, and 8 GPUs.
5.2 Synthetic experiment
Dataset. We prepared seven biased datasets in total to examine the robustness of the proposed and baseline methods. For detailed generation process, see Appendix C. The feature space and the action space are fixed to and , respectively. The sample sizes for were 1,000 for training, 100 for validation, 200 for testing. For training, only one actions and the corresponding outcomes are sampled as follows. Six of the settings have generalized linear models , where denotes one-dimensional representations of and . The function is linear in three of them () and quadratic with respect to in the rest (). The last setting is a bilinear model We set sampling biases as where denotes another representations of and . The three settings for linear and quadratic patterns correspond to the relation between and as illustrated in Fig. 3(a)–LABEL:, i.e., () in Setup-A and C, and () in Setup-B and C. These relations of variables were designed to reproduce spurious correlation, which may mislead the deicision-making as follows. In Setup-A, would have dependence to through the dependence to despite itself has no causal relationship to . Samely, in Setup-B, would have dependence to through , and the causal effect of may appear discounted. In Setup-C, the causal effect of might appear to be the opposite, as illustrated in Fig. 3(d).
Result. As can be seen in Table 1, our proposed method achieved the best performance or compared favorably under all settings. The results in MSE is also shown in Appendix E. In the linear settings, OLS achieved on par with the oracle (nmCG=1), since the model class is correctly specified, and its performance dropped under the nonlinear settings. In real-world situations where the class of the true function is unknown, versatility to the true function classes would be a significant strength of the proposed method. Under the setting of Linear-C, some of the compared methods performed below the chance rate (). This is maybe because Linear-C is designed to mislead as illustrated in Fig. 3(d). Here we would like to mention that Linear-C is not unrealistic, e.g., doctors (the past decision-makers) are likely to give stronger medicines ( is large) to more serious patients ( is large), and the stronger medicines might appear to rather worsen the patients’ health (). In Linear-C and Quadratic-C, the performance of RMNet was worsened by using . This might be because, in Setup-C, the empirical loss ( in (8)) and the regularizer might conflict, i.e., extracting both and is needed for better prediction, which increase due to the bias. In recent studies (Hassanpour and Greiner, 2020; Zhang et al., 2020; Johansson et al., 2019), it is argued that requiring low IPM is unnecessarily strong, and alternatives for IPM are proposed. Thus, there is a room for further improvement left in this direction for future work.
|OLS||0.99 .00||1.00 .00||1.00 .00||0.20 .10||0.68 .11||0.80 .13||0.00 .01|
|Random Forest||0.52 .10||0.46 .06||0.89 .11||0.71 .10||0.24 .03||0.90 .06||0.64 .04|
|BART||0.69 .12||0.99 .00||1.03 .08||0.54 .15||0.87 .07||0.99 .00||0.02 .04|
|M-DNN||0.40 .16||0.76 .09||0.07 .09||0.77 .09||0.45 .14||0.62 .13||0.25 .10|
|S-DNN||0.83 .11||0.85 .08||0.64 .18||0.78 .09||0.52 .08||0.70 .08||0.66 .08|
|CFRNet||0.10 .23||0.72 .16||0.06 .10||0.52 .18||0.30 .12||0.53 .15||0.09 .08|
|RMNet||0.96 .01||0.98 .01||0.76 .07||0.95 .02||0.87 .03||0.90 .05||0.83 .02|
|RMNet (w/o MSE)||0.94 .01||0.89 .09||0.47 .09||0.93 .02||0.86 .03||0.83 .05||0.75 .07|
|RMNet (w/o ER)||0.90 .05||0.84 .08||0.60 .13||0.88 .05||0.55 .07||0.71 .08||0.56 .09|
|RMNet (w/o )||0.91 .03||0.98 .01||0.87 .05||0.94 .01||0.86 .03||0.92 .04||0.83 .02|
Synthetic results on normalized mean CG@1 (larger is better and a maximum at one) and its standard error in ten data generations. Best and second-best methods are in bold.
5.3 Semi-synthetic experiment
Dataset (GPU kernel performance). For semi-synthetic experiment, we used the SGEMM GPU kernel performance dataset (Nugteren and Codreanu, 2015; Ballester-Ripoll et al., 2017), which has 14 feature attributes of GPU kernel parameters and four target attributes of elapsed times in milliseconds for four independent runs for each combination of parameters. We used the inverse of the mean elapsed times as the outcome. Then we had 241.6k instances in total. By treating some of the feature attributes as action dimensions, we got a complete dataset, which has all the entries (potential outcomes) in Fig. 1(b) observed. Then we composed our semi-synthetic dataset by biased subsampling of only one action and the corresponding potential outcome for each . The details of this preprocess can be found in Appendix C.
The sampling policy in the training data was where is sampled from . This policy reproduced a spurious correlation; that is, a random projection of the feature and the action is likely to have little causal relationship with but does have a strong correlation due to the sampling policy. This policy also depends on , which violates the unconfoundedness assumption. Although, the dataset we used has a low noise level, i.e., for some function , and thus
We split the feature set into 80% for training, 5% for validation, and 15% for testing. Then, for the training set, only one action and the corresponding outcome was taken for each . The resulting training sample size for each setting of is listed in Table 3 in Appendix C. We repeated the training and evaluation process ten times for different splits and samplings of .
Result. As shown in Table 2, our proposed method outperformed the others in nmCG@1 in all cases. In terms of MSE, S-DNN with the same backbone also achieved a high performance, which demonstrates that the structure in Fig. 2(b) efficiently modeled the data. The performance gains compared to “w/o ER” and “w/o ” demonstrate the effectiveness of both of the components proposed in Section 4. The superior performance of RMNet without MSE in the settings of and indicates the room for optimizing , which we fixed to 0.5.
|Normalized mean CG@1||MSE|
|OLS||0.04 .15||0.08 .20||0.10 .13||0.01 .10||1.12 .12||1.89 .26||1.70 .26||5.86 1.10|
|Random Forest||0.23 .08||0.33 .07||0.32 .05||0.37 .05||1.03 .11||0.87 .08||0.93 .09||1.07 .18|
|BART||0.00 .13||0.17 .13||0.11 .10||0.04 .09||1.06 .08||1.04 .08||1.19 .12||1.63 .23|
|M-DNN||0.41 .05||0.48 .06||0.31 .07||0.37 .05||0.78 .05||0.84 .02||0.83 .02||0.84 .02|
|S-DNN||0.29 .09||0.26 .10||0.32 .07||0.46 .05||0.75 .12||0.60 .09||0.74 .06||0.74 .04|
|CFRNet||0.50 .06||0.39 .14||0.39 .10||0.35 .05||0.79 .02||0.81 .02||0.87 .01||0.86 .01|
|RMNet||0.68 .00||0.60 .05||0.60 .05||0.51 .05||0.77 .00||0.76 .09||0.84 .02||0.73 .07|
|RMNet (w/o MSE)||0.68 .00||0.66 .01||0.67 .01||0.50 .05||0.76 .00||0.75 .06||0.85 .01||0.80 .08|
|RMNet (w/o ER)||0.68 .00||0.45 .08||0.56 .05||0.49 .05||0.77 .00||0.67 .08||0.88 .02||0.75 .05|
|RMNet (w/o )||0.33 .09||0.27 .10||0.40 .07||0.48 .06||0.72 .12||0.81 .18||0.78 .08||0.71 .06|
In this paper, we have investigated causal inference on a large action space with a focus on the decision-making performance. We first defined and analyzed the performance in decision-making brought about by a model through a simple prediction-based decision-making policy. Then we showed that the bound only with the regression accuracy (MSE) gets looser as the action space gets large, which illustrates the difficulty of utilizing causal inference in decision-making in a large action space. At the same time, however, our bound indicates that minimizing not only the regression loss but also the classification loss leads to a better performance. From this viewpoint, our proposed method minimizes both the regression and classification losses, specifically, soft cross-entropy with a teacher label indicating whether an observed outcome is better than the estimated conditional average outcome in the observational distribution under a given feature. Experiments on synthetic and semi-synthetic datasets, which is designed to have misleading spurious correlations, demonstrated the superior performance of the proposed method with respect to the decision performance and the regression accuracy.
- An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research 46 (3), pp. 399–424. Cited by: §4.2.
Sobol tensor trains for global sensitivity analysis. arXiv preprint arXiv:1712.00233. Cited by: §5.3.
- Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §5.1.
- Sinkhorn distances: lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292–2300. Cited by: §4.2, 6.
- Learning disentangled representations for counterfactual regression. In International Conference on Learning Representations, External Links: Cited by: §5.2.
- Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20 (1), pp. 217–240. Cited by: §5.1.
- Recent developments in the econometrics of program evaluation. Journal of economic literature 47 (1), pp. 5–86. Cited by: §2.
- Support and invertibility in domain-invariant representations. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 527–536. External Links: Cited by: §5.2.
- Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020–3029. Cited by: §4.2.
- Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science 22 (4), pp. 523–539. Cited by: §4.2.
- Adam: a method for stochastic optimization. In ICLR 2015 : International Conference on Learning Representations 2015, External Links: Cited by: Appendix C, 7.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: footnote 1.
- Learning classification with auxiliary probabilistic information. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 477–486. External Links: Cited by: §4.1.
- CLTune: a generic auto-tuner for opencl kernels. In Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2015 IEEE 9th International Symposium on, pp. 195–202. Cited by: §5.3.
- Causality. Cambridge university press. Cited by: §1, §2.
- Learning on probabilistic labels. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 307–315. Cited by: §4.1.
- Perfect match: a simple method for learning representations for counterfactual inference with neural networks. arXiv preprint arXiv:1810.00656. Cited by: §1, §3, Figure 2, footnote 2, footnote 3.
- Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3076–3085. Cited by: §1, §2, Figure 2, §4.2, §4.2, §5.1, footnote 2, footnote 3.
- Spurious correlation: a causal interpretation. Journal of the American statistical Association 49 (267), pp. 467–479. Cited by: §1.
- GANITE: estimation of individualized treatment effects using generative adversarial nets. In International Conference on Learning Representations, External Links: Cited by: §1, §3.
- Learning overlapping representations for the estimation of individualized treatment effects. arXiv preprint arXiv:2001.04754. Cited by: §5.2.
Appendix A Proof of Proposition 3.1
The expected regret will be bounded with uniform MSE in (3) as
where is the top- classification error rate, i.e.,
where denotes the logical XOR.
Here we denote the true and the predicted -th best action by and , respectively; i.e., . For all , the target-wise regret can be bounded as follows.
where . Inequality (9) is from the definition of ; i.e., is the summation of the top- s out of , which must be larger than or equal to the summation of ’s that are not necessarily top-, . Let where
is the one-hot encoding of, and be the error vector that consists of . The r.h.s. is bounded as
where and are the target-wise error rate and MSE, respectively. The inequality comes from the Cauchy–Schwarz inequality. By taking the expectation with respect to and applying Jensen’s inequality, we get the proposition. ∎
Appendix B Error analysis for representation-based domain adaptation from observational data to the uniform average on action space
By performing the representation balancing regularization, our method enjoys better generalization through minimizing the upper bound of the error on the test distribution (under uniform random policy). We briefly show why minimizing the combination of empirical loss on training and the regularization of distribution (8) results in minimizing the test error. First, we define the point-wise loss function under a hypothesis and an extractor , which defines the representation , as
Then, the expected losses for the training (source) and the test distribution (target) are
We assume there exists such that for the given function space . Then the integral probability metric is defined for as
The difference between the expected losses under training and test distributions are then bounded as
For , we use the 1-Lipshitz function class, after which is the Wasserstein distance . Although is unknown, the hyperparameter tuning of the regularization strength in (8) can achieve the tuning of .
Appendix C Experimental details
Synthetic data generation process. Our synthetic datasets are built as follows.
Sample where .
Sample , where , from where and are the following.
In settings other than Setup-B, where .
In Setup-B, , i.e., only the first dimension in is used to bias .
Calculate the expected oucome where we examine three types of functions , namely, Linear, Quadratic, and Bilinear. In the Linear and Quadratic types, where and are one-dimensional representations of and , respectively.
In Setup-B, where denotes all dimensions other than the first dimension ().
In settings other than Setup-B, .
In Setup-A, where
In settings other than Setup-A, .
In the Linear setting,
In the Quadratic setting,
In the Bilinear setting, where
Sample the observed outcome
Details of semi-synthetic data We transformed the target attributes of elapsed times into the average speed as the outcome, i.e., , where are the original elapsed times. Then we standardized and the features. Each feature can take binary values or up to four different powers of two values. Out of 1,327k total parameter combinations, only 241.6k feasible combinations are recorded. We split these original feature dimensions into and as follows. The dimension of the action space ranged from three to six, and the 8th, 11th, 12th, 13th, 14th, and 3rd dimensions are regarded as from the head in order (e.g., for , the 8th, 11th, and 12th dimensions in the original features are regarded as ). This split was for maximizing the overlap of among .
Other DNN parameters. The detailed parameters we used for DNN-based methods (S-DNN, M-DNN, CFRNet, and proposed) were as follows. The backbone DNN structure had four layers for representation extraction and three layers for hypothesis with the width of 64 for the middle layers and the width of 10 for the representation . The batch size was 64, but only for CFRNet, it was 512 for the need to approximate the distributions for each action. The strength of our used L2 regularizer was . We used Adam [Kingma and Ba, 2015] for the optimizer with the learning rate of .
Appendix D Derivation of Eq. 5
Here satisfies the condition , i.e., the -th largest prediction of should be equal to . Although, since is unobservable, we relax the optimization of in the function space that satisfies the condition into the optimization in the general function space. Assuming that our function space includes the optimal function that minimizes
Appendix E Additional experimental results
|OLS||0.01 0.00||0.01 0.00||0.01 0.00||2.70 0.62||9.91 4.53||10.89 4.16||0.28 0.03|
|Random Forest||20.29 5.34||3.19 0.54||17.48 5.14||16.83 5.58||12.59 5.18||19.80 4.87||0.24 0.03|
|BART||14.59 3.80||0.67 0.23||14.30 3.33||14.62 3.82||11.58 4.51||18.64 3.77||0.50 0.10|
|M-DNN||10.70 2.23||3.70 2.07||12.05 2.25||12.79 2.26||16.44 5.99||19.20 3.66||0.36 0.07|
|S-DNN||0.64 0.23||1.04 0.46||2.61 1.21||5.18 2.25||16.75 5.87||16.43 3.12||0.13 0.03|
|CFRNet||10.09 2.22||5.01 2.14||12.64 2.42||10.04 2.18||13.26 4.09||16.25 3.58||0.40 0.07|
|RMNet||0.30 0.08||0.48 0.09||2.45 0.46||1.75 0.96||10.59 4.69||13.82 3.87||0.10 0.01|
|RMNet (w/o MSE)||0.46 0.14||2.31 1.46||3.14 0.79||2.07 1.00||10.55 4.69||13.52 4.17||0.13 0.04|
|RMNet (w/o ER)||0.46 0.14||1.65 0.62||3.71 1.10||3.55 1.82||16.68 5.87||16.33 3.10||0.19 0.03|
|RMNet (w/o )||0.50 0.14||0.54 0.10||1.29 0.27||1.71 0.69||11.87 4.38||14.43 3.87||0.08 0.01|
Elapsed times compared to CFR Figure 4 shows the comparison in training time between the proposed method and CFRNet. For CFRNet, the elapsed time grew when the size of the action space became large. The main reason for this is the calculation of distance between the representation distributions for each pair of actions in Fig. 2(a). The decrease of the elapsed time for RMNet is mainly due to the sample sizes shown in Table 3.