# Causality and Robust Optimization

A decision-maker must consider cofounding bias when attempting to apply machine learning prediction, and, while feature selection is widely recognized as important process in data-analysis, it could cause cofounding bias. A causal Bayesian network is a standard tool for describing causal relationships, and if relationships are known, then adjustment criteria can determine with which features cofounding bias disappears. A standard modification would thus utilize causal discovery algorithms for preventing cofounding bias in feature selection. Causal discovery algorithms, however, essentially rely on the faithfulness assumption, which turn out to be easily violated in practical feature selection settings. In this paper, we propose a meta-algorithm that can remedy existing feature selection algorithms in terms of cofounding bias. Our algorithm is induced from a novel adjustment criterion that requires rather than faithfulness, an assumption which can be induced from another well-known assumption of the causal sufficiency. We further prove that the features added through our modification convert cofounding bias into prediction variance. With the aid of existing robust optimization technologies that regularize risky strategies with high variance, then, we are able to successfully improve the throughput performance of decision-making optimization, as is shown in our experimental results.

## Authors

• 4 publications
02/16/2018

### A Unified View of Causal and Non-causal Feature Selection

In this paper, we unify causal and non-causal feature feature selection ...
11/29/2017

### Causality Refined Diagnostic Prediction

Applying machine learning in the health care domain has shown promising ...
08/29/2019

### Sparse, Low-bias, and Scalable Estimation of High Dimensional Vector Autoregressive Models via Union of Intersections

Vector autoregressive (VAR) models are widely used for causal discovery ...
11/17/2019

### Causality-based Feature Selection: Methods and Evaluations

Feature selection is a crucial preprocessing step in data analytics and ...
10/09/2020

### Causal Feature Selection with Dimension Reduction for Interpretable Text Classification

Text features that are correlated with class labels, but do not directly...
07/06/2020

### Causal Feature Selection via Orthogonal Search

The problem of inferring the direct causal parents of a response variabl...
07/27/2016

### Network-Guided Biomarker Discovery

Identifying measurable genetic indicators (or biomarkers) of a specific ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

With recent advances in machine learning technology, decision-making optimization aided by prediction has become ubiquitous in a variety of industries. This paper considers decision-making conducted on the basis of batch learning and mathematical optimization, for which such a data-analysis pipeline is often called

predictive optimization [14]. Let us first introduce an example of the pipeline in price optimization [13]. Let , , and denote, respectively, decision variables (product prices), target variables to be predicted (product demand), and external features (weather, temperature, etc.), and suppose that one would like to maximize the revenue function

which is an inner product of price and demand vectors. For this aim, given historical daily point-of-sales data

, a typical learner would first apply a feature selection algorithm to compute subset of external features

for improving prediction performance, and next estimate sales demand prediction formula

. At the beginning of each day, then, the learner would input specific realization of external features into a system, and it would optimize the pricing strategy that would maximize a revenue function on the basis of and prediction formula . The general predictive optimization framework is applicable to a variety of applications, such as portfolio optimization [28], inventory optimization [5], and electricity auctions [18].

For decision-making optimization on the basis of a prediction formula, one must be careful about cofounding bias, which might make target variables unpredictable in terms of optimization. Figure 1 shows a simple motivating example with three variables in predictive price optimization, where feature selection causes cofounding bias. A storekeeper is to decide the price of an umbrella and weather is a cofounding variable affecting both price and demand; on rainy days, the demand for umbrellas is high, and the storekeeper, knowing this, will raise prices accordingly. Though increased prices would have a negative effect on demand, it would be less than the positive effect of rain. Suppose we are given historical data and run a demand prediction algorithm. A prediction model relating increased prices to increased demand would be simple, accurate, and, thus, preferable for a machine predictor. If an optimizer increased price on the basis of this prediction model, however, the demand might decrease unexpectedly, which is called cofounding bias. The cofounding bias occurs since the rain node is deleted by the feature selector, is invisible to the optimizer, and thus behaves as a virtually unobserved cofounder. This example demonstrates that a prediction model may not indicate the consequences of optimization under the existence of cofounding bias, and this could be caused by a feature selection algorithm.

A causal Bayesian network [24, 16] is a well-known tool for describing causal relationships between variables, and if such relationships are known, then adjustment criteria [23, 21, 27] can inform that set of features with which one can avoid cofounding bias. A significant amount of effort has thus been exerted on the study of causal discovery [29, 8, 9], which aims to recover the structure of unknown networks from observational data. In our predictive optimization setting, the temporal context of the analysis pipeline indicates that the set of direct causes of target variables satisfies the adjustment criteria, and direct cause discovery algorithms [26, 11] are thus applicable. In terms of feature selection, a set of direct causes are known to maximize the performance of subsequent prediction [7, 12]. Direct cause discovery might thus be one of the most promising approaches for feature selection in predictive optimization that simultaneously avoids cofounding bias and improves prediction performance.

Existing causal discovery algorithms essentially rely on a faithfulness assumption, but that is easily violated in typical feature selection settings. Faithfulness requires that every conditional independence can be read from the structure of a causal network, and this is often justified in the study of causal discovery since unfaithful parameterization has a measure zero and is thus unnatural. For preprocessing of feature selection, however, one often generates artificial features from original features via arithmetics (quadratic features from original features and , for example,), and this artificial generation violates the faithfulness condition. Further, in practice a causal discovery algorithm require "enough faithfulness" for it to be verified with a given limited number of data. The notion of -strong faithfulness has been studied under the normality assumption [15, 33, 35], and it characterizes the relationships among the amount of faithfulness, number of samples, and number of features. These studies have shown that that the number of samples should be comparably larger than the number of features, while this might not the case in practical feature selection settings. Thus, the faithfulness assumption cannot be justified in predictive optimization, and, in fact, causal discovery algorithms causes cofounding biases, as is shown in our experiments.

#### Our contributions

Our contributions are mainly two-fold. First, we present a novel adjustment criterion that directly lead us to propose a meta-algorithm which can modify an existing feature selection algorithm so as to prevent cofounding bias, or to be admissible from the viewpoint of causality. Our approach does not rely on the faithfulness assumption, but instead relies on an assumption that the entire feature set satisfies adjustment criteria, which can be induced from the nonexistence of unobserved cofounders, called causal sufficiency assumption. Intuitively speaking, our approach can enjoy larger number of feature candidates that tend to include thorough cofounders, while existing approaches relying on faithfulness might suffer in a large number of candidates. Our approach can thus be naturally applied to practical feature selection setting dealing with larger number of features.

Secondly, we reveal the role of features additionally selected through the modification of our meta-algorithm in the predictive optimization pipeline. Our meta-algorithm requires a feature selector to adopt additional features which, though useless for improving prediction accuracy, can reduce the cofounding bias. Our theoretical analysis proves that, under certain assumptions, the sum of cofounding bias and variance is constant. Thus the additional features can convert cofounding bias into prediction variance, which is measurable in practice and thus rather tractable. With the aid of existing robust optimization technologies that regularize risky strategies with high variance, then, we are able to successfully improve the throughput performance of a predictive optimization pipeline, as is shown in our experimental results.

Because of space limitation, all proofs are presented in the supplementary material.

## 2 Predictive Optimization Problem

Our general predictive optimization problem, consisting of feature selection, prediction, and optimization, is introduced in this section. Let be a vector of decision variables, where is an optimization domain. Also, let be a vector of target variables, and be a vector of external features. Let be a given objective function. The aim of predictive optimization here is to select optimum decision variable to minimize on the basis of features obtained in advance. If exact prediction formula is known, then the problem can be formulated as a mere mathematical optimization problem:

 minx∈Xr(x,y)s.t. y=f(x,~z). (1)

Since the exact prediction formula is unavailable in practice, we have to estimate it from historical data . We assume that, for each ,

is an independent and identical realization of random variables

, , and , respectively. Our predictive optimization is summarized in the three phases below.

#### Feature selection phase

Feature selection is widely recognized as an important preprocessing phase in machine learning prediction, in which a feature selector discards useless or redundant features in order to improve prediction performance [19]. We consider here supervised and batch feature selection, which aims to select subset of feature indices on the basis of given data . Let denote a feature selection algorithm that outputs subset of which is useful for predicting :

 S=MB(Y).

The selected external features is then defined as , where is indices of selected features. Examples of feature selection algorithms include feature selection on the basis of mutual information [6] and that on the basis of sparse regression [30, 20]. For a thorough review of existing feature selection algorithms, see [19]. Here MB is named after Markov blanket, and our discussion that feature selection algorithms above can be regarded as a Markov blanket discovery algorithm is presented in Section 3.4.

#### Prediction phase

Given selected feature indices

, and a hypothesis space , the prediction phase in general computes a regression function in that minimizes the empirical loss:

 ^fκ:=argminfκ∈Fκd∑i=1ℓ(yi,fκ(xi,ziκ)).

In our experiments, we adopt least square loss and linear regression functions.

#### Optimization phase

We assume that, before optimization, specific realization of external features is available. In a price optimization setting, for example, after such external features as weather, temperature, etc. are revealed, a storekeeper decides on prices for the day. The optimization phase thus computes optimized strategies on the basis of an estimated prediction formula and a realization of . Though a simple non-robust formulation can be given by replacing with and with in (1), we present a more general robust optimization formulation:

 minx∈Xr(x,y)+λg(x,~zκ)s.t. y=^fκ(x,~zκ). (2)

Here is a scale of robustness in which corresponds to non-robust optimization, and is referred to as a robust regularizer. Discussion of robust optimization is found in Section 5.

## 3 Preliminary

As noted in our introduction, simple application of feature selection could cause cofounding bias. This section introduces the language of causality, which enables us to characterize the conditions under which cofounding bias disappears. For simplicity of presentation, this section assumes the well-known causal sufficiency assumption [35] stating that no unobserved cofounder exists, but our main discussion in the subsequent section does not rely this.

### 3.1 Causal Bayesian network

We introduce here the notion of a causal Bayesian network, which is a standard tool for describing causal relationships. Let be a set of random variables. A Bayesian network for is a pair , where is a directed acyclic graph and

is a joint distribution over

, satisfying the following factorization [24]:

 p(V)=∏V∈Vp(V∣Pa(V)).

Here vertices are associated with random variables in , and for , denotes the set of random variables that are parents in . We call the network a causal Bayesian network (or causal network) if all edges represent causal effects. Let (or shortly) denote intervention, which is an operation fixing the realization of random variable to regardless of the joint distribution . In our predictive optimization setting, an optimizer intervenes on the decision variables . Let denote the projection of vector onto coordinates in . Given an intervention , a post interventional distribution can also be factorized to accord with the network [24]:

 p(V=v∣do(X=x))={∏V∈V∖Xp(V=v∣Pa(V),) if vX=x,0 otherwise. (3)

In general, post interventional conditional distribution might not be equivalent to corresponding conditional distribution , and such a gap could cause cofounding bias in predictive optimization.

Given a causal network, we can compute a post interventional distribution on the basis of the factorization formula (3). Specifically, this can characterize the conditions under which cofounding bias disappears, which conditions are called adjustment criteria [24]. We here introduce one of the most basic criteria, referred to as a back-door criterion. We call an ordered tuple of vertices a path if either or holds for every . It is specifically called a directed path if for every . For a triplet of nodes in a path, is called a collider if . A node which is not a collider is called a noncollider. If there exists a directed path from to , then is called an ancestor of , and is called a descendant of . The following d-separation is a standard notion in the study of causal inference.

###### Definition 1 (See [24]).

A path is d-separated by if one of the following holds: (i) there exists a noncollider in that is in , or (ii) there exists a collider in that is neither in nor an ancestor of a node in .

The back-door criterion is then introduced using a d-separation.

###### Definition 2 (The back-door criterion, see [24]).

A set of variables satisfies the back-door

criterion relative to an ordered pair

if no nodes in are descendants of , and every path between and which contains a directed edge into is d-separated by .

The back-door criterion characterizes the condition under which a conditional distribution and a post-interventional distribution coincide, and, thus, the cofounding bias disappears.

###### Theorem 3 (The back-door adjustment, see [24]).

If satisfies the back-door criterion relative to , then we have

 p(Y∣do(X),S)=p(Y∣X,S). (4)

A set of nodes satisfying (4) is called an adjustment set (relative to ). For a more general discussion about adjustment criteria, see [21, 27].

### 3.3 Direct cause discovery

An adjustment set can be computed given the structure of a causal graph, but in practice this structure will be unknown. Causal discovery algorithms can then help us in estimating it using observational data. In general, causal discovery algorithms estimate the entire structure of a causal graph, but for our purposes, we can focus on direct cause discovery, since direct causes are desirable from both viewpoints of cofounding bias and prediction performance, explained as follows. In a predictive optimization setting, the target variable is revealed after the realization of and , and this temporal context implies the following restriction on the network structure.

###### Assumption 4.

No nodes in and are descendants of in .

This assumption, together with the back-door adjustment, implies that any set that includes direct causes of is an adjustment set.

###### Fact 5.

If Assumption 4 holds and , then satisfy the back-door criterion relative to . In particular, is an adjustment set.

Such direct causes are known to be desirable also in terms of feature selection for achieving good prediction performance [1, 17], and among the various adjustment sets, the set of direct causes is one of the most promising candidates for predictive optimization.

The majority of existing causal discovery algorithms are based on the following faithfulness assumption.

###### Definition 6.

A causal network is faithful if every conditional independence in is read from a d-separation in , in other words, for any and , if and only if every path between and is d-separated by .

Given faithfulness, the direct causes are characterized by the following conditional independence.

###### Proposition 7 ([25], see [26][Theorem 3] also)).

If Assumption 4 and faithfulness holds, then a set satisfy if and only if

 Y⊥⊥V∖S∣S. (5)

Given faithfulness, thus, the direct causes are characterized as a minimal set satisfying the conditional independence (5) in our setting. In general, such a minimal set satisfying (5) is called a Markov blanket, and we can utilize existing Markov blanket discovery algorithms: examples include [32, 26, 22, 31].

One drawback of the approach on the basis of direct cause discovery is that it is essentially dependent on the faithfulness assumption.

###### Remark 8.

We here show an example consisting of three variables for which, without faithfulness, conditional independence (5) cannot guarantee the discovery of direct causes. Let us consider causal network with three variables and three edges , as seen in Figure 1. Suppose that it identically holds that . It then holds that . However, the singleton set do not include the direct cause of . Also, in larger networks in practice, a similar setting might occur when a decision variable can be completely explained by external features.

The above example also illustrates that, even if the underlying distribution is faithful, if it is almost unfaithful then a direct cause discovery algorithm might in practice incur difficulty in correctly determining conditional independence. Such a practical requisite condition is successfully characterized by the notion of -strong faithfulness [15, 33, 35]

in a normal distribution setting.

### 3.4 Relationship between Markov blanket discovery and feature selection

In predictive optimization, Assumption 4 reduces the problem of finding a set of direct causes into that of finding a Markov blanket. The relationship between Markov blanket discovery and feature selection has been studied [1, 17]. We here briefly review this relationship, and demonstrate that some of feature selection algorithms can be regarded as approximate Markov blanket discovery algorithms.

In the context of causal discovery, Markov blanket discovery algorithms try to compute a set of variables that achieve the conditional independence . In practice, it can find a correct Markov blanket only when a sufficient amount of data are available so as to correctly compute a series of conditional independence tests. An example of such an algorithm is IAMB [32], which shows that a simple greedy forward and backward algorithm can compute a Markov blanket given sufficient amount of data.

In the context of feature selection, a sparse feature selection algorithm [30, 20] can find a minimal set that is linearly dependent on the target variable, given a sufficient number of samples. Mutual-information-based feature selection [6] greedily finds a minimal set of variables that achieve for every , and one variant adopts a forward and backward search. These algorithms can be regarded as approximate Markov blanket discovery algorithms, with approximation of statistical dependence by, respectively, linear dependence and positive mutual information.

With these observations in mind, we denote an algorithm (possibly approximate) for finding a Markov blanket of by , and such algorithms include the above sparse feature selection and mutual-information-based feature selection. Note that all these algorithms also fail in direct cause discovery without faithfulness, as is shown in Remark 8.

## 4 Causally admissible feature selection

This section presents our first contribution: we prove a novel adjustment criterion and present a meta-algorithm which utilizes an existing feature selection algorithm so as to achieve no cofounding bias even under unfaithfulness. Our approach relies on the following assumption, rather than on faithfulness.

###### Assumption 9.

(i) No nodes in are descendants of , and (ii) is an adjustment set relative to .

The first half (i) of this assumption is implied by the temporal context of predictive optimization pipeline as similar to Assumption 4. The second half (ii) is implied by Fact 5, and Fact 5 holds by the causal sufficiency, which was assumed in the previous section. We state this property as an assumption to maintain the validity of our discussion even without the causal sufficiency assumption.

###### Theorem 10.

If Assumption 9 holds and satisfies , then is an adjustment set relative to .

This criterion leads us to propose Algorithm 1. Recall that a Markov blanket discovery algorithm maps a subset of nodes to which (possibly approximately) satisfies the conditional independence . Given such an algorithm, the proposed algorithm computes , rather than the of the previous direct cause discovery approach. Observe that, in contrast to Proposition 7 given for the previous approach, Theorem 10 does not require the faithfulness assumption for characterizing an adjustment set by conditional independence. This enables our algorithm to compute an adjustment set even under unfaithfulness. Note that, according to Theorem 10, is a smaller adjustment set than , but we compute so as to simultaneously improve prediction performance, where this replacement is justified by the following implication [25] (see [26][Theorem 1] also).

 (X∪Y)⊥⊥(Z∖Zκ)∣Zκ⇒X⊥⊥(Z∖Zκ)∣Zκ.

We conclude this section with a discussion on Assumption 9 when the causal sufficiency does not hold.

###### Remark 11.

If the causal sufficiency holds, then Assumption 9 (ii) is implied by Fact 5. Suppose that the causal sufficiency does not hold, and observable is a subset of a entire external feature set which is causally sufficient. According to our proof of Theorem 10, with additional assumption , our adjustment criterion is still valid. Intuitively speaking, this additional condition requires that the features which have influenced a human decision-maker giving in historical data are successfully collected in , which means that these features have been at least noticed by the decision-maker, and thus are more plausible than the causal sufficiency assumption.

## 5 Robust optimization using adjustment sets

The previous section proposed the utilization of features instead for avoiding cofounding bias, but the features are discarded in since they are redundant and useless from the viewpoint of prediction. This section then presents our second contribution: revealing the role of the redundant features in a predictive optimization pipeline.

### 5.1 Generalized bias-variance decomposition

This section slightly generalizes a well-known bias-variance decomposition for the explicit representation of cofounding bias. Let us define define and . We also define the optimal predictor by . For each , then, a well-known bias-variance decomposition shows

 EY,D[∥Y−^fκ(X,Zκ)∥2∣do(X=x),Z] =EY[∥Y−¯¯¯¯Ydo(x),z∥2∣do(X),Z]noise+∥¯¯¯¯Ydo(x),z−f∗κ(x,zκ)∥2bias+ED[∥f∗κ(x,zκ)−^fκ(x,zκ)∥2]variance.

Here is the expectation with respect to the historical data. Let us consider further decomposition of the bias term into cofounding bias and prediction bias, described as

 ∥¯¯¯¯Ydo(x),z−f∗κ(x,zκ)∥2=∥(¯¯¯¯Ydo(x),z−¯¯¯¯Yx,zκ)cofounding bias+(¯¯¯¯Yx,zκ−f∗κ(x,zκ))prediction bias∥2.

### 5.2 Transforming causality bias into statistical variance using redundant features

We define the sum of cofounding bias and variance given and as:

 Cx,z(κ):=∥¯¯¯¯Ydo(x),z−¯¯¯¯Yx,zκ∥2+ED[∥f∗κ(x,zκ)−^fκ(x,zκ)∥2].

The following statement reveals that this sum is constant under certain assumptions.

###### Proposition 12.

For and , assume that (i) for every , , and , and (ii) both and

are unbiased estimators (having no prediction bias). It holds, then, that

.

Although the above assumptions cannot be precisely satisfied in reality, the statement offers important qualitative observations. Assume that which follows existing approaches, and , which follows our proposed approach. Assumption (i) requires that both and are sufficient for prediction, so that the predictors and are the same, and assumption (ii) requires unbiasedness in the predictor, which is a common assumption in prediction. Under the assumptions of sufficiency and unbiasedness, the statement confirms that the redundant features are in fact useless in terms of prediction error, which is the sum of bias and variance. In terms of optimization, however, variance is far preferable to bias: variance is measurable in practice by means of such methods as bootstrap sampling [10], and risky strategies with high variance can be avoided with the aid of a robust optimization technique. The useless features can exchange causality bias for variance, and, thus, are useful in terms of optimization.

### 5.3 Avoiding high-variance strategies by means of robust optimization

Our meta-algorithm transforms cofounding bias into variance, and, thus, is effective only when combined with a robust optimization technology that can regularize high variance strategies. This section briefly introduces existing robust optimization technologies applicable to predictive optimization. One of the most standard formulations of robust optimization is given by defining in (2) as a variance of an objective function:

 gVar(x):=VarD[r(x,^f(x,~z)).

[3] and certain case in quadratic programming [34] on the basis of linear regression, explicit forms of and efficient optimization algorithms are available. For a survey of robust optimization technologies, see [2, 4].

## 6 Experiments

This section shows the performance of our causally admissible feature selection framework through experiments in predictive price optimization problem [34] using synthetic data.

### 6.1 Problem setting of price optimization

Let be the number of products, and let denote a price vector that is a set of decision variables, (where ) denote the demand vector that is the target variables, and denote external features (temperature, weather, weekday or not, etc.). The goal is to maximize the revenue function which is an inner product of price and demand: . We are given a set of historical point of sales data of size that consists of i.i.d. realizations of , which is generated according to an unknown causal network .

#### Synthetic generation of causal network

We assume the temporal context of predictive optimization given in Assumption 4 and Assumption 9 (i). We also assume here that neither , , nor have internal edges. For each , and , we generate the graph according to , , and .

#### Generation of a linear SEM

Given a causal network generated above, we define the joint distribution by linear structural equation modeling (linear SEM)[24], which is one of the most standard model of Bayesian network. Let

denote the Bernoulli distribution with mean

, and denote the normal distribution with mean and variance . For each and , we define our linear SEM as:

 Zk∼Ber(pk),Xm=1−0.1∑k:(Zk,Xm)∈EZk−0.1εm,
 Yn =M∑m=1an,mxm+bn+∑k:(Zk,Yn)∈Ecn,k+δn,

where with and . Parameterization of , , , and , and its interpretation are presented in the supplementary material.

### 6.2 Algorithms

We specify the feature selection, prediction, and robust optimization in the general problem setting in Section 2.

For feature selection phase, we here adopt the sparse feature selection algorithm of [20] and its implementation as presented by the authors of [19]. Given and , let and be respective historical data matrices for variables and , which is extracted from , and let be a matrix indexed by and . The output of feature selection is then defined as the list of nonzero column in the solution of the following sparse regression:

 minWV∥DU−WVDV∥F+μ∥WV∥1,2,

where is a scale of the regularizer. Selected external features are then defined as . We compute for the case and in our experiments.

For prediction phase, we adopt the least square estimator. Given selected feature indices , we estimate the linear prediction model by the least square method.

For optimization phase, we apply the robust optimization technique of [34] for defining in (2). They defined as a variance of objective function, and proved the explicit form . Here is the covariance matrix of , and is essentially an inverse of , and . Note that, while the original formulation does not deal with external features, this extension can be directly obtained by first regarding also as decision variables and then fixing .

### 6.3 Experimental results

For each setting, we conducted 50 randomized experiments and took an average over them. We fixed the size of the problem as .

#### Comparison of prediction accuracy

We first compared the original feature selection (denoted by ) and our causally admissible feature selection (denoted by ) in terms of prediction accuracy in Figure 3. We observed that The prediction accuracy of is better than that of with every choice of regularization parameter , and the gap is huge when the number of available sample is small. This indicates that the redundant features of are in fact useless in terms of prediction accuracy, and our modification would not improve, or might even degrade, prediction accuracy.

#### Efficiency of robust optimum strategy

We fixed on the basis of the previous experiment, and we then compared and in terms of optimization. After computing a prediction formula, we generated 10 times and conducted robust optimization for computing an optimized strategy with several scale of robust regularizer in (2). Here, corresponds to non-robust optimization, and computes most conservative pricing strategy. We computed the true objective value of optimized strategy , and Figure 3 shows the average of the performance normalized by the true optimum value, with parameters (left) and (right). We observed that:

• [noitemsep,nolistsep,leftmargin=*]

• For , without robust formulation () is only as good as with the best parameterization (). With robust formulation (), however, the performance of drastically improves, while the performance of rarely improves. This demonstrate that redundant features enable a robust optimizer to distinguish stable strategies from risky ones with high variance.

• For , although outperforms , the performance gap is not as huge as with that . In particular, with , the performance of steadily improves as the number of sample increase, in contrast to little improvement in . Recall that controls the Bernoulli independent random noise on historical pricing strategies. With , the causal network is unfaithful. With which makes the noise have largest entropy, the network is the most faithful. With this setting, relying on the faithfulness assumption is less affected by the cofounding bias, and thus the performance gap between and is not large.

• For , the conservative parameterization outperforms , while mild robustness is basically the best in . Estimation of regression parameter is much more difficult in because of small independent noise, and in such scenario [34] have been observed that conservative parameterization is preferable. In fact, even with , outperforms with smallest number of samples ().

Thus our modified feature selection algorithm, together with robust optimization technology, achieves efficient predictive optimization.

## 7 Summary and Future Work

This paper has proposed a meta-algorithm for use with causally admissible feature selection that can avoid cofounding bias in predictive optimization. Our algorithm is based on the contextual restriction of causal network structure and the novel adjustment criteria that maintain its effectiveness even under a causally unfaithful condition. Features useless in terms of prediction turn out to be useful in optimization, transforming intractable causality bias into rather tractable prediction variance. A variance-based regularization technique in robust optimization can then provide safe and effective strategies, as is shown in our experiments. Future research directions include studies aimed at reducing cofounding bias in other optimization scenarios, also without relying on the faithfulness assumption.

## References

• [1] Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D Koutsoukos. Local causal and markov blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation. Journal of Machine Learning Research, 11(Jan):171–234, 2010.
• [2] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust optimization. Princeton University Press, 2009.
• [3] A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations research letters, 25(1):1–13, 1999.
• [4] D. Bertsimas, D. B. Brown, and C. Caramanis. Theory and applications of robust optimization. SIAM review, 53(3):464–501, 2011.
• [5] Daniel Bienstock and Nuri ÖZbay. Computing robust basestock levels. Discrete Optimization, 5(2):389–414, 2008.
• [6] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. Journal of machine learning research, 13(Jan):27–66, 2012.
• [7] Gavin C Cawley.

Causal & non-causal feature selection for ridge regression.

In Causation and Prediction Challenge, pages 107–128, 2008.
• [8] David Maxwell Chickering. Learning equivalence classes of bayesian-network structures. Journal of machine learning research, 2(Feb):445–498, 2002.
• [9] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002.
• [10] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.
• [11] Tian Gao and Qiang Ji. Local causal discovery of direct causes and effects. In Advances in Neural Information Processing Systems, pages 2512–2520, 2015.
• [12] Isabelle Guyon, Constantin Aliferis, and André Elisseeff. Causal feature selection. Computational methods of feature selection, pages 63–82, 2007.
• [13] S. Ito and R. Fujimaki. Large–scale price optimization via network flow. In Advances in Neural Information Processing Systems, 2016.
• [14] Shinji Ito, Akihiro Yabe, and Ryohei Fujimaki. Unbiased objective estimation in predictive optimization. In International Conference on Machine Learning, pages 2181–2190, 2018.
• [15] Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(Mar):613–636, 2007.
• [16] Markus Kalisch and Peter Bühlmann. Causal structure learning and inference: a selective review. Quality Technology & Quantitative Management, 11(1):3–21, 2014.
• [17] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, pages 284–292. Morgan Kaufmann Publishers Inc., 1996.
• [18] Roy H Kwon and Daniel Frances. Optimization-based bidding in day-ahead electricity auction markets: A review of models for power producers. In Handbook of Networks in Power Systems I, pages 41–59. Springer, 2012.
• [19] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):94, 2017.
• [20] Jun Liu, Shuiwang Ji, and Jieping Ye. Multi-task feature learning via efficient l 2, 1-norm minimization. In

Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence

, pages 339–348. AUAI Press, 2009.
• [21] Marloes H Maathuis, Diego Colombo, et al. A generalized back-door criterion. The Annals of Statistics, 43(3):1060–1088, 2015.
• [22] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in neural information processing systems, pages 505–511, 2000.
• [23] Judea Pearl. [bayesian analysis in expert systems]: comment: graphical models, causality and intervention. Statistical Science, 8(3):266–269, 1993.
• [24] Judea Pearl. Causality. Cambridge university press, 2009.
• [25] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier, 2014.
• [26] Jose M Pena, Roland Nilsson, Johan Björkegren, and Jesper Tegnér. Towards scalable and data efficient learning of markov boundaries. International Journal of Approximate Reasoning, 45(2):211–232, 2007.
• [27] Emilija Perković, Johannes Textor, Markus Kalisch, and Marloes H Maathuis. A complete generalized adjustment criterion. In Uncertainty in Artificial Intelligence, pages 682–691. AUAI Press, 2015.
• [28] Huitong Qiu, Fang Han, Han Liu, and Brian Caffo. Robust portfolio optimization. In Advances in Neural Information Processing Systems, pages 46–54, 2015.
• [29] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.
• [30] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
• [31] Ioannis Tsamardinos, Constantin F Aliferis, and Alexander Statnikov. Time and sample efficient discovery of markov blankets and direct causal relations. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 673–678. ACM, 2003.
• [32] Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale markov blanket discovery. In FLAIRS conference, volume 2, pages 376–380, 2003.
• [33] Caroline Uhler, Garvesh Raskutti, Peter Bühlmann, and Bin Yu. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, pages 436–463, 2013.
• [34] Akihiro Yabe, Shinji Ito, and Ryohei Fujimaki. Robust quadratic programming for price optimization. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2017.
• [35] Jiji Zhang and Peter Spirtes. Strong faithfulness and uniform consistency in causal inference. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, pages 632–639. Morgan Kaufmann Publishers Inc., 2002.

## Appendix A Proofs

###### Proof of Fact 5.

Since every path from to which containes a directed edge into must includes a node and such is noncollider, satisfies the back-door criterion relative to . Thus is an adjustment set. ∎

###### Proof of Proposition 7.

Since have no children in the network by Assumption 4, the statement directly follows from Thorem 3 of [26]. ∎

###### Proof of Theorem 10.

Let . We have

 p(Y∣do(X),Zκ)=∑Up(Y∣,do(X),Zκ,U)p(U∣do(X),Zκ).

Since is an adjustment set relative to , we have

 p(Y∣do(X),Z,U)=P(Y∣X,Z,U).

Since no node in is a descendant of and since , we have

 p(U∣do(X),Zκ)=P(U∣Zκ).

Further, by the conditional independence, we have . Thus, we have

 P(Y∣do(X),Z) =∑UP(Y∣X,Zκ,U)P(U∣X,Zκ) =P(Y∣X,Zκ).

###### Proof of Proposition 12.

By the assumption (i), it holds that

 EY,D[Y−^fκ1(X,Zκ1)∣do(X),Z]=EY,D[Y−^fκ2(X,Zκ2)∣do(X),Z].

Since by assumption (ii), there exists no the prediction bias, it holds that

 EY,D[Y−^fκi(X,Zκi)∣do(X),Z]=EY[∥Ydo(x),z−¯¯¯¯Ydo(x),z∥2]+Cx,z(κi)

for . These equalities imply the statement. ∎

## Appendix B Parameterization of SEM in experiments

Let

denote the uniform distribution over

, denote the Bernoulli distribution with mean , and denote the normal distribution with mean and variance . For each , , and , we generate parameters by , and then define a linear SEM as:

 Zk ∼Ber(pk), Xm =1−0.1∑k:(Zk,Xm)∈EZk−0.1εm, Yn =M∑m=1an,mxm+bn+∑k:(Zk,Yn)∈Ecn,k+δn,

where and . Intuitively speaking, the list price of is , and a storekeeper decides discounting strategy according to the realization of relevant features satisfying . The parameter controls the amount of independent noise on each products, and thus controls the amount of faithfulness in this model, which is discussed in our experimental result. The coefficient matrix and constant vector are generated in the same way as the experiments of [34], and each element in is generated with the same distribution as that of nondiagonal elements of .