1 Introduction
The ultimate goal of meaningful data analysis is to understand how the data was generated, by reasoning in terms of cause and effect. Towards this goal, rule mining [agrawal:93:associationrules, wrobel:97:sgd, friedman:99:bumphunting, furnkranz:12:rulebook] has been studied extensively over the years. Most rule miners measure the effect of a rule in terms of correlation or dependence. Correlation, however, does not imply causation. As a result, rules that maximise such effect measures are in no way guaranteed to reflect the underlying datagenerating process.
The gold standard for establishing the causal relationship between variables is through a controlled experiment, such as a randomized controlled trial (RCT) [hernan:18:book]. In many cases, however, it is impossible or at the very least impractical to perform an RCT. We hence most often have to infer causal dependencies from observational data, which is data that was collected without full control. In this work, we study discovering causal rules from observational data that maximise causal effect. Though simple to state, this is a very hard task. Not only do we have to cope with an intricate combination of two semantic problems—one statistical and one structural—but in addition the task is also computationally difficult.
The structural problem is often referred to as Simpson’s paradox. Even strong and confidently measured effects of a rule might not actually reflect true domain mechanisms, but can be mere artefacts of the effect of other variables. Notably, such confounding effects can not only attenuate or amplify the marginal effect of a rule on the target variable, in the most misleading cases they can even result in sign reversal, i.e. when interpreted naively, the data might indicate a negative effect even though in reality there is a positive effect [pearl:09:book, Chap. 6]. For example, a drug might appear to be effective for the treatment of a disease for the overall population. However, if the treatment assignment was affected by sex that also affects the recovery (say males, who recover—regardless of the drug—more often than females, are also more likely to use the drug than females), we may find that the treatment is not effective at all to male and female subpopulations.
The statistical problem is the wellknown phenomenon of overfitting. This phenomenon results from the high variance of the naive empirical (or “plugin”) estimator of causal effect for rules with too small sample sizes for the instances either covered, or excluded by the rule. Combined with the maximization task over a usually very large rule language, this variance turns into a strong positive bias that dominates the search and causes essentially random results of either extremely specific or extremely general rules.
Third, the rule space over which we maximise causal effect is exponential in size and does not exhibit structure that is trivially exploited. We therefore need an efficient optimization algorithm. In this paper, we present a theoretically sound approach to discovering causal rules that remedies each of these problems.

To address the structural problem, we propose to measure the causal effect of a rule from observational data. To this end, we control for the effect of a given set of potential confounder variables. In particular, we give a graphical criteria under which it is possible to discover causal rules. While in practice the set of control variables will rarely be complete, i.e., not contain all potential confounders, this approach can rule out specific alternative explanations of findings as well as eliminate misleading observations caused by selected observables that are known to be strong confounders. In fact, this pragmatic approach is usually a necessity caused by sparsity.

To address the overfitting problem, we propose to measure and optimise the reliable effect of a rule. In contrast to the plugin estimator, we propose a conservative empirical estimate of the population effect, that is not prone to overfitting. Additionally, and in contrast to other known rule optimisation criteria, it is also consistent, i.e., with increasing amounts of evidence (data), the measure converges to the actual population effect of a rule.

We develop a practical algorithm for efficiently discovering the top strongest reliable causal rules. In particular, we show how the optimisation function can be cast into a branchandbound approach based on a computationally efficient and tight optimistic estimator.
We support our claims by experiments on both synthetic and realworld datasets as well as by reporting the required computation times on a large set of benchmark datasets.
2 Related Work
Association rules. In rulebased classification, the goal is to find a set of rules that optimally predict the target label. Classic approaches include CN2 [lavrac:04:cn2], and FOIL [quinlan:95:foil]. In more recent work, the attention shifted from accuracy to optimising more reliable scores, such as area under the curve (AUC) [furnkranz:05:roc].
In association rule mining [agrawal:93:associationrules], we can impose hard constraints on the relative occurrence frequency to get reliable rules. In emerging and contrast pattern mining [dong:99:emergingpatterns, bay:01:contrast], we can get reliable patterns whose supports
differ significantly between datasets by performing a statistical hypothesis test. Most subgroup discovery
[wrobel:97:sgd]methods optimise a surrogate function based on some null hypothesis test. The resulting objective functions are usually a multiplicative combination of coverage and effect.
All these methods optimise associational effect measures that are based on the observed joint distribution. Thus they capture correlation or dependence between variables. They do not reflect the effect if we were to intervene in the system.
Causal rules. Although much of literature is devoted in mining reliable association rules, a few proposals have been made towards mining causal rules. Silverstein et al. [silverstein:00:causalassocrule] test for pairwise dependence and conditional independence relationships to discover causal associations rules that consist of a univariate antecedent given a univariate control variable. Li et al. [li:15:causalrule] discover causal rules from observational data given a target by first mining association rules with the target as a consequent, and performing cohort studies per rule.
Atzmueller & Puppe [atzmueller:09:causalsubgroup] propose a semiautomatic approach to discovering causal interactions by mining subgroups using a chosen quality function, inferring a causal network over these, and visually presenting this to the user. Causal falling rule lists [wang:15:fallingrule] are sequences of “ifthen” rules over the covariates such that the effect of a specific intervention decreases monotonically down the list from experimental data. Shamsinejadbabaki et al. [blockeel:13:causalactionmining]
discover actions from a partial directed acyclic graph for which the postintervention probability of
differs from the observational probability.While all these methods have opened the research direction, we still lack a theoretical understanding. Roughly speaking, all these methods propose to condition “some” effect measure on “some” covariates. In this work, we present a theoretical result showing which covariates to condition upon, under what conditions causal rule discovery is possible, and how an effect measure must be constructed to capture causal effect. Overall, despite the importance of the problem, to the best of our knowledge there does not exist a theoretically wellfounded, efficient approach to discovering reliable causal rules from observational data.
3 Reliable Causal Rules
We consider a system of discrete random variables with a designated
target variable and a number of covariates, which we differentiate into actionable variables^{2}^{2}2Although an actionable variable (e.g. blood group) may not be directly physically manipulable, a causal model such as a structural equation model [pearl:09:book] permits us to compute the effect of intervention on such variables. and control variables . For example, might indicate recovery from a disease, different medications that can be administered to a patient, and might be attributes of patients, such as blood group. Let denote the domain of , and be that of . As such, the domain of is the Cartesian product , and that of is .We use Pearl’s donotation [pearl:09:book, Chap. 3] , or in short, to represent the atomic intervention on variable which changes the system by assigning to a value , keeping everything else in the system fixed. The distribution of after the intervention is represented by the postintervention distribution . This may not be the same as the observed conditional distribution . As we observe without controlling the system, other variables might have influenced , unlike in case of . Therefore, to capture the underlying datagenerating mechanism, we have to use the postintervention distribution .
Let
be the set of all possible vector values of all possible subsets of actionable variables. More formally, we have the following definition:
where is the powerset function. In this work, we are concerned with rules that for a given value evaluate to either true () or false (). Specifically, we investigate the rule language of conjunctions of propositions that can be formed from inequality and equality conditions on actionable variables s (e.g. ).
Let denote the subset of actionable variables, with their joint domain , on which propositions of a rule are defined. Most rule miners measure the effect of a rule using the observed conditional distribution,
which captures the correlation or more generally dependence between the rule and the target. To understand the underlying datagenerating mechanism, however, we need postintervention distributions.
One caveat with rules is that, in general, there are many values that can satisfy a rule (e.g., is satisfied by ). As a result, we have a multitude of atomic interventions to consider (e.g. for , we have ). Depending on the atomic intervention we choose, we may get different answers. This ambiguity can be avoided by considering the average of all postintervention distributions where the probability of each atomic intervention is defined by some stochastic policy [pearl:09:book, Chap. 4]
. In reinforcement learning, for instance, a stochastic policy is the conditional probability of an action given some state. Formally, the postintervention distribution of
under the stochastic policy is given by(1) 
Let denote the logical negation of . Our goal is to identify rules that have a high causal effect on a specific outcome for the target variable , which we define as the difference in the postintervention probabilities of under the stochastic policies corresponding to and , i.e.,
(2) 
where represents the probability mass function. Next we show how to compute the above from observational data, and state the stochastic policy to this end.
3.1 Causal Effect from Observational Data
In observational data, we have observed conditional distributions which may not be the same as postintervention distributions . A wellknown reason for this discrepancy is the potential presence of confounders, i.e., variables that influence both, our desired intervention variable(s) and the target. More generally, to measure the causal effect, we have to eliminate the influence of all spurious path in the causal graph
, i.e., the directed graph that describes the conditional independences of our random variables (with respect to all postintervention distributions).
In more detail, when estimating the causal effect of on , any undirected path connecting and that has an incoming edge towards is a spurious path. A node (variable) is a collider on a path if its indegree is 2, e.g., is a collider on the path . A spurious path is blocked by a set of nodes , if the path contains a collider that is not in , or a noncollider on the path is in [pearl:09:book, Def. 1.2.3]. A set of nodes satisfies the backdoor criterion for a set of nodes and a node if it blocks all spurious paths from any in to , and there is no direct path from any in to any in [pearl:09:book, Def. 3.3.1]. For and , if a set satisfies the backdoor criterion, then observational and postintervention probabilities are equal within each stratum of :
(3) 
and averaging the observational probabilities over gives [pearl:09:book, Thm. 3.3.2].
Therefore, to compute the postintervention probability of under the stochastic policy for a rule , i.e. , we need a set of variables that satisfy the backdoor criterion for actionable variables and . As we consider the rule language over all actionable variables , we require a set of control variables that satisfy the backdoor criterion for all the actionable variables . This also implies that there are no other spurious paths via potentially unobserved variables . In the special case when is empty, must not cause any actionable variable . We formalise these conditions in the definition below.
Definition 1 (Admissible Input to Causal Rule Discovery)
The causal system of actionable variables, target variable, and control variables is an admissible input to causal rule discovery if the underlying causal graph of the variables satisfy the following:

[leftmargin=*, label=()]

there are no outgoing edges from to any in ,

no outgoing edges from any in to any in ,

no edges between actionable variables , and

no edges between any unobserved and in .
In Fig. 1, we show a skeleton causal graph of an admissible input to causal discovery. The proposition below shows that the control variables block all spurious paths between any subset of actionable variables and if the input is admissible. [] Let (, , ) be an admissible input to causal rule discovery. Then the control variables block all spurious paths between any subset of actionable variables and . We postpone the proof to the Appendix.
Using admissible control variables , we can then compute for any rule from the rule language as
(4)  
(5) 
where the first expression is obtained by applying the backdoor adjustment formula [pearl:09:book, Thm. 3.3.2] and the second expression is obtained from the first by exchanging the inner summation with the outer one. What is left now is to define the stochastic policy which in some sense we treated as an oracle so far. The following theorem shows that, with a specific choice of , we can compute the causal effect of any rule on the target, from observational data, in terms of simple conditional expectations (akin to conditional average treatment effect [imai:13:heterogenity]).
Theorem 3.1
Given an admissible input to causal rule discovery, , and a stochastic policy , the causal effect of any rule , from the rule language , on in observational data is given by
(6) 
We postpone the proof to the Appendix.
That is, for admissible input , the expression above on the r.h.s. gives us the causal effect of any rule from the rule language on from observational data. Importantly, we have shown that causal rule discovery is a difficult problem in practice—any violation of Def. 1 would render Eq. (6) noncausal. Having said that, criterion (a) is an implicit assumption in rule discovery, and criterion (b) and (d) are a form of causal sufficiency [scheines:97:intro], which is a fairly standard assumption in causal inference literature.
Exceptional cases aside, in practice, we often do not know the complete causal graph. While with some assumptions, we can discover a partially directed graph from observational data [spirtes:00:book], a rather pragmatic approach is to leverage domain knowledge to eliminate certain variables following the guidelines in Def. 1. For instance, smoking causes tar deposits in a person’s lungs, therefore both smoking and tar deposits cannot be in ; this ensures that criterion (c) of Def 1 is not violated. Moreover, smoking may affect a person’s blood pressure. Thus it is unsafe to include blood pressure in —criterion (b) would be violated otherwise. This way, we can get a practical solution that is closer to the truth.