Subgroup discovery is a well-established KDD technique (klosgen1996explora; friedman1999bump; bay2001detecting; see atzmueller2015subgroup for a recent survey) with applications, e.g., in Medicine (schmidt2010interpreting), Social Science (grosskreutz2010subgroup), and Materials Science (goldsmith2017uncovering). In contrast to global modeling, which is concerned with the complete characterization of some variable defined for a given population, subgroup discovery aims to detect intuitive descriptions or selectors of subpopulations in which, locally, the target variable takes on a useful distribution. In scientific domains, like the ones mentioned above, such local patterns are typically considered useful if they are not too specific (in terms of subpopulation size) and indicate insightful facts about the underlying physical process that governs the target variable. Such facts could for instance be: ‘patients of specific demographics experience a low response to some treatment’ or ‘materials with specific atomic composition exhibit a high thermal conductivity’. For numeric (metric) variables, subgroups need to satisfy two criteria to truthfully represent such statements: the local distribution of the target variable must have a shifted central tendency (effect), and group members must be described well by that shift (consistency). The second requirement is captured by the group’s dispersion, which determines the average error of associating group members with the central tendency value (see also song2016subgroup).
Despite all three parameters—size, central tendency, and dispersion—being important, the only known approach for the efficient discovery of globally optimal subgroups, branch-and-bound search (webb1995opus; wrobel1997algorithm), is restricted to objective functions that only take into account size and central tendency. That is, if we denote by some subpopulation of our global population then the objective functions currently available to branch-and-bound can be written as
where is some measure of central tendency (usually mean or median) and is a function that is monotonically increasing in the subpopulation size . A problem with all such functions is that they inherently favor larger groups with scattered target values over smaller more focused groups with the same central tendency. That is, they favor the discovery of inconsistent statements over consistent ones—surprisingly often identifying groups with a local error that is almost as high or even higher than the global error (see Fig. 1 for an illustration of this problem that abounded from the authors’ research in Materials Science). Although dispersion-corrected objective functions that counter-balance size by dispersion have been proposed (e.g., ‘-score’ by klosgen2002data or ‘mmad’ by pieters2010subgroup
), it remained unclear how to employ such functions outside of heuristic optimization frameworks such as greedy beam search(lavravc2004subgroup) or selector sampling (boley2012linear; li2016sampling). Despite often finding interesting groups, such frameworks do not guarantee the detection of optimal results, which can not only be problematic for missing important discoveries but also because they therefore can never guarantee the absence of high quality groups—which often is an insight equally important as the presence of a strong pattern. For instance, in our example in Fig. 1, it would be remarkable to establish that long-range interactions are to a large degree independent of nanocluster geometry.
Therefore, in this paper (Sec. 3), we extend branch-and-bound search to objective functions of the form
where is monotonically increasing in the subpopulation size, monotonically decreasing in any dispersion measure around the median, and, besides that, depends only (but in arbitrary form) on the subpopulation median. This involves developing an efficient algorithm for computing the tight optimistic estimator given by the optimal value of the objective function among all possible subsets of target values:
which has been shown to be a crucial ingredient for the practical applicability of branch-and-bound (grosskreutz2008tight; lemmerich2016fast). So far, the most general approach to this problem (first codified in lemmerich2016fast; generalized here in Sec. 3.1) is to maintain a sorted list of target values throughout the search process and then to compute Eq. (3) as the maximum of all subsets that contain all target values of down to target value —an algorithm that does not generalize to objective functions depending on dispersion. This paper presents an alternative idea (Sec. 3.2) where we do not fix the size of subset as in the previous approach but instead fix its median to target value . It turns out that this suffices to efficiently compute the tight optimistic estimator for all objective functions of the form of Eq. (2). Moreover, we end up with a linear time algorithm (Sec. 3.3) in the important special case where the dependence on size and dispersion is determined by the dispersion-corrected coverage defined by
where amd denotes the mean absolute deviation from the median. This is the same computational complexity as the objective function itself. Consequently, this new approach can discover subgroups according to a more refined selection criterion without increasing the worst-case computational cost. Additionally, as demonstrated by empirical results on a wide range of datasets (Sec. 4), it is also highly efficient and successfully reduces the error of result subgroups in practice.
2 Subgroup Discovery
Before developing the novel approach to tight optimistic estimator computation, we recall in this section the necessary basics of optimal subgroup discovery with numeric target attributes. We focus on concepts that are essential from the optimization point of view (see, e.g., duivesteijn2011exploiting and references therein for statistical considerations). As notional convention, we are using the symbol for a positive integer to denote the set of integers . Also, for a real-valued expression we write to denote . A summary of the most important notations used in this paper can be found in Appendix C.
2.1 Description languages, objective functions, and closed selectors
Let denote our given global population of entities, for each of which we know the value of a real target variable and additional descriptive information that is captured in some abstract description language of subgroup selectors . Each of these selectors describes a subpopulation defined by
that is referred to as the extension of . Subgroup discovery is concerned with finding descriptions that have a useful (or interesting) distribution of target values in their extension . This notion of usefulness is given by an objective function . That is, the formal goal is to find elements with maximal . Since we assume to be a function of the multiset of -values, let us define to be used interchangeably for convenience. One example of a commonly used objective function is the impact measure (see webb2001discovering; here a scaled but order-equivalent version is given) defined by
where denotes the coverage or relative size of (here—and wherever else convenient—we identify a subpopulation with the multiset of its target values).
The standard description language in the subgroup discovery literature333In this article we remain with this basic setting for the sake of simplicity. It is, however, useful to note that several generalizations of this concept have been proposed (e.g., parthasarathy1999incremental; huan2003efficient), to which the contributions of this paper remain applicable. is the language consisting of logical conjunctions of a number of base propositions (or predicates). That is, are of the form
where the are taken from a pool of base propositions . These propositions usually correspond to equality or inequality constraints with respect to one variable out of a set of description variables that are observed for all population members (e.g., ). However, for the scope of this paper it is sufficient to simply regard them as abstract Boolean functions . In this paper, we focus in particular on the refined language of closed conjunctions (pasquier1999efficient), which is defined as by the fixpoints of the closure operation given by
These are selectors to which no further proposition can be added without reducing their extension, and it can be shown that contains at most one selector for each possible extension. While this can reduce the search space for finding optimal subgroups by several orders of magnitude, closed conjunctions are the longest (and most redundant) description for their extension and thus do not constitute intuitive descriptions by themselves. Hence, for reporting concrete selectors (as in Fig. 1), closed conjunctions have to be simplified to selectors of approximately minimum length that describe the same extension (boley2009non).
2.2 Branch-and-bound and optimistic estimators
The standard algorithmic approach for finding optimal subgroups with respect to a given objective function is branch-and-bound search—a versatile algorithmic puzzle solving framework with several forms and flavors (see, e.g., mehlhorn2008algorithms, Chap. 12.4). At its core, all of its variants assume the availability and efficient computability of two ingredients:
A refinement operator that is monotone, i.e., for with it holds that , and that non-redundantly generates . That is, there is a root selector such that for every there is a unique sequence of selectors with . In other words, the refinement operator implicitly represents a directed tree (arborescence) on the description language rooted in .
An optimistic estimator (or bounding function) that bounds from above the attainable subgroup value of a selector among all more specific selectors, i.e., it holds that for all with .
Based on these ingredients, a branch-and-bound algorithm simply enumerates all elements of starting from using (branch), but—based on —avoids expanding descriptions that cannot yield an improvement over the best subgroups found so far (bound). Depending on the order in which language elements are expanded, one distinguishes between depth-first, breadth-first, breadth-first iterating deepening, and best-first search. In the last variant, the optimistic estimator is not only used for pruning the search space, but also to select the next element to be expanded, which is particularly appealing for informed, i.e., tight, optimistic estimators. An important feature of branch-and-bound is that it effortlessly allows to speed-up the search in a sound way by relaxing the result requirement from being -optimal to just being an -approximation. That is, the found solution satisfies for all that for some approximation factor . The pseudo-code given in Alg. 1 summarizes all of the above ideas. Note that, for the sake of clarity, we omitted here some other common parameters such as a depth-limit and multiple solutions (top-), which are straightforward to incorporate (see lemmerich2016fast).
An efficiently computable refinement operator has to be constructed specifically for the desired description language. For example for the language of conjunctions , one can define by
where we identify a conjunction with the set of base propositions it contains. For the closed conjunctions , let us define the lexicographical prefix of a conjunction and a base proposition index as . Moreover, let us denote with the minimal index such that the -prefix of is extension-preserving, i.e., . With this we can construct a refinement operator (uno2004efficient) as
That is, a selector is among the refinements of if can be generated by an application of the closure operator given in Eq. (5) that is prefix-preserving.
How to obtain an optimistic estimator for an objective function of interest depends on the definition of that objective. For instance, the coverage function cov is a valid optimistic estimator for the impact function as defined in Eq. (4), because the second factor of the impact function is upper bounded by . In fact there are many different optimistic estimators for a given objective function. Clearly, the smaller the value of the bounding function for a candidate subpopulation, the higher is the potential for pruning the corresponding branch from the enumeration tree. Ideally, one would like to use , which is the most strict function that still is a valid optimistic estimator. In general, however, computing this function is as hard as the original subgroup optimization problem we started with. Therefore, as a next best option, one can disregard selectability and consider the (selection-unaware) tight optimistic estimator given by
This leaves us with a new combinatorial optimization problem: given a subpopulation, find a sub-selection of that maximizes . In the following section we will discuss strategies for solving this optimization problem efficiently for different classes of objective functions—including dispersion-corrected objectives.
3 Efficiently Computable Tight Optimistic Estimators
We are going to develop an efficient algorithm for the tight optimistic estimator in three steps: First, we review and reformulate a general algorithm for the classic case of non-dispersion-aware objective functions. Then we transfer the main idea of this algorithm to the case of dispersion-corrected objectives based on the median, and finally we consider a subclass of these functions where the approach can be computed in linear time. Throughout this section we will identify a given subpopulation with the multiset of its target values and assume that the target values are indexed in ascending order, i.e., for . Also, it is helpful to define the following partial order defined on finite multisets. Let and be two multisets such that their elements are indexed in ascending order. We say that is element-wise less or equal to and write if for all .
3.1 The standard case: monotone functions of a central tendency measure
The most general previous approach for computing the tight optimistic estimator for subgroup discovery with a metric target variable is described by lemmerich2016fast where it is referred to as estimation by ordering. Here, we review this approach and give a uniform and generalized version of that paper’s results. For this, we define the general notion of a measure of central tendency as follows.
We call a mapping a (monotone) measure of central tendency if for all multisets with it holds that .
One can check that this definition applies to the standard measures of central tendency, i.e., the arithmetic and geometric mean as well as themedian444In this paper, we are using the simple definition of the median as the
-quantile (as opposed to defining it asfor even ), which simplifies many of the definitions below and additionally is well-defined in settings where averaging of target values is undesired. , and also to weighted variants of them (note, however, that it does not apply to the mode). With this we can define the class of objective functions for which the tight optimistic estimator can be computed efficiently by the standard approach as follows. We call a monotone level 1 objective function if it can be written as
where is some measure of central tendency and is a function that is non-decreasing in both of its arguments. One can check that the impact measure falls under this category of functions as do many of its variants.
The central observation for computing the tight optimistic estimator for monotone level 1 functions is that the optimum value must be attained on a sub-multiset that contains a consecutive segment of elements of from the top element w.r.t. down to some cut-off element. Formally, let us define the top sequence of sub-multisets of as for and note the following observation:
Let be a monotone level 1 objective function. Then the tight optimistic estimator of can be computed as the maximum value on the top sequence, i.e., .
Let be of size with . Since , we have for the top sequence element that and, hence, implying
It follows that for each sub-multiset of there is a top sequence element of at least equal objective value. ∎
From this insight it is easy to derive an algorithm for computing the tight optimistic estimator under the additional assumption that we can compute and the “incremental central tendency problem” in constant time. Note that computing the incremental problem in constant time implies to only access a constant number of target values and of the previously computed central tendency values. This can for instance be done for via the incremental formula or for through direct index access of either of the two values or . Since, according to Prop. 1, we have to evaluate only for the candidates to find we can do so in time by solving the problem incrementally for . The same overall approach can be readily generalized for objective functions that are monotonically decreasing in the central tendency or those that can be written as the maximum of one monotonically increasing and one monotonically decreasing level 1 function. However, it breaks down for objective functions that depend on more than just size and central tendency—which inherently is the case when we want to incorporate dispersion-control.
3.2 Dispersion-corrected objective functions based on the median
We will now extend the previous recipe for computing the tight optimistic estimator to objective functions that depend not only on subpopulation size and central tendency but also on the target value dispersion in the subgroup. Specifically, we focus on the median as measure of central tendency and consider functions that are both monotonically increasing in the described subpopulation size and monotonically decreasing in some dispersion measure around the median. To precisely describe this class of functions, we first have to formalize the notion of dispersion measure around the median. For our purpose the following definition suffices. Let us denote by the multiset of absolute differences to the median of a multiset , i.e., .
We call a mapping a dispersion measure around the median if is monotone with respect to the multiset of absolute differences to its median , i.e., if then .
One can check that this definition contains the measures median absolute deviation around the median , the root mean of squared deviations around the median , as well as the mean absolute deviation around the median .555We work here with the given definition of dispersion measure because of its simplicity. Note, however, that all subsequent arguments can be extended in a straightforward way to a wider class of dispersion measures by considering the multisets of positive and negative deviations separately. This wider class also contains the interquartile range and certain asymmetric measures, which are not covered by Def. 2. Based on Def. 2 we can specify the class of objective functions that we aim to tackle as follows: we call a function a dispersion-corrected or level 2 objective function (based on the median) if it can be written as
where is some dispersion measure around the median and is a real function that is non-decreasing in its first argument and non-increasing in its third argument (without any monotinicity requirement for the second argument).
Our recipe for optimizing these functions is then to consider only subpopulations that can be formed by selecting all individuals with a target value in some interval. Formally, for a fixed index define as the maximal cardinality of a sub-multiset of the target values that has median index , i.e.,
Now, for , let us define as the set with consecutive elements around index . That is
With this we can define the elements of the median sequence as those subsets of the form of Eq. (8) that maximize for some fixed index . That is, where is minimal with
Thus, the number is the smallest cardinality that maximizes the trade-off of size and dispersion encoded by (given the fixed median for all ). Fig. 2 shows an exemplary median sequence based on random target values. In the following proposition we note that, as desired, searching the median sequence is sufficient for finding optimal subsets of .
Let be a dispersion-corrected objective function based on the median. Then the tight optimistic estimator of can be computed as the maximum value on the median sequence, i.e., .
For a sub-multiset let us define the gap count as
Let be an -maximizer with minimal gap count, i.e., for all with . Assume that . That means there is a such that . Define
Per definition we have and . Additionally, we can check that , and, hence, . This implies that
However, per definition of it also holds that , which contradicts that is an -optimizer with minimal gap count. Hence, any -maximizer must have a gap count of zero. In other words, is of the form as in Eq. (8) for some median and some cardinality and per definition we have as required. ∎
Consequently, we can compute the tight optimistic estimator for any dispersion-corrected objective function based on the median in time for subpopulations of size —again, given a suitable incremental formula for . While this is not generally a practical algorithm in itself, it is a useful departure point for designing one. In the next section we show how it can be brought down to linear time when we introduce some additional constraints on the objective function.
3.3 Reaching linear time—objectives based on dispersion-corrected coverage
Equipped with the general concept of the median sequence, we can now address the special case of dispersion-corrected objective functions where the trade-off between the subpopulation size and target value dispersion is captured by a linear function of size and the sum of absolute differences from the median. Concretely, let us define the dispersion-corrected coverage (w.r.t. absolute median deviation) by
where denotes the sum of absolute deviations from the median. We then consider objective functions based on the dispersion-corrected coverage of the form
where is non-decreasing in its first argument. Let us note, however, that we could replace the dcc function by any linear function that depends positively on and negatively on smd. It is easy to verify that function of this form also obey the more general definition of level-2 objective functions given in Sec. 3.2, and, hence can be optimized via the median sequence.
The key to computing the tight optimistic estimator in linear time for functions based on dispersion-corrected coverage is then that the members of the median sequence can be computed incrementally in constant time. Indeed, we can prove the following theorem, which states that the optimal size for a multiset around median index is within of the optimal size for a multiset around median index —a fact that can also be observed in the example given in Fig. 2.
Let be of the form of Eq. (9). For it holds for the size of the -optimal multiset with median that
One idea to prove this theorem is to show that a) the gain in for increasing the multiset around a median index is alternating between two discrete concave functions and b) that the gains for growing multisets between two consecutive median indices are bounding each other. For an intuitive understanding of this argument, Fig. 3 shows for four different median indices the dispersion-corrected coverage for the sets as a function in . On closer inspection, we can observe that when considering only every second segment of each function graph, the corresponding dcc-values have a concave shape. A detailed proof, which is rather long and partially technical, can be found in Appendix A.
It follows that, after computing the objective value of trivially as , we can obtain for by checking the at most seven candidate set sizes given by Eq. (10) as
with and . It remains to see that we can compute individual evaluations of in constant time (after some initial pre-processing step). As a general data structure for quickly computing sums of absolute deviations from a center point, we can define for the left error and the right error as
Note that we can compute these error terms for all in time via the recursions
and . Subsequently, we can compute sums of deviations from center points of arbitrary subpopulations in constant time, as the following statement shows (see Appendix B for a proof).
Let be a multiset with and for . Then the sum of absolute deviations to of all elements of the submultiset can be expressed as
4 Dispersion-corrected Subgroup Discovery in Practice
|5||california||med. h. value||72||0.4|
|16||mortgage||30 y. rate||128||6.71||2.373||0.097||11.61||2.081||1|
|21||treasury||1 m. def. rate||128||6.61||2.473||0.182||1|
The overall result of Sec. 3 is an efficient algorithm for dispersion-corrected subgroup discovery which, e.g., allows us to replace the coverage term in standard objective functions by the dispersion-corrected coverage. To evaluate this efficiency claim as well as the value of dispersion-correction, let us consider as objective the normalized and dispersion-corrected impact function based on the median, i.e., where is the positive relative median shift
This function obeys Eq. (9); thus, its tight optimistic estimator can be computed using the linear time algorithm from Sec. 3.3. The following empirical results were gathered by applying it to a range of publicly available real-world datasets.666Datasets contain all regression datasets from the KEEL repository (alcala2010keel) with at least 5 attributes and two materials datasets from the Nomad Repository nomad-coe.eu/; see Tab. 1. Implementation available in open source Java library realKD bitbucket.org/realKD/. Computation times determined on MacBook Pro 3.1 GHz Intel Core i7. We will first investigate the effect of dispersion-correction on the output before turning to the effect of the tight optimistic estimator on the computation time.
4.1 Selection Bias of Dispersion-Correction and its Statistical Merit
To investigate the selection bias of let us also consider the non-dispersion corrected variant where we simply replace the dispersion-corrected coverage by the ordinary coverage. This function is a monotone level 1 function, hence, its tight optimistic estimator can be computed in linear time using the top sequence approach. Fig. 4 shows the characteristics of the optimal subgroups that are discovered with respect to both of these objective functions (see also Tab. 1 for exact values) where for all datasets the language of closed conjunctions has been used as description language.
The first observation is that—as enforced by design—for all datasets the mean absolute deviation from the median is lower for the dispersion-corrected variant (except in one case where both functions yield the same subgroup). On average the dispersion for is percent of the global dispersion, whereas it is percent for , i.e., when not optimizing the dispersion it is on average higher in the subgroups than in the global population. When it comes to the other subgroup characteristics, coverage and median target value, the global picture is that discovers somewhat more specific groups (mean coverage versus for ) with higher median shift (on average normalized median deviations higher). However, in contrast to dispersion, the behavior for median shift and coverage varies across the datasets. In Fig. 4, the datasets are ordered according to the difference in subgroup medians between the optimal subgroups w.r.t. and those w.r.t. . This ordering reveals the following categorization of outcomes: When our description language is not able to reduce the error of subgroups with very high median value, settles for more coherent groups with a less extreme but still outstanding central tendency. On the other end of the scale, when no coherent groups with moderate size and median shift can be identified, the dispersion-corrected objective selects very small groups with the most extreme target values. The majority of datasets obey the global trend of dispersion-correction leading to somewhat more specific subgroups with higher median that are, as intended, more coherent.
Effect of dispersion correction on lower bound of 95-percent confidence interval of target variable;(left)
improvement over global lower bound in standard deviations of dispersion-corrected objective () and non-dispersion-corrected objective () with annotations showing ids of datasets where either method provides no improvement; (right)
posterior joint probabilities of the events that normalized differenceis larger than (), less than (), or within () according to Bayesian sign-test in barycentric coordinates (sections correspond to regions where corresponding event is maximum a posteriori outcome).
To determine based on these empirical observations, whether we should generally favor dispersion correction, we have to specify an application context that specifies the relative importance of coverage, central tendency, and dispersion. For that let us consider the common statistical setting in which we do not observe the full global population but instead subgroup discovery is performed only on an i.i.d. sample yielding subpopulations . While has been optimized w.r.t. the statistics on that sample we are actually interested in the properties of the full subpopulation . For instance, a natural question is what is the minimal -value that we expect to see in a random individual with high confidence. That is, we prefer subgroups with an as high as possible threshold such that a random satisfies with probability777The probability is w.r.t. to the distribution with which the sample is drawn. that
. This criterion gives rise to a natural trade-off between the three evaluation metrics through theempirical Chebycheff inequality (see kaban2012non, Eq. (17)), according to which we can compute such a value as where
is the sample variance. Note that this expression is only defined for sample subpopulations with a size of at least. For smaller subgroups our best guess for a threshold value would be the one derived from the global sample (which we assume to be large enough to determine an -value). This gives rise to the following standardized lower confidence bound score that evaluates how much a subgroup improves over the global value:
The plot on the left side of Fig. 5 shows the score values of the optimal subgroup w.r.t. to () and () using confidence parameter . Except for three exceptions (datasets 3,4, and 12), the subgroup resulting from provides a higher lower bound than those from the non-dispersion corrected variant . That is, the data shows a strong advantage for dispersion correction when we are interested in selectors that mostly select individuals with a high target value from the underlying population . In order to test the significance of these results, we can employ the Bayesian sign-test (benavoli2014bayesian)
, a modern alternative to classic frequentist null hypothesis tests that avoids many of the well-known disadvantages of those(see demvsar2008appropriateness; benavoli2016time)
. With Bayesian hypothesis tests, we can directly evaluate the posterior probabilities of hypotheses given our experimental data instead of just rejecting a null hypothesis based on some arbitrary significance level. Moreover, we differentiate between sample size and effect size by the introduction of a region of practical equivalence (rope). Here, we are interested in the relative differenceon average for random subgroup discovery problems. Using a conservative choice for the rope, we call the two objective functions practically equivalent if the mean -value is at most . Choosing the prior belief that is superior, i.e., , with a prior weight of , the procedure yields based on our 25 test datasets the posterior probability of approximately that on average (see the right part of Fig. 5 for in illustration of the posterior belief). Hence, we can conclude that dispersion-correction improves the relative lower confidence bound of target values on average by more than 10 percent when compared to the non-dispersion-corrected function.