Separate and conquer heuristic allows robust mining of contrast sets from various types of data

04/01/2022
by   Adam Gudyś, et al.
0

Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on a sequential covering - a well established heuristic for decision rule induction. Multiple passes accompanied with an attribute penalization scheme allow generating contrast sets describing same examples with different attributes, unlike the standard sequential covering. The ability to identify contrast sets in regression and survival data sets, the feature not provided by the existing algorithms, further extends the usability of RuleKit-CS. Experiments on wide range of data sets confirmed RuleKit-CS to be a useful tool for discovering differences between defined groups. The algorithm is a part of the RuleKit suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit). Keywords: Contrast sets, Sequential covering, Rule induction, Regression, Survival, Knowledge discovery

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/05/2018

GuideR: a guided separate-and-conquer rule learning in classification, regression, and survival settings

This article presents GuideR, a user-guided rule induction algorithm, wh...
08/02/2019

RuleKit: A Comprehensive Suite for Rule-Based Learning

Rule-based models are often used for data analysis as they combine inter...
06/09/2021

SCARI: Separate and Conquer Algorithm for Action Rules and Recommendations Induction

This article describes an action rule induction algorithm based on a seq...
06/13/2016

A framework for redescription set construction

Redescription mining is a field of knowledge discovery that aims at find...
11/16/2017

Related family-based attribute reduction of covering information systems when varying attribute sets

In practical situations, there are many dynamic covering information sys...
08/29/2018

Rule induction for global explanation of trained models

Understanding the behavior of a trained network and finding explanations...
09/15/2016

Concordance and the Smallest Covering Set of Preference Orderings

Preference orderings are orderings of a set of items according to the pr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the knowledge discovery in tabular data, rules are the most intuitive, thus the most popular knowledge representation. A lot of knowledge discovery tasks can be considered as rule induction problems. These are, for instance, association rule learning, subgroup discovery, contrast set mining, or identifying emerging patterns. While descriptive capabilities of rules are indisputable, they can be also used for predictive purposes, i.e., for building classification systems. As Novak et al. noticed (Novak et al., 2009b)

, initially, these two directions were independently investigated by data mining (descriptive) and machine learning (predictive) communities. The aforementioned perspectives are, however, tightly related. In fact, any rule induction algorithm can be oriented towards one or both of these purposes. The differences lie in the applied strategy of exploring the search space, the methods of assessing the rules, and their post-processing.

For knowledge discovery purposes, the induction often tries to find all rules fulfilling assumed quality constraints, e.g., precision (confidence) or support (coverage). This can be followed by the filtering based on the rule interestingness (Geng and Hamilton, 2006). When classification is the main aim, the induction is oriented towards highest predictive power. Ensemble of rules Gu et al. (2018) are particularly effective at this field. The possibility to interpret resulting models is often illusive, though. Among many rule learning approaches, sequential covering (a.k.a. separate and conquer) is a reasonable compromise allowing induction of a moderate number of rules with good predictive power. Importantly, the procedure can be straightforwardly adjusted towards interpretability or classification abilities of the model by using different rule quality measures (Janssen and Fürnkranz, 2010; Wróbel et al., 2016).

In this paper, we present RuleKit-CS, the algorithm for contrast set (CS) mining based on the sequential covering. Our approach follows the observation that contrast set mining is a special case of classification rule learning with particular stress put on maximizing the support difference between groups (Webb et al., 2003). By selecting appropriate quality measure which control learning process and introducing support constraints, separate and conquer was suited for discovering CS. Multiple passes accompanied with novel penalization scheme allow generating contrast sets describing same examples with different attributes – the feature not ensured by the standard sequential covering. Additionally, we generalized the problem of contrast set mining for regression and survival data and included support of those in the presented algorithm.

RuleKit-CS is a part of the RuleKit (Gudyś et al., 2020) suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit).

2 Related work

The contrast set mining was formulated by Bay and Pazziani (Bay and Pazzani, 2001) as a problem of identifying differences between contrasting groups in multivariate data. Let indicate a set of examples (observations) described by conditional attributes and assigned to one of the groups from . A contrast set was originally defined for sets with categorical attributes only as a conjunction of attribute-value pairs with each attribute appearing at most once: . This definition can be straightforwardly generalized to continuous attributes by replacing attribute values with intervals. A support of a contrast set in a group is calculated as the fraction of examples from covered by . The aim of contrast set mining is to find contrast sets with high support in the group of interest (here, referred to as positive) and low support in the remaining groups (negative). The requirements were formally defined (Bay and Pazzani, 2001) as:

(1)
(2)

with being a user-defined minimum support difference. Contrast sets fulfilling conditions (1) and (2) are referred to as significant and large, respectively.

Bay and Pazziani proposed a contrast set mining algorithm STUCCO based on set-enumeration trees  (Bayardo Jr, 1998). The method investigates all potential contrast sets and selects those which (i) pass the statistical test of independence w.r.t. group membership and (ii) fulfill minimum support difference. Since the number of contrast sets grows exponentially with the number of attributes, the pruning strategies were incorporated to limit the search space. CIGAR (Hilderman and Peckham, 2005) extended the idea of STUCCO by introducing three additional constraints: (i) minimum support, (ii) minimum correlation, (iii) minimum correlation difference.

Webb et al. (Webb et al., 2003) showed that contrast set mining is a special case of rule learning and confirmed that Magnum Opus, an implementation of a general-purpose association rule learner OPUS_AR (Webb, 2000), was suitable for contrast set identification. Another analogy was shown by Krajl et al., who successfully applied subgroup discovery for mining contrast sets in various brain conditions (Kralj et al., 2007b, a). The equivalence between contrast set mining, subgroup discovery, and identifying emerging patterns was formally shown in (Novak et al., 2009b). The idea was followed in CSM-SD (Novak et al., 2009a) package which employed a subgroup discovery algorithm CN2-SD (Lavrač et al., 2004) to contrast set mining. As authors showed, the weighted relative accuracy used by CN2-SD corresponds to the support difference criterion from STUCCO. Other subgroup discovery algorithms, like pysubgroup Lemmerich and Becker (2018) are also suitable to the identification of contrast sets.

Alternative approaches to contrast set mining include COSINE (Simeon and Hilderman, 2011), DIFF (Liu et al., 2014), SciCSM (Zhu et al., 2015), or Exceptional Contrast Set Mining (Nguyen et al., 2016). There were also attempts to generate contrast sets in temporal data (Magalhães and Azevedo, 2015) or to introduce fuzzy contrast-based models (Ahmed et al., 2022).

While the majority of contrast set research concerned medical data (Kralj et al., 2007b, a; Novak et al., 2009a; Ahmed et al., 2022), the different areas of application like aircraft incidents (Nazeri et al., 2008), software crashes (Qian et al., 2020), or folk music (Neubarth and Conklin, 2016) were also investigated in the literature.

3 Methods

3.1 Separate and conquer contrast sets learning

As presented in Webb et al. (2003), contrast set mining can be considered as a special case of classification rule learning with group being a label attribute. This idea is followed by our algorithm which employs sequential covering — a well established method for rule induction. By selecting appropriate quality measure which control learning process and introducing support constraints, the heuristic was suited for discovering contrast sets.

The general idea of separate and conquer is an iterative addition of rules to the initially empty set as long as all positive examples become covered. To enable generation of many contrast sets describing the same subset of examples with different attributes, RuleKit-CS performs multiple sequential covering passes. Contrast set redundancy between passes is prevented by using attribute penalization mechanism and/or different coverage requirements. By default, contrast sets are established in one vs all scheme, i.e., the examples from the investigated group are differentiated from all the others. Optionally, the induction in one vs one variant can be performed. In this mode, a single group is considered negative, while examples from the remaining groups are discarded.

Given a group of interest , let and indicate subsets of, respectively, positive and negative examples from . By denoting a subset of examples from covered by a contrast set as , we can define the elements of confusion matrix as: , , , . Additionally, let be a subset of yet uncovered positive examples and .

The pseudocode of the contrast set mining for a group is presented in Algorithm 1. The procedure is controlled by the following parameters:

  • minsupp-all — a minimum positive support of a contrast set (). By repeating the induction for and aggregating the results, the algorithm renders contrast sets from most general to most detailed.

  • minsupp-new — a minimum positive support of a contrast set calculated w.r.t. to previously uncovered examples (; 0.1 by default). The parameter ensures the convergence of a separate and conquer pass and is an equivalent of mincov in our algorithm for rule induction Gudyś et al. (2020). Note, that minsupp-new also affects the stop condition — when the fraction of uncovered positives falls below its value, no more contrast sets fulfilling the requirement could be generated and the pass ends.

  • max-neg2pos — a maximum ratio of negative to positive supports (; 0.5 by default). The parameter ensures the basic requirement of the contrast set, i.e., the high support in the group of interest and low in the remaining ones.

  • max-passes — a maximum number of sequential covering passes for a single minsupp-all (5 by default).

  • quality — rule quality measure that drives the learning process. The measure is expressed as a function of confusion matrix elements and allows balancing support and precision of resulting contrast sets. Many rule quality measures with various characteristics have been defined (Kamber and Shinghal, 1996; Hilderman and Hamilton, 2013; Tan et al., 2002; Greco et al., 2004). One of them is the correlation between predicted and target variables defined as:

    (3)

    Due to its valuable properties, the measure was employed in rule induction, subgroup discovery, or evaluation of association rules (Xiong et al., 2004; Geng and Hamilton, 2006; Janssen and Fürnkranz, 2010). In particular, the correlation is monotonic (increasing in for fixed , decreasing in for fixed ), symmetric (if we negate the premise or the consequence, the correlation value becomes the additive inverse), and takes values from interval. Additionally, it belongs to the confirmation measures (Eells and Fitelson, 2002) – it is positive if the contrast set precision exceeds the a priori group precision , and negative otherwise. We experimentally established correlation as the most convenient support-precision trade off and set is as a default measure for traditional contrast sets.

    Note, that a different quality measure is used for regression/survival data where the procedure aims at extracting within groups subsets of examples uniform w.r.t. label attribute/survival prognosis (see Subsection 3.2 for details).

A single separate and conquer pass consists of two steps: growing and pruning. As shown in Algorithm 2, the former starts from an empty premise and adds conditions iteratively, each time selecting the one optimizing quality (correlation for classical contrast sets, label consistency for regression, survival function consistency for survival problems). Conditions, which cause the contrast set to violate minsupp-new or minsupp-all constraints are discarded. Growing stops when there are no more conditions fulfilling the covering requirements. As in principle, a quality measure used for condition evaluation rewards and penalizes , the ratio of negative to positive supports, decreases during growing stage. Therefore, the max-neg2pos requirement is verified for the fully grown contrast set.

If growing produces an empty contrast set (no conditions fulfilling minsupp-new or minsupp-all) or max-neg2pos constraint is violated, the contrast set is discarded and the current separate and conquer pass ends. Otherwise, the pruning starts. The procedure, as an opposite to growing, removes conditions from the premise, each time making an elimination leading to the largest improvement in the rule quality, with a restriction that max-neg2pos requirement cannot be violated. The iteration stops when no such eliminations further exist.

Note, that in the original contrast set definition (Bay and Pazzani, 2001), the difference between supports was controlled rather then their ratio. This, however, would filter out contrast sets with very good discriminating capabilities but moderate support. For instance, if we set the minimum support difference to 30%, the contrast set that covers 80% of positives and 50% of negatives would be accepted, but the one covering 25% of positives and no negatives not. As we believe the latter is also an interesting contrast set, we decided to control the ratios of supports with max-neg2pos instead of their difference. This way, the required support difference becomes dependent on the support value itself (which is controlled by minsupp-all).

1:—data set consisting of positive () and negative () examples, minsupp-all—minimum positive support, minsupp-new–minimum positive support for yet uncovered examples, max-neg2pos—maximum ratio of negative to positive support, max-passes—maximum number of sequential covering passes for a single minsupp-all, quality—quality measure that drives learning process
2:—contrast sets.
3: start from an empty contrast set collection
4:for  do
5:      
6:      for  do multiple passes
7:             contrast sets for current pass
8:             set of uncovered positives
9:            repeat
10:                 
11:                 if  then
12:                       if  then
13:                             
14:                             
15:                             
16:                             
17:                       else
18:                              discard contrast set, end pass                                         
19:            until  end current pass
20:            if  then no more new contrast sets
21:                 break             
22:             add contrast sets from current pass       
Algorithm 1 Separate-and-conquer contrast set induction.
1:—training dataset, —set of uncovered positive examples, minsupp-all—minimum positive support, minsupp-new—minimum positive support for yet uncovered examples, quality—quality measure that drives learning process
2:—grown contrast set.
3:function Grow(, , minsupp-all, minsupp-new, quality)
4:       start from an empty premise
5:      repeat iteratively add conditions
6:             current best condition
7:             best quality and coverage
8:             Cov(, ) examples covered by
9:            for do
10:                  extended contrast set
11:                  updated coverage
12:                  verify support constraints
13:                 if 
14:                        Evaluate(, ,quality) evaluate contrast set
15:                       if  or ( and then
16:                             ,                                                      
17:             extend contrast set with best condition
18:      until 
19:      return
Algorithm 2 Growing a contrast set.

3.2 Regression and survival data

The problem for contrast set mining can be generalized for regression and survival data sets. For this purpose, we assume the presence of the label variable apart from the group . For regression, the label is continuous, while in survival problems, it represents a binary censoring status with 0 and 1 representing censored (event-free) and non-censored (event-subjected) observations, respectively. The status variable is accompanied with a survival time , i.e., the time of the observation for event-free examples or the time before the occurrence of an event.

The analysis of regression and survival data sets by RuleKit-CS differs from the classical contrast set mining based only on the group attribute. Instead of directly diversifying supports across groups by optimizing correlation measure, our algorithm takes into account the label. In particular, for regression problems it establishes the absolute difference in mean labels of positive versus all examples covered by a contrast set . The quality measure to be maximized is then defined as an opposite of this difference:

(4)

Consequently, the cost of covering a negative example by CS in regression data depends on its label and is greater for negatives with deviating more from positives constituting the contrast set. Therefore, the algorithm tends to extract within a group of interest subsets of examples uniform w.r.t. the label .

Survial data sets are handled analogously, but instead of taking label

as an outcome, the algorithms considers Kaplan-Meier survival function estimates 

(Kaplan and Meier, 1958)

. The difference between survival curves of positive versus all examples covered by CS are established with a use of a log-rank test. Since the aim is to minimize this difference, the algorithm maximizes an opposite of the test statistics:

(5)

As a result, the procedure identifies within a group of interest subsets of examples uniform w.r.t. the survival prognosis.

3.3 Contrast set diversity

The crucial feature of contrast sets are their descriptive capabilities, i.e., the ability to represent interesting and unknown relationships between conditional attributes and a group label. In particular, we want different contrast sets to represent different concepts in the attribute space. Consequently, contrast sets which, at the same time, contain similar attributes and cover similar group of examples as previously generated contrast sets can be considered redundant and should be avoided. As classical separate and conquer algorithm aims at maximizing quality measure discarding the aforementioned aspects, we introduce a novel heuristic which prevents redundancy in the generated contrasts sets.

The mechanism consists of two components: (i) the penalty for reusing already utilized attributes, which can be compensated by (ii) the reward for covering previously uncovered examples. The components are incorporated into quality evaluation of the contrast set candidate according to the formula:

(6)

with and representing the input and the modified quality, respectively, and being the penalty strength. The penalty for the current contrast set reflects to what extent the attributes employed by contrast sets are reused in . For each attribute we define an attribute penalty as

(7)

with returning if contrast set contains attribute and otherwise. The penalty for contrast set is a sum of attribute penalties for all attributes contained in . For instance, let us assume that there are four conditional attributes . After inducing two contrast sets and containing attributes and , respectively, the attribute penalties equal to , , , . The contrast set candidate built upon attributes would be penalized with . The proposed penalization scheme has a property of the tabu search. With a small number of already used attributes at the beginning, there is a strong pressure for selecting different features. This results in less redundant and, potentially, more interesting contrast sets. As consecutive contrast sets are induced, the attribute penalties become more even reducing the effect and allowing algorithm to cover remaining positive examples.

The penalty alone does not take into account the fact that a contrast set built upon already used attributes may still be interesting as long as it covers previously uncovered examples. For this purpose, the reward was introduced. The value of depends on the contribution of previously uncovered positive examples in all positives covered by the contrast set. The rewards decreases linearly from (full penalty compensation) when the contrast set covers only new examples () up to (no compensation) at some boundary value of . Note, that the proposed penalty-reward scheme does not allow modified quality to exceed the initial value. The penalization procedure is incorporated into both, growing and pruning stages and considers the cases of multiple occurrences of an attribute in a contrast set. E.g., if candidate contains condition , an attempt to close the interval by adding does not affect value as component has been already included.

In RuleKit-CS, the penalty is reset for every investigated value of minsupp-all. Consequently, it does not prevent from inducing similar contrast sets across different minsupp-all values. Therefore, at the very end of the induction, we quantify the redundancy of every contrast. For this purpose let us define a similarity between contrast sets and as:

(8)

with

being the Jaccard index, and

representing the set of attributes. The redundancy of contrast set is defined as a similarity to its most similar predecessor:

(9)

3.4 Synthetic example

The synthetic data set consists of 320 examples from two groups (170 red, 250 blue) described by two numerical (, ) and one categorical () attribute. As presented in Figure 1, the elements from group red, which was selected as a group of interest, are arranged in two clusters in the attribute space. The left cluster is almost perfectly separable from blue examples by , and well separable when using and together. The right cluster is perfectly separable with and poorly separable when using and/or . All the analyses presented below were performed for which is the smallest among values investigated by RuleKit-CS – the algorithm aggregates results from .

Figure 1: Synthetic data set with 320 examples from two groups (red and blue) described by two numerical ( and ) and one categorical () attribute.

As a preliminary step, we applied on a data set a single separate and conquer pass without attribute penalties, as in the classification rule induction. This rendered the following results for red group (the numbers of positive and negative examples are given on the right):

cs-1:

cs-2:

The contrast sets describe right (cs-1) and left (cs-2) red clusters with a use of nominal attribute and are characterized by very good quality. Since the number of uncovered examples equals to which is lower than , the sequential covering stopped at this point without attempt to cover remaining instances. The contrast set describing left cluster with and numerical attributes remained undiscovered, which was expected as only one sequential covering pass was run.

In order to learn more contrast sets, we applied 5 covering passes with attribute penalization to ensure the CS diversity. The penalty strength was set to , while the reward for covering different examples was disabled. The red group was described by the contrast sets below (for convenience, we provide pass numbers for each CS):

cs-3:

cs-4:

cs-5:

cs-6:

cs-7:

As previously, the first (highest quality) contrast set is the one describing the right cluster with . The second best candidate was . However, as was the only already used attribute, its penalty equaled to resulting in the quality modifier of 0.5. Therefore, the algorithm preferred cs-4 describing left cluster with a use of and . Contrast sets cs-3 and cs-4 left 20 uncovered red examples. As this was more then , an attempt was made to cover the remaining examples with cs-5. This contrast set describes entire left cluster and substantial part of the right cluster, but at the same time it covers large number of negative examples, thus it is characterized by low quality. After that, the second sequential covering pass started. The penalties related to the attributes were equal to , , . This leaded to the contrast set cs-6 — the duplicate of cs-3. This contrast set was filtered out in the post processing step. However, in order to reduce the chance of generating the same CS later on, it contributed to the attribute penalty. The next candidate was, as in the first pass, . Due to lower penalty the contrast set was accepted finishing the second sequential covering pass. The third pass did not introduce any novel contrast sets, fulfilling the stop condition.

Eventually, the multiple sequential covering passes together with attribute penalties allowed discovering all interesting contrast sets. However, discarding the information about examples covered by CS in the penalization leads to the undesired situation where very good sets (cs-7) are learned after those of low quality (cs-5). To prevent this, we enabled rewards (with saturation set at ) which resulted in the following contrast sets:

cs-8:

.

cs-9:

.

cs-10:

.

cs-11:

.

cs-12:

.

As cs-9 covered different examples then cs-8, the penalty for reusing was entirely compensated by reward. The second pass started from the last interesting contrast set cs-10 based on and attributes. This was followed by cs-8 duplicate and the low quality contrast set cs-12 (same as cs-5) induced in order to cover remaining red examples.

Clearly, the multiple sequential covering passes combined with attribute penalties and rewards rendered the most convenient list of contrasts sets. Therefore, all these mechanisms are by default enabled in the presented algorithm. One must keep in mind though, that rewards are effectively working only in the first sequential covering pass as few uncovered observations (at most minsupp-new) are left for the following passes.

4 Results

4.1 Experimental setting

The experiments were performed on 50 classification data sets downloaded from the UCI Machine Learning Repository (Dua and Graff, 2017) with class label being used as a group attribute (the sets are available at RuleKit repository).

RuleKit-CS was compared with two well known algorithms capable of inducing contrast sets: CN2-SD (Lavrač et al., 2004) implementation from Orange 3 Demšar et al. (2013) package and pysubgroup Lemmerich and Becker (2018). For completeness, traditional classification rules induced by RuleKit suite Gudyś et al. (2020) were also included in the analysis. CN2-SD and pysubgroup were executed with default settings, while RuleKit was configured so that parameters having their equivalents in RuleKit-CS were set as in the latter (correlation as a quality measure, ). This was to limit the effect of different parameters on the induction and investigate the novel algorithmic features introduced in the presented method. As RuleKit-CS requires a negative support of a contrast set to be at most half of its positive support (), CN2-SD, pysubgroup, and RuleKit results were subject to the analogous filtering. All the experiments were performed in one vs all scheme (a group against all the others).

4.2 Comparison with other algorithms

Executing multiple covering passes with attribute penalties is a crucial feature of RuleKit-CS that make separate and conquer approach suitable for contrasts set induction. For this reason, as an initial step, we investigated the effect of attribute penalization in RuleKit-CS on the background of the competitors.

Figure 2 presents algorithms‘ performance, i.e., number of contrasts sets, together with their average support and precision, summarized (averaged) over all 50 data sets. As the redundancy of individual contrast sets was established (Equation 9), we show also the performance indicators of contrast sets not exceeding assumed redundancy thresholds: 70%, 50%, and 20%.

Figure 2: The analysis of penalty strength in RuleKit-CS against other contrast set induction algorithms: CN2-SD, pysubgroup, and RuleKit. The comparison includes the following performance indicators averaged over 50 data sets: (a) number of contrast sets, (b) average CS support, (c) average CS precision. The results for different contrast set redundancy thresholds (all CS, 70%, 50%, 20%) are represented with increasing brightness.

The results of CN2-SD with the average CS support and precision of 71% and 75%, respectively, were considered as our baseline. Pysubgroup produced slightly fewer contrast sets, that were substantially less general (58% support) and slightly less precise (73%). Importantly, both methods generated CS from the entire redundancy range. The very different behaviour was observed for RuleKit, which produced several times less contrast sets than CN2-SD or pysubgroup, with almost no redundancy. This was due to fact that the algorithm performed a single sequential covering pass, thus the observations were usually covered by only one rule. While the support of the RuleKit contrast sets fell between CN2-SD and pysubgroup (67%), the precision was the best among all investigated packages (82%), which was expected for the algorithm specialized in the classification rule induction.

The results of RuleKit-CS strongly depended on the penalization scheme. When penalties were disabled, the algorithm rendered slightly more contrast sets than RuleKit. This was caused by aggregating results from four investigated values of minsupp-all parameter, i.e., 0.8, 0.5, 0.2, and 0.1, while RuleKit performed only one covering pass. The resulting contrast sets were characterized by superior support (above 72%) and precision only slightly inferior to that of RuleKit (80%). Running additional passes for each minsupp-all had no effect in this scenario, as due to lack of penalties all of them rendered duplicated contrast sets w.r.t. the initial pass. Increasing penalty strength enforced different induction paths in the consecutive covering passes for a given minsupp-all. As a result, the average number of contrast sets increased and saturated for at the level slightly above CN2-SD. As for the support, for small penalties it was superior by a small margin to the non-penalty variant and started to deteriorate noticeably for . The different situation was in the case of precision which decreased consistently with the growing penalty. After the analysis of Figure 2, the penalty strength was selected as the best trade-off between the number of contrast sets (slightly inferior to that of CN2-SD) and their quality (support and precision greater than CN2-SD by approx. 2 pp.).

The aforementioned RuleKit-CS configuration was subject to the further analysis. To discard less original contrast set, the redundancy threshold was set to 50%. For each data sets we established: (i) the number of contrast sets generated by CN2-SD and pysubgroup relative to RuleKit-CS, (ii) the differences of the average support and precision between the competitors and our algorithm. As Figure 3a shows, for almost of cases, CN2-SD rendered more contrast sets than our algorithm (median of the relative count equal to ), while pysubgroup advantage was observed in less than half of the data sets (median of the relative count equal to ). When considering average supports (Figure 3b), RuleKit-CS was superior to both competitors, in particular to pysubgroup which rendered less general contrast sets in 80% of the data sets. As for the precision, the advantage of our software was smaller, though still visible.

Figure 3: The comparison of the contrast set induction algorithms on the individual data sets (represented as points) at redundancy threshold 50%. The chart presents: (a) the number of contrast sets generated by CN2-SD and pysubgroup relative to RuleKit-CS, (b) the differences of the average support and precision between CN2-SD/pysubgroup and RuleKit-CS.

In Figure 4 we present detailed results for selected data sets. All the contrast sets were visualized on the support/precision plane with markers representing the algorithms. Additionally, each package has the colour assigned which is used to indicate how a contrast set induced by a given algorithm resembles its most similar counterparts generated by the two other methods. For instance, a RuleKit-CS contrast set which does not resemble any of CN2-SD contrast sets and is very similar to one of the pysubgroup contrast sets is indicated by a circle which is half light green and half dark blue. The similarity was established analogously as the redundancy (Equation 9).

As shown in Figure 4, the relations between CS induced by the investigated algorithms varied significantly across data sets. In the case of sonar data set, each package rendered very unique contrast sets dissimilar to those of competitors. CN2-SD and RuleKit-CS induced similar number of CS (39 and 46, respectively) with support ranging from 50 to 80% and precision from 70 to 100%. Interestingly, RuleKit-CS produced the Pareto-best contrast sets — at a given support level they were characterized by best precision and the opposite. At this background, pysubgroup was noticeably worse with only 20 contrast sets spanning 20–40% support and 70–90% precision range.

The very different situation was in the case of ionosphere data set where our algorithm rendered 9 contrast sets – significantly less then the competitors (CN2-SD: 25, pysubgroup: 15). When investigating particular contrast sets, one can see that CN2-SD and RuleKit-CS produced, respectively, 2 and 5 very good contrasts sets (support and precision above 90%) with mild similarity between algorithms. Additionally, there were few contrast sets with moderate support (50–60%) and good precision (85–95%) very similar across all the algorithms. The remaining CS were induced mostly by CN2-SD and were unique, with few exceptions that exhibited moderate similarity to both pysubgroup and RuleKit-CS contrast sets at the same time.

Figure 4: The visualization of individual contrast sets on support/precision plane for selected data sets at redundancy threshold 50%. For every CS, its resemblances to the most similar contrast sets induced by two other algorithms are represented by colouring of the left and right half of the marker (the darker the color, the higher the similarity).

4.3 Case study

The analysis was performed on Statlog (Heart) data set from UCI Machine Learning Repository. The set consists of 270 instances described by 7 numerical and 6 nominal attributes. A binary class label indicating the presence (120) or absence (150) of a heart disease was used as a group attribute.

CN2-SD rendered 54 contrast sets of which 47 had the positive support at least twice as large as the negative one (). At the same time, this condition was met for all of 20 pysubgroup and 104 RuleKit-CS contrast sets (the latter controls this requirement during the induction). The average support and precision of the contrast sets are presented in Table 1. An interesting observation concerns the number of examples not covered by any contrast sets or covered by one CS only. While for CN2-SD and RuleKit-CS there were no or few such cases, pysubgroup left 19 uncovered examples and 6 examples covered by one contrast set.

In order to investigate the most interesting contrast sets, the filtering at the redundancy threshold 50% was then performed. For CN2-SD and pysubgroup it retained approximately of CS (30 and 14, respectively), while in the case of RuleKit-CS it decreased the number of contrast sets four times (to 24). The average support of contrast sets retained for the presented algorithm (71.9%) was noticeably larger than that of competitors (61.9% for CN2-SD, 64.2% for pysubgroup). The precision of all methods was, on the other hand, similar and ranged from 77.2% for CN2-SD to 80.0% for pysubgroup (Table 1). Importantly, the filtering did not affect the number of uncovered examples. The increase from 1 to 6 was only observed in the number of examples covered by exactly one RuleKit-CS contrast set.

Initial contrast sets CS with redundancy
# Supp. Prec. -cov -cov # Supp. Prec. -cov -cov
CN2-SD 47 64.6 77.4 0 0 30 61.9 77.2 0 0
pysubgroup 20 61.8 81.3 19 6 14 64.2 80.0 19 6
RuleKit-CS 104 68.7 80.0 1 1 24 71.9 77.4 1 6
Table 1: Performance of the contrast set induction algorithms on Statlog (Heart) data set before and after redundancy filtering. Seven CN2-SD contrast sets not fulfilling condition were removed prior the analysis. Columns ’’0-cov‘‘ and ’’1-cov‘‘ indicate number of examples covered by 0 and 1 contrast set, respectively.

When we investigate the presence of attributes in the resulting contrast sets, the most widely spread feature was thal which appeared in 11 CN2-SD (3rd place), 8 pysubgroup (1st place), and 10 RuleKit-CS (1st place) contrast sets. The attribute indicates the presence of thalassemia (3: normal; 6: fixed defect; 7: reversible defect) and is strongly correlated with a heart disease (Kremastinos et al., 2010). As previously, contrast sets induced by RuleKit-CS were characterized by significantly higher average support (70.1%) than the competitors (CN2-SD: 63.9%, pysubgroup: 62.2%). Our method was also the best in terms of precision (82.9%) though, the difference was not that meaningful (CN2-SD: 78.5%, pysubgroup: 82.2%). The most comprehensible CS were generated by pysubgroup (1.8 conditions on average) followed by RuleKit-CS (2.7), and CN2-SD (3.2).

Interestingly, the contrast sets based upon thal attribute differed noticeably across algorithms. E.g., as presented in Table 2, all CS describing present group induced by pysubgroup contained condition, which was absent in the competitors. The more resemblances could be observed in contrast sets induced by CN2-SD and RuleKit-CS. E.g., the first contrast sets reported by these methods shared and conditions. The remaining condition was based on chest

attribute, though the value sets were different probably due to different support-precision trade offs in CN2-SD and RuleKit-CS.

No. Contrast sets for present group
CN2-SD
1 84 20
2 65 ⅇ6
3 67 27
4 87 31
5 68 34
pysubgroup
1 79 25
2 63 ⅇ7
3 68 19
4 68 23
RuleKit-CS
1 67 ⅇ9
2 71 13
3 82 25
4 86 24
Table 2: The analysis of Statlog (Heart) data set. After applying redundancy filtering at threshold 50%, contrast sets describing present group and containing thal attribute were selected.

4.4 Regression and survival contrast sets

The ability to induce contrast sets on the regression and censored data is the unique feature of RuleKit-CS, not provided by any other algorithm. Therefore, we investigated the differences between such contrast sets and traditional contrast sets rendered by our approach. The experiments were performed on 48 regression and 35 survival data sets from UCI repository. In the case of the regression problems, the examples were divided into two groups: (1) those with label lower than the median of and (2) the remaining ones. For survival data the groups were defined as follows: (1) observations that were subject to an event () with survival time lower than the median and (2) observations with survival time greater or equal the median (both censored and uncensored). The remaining examples were removed from the sets. When inducing traditional contrast sets, label and survival time/status attributes were discarded. All the data sets are available at RuleKit repository.

In Table 3 we summarize RuleKit-CS contrast sets induced in the mode dedicated to regression/survival data and contrast sets based on the group only. The induction was followed with the redundancy filtering at threshold 50%. As one can see, the dedicated variant rendered twice as many contrast sets as the classical mode, with the average support lower by approximately 30 percent points. On the other hand, the precision was higher by 7 (regression) and 6 (survival) pp. The numbers of observations not covered by any contrast set were very similar for both algorithm variants.

To get a deeper insight into differences between regression/survival and classical contrast sets, we investigated in details Bone marrow transplant: children (Sikora et al., 2019) survival data set. The set describes pediatric patients with hematologic diseases that were subject to the unmanipulated allogeneic unrelated donor hematopoietic stem cell transplantation. The original set consists of 187 examples characterized by 26 nominal and 9 numeric attributes (excluding survival time and status). After previously described division w.r.t. the survival time, two groups of sizes 80 and 94 were produced, while 13 examples were removed.

Regression/survival mode Classical mode
Problem # Supp. Prec. -cov # Supp. Prec. -cov
Regression
Survival
Table 3: Comparison of dedicated (regression/survival) RuleKit-CS mode with a classical (group-only) variant on 48 regression and 35 survival data sets. The contrast sets were subject to the redundancy filtering at threshold 50%. Column ’’0-cov‘‘ contains number of examples not covered by any contrast set.

Running RuleKit-CS mode dedicated to censored data induced 54 contrast sets, 48 of which fell below 50% redundancy threshold. The average support and precision equaled to 23.1% and 90.6%, respectively. The classical mode produced 52 contrast sets of which only 22 fulfilled the redundancy requirements. This was due to substantially higher support (62.1%) which increased the chance of covering same examples with same attributes. The precision was, on the other hand, lower (81.3%). Clearly, the indicators followed the general tendency presented in Table 3.

In Figure 5 we show survival function estimates for contrast sets (after redundancy filtering) describing group (2) – the patients with survival time greater or equal the median (both censored and uncensored). As presented by bold line, the examples from this group have very good survival prognosis. In contrast, the examples from group (1), i.e., the patients that died in a time shorter than the median, are characterized by a pessimistic prognosis. This bimodality of survival between groups is crucial from the point of view of our algorithm. In particular, the cost of covering a negative example in survival RuleKit-CS mode is much larger than in the classical mode. As a result, the former induced sets of higher precision and noticeably lower support, which translated to substantially more CS in total (32 vs 9) as more contrast sets are needed to complete the covering passes.

As presented in Figure 5, all survival curves in both algorithm modes were below the estimate corresponding to the entire group. In some cases this was due to covering negative examples which ’’pulled down‘‘ the curves. However, in the survival mode there were 7 contrast sets with 100% precision, thus not affected by negatives. One of the corresponding curves saturated at survival probability below 0.8 which differed significantly from the average group estimate. This contrast set, representing potentially interesting knowledge, was not identified by the classical mode.

The substantial differences between RuleKit-CS modes exhibited also in the attribute profiles. While the survival contrast sets were of similar length as the classical ones (6.2 vs 5.1 attributes per CS), some attributes abundant in the former, like PLT_recovery, ANC_recovery, HLA_group_1 (frequencies of 0.62, 0.48, 0.31, respectively) were much less frequent in the latter (0.36, 0.14, 0.14). This was probably due to fact, that subsets of examples uniform w.r.t. the survival prognosis represented by the survival contrast sets were more correlated with the aforementioned attributes than the main contrast group.

Figure 5: Comparison of survival and traditional contrast sets describing group (2) of the Bone marrow transplant: children data set (observations with survival time greater or equal the median). The figure presents contrast sets after redundancy filtering at 50% threshold. The survival estimate for the entire group is represented as a bold line.

5 Conclusions

In our study we showed that sequential covering can be successfully applied for identifying contrast sets. In order to make this possible, two main challenges had to be faced:

  1. ensuring the major contrast set requirement, i.e., high support in the group of interest and low support in the remaining groups,

  2. providing the ability to identify contrast sets describing same examples but built upon different attributes.

The first aim was obtained by using correlation between predicted and target variable to drive the induction process, combined with the additional support constraints. The second point was addressed by multiple sequential covering passes together and the attribute penalization mechanism which ensures different induction paths between passes.

As presented in the experimental section, RuleKit-CS was able to successfully analyze various data sets. When compared with CN2-SD, one of its main competitors, our algorithm identified similar number of contrast sets with average support and precision superior by a small margin. When investigating particular contrast sets, the outcome strongly depended on the data set. In some cases both methods rendered similar contrast sets, in others the results differed substantially, which was due to distinct induction strategies. Therefore, if one is interested in revealing as many unknown relationships in the data as possible, a meta-analysis including RuleKit-CS and other existing algorithms may be a wise choice.

An essential contribution of RuleKit-CS is the ability to analyze regression and survival data by identifying within groups subsets of examples uniform with respect to the label or survival prognosis. This type of analysis is out of reach of the existing algorithms and has the potential to discover new, interesting dependencies in the data.

The possible improvements of RuleKit-CS include a modification of the attribute penalization scheme. Currently, contrast set candidates are compared against global collections of already used attributes and covered examples. As a result, a candidate can be penalized even if it does not resemble any of the previous contrast sets. For instance, let and be two contrast sets disjoint in both, attribute and example spaces. If a candidate covers same examples as but using attributes from , it gets full penalty and no reward, which intuitively, should not be the case. A potential alternative is to verify candidates‘ redundancy with individual contrast sets as defined by Equation 9 during induction. However, tracking contrast set similarity while growing is problematic as its final form is unknown (two contrast sets may share almost entire induction path and split into disjoint sets at the very end). Moreover, the procedure would significantly increase the computational effort. Therefore, a more detailed investigation would be required to verify the advantage of this solution over currently used penalization scheme.

We believe that RuleKit-CS could be a useful tool for identifying differences between groups. Thanks to the integration with RuleKit package it can be straightforwardly applied to real-life problems in various areas.

Acknowledgements

This work was partially supported by Łukasiewicz Research Network (ROLAPML, R&D grant) and Computer Networks and Systems Department at Silesian University of Technology within the statutory research project.

References

  • U. Ahmed, J. C. Lin, and G. Srivastava (2022) Fuzzy Contrast Set Based Deep Attention Network for Lexical Analysis and Mental Health Treatment. ACM Trans. Asian Low-Resour. Lang. Inf. Process.. External Links: Link Cited by: §2, §2.
  • S. D. Bay and M. J. Pazzani (2001) Detecting group differences: Mining contrast sets. Data Min. Knowl. Discov. 5 (3), pp. 213–246. Cited by: §2, §3.1.
  • R. J. Bayardo Jr (1998) Efficiently mining long patterns from databases. In Proc. of the 1998 ACM SIGMOD international conference on Management of data, pp. 85–93. Cited by: §2.
  • J. Demšar, T. Curk, A. Erjavec, et al. (2013) Orange: Data Mining Toolbox in Python. J. Mach. Learn. Res. 14 (1), pp. 2349–2353. Cited by: §4.1.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.1.
  • E. Eells and B. Fitelson (2002) Symmetries and asymmetries in evidential support. Philos. Stud. 107 (2), pp. 129–142. Cited by: item 5.
  • L. Geng and H. J. Hamilton (2006) Interestingness measures for data mining: A survey. ACM Comput. Surv. 38 (3), pp. 9–es. Cited by: §1, item 5.
  • S. Greco, Z. Pawlak, and R. Słowiński (2004) Can Bayesian confirmation measures be useful for rough set decision rules?. Eng. Appl. Artif. Intell. 17 (4), pp. 345–361. Cited by: item 5.
  • X. Gu, P. P. Angelov, C. Zhang, and P. M. Atkinson (2018)

    A massively parallel deep rule-based ensemble classifier for remote sensing scenes

    .
    IEEE Geosci. Remote Sens. Lett. 15 (3), pp. 345–349. Cited by: §1.
  • A. Gudyś, M. Sikora, and Ł. Wróbel (2020) RuleKit: a comprehensive suite for rule-based learning. Knowl.-Based Syst. 194, pp. 105480. Cited by: §1, item 2, §4.1.
  • R. J. Hilderman and H. J. Hamilton (2013) Knowledge discovery and measures of interest. Vol. 638, Springer Science & Business Media. Cited by: item 5.
  • R. J. Hilderman and T. Peckham (2005) A statistically sound alternative approach to mining contrast sets. In Proc. of the 4th Australasian Data Mining Conference, pp. 157–172. Cited by: §2.
  • F. Janssen and J. Fürnkranz (2010) On the quest for optimal rule learning heuristics. Mach. Learn. 78 (3), pp. 343–379. Cited by: §1, item 5.
  • M. Kamber and R. Shinghal (1996) Evaluating the Interestingness of Characteristic Rules. In Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 263–266. Cited by: item 5.
  • E. L. Kaplan and P. Meier (1958) Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53 (282), pp. 457–481. Cited by: §3.2.
  • P. Kralj, N. Lavrač, D. Gamberger, and A. Krstačić (2007a) Contrast set mining for distinguishing between similar diseases. In

    Proc. of the 11th Conference on Artificial Intelligence in Medicine

    ,
    pp. 109–118. Cited by: §2, §2.
  • P. Kralj, N. Lavrač, D. Gamberger, and A. Krstačić (2007b) Contrast set mining through subgroup discovery applied to brain ischaemina data. In Proc. of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 579–586. Cited by: §2, §2.
  • D. T. Kremastinos, D. Farmakis, A. Aessopos, G. Hahalis, E. Hamodraka, D. Tsiapras, and A. Keren (2010) -Thalassemia cardiomyopathy: history, present considerations, and future perspectives. Circ. Heart Fail. 3 (3), pp. 451–458. Cited by: §4.3.
  • N. Lavrač, B. Kavšek, P. Flach, and L. Todorovski (2004) Subgroup discovery with CN2-SD. J. Mach. Learn. Res. 5, pp. 153–188. Cited by: §2, §4.1.
  • F. Lemmerich and M. Becker (2018) Pysubgroup: easy-to-use subgroup discovery in python. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 658–662. Cited by: §2, §4.1.
  • H. Liu, Y. Yang, Z. Chen, and Y. Zheng (2014) A tree-based contrast set-mining approach to detecting group differences. INFORMS J. Comput. 26 (2), pp. 208–221. Cited by: §2.
  • A. Magalhães and P. J. Azevedo (2015) Contrast set mining in temporal databases. Expert Syst. 32 (3), pp. 435–443. Cited by: §2.
  • Z. Nazeri, D. Barbara, K. D. Jong, G. Donohue, and L. Sherry (2008) Contrast-Set Mining of Aircraft Accidents and Incidents. In Advances in Data Mining, LNAI, Vol. 5077, pp. 313–322. Cited by: §2.
  • K. Neubarth and D. Conklin (2016) Contrast pattern mining in folk music analysis. In Computational Music Analysis, D. Meredith (Ed.), Cham, pp. 393–424. Cited by: §2.
  • D. Nguyen, W. Luo, D. Phung, and S. Venkatesh (2016) Exceptional contrast set mining: moving beyond the deluge of the obvious. In Proc. of the 29th Australasian Joint Conference on Artificial Intelligence, pp. 455–468. Cited by: §2.
  • P. K. Novak, N. Lavrač, D. Gamberger, and A. Krstačić (2009a) CSM-SD: methodology for contrast set mining through subgroup discovery. J. Biomed. Inform. 42 (1), pp. 113–122. Cited by: §2, §2.
  • P. K. Novak, N. Lavrač, and G. I. Webb (2009b) Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. J. Mach. Learn. Res. 10 (2), pp. 377–403. Cited by: §1, §2.
  • R. Qian, Y. Yu, W. Park, V. Murali, S. Fink, and S. Chandra (2020) Debugging crashes using continuous contrast set mining. In Proc. of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice, pp. 61–70. Cited by: §2.
  • M. Sikora, Ł. Wróbel, and A. Gudyś (2019) GuideR: A guided separate-and-conquer rule learning in classification, regression, and survival settings. Knowl.-Based Syst. 173, pp. 1–14. Cited by: §4.4.
  • M. Simeon and R. Hilderman (2011) COSINE: a vertical group difference approach to contrast set mining. In Proc. of the 24th Canadian conference on Advances in artificial intelligence, pp. 359–371. Cited by: §2.
  • P. Tan, V. Kumar, and J. Srivastava (2002) Selecting the right interestingness measure for association patterns. In Proc. of the 8th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 32–41. Cited by: item 5.
  • G. I. Webb, S. Butler, and D. Newlands (2003) On detecting differences between groups. In Proc. of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 256–265. Cited by: §1, §2, §3.1.
  • G. I. Webb (2000) Efficient Search for Association Rules. In Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 99–107. External Links: ISBN 1581132336 Cited by: §2.
  • Ł. Wróbel, M. Sikora, and M. Michalak (2016) Rule quality measures settings in classification, regression and survival rule induction—an empirical approach. Fundam. Inform. 149 (4), pp. 419–449. Cited by: §1.
  • H. Xiong, S. Shekhar, P. Tan, and V. Kumar (2004) Exploiting a support-based upper bound of Pearson‘s correlation coefficient for efficiently identifying strongly correlated pairs. In Proc. of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 334–343. Cited by: item 5.
  • G. Zhu, Y. Wang, and G. Agrawal (2015) SciCSM: novel contrast set mining over scientific datasets using bitmap indices. In Proc. of the 27th International Conference on Scientific and Statistical Database Management, pp. 1–6. Cited by: §2.