GuideR
Userguided separateandconquer rule learning in classification, regression, and survival settings
view repo
This article presents GuideR, a userguided rule induction algorithm, which overcomes the largest limitation of the existing methodsthe lack of the possibility to introduce user's preferences or domain knowledge to the rule learning process. Automatic selection of attributes and attribute ranges often leads to the situation in which resulting rules do not contain interesting information. We propose an induction algorithm which takes into account user's requirements. Our method uses the sequential covering approach and is suitable for classification, regression, and survival analysis problems. The effectiveness of the algorithm in all these tasks has been verified experimentally, confirming guided rule induction to be a powerful data analysis tool.
READ FULL TEXT VIEW PDFUserguided separateandconquer rule learning in classification, regression, and survival settings
Sequential covering rule induction algorithms can be used for both, predictive and descriptive purposes blaszczynski2011 ; furnkranz1999 ; grzymala2003 ; kaufman1991 . In spite of the development of increasingly sophisticated versions of those algorithms liu2018induction ; valmarska2017 , the main principle remains unchanged and involves two phases: rule growing and rule pruning. In the former, the elementary conditions are determined and added the rule premise. In the latter, some of these conditions are removed.
In comparison to other machine learning methods, rule sets obtained by sequential covering algorithm, also known as separateandconquer strategy (SnC), are characterized by good predictive as well as descriptive capabilities. Taking into consideration only the former, superior results can often be obtained using other methods, e.g. neuralfuzzy networks, support vector machines, or ensemble of classifiers
boser1992 ; czogala2000 ; rokach2010 ; siminski , especially ensemble of rules dembczynski2010 . However, data models obtained this way are much less comprehensible than rule sets.In the case of rule learning for descriptive purposes, the algorithms of association rule induction agrawal1994 ; kavsek2006 ; stefanowski2001 or subgroup discovery lavravc2004 ; valmarska2017 , are applied. The former leads to a very large number of rules which must then be limited by filtering according to rule interestingness measures geng2006 ; greco2016 ; bayardo . Nevertheless, rule sets obtained by subgroup discovery are characterized by worse predictive abilities than those generated by the standard sequential covering approach.
Therefore, if creating a prediction system with comprehensible data model is the main objective, the application of sequential covering rule induction algorithms provides the most sensible solution.
In the works wrobel2017 ; wrobel2016 ; sikora2012 ; sikora2013data , we have presented and confirmed on dozens of benchmark datasets the effectiveness of our version of the sequential algorithm for generating classification, regression, and survival rules. This article presents the semiinteractive version of that algorithm, which overcomes the largest limitation of the existing rule induction methods—the lack of the possibility to introduce user‘s knowledge (or expert‘s knowledge) to the learning process. Automatic selection of attributes and attribute ranges often leads to the situation in which induced rules do not contain the most important information from the user‘s point of view. We propose a rule induction algorithm which takes into account user’s requirements. The possibility to specify the initial set of rules, preferred and forbidden conditions/attributes, etc., together with the multiplicity of options and modes, makes our algorithm the most flexible solution for userguided rule induction. It allows testing various hypotheses concerning data dependencies which are expected or of interest. In particular, the algorithm enables making such hypotheses more specific or more general.
The effectiveness of the guided (semiautomatic) rule induction has been investigated on three test cases concerning various data analysis tasks. Classification was illustrated by the problem of predicting seismic hazards in coal mines (seismicbumps dataset Sikora2010 ); regression—the problem of methane forecasting (methane dataset githubMethane ); survival analysis—the problem of analysing factors which impact patients’ survival following bone marrow transplants (BMTCh dataset kalwak2010 ; sikora2013 ).
The paper is organized as follows: Section 2 concerns overview of works in the area of userguided rule induction. Section 3 presents the algorithm for induction of classification, regression, and survival rules, with a special stress put on the semiautomatic capabilities. Section 4 is devoted to the analysis of the test cases, together with a discussion of obtained results. Section 5 contains a summary and conclusions.
GuideR software as well as the datasets used in this article are available at https://github.com/adaapolsl/GuideR or http://www.adaa.polsl.pl. All the datasets had been proposed by the authors of this article. The sesimicbumps dataset is also available in the UCI repository.
The induction of classification rules with sequential covering approach has been known for many years furnkranz1999 ; grzymala2003 ; kaufman1991 ; clark1989 . As it proved its effectiveness in terms of both, the classification accuracy as well as the descriptive abilities of the induced rules (e.g. sikora2011pattern ; moshkov2008 ; tsumoto2004 ), a number of interesting extensions of this approach have been presented napierala2012 ; huhn2009 ; mozina2007 ; riza2014 ; sikora2013redef ; liu2018induction ; valmarska2017 . In contrast, rule induction algorithms have been rarely applied to the regression and survival analysis, although the comprehensibility of resulting data models in these problems is often a key issue.
Regression rules can be straightforwardly derived from regression trees such as CART breiman1984 and M5 quinlan1992 by generating one rule for each path from the root of the tree to its leaf. These algorithms use the divideandconquer strategy. The other approach to the regression rule induction is to use a generalization of sequential covering (e.g., PCRvzenko2005 , rule list janssen2011 ). The work of Janssen and Fürnkranz janssen2011 describing the dynamic reduction of the regression problem to the classification is of particular importance in the context of the results presented in this paper. The most advanced methods of regression rule induction are based on ensemble techniques (e.g., RuleFit friedman2008 , RegENDER dembczynski2008 ). To supervise the induction of subsequent rules, these algorithms apply gradientbased optimization methods. The resulting rule sets are characterized by good prediction quality, though they are usually composed of a large number of rules.
Equally few attempts have been made to apply rules to survival analysis. Pattaraintakorn and Cercone pattaraintakorn2008 described the rough setbased intelligent system for analyzing survival data. Another approach employing rough sets was presented by Bazan et al.bazan2002 . The idea was to divide examples into three decision classes on the basis of a prognostic index (PI) calculated with a use of the Cox‘s proportional hazard model. The division of survival dataset into three classes was also made by Sikora et al.sikora2013survival , who applied the rule induction algorithm to the analysis of patients who underwent a bone marrow transplantation. The dataset was divided in the following groups: patients who underwent transplantation at least five years before, patients who died within five years after transplantation, and patients who are still alive but whose survival time is less than five years. Two former classes were used for rule generation, while the latter for model postprocessing. Kronek and Reddy kronek2008 proposed the extension of the Logical Analysis of Data (LAD) crama1988
for survival analysis. The LAD algorithm is a combinatorial approach to rule induction. It was originally developed for the analysis of data containing binary attributes; therefore, the discretization and binarization step is usually required. Liu et al.
liu2004adapted the patient rule induction method to the analysis of survival data. The method uses bump hunting heuristic which creates rules by searching regions in an attribute space with a high average value of the target variable. To deal with censoring, the authors use deviance residuals as the outcome variable. The idea of residualbased approach to censored outcome is derived from survival trees
leblanc1992 .In comparison to rulebased techniques, treebased methods received much more attention in survival analysis. The key idea behind the application of treebased techniques to survival data lies in the splitting criterion. The most popular approaches are based on residuals leblanc1992 ; therneau1990 or use logrank statistics leblanc1993 to maximize the difference between survival distributions of child nodes. We employed the latter idea in our latest separateandconquer rule induction algorithm which uses longrank statistics as a rule search heuristic (rule quality measure) wrobel2017 . We showed, that in spite of some similarities between rule and trees, our approach renders different models than the divideandconquer strategy of tree building.
To date, few studies have concerned rule induction algorithms which take into account user‘s preferences. Stefanowski and Vanderpooten stefanowski2001 presented the Explore algorithm. Based on the idea of the Apriori method, it allows the user to specify the requirements for attributes and/or their values, appearing in the rule premises. Other studies on the induction of association rules describe examples of interactive construction of rules rafea2004 ; kliegr and the generation of the unexpected rules padmanabhan1998 . The latter are created on the basis of userdefined templates, indicating the attributes included in the socalled typical rules. Gamberger and Lavrac gamberger2002 showed a similar proposal for the decision rule induction algorithm intended for descriptive purposes.
Adomavicius and Thazulin adomavicius2001 presented expertdriven methods of validating rulebased data models obtained via the association rule induction algorithm. The approach limits number of rules by applying rule grouping and filtering techniques which are based on the interaction with the user instead of the traditional calculation of the rules attractiveness. Balanchard et al. blanchard proposed an interactive methodology for the visual postprocessing of association rules. It allows the user to explore large sets of rules freely by focusing his/her attention on limited subsets of rules. Both the aforementioned methods do not interfere with the induction process.
Algorithms using the paradigm of the argumentbased learning mozina2007 allow the user to provide explanation for each example why it has been assigned with a particular decision class. Examples of medical applications show that this approach can significantly reduce the set of generated rules. However, the argumentbased learning approach does not verify the hypotheses that represent the dependencies which, in the user’s opinion, might occur in the data. Partially, this possibility was introduced by Chen and Liu chen2001 , where the user defines a set of rules expected to be found in the dataset. Then, the rulebased version of the C4.5 algorithm is executed and three types of rules are generated: consistent, inconsistent, and not related to the user‘s rules. The rule is considered to be consistent with the knowledge if in the set of defined rules, there is at least one rule such that and indicate the same decision class and a set of examples covered by is a subset of examples covered by .
The IBM SPPS Modeler analytical package ibmspss
contains a module of interactive decision trees in which the user can determine the attribute and split value to be included in a given tree node. Moreover, the algorithm allows maintaining the induction of a given tree node at a specific level or starting it from a certain level when the above nodes have been defined by the user.
Even though trees can be straightforwardly translated into rules, the induction of the latter directly from data have an important advantage—the rules can be treated independently. The user or domain expert can alter existing rules or add new ones without affecting the rest of the model. The tree, in contrast, must be treated as a whole—a change of a condition in a node involves the need to modify conditions in all its child nodes. Another feature is that the divideandconquer tree generation strategy forbids examples to be covered by multiple rules, while the separateandconquer approach for rule induction lacks this limitation. This often leads to discovering stronger or completely new dependencies in the data. Finally, generation of rules from the tree by following the path from the root to leafs always leads to the condition redundancy which is often undesirable.
Let be the dataset of examples (observations, instances), each being characterized by a set of attributes and a label . The meaning of the label depends on the problem. For classification tasks it corresponds to a discrete class identifier, i.e., . In regression, it is a continuous value: . In survival analysis, it represents the binary censoring status: . In particular, the value of 0 indicates censored observations, also referred to as eventfree (e.g., patients without disease recurrence), while 1 are noncensored examples, that were subject to an event (e.g., patients with recurrence). In the survival datasets, an additional variable representing the survival time, i.e. the time of the observation for eventfree examples or the time before the occurrence of an event, must be specified.
The th example of classification/regression dataset can be represented as a vector ; in survival problems it must be extended by the survival time: . For simplicity, however, all types of datasets will be denoted as —the dependence of survival datasets on does not affect the idea of the presented algorithm.
Let be a set of rules generated by the induction algorithm, referred later as a rulebased data model or, simply, a model. Each rule has the form:
The premise of a rule is a conjunction of conditions , with being an element of the domain and representing a relation ( for nominal attributes;
for numerical ones). The conclusion of the rule can be a nominal value (classification), a numerical value (regression), or a KaplanMeier estimator
kaplan1958 of the survival function (survival analysis). Corresponding rules will be referred to as classification, regression, and survival rules, respectively. An example satisfying the conditions specified in the rule premise is stated to be covered by the rule.Rule sets induced by our separateandconquer heuristic are unordered. Therefore, applying a model on the observation (e.g. a test example) requires evaluating set of rules covering the example and aggregating the results. This differs from ordered rule sets (decision lists), where the first rule covering the investigated observation determines a model response. The method of aggregation depends on the problem. In classification, the output class label is obtained as a result of voting—each rule from votes with its value of the quality measure used during the induction wrobel2016 . In regression, the model response is an average of conclusions of elements sikora2012 . Similar situation is in the survival rules, but averaging concerns not numbers, but survival estimator functions wrobel2017 .
The presented algorithm induces rules according to the separateandconquer principle furnkranz1999 ; michalski1973discovering . Here we describe the fully automatic procedure—the userguided variant is presented in the next subsection. An important factor determining performance and comprehensibility of the resulting model is a selection of a rule quality measure bruha1997 ; an2001rule ; yao1999 ; wrobel2016 (rule learning heuristic furnkranz2005 ; janssen2010quest ; minnaert
) that supervises the rule induction process. In the case of classification problems, our software provides user with a number of stateofart measures calculated on the basis of the rule confusion matrix. Let
be the considered classification rule. The examples whose labels are the same as the conclusion of will be referred to as positive, while the others will be called negative. The confusion matrix consists of the number of positive and negative examples in the entire training set ( and ), and the number of positive and negative examples covered by the rule ( and ). The idea can be straighyforwardly generalized for weighted examples by replacing numbers of examples in the confusion matrix by sums of their weights. The measures built in the algorithm, e.g., C2 bruha1997 , Correlation furnkranz2005 , Lift bayardo , RSS sikora2013data , or sBayesian confirmation greco , evaluate rules using various criteria resulting in very different models. For instance, RSS (also known as WRA furnkranz2005 ) considers equally sensitivity () and specificity () of the rule according to the formula . Another common measure is conditional entropy which describes entropy of an outcome variablegiven random variable
as:(1) 
In our case indicates class (positive/negative) and denotes whether rule covers the example (covered/uncovered). Therefore,
(2)  
(3)  
(4) 
The opposite probabilities, i.e.,
, , and can be calculated straightforwardly by subtracting from 1 appropriate value.The aforementioned measures are also used for evaluating regression rules, as regression is transformed by the algorithm to the binary classification problem. The transformation is done similarly as in janssen2011 . Namely, the median
and the standard deviation
of labels of instances covered by the rule is established. Observations from the entire set with labels from interval are assigned with a positive class. This allows determining elements of the confusion matrix and calculating all aforementioned quality measures. Note, however, that in contrast to classification problems, and values may change as rule coverage is modified.The different situation is in the case of survival analysis, where rule outcomes are survival function estimates rather than numerical values. Thus, it is desirable for a rule to cover examples which survival distributions differ significantly from that of other instances. For this purpose, we use logrank statistics harrington1982class as a measure of survival rules quality. It is calculated as , where:
(5)  
(6) 
() is a set of event times of observations covered (uncovered) by the rule, () is the number of covered (uncovered) observations which experienced an event at time , and () is the number of covered (uncovered) instances at hazard, i.e., still observable at time .
Separateandconquer heuristic adds rules iteratively to the initially empty set as long as the entire dataset becomes covered (Algorithm 1). To ensure the convergence, every rule must cover at least mincov previously uncovered examples. The induction of a single rule consists of two stages: growing and pruning. In the former (presented in Algorithm 2), elementary conditions are added to the initially empty premise. When extending the premise, the algorithm considers all possible conditions built upon all attributes (line 6: GetPossibleConditions function call), and selects those leading to the rule of highest quality (lines 10–12). In the case of nominal attributes, conditions in the form for all values from the attribute domain are considered. For continuous attributes, values that appear in the observations covered by the rule are sorted. Then, the possible split points are determined as arithmetic means of subsequent values and conditions and are evaluated. If several conditions render same results, the one covering more examples is chosen. Pruning can be considered as an opposite to growing. It iteratively removes conditions from the premise, each time making an elimination leading to the largest improvement in the rule quality. The procedure stops when no conditions can be deleted without decreasing the quality of the rule or when the rule contains only one condition. Finally, for comprehensibility, the rule is postprocessed by merging conditions based on the same numerical attributes. E.g., conjunction will be presented as .
In the regression and survival problems, the algorithm is performed once on the entire dataset . For classification tasks, rules are induced independently for all classes. Particularly, when class is investigated, the set is binarized with respect to it: examples with labels equal to are positives, while the other are negatives. The detailed information about our algorithm for classification, regression, and survival rule induction using separateandconquer strategy can be found in wrobel2016 ; wrobel2017 . The most important limitation of the presented approach is that induction is fully automatic—the user may control how the model looks like only by selecting quality measure and adjusting mincov parameter.
In order to allow userguided rule induction, the separateandconquer heuristic explained in the previous subsection was extended. The preliminary step of the procedure is specifying user‘s requirements. They consists of several elements ordered by the priority (highest first):
—set of initial (userspecified) rules which have to appear in the model. Depending on the parameters, initial rules are immutable or can be extended by other conditions (existing conditions cannot be altered, though).
/—multisets of preferred conditions/attributes. When deriving a rule, they are used before automatically induced conditions. The user may specify the multiplicity of each preferred element allowing it to be used in a given numbers of rules.
/—sets of forbidden conditions/attributes which cannot not appear in the automatically generated rules.
In the classification problems, the requirements can be defined for each class separately. Additional parameters controlling guided rule induction are:
/—boolean indicating whether initial rules should be extended with a use of preferred/automatic conditions and attributes.
/—boolean indicating whether new rules should be induced with a use of preferred/automatic conditions and attributes.
/—maximum number of preferred conditions/attributes per rule.
considerOtherClasses—boolean indicating whether automatic induction should be performed for classes for which no user’s requirements have been defined (classification mode only).
The guided separateandconquer heuristic was presented in Algorithm 3. It starts from processing initial rules in the order specified by the user. If flag is enabled, an attempt is made to extend an initial rule by at most preferred conditions and preferred attributes (lines 4–5). After that, if flag is enabled, the algorithm adds automatically induced conditions using standard separateandconquer strategy (lines 6–9). When all initial rules have been processed, new ones are generated analogously; the corresponding boolean parameters are called and (lines 11–20).
For regression and survival problems, the described procedure is performed once, similarly as in the fully automatic mode. For classification tasks, the algorithm is executed for each class the knowledge has been specified for. If considerOtherClasses parameter is set, this is followed by the fully automatic induction of rules for classes without user‘s preferences.
An important assumption concerning the semiautomatic induction is that knowledge elements are prioritized, i.e.:
Initial rules and preferred conditions/attributes are more important than forbidden conditions/attributes. Therefore, if an initial rule contains condition with attribute , it will appear regardless of being marked as forbidden () or intersecting one of the forbidden conditions (). The same holds for preferred conditions and attributes—forbidden knowledge applies to automatic induction only.
Requirement of higher priority cannot be altered by that of lower priority. For instance, if an initial rule contains condition with attribute , cannot be modified, i.e., no other condition concerning can be added to this rule, neither preferred ( or based), nor automatically induced. Similarly, preferred conditions cannot be overridden by based/automatic conditions, etc.
Requirements of the same category are prioritized by order in which they are specified by the user.
Userdefined knowledge cannot be subject to pruning.
The prioritization determines how a single rule is grown taking into account user‘s preferences (Algorithm 4). At the beginning, attributes already present in the rule are excluded (line 2). Then, at most preferred conditions fulfilling coverage requirement are added to the rule (lines 4–13). At each step, a condition rendering the rule of highest quality is selected (lines 6–8). After that, preferred attributes are processed similarly (lines 14–24). For each preferred attribute, a condition leading to the rule of highest quality is considered (line 17: InduceBestCondition function call). When a preferred condition/attribute is used, its multiplicity in / multiset is decreased (lines 10, 21). Moreover, already employed attributes cannot be used again in the rule (lines 11, 22).
The algorithm was evaluated on three test cases representing classification, regression, and survival problems. The analysis of each dataset concerned:
the validation of models rendered by automatic and userguided rule induction; depending on the problem, this was done by 10fold cross validation or train/test split,
the analysis of rule sets induced on the entire datasets in the context of domain knowledge.
Table 1 presents problemspecific details of experimental procedures, e.g., model validation methods, quality criteria, statistical tests used for determining rules significance, etc.
The rule set descriptive statistics were common for all investigated datasets and consisted of: number of rules (#rules), average number of conditions per rule (#conditions), average rule precision (
) and support (). Note, that the interpretation of indicators based on confusion matrix varies for different problems. Particularly, for classification and are fixed for each analyzed class, for regression and are determined for each rule on the basis of covered examples, for survival analysis all examples are considered positive, thus and equal to 0.For all investigated models we report fraction of statistically significant rules at level (%significant). To control false discovery rate (FDR) in multiple testing, BenjaminiHochberg correction was applied benjamini1995 .
Problem  
classification  regression  survival 
Dataset  
seismicbumps  methane  BMTCh 
Model validation method  
10fold CV  train/test split  10fold CV 
Quality criteria  
root relative squared error
—observed label, —expected, —average 
integrated Brier score (see wrobel2017 for details) 

Quality difference significance test  
Student‘s test  —  Student‘s test 
Rule significance test  
Fisher‘s exact test for comparing confusion matrices 
test for comparing label variance of covered vs. uncovered examples 
logrank for comparing survival functions of covered vs. uncovered examples 
Another analysis step was the comparison of similarity between guidedguided and automatic rule sets. The similarity between two rule sets and on the dataset is expressed as:
(7) 
is the number of pairs of examples in for which there exists some rule in and some rule in covering both and .
is the number of pairs of examples in for which there neither exists rule in nor in covering both and .
is the number of all pairs of examples in .
The measure might be interpreted as the probability of agreement between two rule sets for randomly chosen pair of examples. The agreement between rule sets and for pair of examples means that:
if both examples satisfy premise of some rule in then they also both satisfy premise of some rule in ,
if examples are not covered by common rule in then they also are not covered by common rule in ,
if both examples are not covered by any of rules in then they also are not covered by any of rules in ,
if one of examples is covered by some rule in and the other one is not covered by any of rules in then the same applies to .
The rule sets similarity measure takes values between 0 and 1. The value 0 indicates that the two rule sets do not agree of any pairs of examples from given dataset . The value 1 means the perfect agreement, i.e. that there not exists any pair of examples which are covered by common rule in one of the rule sets and not covered by common rule in the second one. Since proposed measure evaluates the similarity between subsets of examples covered by rule sets, it is not influenced by rules overlap within a set. In particular, if rule sets and have similarity score equal 1, then extending these rule sets by additional rules does not change the value of the score.
The proposed similarity score can be also considered as a variant of Rand measure rand1971objective used for evaluation of clustering performance. However, our proposal takes into account that single example can satisfy premises of several rules as well as that it may be not covered by any of the rules.
The following subsections contain detailed analysis of classification, regression, and survival experiments.
Classification experiments were performed on seismicbumps dataset from UCI Machine Learning Repository uci . The dataset had been prepared and made available by the authors of the paper and concerns a problem of forecasting high energy seismic bumps in coal mines kabiesz . It contains 2 584 instances (170 positives and 2 414 negatives) and 19 attributes characterizing seismic activity in the rock mass within one 8hour shift (see Table 2 for description of crucial features). The value 1 of class attribute indicates the presence of a seismic bump with energy higher than J in the next shift.
attribute  description 

seismic
(seismoacoustic) 
result of shift seismic (seismoacoustic) hazard assessment in the mine working obtained by the seismic (seismoacoustic) method developed by mine experts (a—lack of hazard, b—low hazard, c—high hazard, d—danger state) 
genergy  seismic energy recorded within the previous shift by the most active geophone (GMax) out of geophones monitoring the longwall 
gimpuls  a number of pulses recorded within the previous shift by GMax 
goenergy  a deviation of energy recorded within the previous shift by GMax from the average energy recorded during eight previous shifts 
goimpuls  a deviation of a number of GMax pulses within the previous shift from the average number of pulses within eight previous shifts 
ghazard  result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming from GMax only 
nbumpsX  a number of seismic bumps in the energy range (where ), registered within the previous shift 
senergy  total energy of seismic bumps registered within the previous shift 
maxenergy  the maximum energy of the seismic bumps registered within the previous shift 
The model validation was carried out according to the stratified 10fold cross validation. To establish algorithm parameters, automatic rule induction was performed as an initial step. Due to strong imbalance of the problem, a geometric mean (Gm) of sensitivity and specificity was used for assessment. Among examined quality measures (C2, Correlation, Conditional entropy, Lift, RSS, SBayesian), Conditional entropy with was selected for further investigation.
To demonstrate the flexibility of our algorithm, a guided rule induction was done in several variants, with different algorithm parameters. The variants marked as guidedc1, guidedc2, guidedc3, guidedc4 are attempts to use in the classifier attributes that, according to the domain knowledge, should have the greatest significance for bumps forecasting kabiesz . The guidedc5 and guidedc6 variants are an attempt to define the classifier only on the basis of data coming from one measurement system: in the former it is forbidden to use attributes containing data from geophones (i.e., a seismoacoustic system), in the latter it is forbidden to use attributes containing data from seismometers (i.e., a seismic system).
The variants together with corresponding algorithm parameters are listed below. Classspecific requirements are defined with superscripts, e.g., contains preferred attributes for class 0 (lack of superscript indicates that knowledge applies to both classes). Only important parameters are specified.
Model consists of two initial rules:
.
Attribute gimpuls is used in rules for both classes at least once:
.
At least of rules contain gimpuls, genergy, and senergy attributes together:
.
At least one of seismic, seismoacoustic, and ghazard attributes is used in each rule, with an additional requirement on value sets—class 0 may use values a, b, class 1 may use values b, c, d:
.
Attributes gimpuls, goimpuls, ghazard, and seismoacoustic are forbidden:
.
Attributes from nbumps family as well as senergy, maxenergy, and seismic are forbidden: analogous to guidedc5.
Validation  Descriptive statistics  
(10CV)  (full dataset)  
variant  SE  SP  Gm  Gm  angle=45,lap=0pt(1em)#rules  angle=45,lap=0pt(1em)#conditions  angle=45,lap=0pt(1em)support  angle=45,lap=0pt(1em)precision  angle=45,lap=0pt(1em)%significant  angle=45,lap=0pt(1em)similarity 
auto  0.67  0.76  0.708 0.071  ⅇⅇ—  ⅇ67  7.2  0.14  0.51  ⅇ94  — 
guidedc1  0.49  0.813  0.627 0.064  ⅇⅇ0.01  ⅇⅇ2  1.0  0.50  0.55  100  0.74 
guidedc2  0.62  0.82  0.711 0.062  ⅇⅇ0.90  ⅇ39  5.0  0.30  0.57  ⅇ95  0.92 
guidedc3  0.58  0.82  0.690 0.046  ⅇⅇ0.42  155  5.0  0.38  0.78  ⅇ97  0.88 
guidedc4  0.42  0.81  0.580 0.057  121  5.9  0.31  0.80  ⅇ99  0.51  
guidedc5  0.64  0.78  0.701 0.082  ⅇⅇ0.65  ⅇ43  4.1  0.30  0.48  ⅇ93  0.88 
guidedc6  0.56  0.73  0.622 0.070  ⅇⅇ0.02  ⅇ55  4.9  0.20  0.49  ⅇ96  0.88 
—Student‘s ttest
value comparing Gm of user‘s variants w.r.t. auto) and descriptive statistics.The summary of results for automatic and guided classification variants are given in Table 3. Below there is also an analysis of the rule sets obtained by means of the automatic, guidedc1, guidedc2, guidedc4, and guidedc6 methods on the entire dataset.
The rule set induced automatically consisted of 67 rules with average support and precision equal to 0.14 and 0.51, respectively (taking into account dataset imbalance, it is an acceptible result). The attributes goimplus, gimplus, ghazard, and seismoacoustic occured in 49, 47, 18, and 13 rules, respectively.
Below we present – for each decision class – the strongest rule generated automatically. In brackets there are confusion matrix elements which allow calculating support and precision. The rules indicating decision class 1 were more specific, less precise, and covered less examples.
(, , , )
(, , , )
The characteristics of rules obtained in the guidedc1 experiment are as follows:
(, , , )
(, , , )
The classifier based on these two rules had a significantly worse classification quality. However, it is worth noticing that the first rule was less precise by only 3.3% than the best rule generated automatically for this class, however, its support was 3.6 times larger. The rule pointing at class 1 had the precision of 0.15 which was over twice as much as the 0.065 a priori precision of this class.
The guidedc2 experiment aimed at forcing the occurrence of the gimpuls attribute in each rule as well as adding other elementary conditions to the rule premise. As it can be seen in Table 3, this leaded to a model with the best classification ability. In addition, the number of rules decreased, compared to the automatically generated rule set, while their average support and average precision increased over 214% and 11%, respectively.
The results achieved for the guidedc6 experiment show that it is impossible to obtain a good quality classifier only on the basis of data coming from a seismic system, thus it is indispensable to use geophones (sensors which register seismoacoustic emission).
In all cases a majority of induced rules were statistically significant. Rule sets generated under the guided mode (particularly guidedc2 and guidedc3 variants) were less numerous than those generated automatically. They also contained fewer elementary conditions. According to the value of similarity measure rule sets induced in auto, guidedc2 and guided3 mode were very similar. However, rules generated in the guided mode represented knowledge which is more in compliance with the user’s requirements and intuition. Additionally, the analysis of standard deviations shows that rule sets generated in the guided mode were more stable in their classification abilities.
The experiments we carried out show that the guided (interactive) model definition allows verifying certain research hypotheses and, in particular, obtaining classifiers superior to those generated automatically. The induction of successive rule sets may contribute to further analyses. For example, one could attempt to develop a classifier made of the first rule from guidedc1 model supplemented with automatic rules. Our software enables performing many variants of such analyses.
The usability of the presented algorithm for regression problems was verified on methane dataset, which concerns the problem of predicting methane concentration in a coal mine. The set contains 13 368 train and 5 728 test instances characterized by 7 attributes. The features indicate methane concentration (MM116, MM31 [%]), air velocity (AS038 [m/s]), airflow (PG072 [m/s]), atmospheric pressure (BA13 [hPa]), and whether the coal production process () is carried out. The location of sensors is depicted in Fig. 1. The attributes represents averaged measurements from 30second periods. The task is to predict the maximal value of methane concentration registered by MM116 for next 3 minutes.
As in the previous case, an automatic induction was done in order to adjust parameters. Eventually, RSS quality measure with was selected as providing the best trade off between root relative squared error (RRSE) and model complexity expressed by the number of rules and conditions. The following variants of user‘s knowledge were investigated:
The model contains and conditions, both appearing in three rules:
.
The conjunction appears in five rules:
.
The conjunction appears in five rules: analogous to guidedr2.
Attributes DMM116, MM116, and PD appear in every rule:
.
Validation  Descriptive statistics  
(train/test)  (entire dataset)  
variant  RRSE  angle=45,lap=0pt(1em)#rules  angle=45,lap=0pt(1em)#conditions  angle=45,lap=0pt(1em)support  angle=45,lap=0pt(1em)precision  angle=45,lap=0pt(1em)%significant  angle=45,lap=0pt(1em)similarity 
auto  0.918  ⅇ9  3.5  0.26  0.64  ⅇ88  — 
guidedr1  0.811  19  4.4  0.17  0.66  ⅇ95  0.70 
guidedr2  0.793  11  3.3  0.18  0.69  100  0.93 
guidedr3  0.863  ⅇ8  2.9  0.18  0.78  ⅇ87  0.93 
guidedr4  1.174  41  5.5  0.10  0.70  100  0.60 
The automatic induction produced 9 rules, which allowed achieving RRSE of 0.918, i.e. smaller than the naive prognosis based on the average value of the dependent variable (see Table 4 for all the results). The MM116 and MM31 attributes dominated in the rule premises. This means that the currently registered concentration of methane has the largest impact on the future concentration. This is illustrated, for example, by the following rule:
which shows that if the concentration of methane in the middle of the longwall is low, the predicted concentration at the longwall exit will be about twice as high (it will remain in the range [0.39, 0.41]).
Another rule presents an interesting dependence. If the methane concentration is on an average level (about 1%), too high air velocity can lead to the eddies of the gas mixture at the longwall exit and, at the same time, can increase the methane concentration (methane in the range [0.92, 1.28])
In automatically generated rules, the PD attribute occured only twice. Within the guidedr1 experiment the use of elementary conditions or (the cutterloader is not working) was obligatory. This reflects the hypothesis supported by domain knowledge that the emission of methane is larger while the cutterloader is working. In this way, a significant error reduction was achieved at the cost of increasing the number of rules and conditions occurring in their premises.
Within the guidedr3 experiment, the occurrence of the and conditions at the same time was obligatory. A rule containing only those conditions covered 14% of all examples and looks as follows:
The induction of the above rule allowed better identification of rules indicating higher methane concentration, e.g.:
and, as a result of that, caused further decrease of RRSE.
Apart from the last case (guidedr4), rule sets induced in guided mode produced a smaller RRSE values than an automatically generated set. The guidedr1 settings enforce the use of PD = 0 and PD = 1 conditions in three rules. This algorithm settings reflect an attempt to make the methane level dependent on the coal production process. The regression errors of guided1 and automatically generated were close to each other, while the value of the similarity measure was relatively low. This means that both of these rule sets generated different coverages of the example space.
Generally, in the case of regression rule induction, the definition of the user’s requirement and the analysis of the rules can be difficult because there are no explicitly defined decision classes here. However, as we can see, an interactive analysis allows reducing estimation error. In addition, it is possible to identify interesting regularities in the data, similar to the negative effects of too high air velocity.
Another area our algorithm can be applied is survival analysis. The corresponding experiments were performed on BMTCh dataset, which describes 187 pediatric patients (75 females and 112 males) with several hematologic diseases: 155 malignant disorders (i.a. 67 patients with acute lymphoblastic leukemia, 33 with acute myelogenous leukemia, 25 with chronic myelogenous leukemia, 18 with myelodysplastic syndrome) and 32 nonmalignant cases (i.a. 13 patients with severe aplastic anemia, 5 with Fanconi anemia, 4 with Xlinked adrenoleukodystrophy). All patients were subject to the unmanipulated allogeneic unrelated donor hematopoietic stem cell transplantation. Instances are described by 37 conditional attributes, the meaning of the selected ones is as follows: relapse—reoccurrence of the disease, PLTRecovery—time to platelet recovery defined as platelet count , ANCRecovery—time to neutrophils recovery defined as neutrophils count , aGvHD_III_IV—development of acute graft versus host disease stage III or IV, extcGvHD—development of extensive chronic graft versus host disease, CD34 —CD34+ cell dose per kg of recipient body weight, CD3 —CD3+ cell dose per kg of recipient body weight. Patient‘s death is considered as an event in the survival analysis.
The remaining attributes concern coexisting diseases/infections (e.g. cytomegalic inclusion disease) and describe matching between the bone marrow donor and recipient.
The experiments were performed for with different initial knowledge variants (note, that in the survival analysis class labels for initial rules cannot be specified):
Every rule contains CD34 and does not contain ANCRecovery and PLTRecovery attributes:
.
The model consists of four expert rules:
.
Similarly as in the previous case, but CD34 ranges may be altered and rules can be extended with automatic conditions:
.
The model consists of two initial rules:
.
Validation  Descriptive statistics  
(10CV)  (entire dataset)  
variant  IBS  IBS  angle=45,lap=0pt(1em)#rules  angle=45,lap=0pt(1em)#conditions  angle=45,lap=0pt(1em)support  angle=45,lap=0pt(1em)precision  angle=45,lap=0pt(1em)%significant  angle=45,lap=0pt(1em)similarity 
auto  0.212 0.048  —  ⅇ4  3.0  0.49  1.00  100  — 
guideds1  0.235 0.069  0.31  14  4.1  0.14  1.00  ⅇ71  0.30 
guideds2  0.221 0.033  0.48  ⅇ4  2.0  0.21  1.00  ⅇ50  0.27 
guideds3  0.225 0.036  0.43  ⅇ4  3.0  0.36  1.00  100  0.49 
guideds4  0.223 0.026  0.38  ⅇ2  2.0  0.50  1.00  100  0.48 
Detailed results can be found in Table 5. The automatic method generated four rules in which the survival function depended on such factors as the patient age, donor age, gender match, disease relapse, and the number of days to achieve the PLTRecovery. The variants of the guided rule induction refer to the verification of the research hypothesis that an increased dosage of CD34+ cells/kg extends overall survival time without simultaneous occurrence of undesirable events affecting the patients‘ quality of life.
The guideds2 experiment was based on an arbitrary definition of rules. These rules try to make the survival function dependent on the CD34 dosage and the occurence of the extensive chronic graft versus host disease. The average value of the rules with FDR correction was 0.229, which shows that the rules considered separately did not contain statistically useful information. As it can be observed, the IBS value was also worse. Better results were achieved for the rules containing only the CD34 attribute (guideds4). They were characterized by the average value of 0.019 after correction.
In the guideds3 experiment the use of the CD34 and extcGvHD attributes was obligatory, but in the former case, the division point on the attribute domain was not defined. It was also admissible to add other attributes to the rule premises. The algorithm generated four rules with the average value after correction equaled to 0.12.
Figure 2 shows survival curves corresponding to the three rules presented below. Survival curve for the entire set of examples is also given.
There was practically no difference between survival curves corresponding to the second and the third rule. One can see that the second rule was more specific then the third one. The CD34 dosage does not have any impact on the survival function if the patient has not extensive chronic graft versus host disease.
According to the medical knowledge, chronic graft versus host disease remains dangerous complication of allogeneic stem cell transplantation. However, mild forms of this disease are often manageable and if the disease is under control, extend the overall survival time as it causes the elimination of cancer cells (blasts) remaining in the blood. On the other hand, the first rule shows that the patients with small doses of CD34 who developed extensive chronic graft versus host disease have significantly shorter survival time in spite of fast neutrophils recovery.
As it was mentioned before, in the case of survival rule induction, there are no negative examples, therefore the precision of all rules was equal to 1. Similarly to the classification rules, the majority of survival rule sets generated by the guided induction were more stable (smaller standard deviation) than rule sets generated in automatic mode. Statistically, there were no differences between IBS values of guided and automatically generated sets of rules. The similarity measure values of rule sets were very low. This demonstrates that it is possible to define a rule set compliant with the user’s needs, which is different from the automatically generated model, but preserves its prediction abilities.
The next step in the doctor‘s analysis could be a deeper investigation of the induced rules. For example, our algorithm could be used for further analysis of the sR1 rule. One can remove the conditions that may be too specific according to the medical knowledge (e.g. ) and analyze the quality of that modified rule—separately or together with another rules induced in an automatic way.
The presented example shows that the visualization of rule conclusions is very helpful in the survival analysis. Furthermore, similarly to the previous cases, an interactive analysis of data and induced rules rendered interesting results. The models showed better compliance with the user’s (e.g. doctor’s) requirements than those achieved by means of an automatic method.
The article presents a rule induction algorithm in which the learning process can be guided by the user (domain expert). GuideR can be used in classification, regression, and survival settings in an interactive way, enabling the user to adjust final rule set to his own preferences. The rule induction algorithms are known to be unstable, as a small change in the set of the training examples may cause significant changes in the resulting rule set. The underlying cause is often ralated to the boundary areas of elementary conditions covering only small number of examples. A userguided definition of those ranges usually results in preserving predictive abilities of the final rule set, making it more stable, clearer, and closer to the user’s intuition at the same time. For example, in the analysed case studies, the survival rule sR1 contained a condition with a range 13.055—limiting this range to 13 makes the rule more intuitive with insignificant decrease in the quality.
GuideR can impact the attributes, elementary conditions, or even rules of which the rule sets are composed of, directing the induction towards models most interesting to the user. Thus, the algorithm can be considered as a tool for knowledge discovery and for testing certain hypotheses concerning dependencies which are expected to occur in the data. In particular, the algorithm is able to find modifications of userdefined hypotheses, provided in the form of rules, to improve their quality. Certainly, an automatic rule induction can be the starting point of a thorough dependency analysis. A set of automatically induced rules—or selected rules from this set—can be the basis for further, interactive experiments. Moreover, the guided induction can be an iterative process, i.e., the successive rule sets may be built on the basis of the insights from the previous iterations.
The efficiency of our algorithms for automatic rule induction has been confirmed on dozens of benchmark datasets wrobel2017 ; wrobel2016 ; sikora2012 ; sikora2013data . In the experimental part of this article we focused on showing the efficiency and benefits coming from the use of the guided version of the algorithm. For this purpose, the analysis of three reallife datasets was presented. It show that the guided rule induction may produce data models of similar generalization abilities (e.g., classification accuracy) as the automatic induction, containing attributes, elementary conditions, and rules complying with the user’s requirements.
Further work will concern two directions. The first one is extending the algorithm with the possibility to induce socalled action rules [] and interventions. Action rules and interventions specify recommendations which should be taken in order to transfer objects from the undesirable concept to the desirable one (e.g., moving a client from the churn group to the group of regular customers). The second direction will be focused on the development of a graphical user interface for GuideR to make it easier to apply in the reallife analyses.
M. Sikora, A. Skowron, Ł. Wróbel, Rule quality measurebased induction of unordered sets of regression rules, in: A. Ramsay, G. Agre (Eds.), Artificial Intelligence: Methodology, Systems, and Applications, Vol. 7557 of LNAI, Springer, Berlin Heidelberg, 2012, pp. 162–171.
M. Sikora, A. Gruca, Induction and selection of the most interesting gene ontology based multiattribute rules for descriptions of gene groups, Pattern Recognit. Lett. 32 (2) (2011) 258–269.
Comments
There are no comments yet.