RuleKit: A Comprehensive Suite for Rule-Based Learning

08/02/2019
by   Adam Gudyś, et al.
0

Rule-based models are often used for data analysis as they combine interpretability with predictive power. We present RuleKit, a versatile tool for rule learning. Based on a sequential covering induction algorithm, it is suitable for classification, regression, and survival problems. The presence of a user-guided induction facilitates verifying hypotheses concerning data dependencies which are expected or of interest. The powerful and flexible experimental environment allows straightforward investigation of different induction schemes. The analysis can be performed in batch mode, through RapidMiner plug-in, or R package. A documented Java API is also provided for convenience. The software is publicly available at GitHub under GNU AGPL-3.0 license.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/05/2018

GuideR: a guided separate-and-conquer rule learning in classification, regression, and survival settings

This article presents GuideR, a user-guided rule induction algorithm, wh...
04/01/2022

Separate and conquer heuristic allows robust mining of contrast sets from various types of data

Identifying differences between groups is one of the most important know...
06/09/2021

SCARI: Separate and Conquer Algorithm for Action Rules and Recommendations Induction

This article describes an action rule induction algorithm based on a seq...
04/27/2016

UBL: an R package for Utility-based Learning

This document describes the R package UBL that allows the use of several...
01/07/2022

An Accelerator for Rule Induction in Fuzzy Rough Theory

Rule-based classifier, that extract a subset of induced rules to efficie...
05/09/2018

DReAM: Dynamic Reconfigurable Architecture Modeling (full paper)

Modern systems evolve in unpredictable environments and have to continuo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Thanks to the combination of predictive and descriptive capabilities, rules have been applied in machine learning (especially in knowledge discovery) for decades. Amongst many rule induction strategies, sequential covering is one of the most popular 

(Fürnkranz et al., 2012). It consists in iterative addition of rules explaining a part of the training set as long as all the examples are covered. This approach leads to different models than those obtained by extracting rules from trees induced with a divide-and-conquer strategy (Breiman et al., 1984). In the previous research we confirmed the effectiveness of our variant of sequential covering strategy on dozens of data sets in classification, regression, and survival analysis (Wróbel et al., 2016, 2017). We also showed a usefulness of user-guided induction, which allows introducing user’s preferences or domain knowledge to the learning process (Sikora et al., 2019)—feature particularly valuable in medical applications.

In spite of numerous advantages, relatively few sequential covering rule induction algorithms are available as ready-to-use software. The examples are CN2 (Clark and Niblett, 1989) included in the Orange suite (Demšar et al., 2013), AQ (Michalski, 1969) implemented in Rseslib 3 (Wojna and Latkowski, 2019), or RIPPER (Cohen, 1995) and M5Rules (Holmes et al., 1999) contained in Weka (Witten et al., 2016).

We present RuleKit, a comprehensive suite for training and evaluating rule-based data models. Equipped with multiple useful features like user-guided induction, it is the first tool suitable for classification, regression, and survival analysis problems. It additionally stands out from the competitors with handiness—beside batch experimental environment it can be integrated with RapidMiner and R.

2 RuleKit Features

The following features make RuleKit a powerful data analysis tool:

  1. [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  2. Ability to resolve different problems: classification, regression, and survival analysis.

  3. Various ways to run the analysis: batch mode, RapidMiner plug-in, R package.

  4. Multiplicity of algorithm parameters. For instance, there are over 40 rule quality measures available with an additional possibility to define own formulas.

  5. Integrated experimental environment—the software facilitates automated investigation of various algorithm configurations over multiple data sets. Different experimental schemes (train-test, cross validation) are supported and tens of performance metrics are provided for model assessment.

  6. User-guided induction—the possibility to specify the initial set of rules, preferred and forbidden conditions/attributes, together with the multiplicity of options and modes allow suiting the model to user’s requirements. This may be useful, e.g., in verifying hypotheses concerning data dependencies which are expected or of interest.

  7. Computational scalability—independent steps of induction algorithms (e.g., the evaluation of different conditions) are distributed over multiple threads allowing RuleKit to take advantage of multi-core CPUs, as well as high-performance clusters. Bit-level parallelism is also employed for maximum computational performance.

  8. Portability—the suite is distributed as Java application, thus it can be run on the majority of operating systems, including Windows, Linux, and OS X.

  9. Extensibility—the software together with the source code is publicly available at GitHub under GNU AGPL-3.0 Licence: https://github.com/adaa-polsl/RuleKit. The documented API allows straightforward integration of the library with other projects and/or extending its functionality.

3 Case Studies

Batch mode. This example demonstrates running a RuleKit batch analysis on deals classification data set (prediction whether a person making a purchase will be a future customer). The batch mode is run with java -jar RuleKit experiments.xml command, where XML file describes parameter sets and data sets to be investigated (Figure 1 a).

As a result of the training, a text report is produced (Figure 1 b). It contains a list of generated rules (with corresponding confusion matrices and statistical significance), information about examples coverage, model characteristics (no. of rules/conditions, average rule precision/coverage, etc.), and performance metrics calculated on the training set (accuracy, error, etc.). Depending on the problem, the significance of rules is established with different tests (Fisher’s exact, , or log-rank). The training may be followed by applying the model on unlabelled data. In this stage, a comma-separated table with values of performance metrics evaluated on the test set is produced.

      <experiment>          <parameter_sets>          <parameter_set name="mincov=8, Entropy_User_C2">             <param name="min_rule_covered">8</param>             <param name="induction_measure">BinaryEntropy</param>             <param name="pruning_measure">UserDefined</param>             <param name="user_pruning_equation">2 * p / n</param>             <param name="voting_measure">C2</param>             </parameter_set>          </parameter_sets>          <datasets>             <dataset>                <label>Future Customer</label>                <out_directory>./results-deals</out_directory>                <training>                   <report_file>training.log</report_file>                   <train>                   <in_file>../data/deals/deals-train.arff</in_file>                   <model_file>deals.mdl</model_file>                   </train>                </training>                <prediction>                   <performance_file>performance.csv</performance_file>                   <predict>                      <model_file>deals.mdl</model_file>                      <test_file>../deals/data/deals-test.arff</test_file>                      <predictions_file>deals-pred.arff</predictions_file>                   </predict>                </prediction>             </dataset>          </datasets>       </experiment><
(a)
      Rules:       r1: IF Gender = {male} AND Age = (-inf, 34.5)       THEN Future Customer = {yes}       (p=176.0, n=0.0, P=473.0, N=527.0, weight=0.69, pval=3.8E-67)       r2: IF Payment Method = {credit card} AND Age = (-inf, 30.5)       THEN Future Customer = {yes}       (p=183.0, n=0.0, P=473.0, N=527.0, weight=0.69, pval=2.9E-70)       r3: IF Gender = {male} AND Age = (-inf, 35.5)       THEN Future Customer = {yes}       (p=185.0, n=0.0, P=473.0, N=527.0, weight=0.70, pval=3.6E-71)       ...       Best rules covering examples from training set (1-based):       5*,12;3*,12;2*,14,15;6,11*,12,13,14,15;5*,12;13*,14,15;2*,15;...       Model characteristics:       time_total_s: 0.916438051       time_growing_s: 0.684499623       time_pruning_s: 0.365774315       #rules: 15.0       #conditions_per_rule: 2.0       #induced_conditions_per_rule: 22.533333333333335       avg_rule_coverage: 0.26126666666666665       avg_rule_precision: 0.9517768845699609       ...       Training set performance:       accuracy: 0.954       classification_error: 0.04600000000000004       kappa: 0.9072689080712335       balanced_accuracy: 0.9513742071881607       #rules_per_example: 3.919       ...
(b)
Figure 1: (a) XML batch experiment on deals data set. (b) Resulting training report.
(a)
(b)
Figure 2: (a) Process for analyzing methane data set with RuleKit RapidMiner plug-in.
(b) Wizard for specifying user’s knowledge in the guided induction.

RapidMiner plug-in. An alternative way of performing an experiment is integrating RuleKit with RapidMiner. The plug-in provides user with two operators: RuleKit Generator and RuleKit Performance. The former is a RapidMiner learner that induces various types of rule models. The latter extends the standard RM Performance operator and allows calculation of performance metrics as well as gathering model characteristics. In the Figure 2

we present an example RapidMiner process which performs regression analysis on the

methane data set (predicting methane concentration in a coal mine) and a wizard for specifying user’s knowledge in the guided induction.

R package. As a last test case, we present the application of RuleKit R package for analyzing factors contributing to the patients’ survival following bone marrow transplants. The corresponding data set (BMT-Ch) is integrated with the package in the form of the standard R data frame. The training and applying a model is performed by a function learn_rules

which returns a named list containing induced rules, survival function estimates, test set performance metrics, etc. In Figure 

2(a) we provide an example R code for training the model and visualizing corresponding survival functions estimates.

      library(ggplot2)       library(reshape2)       library(rulekit)       formula <- survival::Surv(survival_time,survival_status) ~ .       control <- list(min_rule_covered = 5)       results <- rulekit::learn_rules(formula,control,bone_marrow)       # extract outputs:       rules <- results[["rules"]] # list of rules       cov <- results[["train-coverage"]] # coverage information       surv <- results[["estimator"]] # survival function estimates       perf <- results[["test-performance"]]  # testing performance       # melt data set for automatic plotting of multiple series       melted_surv <- reshape2::melt(surv, id.var="time")       # plot survival functions estimates       ggplot(melted_surv, aes(x=time, y=value, color=variable)) +          geom_line(size=1.0) +          xlab("time") + ylab("survival probability") +          theme_bw() + theme(legend.title=element_blank())
(a)
(b)
Figure 3: Analyzing BMT-Ch survival data set using RuleKit R package. (a) The code for training and visualizing model, (b) survival estimates of generated rules.

4 Conclusions and Future Work

We demonstrated that RuleKit can be successfully applied for training and evaluation of rule-based models in classification, regression, and survival tasks. The multiplicity of options and modes together with the powerful and flexible experimental environment makes presented suite a useful tool for data analysis and knowledge discovery. In the future, we plan to extend RuleKit with algorithms for inducing action rules (Hajja et al., 2014) and oblique rules (Sikora and Gudyś, 2013). The applicability of the suite could be additionally enhanced by providing Python wrapper or standalone graphical interface.

This work was supported by Polish National Centre for Research and Development (NCBiR) within the Operational Programme Intelligent Development (grant no. POIR.04.01.02-00-0024/17-00); Rector of Silesian University of Technology (grant no. 02/020/RGJ18/0126); Institute of Informatics at Silesian University of Technology within the statutory research project (BKM18/RAU2/556).

References

  • L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone (1984) Classification and regression trees. Chapman & Hall/CRC, Boca Raton, London, New York, Washington. Cited by: §1.
  • P. Clark and T. Niblett (1989) The CN2 induction algorithm. Mach. Learn. 3 (4), pp. 261–283. Cited by: §1.
  • W. W. Cohen (1995) Fast Effective Rule Induction. In ICML 1995, pp. 115–123. Cited by: §1.
  • J. Demšar, T. Curk, A. Erjavec, et al. (2013) Orange: Data Mining Toolbox in Python. J. Mach. Learn. Res. 14 (1), pp. 2349–2353. Cited by: §1.
  • J. Fürnkranz, D. Gamberger, and N. Lavrač (2012) Foundations of Rule Learning. Springer-Verlag, Berlin, Heidelberg. Cited by: §1.
  • A. Hajja, Z. W. Ras, and A. Wieczorkowska (2014) Hierarchical object-driven action rules. J. Intell. Inf. Syst. 42 (2), pp. 207–232. Cited by: §4.
  • G. Holmes, M. Hall, and E. Frank (1999) Generating Rule Sets from Model Trees. In IJCAI 1991, pp. 1–12. Cited by: §1.
  • R. S. Michalski (1969) On the quasi-minimal solution of the general covering problem. In FCIP 69, Vol. A3, pp. 125–128. Cited by: §1.
  • M. Sikora and A. Gudyś (2013) CHIRA—Convex Hull Based Iterative Algorithm of Rules Aggregation. Fundam. Inform. 123 (2), pp. 143–170. Cited by: §4.
  • M. Sikora, Ł. Wróbel, and A. Gudyś (2019) GuideR: A guided separate-and-conquer rule learning in classification, regression, and survival settings. Knowl.-Based Syst. 173, pp. 1–14. Cited by: §1.
  • I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal (2016) Data Mining: Practical Machine Learning Tools and Techniques. 4th edition, Morgan Kaufmann, San Francisco. Cited by: §1.
  • A. Wojna and R. Latkowski (2019) Rseslib 3: Library of Rough Set and Machine Learning Methods with Extensible Architecture. In Transactions on Rough Sets XXI, LNCS, Vol. 10810, pp. 301–323. Cited by: §1.
  • Ł. Wróbel, A. Gudyś, and M. Sikora (2017) Learning rule sets from survival data. BMC Bioinformatics 18 (1), pp. 285. Cited by: §1.
  • Ł. Wróbel, M. Sikora, and M. Michalak (2016) Rule Quality Measures Settings in Classification, Regression and Survival Rule Induction—an Empirical Approach. Fundam. Inform. 149 (4), pp. 419–449. Cited by: §1.