Thanks to the combination of predictive and descriptive capabilities, rules have been applied in machine learning (especially in knowledge discovery) for decades. Amongst many rule induction strategies, sequential covering is one of the most popular(Fürnkranz et al., 2012). It consists in iterative addition of rules explaining a part of the training set as long as all the examples are covered. This approach leads to different models than those obtained by extracting rules from trees induced with a divide-and-conquer strategy (Breiman et al., 1984). In the previous research we confirmed the effectiveness of our variant of sequential covering strategy on dozens of data sets in classification, regression, and survival analysis (Wróbel et al., 2016, 2017). We also showed a usefulness of user-guided induction, which allows introducing user’s preferences or domain knowledge to the learning process (Sikora et al., 2019)—feature particularly valuable in medical applications.
In spite of numerous advantages, relatively few sequential covering rule induction algorithms are available as ready-to-use software. The examples are CN2 (Clark and Niblett, 1989) included in the Orange suite (Demšar et al., 2013), AQ (Michalski, 1969) implemented in Rseslib 3 (Wojna and Latkowski, 2019), or RIPPER (Cohen, 1995) and M5Rules (Holmes et al., 1999) contained in Weka (Witten et al., 2016).
We present RuleKit, a comprehensive suite for training and evaluating rule-based data models. Equipped with multiple useful features like user-guided induction, it is the first tool suitable for classification, regression, and survival analysis problems. It additionally stands out from the competitors with handiness—beside batch experimental environment it can be integrated with RapidMiner and R.
2 RuleKit Features
The following features make RuleKit a powerful data analysis tool:
Ability to resolve different problems: classification, regression, and survival analysis.
Various ways to run the analysis: batch mode, RapidMiner plug-in, R package.
Multiplicity of algorithm parameters. For instance, there are over 40 rule quality measures available with an additional possibility to define own formulas.
Integrated experimental environment—the software facilitates automated investigation of various algorithm configurations over multiple data sets. Different experimental schemes (train-test, cross validation) are supported and tens of performance metrics are provided for model assessment.
User-guided induction—the possibility to specify the initial set of rules, preferred and forbidden conditions/attributes, together with the multiplicity of options and modes allow suiting the model to user’s requirements. This may be useful, e.g., in verifying hypotheses concerning data dependencies which are expected or of interest.
Computational scalability—independent steps of induction algorithms (e.g., the evaluation of different conditions) are distributed over multiple threads allowing RuleKit to take advantage of multi-core CPUs, as well as high-performance clusters. Bit-level parallelism is also employed for maximum computational performance.
Portability—the suite is distributed as Java application, thus it can be run on the majority of operating systems, including Windows, Linux, and OS X.
Extensibility—the software together with the source code is publicly available at GitHub under GNU AGPL-3.0 Licence: https://github.com/adaa-polsl/RuleKit. The documented API allows straightforward integration of the library with other projects and/or extending its functionality.
3 Case Studies
Batch mode. This example demonstrates running a RuleKit batch analysis on deals classification data set (prediction whether a person making a purchase will be a future customer). The batch mode is run with java -jar RuleKit experiments.xml command, where XML file describes parameter sets and data sets to be investigated (Figure 1 a).
As a result of the training, a text report is produced (Figure 1 b). It contains a list of generated rules (with corresponding confusion matrices and statistical significance), information about examples coverage, model characteristics (no. of rules/conditions, average rule precision/coverage, etc.), and performance metrics calculated on the training set (accuracy, error, etc.). Depending on the problem, the significance of rules is established with different tests (Fisher’s exact, , or log-rank). The training may be followed by applying the model on unlabelled data. In this stage, a comma-separated table with values of performance metrics evaluated on the test set is produced.
(b) Wizard for specifying user’s knowledge in the guided induction.
RapidMiner plug-in. An alternative way of performing an experiment is integrating RuleKit with RapidMiner. The plug-in provides user with two operators: RuleKit Generator and RuleKit Performance. The former is a RapidMiner learner that induces various types of rule models. The latter extends the standard RM Performance operator and allows calculation of performance metrics as well as gathering model characteristics. In the Figure 2
we present an example RapidMiner process which performs regression analysis on themethane data set (predicting methane concentration in a coal mine) and a wizard for specifying user’s knowledge in the guided induction.
R package. As a last test case, we present the application of RuleKit R package for analyzing factors contributing to the patients’ survival following bone marrow transplants. The corresponding data set (BMT-Ch) is integrated with the package in the form of the standard R data frame. The training and applying a model is performed by a function learn_rules
which returns a named list containing induced rules, survival function estimates, test set performance metrics, etc. In Figure2(a) we provide an example R code for training the model and visualizing corresponding survival functions estimates.
4 Conclusions and Future Work
We demonstrated that RuleKit can be successfully applied for training and evaluation of rule-based models in classification, regression, and survival tasks. The multiplicity of options and modes together with the powerful and flexible experimental environment makes presented suite a useful tool for data analysis and knowledge discovery. In the future, we plan to extend RuleKit with algorithms for inducing action rules (Hajja et al., 2014) and oblique rules (Sikora and Gudyś, 2013). The applicability of the suite could be additionally enhanced by providing Python wrapper or standalone graphical interface.
This work was supported by Polish National Centre for Research and Development (NCBiR) within the Operational Programme Intelligent Development (grant no. POIR.04.01.02-00-0024/17-00); Rector of Silesian University of Technology (grant no. 02/020/RGJ18/0126); Institute of Informatics at Silesian University of Technology within the statutory research project (BKM18/RAU2/556).
- Classification and regression trees. Chapman & Hall/CRC, Boca Raton, London, New York, Washington. Cited by: §1.
- The CN2 induction algorithm. Mach. Learn. 3 (4), pp. 261–283. Cited by: §1.
- Fast Effective Rule Induction. In ICML 1995, pp. 115–123. Cited by: §1.
- Orange: Data Mining Toolbox in Python. J. Mach. Learn. Res. 14 (1), pp. 2349–2353. Cited by: §1.
- Foundations of Rule Learning. Springer-Verlag, Berlin, Heidelberg. Cited by: §1.
- Hierarchical object-driven action rules. J. Intell. Inf. Syst. 42 (2), pp. 207–232. Cited by: §4.
- Generating Rule Sets from Model Trees. In IJCAI 1991, pp. 1–12. Cited by: §1.
- On the quasi-minimal solution of the general covering problem. In FCIP 69, Vol. A3, pp. 125–128. Cited by: §1.
- CHIRA—Convex Hull Based Iterative Algorithm of Rules Aggregation. Fundam. Inform. 123 (2), pp. 143–170. Cited by: §4.
- GuideR: A guided separate-and-conquer rule learning in classification, regression, and survival settings. Knowl.-Based Syst. 173, pp. 1–14. Cited by: §1.
- Data Mining: Practical Machine Learning Tools and Techniques. 4th edition, Morgan Kaufmann, San Francisco. Cited by: §1.
- Rseslib 3: Library of Rough Set and Machine Learning Methods with Extensible Architecture. In Transactions on Rough Sets XXI, LNCS, Vol. 10810, pp. 301–323. Cited by: §1.
- Learning rule sets from survival data. BMC Bioinformatics 18 (1), pp. 285. Cited by: §1.
- Rule Quality Measures Settings in Classification, Regression and Survival Rule Induction—an Empirical Approach. Fundam. Inform. 149 (4), pp. 419–449. Cited by: §1.