The widespread use of machine learning (ML) in many sensible areas such as healthcare, justice, asset management has underlined the importance of interpretability in the decision-making process. In recent years, the number of publications on interpretability has increased exponentially. Usually, two main ways can be distinguished for the production of interpretable predictive models. The first one relies on the use of an uninterpretable machine learning algorithm to create predictive models, and then to take them up again to create a so-called post-hoc interpretable model, for example LIME , DeepLIFT , SHAP . These explanatory models try to measure the importance of a feature on the prediction process (see  for an overview of existing methods). However, as outlined in , the explanations may not be sufficient for a sensitive decision-making process.
The other way is to use an intrinsically
interpretable algorithm to directly generate an interpretable model such as decision tree algorithmsCART , ID3 , C4.5 , RIPPER  or rule-based algorithms FORS , M5 Rules , RuleFit , Ender , Node Harvest  or more recently SIRUS  and RICE .
These algorithms are based on the notion of rule. A rule is a If-Then statement of the form
The condition part If is a logical conjunction, where ’s are tests that check whether the observation has the specified properties or not The number is called the length of the rule. If all ’s are fulfilled the rule is said activated. And the conclusion part Then is prediction of the rule if it is activated. Usually, if the feature space is , each checks if one specific feature is in an interval (e.g ).
In , author emphasizes that there is no rigorous mathematical foundation for the concept of interpretability. In this paper, a rigorous, quantitative and objective measure of the interpretability is proposed as a comparison criterion for any rule-based algorithms. This measure is based on the triptych predictability, computability, stability presented in : Predictability measures the accuracy of the predictive model. Stability quantifies the noise sensitivity of an algorithm. Finally, the notion of computability has been replaced by a notion of simplicity. Computability is important in practice, but it does not reflect the model’s ability to be interpreted, whereas the simplicity of the model proves to be more efficient at this task.
The aim of a predictive model is to predict the value of a variable of interest , given features . Formally, we set the standard regression setting as follows: Let
be a couple of random variable inof unknown distribution such that
where and and is a measurable function from to .
We denote by the set of all measurable functions from to . The accuracy of a regression function is measured by its risk, defined as
where is called a contrast function. The risk measures the average discrepancy, given a new observation from the distribution , between and .
Given a sample , we aim at predicting conditionally on . The observations are independent and identically distributed (i.i.d) from the distribution .
To do so, we consider a statistical algorithm which is a measurable mapping defined by
The purpose of an algorithm , is to generate a measurable function that minimizes the risk (3). To carry out this minimization, the algorithms use the Empirical Risk Minimization principle (ERM) , meaning that
where is the empirical risk.
The choice of depends on the nature of . For example, if , one generally uses the quadratic contrast with . The minimizer of the risk (3) with the quadratic contrast is the called the regression function , defined by
If , one uses the contrast function . The minimizer of the risk (3) with the
contrast function is called the Bayes classifierdefined by
Hence, according to the ERM principle, the choice of determines the function that an algorithm,
tries to estimate, and thus the function.
The notion of predictivity is based on the ability of an algorithm to provide an accurate predictive model. This notion has been well studied since years. In this paper, the predictivity is defined as follows:
where is the trivial constant predictor according to . For example, for the quadratic contrast and for the contrast .
This quantity as a measure of the accuracy is independent from the range of . We may assume that it is a positive number between and . Indeed, the risk (3) is a positive function and if , it means that the predictor is worse than the trivial constant predictor.
In , authors have proposed a measure of the stability of a rule-based algorithm built upon the following definition:
A rule learning algorithm is stable if two independent estimations based on two independent samples result in two similar lists of rules.
The notion of
-stability is based on the same definition. This notion appears to be fairer for algorithms that do not use features discretisation and operate on real rather than integer values. In fact, the probability that a decision tree algorithm cuts on the same exact value for the same rule, given two independent samples is null. For this reason, the pure stability appears too penalizing in this case.
Features discretization is a common solution for controlling the complexity of a rule generator. In 
, for example, the authors use the entropy minimization heuristic to discretize the features and for the algorithms BRL (Bayesian Rule Lists), SIRUS  and RICE 
, authors have used the empirical quantiles of features to discretize them. See for an overview of usual discretization methods.
Let be the number of quantiles considered for the discretization and let be a feature. An integer , named bin, is associated to each interval , where is -th -quantile of . A discrete version of a feature , denoted , is designed by replacing each value by its corresponding bins. in other words, a value is associated for all such that .
This discretization process can be extended to a rule set by replacing for all rules, the intervals’ bound of each test by their corresponding bins. For example, the test becomes , where and are such that and .
The formula of the -stability is based on the so-called Dice-Sorensen index. Let be a rule-based algorithm and let and two independant samples of i.i.d observations drawn from the same distribution . And let and be the sets of rules generated by , given and respectively. Then, the -stability is calculated by
where is the discretized version of the rule-set and with the convention that . The discretization process is performed using and respectively.
This quantity is a positive number between and : If and have no common rules, then while if and are the same, then .
In , authors have introduced a notion of interpretability score based on the sum of the length of all the rules constituting the predictive model.
The interpretability score of an estimator generated by a set of rules is defined by
The measure of the simplicity is based on the definition 4.1. The idea is to compare (8) relatively to a set of algorithms . Hence the simplicity of an algorithm is defined in relative terms as follows:
Like the previous ones, this quantity is a positive number between and : If generates the simplest predictor among the set of algorithms then . Then, the simplicity of other algorithms in are evaluated relatively to .
The main idea underlying the definition of interpretability of a rule-based algorithm, is the use of a weighted sum of the predictivity (6) the stability (7) and the simplicity (9). Let be a set of rule-based algorithms, the interpretability of any algorithm is defined by:
where the coefficients and have been chosen according to the statistician’s desiderata such that . If a statistician tries to understand and to describe a phenomenon then simplicity and predictivity are more important than stability.
It is important to notice that the definition of interpretability (10) depends on the set of rule-based algorithms and a regression setting. Therefore, the interpretable value only makes sense within that set of algorithms and for a specific regression setting.
The aim of this application is to compare four rule-based algorithms: the Decision Tree algorithm (DT) , RuleFit (RF) , the Covering Algorithm (CA)  and RICE . Their parametrization is summarized in Table 1. For this application the same model as in  is considered. Two samples and of data are generated following the regression setting:
where (the dimension of ) and
where is the -st component of and . The value of.
|DT||Number maximal of rules|
|RF||Maximal number of rules Cross validation|
|CA||Number of rules by tree Number of trees|
|RICE||Number of candidates Maximal length|
Predictivity (6) is approximated using test observations and by averaging error of predictors generated from and . The -stability (7) is measured by setting and discretizing with respect to and . Simplicity (9) of an algorithm is computed by averaging the measure on predictors generated from and respectively. Finally, the interpretability (10) is calculated with . The results are summarized in Table 2.
Remark: For the sake of simplification, linear relationships generated by RuleFit have been considered as rules for the evaluation of the -stability. This algorithm generates four linear relationships from using variables whereas a single relationship is produced from using . Regarding the linear relationships generated by RuleFit on the two datasets and they only show one ”rule” in common.
RICE and Covering Algorithm seem to be the most interpretable algorithms for this setting. However, the predictivity value of RICE is very poor compared to the other algorithms. Therefore, the Covering Algorithm is the most interpretable algorithm in this panel for the setting (11). Even if RuleFit is the best algorithm of this panel in predictivity and -stability it generated too many rules and has therefore a weaker simplicity.
7 Conclusion and perspectives
In this paper, a quantitative criterion for interpretability of rule-based algorihms was presented. This measure is based on the triptych: Predictivity (6), Stability (7) and Simplicity (9). This new concept of interpretability has been thought to be fair and rigorous. It can be adapted to the various desiderata of the statistician by choosing appropriate the coefficients in the interpretability formula (10). An application on four rule-based algorithm: Decision Tree algorithm , RuleFit , Covering Algorithm  and RICE , shows how to use and analyse the interpretability value. This application will be extended to others well-known rule-based algorithms such as C4.5 , RIPPER , Ender  and SIRUS  in a further work.
This methodology seems to make the interpretability comparison of rule-based algorithms quite fair. However, according to Definition 4.1, rules of length have the same simplicity that one single rule of length , which is debatable. Moreover, the stability measure is purely syntactic and rather restrictive. Indeed, if some features are duplicated, two rules may have two different syntactic conditions but by otherwise identical based in their activations. One way of relaxing this stability criterion could be to compare the rules, based on their activation sets (i.e. by looking to observations where conditions are met simultaneously). Finally, this comparison of interpretability between a set of algorithms makes only sense for rule-based algorithm. It could be interesting to extend it to other types of algorithms.
-  M. T. Ribeiro, S. Singh, and C. Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016.
-  A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017.
-  S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.
-  R. Guidotti, A. Monreale, F. Turini, D. Pedreschi, and F. Giannotti. A survey of methods for explaining black box models. arXiv preprint arXiv:1802.01933, 2018.
-  C. Rudin Please stop explaining black box models for high stakes decisions. arXiv preprint arXiv:1811.10154, 2018.
-  L. Breiman and J.H. Friedman and R. Olshen and C. Stone Classification and Regression Trees. In CRC press 1984.
-  J. R. Quinlan Induction of decision trees. In Springer pages 81–106. 1986.
-  J. R. Quinlan C4.5: Programs for Machine Learning. In Morgan Kaufmann, 1993.
-  W.W. Cohen Fast effective rule induction. In Machine Learning Proceedings, pages 115-123, 1995.
-  A. Karalič and I. Bratko First order regression. InSpringer, pages 147–176, 1997.
-  G. Holmes, M. Hall and E. Prank Generating rule sets from model trees. In Springer, pages 1–12, 1999.
-  J.H. Friedman and B.E. Popescu Predective Learning via Rule Ensembles. In The Annals of Applied Statistic, pages 916–954, 2008.
K. Dembczyński, W. Kotłowski and R. Słowiński
Solving Regression by Learning an Ensemble of Decision Rules.
International Conference on Artificial Intelligence and Soft Computing, pages 533–544, 2008.
-  N. Meinshausen Node harvest. In Institute of Mathematical Statistics, pages 2049–2072, 2010.
-  C. Bénard and G. Biau, S. Da Veiga and E. Scornet SIRUS: making random forests interpretable. arXiv preprint arXiv:1908.06852, 2019.
-  V. Margot, J.P. Baudry, F. Guilloux and O. Wintenberger Rule Induction Covering Estimator : A New Data Dependent Covering Algorithm, , 2020.
-  Z.C. Lipton, The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2017.
-  Y.Bin and K.Kumbier Three principles of data science: predictability, computability, and stability (PCS). aarXiv preprint arXiv:1901.08152, 2019.
-  V. Vapnik and S. Kotz Estimation of Dependences Based on Empirical Data. Springer-Verlag New York, 1982.
-  V. Margot, J.P. Baudry, F. Guilloux and O. Wintenberger Consistent Regression using Data-Dependent Coverings arXiv preprint arXiv:1907.02306, 2019.
-  U.M Fayyad and K.B Irani Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In Proc. 13th Int. Joint Conf. on Artificial Intelligence, pages 1022–1027, 1993.
-  B. Letham, C. Rudin, T.H. McCormick, D. Madigan Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. In The Annals of Applied Statistics, volume 9, number 3, pages 1350–1371, 2015.
-  J. Dougherty, R. Kohavi and M. Sahami Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings, pages 194–202, 1995.