1 Introduction
The widespread use of machine learning (ML) in many sensible areas such as healthcare, justice, asset management has underlined the importance of interpretability in the decisionmaking process. In recent years, the number of publications on interpretability has increased exponentially. Usually, two main ways can be distinguished for the production of interpretable predictive models. The first one relies on the use of an uninterpretable machine learning algorithm to create predictive models, and then to take them up again to create a socalled posthoc interpretable model, for example LIME [1], DeepLIFT [2], SHAP [3]. These explanatory models try to measure the importance of a feature on the prediction process (see [4] for an overview of existing methods). However, as outlined in [5], the explanations may not be sufficient for a sensitive decisionmaking process.
The other way is to use an intrinsically
interpretable algorithm to directly generate an interpretable model such as decision tree algorithms
CART [6], ID3 [7], C4.5 [8], RIPPER [9] or rulebased algorithms FORS [10], M5 Rules [11], RuleFit [12], Ender [13], Node Harvest [14] or more recently SIRUS [15] and RICE [16].These algorithms are based on the notion of rule. A rule is a IfThen statement of the form
IF  (1)  
THEN 
The condition part If is a logical conjunction, where ’s are tests that check whether the observation has the specified properties or not The number is called the length of the rule. If all ’s are fulfilled the rule is said activated. And the conclusion part Then is prediction of the rule if it is activated. Usually, if the feature space is , each checks if one specific feature is in an interval (e.g ).
In [17], author emphasizes that there is no rigorous mathematical foundation for the concept of interpretability. In this paper, a rigorous, quantitative and objective measure of the interpretability is proposed as a comparison criterion for any rulebased algorithms. This measure is based on the triptych predictability, computability, stability presented in [18]: Predictability measures the accuracy of the predictive model. Stability quantifies the noise sensitivity of an algorithm. Finally, the notion of computability has been replaced by a notion of simplicity. Computability is important in practice, but it does not reflect the model’s ability to be interpreted, whereas the simplicity of the model proves to be more efficient at this task.
2 Predictivity
The aim of a predictive model is to predict the value of a variable of interest , given features . Formally, we set the standard regression setting as follows: Let
be a couple of random variable in
of unknown distribution such that(2) 
where and and is a measurable function from to .
We denote by the set of all measurable functions from to . The accuracy of a regression function is measured by its risk, defined as
(3) 
where is called a contrast function. The risk measures the average discrepancy, given a new observation from the distribution , between and .
Given a sample , we aim at predicting conditionally on . The observations are independent and identically distributed (i.i.d) from the distribution .
To do so, we consider a statistical algorithm which is a measurable mapping defined by
(4) 
where .
The purpose of an algorithm , is to generate a measurable function that minimizes the risk (3). To carry out this minimization, the algorithms use the Empirical Risk Minimization principle (ERM) [19], meaning that
(5) 
where is the empirical risk.
The choice of depends on the nature of . For example, if , one generally uses the quadratic contrast with . The minimizer of the risk (3) with the quadratic contrast is the called the regression function , defined by
If , one uses the contrast function . The minimizer of the risk (3) with the
contrast function is called the Bayes classifier
defined byHence, according to the ERM principle, the choice of determines the function that an algorithm,
tries to estimate, and thus the function
.The notion of predictivity is based on the ability of an algorithm to provide an accurate predictive model. This notion has been well studied since years. In this paper, the predictivity is defined as follows:
(6) 
where is the trivial constant predictor according to . For example, for the quadratic contrast and for the contrast .
This quantity as a measure of the accuracy is independent from the range of . We may assume that it is a positive number between and . Indeed, the risk (3) is a positive function and if , it means that the predictor is worse than the trivial constant predictor.
3 Stability
In [15], authors have proposed a measure of the stability of a rulebased algorithm built upon the following definition:
A rule learning algorithm is stable if two independent estimations based on two independent samples result in two similar lists of rules.
The notion of
stability is based on the same definition. This notion appears to be fairer for algorithms that do not use features discretisation and operate on real rather than integer values. In fact, the probability that a decision tree algorithm cuts on the same exact value for the same rule, given two independent samples is null. For this reason, the pure stability appears too penalizing in this case.
Features discretization is a common solution for controlling the complexity of a rule generator. In [21]
, for example, the authors use the entropy minimization heuristic to discretize the features and for the algorithms BRL (Bayesian Rule Lists)
[22], SIRUS [15] and RICE [16], authors have used the empirical quantiles of features to discretize them. See
[23] for an overview of usual discretization methods.Let be the number of quantiles considered for the discretization and let be a feature. An integer , named bin, is associated to each interval , where is th quantile of . A discrete version of a feature , denoted , is designed by replacing each value by its corresponding bins. in other words, a value is associated for all such that .
This discretization process can be extended to a rule set by replacing for all rules, the intervals’ bound of each test by their corresponding bins. For example, the test becomes , where and are such that and .
The formula of the stability is based on the socalled DiceSorensen index. Let be a rulebased algorithm and let and two independant samples of i.i.d observations drawn from the same distribution . And let and be the sets of rules generated by , given and respectively. Then, the stability is calculated by
(7) 
where is the discretized version of the ruleset and with the convention that . The discretization process is performed using and respectively.
This quantity is a positive number between and : If and have no common rules, then while if and are the same, then .
4 Simplicity
In [20], authors have introduced a notion of interpretability score based on the sum of the length of all the rules constituting the predictive model.
Definition 4.1.
The interpretability score of an estimator generated by a set of rules is defined by
(8) 
Furthermore, the value (8), which is a positive number, cannot be directly compared to the values from (6) and (7), which are between and .
The measure of the simplicity is based on the definition 4.1. The idea is to compare (8) relatively to a set of algorithms . Hence the simplicity of an algorithm is defined in relative terms as follows:
(9) 
Like the previous ones, this quantity is a positive number between and : If generates the simplest predictor among the set of algorithms then . Then, the simplicity of other algorithms in are evaluated relatively to .
5 Interpretability
The main idea underlying the definition of interpretability of a rulebased algorithm, is the use of a weighted sum of the predictivity (6) the stability (7) and the simplicity (9). Let be a set of rulebased algorithms, the interpretability of any algorithm is defined by:
(10) 
where the coefficients and have been chosen according to the statistician’s desiderata such that . If a statistician tries to understand and to describe a phenomenon then simplicity and predictivity are more important than stability.
It is important to notice that the definition of interpretability (10) depends on the set of rulebased algorithms and a regression setting. Therefore, the interpretable value only makes sense within that set of algorithms and for a specific regression setting.
6 Application
The aim of this application is to compare four rulebased algorithms: the Decision Tree algorithm (DT) [6], RuleFit (RF) [12], the Covering Algorithm (CA) [20] and RICE [16]. Their parametrization is summarized in Table 1. For this application the same model as in [12] is considered. Two samples and of data are generated following the regression setting:
where (the dimension of ) and
(11) 
where is the st component of and . The value of
was chosen to produce a twotoone signaltonoise ratio. The variables were generated from a uniform distribution on
.Algorithm  Parameters 

DT  Number maximal of rules 
RF  Maximal number of rules Cross validation 
CA  Number of rules by tree Number of trees 
RICE  Number of candidates Maximal length 
Predictivity (6) is approximated using test observations and by averaging error of predictors generated from and . The stability (7) is measured by setting and discretizing with respect to and . Simplicity (9) of an algorithm is computed by averaging the measure on predictors generated from and respectively. Finally, the interpretability (10) is calculated with . The results are summarized in Table 2.
Remark: For the sake of simplification, linear relationships generated by RuleFit have been considered as rules for the evaluation of the stability. This algorithm generates four linear relationships from using variables whereas a single relationship is produced from using . Regarding the linear relationships generated by RuleFit on the two datasets and they only show one ”rule” in common.
Algorithm  Predictivity  Stability  Simplicity  Interpretability 

DT  
RF  
CA  
RICE 
RICE and Covering Algorithm seem to be the most interpretable algorithms for this setting. However, the predictivity value of RICE is very poor compared to the other algorithms. Therefore, the Covering Algorithm is the most interpretable algorithm in this panel for the setting (11). Even if RuleFit is the best algorithm of this panel in predictivity and stability it generated too many rules and has therefore a weaker simplicity.
7 Conclusion and perspectives
In this paper, a quantitative criterion for interpretability of rulebased algorihms was presented. This measure is based on the triptych: Predictivity (6), Stability (7) and Simplicity (9). This new concept of interpretability has been thought to be fair and rigorous. It can be adapted to the various desiderata of the statistician by choosing appropriate the coefficients in the interpretability formula (10). An application on four rulebased algorithm: Decision Tree algorithm [6], RuleFit [12], Covering Algorithm [20] and RICE [16], shows how to use and analyse the interpretability value. This application will be extended to others wellknown rulebased algorithms such as C4.5 [8], RIPPER [9], Ender [13] and SIRUS [15] in a further work.
This methodology seems to make the interpretability comparison of rulebased algorithms quite fair. However, according to Definition 4.1, rules of length have the same simplicity that one single rule of length , which is debatable. Moreover, the stability measure is purely syntactic and rather restrictive. Indeed, if some features are duplicated, two rules may have two different syntactic conditions but by otherwise identical based in their activations. One way of relaxing this stability criterion could be to compare the rules, based on their activation sets (i.e. by looking to observations where conditions are met simultaneously). Finally, this comparison of interpretability between a set of algorithms makes only sense for rulebased algorithm. It could be interesting to extend it to other types of algorithms.
References
 [1] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016.
 [2] A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017.
 [3] S. M. Lundberg and S.I. Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.
 [4] R. Guidotti, A. Monreale, F. Turini, D. Pedreschi, and F. Giannotti. A survey of methods for explaining black box models. arXiv preprint arXiv:1802.01933, 2018.
 [5] C. Rudin Please stop explaining black box models for high stakes decisions. arXiv preprint arXiv:1811.10154, 2018.
 [6] L. Breiman and J.H. Friedman and R. Olshen and C. Stone Classification and Regression Trees. In CRC press 1984.
 [7] J. R. Quinlan Induction of decision trees. In Springer pages 81–106. 1986.
 [8] J. R. Quinlan C4.5: Programs for Machine Learning. In Morgan Kaufmann, 1993.
 [9] W.W. Cohen Fast effective rule induction. In Machine Learning Proceedings, pages 115123, 1995.
 [10] A. Karalič and I. Bratko First order regression. InSpringer, pages 147–176, 1997.
 [11] G. Holmes, M. Hall and E. Prank Generating rule sets from model trees. In Springer, pages 1–12, 1999.
 [12] J.H. Friedman and B.E. Popescu Predective Learning via Rule Ensembles. In The Annals of Applied Statistic, pages 916–954, 2008.

[13]
K. Dembczyński, W. Kotłowski and R. Słowiński
Solving Regression by Learning an Ensemble of Decision Rules.
In
International Conference on Artificial Intelligence and Soft Computing
, pages 533–544, 2008.  [14] N. Meinshausen Node harvest. In Institute of Mathematical Statistics, pages 2049–2072, 2010.
 [15] C. Bénard and G. Biau, S. Da Veiga and E. Scornet SIRUS: making random forests interpretable. arXiv preprint arXiv:1908.06852, 2019.
 [16] V. Margot, J.P. Baudry, F. Guilloux and O. Wintenberger Rule Induction Covering Estimator : A New Data Dependent Covering Algorithm, , 2020.
 [17] Z.C. Lipton, The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2017.
 [18] Y.Bin and K.Kumbier Three principles of data science: predictability, computability, and stability (PCS). aarXiv preprint arXiv:1901.08152, 2019.
 [19] V. Vapnik and S. Kotz Estimation of Dependences Based on Empirical Data. SpringerVerlag New York, 1982.
 [20] V. Margot, J.P. Baudry, F. Guilloux and O. Wintenberger Consistent Regression using DataDependent Coverings arXiv preprint arXiv:1907.02306, 2019.
 [21] U.M Fayyad and K.B Irani MultiInterval Discretization of ContinuousValued Attributes for Classification Learning. In Proc. 13th Int. Joint Conf. on Artificial Intelligence, pages 1022–1027, 1993.
 [22] B. Letham, C. Rudin, T.H. McCormick, D. Madigan Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. In The Annals of Applied Statistics, volume 9, number 3, pages 1350–1371, 2015.
 [23] J. Dougherty, R. Kohavi and M. Sahami Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings, pages 194–202, 1995.
Comments
There are no comments yet.