A rigorous method to compare interpretability of rule-based algorithms

04/03/2020 ∙ by Vincent Margot, et al. ∙ 0

Interpretability is becoming increasingly important in predictive model analysis. Unfortunately, as mentioned by many authors, there is still no consensus on that idea. The aim of this article is to propose a rigorous mathematical definition of the concept of interpretability, allowing fair comparisons between any rule-based algorithms. This definition is built from three notions, each of which being quantitatively measured by a simple formula: predictivity, stability and simplicity. While predictivity has been widely studied to measure the accuracy of predictive algorithms, stability is based on the Dice-Sorensen index to compare two sets of rules generated by an algorithm using two independent samples. Simplicity is based on the sum of the length of the rules deriving from the generated model. The final objective measure of the interpretability of any rule-based algorithm ends up as a weighted sum of the three aforementioned concepts. This paper concludes with the comparison of the interpretability between four rule-based algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The widespread use of machine learning (ML) in many sensible areas such as healthcare, justice, asset management has underlined the importance of interpretability in the decision-making process. In recent years, the number of publications on interpretability has increased exponentially. Usually, two main ways can be distinguished for the production of interpretable predictive models. The first one relies on the use of an uninterpretable machine learning algorithm to create predictive models, and then to take them up again to create a so-called post-hoc interpretable model, for example LIME [1], DeepLIFT [2], SHAP [3]. These explanatory models try to measure the importance of a feature on the prediction process (see [4] for an overview of existing methods). However, as outlined in [5], the explanations may not be sufficient for a sensitive decision-making process.

The other way is to use an intrinsically

interpretable algorithm to directly generate an interpretable model such as decision tree algorithms

CART [6], ID3 [7], C4.5 [8], RIPPER [9] or rule-based algorithms FORS [10], M5 Rules [11], RuleFit [12], Ender [13], Node Harvest [14] or more recently SIRUS [15] and RICE [16].

These algorithms are based on the notion of rule. A rule is a If-Then statement of the form

IF (1)
THEN

The condition part If is a logical conjunction, where ’s are tests that check whether the observation has the specified properties or not The number is called the length of the rule. If all ’s are fulfilled the rule is said activated. And the conclusion part Then is prediction of the rule if it is activated. Usually, if the feature space is , each checks if one specific feature is in an interval (e.g ).

In [17], author emphasizes that there is no rigorous mathematical foundation for the concept of interpretability. In this paper, a rigorous, quantitative and objective measure of the interpretability is proposed as a comparison criterion for any rule-based algorithms. This measure is based on the triptych predictability, computability, stability presented in [18]: Predictability measures the accuracy of the predictive model. Stability quantifies the noise sensitivity of an algorithm. Finally, the notion of computability has been replaced by a notion of simplicity. Computability is important in practice, but it does not reflect the model’s ability to be interpreted, whereas the simplicity of the model proves to be more efficient at this task.

2 Predictivity

The aim of a predictive model is to predict the value of a variable of interest , given features . Formally, we set the standard regression setting as follows: Let

be a couple of random variable in

of unknown distribution such that

(2)

where and and is a measurable function from to .

We denote by the set of all measurable functions from to . The accuracy of a regression function is measured by its risk, defined as

(3)

where is called a contrast function. The risk measures the average discrepancy, given a new observation from the distribution , between and .

Given a sample , we aim at predicting conditionally on . The observations are independent and identically distributed (i.i.d) from the distribution .

To do so, we consider a statistical algorithm which is a measurable mapping defined by

(4)

where .

The purpose of an algorithm , is to generate a measurable function that minimizes the risk (3). To carry out this minimization, the algorithms use the Empirical Risk Minimization principle (ERM) [19], meaning that

(5)

where is the empirical risk.

The choice of depends on the nature of . For example, if , one generally uses the quadratic contrast with . The minimizer of the risk (3) with the quadratic contrast is the called the regression function , defined by

If , one uses the contrast function . The minimizer of the risk (3) with the

contrast function is called the Bayes classifier

defined by

Hence, according to the ERM principle, the choice of determines the function that an algorithm,

tries to estimate, and thus the function

.

The notion of predictivity is based on the ability of an algorithm to provide an accurate predictive model. This notion has been well studied since years. In this paper, the predictivity is defined as follows:

(6)

where is the trivial constant predictor according to . For example, for the quadratic contrast and for the contrast .

This quantity as a measure of the accuracy is independent from the range of . We may assume that it is a positive number between and . Indeed, the risk (3) is a positive function and if , it means that the predictor is worse than the trivial constant predictor.

3 -Stability

In [15], authors have proposed a measure of the stability of a rule-based algorithm built upon the following definition:

A rule learning algorithm is stable if two independent estimations based on two independent samples result in two similar lists of rules.

The notion of

-stability is based on the same definition. This notion appears to be fairer for algorithms that do not use features discretisation and operate on real rather than integer values. In fact, the probability that a decision tree algorithm cuts on the same exact value for the same rule, given two independent samples is null. For this reason, the pure stability appears too penalizing in this case.

Features discretization is a common solution for controlling the complexity of a rule generator. In [21]

, for example, the authors use the entropy minimization heuristic to discretize the features and for the algorithms BRL (Bayesian Rule Lists)

[22], SIRUS [15] and RICE [16]

, authors have used the empirical quantiles of features to discretize them. See

[23] for an overview of usual discretization methods.

Let be the number of quantiles considered for the discretization and let be a feature. An integer , named bin, is associated to each interval , where is -th -quantile of . A discrete version of a feature , denoted , is designed by replacing each value by its corresponding bins. in other words, a value is associated for all such that .

This discretization process can be extended to a rule set by replacing for all rules, the intervals’ bound of each test by their corresponding bins. For example, the test becomes , where and are such that and .

The formula of the -stability is based on the so-called Dice-Sorensen index. Let be a rule-based algorithm and let and two independant samples of i.i.d observations drawn from the same distribution . And let and be the sets of rules generated by , given and respectively. Then, the -stability is calculated by

(7)

where is the discretized version of the rule-set and with the convention that . The discretization process is performed using and respectively.

This quantity is a positive number between and : If and have no common rules, then while if and are the same, then .

4 Simplicity

In [20], authors have introduced a notion of interpretability score based on the sum of the length of all the rules constituting the predictive model.

Definition 4.1.

The interpretability score of an estimator generated by a set of rules is defined by

(8)

Furthermore, the value (8), which is a positive number, cannot be directly compared to the values from (6) and (7), which are between and .

The measure of the simplicity is based on the definition 4.1. The idea is to compare (8) relatively to a set of algorithms . Hence the simplicity of an algorithm is defined in relative terms as follows:

(9)

Like the previous ones, this quantity is a positive number between and : If generates the simplest predictor among the set of algorithms then . Then, the simplicity of other algorithms in are evaluated relatively to .

5 Interpretability

The main idea underlying the definition of interpretability of a rule-based algorithm, is the use of a weighted sum of the predictivity (6) the stability (7) and the simplicity (9). Let be a set of rule-based algorithms, the interpretability of any algorithm is defined by:

(10)

where the coefficients and have been chosen according to the statistician’s desiderata such that . If a statistician tries to understand and to describe a phenomenon then simplicity and predictivity are more important than stability.

It is important to notice that the definition of interpretability (10) depends on the set of rule-based algorithms and a regression setting. Therefore, the interpretable value only makes sense within that set of algorithms and for a specific regression setting.

6 Application

The aim of this application is to compare four rule-based algorithms: the Decision Tree algorithm (DT) [6], RuleFit (RF) [12], the Covering Algorithm (CA) [20] and RICE [16]. Their parametrization is summarized in Table 1. For this application the same model as in [12] is considered. Two samples and of data are generated following the regression setting:

where (the dimension of ) and

(11)

where is the -st component of and . The value of

was chosen to produce a two-to-one signal-to-noise ratio. The variables were generated from a uniform distribution on

.

Algorithm Parameters
DT Number maximal of rules
RF Maximal number of rules Cross validation
CA Number of rules by tree Number of trees
RICE Number of candidates    Maximal length
Table 1: Algorithm parameters.

Predictivity (6) is approximated using test observations and by averaging error of predictors generated from and . The -stability (7) is measured by setting and discretizing with respect to and . Simplicity (9) of an algorithm is computed by averaging the measure on predictors generated from and respectively. Finally, the interpretability (10) is calculated with . The results are summarized in Table 2.

Remark: For the sake of simplification, linear relationships generated by RuleFit have been considered as rules for the evaluation of the -stability. This algorithm generates four linear relationships from using variables whereas a single relationship is produced from using . Regarding the linear relationships generated by RuleFit on the two datasets and they only show one ”rule” in common.

Algorithm Predictivity -Stability Simplicity Interpretability
DT
RF
CA
RICE
Table 2: Details of the interpretability value for each algorithm.

RICE and Covering Algorithm seem to be the most interpretable algorithms for this setting. However, the predictivity value of RICE is very poor compared to the other algorithms. Therefore, the Covering Algorithm is the most interpretable algorithm in this panel for the setting (11). Even if RuleFit is the best algorithm of this panel in predictivity and -stability it generated too many rules and has therefore a weaker simplicity.

7 Conclusion and perspectives

In this paper, a quantitative criterion for interpretability of rule-based algorihms was presented. This measure is based on the triptych: Predictivity (6), Stability (7) and Simplicity (9). This new concept of interpretability has been thought to be fair and rigorous. It can be adapted to the various desiderata of the statistician by choosing appropriate the coefficients in the interpretability formula (10). An application on four rule-based algorithm: Decision Tree algorithm [6], RuleFit [12], Covering Algorithm [20] and RICE [16], shows how to use and analyse the interpretability value. This application will be extended to others well-known rule-based algorithms such as C4.5 [8], RIPPER [9], Ender [13] and SIRUS [15] in a further work.

This methodology seems to make the interpretability comparison of rule-based algorithms quite fair. However, according to Definition 4.1, rules of length have the same simplicity that one single rule of length , which is debatable. Moreover, the stability measure is purely syntactic and rather restrictive. Indeed, if some features are duplicated, two rules may have two different syntactic conditions but by otherwise identical based in their activations. One way of relaxing this stability criterion could be to compare the rules, based on their activation sets (i.e. by looking to observations where conditions are met simultaneously). Finally, this comparison of interpretability between a set of algorithms makes only sense for rule-based algorithm. It could be interesting to extend it to other types of algorithms.

References