Generalized Linear Rule Models

06/05/2019 ∙ by Dennis Wei, et al. ∙ 0

This paper considers generalized linear models using rule-based features, also referred to as rule ensembles, for regression and probabilistic classification. Rules facilitate model interpretation while also capturing nonlinear dependences and interactions. Our problem formulation accordingly trades off rule set complexity and prediction accuracy. Column generation is used to optimize over an exponentially large space of rules without pre-generating a large subset of candidates or greedily boosting rules one by one. The column generation subproblem is solved using either integer programming or a heuristic optimizing the same objective. In experiments involving logistic and linear regression, the proposed methods obtain better accuracy-complexity trade-offs than existing rule ensemble algorithms. At one end of the trade-off, the methods are competitive with less interpretable benchmark models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Decision rules have served as important building blocks for supervised learning models. They are often combined via logical operations into decision lists

(Rivest, 1987; Angelino et al., 2017; Yang et al., 2017) and rule sets (Lakkaraju et al., 2016; Wang et al., 2017), models whose appeal stems from the human-interpretability of their constituent rules.

In this paper, we consider models that are linear combinations of decision rules, also referred to as rule ensembles, within the framework of generalized linear models (GLM) (McCullagh & Nelder, 1989)

. Rule ensembles retain interpretability while allowing modelling flexibility since rules are able to capture nonlinear dependences and interactions. Our problem formulation accordingly trades off predictive accuracy against the complexity of the rule ensemble, measured in terms of the number of rules as well as their lengths, i.e., the number of elementary conditions on individual variables in the conjunction. We make use of GLMs to address the real-valued prediction tasks of regression and probabilistic classification, where class probability estimates are desired.

The main challenge in learning rule-based models is due to the exponential size of the space of rules, corresponding to all possible conjunctions of the input features. Predominant approaches include pre-selecting a (often large) subset of candidate rules and greedy optimization in which rules are added one by one but not revised. The former includes optimization methods that select from among the candidates (Friedman & Popescu, 2008; Lakkaraju et al., 2016; Wang et al., 2017) while the latter includes sequential covering (Cohen, 1995; Clark & Boswell, 1991; Fürnkranz et al., 2014) and boosting.

Our main contribution herein is to propose an approach that avoids both of the above alternatives. The technique of column generation (CG) is used to intelligently search the space of rules and produce useful ones as needed, as opposed to using a large, fixed set of candidates. Instead of boosting, a GLM is re-fit as rules are generated, which allows existing rules to be reweighted and even discarded. The CG subproblem is formulated as an integer program (not of exponential size) and is solved using either integer programming or a heuristic that targets the same objective. We also discuss a non-CG algorithm that uses only first-degree rules, i.e., those with a single condition, and optionally numerical features as-is. This latter algorithm produces a kind of generalized additive model (GAM) (Hastie & Tibshirani, 1990).

Experiments are presented involving the two most common cases of GLMs, logistic and linear regression. The proposed methods are seen to yield better performance-complexity trade-offs than existing rule ensemble algorithms. At the performance-maximizing end of the trade-off, the methods are competitive with less interpretable benchmark models such as tree ensembles and nonlinear support vector machines (SVM). The trade-off curves also suggest that substantially simpler models are often available at a relatively small cost in performance.

1.1 Related Work

Of the rule ensemble algorithms that have been proposed, the most closely related to the current proposal is RuleFit (Friedman & Popescu, 2008)

, which also fits linear models to a set of rules. This set however is predetermined by first inducing a large number (hundreds) of decision trees from the data and then extracting rules corresponding to nodes of the trees. As will be seen in Section 

5, this approach yields more complex rule ensembles for the same performance. Rückert & Kramer (2006) also propose fitting linear models to an increasing set of rules but the rules are generated in an unsupervised manner.

Boosting algorithms, which do not modify previously added rules, are represented by SLIPPER (Cohen & Singer, 1999), MLRules and ENDER (Dembczyński et al., 2010). Other algorithms include HKL (Jawanpuria et al., 2011), which is also able to effectively optimize over an exponential number of conjunctions but relies on a regularizer with a special group structure. While often cited as a rule ensemble method, Weiss & Indurkhya (2000) actually learn ensembles of disjunctive normal forms (DNF), which are more akin to tree ensembles.

Even though branch-and-bound and CG have been used before in ML, e.g., for boosting (Demiriz et al., 2002) and in particular for rule learning (Angelino et al., 2017; Rudin & Ertekin, 2018), we believe our work is the first application of CG to a nonlinear optimization problem in ML. Our CG approach is inspired by its recent use in learning disjunctive/conjunctive normal form rules (Dash et al., 2018)

, with similar benefits. While that work applied CG to a linear program, here we have a (convex) nonlinear problem. For nonlinear problems, even though the general framework has been discussed for OR problems

(Garcia et al., 2003), there are only a few practical applications, mostly nonlinear variants of the VRP (Borndörfer et al., 2013; Fortz et al., 2010). We also note that whether CG can be successfully applied to MINLPs has been posed as an open problem in a 2018 Dagstuhl Seminar (Bonami et al., 2018).

2 Generalized Linear Rule Models

We consider the standard supervised learning problem of predicting a target variable using input features , given a training dataset of i.i.d. samples , , . The output space may be discrete or continuous. We assume that all features

have been binarized to take values in

. For categorical features, this is achieved through the usual “one-hot” coding into indicators for all categories as well as their negations . Numerical features are binarized through bi-directional comparisons to a set of thresholds, e.g., , and , . Further details on binarization are given in Section 5.

In this section, we recall aspects of generalized linear models (GLMs) and introduce notation needed to define them over a feature space of rules. Let denote the set of conjunctions of to be considered. We defer discussion of the choice of to Section 3 but note that it is not necessary to limit its size. Denote by the variable corresponding to conjunction , and the value taken by in instance . Let be the index of the empty conjunction .

For a GLM, we posit that conditioned on follows an exponential family distribution given by

(1)

where the canonical parameter is a linear combination of the conjunctions of ,

(2)

and is the log-partition function. The terms with nonzero coefficients define an ensemble of rules mapping conjunctions to real values , which are then linearly combined. The distribution may have parameters in addition to

(e.g., the variance in the Gaussian case) but these are either assumed known or their estimation can be separated from that of

. The prediction function is given by the conditional mean of ,

(3)

where the second equality holds for (1).

The coefficients in (2) are determined by minimizing the negative log-likelihood corresponding to (1) on the training data. Since is potentially large, a sparse solution is essential and is obtained through regularization. The optimization problem is therefore

(4)

The factor in (1) is not a function of and is omitted. Each regularization parameter depends on the number of literals of conjunction as specified later. It is a property of exponential families that the log-partition function is convex. By affine composition property of convex functions the problem (4) is therefore convex.

We specialize the foregoing to the two most common cases of GLMs, logistic and linear regression. For logistic regression, the log-partition function

. Substituting this into (4), the quantity in square brackets becomes the familiar expression

(5)

where . For linear regression, and the bracketed quantity becomes (after adding back )

(6)

3 Model Instantiations

We discuss two instantiations of generalized linear rule models (GLRM). In Section 3.1, the set of conjunctions is restricted to first-degree or singleton conjunctions, i.e., those with a single condition on an individual feature. Section 3.2 considers the general case with no restriction on .

3.1 Generalized Additive Model Using First-Degree Rules

In the first case, the conjunctions correspond to the binarized features themselves. In terms of the original unbinarized features, conditions are placed on only one feature at a time and so the resulting GLRM is free of interaction terms. On the other hand, if features are binarized as discussed at the beginning of Section 2, then the GLRM is a type of generalized additive model (GAM), i.e., a sum of univariate functions. For numerical features, first-degree rules correspond to step functions, which can be linearly combined into arbitrary piecewise-constant functions with discontinuities at the binarization thresholds. For categorical features, any function can be realized.

Friedman & Popescu (2008) discuss a similar type of model obtained by restricting their decision trees to depth (decision stumps). There are two differences however. First, the singleton rules herein are systematically enumerated as opposed to generated by a randomized tree induction procedure. Second, from every pair of complementary singleton rules (e.g., , ), we remove one member as otherwise the pair together with the empty conjunction are collinear. The results in Section 5 suggest that this removal of linearly dependent rules contributes toward sparser, simpler rule ensembles.

In addition to first-degree rules, following Friedman & Popescu (2008) we may also include in the feature space any numerical features as they are, without binarization. This model variant is also evaluated in Section 5.

3.2 General Rule Ensemble Using Column Generation

We now let be the set of all possible conjunctions of to obtain rule ensembles with no restrictions. Since is now exponentially large, it is intractable even to enumerate the variables in problem (4) (unless the feature dimension is very small). We exploit the technique of column generation (CG) to tackle this problem.

Column generation was originally developed to solve linear programs (LPs) with a very large number of columns (Gilmore & Gomory, 1961; Conforti et al., 2014). Using the fact that optimal solutions of LPs are sparse, the main idea is to first solve a restricted problem with a small number of candidate columns and then generate some of the missing columns based on the optimal dual solution of this restricted problem. Using the dual solution, one can compute the marginal benefit (or, partial derivative) of introducing a missing column to the restricted problem. If partial derivative for the most promising missing column is non-negative, then the procedure terminates. The crucial component of this approach is to formulate a column generation problem that can search through all of the missing columns without complete enumeration.

We next describe how to adopt this idea to solve the generalized linear model (4). To derive the column generation subproblem, it is helpful to express as for , . Problem (4) becomes

(7)

Suppose that a restricted version of (7) has been solved for a subset of the set of conjunctions, yielding for . We extend this to a solution for (7) by setting for and wish to determine the optimality of the extended solution. Since (7) is also a convex problem with non-negativity constraints, a necessary and sufficient condition of optimality is for the partial derivatives of the objective with respect to to be zero if and non-negative if . This condition is true for due to optimality for the restricted problem. For , and we are thus required to check non-negativity of the derivatives.

The partial derivative w.r.t.  of the objective in (7) is

using (2) and (3) to obtain the first equality and defining the prediction residual in the second. The partial derivative with respect to is the same except that the residuals are negated. Non-negativity of all partial derivatives can thus be determined by solving the pair of problems

(8)

If both optimal values in (8) are non-negative, then all derivatives are indeed non-negative and we conclude that the extended solution is optimal for (7). On the other hand, if the objective in (8) is negative for some and the ‘’ sign (say), then the partial derivative w.r.t.  is negative and can be added to the restricted problem (i.e.,  added to ) to potentially improve the current solution.

We now use the fact that is a set of conjunctions to avoid solving (8) by enumeration. This involves encoding a conjunction by the set of participating features and relating the values to the feature values . Let represent whether feature is selected in a conjunction. We assume that the regularization parameter is an affine function of the degree of the conjunction, with , (other affine functions of are possible). Define , , and . Then (8) can be reformulated as

(9)

where the subscript has been dropped in favor of encoding by . The constraints in (9) ensure that acts as the conjunction of the selected ’s. For , only if all selected features have (), otherwise since is minimized in the objective. For , , is maximized, and the corresponding constraint enforces the same behavior for .

Therefore, our column generation algorithm alternates between solving the restricted log-likelihood problem (4) and searching for new columns by solving (9) for both signs. We initialize the restricted set to be the set of first-degree rules discussed in Section 3.1, optionally including original numerical features as well. The algorithm terminates with a certificate that problem (4) is solved to optimality if the optimal value of problem (9) is non-negative for both signs. For practical reasons we also have a secondary termination criteria that depends on the total number of column generation iterations and the total CPU time spent.

This algorithm is guaranteed to terminate in finite time as there are only a finite number of candidate columns (conjunctions) and the column generation procedure would not generate the same conjunction more than once as the partial derivative of the conjunctions in the restricted problem are guaranteed to be non-negative. We note that finite termination is not guaranteed for models with different regularization parameters, for example for -regularization, as repeating a conjunction with a non-zero weight would improve the optimal value of (7) by simply splitting the weight of an original conjunction into two equal parts. After splitting the first term in (7) would stay the same whereas the regularization penalty would decrease.

Once the column generation algorithm terminates, we solve the log-likelihood problem (4) one last time to de-bias the solution. In this final run we restrict conjunctions to the ones with in the last round and we drop the regularization term in the objective.

4 Column Generation Approaches

The column generation subproblem (9) is solved using either integer programming (IP) or a heuristic algorithm. We have implemented our column generation procedure in Java using LibLinear (Fan et al., 2008) to solve the regularized logistic regression problem (4) and using Cplex callable library (version 12.7.1) to solve the integer program (9) for column generation. As LibLinear package only allows simple -regularization (as opposed to using different weights that depend on the complexity of the conjunction ), we scale the values by .

The proposed heuristic algorithm performs a limited search of the rule space, proceeding in order of increasing conjunction degree from up to a maximum . To describe the heuristic, we define the children of a conjunction as those conjunctions that involve one additional feature, i.e., have one additional in terms of the representation in (9). At each degree , only those conjunctions that are children of a chosen parent (of degree ) are evaluated. These children are evaluated for two purposes: 1) To determine whether any improve upon the incumbent solution, defined as the best solution observed thus far; 2) to select a parent conjunction for the next highest degree. Evaluation for the first purpose is based only on the objective value achieved in (9), while evaluation for the second also considers a lower bound on future objective values, as discussed next.

We illustrate the objective value and lower bound calculations for degree conjunctions, i.e., children of the initial empty conjunction. Higher degrees are analogous. Objective values of children can be computed via their increments and decrements relative to the value of the parent. For , setting forces for such that (). The change in value of child is thus

(10)

A lower bound on future objective values resulting from setting , i.e., values of descendants of child , can be obtained by optimistically assuming that with the addition of one more feature (at a further cost of ), all positive can be eliminated from the objective function in (9) while no negative are eliminated beyond those due to setting itself. Expressed in the same relative terms as (10), this lower bound is

(11)

To determine the parent for the next degree, we first eliminate all children of the current parent whose lower bounds are not less than the value of the incumbent solution, since these cannot lead to improvement. Any remaining children are evaluated using the average of and and the child with the lowest such average is selected. The motivation is to consider not only the children’s current values but also a crude but easily computed estimate of potential future values. Other convex combinations of and have not been explored.

Once a new parent conjunction is chosen, corresponding to setting a , two operations are performed to reduce the dimensions of the problem and to render it in the same form as for . First, indices (rows) where are removed since is forced to zero as noted above. Second, setting may make other features redundant and these can be removed (by setting ). For example, if binary feature corresponds to one category of an original categorical feature, then we may set for other categories of the same feature because the conjunction of and would be identically zero. We refer to Su et al. (2016) for a fuller discussion of these redundancies.

We have explored additional variations of the heuristic algorithm as discussed in Appendix A.

5 Numerical Evaluation

We report on numerical experiments involving both logistic regression (5) (i.e., classification) and linear regression (6). We tested the methods discussed in Section 3: logistic/linear regression on singleton rules (without CG, abbreviated LR1), logistic/linear regression on general rules (with CG, abbreviated LRR), and the same two with the addition of any numerical features originally present in the data (LR1N, LRRN). Column generation is done using the heuristic in Section 4 but we also report a preliminary result using an IP version of LRR (LRRI). We compared these methods to RuleFit (Friedman & Popescu, 2008), which is the closest existing method, and specifically a Python implementation111https://github.com/christophM/rulefit because the original R code is no longer supported. We also tried to test HKL (Jawanpuria et al., 2011)

but encountered a no longer supported toolbox as well as numerical problems. Beyond rule ensembles, we also compared to gradient boosted classification/regression trees (GBT) and support vector machines (SVM) with radial basis function (RBF) kernels. These are less interpretable models intended to provide a benchmark for prediction performance.

-fold cross-validation (CV) is used to estimate all test performance metrics.

Categorical and numerical features were binarized as described at the beginning of Section 2

, using sample deciles as thresholds for numerical features. To control for the effect of discretization, numerical features were discretized using the same quantile thresholds for all rule- and tree-based methods in the comparison. Excluded from this treatment are SVMs and the variant of RuleFit that uses numerical features in addition to rules (abbreviated RuleFitN).

(a) Pima Indians diabetes
(b) FICO Explainable Machine Learning Challenge
(c) MAGIC gamma telescope
(d) Musk molecules
Figure 1:

Trade-offs between Brier score and weighted number of rules. Pareto efficient points are connected by line segments. Horizontal and vertical bars represent standard errors in the means.

dataset LR1 LRR RuleFit LR1N LRRN RuleFitN GBT SVM
banknote () E-3 () E-3 () E-3 () E-5 () E-3 () E-3 () E-3 () E-4
heart () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1
ILPD () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1
ionosphere () E-2 () E-2 () E-2 () E-2 () E-2 () E-2 () E-2 () E-2
liver () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1
pima () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1
tic-tac-toe () E-2 () E-2 () E-7 () E-2 () E-2 () E-7 () E-3 () E-2
transfusion () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1
WDBC () E-2 () E-2 () E-2 () E-2 () E-2 () E-2 () E-2 () E-2
adult () E-1 () E-1 () E-1 () E-2 () E-2 () E-2 () E-1 () E-1
bank-mkt () E-1 () E-2 () E-1 () E-1 () E-2 () E-2 () E-2 () E-2
gas () E-3 () E-3 () E-3 () E-3 () E-3 () E-3 () E-3 () E-3
magic () E-1 () E-1 () E-1 () E-1 () E-1 () E-2 () E-2 () E-2
mushroom () E-7 () E-7 () E-7 () E-7 () E-7 () E-7 () E-4 () E-4
musk () E-2 () E-2 () E-2 () E-2 () E-2 () E-2 () E-2 () E-2
FICO () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1 () E-1
mean rank
Table 1: Mean test Brier scores (standard error in parentheses). Best values in bold.
dataset LR1 LRR RuleFit LR1N LRRN RuleFitN
banknote () () () () () ()
heart () () () () () ()
ILPD () () () () () ()
ionosphere () () () () () ()
liver () () () () () ()
pima () () () () () ()
tic-tac-toe () () () () () ()
transfusion () () () () () ()
WDBC () () () () () ()
adult () () () () () ()
bank-mkt () () () () () ()
gas () () () () () ()
magic () () () () () ()
mushroom () () () () () ()
musk () () () () () ()
FICO () () () () () ()
mean rank
Table 2: Mean weighted number of rules (standard error in parentheses) corresponding to Table 1. Best values in bold.

5.1 Classification

For classification, we used the same datasets considered in (Dash et al., 2018), which also appeared in other recent works on rule-based models (Su et al., 2016; Wang et al., 2017). One of these datasets comes from the recent FICO Explainable Machine Learning Challenge (FICO, 2018).

In the first experiment, we evaluated the performance-complexity trade-offs of the four proposed methods (LR1, LRR, LR1N, LRRN) as well as RuleFit. For performance metrics, we report both accuracy and Brier score (i.e., mean squared error (MSE)), the latter a well-known metric for probabilistic outputs (Hernández-Orallo et al., 2012) as produced by logistic regression-based models. To measure rule ensemble complexity, we consider not only the number of rules (with nonzero coefficients) but also their lengths in terms of number of conditions. Specifically we define the weight of a rule similarly to the regularization parameter as , where we take as the weight on the degree. Coefficients corresponding to numerical features receive a weight of . For consistency with this definition, the parameters used by all methods in this comparison are set proportional to the weights, i.e., with . By varying the remaining free parameter , we sweep out trade-offs between performance and complexity. RuleFit has an additional parameter, the mean tree size, that is recommended for tuning in (Friedman & Popescu, 2008). We have done so based on test set results, which gives RuleFit a slight advantage.

(a) Communities and Crime
(b) Wine Quality
(c) Bike Sharing
Figure 2: Trade-offs between coefficient of determination and weighted number of rules.

Figure 1 shows the resulting trade-offs with Brier score for of the datasets. The other plots as well as those for accuracy are in Appendix B. Pareto-efficient points, i.e., those not dominated by points with both lower Brier score and lower complexity, have been connected with line segments for the sole purpose of visualization. RuleFit obtains inferior trade-offs on most of the tested datasets. Even in cases such as Figure 0(c) where RuleFit eventually attains a lower Brier score, the initial part of the trade-off is worse. Note that the variants employing numerical features generally fare better, as expected. Figure 0(c) indicates that IP CG (LRRI) can improve upon the heuristic in some cases.

Next we discuss the differences between LR1(N) and LRR(N). A general observation is that LRR, which uses CG to produce higher-degree rules, tends to achieve better trade-offs on larger datasets, as exemplified by musk in Figure 0(d). This can be explained by the fact that LRR effectively considers a much larger feature space than LR1. If the training sample size is sufficient to support this, then the greater power of the higher-degree rules found by LRR generalizes to test data. The presence of strong interactions in the data also favors LRR. On the other hand, on smaller datasets such as in Figure 0(a), LRR may overfit and achieve a worse trade-off relative to LR1. In Figure 0(b), LRR outperforms LR1 but the advantage disappears for LRRN compared to LR1N, a pattern that also occurs on other datasets. In Figure 0(c), the advantage of LRR(N) lies in attaining a slightly lower minimal Brier score.

In a second experiment, we aim to maximize performance (minimize Brier score or maximize accuracy) by performing nested CV on the training set to select and applying the resulting model to the test set. Since performance is now the primary criterion, we broaden the comparison to include GBT and SVM. For GBT, the maximum tree depth was tuned and the number of trees was also determined via a stopping criterion on a validation set, up to a maximum of trees. For SVM, the regularization parameter and kernel width were tuned and Platt scaling (Platt, 1999) was used to calibrate the output scores as probabilities.

Table 1 shows the resulting mean Brier scores while Table 2 shows the weighted number of rules at which the Brier scores were achieved. To help summarize these results, we report the mean rank of each method as well as Friedman tests on these mean ranks, as suggested by a reviewer and following (Demšar, 2006). The overall conclusion is that when tuned for maximum performance, the proposed methods, especially LRRN, compete well with benchmark models and do so using significantly fewer rules than RuleFit.

For Table 1, the Friedman statistic is , corresponding to a p-value of using the

-distribution approximation. Hence the null hypothesis of no significant differences is rejected at the

level but not at . A post-hoc test, comparing all other algorithms to LRRN and correcting for multiple comparisons using Holm’s procedure, shows that the only significant difference at the level is with LR1, i.e., when excluding both higher-degree rules and numerical features. The mean ranks also show that LRR(N) outperforms LR1(N) although most of the differences are not statistically significant.

In Table 2

, it is clear that RuleFit produces much more (3-4 times) complex classifiers than all four proposed methods. The Friedman statistic of

(p-value ) is significant as expected. Post-hoc comparisons to LRRN confirm that RuleFit and RuleFitN are significantly more complex, this time at the level and again with Holm’s correction. Note that including original numerical features does not significant increase classifier complexity (comparing LR1 with LR1N and LRR with LRRN).

5.2 Regression

dataset LR1 LRR RuleFit LR1N LRRN RuleFitN GBT SVM
abalone () () () () () () () ()
boston () () () () () () () ()
bike () () () () () () () ()
california () () () () () () () ()
crime () () () () () () () ()
parkinsons () () () () () () () ()
wine () () () () () () () ()
MEPS () () () () () () () ()
mean rank
Table 3: Mean test (%, standard error in parentheses). Best values in bold.
dataset LR1 LRR RuleFit LR1N LRRN RuleFitN
abalone () () () () () ()
boston () () () () () ()
bike () () () () () ()
california () () () () () ()
crime () () () () () ()
parkinsons () () () () () ()
wine () () () () () ()
MEPS () () () () () ()
mean rank
Table 4: Mean weighted number of rules (standard error in parentheses) corresponding to Table 3. Best values in bold.

For regression, we experimented with an additional datasets, of which are drawn from previous works on rule ensembles (Friedman & Popescu, 2008; Dembczyński et al., 2010) and the UCI repository (Dua & Karra Taniskidou, 2017). The last dataset comes from the Medical Expenditure Panel Survey (MEPS) (Agency for Healthcare Research and Quality, 2018) of the US Department of Health and Human Services, specifically panel 19 from the year 2015. The task is to predict the annual healthcare expenditure of individuals based on demographics and self-reported medical conditions.

The same two experiments are conducted for regression, using the linear regression variants (6) of both the proposed approaches as well as RuleFit. The coefficient of determination is chosen as the performance metric and the model complexity metric is the same as in Section 5.1.

In Figure 2, we show trade-offs between and weighted number of rules for of the datasets; the other can be found in Appendix B. RuleFit is a stronger competitor in regression due to sometimes achieving higher at higher complexities. On of datasets, the curves for RuleFit(N) remain below their LRR(N) counterparts as in Figure 1(a). On another datasets, the curves cross at moderate to high complexities as in Figure 1(b), while in Figure 1(c), RuleFit obtains much higher . Among the proposed methods, the advantage of LRR(N) vs. LR1(N) is generally larger in regression than in classification (cf. Figure 1), indicating the benefit of generating higher-degree rules. A possible explanation may be that interaction terms matter more in regression where the output range is wider.

Tables 3 and 4 show the results of selecting parameter through nested CV to maximize . The same overall conclusion as in Section 5.1 holds for LRRN in particular, namely that it yields highly competitive values while using fewer rules than RuleFit. The Friedman statistic for Table 3 is (p-value ), and in post-hoc comparisons to LRRN, the only significant difference ( level) is again with LR1. Among the proposed methods, advantages due to CG and/or numerical features in Figure 2 carry over into Table 3. In Table 4, RuleFit(N) again yields more complex solutions on average, although the difference is not as large as in Table 2. The Friedman statistic is (p-value ); however, no post-hoc comparisons with LRRN (which occupies a middle position) as the reference show statistically significant differences.

6 Discussion

The numerical results in Section 5 may raise the question of the interpretability of rule ensembles with hundreds of rules. First we note that because of the shape of many of the trade-off curves, with steep improvement at low complexity followed by a flatter region, it may be possible to obtain substantially simpler models, with say a few tens of rules, that are not too far from maximum performance. Apart from model simplification, the fact that a rule ensemble is also a linear model facilitates model inspection, for example by focusing on the rules corresponding to the largest coefficients . Friedman & Popescu (2008) discuss a slightly more refined measure of rule importance, which is also used to assess the importance of (original) input features. Based on the discussion in Section 3, we also suggest a division between singleton rules and linear terms on the one hand, and higher-degree rules on the other. Since the former constitute a GAM, their effect can be summarized visually by univariate plots. The higher-degree rules can be ranked and the interactions studied further as described in (Friedman & Popescu, 2008).

The following extensions are suggested for future work: 1) We have used a fixed binarization of the features to facilitate formulation of the column generation subproblem (9) as an IP. However, if CG is done using a heuristic, it may be possible to refine the binarization with each CG iteration. Doing so may improve results, particularly for regression. 2) In the experiments in Section 5, we have fixed the ratio to match the definition of rule weight. One may also vary this ratio to encourage or discourage longer rules, or tune it to the level of interaction present in a dataset.

Acknowledgements

We thank Karthikeyan Natesan Ramamurthy for help with the MEPS dataset.

References

  • Agency for Healthcare Research and Quality (2018) Agency for Healthcare Research and Quality. Medical Expenditure Panel Survey (MEPS). http://www.ahrq.gov/data/meps.html, 2018. Last accessed 2019-01.
  • Angelino et al. (2017) Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., and Rudin, C. Learning certifiably optimal rule lists. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 35–44, 2017.
  • Bonami et al. (2018) Bonami, P., Gleixner, A. M., Linderoth, J., and Misener, R. Designing and implementing algorithms for mixed-integer nonlinear optimization. In Report from Dagstuhl Seminar 18081, 2018.
  • Borndörfer et al. (2013) Borndörfer, R., Löbel, A., Reuther, M., Schlechte, T., and Weider, S. Rapid branching. Public Transport, 5:3–23, 01 2013. doi: 10.1007/s12469-013-0066-8.
  • Clark & Boswell (1991) Clark, P. and Boswell, R. Rule induction with CN2: Some recent improvements. In Proceedings of the European Working Session on Machine Learning (EWSL), pp. 151–163, 1991.
  • Cohen (1995) Cohen, W. W. Fast effective rule induction. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 115–123, 1995.
  • Cohen & Singer (1999) Cohen, W. W. and Singer, Y. A simple, fast, and effective rule learner. In Proc. Conf. Artif. Intell. (AAAI), pp. 335–342, 1999.
  • Conforti et al. (2014) Conforti, M., Cornuejols, G., and Zambelli, G. Integer programming. Springer, 2014.
  • Dash et al. (2018) Dash, S., Günlük, O., and Wei, D. Boolean decision rules via column generation. In Proc. Thirty-second Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • Dembczyński et al. (2010) Dembczyński, K., Kotłowski, W., and Słowiński, R. ENDER: a statistical framework for boosting decision rules. Data Mining and Knowledge Discovery, 21(1):52–90, Jul 2010.
  • Demiriz et al. (2002) Demiriz, A., Bennett, K. P., and Shawe-Taylor, J. Linear programming boosting via column generation. Mach. Learn., 46(1–3):225–254, January 2002.
  • Demšar (2006) Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1–30, January 2006.
  • Dua & Karra Taniskidou (2017) Dua, D. and Karra Taniskidou, E. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Fan et al. (2008) Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
  • FICO (2018) FICO. FICO Explainable Machine Learning Challenge. https://community.fico.com/community/xml, 2018. Last accessed 2019-05.
  • Fortz et al. (2010) Fortz, B., Labbe, M., and Poss, M. A branch-and-cut-and-price framework for convex MINLP applied to a stochastic network design problem. In Proc. European Workshop on MINLP, pp. 131–139, 2010.
  • Friedman & Popescu (2008) Friedman, J. H. and Popescu, B. E. Predictive learning via rule ensembles. Annals of Applied Statistics, 2(3):916–954, Jul 2008.
  • Fürnkranz et al. (2014) Fürnkranz, J., Gamberger, D., and Lavrač, N. Foundations of Rule Learning. Springer-Verlag, Berlin, 2014.
  • Garcia et al. (2003) Garcia, R., Marin, A., and Patriksson, M. Column generation algorithms for nonlinear optimization, I: Convergence analysis. Optimization, 52(2):171–200, 2003.
  • Gilmore & Gomory (1961) Gilmore, P. C. and Gomory, R. E. A linear programming approach to the cutting-stock problem. Operations Research, 9:849–859, 1961.
  • Hastie & Tibshirani (1990) Hastie, T. J. and Tibshirani, R. J. Generalized Additive Models. Chapman and Hall/CRC, 1990. ISBN 9780412343902.
  • Hernández-Orallo et al. (2012) Hernández-Orallo, J., Flach, P., and Ferri, C. A unified view of performance metrics: Translating threshold choice into expected classification loss. J. Mach. Learn. Res., 13:2813–2869, October 2012.
  • Jawanpuria et al. (2011) Jawanpuria, P., Nath, J. S., and Ramakrishnan, G. Efficient rule ensemble learning using hierarchical kernels. In Proc. Int. Conf. Mach. Learn. (ICML), 2011.
  • Lakkaraju et al. (2016) Lakkaraju, H., Bach, S. H., and Leskovec, J. Interpretable decision sets: A joint framework for description and prediction. In Proc. ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining (KDD), pp. 1675–1684, 2016.
  • McCullagh & Nelder (1989) McCullagh, P. and Nelder, J. Generalized Linear Models, Second Edition. Chapman & Hall, 1989.
  • Platt (1999) Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pp. 61–74, 1999.
  • Rivest (1987) Rivest, R. L. Learning decision lists. Machine Learning, 2(3):229–246, 1987.
  • Rückert & Kramer (2006) Rückert, U. and Kramer, S. A statistical approach to rule learning. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 785–792, 2006.
  • Rudin & Ertekin (2018) Rudin, C. and Ertekin, Ş. Learning customized and optimized lists of rules with mathematical programming. Mathematical Programming Computation, 10(4):659–702, Dec 2018.
  • Su et al. (2016) Su, G., Wei, D., Varshney, K. R., and Malioutov, D. M. Learning sparse two-level Boolean rules. In Proc. IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), pp. 1–6, September 2016.
  • Wang et al. (2017) Wang, T., Rudin, C., Doshi-Velez, F., Liu, Y., Klampfl, E., and MacNeille, P. A Bayesian framework for learning rule sets for interpretable classification. Journal of Machine Learning Research, 18(70):1–37, 2017.
  • Weiss & Indurkhya (2000) Weiss, S. M. and Indurkhya, N. Lightweight rule induction. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 1135–1142, 2000.
  • Yang et al. (2017) Yang, H., Rudin, C., and Seltzer, M. Scalable Bayesian rule lists. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 1013–1022, 2017.

Appendix A Variations on the Column Generation Heuristic

We have explored the following three variations of the heuristic algorithm for column generation described in Section 4:

  1. The algorithm can return the best solutions that it finds instead of a single incumbent solution, potentially reducing the number of CG iterations needed. We have observed however that these solutions tend to correspond to very similar conjunctions and are hence highly correlated. By allowing multiple such columns to enter together, sparsity suffers because the regularization in (4) has difficulty favoring sparse linear combinations of highly correlated columns over dense ones. For this reason we kept .

  2. The algorithm can be generalized to a beam search by considering the children of parent conjunctions at each degree instead of a single parent. The best children according to a combination of metrics (10) and (11) are then chosen to become the next parents. To date however, we have not found setting to be beneficial.

  3. The algorithm can be terminated early once a solution with negative objective value is found since any such solution corresponds to a descent direction for problem (7). Termination can be immediate or occur after the current degree. While early termination speeds up each CG iteration, the number of iterations tends to increase because the generated columns are of lower quality.

Appendix B Additional Numerical Results

b.1 Classification

Figures 36 show trade-offs between Brier score and weighted rules and between accuracy and weighted rules for all classification datasets.

Tables 5 and 6 show mean test accuracies and corresponding complexities when the methods are optimized for accuracy. For Table 5, the Friedman statistic computed from the mean ranks is with a p-value of , indicating no statistically significant differences in accuracy among the methods. For Table 6, the Friedman statistic is (p-value ). Post-hoc comparisons with LRRN as the reference show that RuleFit and RuleFitN are significantly more complex at the level using Holm’s step-down procedure.

(a) banknote
(b) heart
(c) ILPD
(d) ionosphere
(e) liver
(f) pima
(g) tic-tac-toe
(h) transfusion
Figure 3: Trade-offs between Brier score and weighted number of rules on classification datasets. Pareto efficient points are connected by line segments. Horizontal and vertical bars represent standard errors in the means.
(a) WDBC
(b) adult
(c) bank-marketing
(d) gas
(e) magic
(f) mushroom
(g) musk
(h) FICO
Figure 4: Trade-offs between Brier score and weighted number of rules on classification datasets. Pareto efficient points are connected by line segments. Horizontal and vertical bars represent standard errors in the means.
(a) banknote
(b) heart
(c) ILPD
(d) ionosphere
(e) liver
(f) pima
(g) tic-tac-toe
(h) transfusion
Figure 5: Trade-offs between accuracy and weighted number of rules on classification datasets. Pareto efficient points are connected by line segments. Horizontal and vertical bars represent standard errors in the means.
(a) WDBC
(b) adult
(c) bank-marketing
(d) gas
(e) magic
(f) mushroom
(g) musk
(h) FICO
Figure 6: Trade-offs between accuracy and weighted number of rules on classification datasets. Pareto efficient points are connected by line segments. Horizontal and vertical bars represent standard errors in the means.
dataset LR1 LRR RuleFit LR1N LRRN RuleFitN GBT SVM
banknote () () () () () () () ()
heart () () () () () () () ()
ILPD () () () () () () () ()
ionosphere () () () () () () () ()
liver () () () () () () () ()
pima () () () () () () () ()
tic-tac-toe () () () () () () () ()
transfusion () () () () () () () ()
WDBC () () () () () () () ()
adult () () () () () () () ()
bank-mkt () () () () () () () ()
gas () () () () () () () ()
magic () () () () () () () ()
mushroom () () () () () () () ()
musk () () () () () () () ()
FICO () () () () () () () ()
mean rank
Table 5: Mean test accuracies (%, standard error in parentheses). Best values in bold.
dataset LR1 LRR RuleFit LR1N LRRN RuleFitN
banknote () () () () () ()
heart () () () () () ()
ILPD () () () () () ()
ionosphere () () () () () ()
liver () () () () () ()
pima () () () () () ()
tic-tac-toe () () () () () ()
transfusion () () () () () ()
WDBC () () () () () ()
adult () () () () () ()
bank-mkt () () () () () ()
gas () () () () () ()
magic () () () () () ()
mushroom () () () () () ()
musk () () () () () ()
FICO () () () () () ()
mean rank
Table 6: Mean weighted number of rules (standard error in parentheses) corresponding to Table 5. Best values in bold.

b.2 Regression

Figure 7 shows the trade-off between and weighted rules for all regression datasets.

(a) abalone
(b) boston
(c) bike
(d) california
(e) crime
(f) parkinsons
(g) wine
(h) MEPS
Figure 7: Trade-offs between coefficient of determination and weighted number of rules on regression datasets. Pareto efficient points are connected by line segments. Horizontal and vertical bars represent standard errors in the means.