1 Introduction
Decision rules have served as important building blocks for supervised learning models. They are often combined via logical operations into decision lists
(Rivest, 1987; Angelino et al., 2017; Yang et al., 2017) and rule sets (Lakkaraju et al., 2016; Wang et al., 2017), models whose appeal stems from the humaninterpretability of their constituent rules.In this paper, we consider models that are linear combinations of decision rules, also referred to as rule ensembles, within the framework of generalized linear models (GLM) (McCullagh & Nelder, 1989)
. Rule ensembles retain interpretability while allowing modelling flexibility since rules are able to capture nonlinear dependences and interactions. Our problem formulation accordingly trades off predictive accuracy against the complexity of the rule ensemble, measured in terms of the number of rules as well as their lengths, i.e., the number of elementary conditions on individual variables in the conjunction. We make use of GLMs to address the realvalued prediction tasks of regression and probabilistic classification, where class probability estimates are desired.
The main challenge in learning rulebased models is due to the exponential size of the space of rules, corresponding to all possible conjunctions of the input features. Predominant approaches include preselecting a (often large) subset of candidate rules and greedy optimization in which rules are added one by one but not revised. The former includes optimization methods that select from among the candidates (Friedman & Popescu, 2008; Lakkaraju et al., 2016; Wang et al., 2017) while the latter includes sequential covering (Cohen, 1995; Clark & Boswell, 1991; Fürnkranz et al., 2014) and boosting.
Our main contribution herein is to propose an approach that avoids both of the above alternatives. The technique of column generation (CG) is used to intelligently search the space of rules and produce useful ones as needed, as opposed to using a large, fixed set of candidates. Instead of boosting, a GLM is refit as rules are generated, which allows existing rules to be reweighted and even discarded. The CG subproblem is formulated as an integer program (not of exponential size) and is solved using either integer programming or a heuristic that targets the same objective. We also discuss a nonCG algorithm that uses only firstdegree rules, i.e., those with a single condition, and optionally numerical features asis. This latter algorithm produces a kind of generalized additive model (GAM) (Hastie & Tibshirani, 1990).
Experiments are presented involving the two most common cases of GLMs, logistic and linear regression. The proposed methods are seen to yield better performancecomplexity tradeoffs than existing rule ensemble algorithms. At the performancemaximizing end of the tradeoff, the methods are competitive with less interpretable benchmark models such as tree ensembles and nonlinear support vector machines (SVM). The tradeoff curves also suggest that substantially simpler models are often available at a relatively small cost in performance.
1.1 Related Work
Of the rule ensemble algorithms that have been proposed, the most closely related to the current proposal is RuleFit (Friedman & Popescu, 2008)
, which also fits linear models to a set of rules. This set however is predetermined by first inducing a large number (hundreds) of decision trees from the data and then extracting rules corresponding to nodes of the trees. As will be seen in Section
5, this approach yields more complex rule ensembles for the same performance. Rückert & Kramer (2006) also propose fitting linear models to an increasing set of rules but the rules are generated in an unsupervised manner.Boosting algorithms, which do not modify previously added rules, are represented by SLIPPER (Cohen & Singer, 1999), MLRules and ENDER (Dembczyński et al., 2010). Other algorithms include HKL (Jawanpuria et al., 2011), which is also able to effectively optimize over an exponential number of conjunctions but relies on a regularizer with a special group structure. While often cited as a rule ensemble method, Weiss & Indurkhya (2000) actually learn ensembles of disjunctive normal forms (DNF), which are more akin to tree ensembles.
Even though branchandbound and CG have been used before in ML, e.g., for boosting (Demiriz et al., 2002) and in particular for rule learning (Angelino et al., 2017; Rudin & Ertekin, 2018), we believe our work is the first application of CG to a nonlinear optimization problem in ML. Our CG approach is inspired by its recent use in learning disjunctive/conjunctive normal form rules (Dash et al., 2018)
, with similar benefits. While that work applied CG to a linear program, here we have a (convex) nonlinear problem. For nonlinear problems, even though the general framework has been discussed for OR problems
(Garcia et al., 2003), there are only a few practical applications, mostly nonlinear variants of the VRP (Borndörfer et al., 2013; Fortz et al., 2010). We also note that whether CG can be successfully applied to MINLPs has been posed as an open problem in a 2018 Dagstuhl Seminar (Bonami et al., 2018).2 Generalized Linear Rule Models
We consider the standard supervised learning problem of predicting a target variable using input features , given a training dataset of i.i.d. samples , , . The output space may be discrete or continuous. We assume that all features
have been binarized to take values in
. For categorical features, this is achieved through the usual “onehot” coding into indicators for all categories as well as their negations . Numerical features are binarized through bidirectional comparisons to a set of thresholds, e.g., , and , . Further details on binarization are given in Section 5.In this section, we recall aspects of generalized linear models (GLMs) and introduce notation needed to define them over a feature space of rules. Let denote the set of conjunctions of to be considered. We defer discussion of the choice of to Section 3 but note that it is not necessary to limit its size. Denote by the variable corresponding to conjunction , and the value taken by in instance . Let be the index of the empty conjunction .
For a GLM, we posit that conditioned on follows an exponential family distribution given by
(1) 
where the canonical parameter is a linear combination of the conjunctions of ,
(2) 
and is the logpartition function. The terms with nonzero coefficients define an ensemble of rules mapping conjunctions to real values , which are then linearly combined. The distribution may have parameters in addition to
(e.g., the variance in the Gaussian case) but these are either assumed known or their estimation can be separated from that of
. The prediction function is given by the conditional mean of ,(3) 
where the second equality holds for (1).
The coefficients in (2) are determined by minimizing the negative loglikelihood corresponding to (1) on the training data. Since is potentially large, a sparse solution is essential and is obtained through regularization. The optimization problem is therefore
(4) 
The factor in (1) is not a function of and is omitted. Each regularization parameter depends on the number of literals of conjunction as specified later. It is a property of exponential families that the logpartition function is convex. By affine composition property of convex functions the problem (4) is therefore convex.
We specialize the foregoing to the two most common cases of GLMs, logistic and linear regression. For logistic regression, the logpartition function
. Substituting this into (4), the quantity in square brackets becomes the familiar expression(5) 
where . For linear regression, and the bracketed quantity becomes (after adding back )
(6) 
3 Model Instantiations
We discuss two instantiations of generalized linear rule models (GLRM). In Section 3.1, the set of conjunctions is restricted to firstdegree or singleton conjunctions, i.e., those with a single condition on an individual feature. Section 3.2 considers the general case with no restriction on .
3.1 Generalized Additive Model Using FirstDegree Rules
In the first case, the conjunctions correspond to the binarized features themselves. In terms of the original unbinarized features, conditions are placed on only one feature at a time and so the resulting GLRM is free of interaction terms. On the other hand, if features are binarized as discussed at the beginning of Section 2, then the GLRM is a type of generalized additive model (GAM), i.e., a sum of univariate functions. For numerical features, firstdegree rules correspond to step functions, which can be linearly combined into arbitrary piecewiseconstant functions with discontinuities at the binarization thresholds. For categorical features, any function can be realized.
Friedman & Popescu (2008) discuss a similar type of model obtained by restricting their decision trees to depth (decision stumps). There are two differences however. First, the singleton rules herein are systematically enumerated as opposed to generated by a randomized tree induction procedure. Second, from every pair of complementary singleton rules (e.g., , ), we remove one member as otherwise the pair together with the empty conjunction are collinear. The results in Section 5 suggest that this removal of linearly dependent rules contributes toward sparser, simpler rule ensembles.
3.2 General Rule Ensemble Using Column Generation
We now let be the set of all possible conjunctions of to obtain rule ensembles with no restrictions. Since is now exponentially large, it is intractable even to enumerate the variables in problem (4) (unless the feature dimension is very small). We exploit the technique of column generation (CG) to tackle this problem.
Column generation was originally developed to solve linear programs (LPs) with a very large number of columns (Gilmore & Gomory, 1961; Conforti et al., 2014). Using the fact that optimal solutions of LPs are sparse, the main idea is to first solve a restricted problem with a small number of candidate columns and then generate some of the missing columns based on the optimal dual solution of this restricted problem. Using the dual solution, one can compute the marginal benefit (or, partial derivative) of introducing a missing column to the restricted problem. If partial derivative for the most promising missing column is nonnegative, then the procedure terminates. The crucial component of this approach is to formulate a column generation problem that can search through all of the missing columns without complete enumeration.
We next describe how to adopt this idea to solve the generalized linear model (4). To derive the column generation subproblem, it is helpful to express as for , . Problem (4) becomes
(7) 
Suppose that a restricted version of (7) has been solved for a subset of the set of conjunctions, yielding for . We extend this to a solution for (7) by setting for and wish to determine the optimality of the extended solution. Since (7) is also a convex problem with nonnegativity constraints, a necessary and sufficient condition of optimality is for the partial derivatives of the objective with respect to to be zero if and nonnegative if . This condition is true for due to optimality for the restricted problem. For , and we are thus required to check nonnegativity of the derivatives.
The partial derivative w.r.t. of the objective in (7) is
using (2) and (3) to obtain the first equality and defining the prediction residual in the second. The partial derivative with respect to is the same except that the residuals are negated. Nonnegativity of all partial derivatives can thus be determined by solving the pair of problems
(8) 
If both optimal values in (8) are nonnegative, then all derivatives are indeed nonnegative and we conclude that the extended solution is optimal for (7). On the other hand, if the objective in (8) is negative for some and the ‘’ sign (say), then the partial derivative w.r.t. is negative and can be added to the restricted problem (i.e., added to ) to potentially improve the current solution.
We now use the fact that is a set of conjunctions to avoid solving (8) by enumeration. This involves encoding a conjunction by the set of participating features and relating the values to the feature values . Let represent whether feature is selected in a conjunction. We assume that the regularization parameter is an affine function of the degree of the conjunction, with , (other affine functions of are possible). Define , , and . Then (8) can be reformulated as
(9) 
where the subscript has been dropped in favor of encoding by . The constraints in (9) ensure that acts as the conjunction of the selected ’s. For , only if all selected features have (), otherwise since is minimized in the objective. For , , is maximized, and the corresponding constraint enforces the same behavior for .
Therefore, our column generation algorithm alternates between solving the restricted loglikelihood problem (4) and searching for new columns by solving (9) for both signs. We initialize the restricted set to be the set of firstdegree rules discussed in Section 3.1, optionally including original numerical features as well. The algorithm terminates with a certificate that problem (4) is solved to optimality if the optimal value of problem (9) is nonnegative for both signs. For practical reasons we also have a secondary termination criteria that depends on the total number of column generation iterations and the total CPU time spent.
This algorithm is guaranteed to terminate in finite time as there are only a finite number of candidate columns (conjunctions) and the column generation procedure would not generate the same conjunction more than once as the partial derivative of the conjunctions in the restricted problem are guaranteed to be nonnegative. We note that finite termination is not guaranteed for models with different regularization parameters, for example for regularization, as repeating a conjunction with a nonzero weight would improve the optimal value of (7) by simply splitting the weight of an original conjunction into two equal parts. After splitting the first term in (7) would stay the same whereas the regularization penalty would decrease.
Once the column generation algorithm terminates, we solve the loglikelihood problem (4) one last time to debias the solution. In this final run we restrict conjunctions to the ones with in the last round and we drop the regularization term in the objective.
4 Column Generation Approaches
The column generation subproblem (9) is solved using either integer programming (IP) or a heuristic algorithm. We have implemented our column generation procedure in Java using LibLinear (Fan et al., 2008) to solve the regularized logistic regression problem (4) and using Cplex callable library (version 12.7.1) to solve the integer program (9) for column generation. As LibLinear package only allows simple regularization (as opposed to using different weights that depend on the complexity of the conjunction ), we scale the values by .
The proposed heuristic algorithm performs a limited search of the rule space, proceeding in order of increasing conjunction degree from up to a maximum . To describe the heuristic, we define the children of a conjunction as those conjunctions that involve one additional feature, i.e., have one additional in terms of the representation in (9). At each degree , only those conjunctions that are children of a chosen parent (of degree ) are evaluated. These children are evaluated for two purposes: 1) To determine whether any improve upon the incumbent solution, defined as the best solution observed thus far; 2) to select a parent conjunction for the next highest degree. Evaluation for the first purpose is based only on the objective value achieved in (9), while evaluation for the second also considers a lower bound on future objective values, as discussed next.
We illustrate the objective value and lower bound calculations for degree conjunctions, i.e., children of the initial empty conjunction. Higher degrees are analogous. Objective values of children can be computed via their increments and decrements relative to the value of the parent. For , setting forces for such that (). The change in value of child is thus
(10) 
A lower bound on future objective values resulting from setting , i.e., values of descendants of child , can be obtained by optimistically assuming that with the addition of one more feature (at a further cost of ), all positive can be eliminated from the objective function in (9) while no negative are eliminated beyond those due to setting itself. Expressed in the same relative terms as (10), this lower bound is
(11) 
To determine the parent for the next degree, we first eliminate all children of the current parent whose lower bounds are not less than the value of the incumbent solution, since these cannot lead to improvement. Any remaining children are evaluated using the average of and and the child with the lowest such average is selected. The motivation is to consider not only the children’s current values but also a crude but easily computed estimate of potential future values. Other convex combinations of and have not been explored.
Once a new parent conjunction is chosen, corresponding to setting a , two operations are performed to reduce the dimensions of the problem and to render it in the same form as for . First, indices (rows) where are removed since is forced to zero as noted above. Second, setting may make other features redundant and these can be removed (by setting ). For example, if binary feature corresponds to one category of an original categorical feature, then we may set for other categories of the same feature because the conjunction of and would be identically zero. We refer to Su et al. (2016) for a fuller discussion of these redundancies.
We have explored additional variations of the heuristic algorithm as discussed in Appendix A.
5 Numerical Evaluation
We report on numerical experiments involving both logistic regression (5) (i.e., classification) and linear regression (6). We tested the methods discussed in Section 3: logistic/linear regression on singleton rules (without CG, abbreviated LR1), logistic/linear regression on general rules (with CG, abbreviated LRR), and the same two with the addition of any numerical features originally present in the data (LR1N, LRRN). Column generation is done using the heuristic in Section 4 but we also report a preliminary result using an IP version of LRR (LRRI). We compared these methods to RuleFit (Friedman & Popescu, 2008), which is the closest existing method, and specifically a Python implementation^{1}^{1}1https://github.com/christophM/rulefit because the original R code is no longer supported. We also tried to test HKL (Jawanpuria et al., 2011)
but encountered a no longer supported toolbox as well as numerical problems. Beyond rule ensembles, we also compared to gradient boosted classification/regression trees (GBT) and support vector machines (SVM) with radial basis function (RBF) kernels. These are less interpretable models intended to provide a benchmark for prediction performance.
fold crossvalidation (CV) is used to estimate all test performance metrics.Categorical and numerical features were binarized as described at the beginning of Section 2
, using sample deciles as thresholds for numerical features. To control for the effect of discretization, numerical features were discretized using the same quantile thresholds for all rule and treebased methods in the comparison. Excluded from this treatment are SVMs and the variant of RuleFit that uses numerical features in addition to rules (abbreviated RuleFitN).
Tradeoffs between Brier score and weighted number of rules. Pareto efficient points are connected by line segments. Horizontal and vertical bars represent standard errors in the means.
dataset  LR1  LRR  RuleFit  LR1N  LRRN  RuleFitN  GBT  SVM 
banknote  () E3  () E3  () E3  () E5  () E3  () E3  () E3  () E4 
heart  () E1  () E1  () E1  () E1  () E1  () E1  () E1  () E1 
ILPD  () E1  () E1  () E1  () E1  () E1  () E1  () E1  () E1 
ionosphere  () E2  () E2  () E2  () E2  () E2  () E2  () E2  () E2 
liver  () E1  () E1  () E1  () E1  () E1  () E1  () E1  () E1 
pima  () E1  () E1  () E1  () E1  () E1  () E1  () E1  () E1 
tictactoe  () E2  () E2  () E7  () E2  () E2  () E7  () E3  () E2 
transfusion  () E1  () E1  () E1  () E1  () E1  () E1  () E1  () E1 
WDBC  () E2  () E2  () E2  () E2  () E2  () E2  () E2  () E2 
adult  () E1  () E1  () E1  () E2  () E2  () E2  () E1  () E1 
bankmkt  () E1  () E2  () E1  () E1  () E2  () E2  () E2  () E2 
gas  () E3  () E3  () E3  () E3  () E3  () E3  () E3  () E3 
magic  () E1  () E1  () E1  () E1  () E1  () E2  () E2  () E2 
mushroom  () E7  () E7  () E7  () E7  () E7  () E7  () E4  () E4 
musk  () E2  () E2  () E2  () E2  () E2  () E2  () E2  () E2 
FICO  () E1  () E1  () E1  () E1  () E1  () E1  () E1  () E1 
mean rank 
dataset  LR1  LRR  RuleFit  LR1N  LRRN  RuleFitN 

banknote  ()  ()  ()  ()  ()  () 
heart  ()  ()  ()  ()  ()  () 
ILPD  ()  ()  ()  ()  ()  () 
ionosphere  ()  ()  ()  ()  ()  () 
liver  ()  ()  ()  ()  ()  () 
pima  ()  ()  ()  ()  ()  () 
tictactoe  ()  ()  ()  ()  ()  () 
transfusion  ()  ()  ()  ()  ()  () 
WDBC  ()  ()  ()  ()  ()  () 
adult  ()  ()  ()  ()  ()  () 
bankmkt  ()  ()  ()  ()  ()  () 
gas  ()  ()  ()  ()  ()  () 
magic  ()  ()  ()  ()  ()  () 
mushroom  ()  ()  ()  ()  ()  () 
musk  ()  ()  ()  ()  ()  () 
FICO  ()  ()  ()  ()  ()  () 
mean rank 
5.1 Classification
For classification, we used the same datasets considered in (Dash et al., 2018), which also appeared in other recent works on rulebased models (Su et al., 2016; Wang et al., 2017). One of these datasets comes from the recent FICO Explainable Machine Learning Challenge (FICO, 2018).
In the first experiment, we evaluated the performancecomplexity tradeoffs of the four proposed methods (LR1, LRR, LR1N, LRRN) as well as RuleFit. For performance metrics, we report both accuracy and Brier score (i.e., mean squared error (MSE)), the latter a wellknown metric for probabilistic outputs (HernándezOrallo et al., 2012) as produced by logistic regressionbased models. To measure rule ensemble complexity, we consider not only the number of rules (with nonzero coefficients) but also their lengths in terms of number of conditions. Specifically we define the weight of a rule similarly to the regularization parameter as , where we take as the weight on the degree. Coefficients corresponding to numerical features receive a weight of . For consistency with this definition, the parameters used by all methods in this comparison are set proportional to the weights, i.e., with . By varying the remaining free parameter , we sweep out tradeoffs between performance and complexity. RuleFit has an additional parameter, the mean tree size, that is recommended for tuning in (Friedman & Popescu, 2008). We have done so based on test set results, which gives RuleFit a slight advantage.
Figure 1 shows the resulting tradeoffs with Brier score for of the datasets. The other plots as well as those for accuracy are in Appendix B. Paretoefficient points, i.e., those not dominated by points with both lower Brier score and lower complexity, have been connected with line segments for the sole purpose of visualization. RuleFit obtains inferior tradeoffs on most of the tested datasets. Even in cases such as Figure 0(c) where RuleFit eventually attains a lower Brier score, the initial part of the tradeoff is worse. Note that the variants employing numerical features generally fare better, as expected. Figure 0(c) indicates that IP CG (LRRI) can improve upon the heuristic in some cases.
Next we discuss the differences between LR1(N) and LRR(N). A general observation is that LRR, which uses CG to produce higherdegree rules, tends to achieve better tradeoffs on larger datasets, as exemplified by musk in Figure 0(d). This can be explained by the fact that LRR effectively considers a much larger feature space than LR1. If the training sample size is sufficient to support this, then the greater power of the higherdegree rules found by LRR generalizes to test data. The presence of strong interactions in the data also favors LRR. On the other hand, on smaller datasets such as in Figure 0(a), LRR may overfit and achieve a worse tradeoff relative to LR1. In Figure 0(b), LRR outperforms LR1 but the advantage disappears for LRRN compared to LR1N, a pattern that also occurs on other datasets. In Figure 0(c), the advantage of LRR(N) lies in attaining a slightly lower minimal Brier score.
In a second experiment, we aim to maximize performance (minimize Brier score or maximize accuracy) by performing nested CV on the training set to select and applying the resulting model to the test set. Since performance is now the primary criterion, we broaden the comparison to include GBT and SVM. For GBT, the maximum tree depth was tuned and the number of trees was also determined via a stopping criterion on a validation set, up to a maximum of trees. For SVM, the regularization parameter and kernel width were tuned and Platt scaling (Platt, 1999) was used to calibrate the output scores as probabilities.
Table 1 shows the resulting mean Brier scores while Table 2 shows the weighted number of rules at which the Brier scores were achieved. To help summarize these results, we report the mean rank of each method as well as Friedman tests on these mean ranks, as suggested by a reviewer and following (Demšar, 2006). The overall conclusion is that when tuned for maximum performance, the proposed methods, especially LRRN, compete well with benchmark models and do so using significantly fewer rules than RuleFit.
For Table 1, the Friedman statistic is , corresponding to a pvalue of using the
distribution approximation. Hence the null hypothesis of no significant differences is rejected at the
level but not at . A posthoc test, comparing all other algorithms to LRRN and correcting for multiple comparisons using Holm’s procedure, shows that the only significant difference at the level is with LR1, i.e., when excluding both higherdegree rules and numerical features. The mean ranks also show that LRR(N) outperforms LR1(N) although most of the differences are not statistically significant.In Table 2
, it is clear that RuleFit produces much more (34 times) complex classifiers than all four proposed methods. The Friedman statistic of
(pvalue ) is significant as expected. Posthoc comparisons to LRRN confirm that RuleFit and RuleFitN are significantly more complex, this time at the level and again with Holm’s correction. Note that including original numerical features does not significant increase classifier complexity (comparing LR1 with LR1N and LRR with LRRN).5.2 Regression
dataset  LR1  LRR  RuleFit  LR1N  LRRN  RuleFitN  GBT  SVM 

abalone  ()  ()  ()  ()  ()  ()  ()  () 
boston  ()  ()  ()  ()  ()  ()  ()  () 
bike  ()  ()  ()  ()  ()  ()  ()  () 
california  ()  ()  ()  ()  ()  ()  ()  () 
crime  ()  ()  ()  ()  ()  ()  ()  () 
parkinsons  ()  ()  ()  ()  ()  ()  ()  () 
wine  ()  ()  ()  ()  ()  ()  ()  () 
MEPS  ()  ()  ()  ()  ()  ()  ()  () 
mean rank 
dataset  LR1  LRR  RuleFit  LR1N  LRRN  RuleFitN 

abalone  ()  ()  ()  ()  ()  () 
boston  ()  ()  ()  ()  ()  () 
bike  ()  ()  ()  ()  ()  () 
california  ()  ()  ()  ()  ()  () 
crime  ()  ()  ()  ()  ()  () 
parkinsons  ()  ()  ()  ()  ()  () 
wine  ()  ()  ()  ()  ()  () 
MEPS  ()  ()  ()  ()  ()  () 
mean rank 
For regression, we experimented with an additional datasets, of which are drawn from previous works on rule ensembles (Friedman & Popescu, 2008; Dembczyński et al., 2010) and the UCI repository (Dua & Karra Taniskidou, 2017). The last dataset comes from the Medical Expenditure Panel Survey (MEPS) (Agency for Healthcare Research and Quality, 2018) of the US Department of Health and Human Services, specifically panel 19 from the year 2015. The task is to predict the annual healthcare expenditure of individuals based on demographics and selfreported medical conditions.
The same two experiments are conducted for regression, using the linear regression variants (6) of both the proposed approaches as well as RuleFit. The coefficient of determination is chosen as the performance metric and the model complexity metric is the same as in Section 5.1.
In Figure 2, we show tradeoffs between and weighted number of rules for of the datasets; the other can be found in Appendix B. RuleFit is a stronger competitor in regression due to sometimes achieving higher at higher complexities. On of datasets, the curves for RuleFit(N) remain below their LRR(N) counterparts as in Figure 1(a). On another datasets, the curves cross at moderate to high complexities as in Figure 1(b), while in Figure 1(c), RuleFit obtains much higher . Among the proposed methods, the advantage of LRR(N) vs. LR1(N) is generally larger in regression than in classification (cf. Figure 1), indicating the benefit of generating higherdegree rules. A possible explanation may be that interaction terms matter more in regression where the output range is wider.
Tables 3 and 4 show the results of selecting parameter through nested CV to maximize . The same overall conclusion as in Section 5.1 holds for LRRN in particular, namely that it yields highly competitive values while using fewer rules than RuleFit. The Friedman statistic for Table 3 is (pvalue ), and in posthoc comparisons to LRRN, the only significant difference ( level) is again with LR1. Among the proposed methods, advantages due to CG and/or numerical features in Figure 2 carry over into Table 3. In Table 4, RuleFit(N) again yields more complex solutions on average, although the difference is not as large as in Table 2. The Friedman statistic is (pvalue ); however, no posthoc comparisons with LRRN (which occupies a middle position) as the reference show statistically significant differences.
6 Discussion
The numerical results in Section 5 may raise the question of the interpretability of rule ensembles with hundreds of rules. First we note that because of the shape of many of the tradeoff curves, with steep improvement at low complexity followed by a flatter region, it may be possible to obtain substantially simpler models, with say a few tens of rules, that are not too far from maximum performance. Apart from model simplification, the fact that a rule ensemble is also a linear model facilitates model inspection, for example by focusing on the rules corresponding to the largest coefficients . Friedman & Popescu (2008) discuss a slightly more refined measure of rule importance, which is also used to assess the importance of (original) input features. Based on the discussion in Section 3, we also suggest a division between singleton rules and linear terms on the one hand, and higherdegree rules on the other. Since the former constitute a GAM, their effect can be summarized visually by univariate plots. The higherdegree rules can be ranked and the interactions studied further as described in (Friedman & Popescu, 2008).
The following extensions are suggested for future work: 1) We have used a fixed binarization of the features to facilitate formulation of the column generation subproblem (9) as an IP. However, if CG is done using a heuristic, it may be possible to refine the binarization with each CG iteration. Doing so may improve results, particularly for regression. 2) In the experiments in Section 5, we have fixed the ratio to match the definition of rule weight. One may also vary this ratio to encourage or discourage longer rules, or tune it to the level of interaction present in a dataset.
Acknowledgements
We thank Karthikeyan Natesan Ramamurthy for help with the MEPS dataset.
References
 Agency for Healthcare Research and Quality (2018) Agency for Healthcare Research and Quality. Medical Expenditure Panel Survey (MEPS). http://www.ahrq.gov/data/meps.html, 2018. Last accessed 201901.
 Angelino et al. (2017) Angelino, E., LarusStone, N., Alabi, D., Seltzer, M., and Rudin, C. Learning certifiably optimal rule lists. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 35–44, 2017.
 Bonami et al. (2018) Bonami, P., Gleixner, A. M., Linderoth, J., and Misener, R. Designing and implementing algorithms for mixedinteger nonlinear optimization. In Report from Dagstuhl Seminar 18081, 2018.
 Borndörfer et al. (2013) Borndörfer, R., Löbel, A., Reuther, M., Schlechte, T., and Weider, S. Rapid branching. Public Transport, 5:3–23, 01 2013. doi: 10.1007/s1246901300668.
 Clark & Boswell (1991) Clark, P. and Boswell, R. Rule induction with CN2: Some recent improvements. In Proceedings of the European Working Session on Machine Learning (EWSL), pp. 151–163, 1991.
 Cohen (1995) Cohen, W. W. Fast effective rule induction. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 115–123, 1995.
 Cohen & Singer (1999) Cohen, W. W. and Singer, Y. A simple, fast, and effective rule learner. In Proc. Conf. Artif. Intell. (AAAI), pp. 335–342, 1999.
 Conforti et al. (2014) Conforti, M., Cornuejols, G., and Zambelli, G. Integer programming. Springer, 2014.
 Dash et al. (2018) Dash, S., Günlük, O., and Wei, D. Boolean decision rules via column generation. In Proc. Thirtysecond Conference on Neural Information Processing Systems (NeurIPS), 2018.
 Dembczyński et al. (2010) Dembczyński, K., Kotłowski, W., and Słowiński, R. ENDER: a statistical framework for boosting decision rules. Data Mining and Knowledge Discovery, 21(1):52–90, Jul 2010.
 Demiriz et al. (2002) Demiriz, A., Bennett, K. P., and ShaweTaylor, J. Linear programming boosting via column generation. Mach. Learn., 46(1–3):225–254, January 2002.
 Demšar (2006) Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1–30, January 2006.
 Dua & Karra Taniskidou (2017) Dua, D. and Karra Taniskidou, E. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Fan et al. (2008) Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., and Lin, C.J. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
 FICO (2018) FICO. FICO Explainable Machine Learning Challenge. https://community.fico.com/community/xml, 2018. Last accessed 201905.
 Fortz et al. (2010) Fortz, B., Labbe, M., and Poss, M. A branchandcutandprice framework for convex MINLP applied to a stochastic network design problem. In Proc. European Workshop on MINLP, pp. 131–139, 2010.
 Friedman & Popescu (2008) Friedman, J. H. and Popescu, B. E. Predictive learning via rule ensembles. Annals of Applied Statistics, 2(3):916–954, Jul 2008.
 Fürnkranz et al. (2014) Fürnkranz, J., Gamberger, D., and Lavrač, N. Foundations of Rule Learning. SpringerVerlag, Berlin, 2014.
 Garcia et al. (2003) Garcia, R., Marin, A., and Patriksson, M. Column generation algorithms for nonlinear optimization, I: Convergence analysis. Optimization, 52(2):171–200, 2003.
 Gilmore & Gomory (1961) Gilmore, P. C. and Gomory, R. E. A linear programming approach to the cuttingstock problem. Operations Research, 9:849–859, 1961.
 Hastie & Tibshirani (1990) Hastie, T. J. and Tibshirani, R. J. Generalized Additive Models. Chapman and Hall/CRC, 1990. ISBN 9780412343902.
 HernándezOrallo et al. (2012) HernándezOrallo, J., Flach, P., and Ferri, C. A unified view of performance metrics: Translating threshold choice into expected classification loss. J. Mach. Learn. Res., 13:2813–2869, October 2012.
 Jawanpuria et al. (2011) Jawanpuria, P., Nath, J. S., and Ramakrishnan, G. Efficient rule ensemble learning using hierarchical kernels. In Proc. Int. Conf. Mach. Learn. (ICML), 2011.
 Lakkaraju et al. (2016) Lakkaraju, H., Bach, S. H., and Leskovec, J. Interpretable decision sets: A joint framework for description and prediction. In Proc. ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining (KDD), pp. 1675–1684, 2016.
 McCullagh & Nelder (1989) McCullagh, P. and Nelder, J. Generalized Linear Models, Second Edition. Chapman & Hall, 1989.
 Platt (1999) Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pp. 61–74, 1999.
 Rivest (1987) Rivest, R. L. Learning decision lists. Machine Learning, 2(3):229–246, 1987.
 Rückert & Kramer (2006) Rückert, U. and Kramer, S. A statistical approach to rule learning. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 785–792, 2006.
 Rudin & Ertekin (2018) Rudin, C. and Ertekin, Ş. Learning customized and optimized lists of rules with mathematical programming. Mathematical Programming Computation, 10(4):659–702, Dec 2018.
 Su et al. (2016) Su, G., Wei, D., Varshney, K. R., and Malioutov, D. M. Learning sparse twolevel Boolean rules. In Proc. IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), pp. 1–6, September 2016.
 Wang et al. (2017) Wang, T., Rudin, C., DoshiVelez, F., Liu, Y., Klampfl, E., and MacNeille, P. A Bayesian framework for learning rule sets for interpretable classification. Journal of Machine Learning Research, 18(70):1–37, 2017.
 Weiss & Indurkhya (2000) Weiss, S. M. and Indurkhya, N. Lightweight rule induction. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 1135–1142, 2000.
 Yang et al. (2017) Yang, H., Rudin, C., and Seltzer, M. Scalable Bayesian rule lists. In Proc. Int. Conf. Mach. Learn. (ICML), pp. 1013–1022, 2017.
Appendix A Variations on the Column Generation Heuristic
We have explored the following three variations of the heuristic algorithm for column generation described in Section 4:

The algorithm can return the best solutions that it finds instead of a single incumbent solution, potentially reducing the number of CG iterations needed. We have observed however that these solutions tend to correspond to very similar conjunctions and are hence highly correlated. By allowing multiple such columns to enter together, sparsity suffers because the regularization in (4) has difficulty favoring sparse linear combinations of highly correlated columns over dense ones. For this reason we kept .

The algorithm can be generalized to a beam search by considering the children of parent conjunctions at each degree instead of a single parent. The best children according to a combination of metrics (10) and (11) are then chosen to become the next parents. To date however, we have not found setting to be beneficial.

The algorithm can be terminated early once a solution with negative objective value is found since any such solution corresponds to a descent direction for problem (7). Termination can be immediate or occur after the current degree. While early termination speeds up each CG iteration, the number of iterations tends to increase because the generated columns are of lower quality.
Appendix B Additional Numerical Results
b.1 Classification
Figures 3–6 show tradeoffs between Brier score and weighted rules and between accuracy and weighted rules for all classification datasets.
Tables 5 and 6 show mean test accuracies and corresponding complexities when the methods are optimized for accuracy. For Table 5, the Friedman statistic computed from the mean ranks is with a pvalue of , indicating no statistically significant differences in accuracy among the methods. For Table 6, the Friedman statistic is (pvalue ). Posthoc comparisons with LRRN as the reference show that RuleFit and RuleFitN are significantly more complex at the level using Holm’s stepdown procedure.
dataset  LR1  LRR  RuleFit  LR1N  LRRN  RuleFitN  GBT  SVM 

banknote  ()  ()  ()  ()  ()  ()  ()  () 
heart  ()  ()  ()  ()  ()  ()  ()  () 
ILPD  ()  ()  ()  ()  ()  ()  ()  () 
ionosphere  ()  ()  ()  ()  ()  ()  ()  () 
liver  ()  ()  ()  ()  ()  ()  ()  () 
pima  ()  ()  ()  ()  ()  ()  ()  () 
tictactoe  ()  ()  ()  ()  ()  ()  ()  () 
transfusion  ()  ()  ()  ()  ()  ()  ()  () 
WDBC  ()  ()  ()  ()  ()  ()  ()  () 
adult  ()  ()  ()  ()  ()  ()  ()  () 
bankmkt  ()  ()  ()  ()  ()  ()  ()  () 
gas  ()  ()  ()  ()  ()  ()  ()  () 
magic  ()  ()  ()  ()  ()  ()  ()  () 
mushroom  ()  ()  ()  ()  ()  ()  ()  () 
musk  ()  ()  ()  ()  ()  ()  ()  () 
FICO  ()  ()  ()  ()  ()  ()  ()  () 
mean rank 
dataset  LR1  LRR  RuleFit  LR1N  LRRN  RuleFitN 

banknote  ()  ()  ()  ()  ()  () 
heart  ()  ()  ()  ()  ()  () 
ILPD  ()  ()  ()  ()  ()  () 
ionosphere  ()  ()  ()  ()  ()  () 
liver  ()  ()  ()  ()  ()  () 
pima  ()  ()  ()  ()  ()  () 
tictactoe  ()  ()  ()  ()  ()  () 
transfusion  ()  ()  ()  ()  ()  () 
WDBC  ()  ()  ()  ()  ()  () 
adult  ()  ()  ()  ()  ()  () 
bankmkt  ()  ()  ()  ()  ()  () 
gas  ()  ()  ()  ()  ()  () 
magic  ()  ()  ()  ()  ()  () 
mushroom  ()  ()  ()  ()  ()  () 
musk  ()  ()  ()  ()  ()  () 
FICO  ()  ()  ()  ()  ()  () 
mean rank 
b.2 Regression
Figure 7 shows the tradeoff between and weighted rules for all regression datasets.