1 Introduction
The last decade has witnessed an unprecedented adoption of machine learning techniques to make sense of available data and make predictions to support decision making for a wide variety of applications ranging from healthcare analytics to customer churn predictions, movie recommendations and macroeconomic policy. The focus in the machine learning literature has been on increasingly sophisticated systems with the paramount goal of improving the accuracy of their predictions at the cost of making such systems essentially blackbox. While in certain tasks such as ad predictions, accuracy is the main objective, in other domains, e.g., in legal, medical, and government, it is essential that the human decision makers who may not have been trained in machine learning can interpret and validate the predictions [20, 33].
The most popular interpretable techniques that tend to be adopted and trusted by decision makers include classification rules, decision trees, and decision lists [10, 29, 8, 30]. In particular, decision rules with a small number of Boolean clauses tend to be the most interpretable. Such models can be used both to learn interpretable models from the start, and also as proxies that provide posthoc explanations to pretrained blackbox models [12, 1].
On the theoretical front, the problem of rule learning was shown to be computationally intractable [32]
. Consequently, the earliest practical efforts such as decision list and decision tree approaches relied on a combination of heuristically chosen optimization objectives and greedy algorithmic techniques, and the size of the rule was controlled by either early stopping or adhoc rule pruning. Only recently there have been some formulations that attempt to balance the accuracy and the size of the rule in a principled optimization objective either through combinatorial optimization, linear programming (LP) relaxations, submodular optimization, or Bayesian methods
[4, 25, 24][7, 34] as we review in Section 5.Motivated by the significant progress in the development of combinatorial solvers (in particular, ), we ask: can we design a combinatorial framework to efficiently construct interpretable classification rules that takes advantage of these recent advances? The primary contribution of this paper is to present a combinatorial framework that enables a precise control of accuracy vs. interpretability, and to verify that the computational advances in the community can make it practical to solve largescale classification problems.
In particular, this paper makes following contributions:

A based framework, , that provably trades off accuracy vs. interpretability of the rules

A prototype implementation of based on that is capable of finding optimal (or highquality nearoptimal) classification rules from modern largescale datasets

We show that in many classification problems interpretability can be achieved at only a minor loss of accuracy, and furthermore, , which specifically looks for interpretable rules, can learn from much fewer samples than blackbox ML techniques.
Furthermore, we hope to share our excitement with applications of constraint programming/ in Machine Learning, and to encourage researchers in both interpretable classification and in the CSP/SAT communities to consider this topic further: both in developing new SATbased formulations for interpretable ML, and in designing bespoke solvers attuned to the problem of interpretable ML.
The rest of the paper is organized as follows: We discuss notations and preliminaries in Section 2. We then present , which is the primary contribution of this paper, in Section 3 and follow up with experimental setup and results over a large set of benchmarks in Section 4. We then discuss related work in Section 5 and finally conclude in Section 7.
2 Preliminaries
We use capital boldface letters such as to denote matrices while lower boldface letters
are reserved for vectors/sets. For a matrix
, represents ith row of while for a vector/set , represents ith element of .Let be a Boolean formula and be the set of variables appearing in . A literal is a variable () or its complement(). A satisfying assignment or a witness of is an assignment of variables in that makes evaluate to true. If is an assignment of variables and , we use to denote the value assigned to in . is in Conjunctive Normal Form (CNF) if , where each clause is represented as disjunction of literals. We use to denote the number of literals in . For two vectors and over propositional variable/constants, we define , where and denote variables/constants at th index of and respectively. In this context, note that the operation between a variable and a constant follows standard interpretation, i.e. and .
We consider standard binary classification, where we are given a collection of training samples where each vector contains valuation of the features for sample , and is the binary label for sample . A classifier is a mapping that takes in a feature vector and return a class , i.e. . The goal is not only to design to approximate our training set, but also to generalize to unseen samples arising from the same distribution. In this work, we restrict and to be Boolean^{1}^{1}1We discuss in Section 3 that such a restriction can be achieved without loss of generality and focus on classifiers that can be expressed compactly in Conjunctive Normal Form (CNF). We use to denote the th clause of . Furthermore, we use to denote the sum of the counts of literals in all the clauses, i.e. .
In this work, we focus on weighted variant of CNF wherein a weight function is defined over clauses. For a clause and weight function , we use to denote the weight of clause . We say that a clause is hard if , otherwise is called as soft clause. To avoid notational clutter, we overload to denote the weight of an assignment or clause, depending on the context. We define weight of an assignment as the sum of weight of clauses that does not satisfy. Formally, .
Given and weight function , the problem of is to find an assignment that has the maximum weight, i.e. if . Our formulation will have negative clause weights, hence corresponds to satisfying as many clauses as possible, and picking the weakest clauses among the unsatisfied ones. Note that the above formulation is different from the typical definition of but the difference is only syntactic. Borrowing terminology of community focused on developing solvers, we are solving a partial weighted instance wherein we mark all the clauses with weight as hard and negate weight of all the other clauses and ask for a solution that optimizes the partial weighted formula. The knowledge of inner working of solvers and encoding of our representation into weighted is not required for this paper and we defer the details to release of source code postpublication.
3 : MaxSATbased Learning of Interpretable Classifiers
We now discuss the primary technical contribution of this paper, : based Learning of Interpretable Classifiers. We first describe a metric for interpretability of CNF rules. Since our formulation employs binary features, we discuss how nonbinary features such as categorical and continuous features can be represented as binary features. We then move on to formulate the problem of learning interpretable classification rules as a query and provide a proof of its theoretical soundness regarding controlling sparsity of the rules. As discussed in Section 5, prior work does not provide a sound procedure for controlling sparsity and accuracy. We then discuss the representational power of our CNF framework – in particular, we demonstrate that the proposed framework generalizes to handle complex objective function and rules in forms other than CNF.
3.1 Balancing Accuracy and Intrepretability
While in general interpretability may be hard to define precisely, in the context of decision rules, an effective proxy is merely the count of clauses or literals used in the rule. Rules involving few clauses with few literals are natural for humans to evaluate and understand, while complex rules involving hundreds of clauses will not be interpretable even if the individual clauses are. In addition to interpretability, such sparsity also controls model complexity and gives a handle of the generalization error.^{2}^{2}2The framework proposed in this paper allows generalization to other forms of rules, as we discuss in Section 3.6.
First, suppose that there exists a rule that perfectly classifies all the examples, i.e. . Among all possible functions that satisfy this we would like to find the most interpretable (sparse) one:
Since most ML datasets do not allow perfect classification, we introduce a penalty on classification errors. We balance the two terms by a parameter , where large gives more accurate but more complex rules, and smaller gives smaller rules at the cost of reduced accuracy. Let be the set of examples on which our classifier makes an error, then our objective is^{3}^{3}3Costsensitive classification is defined analogously by allowing a separate parameter for false positives and false negatives.:
(1) 
3.2 Discretization of Features
In our based formulation, we focus on learning rules based on Boolean variables. We do also allow categorical and continuous features for our classifier, which are preprocessed before being presented to the
formulation. To handle categorical features one may use the common ‘onehot’ encoding, where a Boolean vector variable is introduced with the cardinality equal to the number of categories. For example a categorical feature with values ’red’, ’green’, ’blue’ would get converted to three binary variables, which take values 100, 010, and 001 for the three categorical values. [[[ If you think this sentence is obvious – please drop ]]]
For continuous features, we introduce discretization, by comparing feature values to a collection of thresholds. The thresholds may be chosen for example based on quantiles of their distribution, or alternatively, on uniform partition of the range of feature values. Specifically, for a continuous feature
we consider a number of thresholds and define two separate Boolean features and for each . The number of thresholds may vary by feature. Thus, each continuous feature is represented using a collection of Boolean features, where is the number of thresholds.In principle, one could use all the values occurring in the data as thresholds, and this would be equivalent to the original continuous features. In practice, however, such granularity is typically not necessary, and a handful of thresholds could be used, e.g., agegroups for each years to discretize a continuous age variable. This typically leads to only a very minor (if any) loss in accuracy, and in fact improves the presentations and understanding of the rules to human users. In our experiments, we used 10 thresholds based on the quantiles of the feature distribution (10th, 20th, … 100th percentile), unless the number of unique values of the feature was less than 10, in which case we kept all of them.
We note that we could easily define arbitrary other Boolean functions of continuous or categorical variables within our framework. For example, categorical variables with many possible values (e.g. states or countries) may be grouped into more interpretable coarser units ( regions or continents). Such groupings are application specific and wpuld typically require relevant domain knowledge. They could perhaps be learned from data, but this is outside the scope of the current paper.
3.3 Transformation to MaxSAT query
We now describe our MaxSAT formulation for learning interpretable rules. takes in four inputs: (i) a (0,1)matrix of dimension describing values of all features for samples with corresponding to feature vector for sample , (ii) (0,1)vector containing class labels for sample , (iii) k, the desired number of clauses in CNF rule, (iv) the regularization parameter . Consequently, constructs a query and invokes a solver to compute the underlying rule as we now describe.
The key idea of is to define a query over propositional variables, denoted by , such that every truth assignment defines a kclause CNF rule , where feature appears in clause if . Corresponding to every sample , we introduce a noise variable that is employed to distinguish whether the labeling for sample should be considered as noise or not. Let .
The MaxSAT query constructed by consists of the following three sets of constraints:
Please refer to Section 2 for the interpretation of . Finally, the set of constraints constructed by is defined as follows:
(2) 
Note that the elements of and are not variables but constants whose values (0 or 1) are provided as inputs. Therefore, the set of variables for is . We now explain the intuition behind the design of .
We assign a weight of to every as we would like to satisfy as many , i.e. falsify as many as possible. Similarly, we assign a weight of to every clause as we are, again, interested in sparse solutions (i.e., ideally, we would prefer as many to be satisfied as possible). Every clause can be read as follows: if is assigned to false, i.e. sample is not considered as noise, then . As noted in Section 2, equivalent representation of the , as described above, for solvers involves usage of hard clauses.
Next, we extract from the solution of as follows.
Construction 3.1
Let , then iff .
Before proceeding further, it is important to discuss CNF encodings for the above sets of constraints. The constraints arising from and are unit clauses and do not require further processing. Furthermore, note that is already known and is a constant. Therefore, when is 1, the constraint can be directly encoded as CNF by using equivalence of . Finally, when is 0, we use Tseitin encoding wherein we introduce an auxiliary variable corresponding to each clause . Formally, we replace with where , and . Furthermore, . The following lemma establishes the theoretical soundness of parameter .
Lemma 1
For all , if and , then and .
Proof
First, note that construction of depends only on and . Furthermore, the parameter influences only the associated weight function. We denote weight functions corresponding to and as and respectively. Furthermore, let and . If , the lemma trivially holds. We now complete proof by contradiction argument for the case when .
Let . As , we have . Since , where is extracted from as stated above. Therefore, we have . But we also have , which implies that . Since , we have contradiction. Therefore, it must be the case that .
3.4 Illustrate Example
We illustrate our encoding with the help of a toy example. Let and and . Then we have following clauses:
; ;
; ; ;
; ; ;
;
3.5 Beyond CNF Rules
While CNF formulas are general enough to express every Boolean formula, the length of representation may not be polynomial size. Therefore, one might wonder if we can extend to learn rules in other canonical forms as well. In fact, early CSP based approaches to rule learning focused on rules in DNF form. We now show that with a minor change, we are able to learn rules expressible in DNF. Suppose that we are interested in learning a rule that is expressible in DNF, such that , where S is a DNF formula. We note that . And if is a DNF formula, then is a CNF formula. Therefore, to learn rule , we simply call with as input and negate the learned rule.
3.6 Complex Objective Functions
We now discuss how can be easily extended to handle complex objective functions. The objective function for as defined in Equation 1 treats all features equally. In some cases, the user might prefer rules that contain certain features. Such an extension is fairly easy to achieve as we need only to change the weight function corresponding to clauses . Furthermore, in certain cases, one might want to minimize the total number of different features across different clauses rather than minimize the total number of terms. Such an extension is fairly easy to handle as we can simply replace with where . It is worth noting that the proposed modifications impact only the query and does not require any modifications to the underlying solver. We believe that such a separation is a key strength of as it separates modeling and solving completely.
4 Evaluation
To evaluate the performance of , we implemented a prototype implementation in Python that employs [15] to handle MaxSAT instances. We also experimented with LMHS [3], another state of the art MaxSAT solver and MaxHS outperformed LMHS for our benchmarks ^{4}^{4}4A detailed evaluation among different MaxSAT solvers is beyond the scope of this work and left for future work. We conducted an extensive set of experiments on diverse publicly available benchmarks, seeking to answer the following questions^{5}^{5}5The source code of and benchmarks can be viewed at https://github.com/meelgroup/mlic:

Do advancements in solving enable to be run with datasets involving tens of thousands of variables with thousands of binary features?

How does the accuracy of compare to that of state of the art but typically noninterpretable classifiers?

How does the accuracy of vary with the size of training set?

How does the accuracy of vary with ?

How does the size of learnt rules of vary with ?
In summary, our experiments demonstrate that can handle datasets involving tens of thousands of variables with thousands of binary features. Furthermore, can generate rules that are not only interpretable but with accuracy comparable to that of other competitive classifiers, which often produce hard to interpret rules/models. We demonstrate that is able to achieve sufficiently high accuracy with very few samples.
Dataset  Size  # Features  RIPPER  Log Reg  NN  RF  SVC  

TomsHardware  28170  830  0.968 (92.8)  0.976 (0.2)  0.977 (3.4)  0.976 (64.9 )  Timeout  0.969 (2000) 
49990  1050  0.938 (187.3)  0.963 (0.2)  0.965 (6.8)  0.962 (250.9 )  0.962 (1010.0)  0.958 (2000)  
adultdata  32560  262  0.852 (0.5)  0.801 (0.3)  0.866 (3.0)  0.844 (41.8 )  Timeout  0.755 (2000) 
creditcard  30000  334  0.811 (0.7)  0.781 (0.1)  0.822 (3.9)  0.82 (25.5 )  Timeout  0.82 (2000) 
ionosphere  350  564  0.886 (0.1)  0.909 (0.1)  0.926 (1.2)  0.909 (1.3 )  0.886 (0.1 )  0.889 (15.04) 
PIMA  760  134  0.774 (0.1)  0.749 (0.1)  0.764 (1.3)  0.761 (1.3)  0.77 (21.4 )  0.736 (2000) 
parkinsons  190  392  0.868 (0.1)  0.884 (0.1)  0.921 (1.2)  0.895 (1.1)  0.879 (1.6 )  0.895 (245) 
Trans  740  64  0.78 (0.0)  0.759 (0.0)  0.788 (1.2)  0.788 (1.2 )  0.765 (372.3 )  0.797 (1177) 
WDBC  560  540  0.961 (0.1)  0.936 (0.0)  0.961 (1.3)  0.943 (1.4 )  0.955 (3.0 )  0.946 (911) 
Dataset  Size  # Features  RIPPER  
TomsHardware  28170  830  57.5  4 
49990  1050  78.5  15  
adultdata  32560  262  74.5  51.5 
creditcard  30000  334  7.5  4 
ionosphere  350  564  3  5.5 
PIMA  760  134  5  9 
parkinsons  190  392  6.5  6 
Trans  740  64  6  4 
WDBC  560  540  7.5  3.5 
4.1 Experimental Methodology
We conducted extensive experiments on publicly available data sets obtained from UCI repository [6]. The data sets involved both real and categoricalvalued features. Specifically, the specific datasets are: buzz events from two different social networks: Twitter, Tom’s Hardware, Adult Data (adult_data), Credit Approval Data Set (credit_data), Ionosphere (Ionos), Pima Indians Diabetes (PIMA), Parkinsons, connectionist bench sonar (Sonar), blood transfusion service center (Trans), and breast cancer Wisconsin diagnostic (WDBC).
For purposes of comparison of the accuracy of , we considered a variety of popular classifiers:
penalized Logistic regression (LogReg), Nearest neighbors classifier (NN), and the black box random forests (RF), and support vector classification (SVC).
We perform 10fold crossvalidation to perform an assessment of accuracy on a validation set. We compute the mean across the 10 folds for each choice of a regularization (or complexity control) parameter for each technique (baseline and MLIC), and report the best crossvalidation accuracy. The number of parameter values is comparable ( 10) for each technique. For RF and RIPPER we use control based on the cutoff of the number of examples in the leaf node. For SVC and LogReg we discretize the regularization parameter on a logarithmic grid. In case of we have 2 choices of and number of clauses, and the type of rule as {CNF, DNF}. We set the training time cutoff for each classifier (on each fold) to be 2000 seconds. Again, note that some classifiers can be much faster than others, but in this paper we focus on the best tradeoff of accuracy vs interpretability in missioncritical settings, and the training time (which can be offline) is secondary, as long as it is realistic. In this context, note that testing time for each of these techniques is less than 0.01 seconds for a given set of labels.
4.2 Illustrative Example
We illustrate the interpertable rules that are computed by on the iris data set, which is a simple benchmark and widely used by machine learning community to illustrate new classification techniques. We consider the binary problem of classifying iris versicolor from the other two species, setosa and virginica. Of the four features, sepal length, sepal width, petal length, and petal width, we learn the following rule: :=

(sepal length 6.3 sepal width 3.0 petal width 1.5 )

( sepal width 2.7 petal length 4.0 petal width 1.2 )

( petal length 5.0)
Let us pause a bit to understand how to apply the above rule. The above rule implies that when the three constraints are satisfied, the flower must be classified as Iris otherwise, noniris. The size of the above rule, i.e. .
4.3 Results
Table 1 presents results of comparison of visavis typical noninterpretable classifiers. The first three columns list the name, size (number of samples) and the number of binary features for each Dataset. The next five columns present test accuracy of the classifiers RIPPER, Logistic Regression (Log Reg), Nearest Neighbor (NN), Random Forest (RF), and SVC. The final column contain the median test accuracy for . For every cell in the last five columns, the top value represents the accuracy, while the value sorrounded by parenthesis represent average training time. We draw the following two conclusions from the table: First, is able to handle datasets with tens of thousands of examples with hundreds of features. The scalability of demonstrates the potential presented by remarkable progress in SAT solving. Recent research efforts have often used hardness of the problem to justify the usage of heuristics but our experience with shows that SAT solving is able to solve many largescale problems directly. Note that when times out, it is able to provide the best solution found so far. In this context, it is worth noting that for some of the benchmarks, even state of the art classifiers such as SVC time out. Secondly, is often able to achieve accuracy that is sufficiently close to accuracy achieved by typical noninterpretable classifiers but produces easy to state rules that often have just a few literals.
To demonstrate ’s ability to compute easy to state rules in comparison to the state of the art classifiers such as RIPPER, we computed the size of rules returned by RIPPER and . Table 2 presents results of comparison of visavis RIPPER. The first three columns list the name, size (number of samples) and the number of binary features for each Dataset. The next two columns state the median size of rules returned by RIPPER and . The size of a rule is computed as the number of terms involved in a rule. First, note that except for two cases where RIPPER has produced marginally shorter rules compared to , produces significantly shorter rules and sometimes, these rules could be orders of magnitude larger than those produced by . For example, for Toms hardware, the rule produced by RIPPER has 57 terms compared to just 4 literals for . Note that with has better accuracy than RIPPER. One might wonder if the rule learned by RIPPER could have been simply transformed into a sparser rule; it is not the case here. Furthermore, it is worth noting that RIPPER does not provide sound handle to tune rule size and therefore, user is left to trying out combination of input parameters without any guarantee of improvement of the interpretability of generated rules, which we experienced in this case. A indepth study into failure of RIPPER to generate sparser rules than is beyond the scope of this work.
To measure the accuracy of w.r.t. the size of training data, we consider test errors when only a fraction of training data is available (we vary it from % to % in steps of %). . Due to lack of space, we present result for only one benchmark, WDBC, for and 5 and in Figure 1. We plot median training and test accuracy of over 10 trials, which is also known as learning curve in machine learning literature. The yaxis represents the error as the ratio of incorrect predictions to total examples while the xaxis represents the size of training set. The plot shows how training and test error vary for and 5. Note that is able to achieve sufficiently high test accuracy with just 40% of the complete dataset. We observe similar behavior for other benchmarks as well.
Figures 2 and 3 illustrate how training accuracy and rule sizes vary with for one of the representative benchmark, parkinsons. CNF1, CNF2, DNF1, DNF2 refer to invocations of with (rule type, k) set to (CNF, 1), (CNF, 2), (DNF,1), and (DNF,2) respectively. For each of the plots, xaxis refers to the value of while yaxis represents Rule size (i.e. ) and accuracy for Figure 3 and Figure 2 respectively. First, note that for both CNF and DNF, the accuracy of rules is generally higher for larger k. Significantly, the plots clearly demonstrate monotonicity of rule size and accuracy with respect to . In contrast, the state of the art interpretable classifier, RIPPER, can lead to rules that can be order of magnitude larger than those produced by . For example, for Toms hardware, the rule produced by RIPPER has 57 terms compared to just 4 literals for . In this context, it is worth noting that RIPPER does not provide sound handle to tune rule size and therefore, user is left to trying out combination of input parameters without any guarantee of improvement of the interpretability of generated rules.
5 Related Work
There is a long history of learning interpretable classification models from data, including popular approaches such as decision trees [5, 29], decision lists [30], and classification rules [10]. While the form of such classifiers is highly amenable to human interpretation, unfortunately, most of the objective functions that arise for these problems are intractable combinatorial optimization problems. Hence, most popular existing approaches rely on various greedy heuristics, pruning, and adhoc local criteria such as maximizing information gain, coverage, e.t.c. For example vaious popular decision rule approaches, such as C4.5.rules [29], CN2 [9], RIPPER [10], SLIPPER [11], all make different tradeoffs in how they use these heuristic criteria for growing and pruning the rules.
Recent advances in largescale optimization and scalable Bayesian inference gave rise to stateoftheart black box models. However, many of the same advances can also be used in the context of interpretable machine learning models. Some of such recent proposals include Bayesian approaches
[23, 35], constraint programming [2], integer programming approaches to learn decision trees [4], quadratic programming relaxation with a variancepenalized margin objective
[31]. Greedy approaches are used with a principled objective function in ENDER [17] and Set covering machines [25]. [22] propose a hierarchical kernel learning approach and [21] use optimization to combine basic Boolean clauses obtained from decision trees. Linear Programming relaxations based on Boolean Compressed Sensing formulation have been used to learn sparse interpretable rules and checklists^{6}^{6}6Note, however, that the objective functions for the integer program and the LP relaxation in these papers are not the same as sparsitypenalized costsensitive classification error. in [24, 18]. Prior work has considered applications of constraint programming to learning Bayesian networks
[2] and itemset mining [16, 28]. In contrast, we focus on learning sparse interpretable classification rules allowing control of accuracy vs. interpretability.6 Extensions
In the paper, we have focused on decision rules in the DNF or CNF form, which is among the most interpretable classification methods available. We now describe a few related classification formulations, which are also amenable to being learned from data using a SATbased framework. A simple ANDclause can be considered as a requirement that all of the N literals in the clause are satisfied, while a simple ORclause requires that at least 1 of the N literals are satisfied. A useful generalization is a “KofN” clause [13], which is true when at least K of the N literals are satisfied. In particular, it leads to a very popular decision rubric called checklists or scorecards, widely used in medicine and finance, where a questionnaire asks some questions (e.g., risk factors), and the total number of positive answers is compared to a predetermined threshold. LP relaxations have been considered for learning scorecards from data [19], and our based framework can be directly extended. In the case of multiclass classification, a decision rule may be ambiguous, as it does not specify what multiclass label to use when several contradictory clauses pointing to different labels are satisfied simultaneously. Decision lists [30] enforces an order of evaluation of the rules, resolving this ambiguity. Bayesian frameworks for learning decision lists have been considered recently [23]. Perhaps the most well known interpretable classification scheme is a decision tree, where literals are arranged as nodes in a binary tree, and a decision is made by following the path from the root node to one of the leafs. The decision tree can be converted to an equivalent set of classification rules which correspond to all the paths from the root to the leafs, a more expensive representation. On the other side, however, certain small decision rules can lead to very complex decision trees, for example, the ”KofN” rule cannot be efficiently encoded using a decision tree. Recent work has considered combinatorial optimization to learn compact interpretable decision trees [4]. Beyond simple Boolean expressions, a variety of weighted classification methods can be used, for example, a weighted linear combination of simple AND clauses – for instance by using Boosting on a set of classifiers based on simple logical clauses. In future work, we plan to extend our based framework for all these related interpretable classification approaches.
7 Conclusion
We proposed a new approach to learn interpretable classification rules via reduction to (). Due to the impressive advances in solving, our formulation can find optimal or nearoptimal rules balancing accuracy and interpretability (sparsity) for large datasets involving tens or hundreds of thousands of data points, and hundreds or thousands of features. Furthermore, the approach separates the modeling from the optimization, and this framework could be used to solve a wide variety of interpretable classification formulations, including decision lists, decision trees, and decision rules with different cost functions (including groupsparsity, sharing of the variables, and having prior knowledge on variable importance). Finally, we demonstrate on experiments that for many classification problems interpretability does not have to come at a high cost in terms of accuracy.
Furthermore, we hope to share our excitement with applications of constraint programming/ in Machine Learning, and to encourage researchers in both interpretable classification and in the CSP/SAT communities to consider this topic further: both in developing new SATbased formulations for interpretable ML, and in designing bespoke solvers attuned to the problem of interpretable ML.
Acknowledgements
This work was supported in part by NUS ODPRT Grant, R252000685133 and IBM PhD Fellowship. The computational work for this article was performed on resources of the National Supercomputing Centre, Singapore https://www.nscc.sg
References
 [1] Andrews, R., Diederich, J., Tickle, A.: Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledgebased systems 8(6), 373–389 (1995)
 [2] van Beek, P., Hoffmann, H.F.: Machine learning of bayesian networks using constraint programming. In: Proc. of CP. pp. 429–445 (2015)
 [3] Berg, J., Saikko, P., Järvisalo, M.: Improving the effectiveness of satbased preprocessing for maxsat. In: Proc. of IJCAI (2015)
 [4] Bertsimas, D., Chang, A., Rudin, C.: An integer optimization approach to associative classification. In: Adv. Neur. Inf. Process. Syst. 25, pp. 269–277 (2012)
 [5] Bessiere, C., Hebrard, E., O’Sullivan, B.: Minimising decision tree size as combinatorial optimisation. In: Proc. of CP. pp. 173–187. Springer (2009)
 [6] Blake, C., Merz, C.J.: UCI repository of machine learning databases (1998)
 [7] Boros, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Transactions on Knowledge and Data Engineering 12(2), 292–306 (2000)
 [8] Breiman, L., Friedman, J., Stone, C., Olshen, R.: Classification and regression trees. CRC press (1984)
 [9] Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (Mar 1989)
 [10] Cohen, W.W.: Fast effective rule induction. In: Proc. Int. Conf. Mach. Learn. pp. 115–123. Tahoe City, CA (Jul 1995)
 [11] Cohen, W.W., Singer, Y.: A simple, fast, and effective rule learner. In: Proc. Nat. Conf. Artif. Intell. pp. 335–342. Orlando, FL (Jul 1999)
 [12] Craven, M.W., Shavlik, J.W.: Extracting treestructured representations of trained networks. Proc. of NIPS pp. 24–30 (1996)
 [13] Craven, M.W., Shavlik, J.W.: Extracting treestructured representations of trained networks. Proc. of NIPS pp. 24–30 (1996)
 [14] Davies, J.: Solving MAXSAT by Decoupling Optimization and Satisfaction. Ph.D. thesis, University of Toronto (2013)
 [15] Davies, J., Bacchus, F.: Solving maxsat by solving a sequence of simpler sat instances. In: Proc. of CP. pp. 225–239 (2011)
 [16] De Raedt, L., Guns, T., Nijssen, S.: Constraint programming for itemset mining. In: Proc. of KDD. pp. 204–212 (2008)
 [17] Dembczyński, K., Kotłowski, W., Słowiński, R.: Ender: a statistical framework for boosting decision rules. Data Mining and Knowledge Discovery 21(1), 52–90 (2010)
 [18] Emad, A., Varshney, K.R., Malioutov, D.M.: A semiquantitative group testing approach for learning interpretable clinical prediction rules. In: Proc. Signal Process. Adapt. Sparse Struct. Repr. Workshop, Cambridge, UK (2015)
 [19] Emad, A., Varshney, K.R., Malioutov, D.M.: A semiquantitative group testing approach for learning interpretable clinical prediction rules. In: Proc. Signal Process. Adapt. Sparse Struct. Repr. Workshop, Cambridge, UK (2015)
 [20] Freitas, A.: Comprehensible classification models: a position paper. ACM SIGKDD explorations newsletter 15(1), 1–10 (2014)
 [21] Friedman, J.H., Popescu, B.E.: Predictive learning via rule ensembles. The Annals of Applied Statistics pp. 916–954 (2008)
 [22] Jawanpuria, P., Jagarlapudi, S.N., Ramakrishnan, G.: Efficient rule ensemble learning using hierarchical kernels. In: Proc. of ICML (2011)
 [23] Letham, B., Rudin, C., McCormick, T.H., Madigan, D.: Building interpretable classifiers with rules using Bayesian analysis. Tech. Rep. 609, Dept. Stat., Univ. Washington (Dec 2012)
 [24] Malioutov, D.M., Varshney, K.R.: Exact rule learning via boolean compressed sensing. In: Proc. of ICML. pp. 765–773 (2013)
 [25] Marchand, M., ShaweTaylor, J.: The set covering machine. Journal of Machine Learning Research 3(Dec), 723–746 (2002)
 [26] Morgado, A., Heras, F., Liffiton, M., Planes, J., MarquesSilva, J.: Iterative and coreguided maxsat solving: A survey and assessment. Constraints 18(4), 478–534 (Oct 2013)
 [27] Narodytska, N., Bacchus, F.: Maximum satisfiability using coreguided maxsat resolution. In: AAAI. pp. 2717–2723 (2014)
 [28] Nijssen, S., Guns, T., De Raedt, L.: Correlated itemset mining in roc space: a constraint programming approach. In: KDD. pp. 647–656. ACM (2009)
 [29] Quinlan, J.R.: C4. 5: Programming for machine learning. Morgan Kauffmann p. 38 (1993)
 [30] Rivest, R.L.: Learning decision lists. Mach. Learn. 2(3), 229–246 (Nov 1987)
 [31] Rückert, U., Kramer, S.: Marginbased firstorder rule learning. Mach. Learn. 70(2–3), 189–206 (Mar 2008)
 [32] Valiant, L.G.: Learning disjunctions of conjunctions. In: Proc. Int. Joint Conf. Artif. Intell. pp. 560–566. Los Angeles, CA (Aug 1985)

[33]
Varshney, K.R.: Data science of the people, for the people, by the people: A viewpoint on an emerging dichotomy. In: Proc. Data for Good Exchange Conf (2015)
 [34] Wang, T., Rudin, C., DoshiVelez, F., Liu, Y., Klampfl, E., MacNeille, P.: Or’s of and’s for interpretable classification, with application to contextaware recommender systems. arXiv preprint arXiv:1504.07614 (2015)
 [35] Wang, T., Rudin, C., Liu, Y., Klampfl, E., MacNeille, P.: Bayesian or’s of and’s for interpretable classification with application to context aware recommender systems (2015)
Comments
There are no comments yet.