1 Introduction
Predictive models are used in an increasingly highstakes set of applications, from bail decisions in the criminal justice system [2, 23] to treatment recommendations in personalized medicine [4]. As the stakes have risen, so has the negative impact of incorrect predictions, which could be due to a poorly trained model or to undetected confounding patterns within the data itself [31].
As machine learning models influence a growing fraction of everyday life, individuals often want to understand the reasons for the decisions that affect them. Many governments now recognize a “right to explanation” for significant decisions, for instance as part of the European Union’s General Data Protection Regulation [19]
. However, stateoftheart machine learning methods such as random forests and neural networks are black boxes: their complex structure makes it difficult for humans, including domain experts, to understand their predictive behavior
[8, 17].1.1 Interpretable Machine Learning
According to Leo Breiman [9], machine learning has two objectives: prediction, i.e., determining the value of the target variable for new inputs, and information, i.e., understanding the natural relationship between the input features and the target variable. Studies have shown that many decision makers exhibit an inherent distrust of automated predictive models, even if they are proven to be more accurate than human forecasters [12]. One way to overcome “algorithm aversion” is to give decision makers agency to modify the model’s predictions [13]. Another is to provide them with understanding.
Many studies in machine learning seek to train more interpretable models in lieu of complex black boxes. Decision trees
[7, 3] are considered interpretable for their discrete structure and graphical visualization, as are close relatives including rule lists [27, 36], decision sets [25], and casebased reasoning [21]. Other approaches include generalized additive models [30], i.e., linear combinations of singlefeature models, and scorebased methods [35], where integer point values for each feature can be summed up into a final “score”. In the case of linear models, interpretability often comes down to sparsity (small number of nonzero coefficients), a topic of extensive study over the past twenty years [20]. Sparse regression models can be trained using heuristics such as LASSO, stagewise regression or leastangle regression
[34, 33, 15], or scalable mixedinteger approaches [5, 6].Many practitioners are hesitant to give up the high accuracy of black box models in the name of interpretability, and prefer to construct ex post
explanations for a model’s predictions. Some approaches create a separate explanation for each prediction in the dataset, e.g. by approximating the nonlinear decision boundary of a neural network with a hyperplane
[32]. Others define metrics of feature importance to quantify the effect of each feature in the overall model [17, 11].Finally, some approaches seek to approximate a large, complex model such as a neural network or a random forest with a simpler one – a decision tree [1], twolevel rule list [26], or smaller neural network [10]. Such global explanations can help human experts detect systemic biases or confounding variables. However, even if these approximations are almost as accurate as the original model, they may have very different behavior on some inputs and can thus provide a misleading assessment of the model’s behavior [18].
The interpretability of linear models and the resulting tradeoff with predictive accuracy are of significant interest to the machine learning community [16]. However, a major challenge in this line of research is that the very concept of interpretability is hard to define and even harder to quantify [29]. Many definitions of interpretability have a “know it when you see it” aspect which impedes quantitative analysis: though some aspects of interpretability are easy to measure, others may be difficult to evaluate without human input [14].
1.2 Contributions
We introduce the framework of interpretable paths, in which models are decomposed into simple building blocks. An interpretable path is a sequence of models of increasing complexity which can represent a sequential process of “reading” or “explaining” a model. Using examples of several machine learning model classes, we show that the framework of interpretable paths is relevant and intuitively captures properties associated with interpretability.
To formalize which paths are more interpretable, we introduce path interpretability metrics. We define coherence conditions that such metrics should satisfy, and derive a parametric family of coherent metrics.
This study of interpretable paths naturally leads to a family of model interpretability metrics. The proposed metrics generalize a number of proxies for interpretability from the literature, such as sparsity in linear models and number of splits for decision trees, and also encompass other desirable characteristics.
The model interpretability metrics can be used to select models that are both accurate and interpretable. To this end, we formulate the optimization problem of computing models that are on the Pareto front of interpretability and predictive accuracy (price of interpretability). We give examples in various settings, and discuss computational challenges.
We study an indepth application to linear models on real and synthetic datasets. We discuss both the modeling aspect (the choice of the interpretability metric), as well as the computational aspect for which we propose exact mixedinteger formulations and scalable local improvement heuristics.
2 A Sequential View of Model Construction
2.1 Selecting a Model
Most machine learning problems can be viewed through the lens of optimization. Given a set of models , each model is associated with a cost , typically derived from data, representing the performance of the model on the task at hand (potentially including a regularization term). Training a machine learning model means choosing the appropriate from (for example the one that minimizes ). To make this perspective more concrete, we will use the following examples throughout the paper.
Linear models.
Given the feature matrix of a dataset of size with feature space in
and the corresponding vector of labels
, a linear model corresponds to a set of linear coefficients . In this example, , and the costdepends on the application: for ordinary least squares (OLS),
(mean squared error).Classification trees (CART).
In this case, each model corresponds to a binary decision tree structure [7], so is the set of all possible tree structures of any size. Given a tree and an input , let
designate the tree’s estimate of the corresponding label. Then a typical performance metric
is the number of misclassified points. If we have a dataset with points associated with classification labels then we have .Clustering.
We consider the kmeans clustering problem for a dataset
of points in dimension . Our model space is the set of all partitions of the dataset, each partition representing a cluster. Formally . To evaluate a partition, we can use the withincluster sum of squares , where is the centroid of cluster .2.2 Interpretable Steps
For our guiding examples, typical proxies for interpretability include sparsity in linear models [5], a small number of nodes in a classification tree [3], or a small number of clusters.
As we try to rationalize why these are good proxies for interpretability, one possible approach is to consider how humans read and explain these models. For example, a linear model is typically introduced coefficient by coefficient, a tree is typically read node by node from the root to the leaves, and clusters are typically examined one by one. During this process, we build a model that is more and more complex. In other words, the human process of understanding a model can be viewed as a decomposition into simple building blocks.
We introduce the notion of an interpretable step to formalize this sequential process. For every model , we define a step neighborhood function that associates each model to the set of models such that is one interpretable step away from if and only if . Interpretable steps represent simple model updates that can be chained to build increasingly complex models.
For linear models, one possible interpretable step is modifying a single coefficient, i.e. belongs to if ( and differ in at most one coefficient). For CART, an interpretable step could be adding a split to an existing tree, i.e., if can be obtained by splitting a leaf node of
into two leaves. For clustering, we could use the structure of hierarchical clustering and choose a step that increases the number of clusters by one by splitting an existing cluster into two. These examples are illustrated in Figure
1.Choosing the step neighborhood function is a modeling choice and for the examples considered, there may be many other ways to define it. To simplify the analysis, we only impose that for all (there must always be a feasible next step from any model), which can trivially be satisfied by ensuring (an interpretable step can involve no changes to the model).
Given the choice of an interpretable step , we can define an interpretable path of length as a sequence of models such that for all , i.e., a sequence of interpretable steps starting from a base model . The choice of , the “simplest” model, is usually obvious: in our examples, could be a linear model with , an empty classification tree, or a single cluster containing all data points. Given the model space , we call the set of all interpretable paths of length and the set of all interpretable paths of any length.
Let us consider an example with classification trees to build intuition about interpretable paths. The iris dataset is a small dataset often used to illustrate classification problems. It records the petal length and width and sepal length and width of various iris flowers, along with their species (setosa, versicolor and virginica). For simplicity, we only consider two of the four features (petal length and width) and subsample 50 total points from two of the three classes (versicolor and virginica).
We define an interpretable step as splitting one leaf node into two. Given the iris dataset, we consider two classification trees . Both trees have a depth of 2, exactly 3 splits, and a misclassification cost of 2. However, when we consider interpretable paths leading to these two trees, we notice some differences. An interpretable path leading to is shown in Figure 2, and an interpretable path leading to is detailed in Figure 3.
For , the first split results in an intermediate tree with a high classification error, which could be less intuitive for flower connoisseurs. In contrast, the first split of gives a much more accurate intermediate tree. We will introduce a way to formally identify which of the two paths is more interpretable.
3 The Tradeoffs of Interpretability
In this section, we consider the choice of an interpretability loss for all interpretable paths such that a path is considered more interpretable than a path if and only if . We first motivate this formalism by showing it can lead to a notion of interpretability loss for models as well. We then use a simple example to build intuition about the choice of .
3.1 From paths to models
Defining a loss function for the interpretability of a path can naturally lead to an interpretability loss on the space of models, with the simple idea that more interpretable paths should lead to more interpretable models. Given a path interpretability loss
, we can define a corresponding model interpretability loss as(1) 
where designates the set of interpretable paths of length leading to , and designates the set of finite interpretable paths leading to . In other words, the interpretability loss of a model is the interpretability loss of the most interpretable path leading to .
As an example, consider the following path interpretability loss, which we call path complexity and define as (number of steps in the path). Under this metric, paths are considered less interpretable if they are longer. From (1) we can then define the interpretability loss of a given model as
which corresponds to the minimal number of interpretable steps required to reach .
In the context of the examples from Section 2, the function recovers typical interpretability proxies. For a linear model , is the sparsity of the model (number of nonzero coefficients). For a classification tree , is the number of splits. In a clustering context, is just the number of clusters. We refer to this candidate loss function as the model complexity.
A fundamental problem of interpretable machine learning is finding the highestperforming model at a given level of interpretability [9]. Defining an interpretability loss function on the space of models is important because it allows us to formulate this problem generally as follows:
(2) 
where is the desired level of interpretability. Problem (2) produces models on the Pareto front of accuracy and interpretability. If we compute this Pareto front by solving problem (2) for any , then we can mathematically characterize the price of our definition of interpretability on our dataset given a class of models, making the choice of a final model easier.
In the case of model complexity , for problem (2) can be written as
(3) 
Problem (3) generalizes existing problems in interpretable machine learning: best subset selection (constrained sparse regression) for linear models [5], finding the best classification tree of a given size [3], or the Kmeans problem of finding the best possible clusters.
Thus, the framework of interpretable paths naturally gives rise to a general definition of model complexity via the loss function , and our model generalizes many existing approaches. By some counts, however, model complexity remains an incomplete interpretability loss. For instance, it does not differentiate between the trees and : both models have a complexity of 3 because they can be reached in three steps. More generally, does not differentiate between paths of the same length, or between models that can be reached by paths of the same length.
3.2 Incrementality
In the decision tree example from Figures 2 and 3, we observed that the intermediate trees leading to were more accurate than the intermediate trees leading to . Evaluating the costs of intermediate models along a path may provide clues as to the interpretability of the final model.
Consider the following toy example, where the goal is to estimate a child’s age given height and weight . The normalized features and have correlation and are both positively correlated with the objective. Solving the OLS problem yields , i.e.,
(4) 
with the error term. The mean squared error (MSE) of the model is .
As in Section 2, we define an interpretable step to be modifying a single coefficient in the linear model, keeping all other coefficients constant. In this case, consider the three interpretable paths in Table 1. When using the complexity loss , the first two paths in the table are considered equally interpretable because they have the same length. But are they? Both verify , but . Indeed, is a particularly inaccurate model, as weight is positively correlated with age. And furthermore, if having an accurate first step matters to the user, then path may be preferred even though it is longer.



As discussed in Section 2, an interpretable path leading to model can be viewed as a decomposition of into a sequence of easily understandable steps. The costs of intermediate models should play a role in quantifying the interpretability loss of a path; higher costs should be penalized, as we want to avoid nonsensical intermediate models such as .
One way to ensure that every step of an interpretable path adds value is a greedy approach, where the next model at each step is chosen by minimizing the cost :
(5) 
In our toy example, restricting ourselves to paths of length 2, this means selecting the best possible , and then the best possible given , as in stagewise regression [33]. This will not yield the best possible model achievable in two steps as in (3), but the first step is guaranteed to be the best one possible. Notice that , but . The improvement of the first model comes at the expense of the second step.
Deciding which of the two paths and is more interpretable is a hard question. It highlights the tradeoff between the desirable incrementality of the greedy approach and the cost of the final model. For paths of length 2, there is a continuum of models between and , corresponding to the Pareto front between and , shown in Figure 4.
4 Coherent Interpretability Losses
In the previous section, we developed intuition regarding the interpretability of different paths. We now formalize this intuition in order to define a suitable interpretability loss.
4.1 Coherent Path Interpretability Losses
According to the loss defined in Section 3.1, which generalizes many notions of interpretability from the literature, a path is more interpretable if it is shorter. In Section 3.2, we saw that the cost of individual models along the path matters as well.
Sometimes, comparing the costs of intermediate models between two paths is easy because the cost of each step along one path is at least as good as the cost of the corresponding step in the other path. In Table 1, it is reasonable to consider more interpretable than because and . In contrast, comparing the interpretability of and is more difficult and userspecific, because , but .
We now formalize this intuition into desirable properties of interpretability loss functions. We first introduce the notion of a cost sequence, which provides a concise way to refer to the costs of all the steps in an interpretable path. We then propose axioms for coherent interpretability losses.
Definition 1 (Cost sequence).
Given an interpretable path of length , denoted as , the cost sequence is the infinite sequence such that:
Definition 2 (Coherent Interpretability Loss).
A path interpretability loss is coherent if the following conditions hold for any two interpretable paths with respective cost sequences and .

[(a)]

If , then .

(Weak Pareto dominance) If (which we write as ), then .
Condition 1 means that the interpretability of a path depends only on the sequence of costs along that path. Condition 2 formalizes the intuition described before, that paths with fewer steps or better steps are more interpretable. For instance, if we improve the cost of one step of a path while leaving all other steps unchanged, we can only make the path more interpretable. Under any coherent interpretability loss in Table 1, is more interpretable than , but may be more or less interpretable than depending on the specific choice of coherent interpretability loss.
In addition, consider a path and remove its last step to obtain a new path . This is equivalent to setting the th element of the cost sequence to zero. Since , we have that , which implies . In other words, under a coherent interpretability loss, removing a step from an interpretable path can only make the path more interpretable.
Remark.
The path complexity is a coherent path interpretability loss.
Proof.
If and verify , then trivially the two cost sequences become zero after the same number of steps, so . If and becomes zero after exactly steps, then must become zero after at most steps, so . ∎
4.2 A Coherent Model Interpretability Loss
Axiom 2 of Definition 2 states that a path that dominates another path in terms of the costs of each step must be at least as interpretable. This notion of weak Pareto dominance suggests a natural path interpretability loss:
In other words, the interpretability loss of a path is the weighted sum of the costs of all steps in the path. This loss function is trivially coherent and extremely general. It is parametrized by the infinite sequence of weights , which specifies the relative importance of the accuracy of each step in the model for the particular application at hand.
Defining a family of interpretability losses with infinitely many parameters allows for significant modeling flexibility, but it is also cumbersome and overly general. We therefore propose to select for all , replacing the infinite sequence of parameters with a single parameter . In this case, following (1), we propose the following coherent interpretability loss function on the space of models.
Definition 3 (Model interpretability).
Given a model , its interpretability loss is given by
(6) 
By definition, is a coherent interpretability loss, which favors more incremental models or models with a low complexity. The parameter captures the tradeoff between these two aspects of interpretability. Theorem 1 shows that with a particular choice of one can recover the notion of model complexity introduced in Section 3.1, or models that can be built in a greedy way.
Theorem 1 (Consistency of interpretability measure).
Assume that the cost is bounded, we consider in the two limit cases and :

[(a)]

Let with (i.e., requires less interpretable steps than ), or and .
(7) .

Given , if , where represents the lexicographic order on , then
(8) Consequently, given models , if there is such that for all , then
(9)
Intuitively, in the limit , 1 states that the most interpretable models are the ones with minimal complexity, or minimal costs if their complexity is the same. 2 states that in the limit the most interpretable models can be constructed with greedy steps. Definition 3 therefore generalizes existing approaches and provides a good framework to model the tradeoffs of interpretability.
5 Interpretability Losses in Practice
Defining an interpretability loss brings a new perspective to the literature on interpretability in machine learning. In this section, we discuss the applications of this framework. For the sake of generality, in the early part of this section we work with the more general interpretability loss .
5.1 The Price of Interpretability
Given the metric of interpretability defined above, we can quantitatively discuss the price of interpretability, i.e., the tradeoff between a model’s interpretability loss and its cost . To evaluate this tradeoff, we want to compute models that are Pareto optimal with respect to and , as in (2).
Computing these Paretooptimal solutions can be challenging, as our definition of model interpretability requires optimizing over paths of any length. Fortunately, the only optimization problem we need to be able to solve is to find the most interpretable path of a fixed length , i.e.,
(10) 
Indeed, the following proposition shows that we can compute Paretooptimal solutions by solving a sequence of optimization problems (10) for various and .
Proposition 1 (Price of interpretability).
Paretooptimal models that minimize the interpretability loss and the cost can be computed by solving the following optimization problem:
(11) 
where is a tradeoff parameter between cost and interpretability.
The (simple) proof of the proposition is provided in the appendix. Notice that the inner minimization problem in (11) is simply problem (10) with the modified coefficients .
By defining the general framework of coordinate paths and a natural family of coherent interpretability loss functions, we can understand exactly how much we gain or lose in terms of accuracy when we choose a more or less interpretable model. Our framework thus provides a principled way to answer a central question of the growing literature on interpretability in machine learning.
Readers will notice that the weighted sum of the objectives optimized in Proposition 1 does not necessarily recover the entire Pareto front, and in particular cannot recover any nonconvex parts [22].
Using Proposition 1, we can compute the price of interpretability for a range of models and interpretability losses. As an example, Figure 5 shows all Paretooptimal models with respect to performance cost and interpretability for our toy problem from Section 3.2, with the interpretability loss chosen such that . Figure 6 shows the Pareto front in the same setting for other values of . We notice that as in Theorem 1, when grows large our notion of interpretability reduces to sparsity (discrete Pareto curve), whereas when grows small our notion of interpretability favors a larger number of incremental steps.
5.2 Computational Considerations
To solve (11) we consider a sequence of problems of type (10). However, this sequence is possibly infinite, which poses a computational problem. Proposition 2 provides a bound for the number of problems of type (10) we need to consider in the general case.
Proposition 2.
Assume there exist and such that for all (positive and bounded cost function), and consider the interpretability loss . If , then
(12) 
where
(13) 
In other words, under the interpretability loss with , we can find the optimal solution of (11) by solving at most problems of type (10). The proof of Proposition 2 is provided in the appendix.
A corollary of Proposition 2 is that we can write an optimization formulation of problem (11) with a finite number of decision variables. For instance, we can formulate the inner minimization problem with finitely many decision variables for each and then solve finitely many such problems. The tractability of this optimization problem is applicationdependent.
For example, by adapting the mixedinteger optimization formulation from Bertsimas and Dunn [3], we can compute the price of interpretability for decision trees of bounded depth by writing the following mixedinteger formulation of the inner minimization problem in (11):
(14a)  
s.t.  (14b)  
(14c)  
(14d)  
(14e)  
(14f) 
where the variables , and define trees of depth at most , and constraints (14c)(14f) impose an interpretable path structure on the trees. The set indicates the set of branching nodes of the trees, the variable indicates whether branching node in tree is active, selects the variable along which to perform the split at branching node in tree , and is the split value at branching node in tree . The function is the objective value of the tree defined by these split variables, and the set designates all the constraints to impose the tree structure for each (constraint (14b) is equivalent to (24) from [3]). Constraint (14c) imposes that tree must have exactly active splits, Constraint (14e) forces tree to keep all the branching nodes of tree , and constraints (14d) and (14f) force the splits at these common branching nodes to be the same.
5.3 Interpretable Paths and HumanintheLoop Analytics
Motivated by the idea that humans read and explain models sequentially, we have used the framework of interpretable paths to evaluate the interpretability of individual models. Viewing an interpretable path as a nested sequence of models of increasing complexity can also be useful in the context of humanintheloop analytics.
Consider the problem of customer segmentation via clustering. Choosing the number of customer types () is not always obvious in practice and has to be selected by a decisionmaker. Solving the clustering problem with clusters and with clusters may lead to very different clusters. Alternatively, using interpretable steps, we can force a hierarchical structure on the clusters, i.e., the solution with clusters results from the splitting of one of the clusters of the solution with clusters, for all . The change between clusters and clusters becomes simpler and may facilitate the choice of .
If we assume each
can be chosen with equal probability for
, the problem of finding the sequence that minimizes the expected cost is:(15) 
which is exactly the decision problem (10) with the weights for , and otherwise. This problem is related to studies in incremental approximation algorithms [28] and prioritization [24], which are typically motivated by a notion of interpretability which simplifies implementation for practitioners.
More generally, we can use interpretable paths to facilitate humanintheloop model selection. Given a discrete distribution on the choice of : , we can choose and solve (10) to find paths that minimize the expected cost .
6 Application: Linear Regression
So far, we have presented a mathematical framework to formalize the discussion of interpretability. We now study in detail how it can be used in practice, focusing on the single application of linear regression.
6.1 Modeling interpretability
In the example of linear regression, we defined the following interpretable steps:
(16) 
These steps are a modeling choice. They lead to decompositions of linear models where the coefficients are introduced or modified one at a time, and we have seen they are intimately linked to sparsity. We wish to obtain models that can easily be introduced coefficient by coefficient, allowing ourselves to modify coefficients that have already been set.
Choosing a different step function can lead to other notions of interpretability. For instance, each step could add a feature, allowing to modify all the weights (not only one coordinate): . This boils down to ordering the features of a linear model, finding the most interpretable order. We could also choose , which imposes integer coordinate updates at each step. This is related to the notion of interpretability introduced by scorebased methods [35]. Another way to think about scorebased methods is to choose , which imposes that each step adds one point to the scoring system.
6.2 Algorithms
Optimal.
Problem (10) can be written as a convex integer optimization problem using special ordered sets of type 1 (SOS1 constraints).
(17a)  
s.t.  (17b) 
For reasonable problem sizes (, and any choice of ), this problem can be solved exactly using a standard solver such as Gurobi or CPLEX.
Local improvement.
In higherdimensional settings, or when is too large, the formulation above may no longer scale. Thus it is of interest to develop a fast heuristic for such instances.
A feasible solution to problem (17) can be written as a vector of indices and a vector of values , such that for ,
The vector of indices encodes which regression coefficients are modified at each step in the interpretable path, while the sequence of values encodes the value of each modified regression coefficient. Thus problem (17) can be rewritten as
(18) 
where designates the
th unit vector. Notice that the inner minimization problem is an “easy” convex quadratic optimization problem, while the outer minimization problem is a “hard” combinatorial optimization problem. We propose the following local improvement heuristic for the outer problem: given a first sequence of indices
, we randomly sample one step in the interpretable path. Keeping all constant for , we iterate through all possible values of and obtain candidate vectors . For each candidate, we solve the inner minimization problem and keep the one with lowest cost. The method is described in full detail as Algorithm 1, in the more general case where we sample not one but steps from the interpretable path.In order to empirically evaluate the local improvement heuristic, we run it with different batch sizes
on a small real dataset, with 100 rows and 6 features (after onehot encoding of categorical features). The goal is to predict the perceived prestige (from a survey) of a job occupation given features about it, including education level, salary, etc.
Given this dataset, we first compute the optimal coordinate path of length . We then test our local improvement heuristic on the same dataset. Given the small size of the problem, in the complete formulation a provable global optimum is found by Gurobi in about 5 seconds. To be useful, we would like our local improvement heuristic to find a good solution significantly faster. We show convergence results of the heuristic for different values of the batch size parameter in Table 2. For both batch sizes, the local improvement heuristic converges two orders of magnitude faster than Gurobi. With a batch size , the solution found is optimal.
6.2.1 Results
We now explore the results of the presented approach on a dataset of test scores in California from 19981999. Each data point represents a school, and the variable of interest is the average standardized test score of students from that school. All features are continuous and a full list is presented in Table 8(b)
. Both the features and the target variables are centered and rescaled to have unit variance.
In our example, we assume that we already have a regression model available to predict the averaged test score: it was trained using only the percentage of students qualifying for a reducedprice lunch. This model has an MSE of 0.122 (compared to an optimal MSE of 0.095). We would like to update this model in an interpretable way given the availability of all features in the dataset. This corresponds to the problem of constructing an interpretable path, as before, with the simple modification that no longer designates the regression model , but an arbitrary starting model (in particular, the one we have been provided).
The first thing we can do is explore the price of interpretability in this setting. We can use the method presented in Section 5.1 to compute find Pareto efficient interpretable models. The resulting price curve is shown in Figure 8.
Method  Time (s)  Gap (%) 

Exact  
Local imp. ()  
Local imp. () 
Given this price curve, we choose because it yields an accurate final model while avoiding diminishing interprtability returns. This yields the new model (and associated interpretable path) shown in Figure 8(b). This new model can be obtained from the old in just four steps. First we add the district average income with a positive coefficient, then we correct the coefficient for reducedprice lunch students to account for this new feature, and finally we add the percentage of English learners and the school’s perstudent spending. The final model has an MSE of 0.097 which is nearoptimal. When we compare this path to other methods (see Figure 8(c)) we see that our interpretable formulation allows us to find a good tradeoff between a greedy, “every step must improve” formulation, and a formulation that just sets the coefficients to their final values one by one.

7 Conclusions
In this paper, we have presented a simple optimizationbased framework to model the interpretability of machine learning models. Our framework provides a new way to think about what interpretability means to users in different applications and quantify how this meaning affects the tradeoff with predictive accuracy. This framework is general, and each application can have its own modeling and optimization challenges, and could be an opportunity for further research.
Acknowledgements
Research funded in part by ONR grant N000141812122.
References
 [1] Hamsa Bastani, Osbert Bastani, and Carolyn Kim. Interpreting Predictive Models for HumanintheLoop Analytics. arXiv preprint arXiv:1705.08504, pages 1–45, 2018.
 [2] Richard Berk. An impact assessment of machine learning risk forecasts on parole board decisions and recidivism. Journal of Experimental Criminology, 13(2):193–216, 2017.
 [3] Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039–1082, 2017.
 [4] Dimitris Bertsimas, Nathan Kallus, Alexander M. Weinstein, and Ying Daisy Zhuo. Personalized diabetes management using electronic medical records. Diabetes Care, 40(2):210–217, 2017.
 [5] Dimitris Bertsimas, Angela King, and Rahul Mazumder. Best subset selection via a modern optimization lens. Annals of Statistics, 44(2):813–852, 2016.

[6]
Dimitris Bertsimas and Bart Van Parys.
Sparse HighDimensional Regression: Exact Scalable Algorithms and Phase Transitions.
Annals of Statistics, to appear, 2019.  [7] Leo Breiman. Classification and regression trees. New York: Routledge, 1984.
 [8] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
 [9] Leo Breiman. Statistical modeling: The two cultures. Statistical science, 16(3):199–231, 2001.
 [10] Cristian Bucilǎ, Rich Caruana, and Alexandru NiculescuMizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’06, page 535, New York, New York, USA, 2006. ACM, ACM Press.
 [11] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic Transparency via Quantitative Input Influence :. In 2016 IEEE Symposium on Security and Privacy, 2016.
 [12] Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1):114, 2015.
 [13] Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Overcoming algorithm aversion: People will use imperfect algorithms if they can (even slightly) modify them. Management Science, 64(3):1155–1170, nov 2016.
 [14] Finale DoshiVelez and Been Kim. Towards A Rigorous Science of Interpretable Machine Learning. arXiv preprint arXiv:1702.08608, (Ml):1–13, 2017.
 [15] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least Angle Regression. Annals of Statistics, 32(2):407–499, apr 2004.
 [16] Alex A. Freitas. Comprehensible classification models. ACM SIGKDD Explorations Newsletter, 15(1):1–10, 2014.
 [17] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics New York, NY, USA:, 2001.
 [18] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining Explanations : An Approach to Evaluating Interpretability of Machine Learning. arXiv preprint arXiv:1806.00069, 2018.
 [19] Bryce Goodman and Seth Flaxman. European Union regulations on algorithmic decisionmaking and a ”right to explanation”. pages 1–9, 2016.
 [20] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.
 [21] Been Kim, Cynthia Rudin, and Julie Shah. The Bayesian Case Model: A Generative Approach for CaseBased Reasoning and Prototype Classification. In Neural Information Processing Systems (NIPS) 2014, 2014.
 [22] I. Y. Kim and O. L. De Weck. Adaptive weightedsum method for biobjective optimization: Pareto front generation. Structural and Multidisciplinary Optimization, 29(2):149–158, 2005.
 [23] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2017.
 [24] Ali Koç and David P. Morton. Prioritization via Stochastic Optimization. Management Science, 61(3):586–603, 2014.
 [25] Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. Interpretable decision sets: a joint framework for description and prediction. KDD ’16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1:1675–1684, 2016.
 [26] Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Interpretable & Explorable Approximations of Black Box Models. FAT/ML, jul 2017.

[27]
Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan.
Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model.
Annals of Applied Statistics, 9(3):1350–1371, 2015.  [28] Guolong Lin and David Williamson. A general approach for incremental approximation and hierarchical clustering. SIAM Journal Computing, 39(8):3633–3669, 2010.
 [29] Zachary C. Lipton. The Mythos of Model Interpretability. arXiv preprint arXiv:1606.03490, 2016.
 [30] Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible Models for Classification and Regression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 150–158. ACM, 2012.
 [31] Sendhil Mullainathan and Ziad Obermeyer. Does machine learning automate moral hazard and error? American Economic Review, 107(5):476–480, 2017.
 [32] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
 [33] Jonathan Taylor and Robert J. Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, jun 2015.
 [34] Robert J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 [35] Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102(3):349–391, 2016.
 [36] Hongyu Yang, Cynthia Rudin, and Margo Seltzer. Scalable Bayesian Rule Lists. In Proceedings of the 34th International Conference on Machine Learning, 2017.
Appendix A Appendix
a.1 Proof of Theorem 1
Proof of part 1.
As is bounded, we have such that .
Let be a path of optimal length to the model , i.e., . Let be any path leading to (not necessarily of optimal length). By assumption, we have , and by definition of model interpretability, we have . Therefore we obtain:
(19)  
(20)  
(21)  
(22) 
where (20) follows from the definition of model interpretability, (21) is just a development of the previous equation, and (22) just bounds the first sum and uses for the middle term.
If , we have , and therefore the last sum in (22) is not empty and for we can bound it:
(23) 
Therefore, for we have:
(24) 
This bound is valid for all the path leading to , in particular the one with optimal interpretability loss, therefore we have (for ):
(25) 
which implies (as ):
(26) 
We now look at the case and . For , we can easily bound parts of equation (22):
(27) 
Putting it back into (22), we obtain (for )