1 Introduction
As machine learning models influence a growing fraction of everyday life, from cancer diagnoses to parole decisions to loan applications
[1, 2, 3], individuals often want to understand the reasons for the decisions that affect them [4]. Model interpretability is of significant interest to the machine learning community [5], even though the lack of a welldefined concept of interpretability [6] means researchers often focus on proxies (e.g. sparsity or coefficient integrality in linear models).Building “simple” models by optimizing interpretability proxies is a helpful but incomplete way to enhance interpretability. In some cases, practitioners prefer more complex models (e.g. deeper decision trees) because they can be explained and justified in a more compelling way
[7]. Recent EU legislation [8] guarantees citizens a “right to explanation,” not a right to be affected only by sparse models. But explaining a model in simple terms is a major challenge in interpretable machine learning, because it is typically an ad hoc, audiencespecific process. Formalizing the process of model explanation can yield ways to create models that are easier to explain, and rigorously quantify the tradeoff between interpretability and accuracy.In this paper, we focus on linear models, and explore ways to decompose them into a sequence of interpretable coordinate updates. We propose a general optimization framework to measure and optimize the interpretability of these sequences. We then discuss how to create linear models with better explanations, leading to a natural set of interpretability metrics, and show that we can generalize various aspects of linear model interpretability. In particular,

Section 2 introduces coordinate paths and motivates their use to explain linear models.

Section 3 presents a set of metrics to evaluate the interpretability of coordinate paths and extends them into interpretability metrics for models. This allows us to study the price of interpretability, i.e., the Pareto front between accuracy and interpretability. We show that our metrics are consistent with existing approaches and exhibit desirable properties.

Section 4 presents both optimal and scalable algorithms to compute coordinate paths and interpretable models.

Section 5 discusses various practical uses of our framework and other extensions.
1.1 Related work
Many interpretable machine learning approaches involve optimizing some characteristics of the model as proxies for interpretability. Examples include sparsity for linear models [9], number of splits for decision trees [10], number of subspace features for casebased reasoning [11], or depth for rule lists [12, 13]. Some approaches optimize these proxies directly, while others fit auxiliary simple models to more complex blackbox models [14, 15, 16, 17, 18, 19].
In the specific case of linear models, the typical interpretability proxy of sparsity (small number of nonzero coefficients) has been a topic of extensive study over the past twenty years [9]. Sparse regression models can be trained using heuristics such as LASSO [20], stagewise regression [21] or leastangle regression [22], or using scalable mixedinteger approaches [23]. More recently, another factor of interpretability in linear models has involved imposing integrality on the coefficients [24, 25], which allows to think of the output as tallying up points from each feature into a final score.
Training lowcomplexity models often affects predictive accuracy, and the tradeoff between the two can be difficult to quantify [26]. Similarly, the limitations of an ex post explanation relative to the original black box model can be difficult to explain to users [27]. And it is not clear that practitioners always find models that optimize these proxies more interpretable [7]. Recent landmark works [28, 6, 27] have argued that any study of interpretability must include input from human users. The framework we propose is both humandriven and mathematically rigorous, as users can define their own understanding of interpretability and quantify the resulting tradeoff with accuracy.
2 A Sequential View of Model Construction
Given a dataset with feature matrix and labels
, a linear model is a vector of coefficients
, associated with a cost that measures how well it fits the data, such as the meansquared error (potentially augmented with a regularization term for outofsample error).We will motivate our approach to explaining linear models with a toy example. The goal is to predict a child’s age given height and weight . The normalized features and have correlation
and are both positively correlated with the normalized target. Solving the ordinary least squares problem yields optimal coefficients
:(1) 
with the error term. The mean squared error (MSE) of is .
2.1 Coordinate paths
We propose a framework to construct an explanation of by decomposing the model into a sequence of interpretable building blocks. In particular, we consider sequences of of linear models leading to where each model is obtained by changing one coefficient from the preceding model. We choose these coordinate steps because they correspond to the natural idea of adding a feature or updating an existing coefficient, and we will show they have interesting properties. We discuss other potential steps in Section 5. Table 1 shows 3 possible decompositions of into coordinate steps. This is a natural way to decompose , notice for example that decomposition 1(a) corresponds to introducing the model coefficient by coefficient.



We refer to these sequences of models as coordinate paths. Formally, we define a coordinate path of length as a sequence of models such that for all , where , and is the set of linear models that are one coordinate step away from , i.e., . is the set of all coordinate paths of length , and is the set of all finite coordinate paths. An explanation of a model is a coordinate path such that the last model is . is the set of explanations of of length (potentially empty), and is the set of all possible explanations of (typically infinite).
The examples in Table 1 are all explanations of , and it is natural to ask which is the most useful or interpretable one. Formally speaking, it is of interest to define an interpretability loss on the space of coordinate paths , such that when is more interpretable than . Then finding the best possible explanation for any model can be written as the optimization problem
For any path interpretability loss , it is then easy to consider the interpretability loss of a model as the interpretability loss of the best explanation , i.e.
(2) 
How to select a path interpretability loss ? A natural choice is to consider that an explanation is better if it is shorter. Formally, we define the path complexity loss , corresponding to the length of the coordinate path. For any model , we can define the corresponding interpretability loss
which we call model complexity (minimum number of coordinate steps required to reach ). Interestingly, for any model , corresponds to the number of nonzero coefficients of . The natural metric of coordinate path length thus recovers the usual interpretability proxy of model sparsity.
Consider the different coordinate paths in Table 1. If we use the interpretability loss , i.e., if we consider shorter paths (and thus sparser models) to be more interpretable, then (path 1(a)) and (path 1(b)) are equally interpretable. Though both paths verify , we notice that . Indeed, is a particularly inaccurate model, as weight is actually positively correlated with age. Since a coordinate path represents an explanation of the final model, the costs of intermediate models should play a role in quantifying the interpretability of a path; higher costs should be penalized. The path complexity loss does not consider intermediate model costs at all, and therefore cannot capture this effect.
2.2 Incrementality
To explore alternatives to path complexity, consider the example of greedy coordinate paths. where the next model at each step is chosen by minimizing the cost :
(3) 
This approach is appealing from an explanation standpoint, because we always select the coordinate step which most improves the model. However, many steps may be required to obtain an accurate model (slow convergence). Returning to toy example (1) and considering only paths of length 2, we compute the greedy coordinate path by solving (3) twice [21]. Comparing to from Table 1(a), we have , but . The improvement of the first model comes at the expense of the second step.
Deciding which of the two paths and is more interpretable is a hard question. It highlights the tradeoff between the desirable incrementality of the greedy approach and the optimality of the second model. For paths of length 2, there is a continuum of solutions that trade off MSE in the first and second steps, shown in Figure 1. The next section introduces interpretability losses that formalize this tradeoff.
3 Defining an Interpretability Loss
3.1 Coherent Interpretability Losses
In example (1), comparing the interpretability of and is easy, because they have the same length and has a better cost than at each step. In contrast, comparing the interpretability of and is not trivial, because has a better final cost, but has a better initial cost.
We can define the cost sequence of a coordinate path as the infinite sequence such that if , and otherwise. Then we call a path interpretability loss coherent if the following conditions hold for any two paths with cost sequences and .

[(a)]

If , then .

If , then .
Condition 1 means that in our modeling framework, the interpretability of a path depends only on the sequence of costs along that path. Condition 2 formalizes the intuition that paths with fewer steps or better steps are more interpretable. Under any coherent interpretability loss in toy example (1), is more interpretable than , but may be more or less interpretable than depending on the specific choice of coherent interpretability loss.
In addition, consider a path and remove its last step to obtain a new path . This is equivalent to setting the th element of the cost sequence to zero. Since , we have that , which implies . In other words, under a coherent interpretability loss, removing a step from a coordinate path can only make the path more interpretable. We also notice that the path complexity (sparsity) is a coherent path interpretability loss.
3.2 A Coherent Model Interpretability Loss
Condition 2 states that a path with at least as good a cost at each step as another path must be at least as interpretable. This notion of Pareto dominance suggests a natural path interpretability loss:
In other words, the interpretability loss of a path
is the weighted sum of the costs of all steps in the path. This loss function is trivially coherent and extremely general. It is specified by the infinite sequence of parameters
, which specify the relative importance of the accuracy of each step in the model for the particular application at hand.Defining a family of interpretability losses with infinitely many parameters allows for significant modeling flexibility, but it is also cumbersome and overly general. We therefore propose to select for all , replacing the infinite sequence of parameters with a single parameter . In this case, following 2, we propose the following interpretability loss function on the space of models.
Definition 1 (Model interpretability).
Given a model , its interpretability loss is given by
(4) 
By definition, is a coherent interpretability loss. The parameter captures the tradeoff between favoring more incremental models or models with a low complexity, as formalized in Theorem 1.
Theorem 1 (Consistency of interpretability measure).
Assume that is bounded and nonnegative.

[(a)]

Let with , or and .
(5) 
Given models , if there is such that for all , then
(6)
Intuitively, in the limit , 1 states that the most interpretable models are the ones with minimal complexity, or minimal costs if their complexity is the same. 2 states that in the limit the most interpretable models are the ones that can be constructed with greedy steps. All proofs are provided in the supplement.
3.3 The Price of Interpretability
Given the metric of interpretability defined above, we want to compute models that are Paretooptimal with respect to and (more generally ). Computing these models can be challenging, as our definition of model interpretability requires to optimize over paths of any length. We can get around this if we can at least find the most interpretable path of a fixed length , i.e.,
(7) 
Indeed, the following result shows that we can compute Paretooptimal solutions by solving a sequence of optimization problems (7) for various .
Proposition 1 (Price of interpretability).
Paretooptimal models that minimize the interpretability loss and the cost can be computed by solving the following optimization problem:
(8) 
where is a tradeoff parameter between cost and interpretability.
Notice that the inner minimization problem in (8) is simply problem (7) with appropriate modifications of the coefficients . We can use this decomposition to compute the price of interpretability in the toy problem (1), with the interpretability loss chosen such that . Figure 2 shows all Paretooptimal models with respect to cost (MSE) and interpretability loss.
By defining the general framework of coordinate paths and a natural family of coherent interpretability loss functions, we can understand exactly how much we gain or lose in terms of accuracy when we choose a more or less interpretable model. Our framework thus provides a principled way to answer a central question of the growing literature on interpretability in machine learning.
4 Computing the Price of Interpretability
4.1 Algorithms
Optimal.
Given the step function and the convex quadratic cost function , problem (7) can be written as a convex integer optimization problem using special ordered sets of type 1 (SOS1 constraints), and solved using Gurobi or CPLEX for small problems:
(9) 
where designates the starting linear model.
Local improvement.
In higherdimensional settings, or when grows large, the formulation above may no longer scale. Thus it is of interest to develop a fast heuristic for such instances.
A feasible solution to problem (9) can be written as a vector of indices and a vector of values , such that for ,
The vector of indices encodes which coefficients are modified at each step, while the vector of values encodes the value of each modified coefficient. Thus problem (9) can be rewritten as
(10) 
where designates the
th unit vector. The inner minimization problem is an “easy” convex quadratic optimization problem, while the outer minimization problem is a “hard” combinatorial optimization problem. We propose the following local improvement heuristic for the outer problem: given a current vector of indices
, we randomly sample one step in the coordinate path. Keeping all constant for , we iterate through all possible values of and obtain candidate vectors . For each candidate, we solve the inner minimization problem and keep the one with lowest cost. A general version of this algorithm, where we sample not one but steps at each iteration, is provided in the supplement.4.2 Results
Optimal vs heuristic.
In order to empirically evaluate the local improvement heuristic, we run it with different batch sizes
on a small real dataset, with 100 rows and 6 features (after onehot encoding of categorical features). The goal is to predict the perceived prestige (from a survey) of a job occupation given features about it, including education level, salary, etc.
Given this dataset, we first compute the optimal coordinate path of length . We then test our local improvement heuristic on the same dataset. Given the small size of the problem, in the complete formulation a provable global optimum is found by Gurobi in about 5 seconds. To be useful, we would like our local improvement heuristic to find a good solution significantly faster. We show convergence results of the heuristic for different values of the batch size parameter in Table 2. For both batch sizes, the local improvement heuristic converges two orders of magnitude faster than Gurobi. With a batch size , the solution found is optimal.
Method  Time to convergence (s)  Optimality gap (%) 

Exact  
Local improvement ()  
Local improvement () 
Insights from a real dataset.
We now explore the results of the presented approach on a dataset from the 19981999 California test score dataset. Each data point represents a school, and the variable of interest is the average standardized test score of students from that school. The ten continuous features and the target variables are centered and rescaled to have unit variance.
In our example, we assume that we already have a regression model available to predict the number of trips: it was trained using only the percentage of students qualifying for a reducedprice lunch. This model has an MSE of 0.122 (compared to an optimal MSE of 0.095). We would like to update this model in an interpretable way given the availability of all features in the dataset. In our framework, this corresponds to problem (9) where is no longer 0 but the available starting model.

In Figure 3 we study one particular coordinate path on the Pareto front of interpretability and efficiency. The path (and the new model it leads to) is shown in Figure 2(a). It can be obtained from the original model in just four steps. First we add the district average income with a positive coefficient, then we correct the coefficient for reducedprice lunch students to account for this new feature, and finally we add the percentage of English learners and the perstudent spending. The final model has an MSE of 0.097 which is nearoptimal. When we compare this path to other methods (see Figure 2(b)) we see that our interpretable formulation allows us to find a good tradeoff between a greedy formulation and a formulation that just sets the coefficients to their final values one by one.
5 A General Framework
5.1 Different Steps for Different Notions of Interpretability
So far, we have focused exclusively on paths in which linear models are constructed in a series of coordinate steps. However, the choice of what constitutes a step is ultimately a modeling choice which encodes what a user in a particular application considers a simple building block. Choosing a different step function can lead to other notions of interpretability. For instance, choosing imposes integer coordinate updates at each step. This is related to the notion of interpretability introduced by the scorebased methods of Ustun and Rudin [25]. Another way to think about scorebased methods is to choose
, which imposes that each step adds one point to the scoring system. The fundamental idea of optimally decomposing models into a sequence of simple building blocks is general, and can be applied not only to more general linear models (e.g. ridge regression or logistic regression by suitably modifying the cost
) but also to other machine learning models in general (for example, a decision tree can be decomposed into a sequence of successive splits).5.2 Humanintheloop Model Selection
Viewing a coordinate path as a nested sequence of models of increasing complexity can be useful in the context of humanintheloop analytics. Consider the problem of selecting a linear model by a human decision maker. For example, consider a city planner that would like to understand bikesharing usage in Porto, by training a linear model using a dataset from the UCI ML repository [30], where each of the 731 data points represents a particular day (18 features about weather, time of year, etc.), and the variable of interest is the number of trips recorded by the bikesharing system on that day. The decisionmaker may prefer a sparse model, but may not know the exact desired level of sparsity. Given a discrete distribution on the choice of the level of sparsity : , we can choose and solve (7) to find paths that minimize the expected cost .
In Table 3(a), we show the path obtained by assuming that the desired level of sparsity is uniformly random between 1 and 7. We can compare the result to a sequence of linear models obtained using LASSO to select an increasing set of features, shown in Table 3(b). The expected costs are respectively 0.145 and 0.150, and the MSE is essentially the same as each step. The two sequences also use almost the same features, in a similar order. However, the coordinate path can be read much more easily: because only one coefficient changes at each step, the whole path can be described with parameters, while the path constructed using LASSO/sparse regression needs parameters.


Comparison of interpretable path and sequence of models of increasing sparsity selected by LASSO. Ftemp is the “feels like” temperature, Day is the number of days since data collection began, Hum is humidity, Wind is wind speed. Season, Weather and Weekday are categorical variables.
Acknowledgements
Research funded in part by ONR grant N000141812122.
References
 [1] Sendhil Mullainathan and Ziad Obermeyer. Does machine learning automate moral hazard and error? American Economic Review, 107(5):476–480, 2017.
 [2] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2017.
 [3] Richard Berk. An impact assessment of machine learning risk forecasts on parole board decisions and recidivism. Journal of Experimental Criminology, 13(2):193–216, 2017.
 [4] Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Overcoming algorithm aversion: People will use imperfect algorithms if they can (even slightly) modify them. Management Science, 64(3):1155–1170, nov 2016.
 [5] Alex A. Freitas. Comprehensible classification models. ACM SIGKDD Explorations Newsletter, 15(1):1–10, 2014.
 [6] Zachary C. Lipton. The Mythos of Model Interpretability. arXiv preprint arXiv:1606.03490, 2016.
 [7] Nada Lavrač. Selected techniques for data mining in medicine. Artificial intelligence in medicine, 16(1):3–23, 1999.
 [8] Bryce Goodman and Seth Flaxman. European Union regulations on algorithmic decisionmaking and a "right to explanation". pages 1–9, 2016.
 [9] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.
 [10] Leo Breiman. Classification and regression trees. New York: Routledge, 1984.
 [11] Been Kim, Cynthia Rudin, and Julie Shah. The Bayesian Case Model: A Generative Approach for CaseBased Reasoning and Prototype Classification. In Neural Information Processing Systems (NIPS) 2014, 2014.

[12]
Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan.
Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model.
Annals of Applied Statistics, 9(3):1350–1371, 2015.  [13] Hongyu Yang, Cynthia Rudin, and Margo Seltzer. Scalable Bayesian Rule Lists. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 [14] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
 [15] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics New York, NY, USA:, 2001.
 [16] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic Transparency via Quantitative Input Influence :. In 2016 IEEE Symposium on Security and Privacy, 2016.
 [17] Hamsa Bastani, Osbert Bastani, and Carolyn Kim. Interpreting Predictive Models for HumanintheLoop Analytics. arXiv preprint arXiv:1705.08504, pages 1–45, 2018.
 [18] Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Interpretable & Explorable Approximations of Black Box Models. FAT/ML, jul 2017.
 [19] Cristian Bucilǎ, Rich Caruana, and Alexandru NiculescuMizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’06, page 535, New York, New York, USA, 2006. ACM, ACM Press.
 [20] Robert J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 [21] Jonathan Taylor and Robert J. Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, jun 2015.
 [22] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least Angle Regression. Annals of Statistics, 32(2):407–499, apr 2004.
 [23] Dimitris Bertsimas, Angela King, and Rahul Mazumder. Best subset selection via a modern optimization lens. Annals of Statistics, 44(2):813–852, 2016.
 [24] Jongbin Jung, Connor Concannon, Ravi Shroff, Sharad Goel, and Daniel G Goldstein. Simple rules for complex decisions. feb 2017.
 [25] Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102(3):349–391, 2016.
 [26] Leo Breiman. Statistical modeling: The two cultures. Statistical science, 16(3):199–231, 2001.
 [27] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining Explanations : An Approach to Evaluating Interpretability of Machine Learning. arXiv preprint arXiv:1806.00069, 2018.
 [28] Finale DoshiVelez and Been Kim. Towards A Rigorous Science of Interpretable Machine Learning. arXiv preprint arXiv:1702.08608, (Ml):1–13, 2017.
 [29] I. Y. Kim and O. L. De Weck. Adaptive weightedsum method for biobjective optimization: Pareto front generation. Structural and Multidisciplinary Optimization, 29(2):149–158, 2005.
 [30] Hadi FanaeeT and Joao Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2(23):113–127, 2014.
Appendix A Proof of Theorem 1
Proof of part (a).
As is bounded, we have such that .
Let be a path of optimal length to the model , i.e., . . Let be any path leading to (not necessarily of optimal length). By assumption, we have , and by definition of model interpretability, we have . Therefore we obtain:
(11)  
(12)  
(13)  
(14) 
where (12) follows from the definition of model interpretability, (13) is just a development of the previous equation, and (14) just bounds the first sum and uses for the middle term.
If , we have , and therefore the last sum in (14) is not empty and for we can bound it:
(15) 
Therefore, for we have:
(16) 
This bound is valid for all the path leading to , in particular the one with optimal interpretability loss, therefore we have (for ):
(17) 
which implies (as ):
(18) 
We now look at the case and . For , we can easily bound parts of equation (14):
(19) 
Putting it back into (14), we obtain (for )
(20) 
This bound is independent of the path leading to , therefore we have
(21) 
which ends the proof. ∎
Proof of part (b).
Consider two paths , such that . By definition of the lexicographic order, either the two paths are the same (in that case the theorem is trivial), or there exist such that:
We have:
(22)  
(23)  
(24)  
(25) 
where (23) just applies the definition of the sequence , and (25) uses .
The term inside the parenthesis in (25) converges to when , as the paths are finite. Therefore
(26) 
which completes the proof. The very end of the theorem is an immediate consequence. ∎
Appendix B Proof of Proposition 1
Proof.
First, a solution of
is Pareto optimal between the cost and the interpretability as it corresponds to the minimization of a weighted sum of the objectives. Furthermore, we can write
∎