DALEX: explainers for complex predictive models

06/23/2018 ∙ by Przemyslaw Biecek, et al. ∙ 0

Predictive modeling is invaded by elastic, yet complex methods such as neural networks or ensembles (model stacking, boosting or bagging). Such methods are usually described by a large number of parameters or hyper parameters - a price that one needs to pay for elasticity. The very number of parameters makes models hard to understand. This paper describes a consistent collection of explainers for predictive models, a.k.a. black boxes. Each explainer is a technique for exploration of a black box model. Presented approaches are model-agnostic, what means that they extract useful information from any predictive method despite its internal structure. Each explainer is linked with a specific aspect of a model. Some are useful in decomposing predictions, some serve better in understanding performance, while others are useful in understanding importance and conditional responses of a particular variable. Every explainer presented in this paper works for a single model or for a collection of models. In the latter case, models can be compared against each other. Such comparison helps to find strengths and weaknesses of different approaches and gives additional possibilities for model validation. Presented explainers are implemented in the DALEX package for R. They are based on a uniform standardized grammar of model exploration which may be easily extended. The current implementation supports the most popular frameworks for classification and regression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this section we present the importance of model interpretability and state of the art in this domain. Then we show how our methodology helps to better understand complex predictive models a.k.a. black-boxes.

Predictive modeling has a large number of applications in almost every area of human activity, starting from medicine, marketing, logistic, banking and many others. Due to the increasing amount of collected data, models become more sophisticated and complex.

It is believed that there is a trade-off between the interpretability and accuracy of a model (see e.g., Johansson et al. (2011)). It comes from the observation that the most elastic models usually have higher accuracy but in turn they are also more complex. Complexity here means a large number of parameters that affect the final prediction. That number is big enough to make the model ununderstandable for an ordinary human being.

Interpretability may be introduced naturally in the modeling framework, see an example in Figure 1. In many areas the interpretability of a model is very important, see for example Lundberg and Lee (2017), Murad and Tarr , Puri (2017). The reason behind it that interpretability allows to clash the model structure with the domain knowledge. And this may bring multiple benefits such as:

  • Domain validation. Very flexible models may be over-fitted to the training data and focused on some biases that result from the manner in which the data was collected (sample bias) or some surrogate variables (variable bias). Validation of the model structure helps to identify these biases. In Figure 1 this feature is marked as C.

  • Model improvement. Identification of subsets of observations in which a model has lower performance allows us to correct the model in this subset and leads to further improvements of the model. In Figure 1 this feature was marked as D.

  • Trust. If the model is used to assist people in activities such as selection of proper therapy, understanding key factors that drive model predictions is very important. See more examples in Ribeiro et al. (2016).

  • Hidden debt. Sculley et al. (2015) argue that lack of interpretability leads to hidden debt in machine learning models. Despite initial high performance, the real model performance may deteriorate quickly. Model explainers help to control this debt.

    Figure 1: Points A and B are from typical workflow of data modeling. A) domain knowledge and data are turned into models. B) models are used to generate predictions, presented methodology extends this framework with new processes C) model understanding increases our knowledge and, in consequence, it may lead to a better model, D) undertsanding preditction helps to to correct wrong decisions and, in consequence, it leads to better models.
  • New insights. It is hard to increase knowledge about a domain on the basis of black boxes. They may be useful but it does not lead to any new knowledge about a given discipline. Understanding model structure may lead to new interesting discoveries.

In this paper we present a consistent general framework for local and global model interpretability. This framework covers the most known approaches to model explanation such as Partial Dependence Plots (Greenwell, 2017), Accumulated Local Effects Plots (Apley, 2017), Merging Path Plots (Sitko and Biecek, 2017), Break Down Plots (Biecek, 2018), Permutational Variable Importance Plots (Fisher et al., 2018) or Cateris Paribus Plots.

All these explainers are extended in a way that allows us to compare different models against each other on the same scale. Model comparison is very important since in model building often one gets a collection of competing models. Comparisons of these models and exploration of structures learned by elastic models gives new insights that may be used to construct better features for new models (assisted training with surrogate models). Also lot of effort was put in the graphical side of explainers. Solutions such as Visualizations for Convolutional Networks (Zeiler and Fergus, 2014) or Conditional visualization for statistical models (O’Connell et al., 2017) show that well-prepared visualization boost actionability. Also, by purpose, we have not included approaches that do not fit into our grammar of model exploration, such as Individual Conditional Expectations Plot (Goldstein et al., 2015) and (Apley, 2017). Nevertheless, they are still available, for example in the ICEbox package.

The presented methodology is available as an open source package DALEX for R. The R language (R Core Team, 2017) is one of the most popular software systems for statistical and machine learning modeling. The current implementation of DALEX supports models generated with the most popular frameworks for classification or regression, such as caret (from Jed Wing et al., 2016), mlr (Bischl et al., 2016)

, Random Forest

(Liaw and Wiener, 2002)

, Gradient Boosting Machine

(Ridgeway, 2017) and Generalized Linear Models (Dobson, 1990). It can be also easily extended to other frameworks and other techniques for model exploration. The DALEX package is available on GPL-3 license at CRAN111https://cran.r-project.org/package=DALEX and at GitHub222https://github.com/pbiecek/DALEX along with technical documentation333https://pbiecek.github.io/DALEX and extended documentation444https://pbiecek.github.io/DALEX_docs.

Example explainers presented in this paper were recorded with the archivist package (Biecek and Kosinski, 2017). Each explainer is an R object, which can be downloaded directly to R console with hooks added to every section. To save space, we present in this paper only graphical representation of explainers. The tabular representation is available through attached hooks.

2 Architecture

Figure 2 presents the general architecture of the DALEX package. The presented methodology is model-agnostic and works for any predictive model that returns a numeric score, such as classification and regression models.

To achieve a truly model-agnostic solution, explainers cannot be based on model parameters nor model structure. The only assumption here is that we can call the predict function for any selected data points. Such function is wrapped with the model and the validation dataset. Such wrapper serves as a unified interface for a model.

Methods for better understanding of global structure of a model (a.k.a. model explainers) and for better understanding of a local structure of a model (a.k.a. prediction explainers) are implemented in separate functions. We call these function explainers since they are design to explain a single feature of a model. As a result, they return numerical summaries in a tabular format. Results from each explainer may be summarized with generic plot function. The plot function will work with any number of models and will overlay all models in a single chart for cross examination.

Figure 2: Architecture of the DALEX package is based on simple yet unified grammar. A) Any predictive model with defined input and output may be used. B) Models are first enriched with additional metadata, such as a function that calculates predictions and validation data. The explain() function creates an wrapper over a model that can be used in further processing. C) Various explainers may be applied to a model. Each explainer calculates a numerical summaries that can be plotted with generic plot() function.

3 Model understanding

In this section we present explainers that increase understanding of a global structure of the model. The primary goal of these explainers is to answer the following questions: How good is the model? Which variables are the most important? How are the variables linked with the model response?

3.1 Explainers for model performance

Model performance is often summarized with a single number such as precision, recall, F1, average loss or accuracy. Such approach is handy in model selection. It is easy to construct ranking of models and choose the best one on the basis of a single statistic. However, more descriptive statistics are better when it comes to understanding of a model.

The descriptive statistics most often used for classification models is ROC (Receiver Operating Characteristic). It has many various implementations. In R, the most widely used descriptive statistic is the ROCR package Sing et al. (2005). ROC plots have also extensions for regression models. Find an overview of Regression ROC curves in Hernández-Orallo (2013).

Figure 3:

Both plots compare distributions of residuals for two models. The left plot shows 1 - Empirical Cumulative Distribution Function for absolute values of residuals, while the right plot shows boxplots for absolute values of residuals. Red dots in the right plot stand for root mean square loss.

The DALEX package offers a selection of tools for exploration of model residuals. Figure 3 presents example explainers for model performance555Access this explainer with archivist::aread(’pbiecek/DALEX_arepo/b4eb1’) created with model_performance() function. Here distribution of absolute residuals is compared between two models. The average mean square loss is equal for both models, yet we can see that the random forest model has more small residuals and only a small fraction of large residuals. 10% of residuals in random forest model is larger than the largest residual in the linear model.

More diagnostic plots are available through the auditor package (Gosiewska and Biecek, 2018), which is closely integrated with the DALEX.

3.2 Explainers for conditional effect of a single variable

The DALEX package offers a selection of tools for better understanding of a conditional model’s response based on a single variable. Current implementation covers:

  • Partial Dependence Plot (Greenwell, 2017), as implemented in the pdp package.

  • Accumulated Local Effects Plot (Apley, 2017) as implemented in ALEPlot package,

  • Merging Path Plot (Sitko and Biecek, 2017) as implemented in the factorMerger package.

First two methods were designed to deal with continuous variables, while the third one is designed for categorical variables.

Examples for these explainers666Access these explainers with archivist::aread(’pbiecek/DALEX_arepo/3b150’) and archivist::aread(’pbiecek/DALEX_arepo/6cbf4’) created with function variable_response() are presented in Figure 4. On the basis of these explainers it is easy to see that the random forest model learns the nonlinear relation between price and construction year. The linear model is unable to handle such relation without some prior feature engineering. For categorical variable we can see that both models divide the district variable into three groups of values: downtown (largest responses), three districts close to downtown (middle response) and all remaining responses.

Figure 4: Example explainers for variable responses. Two top plots show responses for quantitative variable construction year (Partial Dependency Plots), while bottom plots show responses for factor variable district (Factor Merger Plots). The left plots show explainers for a single model, while the right panels show plots in which two models are being compared. Explainers for quantitative variable show the expected response given a selected value of a variable. Explainers for factor variable present similarity of responses for each possible value.

3.3 Explainers for variable importance

The DALEX package offers a model-agnostic procedure to calculate variable importance. The model-agnostic approach is based on permutational approach introduced initially for random forest (Breiman, 2001) and then extended for other models by Fisher et al. (2018).

An example for these explainers777Access this explainer with archivist::aread(’pbiecek/DALEX_arepo/9378c’) created with function variable_importance() is presented in Figure 5. The initial performance of both models is similar, and for that reason these intervals are left aligned. For both models the district and surface variables are the most interesting variables. The largest difference between these models is the effect of construction year. For the linear model the length of corresponding interval is almost 0, while for the random forest model is far from 0. This observation is aligned with variables’ effects presented in Figure 4.

The usual practice in variable importance charts is to present only the length of the interval which is related to loss in the performance metrics after the selected variable is shuffled. Bars on such plots are hitched in 0. In the DALEX package we propose to present not only drop in model performance but also the initial model performance. In that way one can compare variables between models with different initial performance.

Figure 5: Example explainers for variables importance. Left panel shows explainers for a single model, random forest, while the right one compares two models. Importance of every variable is presented as an interval. One end of this interval is the model performance with regard to validation data, while the second end is the performance with regard to a data set with single variable being shuffled. The longer the interval, the more important the corresponding variable is.

4 Prediction understanding

In this section we present explainers that increase understanding of a prediction for a single observations. The primary goal of these explainers is to answer the following questions: How stable is the prediction? Which variables influence the prediction? How to attribute effects of particular variables to a single model prediction?

4.1 Explainers for robustness of predictions

Ceteris Paribus Plots show how the model response changes as a function of a single variable. These plots recollect similarities to Partial Dependency Plots presented in Section 3.2; the only difference between them is the fact that Ceteris Paribus Plots are focused on a single observation.

CP Plots have many applications. The derivative is related to local variable importance (as measured in LIME), the profile may be used to verify some constraints related to a variable (such as monotonic relation) or to asses variable contribution.

An example for this explainer888Access this explainer with archivist::aread(’pbiecek/DALEX_arepo/c8989’) created with ceterisParibus package999https://github.com/pbiecek/ceterisParibus is presented in Figure 6. We can read from it that the variable surface has the largest effect on the model predictions and and it lowers the model prediction for large apartments. We can also read that small changes in the variable construction year will not affect model predictions.

Figure 6:

Ceteris Paribus Plots - explainers for a single observation. The left plot shows how the model response fluctuates for a single observation (predicted y is on OY axis) and if all its unchanged variables remain constant when a single variable is changed (ceteris paribus principle). The right plot shows effects of all variables in the same coordinate system. On OX axis values are normalized through quantile transformation.

4.2 Explainers for variable attribution

The most known approaches to explanations of a single prediction are LIME method (Ribeiro et al., 2016), working best for local explanations, and Shapley values (Štrumbelj and Kononenko, 2010, 2014; Lundberg and Lee, 2017), working best for variable attribution. Break Down Plots are fast approximations of Shapley values. The methodology behind this method and comparison among these three methods is presented in (Staniak and Biecek, 2018).

An example for BDP explainers101010Access this explainer with archivist::aread(’pbiecek/DALEX_arepo/72b47’) created with function prediction_breakdown() is presented in Figure 7. As one can read from the graph, in both models the largest increase in model prediction is due to variable district = Srodmiescie (downtown). Large surface lowers the prediction in the random forest model, while the variable number of rooms has larger impact in the random forest model than in the linear model.

Figure 7: Break Down Plots - explainers for a single observation that attributes variables to parts of model prediction. The left plot shows how the random forest model’s response decomposed onto five variables. The right plot shows decompositions for two models. The gray rectangles show how the single model prediction is different from the population average (reference), the blue rectangles show which variables increase model prediction, while the yellow rectangles are related to variables that lower the model prediction.

5 Summary

Thinking about data modeling is currently dominated by feature engineering and model training. Kaggle competitions turn the data modeling process into a process that returns a single model with highest accuracy. Tasks of that type may be easily automated. Such thinking about modeling is popular due to lack of tools that can be used for model validation and richer domain verification.

In this article we have introduced consistent methodology and a set of tools for model-agnostic explanations. The presented global explainers for model understanding and local explainers for prediction understanding are based on uniform grammar introduced in Figure 2. Every explainer is constructed in a way that allows for numerical summary, visual summary and comparison of multiple models.

The methodology is developed in a way that is easy to extend with broad technical documentation with rich training materials111111https://pbiecek.github.io/DALEX_docs. The code is properly maintained and tested with tools for continuous integration.

6 Acknowledgments

The work was partially supported as RENOIR Project by the European Union Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 691152 (project RENOIR) and by NCN Opus grant 2016/21/B/ST6/02176.


References