1 Introduction
Modern applications of machine learning (ML) involve a complex web of social, political, and regulatory issues as well as the standard technical issues in ML such as covariate shift or training dataset bias. Although the most common diagnostic is simply test set accuracy, most of these issues cannot be resolved by test set accuracy alone—and sometimes these issues directly oppose high test set accuracy (e.g., privacy concerns). Other simple numeric diagnostics such as the variance of the residual in regression do not provide much insight into the ML system. Thus, to develop safe and robust systems, more diagnostics are needed.
In response to these issues, the literature has recently focused on the problem of interpretability and explainability in production ML systems, and the dangers that arise when these considerations are not properly taken into account. In some settings, model explanations are used as diagnostic tools to uncover hidden problems in the model that are not evident from test accuracy alone (Ribeiro et al., 2016). A common thread in much of this work is the reliance on local approximations to a model (e.g., (Ribeiro et al., 2016, 2018; Guidotti et al., 2018a, b)), which presents difficulties to end users in practice. These difficulties include questions regarding the quality and validity of an approximation, and how rapidly it breaks down or becomes unreliable. Thus, the use of these local approximations for specific, humaninterpretable diagnostics to check model properties is tenuous at best, despite their continued utility in terms of providing highlevel understanding. For these reasons, it is desirable to provide the model’s exact output to users (see Fig 2 for a comparison).
In this paper, we propose a diagnostic framework for validating various model properties of complex machine learning models that goes significantly beyond test accuracy. We seek to find a single univariate curve in the input space whose model output maximally or minimally violates certain properties, which we encode by defining two key classes of curve utility functions. For example, we may seek a curve that is least monotonic (a surprising and often undesirable property in certain applications such as loan decisions). Once this curve is found, we display the model’s output along this univariate diagnostic curve for inspection so that users can visually understand the severity of the issue and quickly get a qualitative feel for the model along this curve. To aid in interpreting this curve visualization, we also allow the curve to pass through a real data point so that users can understand the context for this curve in high dimensional space if needed.
Since there are many possible curves, the key challenge is finding interesting or useful diagnostic curves that encode various model properties and displaying this in a useful way to users. Addressing these challenges is the core contribution of this paper. More specifically, we do the following: (1) We illustrate how diagnostic curves are simple to interpret, exact, nonlocal, and provide intuitive interpretations to the behavior of a model. (2) We develop a framework for finding interesting diagnostic curves by optimizing a utility measure over the space of curves, and define two broad families of utility measures that can be used to validate or invalidate various model properties. Our approach gracefully handles mixed continuous and discrete datasets. (3) We propose an optimization algorithm for finding sparse optimal diagnostic curves and evaluate its performance on synthetic models for which the true optimal values are known. (4) We demonstrate our diagnostic curve framework for several usecases, including nondifferentiable models, on multiple datasets such as loan data and traffic sign recognition.
Example.
To build some intuition, an example of the visualization of our diagnostic curve can be seen in Fig 1 in which the exact model output is shown along the curve. The curve shows how the model score changes for the given target loan application in the red dot, as the categorical value of credit history changes from “critical account” to “all credits paid back” (note the solid line shifting downwards from where the target originally is at), and as the numerical value of duration and age change (along the axis). This curve is found by finding the curve that maximally differs from the constant model (i.e., a simple property that the model is close to constant). To formalize this and develop our framework, there are two complications here to overcome: 1) Defining a proper notion of “curve” in the categorical feature space and 2) Defining diagnostic curves which maximally or minimally violate some interesting model property such as monotonicity by proposing curve utility functions that that can encode various model properties. The appropriate definitions are made in Section 2, while Sections 34 are devoted to the problem of selecting appropriate diagnostic curves.
Related work.
The importance of safety and robustness in ML is now wellacknowledged (Varshney and Alemzadeh, 2017; Varshney, 2016; Saria and Subbaswamy, 2019), especially given the current dynamic regulatory environment surrounding the use of AI in production (Hadfield and Clark, 2019; Wachter et al., 2017). A popular approach to auditing and checking models before deployment is “explaining” a blackbox model posthoc. Both early and recent work in explainable ML rely on local approximations (e.g. (Ribeiro et al., 2016, 2018; Guidotti et al., 2018a, b)). Other recent papers have studied the influence of training samples or features (Datta et al., 2016; Dhurandhar et al., 2018). These have been extended to subsets of informative features (Lundberg and Lee, 2017; Datta et al., 2016; Chen et al., 2018) to explain a prediction. Other approaches employ influence functions (Koh and Liang, 2017) and prototype selection (Yeh et al., 2018; Zhang et al., 2018). A popular class of approaches take the perspective of local approximations, such as linear approximations (Ribeiro et al., 2016; Lundberg and Lee, 2017), and saliency maps (Simonyan et al., 2013; Sundararajan et al., 2017; Shrikumar et al., 2017; Smilkov et al., 2017). A crucial caveat with such approximationbased methods is that the quality of the approximation is often unknown, and is typically only valid in small neighborhoods of an instance (although we note recent work on global approximations (Guo et al., 2018)
). In contrast to these previous feature selection methods, our approach leverages a utility measure to select features (i.e., axisaligned curve) or more general curves (as opposed to, e.g. mutual information or Shapley values). Independent conditional expectation (ICE) plots show how the model prediction scores change when one feature is modified for a every point in a set such that each line of the plot corresponds to a point
(Goldstein et al., 2015). The more classical partial dependence plots (PDP) show the average of the ICE plots over all data points (Friedman, 2001). The output of our method is similar to ICE and PDP plots because we show the model output along a curve. However, PDP and ICE do not define ways of selecting the most interesting plots, and in practice, they are often sorted by variance or range; we greatly generalize the way to select or sort the curves with two broad classes of curve utilities to validate (or invalidate) properties beyond variance such as monotonicity or Lipschitz smoothness. Moreover, PDP and ICE only consider axisaligned curves while we generalize to linear, nonlinear, and transformation curves.2 Diagnostic Curves
We decompose the input space into a numeric feature space (e.g. ) and a categorical feature space (e.g. binary or multinomial features). Curves will typically be defined with respect to a target instance , however, this is not necessary in our framework. For example, in many cases it may be best to search over curves based on a sample of possible target instances. Let be a blackbox function, i.e. we can query to obtain pairs . We do not assume that
is differentiable or even continuous in order to allow nondifferentiable models such as random forests.
First, we define an appropriate class of candidate curves in the numeric feature space . Recall that a curve is a continuous function . Here we assume is a metric space (e.g. ), so that the notion of continuity is welldefined. To enhance interpretability and help ensure the curves are near realistic points, we anchor diagnostic curves to point of interest —usually a real data point from the dataset so that the curve passes through a point in the training data. To incorporate categorical features, consider a point . As a curve represents a continuous perturbation of the point , we allow for a single discrete perturbation of . More precisely, we define a set of categorical curves by In other words, we augment the curve with a categorical alternative . The idea is to find both a curve and a discrete alternative that maximally violates some desirable property of the model. For notational simplicity, we will use to denote either a curve or a categorical curve in the rest of the paper. An example visualization of a diagnostic curve can be seen in Fig. 1 where the user can read off various predictions as a result of changing both categorical and numeric features. We conclude this section by illustrating several examples of curve parametrizations used in practice.
Linear curves.
A natural family of curves to consider is the family of linear curves, i.e. straight lines passing through . Let and
be vectors containing the minimum and maximum of each feature over the training dataset. Let
denote the density fitted to the training data (helps to encourage realistic curves, see paragraph below for discussion). We parametrize linear curves by a unit vector : where , , andNonlinear curves.
The above linear parameterization could be simply extended to the nonlinear case given access to a nonlinear generative model , that generates input points in , given vectors in ; examples of such models include VAEs (Kingma and Welling, 2014; Rezende et al., 2014) and deep generative models via normalizing flows (Germain et al., 2015; Papamakarios et al., 2017; Oliva et al., 2018; Dinh et al., 2017; Inouye and Ravikumar, 2018). We then allow for linear curves in the input space of the generative model rather than itself, so that we can parameterize nonlinear curves by a unit vector : where is an (approximate) input to the generative model that would have generated , i.e., . Essentially, we optimize over linear curves in the latent space which correspond to nonlinear curves in the input space. For VAEs (Kingma and Welling, 2014; Rezende et al., 2014), the decoder network acts as and the encoder network acts as an approximate . For normalizing flowbased models, can be computed exactly by construction (Germain et al., 2015; Papamakarios et al., 2017; Oliva et al., 2018; Dinh et al., 2017; Inouye and Ravikumar, 2018).
Known transformations and their compositions.
In certain domains, there are natural classes of transformations that are semantically meaningful. Examples include adding white noise to an audio file or blurring an image, removing semantic content like one letter from a stop sign, or changing the style of an image using style transfer
(Johnson et al., 2016). We now consider more general curves based on a compositions of such known transformations, which could be simple or complex and nonlinear. We will denote each transformation as with parameter , where corresponds to the identity transformation. For example, if a rotation operator, then would represent the angle of rotation where is 0 degrees and is 180 degrees. Given an ordered set of different transformations denoted by , we can define a general transformation curve using a composition of these simpler transformations: where ,is the optimization parameter. We emphasize that the transformations can be arbitrarily nonlinear and complex—even deep neural networks.
Realistic curves.
In some contexts, it is desirable to explicitly audit the behavior of off its training data manifold. This is helpful for detecting unwanted bias and checking robustness to rare, confusing, or adversarial examples—especially in safety critical applications such as autonomous vehicles. In other contexts, we may wish to ensure that the selected diagnostic curves are realistic—that is, that they respect the training data. For example, it may not be useful to show an example of a high income person who also receives social security benefits. Fortunately, it is not hard to enforce this constraint: First, bound the curve using a box constraint based on the minimum and maximum of each feature. Then train a density model on the training data to account for feature dependencies and further restrict the bounds of the curve to only lie in regions with a density value above a threshold.
For selecting among these varied parametrizations, if the data already has semantically interpretable features, we suggest using a linear parametrization with a sparse parameter vector so that the curves are directly interpretable. On the other hand, when the raw features are not directly interpretable such as in computer vision, speech recognition, or medical imaging, we suggest using a linear parameterization of the input to a deep generative model or simply domainspecific transformations that are meaningful and useful for a domain expert.
3 Diagnostic Curve Utility Measures
Given these curve definitions and parametrizations, we now consider the problem of optimizing over curves to find diagnostic curves which maximally or minimally violate a property along the curve. We define the diagnostic utility of a curve as follows:
Definition 1.
A diagnostic utility measure is a function , that depends on the model and the target point .
In this subsection, we carefully develop and define two general classes of diagnostic utility measures that can be used to validate (or invalidate) various model properties. We of course do not claim that our proposed set of utility measures is exhaustive, and indeed, this paper will hopefully lead to more creative and broader classes of utility measures that are practically useful.
3.1 Model contrast utility measures.
A simple and natural way to measure the utility of a curve is to contrast it with the some curve based on a different model, such as a constant model. Given another model , we define the contrast utility measure:
(1) 
where
is a loss function, e.g. squared or absolute loss. Recall that an arbitrary curve in
is represented by the tuple . The simplest case is where the model is held constant at the original prediction of the target instance:. The comparison model could be another simple model like logistic regression or some local approximation of the model around the target instance such as explanations produced by LIME
(Ribeiro et al., 2016)—this would provide a way to critique local approximation methods and show where and how they differ from the true model.Example 1 (Contrast with Validated Linear Model).
Suppose an organization has already deployed a carefully validated linear model—i.e., the linear parameters were carefully checked by domain experts to make sure the model behaves reasonably with respect to all features. The organization would like to improve the model’s performance by using a new model, but wants to see how the new model compares to their carefully validated linear model to see where it differs the most. In this case, the organization could let the contrast model be their previous model, i.e., where does not depend on the target point ; see Fig. 1 for an example of a constant contrast model.
Example 2 (Contrast Random Forest and DNN).
An organization may want to compare two different model classes such as random forests and deep neural networks (DNN) to diagnose if there are significant differences in these model classes or if they are relatively similar. Either the random forest or DNN model could be the contrast model —again does not depend on the target point .
Example 3 (Contrast with Constant Model).
Sometimes we may want to create a contrast model that is dependent on the target point . The simplest case is where the contrast model is a constant fixed to the original prediction value, i.e., . Note that in this case the comparison function depends on . This contrast to a constant model can find curves that deviate the most from the prediction across the whole curve; this implicitly finds features or curves that are not flat and those finds curves that significantly affect the prediction value.
Example 4 (Contrast with local approximations used for explanations).
We could also use our diagnostic curves to contrast the true model with explanation methods based on local approximation such as LIME (Ribeiro et al., 2016) or gradientbased explanation methods (Sundararajan et al., 2017; Shrikumar et al., 2017; Smilkov et al., 2017). We can simply use the local approximation to the model centered at the target point as the contrast model,i.e., , where is the local approximation centered around . Thus, the found diagnostic curve will show the maximum deviation of the true model from the local model. Importantly, this allows our diagnostic method to assess the goodness of local approximation explanation methods showing when they are reasonable and when they may fail; see Fig. 2.
3.2 Functional property validation utility measures.
In many contexts, a user may be more interested in validating or invalidating expected functional properties of a model, such as monotonicity or smoothness. For example, if it is expected that a model should be increasing with respect to a feature (e.g. income in a loan scoring model), then we’d like to check that this property holds true. Let be a class of univariate functions that represents a property that encodes acceptable or expected behaviors. To measure deviation from this class, take the minimum expected loss over all :
(2) 
where as usual, is a loss function. The minimization in (2) is a univariate regression problem that can be solved using standard techniques. This utility will find diagnostic curves that maximally or minimally violate a functional property along the curve with respect to some given class of univariate functions.
Example 5 (Monotonic property validation via isotonic regression).
In many applications, it may be known that the model output should behave simply with respect to certain features. For example, one might expect that the score is monotonic in terms of income in a loan scoring model. In this case, the class of functions should be the set of monotonic functions, i.e., is the set of all monotonic functions. The resulting problem can be efficiently solved using isotonic regression (Best and Chakravarti, 1990)—and this is what we do in our experiments; see Fig. 4 for an example of validating (or invalidating) the monotonic property.
Example 6 (Lipschitzbounded property validation via constrained least squares).
Another property that an organization might want to validate is whether the function is Lipschitzbounded along the curve. Formally, they may wish to check if the following condition holds:
(3) 
where is a fixed Lipschitz constant. Thus, the corresponding class of functions is the set of Lipschitz continuous functions with a Lipschitz constant of . In practice, we can solve this problem via constrained least squares optimization—similar to isotonic regression (details in supplement). An example of using Lipschitz bounded functions for can be seen in Fig. 3. This utility will find the curve that maximally violates the Lipschitz condition; this curve may also be useful in finding where the model is particularly sensitive to changes in the input since the derivative along the curve will be larger than the Lipschitz constant.
4 Optimization for Diagnostic Curves
So far we have discussed: 1) How to parametrize curves and 2) How to evaluate the diagnostic utility of a curve. With these pieces in place, the objective is to find diagnostic curves that are simultaneously interpretable and useful, by restricting the curves under consideration and optimizing for the curve that shows the best or worst behavior with respect to certain properties. Formally, let be the target instance, a utility measure, and . For instance, could be the class of linear or transformation curves outlined in Section 2. A natural method for obtaining highutility diagnostic curves is to simply optimize over :
(4) 
where and are the corresponding subsets of (i.e. projections onto and , respectively). In our experiments with linear curves , we will make the following restrictions for the optimization to enforce sparsity and interpretability: for and —i.e., only one, two or three numeric features can change, and zero or one categorical features could change.
The optimization problem in (4) is usually nonconvex and nonsmooth because is assumed to be an arbitrary blackbox, where only function evaluations are possible. In fact, this is why we did not make any smoothness assumptions on the utility measure in Def. 1: The potential lack of smoothness in obviates the advantages of smoothness in . This modelagnostic setup generalizes to many interesting settings including ones in which the model is private, distributed or nondifferentiable (e.g. boosted trees) such that exact gradient information would be impossible or difficult to compute. Given these restrictions, we must resort to zerothorder optimization. While we could use general zerothorder optimization techniques, we require that curves are sparse so that the resulting diagnostic curves are interpretable. To accomplish this, we propose a greedy optimization scheme called greedy coordinate pairs (GCP) that adds nonzero coordinates in a greedy fashion, as outlined in Algorithm 1. We initialize this algorithm by finding the single feature with the highest utility.
Optimization evaluation.
To test the effectiveness of GCP (Algorithm 1
), we ran two tests. First, we compared the optimal utility values returned by GCP to the utility of 10,000 randomly selected curves based on a random forest classifier. In all cases, GCP returned the highest values. For example, GCP found a diagnostic curve
with , compared with for the best random curve. In the second experiment, we generated random synthetic models in which the highest utility curves could be determined in advance, making it possible to evaluate how often GCP selects the optimal curves. We evaluated its performance on curves with at most one, two, or three nonzero coordinates, and found that in 100%, 97%, and 98% of simulations, GCP found the optimal curve. In the few cases GCP did not find the optimal curve, this was due to the randomness in generating examples whose optimal curves are more difficult to identify (e.g., the optimal curve was nearly constant). Details and further results on these experiments can be found in the appendix.5 Experiments
We present four concrete use cases: 1) Selecting among several models, 2) Evaluating robustness to covariate shift, 3) Outofsample behavior in computer vision, and 4) Bias detection. We push some details of the experiments to the appendix.
5.1 Selecting a model for loan prediction.
Suppose we have trained multiple models, each with similar test accuracies. Which of these models should be deployed? To answer this question, diagnostic curves can be used to detect undesirable or unexpected behaviours. We explore this usecase of qualitative model selection on a dataset of German loan application data in order to find nonmonotonic curves. For example, it may be undesirable to deploy a model that decreases a candidate’s score in response to an increase in their income. We train a decision tree, a random forest, a kernel SVM, a gradient boosted tree, and a deep neural network. For this simple example, we consider linear curves with only one nonzero to help ensure interpretability; additionally, we optimize the utility over 100 random target points
—thus providing an estimate of the worstcase behavior of the model. The test accuracies of these models were all close to 62%—thus, a stakeholder would not be able to select a model based on accuracy alone. The curves can be seen in
Fig. 4. In addition to a single number that quantifies nonmonotonicity, our diagnostic curve visualizations show the user both the location and severity of the worstcase nonmonotonicity. For example, the visualizations suggest that gradient boosting may be preferable since its worstcase behavior is nearly monotonic, whereas other models are far from monotonic.5.2 Evaluating robustness under covariate shift in loan prediction.
Using the same dataset as in the previous example, we compare the behavior of the same model over different regions of the input space. The motivation is covariate shift: A bank has successfully deployed a loan prediction model but has historically only focused on highincome customers, and is now interested in deploying this model on lowincome customers. Is the model robust? Once again we use diagnostic curves to detect undesirable behaviour such as nonmonotonicity. We trained a deep neural network that achieves 80% accuracy on highincome data and a comparable 76% accuracy on the unseen lowincome data. Moving beyond accuracy, we generated diagnostics using the least monotonic utility optimizing over all target points in the training data (i.e. highincome) and the unseen test data (i.e. lowincome) as can be seen in Fig. 5. The curves explicitly display the difference between the worstcase nonmonotonicity for highincome (Fig. 5, left) and lowincome (Fig. 5, right), which appears to be minimal, giving stakeholders confidence for deploying this model.
5.3 Understanding outofsample behavior for traffic sign detection.
For the safety of autonomous cars, understanding how the model behaves outside of the training dataset is often critical. For example, will the model predict correctly if a traffic sign is rotated even though the training data does not have rotated signs? In this usecase, we train a convolution neural network on German Traffic Sign Recognition Benchmark (GTSRB) dataset
(Stallkamp et al., 2011) which achieves 97% test accuracy. We consider transformation curves based on five image transformations: rotate, blur, brighten, desaturate, and contrast. Each of these transformations creates images outside of the training data, and we are interested in finding the least and most sensitive combinations of transformation. For this, it is reasonable to use both the constant model comparison and functional property validation utilities. Fig. 6 depicts the resulting diagnostic curves. The most sensitive direction (top) simultaneously adjusts contrast and brightness, which is expected since this transformation gradually removes almost all information about the stop sign. The least sensitive direction (middle) adjusts saturation and contrast, which may be surprising since it suggests that the model ignores color. Finally, the least monotonic direction (bottom) is rotation, which suggests that the model identifies a white horizontal region but ignores the actual letters “STOP” since it still predicts correctly when the stop sign is flipped over.5.4 Bias detection in recidivism model prediction.
In many realworld applications, it is essential to check models for bias. A contemporary example of significant interest in recent years concerns recidivism prediction instruments (RPIs), which use data to predict the likelihood of a convicted criminal to reoffend (Chouldechova, 2017; Chouldechova and G’Sell, 2017). In this setting, a simple diagnostic for viewing model bias is the following: Given an instance , consider what the output of would have been had the protected attribute (e.g. race or gender) been flipped. In certain cases, such protected attributes might not be explicitly available, in which case, we could use proxy attributes or factors, though we do not focus on that here. We also emphasize that this does not formally assess the fairness of the model or system, which is a much deeper question, but is meant as a simple validation check, potentially elucidating biases in the model itself introduced because of dataset bias. A model that ignores the protected attributes would see little or no change in as a result of this change. In this simple example, we explicitly ignore dependencies between the protected attribute and other features though this would be important to consider for any significant conclusions to be made. Given this situation, we select the model contrast utility (section 3) with a special definition for the comparison model defined as follows: where , , essentially flipping only the protected attribute and leaving all other features untouched.
There are two cases: Either (a) No such curve deviates too much, in which case this is evidence that is not biased, or (b) there is a dimension along which is significantly biased. A diagnostic curve for flipping race from white to black based on data from (Schmidt and Witte, 1988) using a kernel SVM model can be seen in Fig 7. One can clearly see that the model behaves in biased ways with respect to race: The effect of time served on the risk score clearly depends on race and even switches bias direction. For this high stakes application, the use of such a model could be problematic, and our diagnostic curves highlight this fact easily even for nontechnical users. Finally, these diagnostic curves avoid averaging the data into a single metric and offer more insight into the location and form of the model bias that may depend on the inputs in complex ways.
6 Conclusion and Discussion
We have introduced a framework for finding useful, interpretable diagnostic curves to check various properties of a blackbox model. These diagnostic curves are intuitive to understand and can be tailored to different usecases by adopting different utility measures or classes of curves. Furthermore, the methodology is flexible and is shown to handle discrete features as well as nondifferentiable models seamlessly. Through several usecases, we have demonstrated their effectiveness in diagnosing, understanding, and auditing the behavior of ML systems. We conclude by discussing a few challenges and practical points in the use of diagnostic curves. A key decision in practice is selecting an appropriate utility measure, for which we have introduced two broad classes that cover many relevant cases. Since it is easy to apply the method to new utilities, we emphasize this is a feature that allows practitioners to tailor the cuves as needed to their use cases. This is important since the desiderata for validating models are hardly universal, and will change depending on the scenario. Given that these diagnostic curves can give evidence that validate or invalidate desirable model properties, the natural next question is “How can models be designed to ensure they have certain of these desirable properties—i.e., no diagnostic curve exists that violates the desirable model property?” While answering this question is out of the scope of this paper, our framework will likely inspire further investigation into designing models to have these desirable properties by construction. More generally, our novel diagnostic curve framework provides a foundation for diagnostics for black box models that can aid in building more robust and safe ML systems.
Acknowledgments
The authors acknowledge the support of DARPA via FA87501720152, and Accenture. D. I. Inouye acknowledges support from Northrop Grumman.
References
 Ribeiro et al. [2016] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In KDD, 2016.
 Ribeiro et al. [2018] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: Highprecision modelagnostic explanations. In AAAI, 2018.
 Guidotti et al. [2018a] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Dino Pedreschi, Franco Turini, and Fosca Giannotti. Local rulebased explanations of black box decision systems, 2018a.
 Guidotti et al. [2018b] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM Computing Surveys (CSUR), 51(5):93, 2018b. doi: https://doi.org/10.1145/3236009.
 Varshney and Alemzadeh [2017] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyberphysical systems, decision sciences, and data products. Big data, 5(3):246–255, 2017.
 Varshney [2016] Kush R Varshney. Engineering safety in machine learning. In 2016 Information Theory and Applications Workshop (ITA), pages 1–5. IEEE, 2016.
 Saria and Subbaswamy [2019] Suchi Saria and Adarsh Subbaswamy. Tutorial: Safe and reliable machine learning. arXiv preprint arXiv:1904.07204, 2019.
 Hadfield and Clark [2019] Gillian Hadfield and Jack Clark. Regulatory markets for AI safety. Safe Machine Learning workshop at ICLR, 2019.
 Wachter et al. [2017] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the gdpr. SSRN Electronic Journal, 11 2017. doi: 10.2139/ssrn.3063289.
 Datta et al. [2016] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016.
 Dhurandhar et al. [2018] Amit Dhurandhar, PinYu Chen, Ronny Luss, ChunChen Tu, Paishun Ting, Karthikeyan Shanmugam, and Payel Das. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. arXiv preprint arXiv:1802.07623, 2018.
 Lundberg and Lee [2017] Scott M Lundberg and SuIn Lee. A unified approach to interpreting model predictions. In NIPS, 2017.
 Chen et al. [2018] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. Learning to explain: An informationtheoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814, 2018.
 Koh and Liang [2017] Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. arXiv preprint arXiv:1703.04730, 2017.
 Yeh et al. [2018] ChihKuan Yeh, Joon Kim, Ian EnHsu Yen, and Pradeep K Ravikumar. Representer point selection for explaining deep neural networks. In Advances in Neural Information Processing Systems, pages 9291–9301, 2018.
 Zhang et al. [2018] Xin Zhang, Armando SolarLezama, and Rishabh Singh. Interpreting neural network judgments via minimal, stable, and symbolic corrections. In Advances in Neural Information Processing Systems, pages 4874–4885, 2018.
 Simonyan et al. [2013] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 Sundararajan et al. [2017] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.
 Shrikumar et al. [2017] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017.
 Smilkov et al. [2017] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.

Guo et al. [2018]
Wenbo Guo, Sui Huang, Yunzhe Tao, Xinyu Xing, and Lin Lin.
Explaining deep learning models–a bayesian nonparametric approach.
In Advances in Neural Information Processing Systems, pages 4514–4524, 2018.  Goldstein et al. [2015] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin. Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65, jan 2015.
 Friedman [2001] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.
 Kingma and Welling [2014] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.

Rezende et al. [2014]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In ICML, 2014. 
Germain et al. [2015]
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle.
MADE: Masked autoencoder for distribution estimation.
In ICML, 2015.  Papamakarios et al. [2017] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In NIPS, 2017.
 Oliva et al. [2018] Junier B Oliva, Avinava Dubey, Manzil Zaheer, Barnabás Póczos, Ruslan Salakhutdinov, Eric P Xing, and Jeff Schneider. Transformation autoregressive networks. In ICML, 2018.
 Dinh et al. [2017] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real NVP. In ICLR, 2017.
 Inouye and Ravikumar [2018] D. I. Inouye and P. Ravikumar. Deep density destructors. In International Conference on Machine Learning (ICML), pages 2172–2180, jul 2018.

Johnson et al. [2016]
Justin Johnson, Alexandre Alahi, and Li FeiFei.
Perceptual losses for realtime style transfer and superresolution.
In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 694–711, Cham, 2016. Springer International Publishing. ISBN 9783319464756.  Best and Chakravarti [1990] Michael J Best and Nilotpal Chakravarti. Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47(13):425–439, 1990.
 Stallkamp et al. [2011] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The German Traffic Sign Recognition Benchmark: A multiclass classification competition. In IEEE International Joint Conference on Neural Networks, pages 1453–1460, 2011.
 Chouldechova [2017] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
 Chouldechova and G’Sell [2017] Alexandra Chouldechova and Max G’Sell. Fairer and more accurate, but for whom? arXiv preprint arXiv:1707.00046, 2017.
 Schmidt and Witte [1988] Peter Schmidt and Ann D Witte. Predicting Recidivism in North Carolina, 1978 and 1980. Interuniversity Consortium for Political and Social Research, 1988.
Appendix A Lipschitzbounded property validation formulated as a constrained least squares problem
Suppose we have a grid of values and corresponding to model outputs along the curve . Now let and let be defined as follows:
(5) 
Now we solve the following simple least squares problem:
(6) 
Notice that the first coordinate is unconstrained and represents . The rest of correspond to the slope of a line connecting each point; thus and , etc. Thus,
and our approximation is merely a linear interpolation using
and .Appendix B Synthetic Optimization Figure
See Fig. 8.
Appendix C Expanded figure for bias detection
See Fig. 9.
Appendix D Optimization evaluation details
Synthetic experiment.
We create a synthetic model to test our optimization algorithm. Consider a function for , and . We consider two utilities: 1) the model comparison utility with the firstorder Taylor series approximation to , i.e. , and 2) the least monotonic utility. It can be seen that the groundtruth best linear curve, for the model above, with respect to these two utilities, will have directions along and respectively. Sinusoidal functions are indeed less monotonic than linear functions, and they also deviate away from the firstorder Taylor series approximation, which is also linear. We verify that these directions are correctly found using our optimization algorithm as seen in Figure 8.
To more generally verify that GCP finds correct direction that maximizes the least monotonic utility, we simulated random model behaviors that are nonmonotonic in certain directions, and compared the diagnostic curve found with GCP with the groundtruth curves. For introducing nonmonotonicity, a random set of polynomial functions were used, additionally constraining that the utilities of the curves along the groundtruth directions are nonzero and above certain threshold, while the utilities along the nongroundtruth directions are relatively small or almost zero.
Utility histogram.
Another empirical way to check our optimization method is to randomly sample curves and compute their utilities; then we can compare to the utility of our optimized curve. We generate diagnostic curves using a random forest on the loan application data (see section 5 for more data details). For interpretability, we restrict the curve parameter to only have three nonzeros. We sample uniformly from all directions that have at most three nonzeros, i.e. . We can see in the histogram of utility values (log scale) shown in Fig. 10 that the utility of our optimized curve (red line) is clearly better than the utility of random directions. In addition, we note that even if we do not find the global optimum, our optimized diagnostic curves can still be useful in highlighting interesting parts of the model—see usecase experiments. Thus, This shows that even though the optimization problem is quite difficult, we can perform well empirically.
Appendix E More experiment details
Selecting model for loan prediction.
The data used in this experiment is German loan application data.^{1}^{1}1https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) This dataset has 7 numeric attributes and 13 categorical attributes ranging from the amount of the loan requested to the status of the applicant’s checking account.
We train a decision tree, a gradient boosted tree model, and a deep neural network on only the numeric attributes (since monotonicity isn’t welldefined for categorical attributes). We tune each model via cross validation using scikit learn. We optimize each model over the parameters in Table 1 (where other parameters are defaults in scikit learn).
Model Name  Parameters 

Decision tree  Max leaf nodes , max depth 
Gradient boosted trees  Learning rate , Number of estimators 
Deep NN  Max epoch = 1000, Learning rate = 0.0001, batch size , two hidden layers of size 128 with relu activations and softmax final activation, ADAM optimizer and BCE loss. 
Evaluating robustness under covariate shift in loan prediction.
To simulate this setup, we split the German loan dataset based on amount: a training dataset with 884 users with amount 1,000 DMR and a separate unseen test dataset of 116 users with amount 1,000 DMR—note that these will give two different data distributions. We train via the deep NN parameters and cross validation in Table 1.
Understanding outofsample behavior for traffic sign detection.
We train a convolution neural network on German Traffic Sign Recognition Benchmark (GTSRB) dataset [Stallkamp et al., 2011] which achieves 97% test accuracy. We consider diagnostic curves based on five image transformations: rotate, blur, brighten, desaturate, and increase contrast. Each image transformation will create images outside of the training distribution—hence, we can view the behavior of the model outside of the training data.
Comments
There are no comments yet.