Diagnostic Curves for Black Box Models

In safety-critical applications of machine learning, it is often necessary to look beyond standard metrics such as test accuracy in order to validate various qualitative properties such as monotonicity with respect to a feature or combination of features, checking for undesirable changes or oscillations in the response, and differences in outcomes (e.g. discrimination) for a protected class. To help answer this need, we propose a framework for approximately validating (or invalidating) various properties of a black box model by finding a univariate diagnostic curve in the input space whose output maximally violates a given property. These diagnostic curves show the exact value of the model along the curve and can be displayed with a simple and intuitive line graph. We demonstrate the usefulness of these diagnostic curves across multiple use-cases and datasets including selecting between two models and understanding out-of-sample behavior.



There are no comments yet.


page 8


RuleMatrix: Visualizing and Understanding Classifiers with Rules

With the growing adoption of machine learning techniques, there is a sur...

A Theory of Diagnostic Interpretation in Supervised Classification

Interpretable deep learning is a fundamental building block towards safe...

Auditing Black-box Models for Indirect Influence

Data-trained predictive models see widespread use, but for the most part...

Black box tests for algorithmic stability

Algorithmic stability is a concept from learning theory that expresses t...

A Formal Characterization of Black-Box System Safety Performance with Scenario Sampling

A typical scenario-based evaluation framework seeks to characterize a bl...

Learning Curves for Decision Making in Supervised Machine Learning – A Survey

Learning curves are a concept from social sciences that has been adopted...

Infrastructure Resilience Curves: Performance Measures and Summary Metrics

Resilience curves communicate system behavior and resilience properties ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern applications of machine learning (ML) involve a complex web of social, political, and regulatory issues as well as the standard technical issues in ML such as covariate shift or training dataset bias. Although the most common diagnostic is simply test set accuracy, most of these issues cannot be resolved by test set accuracy alone—and sometimes these issues directly oppose high test set accuracy (e.g., privacy concerns). Other simple numeric diagnostics such as the variance of the residual in regression do not provide much insight into the ML system. Thus, to develop safe and robust systems, more diagnostics are needed.

In response to these issues, the literature has recently focused on the problem of interpretability and explainability in production ML systems, and the dangers that arise when these considerations are not properly taken into account. In some settings, model explanations are used as diagnostic tools to uncover hidden problems in the model that are not evident from test accuracy alone (Ribeiro et al., 2016). A common thread in much of this work is the reliance on local approximations to a model (e.g., (Ribeiro et al., 2016, 2018; Guidotti et al., 2018a, b)), which presents difficulties to end users in practice. These difficulties include questions regarding the quality and validity of an approximation, and how rapidly it breaks down or becomes unreliable. Thus, the use of these local approximations for specific, human-interpretable diagnostics to check model properties is tenuous at best, despite their continued utility in terms of providing high-level understanding. For these reasons, it is desirable to provide the model’s exact output to users (see Fig 2 for a comparison).

In this paper, we propose a diagnostic framework for validating various model properties of complex machine learning models that goes significantly beyond test accuracy. We seek to find a single univariate curve in the input space whose model output maximally or minimally violates certain properties, which we encode by defining two key classes of curve utility functions. For example, we may seek a curve that is least monotonic (a surprising and often undesirable property in certain applications such as loan decisions). Once this curve is found, we display the model’s output along this univariate diagnostic curve for inspection so that users can visually understand the severity of the issue and quickly get a qualitative feel for the model along this curve. To aid in interpreting this curve visualization, we also allow the curve to pass through a real data point so that users can understand the context for this curve in high dimensional space if needed.

Since there are many possible curves, the key challenge is finding interesting or useful diagnostic curves that encode various model properties and displaying this in a useful way to users. Addressing these challenges is the core contribution of this paper. More specifically, we do the following: (1) We illustrate how diagnostic curves are simple to interpret, exact, nonlocal, and provide intuitive interpretations to the behavior of a model. (2) We develop a framework for finding interesting diagnostic curves by optimizing a utility measure over the space of curves, and define two broad families of utility measures that can be used to validate or invalidate various model properties. Our approach gracefully handles mixed continuous and discrete datasets. (3) We propose an optimization algorithm for finding sparse optimal diagnostic curves and evaluate its performance on synthetic models for which the true optimal values are known. (4) We demonstrate our diagnostic curve framework for several use-cases, including non-differentiable models, on multiple datasets such as loan data and traffic sign recognition.


To build some intuition, an example of the visualization of our diagnostic curve can be seen in Fig 1 in which the exact model output is shown along the curve. The curve shows how the model score changes for the given target loan application in the red dot, as the categorical value of credit history changes from “critical account” to “all credits paid back” (note the solid line shifting downwards from where the target originally is at), and as the numerical value of duration and age change (along the -axis). This curve is found by finding the curve that maximally differs from the constant model (i.e., a simple property that the model is close to constant). To formalize this and develop our framework, there are two complications here to overcome: 1) Defining a proper notion of “curve” in the categorical feature space and 2) Defining diagnostic curves which maximally or minimally violate some interesting model property such as monotonicity by proposing curve utility functions that that can encode various model properties. The appropriate definitions are made in Section 2, while Sections 3-4 are devoted to the problem of selecting appropriate diagnostic curves.

Related work.

The importance of safety and robustness in ML is now well-acknowledged (Varshney and Alemzadeh, 2017; Varshney, 2016; Saria and Subbaswamy, 2019), especially given the current dynamic regulatory environment surrounding the use of AI in production (Hadfield and Clark, 2019; Wachter et al., 2017). A popular approach to auditing and checking models before deployment is “explaining” a black-box model post-hoc. Both early and recent work in explainable ML rely on local approximations (e.g. (Ribeiro et al., 2016, 2018; Guidotti et al., 2018a, b)). Other recent papers have studied the influence of training samples or features (Datta et al., 2016; Dhurandhar et al., 2018). These have been extended to subsets of informative features (Lundberg and Lee, 2017; Datta et al., 2016; Chen et al., 2018) to explain a prediction. Other approaches employ influence functions (Koh and Liang, 2017) and prototype selection (Yeh et al., 2018; Zhang et al., 2018). A popular class of approaches take the perspective of local approximations, such as linear approximations (Ribeiro et al., 2016; Lundberg and Lee, 2017), and saliency maps (Simonyan et al., 2013; Sundararajan et al., 2017; Shrikumar et al., 2017; Smilkov et al., 2017). A crucial caveat with such approximation-based methods is that the quality of the approximation is often unknown, and is typically only valid in small neighborhoods of an instance (although we note recent work on global approximations (Guo et al., 2018)

). In contrast to these previous feature selection methods, our approach leverages a utility measure to select features (i.e., axis-aligned curve) or more general curves (as opposed to, e.g. mutual information or Shapley values). Independent conditional expectation (ICE) plots show how the model prediction scores change when one feature is modified for a every point in a set such that each line of the plot corresponds to a point

(Goldstein et al., 2015). The more classical partial dependence plots (PDP) show the average of the ICE plots over all data points (Friedman, 2001). The output of our method is similar to ICE and PDP plots because we show the model output along a curve. However, PDP and ICE do not define ways of selecting the most interesting plots, and in practice, they are often sorted by variance or range; we greatly generalize the way to select or sort the curves with two broad classes of curve utilities to validate (or invalidate) properties beyond variance such as monotonicity or Lipschitz smoothness. Moreover, PDP and ICE only consider axis-aligned curves while we generalize to linear, non-linear, and transformation curves.

2 Diagnostic Curves

We decompose the input space into a numeric feature space (e.g. ) and a categorical feature space (e.g. binary or multinomial features). Curves will typically be defined with respect to a target instance , however, this is not necessary in our framework. For example, in many cases it may be best to search over curves based on a sample of possible target instances. Let be a black-box function, i.e. we can query to obtain pairs . We do not assume that

is differentiable or even continuous in order to allow non-differentiable models such as random forests.

First, we define an appropriate class of candidate curves in the numeric feature space . Recall that a curve is a continuous function . Here we assume is a metric space (e.g. ), so that the notion of continuity is well-defined. To enhance interpretability and help ensure the curves are near realistic points, we anchor diagnostic curves to point of interest —usually a real data point from the dataset so that the curve passes through a point in the training data. To incorporate categorical features, consider a point . As a curve represents a continuous perturbation of the point , we allow for a single discrete perturbation of . More precisely, we define a set of categorical curves by In other words, we augment the curve with a categorical alternative . The idea is to find both a curve and a discrete alternative that maximally violates some desirable property of the model. For notational simplicity, we will use to denote either a curve or a categorical curve in the rest of the paper. An example visualization of a diagnostic curve can be seen in Fig. 1 where the user can read off various predictions as a result of changing both categorical and numeric features. We conclude this section by illustrating several examples of curve parametrizations used in practice.

Figure 1: For a target loan application (designated by the red point), the diagnostic curve (solid line) shows the change in model scores (

-axis) along a diagnostic curve when the categorical variable of credit history is flipped (all credits paid back) and two numeric features (duration and age) are changed. The curve was optimized so that the scores along the curve differ most from a constant model (dotted line): This effectively highlights the three features that would change this applicant’s score the most (see Sec. 

3 for more details).
Linear curves.

A natural family of curves to consider is the family of linear curves, i.e. straight lines passing through . Let and

be vectors containing the minimum and maximum of each feature over the training dataset. Let

denote the density fitted to the training data (helps to encourage realistic curves, see paragraph below for discussion). We parametrize linear curves by a unit vector : where , , and

Non-linear curves.

The above linear parameterization could be simply extended to the non-linear case given access to a non-linear generative model , that generates input points in , given vectors in ; examples of such models include VAEs (Kingma and Welling, 2014; Rezende et al., 2014) and deep generative models via normalizing flows (Germain et al., 2015; Papamakarios et al., 2017; Oliva et al., 2018; Dinh et al., 2017; Inouye and Ravikumar, 2018). We then allow for linear curves in the input space of the generative model rather than itself, so that we can parameterize non-linear curves by a unit vector : where is an (approximate) input to the generative model that would have generated , i.e., . Essentially, we optimize over linear curves in the latent space which correspond to non-linear curves in the input space. For VAEs (Kingma and Welling, 2014; Rezende et al., 2014), the decoder network acts as and the encoder network acts as an approximate . For normalizing flow-based models, can be computed exactly by construction (Germain et al., 2015; Papamakarios et al., 2017; Oliva et al., 2018; Dinh et al., 2017; Inouye and Ravikumar, 2018).

Known transformations and their compositions.

In certain domains, there are natural classes of transformations that are semantically meaningful. Examples include adding white noise to an audio file or blurring an image, removing semantic content like one letter from a stop sign, or changing the style of an image using style transfer 

(Johnson et al., 2016). We now consider more general curves based on a compositions of such known transformations, which could be simple or complex and non-linear. We will denote each transformation as with parameter , where corresponds to the identity transformation. For example, if a rotation operator, then would represent the angle of rotation where is 0 degrees and is 180 degrees. Given an ordered set of different transformations denoted by , we can define a general transformation curve using a composition of these simpler transformations: where ,

is the optimization parameter. We emphasize that the transformations can be arbitrarily non-linear and complex—even deep neural networks.

Realistic curves.

In some contexts, it is desirable to explicitly audit the behavior of off its training data manifold. This is helpful for detecting unwanted bias and checking robustness to rare, confusing, or adversarial examples—especially in safety critical applications such as autonomous vehicles. In other contexts, we may wish to ensure that the selected diagnostic curves are realistic—that is, that they respect the training data. For example, it may not be useful to show an example of a high income person who also receives social security benefits. Fortunately, it is not hard to enforce this constraint: First, bound the curve using a box constraint based on the minimum and maximum of each feature. Then train a density model on the training data to account for feature dependencies and further restrict the bounds of the curve to only lie in regions with a density value above a threshold.

For selecting among these varied parametrizations, if the data already has semantically interpretable features, we suggest using a linear parametrization with a sparse parameter vector so that the curves are directly interpretable. On the other hand, when the raw features are not directly interpretable such as in computer vision, speech recognition, or medical imaging, we suggest using a linear parameterization of the input to a deep generative model or simply domain-specific transformations that are meaningful and useful for a domain expert.

3 Diagnostic Curve Utility Measures

Given these curve definitions and parametrizations, we now consider the problem of optimizing over curves to find diagnostic curves which maximally or minimally violate a property along the curve. We define the diagnostic utility of a curve as follows:

Definition 1.

A diagnostic utility measure is a function , that depends on the model and the target point .

In this subsection, we carefully develop and define two general classes of diagnostic utility measures that can be used to validate (or invalidate) various model properties. We of course do not claim that our proposed set of utility measures is exhaustive, and indeed, this paper will hopefully lead to more creative and broader classes of utility measures that are practically useful.

3.1 Model contrast utility measures.

A simple and natural way to measure the utility of a curve is to contrast it with the some curve based on a different model, such as a constant model. Given another model , we define the contrast utility measure:



is a loss function, e.g. squared or absolute loss. Recall that an arbitrary curve in

is represented by the tuple . The simplest case is where the model is held constant at the original prediction of the target instance:

. The comparison model could be another simple model like logistic regression or some local approximation of the model around the target instance such as explanations produced by LIME

(Ribeiro et al., 2016)—this would provide a way to critique local approximation methods and show where and how they differ from the true model.

Example 1 (Contrast with Validated Linear Model).

Suppose an organization has already deployed a carefully validated linear model—i.e., the linear parameters were carefully checked by domain experts to make sure the model behaves reasonably with respect to all features. The organization would like to improve the model’s performance by using a new model, but wants to see how the new model compares to their carefully validated linear model to see where it differs the most. In this case, the organization could let the contrast model be their previous model, i.e., where does not depend on the target point ; see Fig. 1 for an example of a constant contrast model.

Example 2 (Contrast Random Forest and DNN).

An organization may want to compare two different model classes such as random forests and deep neural networks (DNN) to diagnose if there are significant differences in these model classes or if they are relatively similar. Either the random forest or DNN model could be the contrast model —again does not depend on the target point .

Example 3 (Contrast with Constant Model).

Sometimes we may want to create a contrast model that is dependent on the target point . The simplest case is where the contrast model is a constant fixed to the original prediction value, i.e., . Note that in this case the comparison function depends on . This contrast to a constant model can find curves that deviate the most from the prediction across the whole curve; this implicitly finds features or curves that are not flat and those finds curves that significantly affect the prediction value.

Figure 2: This diagnostic curve illustrates using the model contrast utility where (dotted curve) is an explanation model based on the gradient similar to the local linear explanation models in LIME (Ribeiro et al., 2016). Notice how our diagnostic curves show where the approximation may be appropriate (duration > 46) and where it might be far from the true model (duration < 46).
Example 4 (Contrast with local approximations used for explanations).

We could also use our diagnostic curves to contrast the true model with explanation methods based on local approximation such as LIME (Ribeiro et al., 2016) or gradient-based explanation methods (Sundararajan et al., 2017; Shrikumar et al., 2017; Smilkov et al., 2017). We can simply use the local approximation to the model centered at the target point as the contrast model,i.e., , where is the local approximation centered around . Thus, the found diagnostic curve will show the maximum deviation of the true model from the local model. Importantly, this allows our diagnostic method to assess the goodness of local approximation explanation methods showing when they are reasonable and when they may fail; see Fig. 2.

3.2 Functional property validation utility measures.

In many contexts, a user may be more interested in validating or invalidating expected functional properties of a model, such as monotonicity or smoothness. For example, if it is expected that a model should be increasing with respect to a feature (e.g. income in a loan scoring model), then we’d like to check that this property holds true. Let be a class of univariate functions that represents a property that encodes acceptable or expected behaviors. To measure deviation from this class, take the minimum expected loss over all :


where as usual, is a loss function. The minimization in (2) is a univariate regression problem that can be solved using standard techniques. This utility will find diagnostic curves that maximally or minimally violate a functional property along the curve with respect to some given class of univariate functions.

Example 5 (Monotonic property validation via isotonic regression).

In many applications, it may be known that the model output should behave simply with respect to certain features. For example, one might expect that the score is monotonic in terms of income in a loan scoring model. In this case, the class of functions should be the set of monotonic functions, i.e., is the set of all monotonic functions. The resulting problem can be efficiently solved using isotonic regression (Best and Chakravarti, 1990)—and this is what we do in our experiments; see Fig. 4 for an example of validating (or invalidating) the monotonic property.

Figure 3: This illustrates the contrast utility function class when the class of functions is Lipschitz bounded by .
Example 6 (Lipschitz-bounded property validation via constrained least squares).

Another property that an organization might want to validate is whether the function is Lipschitz-bounded along the curve. Formally, they may wish to check if the following condition holds:


where is a fixed Lipschitz constant. Thus, the corresponding class of functions is the set of Lipschitz continuous functions with a Lipschitz constant of . In practice, we can solve this problem via constrained least squares optimization—similar to isotonic regression (details in supplement). An example of using Lipschitz bounded functions for can be seen in Fig. 3. This utility will find the curve that maximally violates the Lipschitz condition; this curve may also be useful in finding where the model is particularly sensitive to changes in the input since the derivative along the curve will be larger than the Lipschitz constant.

4 Optimization for Diagnostic Curves

So far we have discussed: 1) How to parametrize curves and 2) How to evaluate the diagnostic utility of a curve. With these pieces in place, the objective is to find diagnostic curves that are simultaneously interpretable and useful, by restricting the curves under consideration and optimizing for the curve that shows the best or worst behavior with respect to certain properties. Formally, let be the target instance, a utility measure, and . For instance, could be the class of linear or transformation curves outlined in Section 2. A natural method for obtaining high-utility diagnostic curves is to simply optimize over :


where and are the corresponding subsets of (i.e. projections onto and , respectively). In our experiments with linear curves , we will make the following restrictions for the optimization to enforce sparsity and interpretability: for and —i.e., only one, two or three numeric features can change, and zero or one categorical features could change.

The optimization problem in (4) is usually nonconvex and nonsmooth because is assumed to be an arbitrary black-box, where only function evaluations are possible. In fact, this is why we did not make any smoothness assumptions on the utility measure in Def. 1: The potential lack of smoothness in obviates the advantages of smoothness in . This model-agnostic setup generalizes to many interesting settings including ones in which the model is private, distributed or nondifferentiable (e.g. boosted trees) such that exact gradient information would be impossible or difficult to compute. Given these restrictions, we must resort to zeroth-order optimization. While we could use general zeroth-order optimization techniques, we require that curves are sparse so that the resulting diagnostic curves are interpretable. To accomplish this, we propose a greedy optimization scheme called greedy coordinate pairs (GCP) that adds non-zero coordinates in a greedy fashion, as outlined in Algorithm 1. We initialize this algorithm by finding the single feature with the highest utility.

Optimization evaluation.

To test the effectiveness of GCP (Algorithm 1

), we ran two tests. First, we compared the optimal utility values returned by GCP to the utility of 10,000 randomly selected curves based on a random forest classifier. In all cases, GCP returned the highest values. For example, GCP found a diagnostic curve

with , compared with for the best random curve. In the second experiment, we generated random synthetic models in which the highest utility curves could be determined in advance, making it possible to evaluate how often GCP selects the optimal curves. We evaluated its performance on curves with at most one, two, or three nonzero coordinates, and found that in 100%, 97%, and 98% of simulations, GCP found the optimal curve. In the few cases GCP did not find the optimal curve, this was due to the randomness in generating examples whose optimal curves are more difficult to identify (e.g., the optimal curve was nearly constant). Details and further results on these experiments can be found in the appendix.

0:  Curve utility , target point , max number of features , grid size
0:  Optimized diagnostic curve
   Givens rotation matrix for coordinates and with angle
   {Coordinate-wise optimization}
  while  and not converged do
     while  not converged do
     end while
  end while
Algorithm 1 Greedy coordinate pairs for optimizing utility

5 Experiments

We present four concrete use cases: 1) Selecting among several models, 2) Evaluating robustness to covariate shift, 3) Out-of-sample behavior in computer vision, and 4) Bias detection. We push some details of the experiments to the appendix.

5.1 Selecting a model for loan prediction.

Suppose we have trained multiple models, each with similar test accuracies. Which of these models should be deployed? To answer this question, diagnostic curves can be used to detect undesirable or unexpected behaviours. We explore this use-case of qualitative model selection on a dataset of German loan application data in order to find non-monotonic curves. For example, it may be undesirable to deploy a model that decreases a candidate’s score in response to an increase in their income. We train a decision tree, a random forest, a kernel SVM, a gradient boosted tree, and a deep neural network. For this simple example, we consider linear curves with only one non-zero to help ensure interpretability; additionally, we optimize the utility over 100 random target points

—thus providing an estimate of the worst-case behavior of the model. The test accuracies of these models were all close to 62%—thus, a stakeholder would not be able to select a model based on accuracy alone. The curves can be seen in

Fig. 4. In addition to a single number that quantifies non-monotonicity, our diagnostic curve visualizations show the user both the location and severity of the worst-case non-monotonicity. For example, the visualizations suggest that gradient boosting may be preferable since its worst-case behavior is nearly monotonic, whereas other models are far from monotonic.

Figure 4: The least monotonic diagnostic curve for each loan prediction model optimized over all features of 100 randomly sampled target points. The dotted line (labeled “BestPartial”) is the best isotonic (i.e., monotonic) regression model.

5.2 Evaluating robustness under covariate shift in loan prediction.

Using the same dataset as in the previous example, we compare the behavior of the same model over different regions of the input space. The motivation is covariate shift: A bank has successfully deployed a loan prediction model but has historically only focused on high-income customers, and is now interested in deploying this model on low-income customers. Is the model robust? Once again we use diagnostic curves to detect undesirable behaviour such as non-monotonicity. We trained a deep neural network that achieves 80% accuracy on high-income data and a comparable 76% accuracy on the unseen low-income data. Moving beyond accuracy, we generated diagnostics using the least monotonic utility optimizing over all target points in the training data (i.e. high-income) and the unseen test data (i.e. low-income) as can be seen in Fig. 5. The curves explicitly display the difference between the worst-case non-monotonicity for high-income (Fig. 5, left) and low-income (Fig. 5, right), which appears to be minimal, giving stakeholders confidence for deploying this model.

Figure 5: Least monotonic diagnostic curves for deep neural network model trained on large loan applications (amount > 1,000) when optimizing over (left) the training distribution (i.e. amount > 1,000) and over (right) the unseen novel small loan distribution (i.e. amount 1,000). The dotted line (labeled “BestPartial”) is the best isotonic regression model.

5.3 Understanding out-of-sample behavior for traffic sign detection.

For the safety of autonomous cars, understanding how the model behaves outside of the training dataset is often critical. For example, will the model predict correctly if a traffic sign is rotated even though the training data does not have rotated signs? In this use-case, we train a convolution neural network on German Traffic Sign Recognition Benchmark (GTSRB) dataset

(Stallkamp et al., 2011) which achieves 97% test accuracy. We consider transformation curves based on five image transformations: rotate, blur, brighten, desaturate, and contrast. Each of these transformations creates images outside of the training data, and we are interested in finding the least and most sensitive combinations of transformation. For this, it is reasonable to use both the constant model comparison and functional property validation utilities. Fig. 6 depicts the resulting diagnostic curves. The most sensitive direction (top) simultaneously adjusts contrast and brightness, which is expected since this transformation gradually removes almost all information about the stop sign. The least sensitive direction (middle) adjusts saturation and contrast, which may be surprising since it suggests that the model ignores color. Finally, the least monotonic direction (bottom) is rotation, which suggests that the model identifies a white horizontal region but ignores the actual letters “STOP” since it still predicts correctly when the stop sign is flipped over.

Figure 6: The diagnostic curves for traffic sign detection show which transformation is the most sensitive (top), the least sensitive (middle), and the least monotonic (bottom). These curves highlight both expected model trends (top) and unexpected trends (middle and bottom), where the model seems to ignore color and fails when the stop is rotated partially but works again when the stop sign is almost flipped over.

5.4 Bias detection in recidivism model prediction.

In many real-world applications, it is essential to check models for bias. A contemporary example of significant interest in recent years concerns recidivism prediction instruments (RPIs), which use data to predict the likelihood of a convicted criminal to re-offend (Chouldechova, 2017; Chouldechova and G’Sell, 2017). In this setting, a simple diagnostic for viewing model bias is the following: Given an instance , consider what the output of would have been had the protected attribute (e.g. race or gender) been flipped. In certain cases, such protected attributes might not be explicitly available, in which case, we could use proxy attributes or factors, though we do not focus on that here. We also emphasize that this does not formally assess the fairness of the model or system, which is a much deeper question, but is meant as a simple validation check, potentially elucidating biases in the model itself introduced because of dataset bias. A model that ignores the protected attributes would see little or no change in as a result of this change. In this simple example, we explicitly ignore dependencies between the protected attribute and other features though this would be important to consider for any significant conclusions to be made. Given this situation, we select the model contrast utility (section 3) with a special definition for the comparison model defined as follows: where , , essentially flipping only the protected attribute and leaving all other features untouched.

Figure 7: A diagnostic curve using the model contrast utility where the comparison model is the same model but the race is flipped. Notice that bias between races is far from uniform even switching bias direction. Expanded figure with flipping both gender and race on two target points can be found in the appendix.

There are two cases: Either (a) No such curve deviates too much, in which case this is evidence that is not biased, or (b) there is a dimension along which is significantly biased. A diagnostic curve for flipping race from white to black based on data from (Schmidt and Witte, 1988) using a kernel SVM model can be seen in Fig 7. One can clearly see that the model behaves in biased ways with respect to race: The effect of time served on the risk score clearly depends on race and even switches bias direction. For this high stakes application, the use of such a model could be problematic, and our diagnostic curves highlight this fact easily even for non-technical users. Finally, these diagnostic curves avoid averaging the data into a single metric and offer more insight into the location and form of the model bias that may depend on the inputs in complex ways.

6 Conclusion and Discussion

We have introduced a framework for finding useful, interpretable diagnostic curves to check various properties of a black-box model. These diagnostic curves are intuitive to understand and can be tailored to different use-cases by adopting different utility measures or classes of curves. Furthermore, the methodology is flexible and is shown to handle discrete features as well as non-differentiable models seamlessly. Through several use-cases, we have demonstrated their effectiveness in diagnosing, understanding, and auditing the behavior of ML systems. We conclude by discussing a few challenges and practical points in the use of diagnostic curves. A key decision in practice is selecting an appropriate utility measure, for which we have introduced two broad classes that cover many relevant cases. Since it is easy to apply the method to new utilities, we emphasize this is a feature that allows practitioners to tailor the cuves as needed to their use cases. This is important since the desiderata for validating models are hardly universal, and will change depending on the scenario. Given that these diagnostic curves can give evidence that validate or invalidate desirable model properties, the natural next question is “How can models be designed to ensure they have certain of these desirable properties—i.e., no diagnostic curve exists that violates the desirable model property?” While answering this question is out of the scope of this paper, our framework will likely inspire further investigation into designing models to have these desirable properties by construction. More generally, our novel diagnostic curve framework provides a foundation for diagnostics for black box models that can aid in building more robust and safe ML systems.


The authors acknowledge the support of DARPA via FA87501720152, and Accenture. D. I. Inouye acknowledges support from Northrop Grumman.


Appendix A Lipschitz-bounded property validation formulated as a constrained least squares problem

Suppose we have a grid of values and corresponding to model outputs along the curve . Now let and let be defined as follows:


Now we solve the following simple least squares problem:


Notice that the first coordinate is unconstrained and represents . The rest of correspond to the slope of a line connecting each point; thus and , etc. Thus,

and our approximation is merely a linear interpolation using

and .

Appendix B Synthetic Optimization Figure

See Fig. 8.

Figure 8: Diagnostic curves when optimizing the synthetic function defined in D using the model comparison utility to a first-order Taylor series approximation (top) and using the least monotonic utility (bottom). In both cases, our optimization algorithm indeed finds the correct direction along and .

Appendix C Expanded figure for bias detection

See Fig. 9.

(a) Flipping gender
(b) Flipping race
Figure 9: Diagnostic curves showing top two biased features (rows) of two target instances (columns) for flipping (a) gender and (b) race. Notice that bias between groups is quite evident and is far from uniform; sometimes the bias even switches depending on the feature values (top left of subfigure (b)).

Appendix D Optimization evaluation details

Synthetic experiment.

We create a synthetic model to test our optimization algorithm. Consider a function for , and . We consider two utilities: 1) the model comparison utility with the first-order Taylor series approximation to , i.e. , and 2) the least monotonic utility. It can be seen that the ground-truth best linear curve, for the model above, with respect to these two utilities, will have directions along and respectively. Sinusoidal functions are indeed less monotonic than linear functions, and they also deviate away from the first-order Taylor series approximation, which is also linear. We verify that these directions are correctly found using our optimization algorithm as seen in Figure 8.

To more generally verify that GCP finds correct direction that maximizes the least monotonic utility, we simulated random model behaviors that are non-monotonic in certain directions, and compared the diagnostic curve found with GCP with the ground-truth curves. For introducing non-monotonicity, a random set of polynomial functions were used, additionally constraining that the utilities of the curves along the ground-truth directions are non-zero and above certain threshold, while the utilities along the non-ground-truth directions are relatively small or almost zero.

Utility histogram.

Another empirical way to check our optimization method is to randomly sample curves and compute their utilities; then we can compare to the utility of our optimized curve. We generate diagnostic curves using a random forest on the loan application data (see section 5 for more data details). For interpretability, we restrict the curve parameter to only have three non-zeros. We sample uniformly from all directions that have at most three non-zeros, i.e. . We can see in the histogram of utility values (log scale) shown in Fig. 10 that the utility of our optimized curve (red line) is clearly better than the utility of random directions. In addition, we note that even if we do not find the global optimum, our optimized diagnostic curves can still be useful in highlighting interesting parts of the model—see use-case experiments. Thus, This shows that even though the optimization problem is quite difficult, we can perform well empirically.

Figure 10: The utility found by our optimization (red line) is clearly finding a large value for utility compared to random directions (blue histogram with counts in log scale) demonstrating that our optimization method performs well empirically.

Appendix E More experiment details

Selecting model for loan prediction.

The data used in this experiment is German loan application data.111https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) This dataset has 7 numeric attributes and 13 categorical attributes ranging from the amount of the loan requested to the status of the applicant’s checking account.

We train a decision tree, a gradient boosted tree model, and a deep neural network on only the numeric attributes (since monotonicity isn’t well-defined for categorical attributes). We tune each model via cross validation using sci-kit learn. We optimize each model over the parameters in Table 1 (where other parameters are defaults in sci-kit learn).

Model Name Parameters
Decision tree Max leaf nodes , max depth
Gradient boosted trees Learning rate , Number of estimators
Deep NN

Max epoch = 1000, Learning rate = 0.0001, batch size

, two hidden layers of size 128 with relu activations and softmax final activation, ADAM optimizer and BCE loss.

Table 1: Parameter Values for Models
Evaluating robustness under covariate shift in loan prediction.

To simulate this setup, we split the German loan dataset based on amount: a training dataset with 884 users with amount 1,000 DMR and a separate unseen test dataset of 116 users with amount 1,000 DMR—note that these will give two different data distributions. We train via the deep NN parameters and cross validation in Table 1.

Understanding out-of-sample behavior for traffic sign detection.

We train a convolution neural network on German Traffic Sign Recognition Benchmark (GTSRB) dataset [Stallkamp et al., 2011] which achieves 97% test accuracy. We consider diagnostic curves based on five image transformations: rotate, blur, brighten, desaturate, and increase contrast. Each image transformation will create images outside of the training distribution—hence, we can view the behavior of the model outside of the training data.