DIVINE: Diverse Influential Training Points for Data Visualization and Model Refinement

07/13/2021
by   Umang Bhatt, et al.
University of Cambridge
Amazon
0

As the complexity of machine learning (ML) models increases, resulting in a lack of prediction explainability, several methods have been developed to explain a model's behavior in terms of the training data points that most influence the model. However, these methods tend to mark outliers as highly influential points, limiting the insights that practitioners can draw from points that are not representative of the training data. In this work, we take a step towards finding influential training points that also represent the training data well. We first review methods for assigning importance scores to training points. Given importance scores, we propose a method to select a set of DIVerse INfluEntial (DIVINE) training points as a useful explanation of model behavior. As practitioners might not only be interested in finding data points influential with respect to model accuracy, but also with respect to other important metrics, we show how to evaluate training data points on the basis of group fairness. Our method can identify unfairness-inducing training points, which can be removed to improve fairness outcomes. Our quantitative experiments and user studies show that visualizing DIVINE points helps practitioners understand and explain model behavior better than earlier approaches.

READ FULL TEXT VIEW PDF

Authors

page 20

page 23

page 29

11/07/2020

On the Privacy Risks of Algorithmic Fairness

Algorithmic fairness and privacy are essential elements of trustworthy m...
03/02/2022

PUMA: Performance Unchanged Model Augmentation for Training Data Removal

Preserving the performance of a trained model while removing unique char...
11/06/2020

There is no trade-off: enforcing fairness can improve accuracy

One of the main barriers to the broader adoption of algorithmic fairness...
02/21/2022

Personalized PATE: Differential Privacy for Machine Learning with Individual Privacy Guarantees

Applying machine learning (ML) to sensitive domains requires privacy pro...
08/26/2020

Estimating Example Difficulty using Variance of Gradients

In machine learning, a question of great interest is understanding what ...
09/12/2017

Interpreting Shared Deep Learning Models via Explicable Boundary Trees

Despite outperforming the human in many tasks, deep neural network model...
03/05/2021

Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data

Collecting more diverse and representative training data is often touted...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training point importance is a useful form of explainability for practitioners when reasoning about a machine learning (ML) model’s behavior (jeyakumar2020can). This form of explanation identifies which training points are most important to a ML model  (ghorbani2019data; koh2017understanding; yeh2018representer). To compute training point importance, popular methods include calculating Data Shapley values (ghorbani2019data; kwon2020efficient)

or using influence functions to estimate the impact to the model of dropping one or more points from training data 

(koh2017understanding; koh2019accuracy). However, the top- most important points returned by these methods are often redundant, in the sense that several may be very similar, limiting the extent of explanation provided (barshan2020relatif; bhatt2020explainable). To address this shortcoming, we devise an approach for selecting a set of DIVINE (DIVerse INfluEntial) training points.

Figure 0(a) shows that the top- influential points with respect to the approximate leave-one-out estimate of koh2017understanding and with respect to Data Shapley (ghorbani2019data) are all located in a small vicinity (red and blue diamonds respectively). Due to this lack of diversity, practitioners may miss key insights from underrepresented data points. Some regions, such as the cluster of points in the top left corner, are ignored by both methods. In Figure 0(a), our DIVINE points, denoted by yellow circles, not only lie in regions of high influence but also across the feature space. Our method provides the flexibility, under a common assumption, to operate on top of training point importance scores from a wide range of methods, including Data Shapley (DS) (ghorbani2019data) and influence functions (IF) (koh2017understanding). Beyond the synthetic setting of Figure 0(a), consider a misclassified test point, as in Figure 0(c). The influential training points according to two competitive methods (Influence (koh2017understanding) and RelatIF (barshan2020relatif)) are very similar to the test point. The resulting explanation contains redundant information. A explanation containing diverse points is more useful: notice the coat that appears in the DIVINE points but not in the others. Our user studies (Section 6) show that this additional diversity allows practitioners to be more accurate in simulating model behavior and enhances trustworthiness in the model.

Moreover, existing methods for training point importance focus on the impact of a data point on loss or accuracy, but practitioners may also want to value data points with respect to other important metrics, such as fairness. To bridge this gap, we develop an efficient method for computing importance scores with respect to group fairness metrics (hardt2016equality). Practitioners can use these scores to visualize data points that harm model fairness. We describe how to refine a model and improve fairness outcomes by removing unfairness-inducing points, while minimizing impact on other metrics.

(a) Synthetic Data
(b) Global: FashionMNIST
(c) Local: FashionMNIST
Figure 1: In 0(a), we show that our method selects DIVINE points (yellow circles) that are spread across the feature space. This contrasts IF (red diamonds) and Data Shapley (blue diamonds), which select points located in one region. Note the overlap between IF and DIVINE points in the top right. In 0(b) and 0(c), we show that DIVINE points (third row) are more diverse than ones selected by IF (first row) or other methods (second row). DIVINE is calculated by trading off IF and . The predicted label is listed under each point.

Our main contributions are:

  1. We devise a method for finding a diverse set of training points that are influential to a model. Our top DIVINE points, when trading off influence with diversity objectives, can provide a more comprehensive overview of model behavior (Section 3).

  2. We discuss how to value a training point’s influence on group fairness metrics (Section 4).

  3. Experiments on synthetic and real-world datasets show that DIVINE can help explore diverse, influential regions of the feature space and can also help improve fairness in the model outcomes by identifying most unfairness inducing points (Section 5).

  4. Extensive user studies show that DIVINE leads to enhanced user trust and better task simulatability as compared to existing approaches (Section 6).

2 Background: Assigning Training Point Importance

We start by reviewing earlier methods for obtaining training point importance scores. Consider a ML model parameterized by . Given training data

and a loss function

, weighted empirical risk minimization estimates , where is the weight given to training point . Usually, each training point has equal weight, e.g., . The leave-one-out (LOO) model obtained as a result of dropping the -th training point (i.e., ) has LOO parameters denoted by . Dropping a set of training points can be done by setting all respective ’s in the set to zero. The resulting model is denoted by . Within the LOO framework, the importance of the -th training point can be written as where measures a quantity of interest (e.g., loss). We call the evaluation function. Instead of assigning importance by a difference between the value of with LOO and original parameters, we can also take an absolute difference, squared difference, sigmoid, etc. depending on the application. In this paper, we focus consideration on cases where is the loss on data points, or is a group fairness metric, like equal accuracy (Section 4).

Unless otherwise specified, we assume that is a non-negative scalar and that lower is desirable. Let denote the new parameters (e.g., ) and let denote the old ERM parameters (e.g., ). Thus, a positive implies that including the -th point is helpful for lowering when learning : upon removing the -th point, the value of at the new parameters increased, which is undesirable. A negative implies that including the -th point is harmful for lowering . A large absolute magnitude of implies that a point is influential.

2.1 Influence Functions

Retraining for different weight configurations can be computationally expensive. koh2017understanding develop algorithms to approximate the effect of removing a training point on the loss at a test point by re-weighting its contribution. Suppose we modify the weight of from to . Let be the parameters obtained upon re-weighting. If we let , this amounts to dropping from the training data. Influence functions (IF) from robust statistics can be used to approximate  (influence; hampel1974influence). Assuming the loss is twice differentiable and convex in , we can linearly approximate the parameters upon dropping as , where is the Hessian of the loss . For details, see (koh2017understanding). We let be the importance score of the -th training point according to IF. Per (koh2019accuracy), we can estimate the influence of dropping the -th training point on any evaluation function as:

(1)

When is loss, koh2019accuracy note that influence is additive, which means that importance scores are additive. This implies that the importance of training points in is given by: .

2.2 Data Shapley

Instead of computing parameters to obtain an importance score with respect to an evaluation function , techniques like Data Shapley (DS) aim to directly compute importance scores (ghorbani2019data). Shapley values are a game-theoretic way to attribute value to players in game. ghorbani2019data apply Shapley values to training point importance. They propose to compute the importance of as , where , is a subset of the training data and . Most works regarding DS take to be loss, accuracy, or AUC. We can efficiently approximate DS using Monte Carlo Sampling (ghorbani2019data; kwon2020efficient).

2.3 Other Methods

khanna2019interpreting use Fisher kernels to select influential training points efficiently. Their method recovers IF if

is negative log-likelihood. Specific to neural networks,

yeh2018representer decompose the (pre-activation) prediction for a test point into a linear combination of activations for training points, using a modified representer theorem. The resulting training point weights correspond to a point’s importance with respect to . bhatt2020counterfactual assign training point importance based on how much loss is incurred when finding a set of alternative parameters

that classify a specific point

differently from but that perform similarly on the training data. Their setup is a special case of IF (koh2017understanding). The methods discussed thus far are model-specific and depend on a specified . Another line of work has searched for prototypes, which are representative points that summarize a dataset independent of model parameters (bien2011prototype; gurumoorthy2019efficient). kim2016MMD use maximum mean discrepancy (MMD) to find prototypes but do not assign importance scores to the selected points. A large MMD implies that the samples are likely from different distributions (gretton2012kernel). Since prototypes are model-agnostic, we omit them from our evaluation, as we are interested in finding diverse training points important to a model.

3 Selecting Diverse Samples

As shown in Figure 0(a), the top- influential points based on importance scores can result in a set of points that are similar to each other. We desire points that are simultaneously influential (high importance) and diverse across the feature space to serve as an explanation of model behavior. To achieve these desiderata, we propose the following objective:

(2)

where is a subset of important points from the dataset , is a normalized function () that captures the importance of the points in , is a function that captures the diversity of the points in , and controls the trade-off between the two terms. Solving the optimization problem in Equation 2 yields a set of DIVerse and INfluEntial points which we call DIVINE points. Setting recovers the traditional setup of selecting points with the highest importance. Our setup is reminiscent of combining loss functions (e.g., one to penalize training error and one to regularize for sparsity, smoothness, etc.): we effectively regularize for diversity in the influential points we select. Our formulation in Equation 2 is similar to that of lin-bilmes-2011-class, who select relevant yet diverse sentences to summarize a document, and that of prasad2014submodular, who scale diverse set selection to exponentially large datasets.

To this end, we take to be the sum of the importance scores of points in . We propose three submodular : A. Sum-Redundancy:  (libbrecht2018choosing); B. Facility-Location:  (krause2014submodular); and C. MMD:  (kim2016MMD), where is the similarity between two points, , , and . We let

be the radial basis function kernel.

1:  Input: Dataset , Trade-off parameter , number of diverse influential points
2:  for all  do
3:      influence()
4:  end for
5:  ;
6:  while  do
7:     
8:     
9:  end while
10:  Output: Set of DIVINE points
Algorithm 1 Greedy DIVINE selection

While encourages us to find influential points that are diverse from each other, both and encourage our influential points to be representative of the training data. is known as penalty-based diversity and penalizes similarity between points in  (lin-bilmes-2011-class; tschiatschek2014learning). maximizes the average similarity between a training point and its most similar point in . ensures the selected influential points are similar to the training data while being different from each other. We use in the main text, but practitioners can select that is appropriate for their use case.

3.1 Optimization Procedure

It is well-known that a modular function plus a submodular function is submodular, rendering the overall objective submodular (bach2011learning). As such, we take a greedy approach to performing the optimization in Equation 2, as outlined in Algorithm 1. Greedy selection returns a set that typically performs very well, and has a guarantee of at worst performance of the optimal set  (nemhauser1978analysis). We can also take a stochastic greedy approach per mirzasoleiman2015lazier. Instead of using the entire dataset to find the element with maximum marginal gain, we would randomly sample points at each iteration, calculate the marginal gain for each of the points, and add the one with the highest gain to until . In practice, stochastic greedy is preferred on large datasets, where the computational cost of full greedy alone can be high. Moreover, some may find our additivity assumption, which lets be modular, too restrictive. However, note that, by construction, DS satisfies linearity, which implies modularity. For IF, we find in Appendix C that modularity holds for various as long as is not too large. Furthermore, instead of calculating for the entire dataset and then performing greedy selection, we could select the first point greedily, recalculate for the remaining points, greedily select the next point, and repeat until we have points. Appendix D shows that this works similarly to our approach. Future work can develop other selection methods.

3.2 Local Explanations for Individual Points

In addition to obtaining a global diverse set of influential training points that explains a model’s behavior, our framework is amenable to obtaining local explanations: a diverse set of points that explains a model’s prediction for a specific point, . To accomplish this, we can let be the loss at : . For IF, the impact of the -th training point on can be approximated as:  (koh2017understanding). barshan2020relatif notice that the top- influential points selected by Equation 1 tend to be outliers, and add locality constraints to this objective. However, they solve a slightly different problem to us: their method, RelatIF, is concerned with data points being atypical, whereas DIVINE focuses on providing a diverse explanation by ensuring that the same region does not get marked as influential repeatedly. Our method would select a diverse set of outliers (if those are indeed influential) whereas the constraints of barshan2020relatif would not permit it. We compare our local DIVINE explanations to RelatIF in Section 6.1.

4 Fairness Valuation

Existing approaches to valuing data points usually take to be a model’s training loss (barshan2020relatif; koh2017understanding), accuracy, or AUC (ghorbani2019data). We propose that can also be any group fairness criteria. This allows us to value training data based on their helpfulness or harmfulness for achieving various notions of fairness.111

This problem can be seen as the inverse of fair data augmentation or fair active learning 

(anahideh2020fair), approaches which identify additional data to collect. In the feature importance literature, datta2016algorithmic use quantitative input influence to evaluate the effect of removing a feature on a model’s fairness. lundberg2020explaining use Shapley values to decompose demographic parity in terms of feature contributions. Our work can be seen as an influential point analog. While they extend feature importance to identify features that contribute to unfairness, we extend training point importance to identify which points contribute to unfairness.

Let our model have parameters and a binary predicted outcome for some input . Let be the actual outcome for . Let be a binary sensitive attribute that is contained explicitly in or encoded implicitly in . When we refer to subgroups, we mean partitions of based on . Let the training points in Group 1 be given by , where are positives in Group 1, and the training points in Group 2 be given by . We can further partition each set by predicted outcome, e.g. where captures false negatives in . In this work, we define unfairness as the difference in accuracy between subgroups: this is sometimes referred to as (un)equal accuracy.222Extension to other group fairness notions is straightforward. berk2018fairness says is fair (with respect to equal accuracy) if the following is close to :

(3)

where is the true positive rate for Group 1 under . We take the sum of the absolute difference in true positive rates between subgroups and the absolute difference in true negative rates between subgroups as a measure of unfairness, per Equation 3; the larger its magnitude, the more unfair. Note . Practitioners can calculate on training, validation, or test data. They may leverage importance scores with respect to to identify points hurting their model’s fairness. We refer to the points harmful to as unfairness-inducing points. By removing such points from their datasets, practitioners can potentially improve model fairness and accuracy.

Figure 2: Toy Example

To check if our method can detect erroneous, unfairness-inducing points, we construct a toy example. In Figure 2, a set of four -dimensional data points are drawn on the four corners of a square. The top two points have label (), and the bottom two points have label (). We assign sensitive attributes by saying the left side is from Group 1 (orange) and the right side is from Group 2 (blue). A fifth point is added farther away in the top left (orange

). A logistic regression model is fit to these five points (dashed line), obtaining perfect accuracy (

) and ideal unfairness of . We then inject a poisonous point into the dataset (top right orange ). Note that, with respect to the original model, this point is incorrectly classified and has an inconsistent sensitive attribute. A logistic regression model is fit to all six points (solid line). This poisoned model gets points correct, but has an unfairness of . We find importance scores for all six points with respect and . The most influential point with respect to is the correctly classified outlier (yellow diamond); however, the most influential point with respect to is the poisonous point (red diamond). This demonstrates that can detect unfairness-inducing points and does not simply find outliers.

5 Experiments

We evaluate our approach on multiple datasets and identify DIVINE points with respect to multiple measures. We first visualize the set of DIVINE points found by running Algorithm 1. We compare importance scores found with respect to and . We then learn fairer models by removing low-value points, those with high . We qualitatively review the unfairness-inducing points.

To validate our method, we run experiments on the following datasets: synthetic data, LSAT (kusner2017counterfactual), Bank Marketing (Dua:2019), COMPAS (compas), Adult (Dua:2019)

, a two-class variant of MNIST 

(lecun1998mnist), and FashionMNIST (xiao2017fashion).333Code for our experiments can be found at https://github.com/umangsbhatt/divine-release. We primarily consider logistic loss, , such that the logistic likelihood is given by where

. For image datasets, we train multilayer perceptons and convolutional neural networks, both with cross-entropy loss.

For synthetic experiments, we follow zafar2017fairness generating a synthetic dataset. First, we generate

binary labels uniformly at random. We then assign a feature-vector to each label by sampling from two Gaussian distributions:

and . We then sample points with label from

. We draw a sensitive attribute for all samples from a Bernoulli distribution:

, where is a rotated version of .

(a) - Tradeoff
(b) DIVINE
(c) DIVINE
(d) DIVINE
Figure 3: We characterize the trade-off between influence and diversity in 2(a) by varying . We also visualize the top-5 DIVINE points for select values of . The red diamond in 2(a) is , which recovers the top points from IF alone plotted in 2(b). The orange diamond in 2(a) is for the we find such that our DIVINE points have less influence than IF points; these are visualized in 2(c). The yellow diamond in 2(a) is the we find such that we maximize the average pairwise distance between our DIVINE points, seen in 2(d).

5.1 Selecting Diverse Influential Points

Figure 4: Tradeoff for various

First, we validate our approach to greedily select DIVINE points as a global explanation for a logistic regression model trained on our synthetic data. In Figure 3, we show how DIVINE values data points using and IF with on the training data. In Figure 2(a), we characterize the trade-off between influence and diversity. We obtain the black line by varying . We normalize influence such that we consider how much less influence DIVINE points contain than the top IF points on the y-axis. The red diamond represents , which maximizes influence, i.e., top IF points. We suggest two ways to select . One option is to specify a specific amount of influence to sacrifice. In Figure 2(c), we find by specifying that we want our top- DIVINE points to have less influence than the top- IF points; we indicate the corresponding point in orange on the trade-off curve in Figure 2(a). Another option is to find the maximizes the average pairwise distance between points in : , which is depicted by the yellow diamonds in Figure 2(d). In both selection mechanisms, we select by running a log sweep over . Our mechanisms for selecting ensure our set of points have high diversity at the expense of little influence. In Figure 4, we show how our trade-off curves vary as we add more DIVINE points (). Each trade-off curve has the same shape as Figure 2(a), but due to scaling might appear linear when compared to curves for larger ; for example, the rightmost curve in Figure 4 is the same as the curve shown in Figure 2(a). As we increase , the diversity () of the IF points (red) decreases, implying redundancy in the selected points. This confirms the findings of barshan2020relatif. Practitioners can select any place along the black curve to identify positions that trade-off influence and diversity. Our suggested selection strategies are shown as yellow and orange diamonds. In Appendix D, we show similar curves to Figure 4 when valuing points with other methods like DS, when using other diversity measures like and , and when using other datasets. We find that when optimizing Equation 2 and varying , we maintain high influence while achieving the desired diversity.

5.2 Modes within Unfairness Inducing Points

Studying the diverse modes of data that contribute to canonical model behavior, herein with respect to unfairness, can help practitioners analyze their models. In Table 1 for LSAT, we qualitatively compare the top- diverse unfairness-inducing points found by Equation 2 (left) and the top- most influential points (right). We use IF with respect to for and .

is the sensitive attribute: male (M) or female (F). FYA is first-year average, which is binarized to pass/fail. The maximum possible LSAT score is 48, and maximum GPA is 4.0. Notice the lack of diversity in the points on the right: most points are males with poor LSAT test scores and low GPA grades yet pass their course. DIVINE points include not only points with poor LSAT scores and low GPAs that pass but also points with high LSAT test scores and high GPAs (which are mostly female) yet fail. By trading off influence and diversity (left), we identify an unfairness-inducing “mode” of the dataset—female participants with high LSAT scores and GPAs, but fail—which is not identified by influence alone (right). We visualize this diversity in low dimensions via TSNE 

(van2014accelerating) in Figure 5: notice how all the IF points are clustered. This ability to detect modes missed by the top IF points highlights the utility of DIVINE. With DIVINE, we find multiple modes that lead to unfairness in our model. Quantitatively, we show that DIVINE does a better job of covering clusters of data in input space. In Figure 6, we cluster the full dataset using KMeans into clusters and then find the such that one point from each cluster is in the top- points of IF and DIVINE. The black line is a lower-bound, . DIVINE points requires a smaller than IF to represent all clusters of the data in the top points. The redundancy of the top IF points make it difficult to get a holistic picture of model behavior, as top IF points lie in clusters where we already have important points identified. Unlike IF points, DIVINE points allow us to identify various modes of data that contribute to a model’s unfairness.

  LSAT GPA FYA   LSAT GPA FYA 14 2.9 F Pass   20 2.8 M Pass 25 3.6 M Pass   25 3.6 M Pass 20 2.8 M Pass   20 3.2 M Pass 41 3.8 F Fail   14 2.9 F Pass 33 4.0 F Fail   21 3.1 M Pass 45 3.9 F Fail   23 2.8 M Pass 29 3.7 F Fail   22 2.9 M Pass 37 3.8 F Fail   26 3.7 M Pass Table 1: LSAT Unfairness Inducing Points Figure 5: TSNE Figure 6: LSAT Clustering

5.3 Removing Unfairness Inducing Points

Once we have detected unfairness-inducing points, we may hope to improve our models fairness outcomes. We now consider removing unfairness-inducing points identified with DIVINE. We first calculate importance scores for each training point with respect to . To use Algorithm 1 to find points to remove, we negate each importance score (harmful points now have positive importance), allowing us to perform submodular maximization via Equation 2. We iteratively select sets of unfairness-inducing points to remove based on Equation 2. We let be equal to of the training data size. After removing the selected points, we retrain. In Figure 7, we plot accuracy and unfairness () after removing up to of the training data. For all tabular datasets, remains stable or decreases until a large fraction of the dataset has been removed. The corresponding drop in accuracy is minor. Consider IF on COMPAS: dropping first of the training points reduces unfairness by almost while only incurring a   drop in accuracy. This implies that in many cases, a significant drop in unfairness can be achieved by dropping a small fraction of the training data points.

In Figure 7, we remove sets of points based on importance scores calculated with respect to the original model. Instead, we can recalculate importance scores after every set of points is dropped. Results for removal with recalculation are in Appendix D. In Figure 7, we report performance metrics on the training data after removing unfairness-inducing points. However, we can calculate performance metrics on the test data to report the effects of removal on accuracy and unfairness in generalization as well. We can also calculate on held-out validation data (instead of on training data as in Figure 7) when scoring training points and can even report performance metrics on test data. In Appendix D, we conduct experiments with validation and test data.

The benefit of removal of DIVINE points is reflected in the performance of LSAT in Figure 7. In the first of points removed via IF with respect to (green), of those points are females and are correctly classified by the original model. The model performs poorly after these points are removed. In contrast, in the first of points removed with DIVINE (blue), of points are female and are correctly classified points. DIVINE points more closely resemble the training data, which is female and correctly classifies of points. Thus, we find that we can value data points with respect to unfairness and then remove harmful points to improve fairness outcomes.

(a) COMPAS
(b) LSAT
(c) Adult
(d) Bank
Figure 7: Training data performance on four datasets after removal of unfairness-inducing points. First row shows accuracy. Second row shows unfairness. Methods for selecting points to remove denoted by line color. Blue selects DIVINE points per Equation 2 with IF, , and tuned via pairwise distance. Orange denotes randomly selected points. Green uses IF to select points (). Red chooses points by maximizing alone. Grey indicates original model’s performance. Diamond indicates the point after which all unfairness-inducing points have been removed. After this point, we expect unfairness to increase, as we start removing low importance (but helpful) points. With both IF and DIVINE, valuing data with respect to identifies harmful data to remove; after removal, fairness outcomes improve greatly, though accuracy may drop slightly.

6 User Studies

We conduct user studies to validate how useful DIVINE points are for explanation and for simulatability. Details about all user studies can be found in Appendix E. For all experiments in this section, we take DIVINE to be IF with and find by maximizing average pairwise distance.

6.1 Manually Examining DIVINE Explanations

We consider how DIVINE points can be used as global and local training point-based explanations. We first assess if DIVINE points provide sufficient diversity and are not dominated by outliers like IF points (barshan2020relatif). While Section 5 primarily focuses on DIVINE points in tabular data and convex models, here we use image data and non-convex models. We train a convolutional neural network on FashionMNIST. The top images obtained by IF and DS are similar to each other, as shown in Figure 0(b), while DIVINE points are more varied. Notably, the amount of influence contained in the top- points from IF and DIVINE are roughly similar: for less influence, we obtain more diversity in

. Similar results hold for a multi-layer perceptron trained on a binary subset of MNIST (

vs. ) in Appendix D: DIVINE points consider both s and s of various modes (differing thickness or style) to be influential. In Figure 0(c), when explaining a misclassified test point, DIVINE finds visually different training points to serve as an explanation. The average pairwise distance between the top DIVINE points is nearly double the average pairwise distance between the top IF points (Appendix D). As demonstrated on MNIST and FashionMNIST, DIVINE points have more diversity in feature space, which can help practitioners see which input regions influence their model.

Next, we conduct a user study to validate the utility of displaying DIVINE points as an explanation for practitioners. We asked participants with computer science experience to rank the diversity of the top- influential points from various methods (details in Appendix E). When shown the top- FashionMNIST points from IF and DIVINE, of participants said DIVINE was more diverse than IF. When shown the top- FashionMNIST points from IF and DIVINE, of participants said DIVINE was more diverse. One participant noted that “[DIVINE] seemed to be more distinct and varied,” while another said that “with [DIVINE] a more representative selection was used.”

In practice, influential training points can be useful for displaying the trustworthiness of a model (cai2019human). zhou2019effects display the top IF points and ask participants to rate the trustworthiness of the resultant model. Similarly, we displayed FashionMNIST points and asked participants to rate the trustworthiness of the resultant model. We show one set of points from DIVINE and another from IF. They provided a trustworthy rating from 1 (Least) to 5 (Most) for each set of points. DIVINE, , was deemed more trustworthy then IF, , (

, t-test). A participant said “[DIVINE] has more variety so it is more trustworthy.” Participants were then asked to decide if IF, RelatIF, or DIVINE provided a more useful local explanation for a misclassification on FashionMNIST. After seeing the misclassification and the top-

points from each of the three methods, of participants preferred DIVINE, preferred RelatIF, and preferred IF. One participant said that “the mistake made by the model is made obvious since the shape between shirt and coat are shown in [DIVINE],” which suggests the modes detected by DIVINE (per Section 5.2) are useful in practice. This confirms that not only are DIVINE points quantitatively more diverse than IF points, but are also qualitatively perceived to be more diverse, trustworthy, and useful.

6.2 DIVINE for Simulatability

Figure 8: User Drawn
Decision Boundaries

Many user studies for explainability consider how users can perform forward simulation, i.e. where a user uses an explanation to predict the model’s behavior on an unseen test point (doshi2017towards; hase-bansal-2020-evaluating). Another important consideration of explanations is simulatability (hoffman2018metrics; lipton2018mythos), which measures how well a user can reason about an entire model given an explanation. We posit that diversity in influential samples will help practitioners with simulatability. To test the simulatability of DIVINE points, we ask practitioners to reconstruct a model decision boundary given a set of points from our synthetic data. Survey details can be found in Appendix E. Our goal is to measure the similarity between the user-drawn decision boundary and the true decision boundary. To a new set of participants, we display points on a grid: each point is colored by its predicted class. participants see points. The rest see . We then ask the participant to draw a decision boundary that separates the two classes. We ask them to draw a decision boundary after seeing the top- IF points, top- DIVINE points, or random points. When shown

points, we find that the cosine similarity between the user drawn boundary and the true boundary is

with DIVINE, with IF, and with Random. When shown points, we find that the cosine similarity between the user drawn boundary and the true boundary is with DIVINE, is with IF, and with Random. We show the average user drawn decision boundary (after observing points) in Figure 8. Notice that the average decision boundary drawn after observing DIVINE points is closer to the true decision boundary. We find that DIVINE points were considerably more helpful that IF points to participants (p= , t-test). While DIVINE points are not optimized for decision boundary reconstruction and might be misleading in some cases, it is reassuring to know that they provide sufficiently diverse explanations such that users can reconstruct the model decision boundary. Though our study was performed on 2D synthetic data with linear decision boundaries, our findings are promising. We hope in future work to extend this study to higher dimensions. We conclude that practitioners find DIVINE points helpful for simulatability.

7 Conclusions, Limitations, and Future Work

Method
Prototypes (kim2016MMD) N/A
IF (koh2017understanding) Loss
DS (ghorbani2019data) Accuracy and AUC
DIVINE (Ours) Loss, Unfairness, etc.
Table 2: Comparison

In this work, we propose an approach for finding DIVerse INfluEntial (DIVINE) training points. We note that existing training point importance methods tend to assign high importance to similar points; hence, we propose a method to select a diverse and influential subset of the data using submodular optimization. In Table 2, we summarize how DIVINE compares to existing training point importance methods. Our method enables practitioners to inject diversity into explanations of model behavior, derived from training point importance scores. Additionally, previous work has mainly investigated influential points with respect to a model’s loss. We go further by considering valuation of data points with respect to model fairness. We then examine how unfairness can be reduced. We use DIVINE to detect and remove unfairness-inducing points, leading to improvements in model fairness. Our experiments on synthetic and real-world datasets demonstrate that, using DIVINE, practitioners can visualize a diverse summary of influential training points and thus understand the possible modes of data that contribute to their model’s behavior. In our user studies, we find that practitioners perceive DIVINE points to be more diverse, more useful, and more trustworthy. Practitioners also find DIVINE helpful for model simulatability.

We acknowledge that practitioners may not want diversity in their influential training points, if they do not desire a complete picture of model behavior. For local explanations, one participant in our user study (Section 6.1) noted that DIVINE points “may include conflicting and uncomparable [sic] items.” As such, practitioners may want to clarify the goals for using training point importance and leverage diversity accordingly. In Appendix A, we provide a detailed guide for how practitioners can leverage DIVINE and our codebase. Specific to removing points, we note that removing unfairness-inducing points might be too harsh. Future work might learn to down-weight those points with weight as opposed to dropping them with weight

. Future work could also value data using alternative metrics, such as robustness or privacy. Nonetheless, practitioners can leverage DIVINE to value training points based on their effect on model-specific evaluation metrics and to summarize model behavior either locally or globally. We hope DIVINE is a helpful intervention for practitioners to generate data visualizations and refine their models.

Acknowledgements

The authors thank Matthew Ashman, Krishna Gummadi, Weiyang Liu, Aditya Nori, Elre Oldewage, Richard E. Turner, and Timothy Ye for their comments on this manuscript. UB acknowledges support from the Mozilla Foundation and from DeepMind and the Leverhulme Trust via the Leverhulme Centre for the Future of Intelligence (CFI). IC acknowledges support from Microsoft Research. AW acknowledges support from a Turing AI Fellowship under grant EP/V025379/1, The Alan Turing Institute under EPSRC grant EP/N510129/1 & TU/B/000074, and the Leverhulme Trust via CFI.

References

Appendix

This appendix is formatted as follows.

  1. We present a practitioner guide in Appendix A. We discuss how one would go about selecting the various parameters used to find DIVINE points.

  2. We discuss extensions of counterfactual prediction for training point importance in Appendix B.

  3. We provide additional details about our experimental setup in Appendix C.

  4. We report additional experimental results in Appendix D.

  5. We discuss details of our user studies in Appendix E.

Appendix A Practitioner Guide

Throughout the paper, we use the word “practitioners” to refer to data scientists who can use DIVINE in practical ML settings where explainability is valued, or those who hope to refine their models by better understanding their training data. In this guide, we explain how practitioners can select the parameters used in DIVINE: importance measure , evaluation function , diversity function , influence-diversity tradeoff , and the number of DIVINE points, . Our code can be extended to support additional influence measures , submodular diversity functions , evaluation functions , and selection strategies.

a.1 Influence Measure

Within our work, the influence measure assigns importance to individual data points and to groups of data points. We aim to find an importance score for the -th training point. Under our additivity assumption per koh2019accuracy, we let the importance of a set of points be . We can obtain importance scores from various methods. In our main paper and in our additional experiments (Appendix D), we let be influence functions [koh2017understanding], Data Shapley [ghorbani2019data], counterfactual prediction [bhatt2020counterfactual], or leave-one-out (LOO) [hastie2009elements]. In Table 3, we compare various methods for finding valuable training points with respect to a model or a diversity function .

Method
Prototypes/Criticisms [kim2016MMD] N/A
ProtoDash [gurumoorthy2019efficient] N/A
Influential Points [koh2017understanding] Loss
Representer Points [yeh2018representer] Loss
Seq. Bayesian Quadrature [khanna2019interpreting] Loss
Data Shapley [ghorbani2019data] Accuracy and AUC
RelatIF [barshan2020relatif] Loss
DIVINE (Ours) Any (Loss, Unfairness, etc.)
Table 3: Practitioners can leverage DIVINE to value data points based on their contributions to model-specific evaluation metrics and then select a diverse, influential subset of points as a summary. We list other methods based on (1) their dependence on model parameters (e.g., prototypes are model independent), (2) the diversity in the points to which they assign high importance, and (3) the metric with respect to which the data points are valued.

One would select influence functions if they prefer a fast computation (assuming the Hessian computation is done once for a reasonably sized set of parameters). One could use Data Shapley if they want importance scores that adhere to the game-theoretic guarantees of the Shapley value. Counterfactual prediction is a training point importance method that is robust to label noise: we discuss this at length in Appendix B. LOO scores are also possible to compute, but note these may be computationally expensive as retraining is required for every point that is dropped. We show how all methods behave in comparison to each other on our synthetic data in Appendix D. In our package, influence functions, Data Shapley, LOO, and counterfactual prediction are supported.

Moreover, when applying an importance scoring measure, a practitioner may have to select an evaluation function with respect to which one wants to value datapoints. While we primarily let be or in our experiments, we report additional evaluation functions that can be used to score data points in Table 4. For every evaluation function, we write the function such that a large is undesirable: a negative importance score (if our score is taken to be the difference between evaluated at the new and old parameters – trained with and without the point respectively) implies that a point is harmful to . Moreover, we can also consider functions that do not simply take a difference between the evaluation function’s value at the new and old parameters. These functions might be independent of , e.g., . However, from a model debugging perspective, these definitions may not be relevant when the dimensionality of is even moderately large. We can also define importance scores that jointly consider unfairness and loss: would tell us how important a point is for unfairness and loss. The evaluation functions in Table 4 represent other popular group fairness metrics. We hope that future work consider finding DIVINE points with respect to robustness and privacy, which requires devising a new .

Metric Evaluation Function ()
Loss (e.g., wrt )
Equal Accuracy [berk2018fairness]
Equal Opportunity [hardt2016]

Equalized Odds 

[hardt2016]
Table 4: Various candidate evaluation functions ; note

Practitioners should select the that captures the property, for which they wish to test their model. If one wants to see the impact of datapoints on performance, would be a good option. If one wants to see the impact of datapoints on fairness, would be a good option.

a.2 Diversity Function

One main ingredient in our DIVINE point selection is a submodular function . This allows us to perform greedy selection when adding points to our DIVINE set. While other non-submodular diversity functions are possible, they would to benefit from the ease of using greedy selection. Future work might benefit from more clever set selection methodologies. Through out the paper, we mostly let our diversity function be the sum-redundancy function, . This function ensures that our selected points differ from each other, i.e., have low similarity [lin-bilmes-2011-class, libbrecht2018choosing]. In Appendix D, we demonstrate how the facility location function, , and maximum mean discrepancy, , perform. While the equations for each appears in Section 3, is the submodular facility location function [krause2014submodular] from the sensor-placement literature, selects points that minimize are similar to the most number of points in the entire dataset, and does not explicitly prohibit redundancy between the points selected. When is large and is maximized, the prototypes of kim2016MMD are recovered. If a practitioner does not mind some potential redundancy in the points selected and wants a set of points representative of the dataset, then may be suitable. On the other hand, from kim2016MMD selects a set of points that summarize the entire dataset and penalizes similarity between the chosen points. When is large and is maximized, the prototypes of kim2016MMD are recovered. If a practitioner wants representativeness without much redundancy, might suffice. All three submodular diversity functions are implemented in our package.

a.3 Influence-Diversity Tradeoff

In Section 5, we introduce influence-diversity tradeoff curve. This illustrates how controls how much influence to forgo in favor of diversity with respect to . While practitioners can select any along the curve. We suggest two ways to pick . A practitioner could specify the maximum of influence they are willing to sacrifice. A practitioner could also find the that optimizes average pairwise distance between the selected points. By default our divine package favors the latter. Practitioners could also implement other selection strategies as they see fit.

a.4 Number of DIVINE points

Selecting the number of DIVINE points to find and visualize will be use case dependent. If the goal of finding DIVINE points is to display them as an explanation of model behavior, we suggest displaying at most

points, which aligns with the number of cognitive chunks a user can handle at any given moment 

[doshi2017towards]. We suggest curating the size of the explanation to the needs of the stakeholders who will be analyzing the DIVINE points [bhatt2020explainable]. When selecting points to remove, a practitioner may consider checking our additivity assumption, i.e., see if removing a large number of low-value points at once does not affect other metrics of interest. We discuss how one would go about doing this analysis in Appendix C. We find that, for small , our additivity assumption is valid. Therefore, one might consider recalculating importance scores after removing a few batches of . We hope future work develops additional methodology for choosing .

a.5 Using Our Code

Our code is publicly available at https://github.com/umangsbhatt/divine-release, with a comprehensive README describing our implementation of DIVINE. We intend our code to be usable out of the box. We describe typical use-cases for our DIVINE codebase in the README. Practitioners can use our codebase by importing necessary files (as shown in tutorial.py) into their own code. All use-cases are runnable from tutorial.py. More details are available in our README.

Appendix B Counterfactual Prediction

We extend the work of bhatt2020counterfactual to be compatible with our approach to finding training point importance. We restrict ourselves to standard binary classification tasks, where our goal is to find a parameter such that learns a mapping between inputs and labels . Given a training dataset from some underlying, unknown distribution and a non-negative loss function , our goal is to learn that minimizes training error yet performs well on unseen test data. The expected loss of is given by: . Since we do not know , we calculate the average loss over the training dataset, . In an ERM setup, we find the optimal parameter parameterized by as follows: .

Given a point and its predicted label , we want to find alternate parameters such that we minimize empirical risk with the condition that the predicted label of if flipped: . We find an alternative classifier via . We can view this problem as optimizing over with an added constraint, or as optimizing over a subspace, ; note . If is training loss, this quantity tells us how much our loss suffers when we introduce a constraint to conflict the predictions of and on a point . The importance of the -th training point is given by: .

The expected loss of is given by: . The average loss of over the training data is given by . Let tell us how much our loss suffers when we introduce a constraint to conflict the predictions of and on a point . For ease of reading, we let . Since we do not know , we calculate an empirical variant over our training dataset: . We write the empirical extra loss as the difference in training loss between and : we call this the counterfactual prediction (CFP), denoted by . For any point and a given parameter space , we can find its corresponding counterfactual prediction. “Counterfactual” is not used in the causal sense of pearl2009causality but captures what would happen to loss (or any ) if we were to constrain our objective to alter the prediction of .

b.1 CFP in Prior Work

breiman2001statistical noted that there can exist multiple hypotheses that fit a training dataset equally well, leading to different stories about the relationship between the input features and output response. There are a few recent works that relate to our formulation. Firstly, rashomon defines the empirical -Rashomon set is defined as: . can be seen as the set of all classifiers in that have an average loss no more than greater than the average loss of . pred_mult study predictive multiplicity within the Rashomon set (calling it the -level set): they define a metric called ambiguity, which is the proportion of training data points whose . To calculate ambiguity, they find via a mixed integer program (MIP) for each . While these works study how to deal with varying predictions in , we essentially solve a dual problem where we want to find the minimum such that the empirical contains at least one model with different predictions for . More concretely, we want to find , where . letham2016prediction looks across to identify classifiers which have maximally different predictions (similar to discrepancy defined in pred_mult). We do something similar but different: we ask how much does your average loss need to suffer in order to change the prediction of a test point.

agnostic used to create a selection function for rejection option classification: they call the disbelief index. They consider Specifically, if the , then they proceed to predict on using . If , then they abstain from providing a prediction for . They choose , where is the slack of a uniform deviation bound in terms of the training set size , a confidence parameter , and the VC-dimension of . Using a weighted SVM, they find by adding to the training data and then upweight the penalty of misclassifying (penalty is set to ten times the weight of all other training points combined). Since is a noisy statistic and depends heavily on the samples chosen from , they use bootstrap sampling and then take the median of all measurements as the final value of the disbelief index, which is closely related to the disagreement coefficient from hanneke2009theoretical.

Earlier, dasgupta2008general used an ERM oracle with an example-based constraint in the context of active learning. In a setup similar to agnostic and ours, dasgupta2008general decides to request a label for if . They also select to be in terms of the empirical errors of both classifiers, , , and .

High Density Low Density
Low is uncertain in its prediction for is uncertain in its prediction for
High is certain in its prediction for is a potential outlier
Table 5: The Interplay between CFP and the data density around

b.2 Intuition

In Figure 9, we compare CFP to IF for and : CFP respects the data density more than IF, which correlates heavily with the distance to decision boundary. When visualizing, we normalize importance such that . We now provide intuition behind CFP. Recall we simply care about the absolute difference in loss between the training loss of the ERM parameter and the training loss of the constrained parameter , where the constraint mandates that . We denote this difference by .

If is large, the parameters must change a lot in order to fit an opposite label for ; therefore, we see model performance drop and can be confident that correctly classified , given and : dasgupta2008general would not request a label for and agnostic would accept . If is small, the parameters learned are similar and it was easy to fit an opposite label for , so we cannot be sure of the label for : dasgupta2008general would request a label for and agnostic would reject .

We expect to change rapidly based on the data density. We can expect to be high in dense regions. In Table 5, we discuss the interplay between and data density. If is high and we are in a high density region, then we know is correct and certain. If is low and we are in a high density region, then we know is uncertain: we may have the incorrect label or noise in our covariates. This happens in regions of high class overlap (high aleatoric uncertainty). If is low and we are in a low density region, then we know is uncertain: we may have an outlier or simply high epistemic uncertainty. If is high and we are in a low density region, is a potential outlier, since changed considerably to alter the prediction of .

(a) CFP with
(b) IF with
(c) CFP with
(d) IF with
Figure 9: Comparing various training point importance methods based on how they assign influence; we normalize influence when we visualize.

b.3 Connecting CFP to Neural Networks

For many complex models (i.e., large neural networks), re-computing the model with an added constraint is computationally expensive. As such, we propose an approximate version of using influence functions to find the set of perturbed parameters. We want to estimate the effect of training with a flipped label on the model parameters. In a weighted ERM setup, we find that minimizes empirical risk:

To weight every training data point equally, we set for all . Like koh2017understanding, we assume that is twice differentiable and convex in . As such, we assume the Hessian exists and is given by:

We assume is positive definite to guarantee the existence of . koh2017understanding define the perturbed parameters obtained when upweighting a single training point, as follows:

Assuming , let be the label flipped point. We can analogously get the following:

koh2017understanding study the affect of an input perturbation on the models parameters. They define this as follows:

We can define the effect of the label flipped point as:

Thus, the new parameters are approximately given by:

In the case of logistic regression, we can approximate the CFP parameters in closed form. We know that for logistic regression where . The loss is given by and its derivative is given by .

The difference in the derivatives with respect to the loss for the flipped-label point and the original point can be re-written as . Since we know , we can write the updated parameters after flipping the label of as:

Appendix C Experimental Setup

c.1 Dataset Metadata

We employ 6 datasets in our experiments, 4 tabular and 2 image. All are publicly available, with details given in Table 6. For all datasets, we use a train, validation, and test split.

Name Targets Input Type # Features # Total Samples
LSAT Continuous Continuous & Categorical
COMPAS Binary Continuous & Categorical
Adult Binary Continuous & Categorical
Bank Binary Continuous & Categorical
MNIST Categorical Image (greyscale)
 FashionMNIST Categorical Image (greyscale)
Table 6: Summary of datasets used in our experiments. (*)We use a 7 feature version of COMPAS; however, other versions exist.

We use the LSAT loading script from Cole2019AvoidingRV’s github page. The raw data can be downloaded from https://raw.githubusercontent.com/throwaway20190523/MonotonicFairness/master/data/law_school_cf_test.csv and https://raw.githubusercontent.com/throwaway20190523/MonotonicFairness/master/data/law_school_cf_train.csv. We let “sex” be our protected attribute and drop “race” from the dataset when running our experiments. Features used are undergraduate grade point average, LSAT score, and sex. The predicted label is first year law school performance.

For the COMPAS criminal recidivism prediction dataset we use a modified version of zafar2017fairness’s loading and pre-processing script. It can be found at https://github.com/mbilalzafar/fair-classification/blob/master/disparate_mistreatment/propublica_compas_data_demo/load_compas_data.py. We add an additional feature: “days served” which we compute as the difference, measured in days, between the “c_jail_in” and “c_jail_out” variables. The raw data is found at https://github.com/propublica/compas-analysis/blob/master/compas-scores-two-years.csv. We let “race” be our protected attribute. Other features used are age, sex, charge degree (felony or misdemeanor), and priors count. The predicted label is recidivism within 2 years.

The adult dataset [Dua:2019] can be obtained from and is described in detail at https://archive.ics.uci.edu/ml/datasets/adult. The features we used are age, work class, education, education number, marital status, occupation, relationship, capital gain, capital loss, hours per week, and native country. More details are available at the link. We let “sex” be our protected attribute. The predicted label is whether the person makes more than 50K a year.

The bank marketing dataset [Dua:2019] can be obtained from and is described in detail at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The features we used are described in detail at the link. We let “age” be our protected attribute. The predicted label is whether a client will subscribe to a term deposit.

The MNIST handwritten digit image dataset [lecun1998mnist] can be obtained from http://yann.lecun.com/exdb/mnist/.

The FashionMNIST image dataset [xiao2017fashion] can be obtained from https://github.com/zalandoresearch/fashion-mnist.

(a) LSAT
(b) COMPAS
Figure 10: We show when is taken to be equal accuracy.

c.2 Models

In Section 5, we primarily use logistic regression for our tabular data experiments. For all tabular datasets, we append an intercept to the input features before learning our parameters, : this is customary in such settings, i.e., zafar2017fairness has a similar set up. We learn classifier’s parameters using scipy.optimize, using the SLSQP (Sequential Least SQuares Programming) solver. For image datasets, we use tensorflow

to learn a three-layered multilayer perceptron (for MNIST) and a three-layered convolutional neural network (for FashionMNIST). We then leverage our DIVINE codebase (Section 

A) to calculate the DIVINE points for each model.

c.3 Additivity implies Modularity

We next comment on the modularity of . If holds , then is monotone. To show the modularity of , it suffices to show each importance measure with a selected , is additive, which implies importance scores are linear. When is loss, koh2019accuracy find that influence functions are approximately linear, . Since we can recast counterfactual prediction in terms of influence functions in Appendix B, we know importance scores from counterfactual prediction are approximately linear. Furthermore, irrespective of , Shapley values, by construct, satisfy linearity: see [shapley52, ghorbani2019data] for a thorough treatment. While we know that will be modular when is a function of loss, we consider linearity using influence functions with . In Figure 10, we plot how a linearity approximation of importance (calculated using ) performs as we increase the number of points we remove from the dataset. The average difference between the predicted importance score , and the actual importance score , over 1000 samples of sized sets. On LSAT, shown in Figure 9(a), all three scoring maintain linearity as set sizes increases, implying that is modular. Linearity is also satisfied on Adult (Figure 9(b)) with LOO scores. For IF, importance score is linear for small values of . As increases, our approximation no longer maintains linearity. If we keep relatively small, we can use our linearity approximation, as we desire simple explanations with few cognitive chunks [doshi2017towards]

. However, as we can assume additivity with larger set sizes when we use LOO, we expect it to perform better overall in identifying large sets of important points on high-dimensional data. Practitioners might find using these graphs when deciding how large to make

; if is too large, then DIVINE points for unfairness might not be trustworthy.

Appendix D Additional Experiments

d.1 Analyzing Different Diversity Functions

First, we replicate Figure 3 with various diversity functions on our synthetic data. Here and is taken to be influence functions [koh2017understanding] with . We notice similar trends for and as we did for in Figures 1112, and 13. We also visualize the top-5 DIVINE points for select values of . The red diamonds are , which recovers the top points from IF alone. This is the same, irrespective of . The orange diamonds are the we find such that our DIVINE points have less influence than IF points. The yellow diamonds are the we find such that we maximize the average pairwise distance between our DIVINE points. Notice how both and encourage representativeness by selecting a DIVINE point that is near the center of the Gaussians. However, since does not penalize redundancy between points, it selects three points near to each other in the top right. As approaches , will recover the prototypes of kim2016MMD. In Figure 14, we show how varying affects our trade-off curves for various .

(a) - Tradeoff
(b) DIVINE
(c) DIVINE
(d) DIVINE
Figure 11: – note this figure is the same as Figure 3 from the main paper
(a) - Tradeoff
(b) DIVINE
(c) DIVINE
(d) DIVINE
Figure 12: . Some points are representative (near cluster center), but others are redundant (top right).
(a) - Tradeoff
(b) DIVINE
(c) DIVINE
(d) DIVINE
Figure 13: . Our selected points are representative, though the top cluster is missed. There are no redundant points, in contrast to points selected with .
(a)
(b)
(c)
Figure 14: We report the DIVINE trade-off as a function of for various with our synthetic data.

d.2 Analyzing Different Influence Measures

We next consider the effects of varying the underlying influence measure but use for all experiments herein. In Figure 1516, and 17, we show how DIVINE points are selected for DataShapley, Counterfactual Prediction, and Leave-one-out, respectively. Note that when , the DIVINE points are simply the highest scoring points from each method alone. Every method selects similar points (all high importance are located in a small cluster) when no diversity is considered. Then, we trade-off with influence, and obtain similar trade-off plots. In Figure 18, we find that as we increase we get similar behavior for other influence measures that we obtained for influence functions before. For all influence measures, we use as our evaluation function.

(a) - Tradeoff
(b) DIVINE
(c) DIVINE
(d) DIVINE
Figure 15: Data Shapley
(a) - Tradeoff
(b) DIVINE
(c) DIVINE
(d) DIVINE
Figure 16: Counterfactual Prediction
(a) - Tradeoff
(b) DIVINE
(c) DIVINE
(d) DIVINE
Figure 17: Leave-one-out
(a) Influence Functions
(b) DS
(c) CFP
(d) LOO
Figure 18: We report the DIVINE trade-off as a function of for various with our synthetic data.

d.3 Analyzing Different Datasets

We show how the trade-off curves look for various from various datasets: LSAT, COMPAS, Adult, and FashionMNIST. We use IF as our influence measure, as our diversity function, and as our evaluation function. In Figure 19, we report the trade-off curves for when . In Figure 20, we further illustrate the flexibility of our approach to obtain DIVINE points under multiple changes: in model type, in input dimensions, and in explanation size .

(a) LSAT
(b) COMPAS
(c) Adult
(d) FashionMNIST
Figure 19: For four datasets and , we characterize the influence-diversity trade-off. In 18(d), we show a trade-off curve for FashionMNIST. Notice that it has a similar shape to 2(a), even though the model type is a CNN not LR and the data type is image not tabular.
(a) LSAT
(b) COMPAS
(c) Adult
(d) FashionMNIST
Figure 20: For four datasets and four values of , we characterize the influence-diversity trade-off. The red diamond indicates where top IF points lie. The orange diamond is where of the influence has been foregone for diversity. The yellow diamond is where the average pairwise distance between DIVINE points is maximized. With COMPAS, we find that the orange and yellow diamonds coincide for multiple values of . Recall that just because the lines look linear does not mean that they do not resemble the curve shown in Figure 19. Even with FashionMNIST where average pairwise distance in input space might not be meaningful, we find that our curves hold, as a gamma selection strategy.

d.4 DIVINE for Image Classifiers

In the main paper, we discuss how to find DIVINE points for a CNN trained on FashionMNIST. In Figure 21, we show the top- DIVINE points as we sweep over . When finding the that maximizes average pairwise distance, we select . These DIVINE points are used in one of our user studies.

Figure 21: We show the top- points when trading off influence (IF with ) with .

While we report results for a CNN trained on FashionMNIST in the main text, we compare the top influential and DIVINE points from a MLP and from a simpler logistic regression classifier trained on MNIST. In Figure 22, we show the most influential points to an entire test set for our logistic regression classifier. Herein we value points with rest to . Note how IF alone does contain label diversity: it simply captures two canonical sevens (ones with lines in the middle and ones without). As sanity check, we train a logistic regression classifier on the top selected points by each method and report accuracies. The LR classifier trained on the MNIST task used training points and achieved a test accuracy of . We use a validation set and to select important points via IF and IF+ diversity methods. We find that the model trained on points selected by IF has a test accuracy of . With DIVINE points, we find that IF+MMD gets , IF+FL gets , and IF+SR gets . Ergo, our method allows us to select a subset of points important for model performance. In Figure 23, we find that the average pairwise distance between DIVINE points exceeds that of IF for MNIST. In Figure 24, we show how explanations for a specific test point differ by model (LR and MLP) and by method (IF and IF + Diversity). Notice that we achieve label diversity with all three of our , and achieve mode diversity with .

(a) Logistic Regression
(b) MLP
Figure 22: We show that top DIVINE points for LR and a MLP differ. Most important point for each method in leftmost position. Note that the last three rows are three potential sets of DIVINE points all with IF as the influence measure but with the diversity function varying.
(a) Logistic Regression
(b) MLP
Figure 23: Average pairwise distance between DIVINE points exceeds that of IF alone for both models. The -sized explanations are found random test points and then averaged.
(a) Logistic Regression
(b) MLP
Figure 24: We show that the influential samples selected by IF are less diverse than the ones we select, when locally explaining a test point, for both logistic regression and a MLP. Note both of the test points are correctly classified. Note that the last three rows are three potential sets of DIVINE points all with IF as the influence measure but with the diversity function varying.

d.5 Additional Fairness Experiments

d.5.1 Generalization

While practitioners might be interested in understanding the effect of training data points with respect to metrics evaluated on the training data itself, practitioners may also like to achieve better generalization. We report the effect of removing points scored with respect to training unfairness on the test data in the top row of Figure 25. In this section, we use as our evaluation function and use influence functions as our influence measure. Notice that we are able to achieve a lower unfairness in generalization on all datasets with either points selected based on importance scores alone or with DIVINE-selected points, which incorporate a diversity term. We observe unexpected results with LSAT removal and the influence score alone: this is likely due to the low-dimensionality and high redundancy in the dataset. We posit that there only exist a handful of unfairness inducing points that need to be removed.

Moreover, practitioners may be interested in scoring points specifically for generalization. We can score training data points based on their impact on the unfairness of a held-out, validation dataset (i.e., is calculated on validation data) and then measure the impact removing unfairness-inducing points on a separate test set. We demonstrate this approach in the bottom row of Figure 25. As expected, the first few percentages of points removed lead to a decrease in model unfairness. For every dataset, removing DIVINE points (blue line) outperforms removing points at random, by only their importance score (), or by only their diversity ( in this case).

A practitioner can define a stopping criterion for the removal of unfairness-inducing points. In Table 7, we report the number of unfairness-inducing points highlighted by various methods. We include a column denoting the number of correctly classified points our method has identified as unfair to show that it does not simply recommend misclassified points for removal. Unfairness-inducing points can also be correctly classified points that change the decision boundary significantly upon removal, such that unfairness is reduced while accuracy is maintained.

(a) COMPAS
(b) LSAT
(c) Adult
(d) Bank
Figure 25: Impact of removing unfairness-inducing points on generalization. In the top row, we score each training point based the change in on the training dataset upon removing the point. We report unfairness on a test dataset, after removing of the most unfairness inducing points (selected based on methods differing by color) at a time. In the bottom row, we score each training point based the change in on a separate validation dataset upon removing the point. We then report unfairness on a held-out test dataset.

d.5.2 Removal with Recalculation

Instead of simply calculating the importance scores once (with respect to the entire training dataset), we can recalculate importance scores after the removal of each set of points. Although computationally expensive, a practitioner may elect to do this to avoid divergence from additivity for large (as described in Appendix C.3) or to improve generalization. To improve generalization, we may also calculate importance scores with respect to a validation data set and report performance on a test data set (as described in Appendix D.5.1). In Figure 26, we use both of these approaches to improve generalization, and report the removal with recalculation results for the first training points removed, with importance scores recalculated after every single point removed (). For LSAT, Adult, and Bank, we notice that unfairness decreases steadily as we remove the most harmful data point based on newly calculated influence scores. However, for COMPAS, after iterations, our algorithm no longer identifies unfair points (importance scores are all 0 or greater) and so accuracy and unfairness both remain constant.

Dataset
Importance
Method
Unfairness
Inducing
Points
Correct
LOO