1 Introduction
Machine learning systems [1] are becoming more prevalent thanks to a vast number of success stories. However, the data tools for interpreting and debugging models have not caught up yet, and many important challenges exist to improve our model understanding after training [2]. One such key problem is to understand if a model performs poorly on certain parts of the data, hereafter referred to as a slice.
Example 1.
Consider a Random Forest classifier that predicts whether a person’s income is above or below $50,000 (UCI Census data
[3]). Looking at Table I, the overall metrics may be considered acceptable, since the overall log loss (a widelyused loss metric for binary classification problem) is low for all the data (see the “All” row). However, the individual slices tell a different story. When slicing data by gender, the model is more accurate for Female than Male (the effect size defined in Section 2 captures this relation by measuring the normalized loss metric difference between the Male slice and its counterpart, the Female slice). The Profspecialty slice is interesting because the average loss metric is on par with Male, but the effect size is much smaller. A small effect size means that the loss metric on Profspecialty is similar to the loss metric on other demographics (defined as counterparts in Section 2). Hence, if the log loss of a slice and that of the counterpart are not acceptable, then it is likely that the model is bad overall, not just on a particular subset. Lastly, we see that people with higher education degrees (Bachelors Masters Doctorate) suffer from worse model performance, and their losses are higher than their counterparts and thus have higher error concentration. Thus, slices with high effect size are important for model validation, to make sure that the model does not underperform on certain parts of the data.Slice  Log Loss  Size  Effect Size 

All  0.35  30k  n/a 
Sex = Male  0.41  20k  0.28 
Sex = Female  0.22  10k  0.29 
Occupation = Profspecialty  0.45  4k  0.18 
Education = HSgrad  0.33  9.8k  0.05 
Education = Bachelors  0.44  5k  0.17 
Education = Masters  0.49  1.6k  0.23 
Education = Doctorate  0.56  0.4k  0.33 
The problem is that the overall model performance can fail to reflect that of smaller data slices. Thus, it is important that the performance of a model is analyzed on a more granular level. While a wellknown problem [4], current techniques to determine underperforming slices largely rely on domain experts to define important subpopulations (or at least specify a feature dimension to slice by) [5, 6]. Unfortunately, machine learning practitioners do not necessary have the domain expertise to know all important underperforming slices in advance, even after spending a significant amount of time exploring the data. An underlying assumption here is that the dataset is large to the extent that enumerating all possible data slices and validating model performance for each is not practical due to the sheer number of possible slices. Worse yet, simply searching for the most underperforming slices can be misleading because the model performance on smaller slices can be noisy, and without any safeguard, this leads to slices that are too small for meaningful impact on the model quality or that are false discoveries (i.e., nonproblematic slices appearing as problematic). Ideally, we want to identify the largest and true problematic slices from the smaller slices that are not fully reflected on by the overall model performance metric.
There are more generic clusteringbased algorithms in model understanding[7, 8, 9]
that group similar examples together as clusters and analyze model behavior locally within each cluster. Similarly, we can cluster similar examples and treat each cluster as an arbitrary data slice; if a model underperforms on any of the slices, then the user can analyze the examples within. However, clusters of similar examples can still have high variance and high cardinality of feature values, which are hard to summarize and interpret. In comparison, a data slice with a few common feature values (e.g., the
Female slice contains all examples with Sex = Female) is much easier to interpret. In practice, validating and reporting model performance on interpretable slices are much more useful than validating on arbitrary noninterpretable slices (e.g., a cluster of similar examples with mixed properties).A good technique to detect problematic slices for model validation thus needs to find easytounderstand subsets of data and ensure that the model performance on the subsets is meaningful and not attributed to chance. Each problematic slice should be immediately understandable to a human without the guesswork. The problematic slices should also be large enough so that their impact on the overall model quality is nonnegligible. Since the model may have a high variance in its prediction quality, we also need to be careful not to choose slices that are false discoveries. Finally, since the slices have an exponentially large search space, it is infeasible to manually go though each slice. Instead, we would like to guide the user to a handful of slices that satisfy the conditions above. In this paper we propose Slice Finder, which efficiently discovers large possiblyoverlapping slices that are both interpretable and problematic.
A slice is defined as a conjunction of featurevalue pairs where having fewer pairs is considered more interpretable. A problematic slice is identified based on testing of a significant difference of model performance metrics (e.g., loss function) of the slice and its counterpart. That is, we treat each problematic slice as a hypothesis and check that the difference is statistically significant, and the magnitude of the difference is large enough according to the effect size. We discuss the details in Section
2. One problem with performing many statistical tests (due to a large number of candidate slices) is an increased number of false positives. This is what is also known as Multiple Comparisons Problem (MCP) [10]: imagine a test of TypeI error (false positive: recommending a nonproblematic slice as problematic) rate of 0.05 (a common
level for statistical significance testing); the probability of having any false positives blows up exponentially with the number of comparisons (e.g.,
, even for just 8 tests, but then, we may end up exploring hundreds and thousands of slices even for a modest number of examples). We address this issue in Section 3.2.In addition to testing, the slices found by Slice Finder
can be used to evaluate model fairness or in applications such as fraud detection, business analytics, and anomaly detection, to name a few. While there are many definitions for fairness, a common one is that a model performs poorly (e.g., lower accuracy) on certain sensitive features (which define the slices), but not on others. Fraud detection also involves identifying classes of activities where a model is not performing as well as it previously did. For example, some fraudsters may have gamed the system with unauthorized transactions. In business analytics, finding the most promising marketing cohorts can be viewed as a data slicing problem. Although
Slice Finder evaluates each slice based on its losses on a model, we can also generalize the data slicing problem where we assume a general scoring function to assess the significance of a slice. For example, data validation is the process of identifying training or validation examples that contain errors (e.g., values are out of range, features are missing, and so on). By scoring each slice based on the number or type of errors it contains, we can summarize the data errors through a few interpretable slices rather than showing users an exhaustive list of all erroneous examples.The main contribution of this paper is applying data management techniques to the model validation problem in machine learning. This application is part of a larger integration of the areas of Big data and Artificial Intelligence (AI) where data management plays a role in almost all aspects of machine learning [11, 12]. This paper extends our previous work on Slice Finder [13, 14]. In particular, we provide a full description of the slice finding algorithms and provide extensive experiments.
In summary, we make the following contributions:

We describe the Slice Finder system and propose three automated data slicing approaches, including a naïve clusteringbased approach as a baseline for automated data slicing (Section 3).

We present model fairness as a potential use case for Slice Finder (Section 4).

We evaluate the automated data slicing approaches using real and synthetic datasets (Section 5).
2 Data Slicing Problem
2.1 Preliminaries
We assume a dataset with examples and a model that needs to be tested. Following common practice, we assume that each example contains features where each feature (e.g., country) has a list of values (e.g., {US, DE}) or discretized numeric value ranges (e.g., {[0, 50), [50, 100)}). We also have a ground truth label for each example, such that = , . The test model is an arbitrary function that maps an input example to a prediction using , and the goal is to validate if is working properly for different subsets of the data. For ease of exposition, we focus on a binary classification problem (e.g., UCI Census income classification) with that takes an example and outputs a prediction of the true label (e.g., a person’s income is above or below $50,000). Without loss of generality, we also assume that the model uses all the features in for classification.
A slice is a subset of examples in with common features and can be described as a predicate that is a conjunction of literals where the ’s are distinct (e.g., country = DE gender = Male), and can be one of , , , , , or
. For numeric features, we can discretize their values (e.g., quantiles or equiheight bins) and generate ranges so that they are effectively categorical features (e.g.,
age = [20,30)). Numeric features with large domains tend to have fewer examples per value, and hence do not appear as significant. By discretizing numeric features into a set of continuous ranges, we can effectively avoid searching through tiny slices of minimal impact on model quality and group them to more sizable and meaningful slices.We also assume a classification loss function that returns a performance score for a set of examples by comparing ’s prediction with the true label . A common classification loss function is logarithmic loss (log loss), which in case of binary classification is defined as:
The log loss is nonnegative and grows with the number of classification errors. A perfect classifier would have log loss of zero, and a randomguesser () log loss of . Also note that our techniques and the problem setup can easily generalize to other machine learning problem types (e.g., multiclass classification, regression, etc.) with proper loss functions/performance metrics.
2.2 Model Validation
We consider the model validation scenario of pointing the user to “problematic” slices where a single model performs relatively poorly on. That is, we would like to find slices where the loss function returns a significantly higher loss than the rest of the examples in . At the same time, we prefer these slices to be large as well. For example, the slice country = DE may be too large for a model to perform significantly worse than other countries. On the other hand, the slice country = DE gender = Male age = 30 may have a high loss, but may also be too specific and thus small to have much impact on the overall performance of the model. Finally, we would like the slices to be interpretable in the sense that they can be expressed with a few literals. For example, country = DE is more interpretable than country = DE age = 2040 zip = 12345.
A straightforward extension of this scenario is to compare two models on the same data and point out if certain slices would experience a degrade in performance if the second model would be used. For example, a user may be using an existing model and wants to determine if a newlytrained model is safe to push to production. Here we can consider the two models as a single model where the loss is defined as the loss of the second model minus the loss of the first model. Since the extension does not fundamentally change the problem, for the rest of the paper, we focus on the original scenario of validating a single model.
Finding the most problematic slices is challenging because it requires a balance between how significant the difference in loss is and how large the slice is. Simply finding a slice with many classification errors will not work because there may also be many correct classifications within the same slice (recall that a slice is always of the form ). Another solution would be to score each slice based on some weighted sum of its size and difference in average losses. However, this weighting function is hard to tune by the user because it is not clear how size relates to loss. Instead, we envision the user to either fix the significance or size.
2.3 Problematic Slice as Hypothesis
We now discuss what we mean by significance in more detail. For each slice , we define its counterpart as , which is the rest of the examples. We then compute the relative loss as the difference . Without loss of generality, we only look for positive differences where the loss of is higher than that of .
A key question is how to determine if a slice has a significantly higher loss than . Our solution is to treat each slice as a hypothesis and perform two tests: determine if the difference in loss is statistically significant and if the effect size [15] of the difference is large enough. Using both tests is a common practice [16] and necessary because statistical significance measures the existence of an effect (i.e., the slice indeed has a higher loss than its counterpart) while the effect size complements statistical significance by measuring the magnitude of the effect (i.e., how large the difference is).
To measure the statistical significance, we use the hypothesis testing with the following null () and alternative () hypotheses:
Here both and should be viewed as samples of all the possible examples in the world, including the training data and even the examples that the model might serve in the future. We then use Welch’s test [17], which is used to test the hypothesis that two populations have equal means and is defined as follows:
where is the average loss of , is the variance of the individual example losses in , and is the size of . In comparison to Student’s test, Welch’s test is more reliable when the two samples have unequal variances and unequal sample sizes, which fits our setting.
To measure the magnitude of the difference between the distributions of losses of and , we compute the effect size [15] , which is defined as follows:
Intuitively, if the effect size is 1.0, we know that the two distributions differ by one standard deviation. According to Cohen’s rule of thumb
[18], an effect size of 0.2 is considered small, 0.5 is medium, 0.8 is large, and 1.3 is very large.2.4 Problem Definition
For two slices and , we say that if precedes when ordering the slices by increasing number of literals, decreasing slice size, and decreasing effect size. Then the goal of Slice Finder is to identify problematic slices as follows:
Definition 1.
Given a positive integer , an effect size threshold , and a significance level , find the top slices sorted by the ordering such that:

Each slice has an effect size at least ,

The slice is statistically significant,

No slice can be replaced with one that has a strict subset of literals and satisfies the above two conditions.
The top slices do not have to be distinct, e.g., country = DE and education = Bachelors overlap in the demographic of Germany with a Bachelors degree. In a user’s point of view, setting the effect size threshold may be challenging, so Slice Finder provides a slider for that can be used to explore slices with different degrees of problematicness (see Section 3.3).
3 System Architecture
Underlying the Slice Finder system is an extensible architecture (Figure 1) that combines automated data slicing and interactive visualization tools. Slice Finder loads the validation data set into a Pandas DataFrame [19]. The DataFrame supports indexing individual examples, and each data slice keeps a subset of indices instead of a copy of the actual data examples. Slice Finder provides basic slice operators (e.g., intersect) based on the indices; only when evaluating the machine learning model on a given slice does Slice Finder access the actual data by the indices to test the model. The Pandas library also provides a number of options to deal with dirty data and missing values. For example, one can drop NaN (missing values) or any values that deviate from the column types as necessary.
Once the data is loaded into a DataFrame, Slice Finder processes it to identify the problematic slices and allows the user to explore them. Slice Finder
searches for problematic slices either by training a CART decision tree around misclassified examples or by performing a more exhaustive search on a lattice of slices. Both search strategies progress in a topdown manner until they find the top
problematic slices. The decision tree approach materializes the tree model and traverses the tree to find problematic slices. In lattice searching, Slice Finder traverses a lattice of slices to find the slices. This topdown approach allows Slice Finder to quickly respond to new request queries that use different , as described in Section 3.3. As Slice Finder searches through a large number of slices, some slices might appear problematic by chance (i.e., multiple comparisons problem [20]). Slice Finder controls such a risk by applying a marginal false discovery rate (mFDR) controlling procedure called investing [20, 21] in order to find statistically significant slices among a stream of slices. Lastly, even a handful of problematic slices can be overwhelming to the user, since she may need to take action (e.g., deeper analyses or model debugging) on each slice. Hence, it is important to enable the user to quickly browse through the slices by slice size and effect size. To this end, Slice Finder provides interactive visualization tools for the user to explore the recommended slices.The following subsections describe the Slice Finder components in detail. Section 3.1 introduces the automated data slicing approaches without false discovery control, Section 3.2 discusses the false discovery control, and Section 3.3 describes the interactive visualization.
3.1 Automated Data Slicing
As mentioned earlier, the goal of this component is to automatically identify problematic slices for model validation. To motivate the development of the two techniques that we mentioned (decision tree and lattice searching), let us first consider a simple baseline approach that identifies the problematic slices through clustering. And then, we discuss two automated data slicing approaches used in Slice Finder that improve on the clustering approach.
3.1.1 Clustering
The idea is to cluster similar examples together and take each cluster as an arbitrary data slice. If a test model fails on any of the slices, then the user can examine the data examples within or run a more complex analysis to fix the problem. This is an intuitive way to understand the model and its behavior (e.g., predictions) [7, 8, 9]; we can take a similar approach to the automated data slicing problem. The hope is that similar examples would behave similarly even in terms of data or model issues.
Clustering is a reasonable baseline due to its ease of use, but it has major drawbacks: first, it is hard to cluster and explain high dimensional data. We can reduce the dimensionality using principled component analysis (PCA) before clustering, but many features of clustered examples (in its original feature vector) still have high variance or high cardinality of values. Unlike an actual data slice filtered by certain features, this is hard to interpret unless the user can manually go through the examples and summarize the data in a meaningful way. Second, the user has to specify the number of clusters, which affects crucially the quality of clusters in both metrics and size. As we want slices that are problematic and large (more impact for model quality), this is a key parameter, which is hard to tune. The two techniques that we present next overcome these deficiencies.
3.1.2 Decision Tree Training
To identify more interpretable problematic slices, we train a decision tree that can classify which slices are problematic. The output is a partitioning of the examples into the slices defined by the tree. For example, a decision tree could produce the slices {, ,
}. For numeric features, this kind of partitioning is natural. For categorical features, a common approach is to use onehot encoding where all possible values are mapped to columns, and the selected value results in the corresponding column to have a value 1. We can also directly handle categorical features by splitting a node using tests of the form
and .To use a decision tree, we start from the root slice (i.e., the entire dataset) and go down the decision tree to find the top problematic slices in a breadthfirst traversal. The decision tree can be expanded one level at a time where each leaf node is split into two children that minimize impurity. The slices of each level are sorted by the ordering and then filtered based on whether they have largeenough effect sizes and are statistically significant. The details of the filtering are similar to lattice searching, which we describe in Section 3.1.3. The searching terminates when either slices are found or there are no more slices to explore.
The decision tree approach has the advantage that it has a natural interpretation, since the leaves directly correspond to slices. In addition, if the decision tree only needs to be expanded a few levels to find the top problematic slices, then the slice searching can be efficient. On the other hand, the decision tree approach optimizes on the classification results and may not find all problematic slices according to Definition 1. For example, if some feature is split on the root node, then it will be difficult to find singlefeature slices for other features. In addition, a decision tree always partitions the data, so even if there are two problematic slices that overlap, at most one of them will be found. Another downside is that, if a decision tree gets too deep with many levels, then it starts to become uninterpretable as well [22].
3.1.3 Lattice Searching
The lattice searching approach considers a larger search space where the slices form a lattice, and problematic slices can overlap with one another. We assume that slices only have equality literals, i.e., . In contrast to the decision tree training approach, lattice searching can be more expensive because it searches overlapping slices.
Figure 2 illustrates how slices can be organized as a lattice. Lattice searching performs a breadthfirst search and efficiently identifies problematic slices as shown in Algorithm 1. As a preprocessing step, Slice Finder takes the training data and discretizes numeric features. For categorical features that contain too many values (e.g., IDs are unique for each example), Slice Finder
uses a heuristic where it considers up to the
most frequent values and places the rest into an “other values” bucket. The possible slices of these features form a lattice where a slice is a parent of every with one more literal.Slice Finder finds the top interpretable and large problematic slices sorted by the order by traversing the slice lattice in a breadthfirst manner, one level at a time. Initially Slice Finder considers the slices that are defined with one literal. For each slice, Slice Finder checks if it has an effect size at least (using the function) and adds it to the priority queue , which contains candidate slices that are sorted by the order. Next, Slice Finder pops slices from and tests for statistical significance using the function. The testing can be done using investing, which we discuss in Section 3.2. Sorting the slices in the middle of the process using is important for the investing policy used by Slice Finder as we explain later.
Each slice that has both a large enough effect size and is statistically significant is added to and later expanded using the function where we generate each new slice by adding a literal, only if the resulting slice is not subsumed by a previouslyidentified problematic slice. The intuition is that any subsumed (expanded) slice contains a subset of the examples of its parent and is smaller with more filter predicates (less interpretable); thus, we do not expand larger and already problematic slices. By starting from the slices whose predicates are single literals and expanding only nonproblematic slices with one additional literal at a time (i.e., topdown search from lower order slices to higher order slices), we can generate a superset of all candidate slices. Depending on whether each slice satisfies the two conditions, Slice Finder updates the wealth accordingly using the function (details on the updating strategy are discussed in Section 3.2).
Example 2.
Suppose there are three features , , and with the possible values {}, {}, and {}, respectively. Also say = 2, and the effect size threshold is . Initially, the root slice is expanded to the slices = , = , = , and = , which are inserted into . Among them, suppose that only = has an effect size at least while the others do not. Then = is added to for significance testing while the rest are added to . Next, = is popped from and is tested for statistical significance. Suppose the slice is significant and is thus added to . Since is now empty, the slices in are expanded to = = and = = , which are not subsumed by the problematic slice = . If = = is larger and has both an effect size at least and is statistically significant, then the final result is [ = , = = ].
The following theorem formalizes the correctness of this algorithm for the sliceidentification problem. The proof is a straightforward proofbycontradiction and is omitted.
3.1.4 Scalability
Slice Finder optimizes its search by expanding the filter predicate by one literal at a time. Unfortunately, this strategy does not solve the scalability issue of the data slicing problem completely, and Slice Finder could still search through an exponential number of slices, especially for large highdimensional data sets. To this end, Slice Finder also takes the following two approaches for speeding up search.
Parallelization: For lattice searching, evaluating a given model on a large number of slices onebyone (sequentially) can be very expensive. In particular, computing the effect sizes is the performance bottleneck. So instead, Slice Finder can distribute effect size evaluation jobs (lines 8–12 in Algorithm 1) by keeping separate priority queues for the different number of literals . The idea is that workers take slices from the current in a roundrobin fashion and evaluate them asynchronously; the workers push slices that have high effect sizes to the priority queue (for hypothesis testing) as they finish evaluating the slices. The significance testing on the slices in can be done by a single worker because the slices have already been filtered by effect size, and the significance testing can be done efficiently. In addition, the added memory and communication overheads are negligible compared to the time for computing the effect sizes. If is empty, but , Slice Finder moves onto the next queue and continues searching until .
For DT, our current implementation does not support parallel learning algorithms for constructing trees. However, there exist a number of highly parallelizable learning processes for decision trees [23], which Slice Finder could implement to make DT more scalable.
Sampling: Slice Finder can also scale by running on a sample of the entire dataset. The runtime of Slice Finder is proportional to the sample size, assuming that the runtime for the test model is constant for each example. By taking a sample, however, we also run the risk of false positives (nonproblematic slices that appear problematic) and false negatives (problematic slices that appear nonproblematic or completely disappear from the sample) due to a decreased number of examples. Since we are interested in large slices that are more impactful to model quality, we can disregard smaller false negatives that may have disappeared from the sample. In Section 5.5, we show that even for small sample sizes, most of the problematic slices can still be found. In Section 3.2, we perform significance testing to filter slices that falsely appear as problematic or nonproblematic.
3.2 False Discovery Control
As Slice Finder finds more slices for testing, there is also the danger of finding more “false positives” (Type1 errors), which are slices that are not statistically significant. Slice Finder controls false positives in a principled fashion using investing [20]. Given an wealth (overall Type I error rate) , investing spends this over multiple comparisons, while increasing the budget towards the subsequent tests with each rejected hypothesis. This so called payout (increase in ) helps the procedure become less conservative and puts more weight on more likely to be faulty null hypotheses. More specifically, an alphainvesting rule determines the wealth for the next test in a sequence of tests. This effectively controls marginal false discovery rate at level :
Here, is the number of false discoveries and the number of total discoveries returned by the procedure. Slice Finder uses investing, mainly because it allows more interactive multiple hypothesis error control with an unspecified number of tests in any order. On the contrary, more restricted multiple hypothesis error control techniques, such as Bonferroni correction and BenjaminiHochberg procedure [10] fall short as they require the total number of tests in advance or become too conservative as grows large.
While there are different investing policies [21] for testing a sequence of hypotheses, we use a policy called Bestfootforward. Recall our exploration strategy orders slices by decreasing slice size and effect size. As a result, the initial slices also tend to be statistically significant as well. The Bestfootforward policy also assumes that many of the true discoveries are found early on and aggressively invests all wealth on each hypothesis instead of saving some for subsequent hypotheses.
3.3 Interactive Visualization Tool
Slice Finder interacts with users through the GUI in Figure 3. A: On the left side is a scatter plot that shows the (size, effect size) coordinates of all slices. This gives an overview of the top problematic slices, which allows the user to quickly browse through large and also problematic slices and compare slices to each other. B: Whenever the user hovers a mouse over a dot, the slice description, size, effect size, and metric (e.g., log loss) are displayed next to it. If a set of slices are selected, their details appear on the table on the righthand side, C: On the table view, the user can sort slices by any metrics on the table. On the bottom, D: Slice Finder provides configurable sliders for adjusting and . Slice Finder materializes all the problematic slices () as well as the nonproblematic slices () searched already. If decreases, we just need to reiterate the slices explored until now to find the top slices. If increases, then the current slices may not be sufficient, depending on , so we continue searching the slice tree or lattice. This interaction is possible because Slice Finder looks for the top slices in a topdown manner.
4 Using Slice Finder for Model Fairness
In this section, we look at model fairness as a potential use case of Slice Finder where identifying problematic slices can be a preprocessing step before more sophisticated analyses on fairness on the slices. As machine learning models are increasingly used in sensitive applications, such as predicting whether individuals will default on loans [24], commit crime [25], or survive intensive hospital care [26], it is essential to make sure the model performs equally well for all demographics to avoid discrimination. However, models may fail this property for various reasons: bias in data collection, insufficient data for certain slices, limitations in the model training, to name a few cases.
Model fairness has various definitions depending on the application and is thus nontrivial to formalize [27]. While many metrics have been proposed [24, 28, 29, 30]
, there is no widelyaccepted standard, and some definitions are even at odds. In this paper, we focus on a relatively common definition, which is to find the data where the model performs relatively worse using some of these metrics, which fits into the
Slice Finder framework.Using our definition of fairness, Slice Finder can be used to quickly identify interpretable slices that have fairness issues without having to specify the sensitive features in advance. Here, we demonstrate how Slice Finder can be used to find any unfairness of the model with equalized odds [24]. Namely, we explain how our definition of a problematic slice using effect size also conforms to the definition of equalized odds. Slice Finder is also generic and supports any fairness metric that can be expressed as a scoring function. Any subsequent analysis of fairness on these slices can be done afterwards.
Equalized odds requires a predictor (e.g., a classification model in our case) to be independent of protected or sensitive feature values (e.g., gender = Male or gender = Female) conditional on the true outcome [24]. In binary classification (), this is equivalent to:
Notice that equalized odds is essentially matching true positive rates (tpr) in case of or false negative rates (fnr) otherwise.
Slice Finder can be used to identify slices where the model is potentially discriminatory; a machine learning practitioner can easily identify feature dimensions of the data, without having to manually consider all feature value pair combinations, on which a deeper analysis and potential model fairness adjustments are needed. The problematic slices with suffer from higher loss (lower model accuracy in case of log loss) compared to the counterparts. If one group is enjoying a better rate of accuracy over the other, then it is a good indication that the model is biased. Namely, accuracy is a weighted sum of tpr and fnr by their proportions, and thus, a difference in accuracy means there are differences in tpr and false positive rate (), assuming there are any positive examples. As equalized odds requires matching tpr and fpr between the two demographics (a slice and its counterpart), Slice Finder using can identify slices to show that the model is potentially discriminatory. In case of the gender = Male slice above, we flag this as a signal for discriminatory model behavior because the slice is defined over a sensitive feature and has a high effect size.
There are other standards, but equalized odds ensures that the prediction is nondiscriminatory with respect to a specified protected attribute (e.g., gender), without sacrificing the target utility (i.e., maximizing model performance) too much [24].
5 Experiments
In this section, we compare the two Slice Finder approaches (decision tree and lattice search) with the baseline (clustering). For the clustering approach, we use the means algorithm. We address the following key questions:

How accurate and efficient is Slice Finder?

What are the tradeoffs between the slicing techniques?

What is the impact of adjusting the effect size threshold ?

Are the identified slices interpretable enough to understand the model’s performance?

How effective is false discovery control using investing?
5.1 Experimental Setup
We used the following two problems with different datasets and models to compare how the three different slicing techniques – lattice search (LS), decision tree (DT), and clustering (CL) – perform in terms of recommended slice quality as well as their interpretability.

Credit Card Fraud Detection: We trained a random forest classifier to predict fraudulent transactions among credit card transactions [32]. This dataset contains transactions that occurred over two days, where we have 492 frauds out of 284k transactions (examples), each with 29 features. Because the data set is heavily imbalanced, we first undersample nonfraudulent transactions to balance the data. This leaves a total of 984 transactions in the balanced dataset.
As we shall see, the two datasets – Census Income and Credit Card Fraud – have different characteristics and are thus useful for comparing the behaviors of the decision tree and lattice search algorithms. In addition, we also use a synthetic dataset when necessary. The main advantage of using synthetic data is that it gives us more insights into the operations of Slice Finder. In Sections 5.2–5.6, we assume that all slices are statistically significant for simplicity and separately evaluate statistical significance in Section 5.7.
Accuracy Measure: Since problematic slices may overlap, we define precision to be the fraction of examples in the union of the slices identified by the algorithm being evaluated that also appear in actual problematic slices. Similarly, recall is defined as the fraction of the examples in the union of actual problematic slices that are also in the identified slices. Finally, accuracy
is the harmonic mean of precision and recall.
5.2 Problematic Slice Identification
An important question to answer is whether Slice Finder can indeed find the most problematic slices, in the user’s point of view. Unfortunately for the real datasets, we do not know what are the true problematic slices, which makes our evaluation challenging. Instead, we add new problematic slices by randomly perturbing labels and focus on finding those slices. While Slice Finder may find both new and existing problematic slices, our evaluation will only be whether Slice Finder finds the new problematic slices.
We first experiment on a synthetic dataset and compare the performances of LS, DT, and CL. We then experiment on the real datasets and show similar results.
5.2.1 Synthetic Data
We generate a simple synthetic dataset where the generated examples have two discretized features and and can be classified into two classes – 0 and 1 – perfectly. We make the model use this decision boundary and do not change it further. Then we add problematic slices by choosing random possiblyoverlapping slices of the form , , or . For each slice, we flip the labels of the examples with 50% probability. Note that this perturbation results in the worst model accuracy possible.
Figure 4(a) shows the accuracy comparison of LS, DT, and CL on synthetic data. As the number of recommendations increases, LS consistently has a higher accuracy than DT because LS is able to better pinpoint the problematic slices including overlapping ones while DT is limited in the sense that it only searches nonoverlapping slices. For CL, we only evaluated the clusters with effect sizes at least . Even so, the accuracy is much lower than those of LS and DT.
5.2.2 Real Data
We also perform a similar experiment using the Census Income dataset where we generate new problematic slices on top of the existing data by randomly choosing slices and flipping labels with 50% probability. Compared to the synthetic data, the existing data may also have problematic slices, which we do not evaluate because we do not know what they are. Figure 4(b) shows similar comparison results between LS, DT, and CL. The accuracies of LS and DT are lower than those in the synthetic data experiments because some of the identified slices may be problematic slices in the existing data, but are considered incorrect when evaluated.
5.3 Large Problematic Slices
Figures 5 and 6 show how LS and DT outperform CL in terms of average slice size and average effect size on the real datasets. CL starts with the entire dataset where the number of clusters (i.e., recommendations) is 1. CL produces large clusters that have very low effect sizes where the average is around 0.0 and sometimes even negative, which means some slices are not problematic. The CL results show that grouping similar examples does not necessarily guide users to problematic slices. In comparison, LS and DT find smaller slices with effect sizes above the threshold .
LS and DT show different behaviors depending on the given dataset. When running on the Census Income data, both LS and DT are able to easily find up to problematic slices with similar effect sizes. Since LS generally has a larger search space than DT where it also considers overlapping slices, it is able to find larger slices as a result. When running on the Credit Card Fraud data, DT has a harder time finding enough problematic slices. The reason is that DT initially finds a large problematic slice, but then needs to generate many levels of the decision tree to find additional problematic slices because it only considers nonoverlapping slices. Since a decision tree is designed to partition data to minimize impurity, the slices found deeper down the tree tend to be smaller and “purer,” which means the problematic ones have higher effect sizes. Lastly, DT could not find more than 7 problematic slices because the leaf nodes were too small to split further. These results show that, while DT may search a level of a decision tree faster than LS searching a level of a lattice, it may have to search more levels of the tree to make the same number of recommendations.
5.4 Adjusting Effect Size Threshold
Figure 7 shows the impact of adjusting the effect size threshold on LS and DT. For a low value, there are more slices that can be problematic. Looking at the Census Income data, LS indeed finds larger slices than those found by DT, although they have relatively smaller effect sizes as a result. As increases, LS is forced to search smaller slices that have highenough effect sizes. Since LS still has a higher search space than DT, it does find slices with higher effect sizes when is at least 0.4. The Credit Card Fraud data shows a rather different comparison. For small values, recall that DT initially finds a large problematic slice, which means the average size is high, and the effect size small. As increases, DT has to search many levels of the decision tree to find additional problematic slices. These additional slices are much smaller, which is why there is an abrupt drop in average slice size. However, the slices have higher effect sizes, which is why there is also a corresponding jump in the average effect size.
5.5 Scalability
We evaluate the scalabilities of LS and DT against different sample fractions, degree of parallelization, and the number of top slices to recommend. All experiments were performed on the Census Income dataset.
Figure 8 shows how the runtimes of LS and DT change versus the sampling fraction. For both algorithms, the runtime increases almost linearly with the sample size. We also measure the relative accuracy of the two algorithms where we compare the slices found in a sample with the slices found in the full dataset. For a sample fraction of 1/128, both LS and DT maintain a high relative accuracy of 0.88. These results show that it is possible to find most of the problematic slices using a fraction of the data, about two orders of magnitude faster.
Figure 9(a) illustrates how Slice Finder can scale with parallelization. LS can distribute the evaluation (e.g., effect size computation) of the slices with the same number of filter predicates to multiple workers. As a result, for the full Census Income data, increasing the number of workers results in better runtime. Notice that the marginal runtime improvement decreases as we add more workers. The results for DT are not shown here because the current implementation does not support parallel DT model training.
Figure 9(b) compares the runtimes of LS and DT when the number of top recommendations increase. For small values less than 5, DT is faster because it searchers fewer slices to find problematic ones. However, as increases, DT needs to search through many levels of a decision tree and starts to run relatively slower than LS. Meanwhile, LS only searches the next level of the lattice if is at least 70 at which point DT is again relatively faster. Thus, whether LS or DT is faster depends on and how frequently problematic slices occur.
5.6 Interpretability
Slice  # Literals  Size  Effect Size 
LS slices from Census Income data  
Marital Status = Marriedcivspouse  1  14065  0.58 
Relationship = Husband  1  12463  0.52 
Relationship = Wife  1  1406  0.46 
Capital Gain = 3103  1  94  0.87 
Capital Gain = 4386  1  67  0.94 
DT slices from Census Income data  
Marital Status = Marriedcivspouse  1  14065  0.58 
Marital Status Marriedcivspouse Capital Gain 7298 Capital Gain 8614 EducationNum 13  4  7  0.58 
Marital Status Marriedcivspouse Capital Gain 7298 EducationNum 13 Age 28  5  855  0.43 
Hours per week 44  
Marital Status Marriedcivspouse Capital Gain 7298 EducationNum 13 Age 28  5  5  1.07 
Capital Loss 2231  
Marital Status Marriedcivspouse Capital Gain 7298 EducationNum 13 Age 28  6  101  0.47 
Hours per week 44 EducationNum 15  
LS slices from Credit Card Fraud data  
V14 = 3.69 – 1.00  2  98  0.45 
V7 = 0.94 – 23.48 V10 = 2.16 – 0.87  3  29  0.41 
V1 = 1.13 – 1.74 V25 = 0.48 – 0.71  4  28  0.54 
V7 = 0.94 – 23.48 Amount = 270.54 – 4248.34  4  28  0.53 
V10 = 2.16 – 0.87 V17 = 0.92 – 6.74  5  27  0.44 
DT slices from Credit Card Fraud data  
V14 V10  2  31  0.60 
V14 V4 V12  3  59  0.48 
V14 V4 V14 V2  4  23  0.42 
V14 V4 V14 Amount  4  18  0.52 
V14 V4 V12 Amount V17  5  6  0.63 
An important feature of Slice Finder is that it can find interpretable slices that can help a user understand and describe the model’s behavior using a few common features. A user without Slice Finder may have to go through all the misclassified examples (or clusters of them) manually to see if the model is biased or failing.
Table II shows top5 problematic slices from the two datasets using LS and DT. Looking at the top5 slices found by LS from the Census Income data, the slices are easy to interpret with a few number of common features. We see that the Marital Status = Marriedcivspouse slice has the largest size as well as a large effect size, which indicates that the model can be improved for this slice. It is also interesting to see that the model fails for the people who are husbands or wives, but not for other relationships: ownchild, notinfamily, otherrelative, and unmarried. We also see slices with high capital gains tend to be problematic in comparison to the common case where the value is 0. In addition, the top5 slices found by DT from the Census Income data can also be interpreted in a straightforward way, although having more literals makes the interpretation more tedious. Finally, the top5 slices from the Credit Card Fraud data are harder (but still reasonable) to interpret because many feature names are anonymized (e.g., V14).
5.7 False Discovery Control
Even for a small data set (or sample), there can be an overwhelming number of problematic slices. The goal of Slice Finder is to bring the user’s attention to a handful of large problematic slices; however, if the sample size is small, most slices would contain fewer examples, and thus, it is likely that many slices and their effect size measures are seen by chance. In such a case, it is important to prevent false discoveries (e.g., nonproblematic slices appear as problematic where due to sampling bias). For evaluation, we use the Census Income data and compare the results of Bonferroni correction (BF), the BenjaminiHochberg procedure (BH), and investing (AI) using two standard measures: false discovery rate, which was described in Section 5.7, and power [21]
, which is the probability that the tests correctly reject the null hypothesis.
The results in Figure 10 show that, as the value (or wealth when using AI) increases up to 0.01, AI and BH have higher FDR results than BH, but higher power results as well. When measuring the accuracy of slices, AI slightly outperforms both BH and BF because it invests its more effectively using the Bestfootforward policy. In comparison, BF is conservative and has a high falsediscovery rate (which results in lower accuracy), and BH does not exploit the fact that the earlier slices are more likely to be problematic as AI does. The more important advantage of AI is that it is the only technique that works in an interactive setting.
6 Related Work
In practice, the overall performance metrics can mask the issues on a more granularlevel, and it is important to validate the model accordingly on smaller subsets/subpopulations of data (slices). While a wellknown problem, the existing tools are still primitive in that they rely on domain experts to predefine important slices. Stateofart tools for machine learning model validation include Facets [33]
, which can be used to discover bias in the data, TensorFlow Model Analysis (TFMA), which slices data by an input feature dimension for a more granular performance analysis
[6], and MLCube [5], which provides manual exploration of slices and can both evaluate a single model or compare two models. While the above tools are manual, Slice Finder complements them by automatically finding slices useful for model validation.There are several other lines of work related to this problem, and we list the most relevant work to Slice Finder.
Data Exploration: Online Analytical Processing (OLAP) has been tackling the problem of slicing data for analysis, and the techniques deal with the problem of large search space (i.e., how to efficiently identify data slices with certain properties). For example, Smart Drilldown [34] proposes an OLAP drill down process that returns the top most “interesting” rules such that the rules cover as many records as possible while being as specific as possible. Intelligent rollups [35] goes the other direction where the goal is to find the broadest cube that share the same characteristics of a problematic record. In comparison, Slice Finder finds slices, on which the model underperforms, without having to evaluate the model on all the possible slices. This is different from general OLAP operations based on cubes with presummarized aggregates, and the OLAP algorithms cannot be directly used.
Model Understanding: Understanding a model and its behavior is a broad topic that is being studied extensively [22, 8, 9, 36, 37, 38]. For example, LIME [8] trains interpretable linear models on local data and random noise to see which feature are prominent. Anchors [9] are highprecision rules that provide local and sufficient conditions for a blackbox model to make predictions. In comparison, Slice Finder is a complementary tool to provide part of the data where the model is performing relatively worse than other parts. As a result, there are certain applications (e.g., model fairness) that benefit more from slices. PALM [7] isolates a small set of training examples that have the greatest influence on the prediction by approximating a complex model into an interpretable metamodel that partitions the training data and a set of submodels that approximate the patterns within each pattern. PALM expects as input the problematic example and a set of features that are explainable to the user. In comparison, Slice Finder finds large, significant, and interpretable slices without requiring user input. Influence functions [39] have been used to compute how each example affects model behavior. In comparison, Slice Finder identifies interpretable slices instead of individual examples. An interesting research direction is to extend influence functions to slices and quantify the impact of slices on the overall model quality.
Feature Selection: Slice Finder
is a model validation tool, which comes after model training. It is important to note that this is different from feature selection
[40, 41] in model training, where the goal is often to identify and (re)train on the most correlated features (dimensions) to the target label (i.e., finding representative features that best explain model predictions). Instead, Slice Finder identifies a few common feature values that describe subsets of data with significantly high error concentration for a given model; this, in turn, could help the user to interpret hidden model performance issues that are masked by good overall model performance metrics.7 Conclusion
We have proposed Slice Finder as a tool for efficiently and accurately finding large, significant, and interpretable slices. The techniques are relevant to model validation in general, but also to model fairness and fraud detection where human interpretability is critical to understand model behavior. We have proposed two complementing approaches for slice finding: decision tree training, which finds nonoverlapping slices, and lattice searching, which finds possiblyoverlapping slices. We also provide an interactive visualization frontend to help users quickly browse through a handful of problematic slices.
In the future, we would like to improve Slice Finder to better discretize numeric features and support the merging and summarization of slices. We would also like to deploy Slice Finder to production machine learning platforms and conduct a user study on how helpful the slices are for explaining and debugging models.
Acknowledgments
Steven Euijong Whang and Ki Hyun Tae were supported by a Google AI Focused Research Award and by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF2018R1A5A1059921).
References
 [1] D. Baylor, E. Breck, H.T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc et al., “Tfx: A tensorflowbased productionscale machine learning platform,” in KDD, 2017, pp. 1387–1395.
 [2] F. DoshiVelez and B. Kim, “Towards A Rigorous Science of Interpretable Machine Learning,” ArXiv eprints, Feb. 2017.
 [3] M. Lichman, “UCI machine learning repository,” 2013.
 [4] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin et al., “Ad click prediction: a view from the trenches,” in KDD, 2013, pp. 1222–1230.
 [5] M. Kahng, D. Fang, and D. H. P. Chau, “Visual exploration of machine learning results using data cube analysis,” in HILDA. ACM, 2016, p. 1.
 [6] “Introducing tensorflow model analysis,” https://medium.com/tensorflow/introducingtensorflowmodelanalysisscaleableslicedandfullpassmetrics5cde7baf0b7b, 2018.
 [7] S. Krishnan and E. Wu, “Palm: Machine learning explanations for iterative debugging,” in HILDA, 2017, pp. 4:1–4:6.
 [8] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”: Explaining the predictions of any classifier,” in KDD, 2016, pp. 1135–1144.
 [9] ——, “Anchors: Highprecision modelagnostic explanations,” in AAAI, 2018, pp. 1527–1535.
 [10] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society Series B (Methodological), no. 1, pp. 289–300.
 [11] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data lifecycle challenges in production machine learning: A survey,” SIGMOD Rec., vol. 47, no. 2, pp. 17–28, Jun. 2018.
 [12] ——, “Data management challenges in production machine learning,” in SIGMOD, 2017, pp. 1723–1726.
 [13] Y. Chung, T. Kraska, N. Polyzotis, K. H. Tae, and S. E. Whang, “Slice finder: Automated data slicing for model validation,” ICDE, 2019.
 [14] Y. Chung, T. Kraska, N. Polyzotis, and S. E. Whang, “Slice finder: Automated data slicing for model interpretability,” SysML Conference, 2018.
 [15] “Effect size,” https://en.wikipedia.org/wiki/Effect_size.
 [16] G. M. Sullivan and R. Feinn, “Using effect size—or why the p value is not enough,” Journal of Graduate Medical Education, vol. 4, no. 3, pp. 279–282, 2012.

[17]
“Welch’s ttest,”
https://en.wikipedia.org/wiki/Welch%27s_ttest.  [18] J. Cohen, “Statistical power analysis for the behavioral sciences.” 1988.
 [19] W. McKinney, “pandas: a foundational python library for data analysis and statistics,” Python for High Performance and Scientific Computing, pp. 1–9, 2011.
 [20] D. Foster and B. Stine, “Alphainvesting: A procedure for sequential control of expected false discoveries,” Journal of the Royal Statistical Society Series B (Methodological), vol. 70, no. 2, pp. 429–444, 2008.
 [21] Z. Zhao, L. D. Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska, “Controlling false discoveries during interactive data exploration,” in SIGMOD, 2017, pp. 527–540.
 [22] A. A. Freitas, “Comprehensible classification models: A position paper,” SIGKDD Explor. Newsl., vol. 15, no. 1, pp. 1–10, Mar. 2014.
 [23] A. Srivastava, E.H. Han, V. Kumar, and V. Singh, “Parallel formulations of decisiontree classification algorithms,” in High Performance Data Mining. Springer, 1999, pp. 237–261.

[24]
M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in supervised learning,” in
NIPS, 2016, pp. 3315–3323.  [25] “Machine bias,” https://www.propublica.org/article/machinebiasriskassessmentsincriminalsentencing, 2016.
 [26] M. Ghassemi, T. Naumann, F. DoshiVelez, N. Brimmer, R. Joshi, A. Rumshisky, and P. Szolovits, “Unfolding physiological state: Mortality modelling in intensive care units,” in KDD, 2014, pp. 75–84.
 [27] S. Barocas and M. Hardt, “Fairness in machine learning,” NIPS Tutorial, 2017.
 [28] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairness through awareness,” in ITCS, 2012, pp. 214–226.
 [29] J. M. Kleinberg, S. Mullainathan, and M. Raghavan, “Inherent tradeoffs in the fair determination of risk scores,” in ITCS, 2017, pp. 43:1–43:23.
 [30] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian, “Certifying and removing disparate impact,” in KDD, 2015, pp. 259–268.

[31]
R. Kohavi, “Scaling up the accuracy of naivebayes classifiers: A decisiontree hybrid.” in
KDD, vol. 96, 1996, pp. 202–207.  [32] A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, “Calibrating probability with undersampling for unbalanced classification,” in SSCI. IEEE, 2015, pp. 159–166.
 [33] “Facets overview,” https://research.googleblog.com/2017/07/facetsopensourcevisualizationtool.html, 2017.
 [34] M. Joglekar, H. GarciaMolina, and A. Parameswaran, “Interactive data exploration with smart drilldown,” in ICDE. IEEE, 2016, pp. 906–917.
 [35] G. Sathe and S. Sarawagi, “Intelligent rollups in multidimensional olap data,” in VLDB, 2001, pp. 531–540.
 [36] P. Tamagnini, J. Krause, A. Dasgupta, and E. Bertini, “Interpreting blackbox classifiers using instancelevel visual explanations,” in HILDA, 2017, pp. 6:1–6:6.
 [37] O. Bastani, C. Kim, and H. Bastani, “Interpreting blackbox models via model extraction,” CoRR, vol. abs/1705.08504, 2017.
 [38] H. Lakkaraju, E. Kamar, R. Caruana, and J. Leskovec, “Interpretable & explorable approximations of black box models,” CoRR, vol. abs/1707.01154, 2017.
 [39] P. W. Koh and P. Liang, “Understanding blackbox predictions via influence functions,” in ICML, 2017, pp. 1885–1894.
 [40] M. Charikar, V. Guruswami, R. Kumar, S. Rajagopalan, and A. Sahai, “Combinatorial feature selection problems,” in Foundations of Computer Science, 2000, pp. 631–640.
 [41] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of machine learning research, vol. 3, no. Mar, pp. 1157–1182, 2003.
Comments
There are no comments yet.