DataExposer: Exposing Disconnect between Data and Systems

As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of the data. For example, a health-monitoring system that is designed under the assumption that weight is reported in imperial units (lbs) will malfunction when encountering weight reported in metric units (kilograms). Similar to software debugging, which aims to find bugs in the mechanism (source code or runtime conditions), our goal is to debug the data to identify potential sources of disconnect between the assumptions about the data and the systems that operate on that data. Specifically, we seek which properties of the data cause a data-driven system to malfunction. We propose DataExposer, a framework to identify data properties, called profiles, that are the root causes of performance degradation or failure of a system that operates on the data. Such identification is necessary to repair the system and resolve the disconnect between data and system. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataExposer alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataExposer reports causally verified root causes, in terms of data profiles, of the system malfunction. We empirically evaluate DataExposer on three real-world and several synthetic data-driven systems that fail on datasets due to a diverse set of reasons. In all cases, DataExposer identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques.


Causality-Guided Adaptive Interventional Debugging

Runtime nondeterminism is a fact of life in modern database applications...

Scalable Statistical Root Cause Analysis on App Telemetry

Despite engineering workflows that aim to prevent buggy code from being ...

Causal Testing: Finding Defects' Root Causes

Isolating and repairing unexpected or buggy software behavior typically ...

ExplainIt! -- A declarative root-cause analysis engine for time series data (extended version)

We present ExplainIt!, a declarative, unsupervised root-cause analysis e...

Improving Test Distance for Failure Clustering with Hypergraph Modelling

Automated debugging techniques, such as Fault Localisation (FL) or Autom...

Automatic Root Cause Quantification for Missing Edges in JavaScript Call Graphs (Extended Version)

Building sound and precise static call graphs for real-world JavaScript ...

On the Refinement of Spreadsheet Smells by means of Structure Information

Spreadsheet users are often unaware of the risks imposed by poorly desig...

1. Introduction

Traditional software debugging aims to identify errors and bugs in the mechanism—such as source code, configuration files, and runtime conditions—that may cause a system to malfunction (Hailpern and Santhanam, 2002; Liblit et al., 2005; Fariha et al., 2020)

. However, in modern systems, data has become a central component that itself can cause a system to fail. Data-driven systems comprise complex pipelines that rely on data to solve a target task. Prior work addressed the problem of debugging machine-learning models 

(Cadamuro et al., 2016) and finding root causes of failures in computational pipelines (Lourenço et al., 2020), where certain values of the pipeline parameters—such as a specific model and/or a specific dataset—cause the pipeline failure. However, just knowing that a pipeline fails for a certain dataset is not enough; naturally, we ask: what properties of a dataset caused the failure?

Two common reasons for malfunctions in data-driven systems are: (1) incorrect data, and (2) disconnect between the assumptions about the data and the design of the system that operates on the data. Such disconnects may happen when the system is not robust, i.e., it makes strict assumptions about metadata (e.g., data format, domains, ranges, and distributions), and when new data drifts from the data over which the system was tested on before deployment (Rawal et al., 2020) (e.g., when a system expects a data stream to have a weekly frequency, but the data provider suddenly switches to daily data).

Therefore, in light of a failure, one should investigate potential issues in the data. Some specific examples of commonly observed system malfunctions caused by data include: (1) decline of a machine-learned model’s accuracy (due to out-of-distribution data), (2) unfairness in model predictions (due to imbalanced training data), (3) excessive processing time (due to a system’s failure to scale to large data), and (4) system crash (due to invalid input combination in the data tuples beyond what the system was designed to handle). These examples indicate a common problem: disconnect or mismatch between the data and the system design. Once the mismatch is identified, then possible fixes could be either to repair the data to suit the system design, or to adjust the system design (e.g., modify source code) to accommodate data with different properties.

A naïve approach to deal with potential issues in the data is to identify outliers: report tuples as potentially problematic based on how atypical they are with respect to the rest of the tuples in the dataset. However, without verifying whether the outliers actually cause unexpected outcomes, we can never be certain about the actual root causes. As pointed out in prior work 

(Barowy et al., 2014): “With respect to a computation, whether an error is an outlier in the program’s input distribution is not necessarily relevant. Rather, potential errors can be spotted by their effect on a program’s output distribution.” To motivate our work, we start with an example taken from a real-world incident, where Amazon’s delivery service was found to be racist (Is Amazon same-day delivery service racist?, 2016).

Example 1 (Biased Classifier).

An e-commerce company wants to build an automated system that suggests who should get discounts. To this end, they collect information from the customers’ purchases over one year and build a dataset over the attributes name, gender, age, race, zip_code, phone, products_purchased

, etc. Anita, a data scientist, is then asked to develop a machine learning (ML) pipeline over this dataset to predict whether a customer will spend over a certain amount, and, subsequently, should be offered discounts. Within this pipeline, Anita decides to use a logistic regression classifier for prediction and implements it using an off-the-shelf ML library. To avoid discrimination over any group and to ensure that the classifier trained on this dataset is fair, Anita decides to drop the sensitive attributes—

race and gender—during the pre-processing step of the ML pipeline, before feeding it to the classifier. However, despite this effort, the trained classifier turns out to be highly biased against African American people and women. After a close investigation, Anita discovers that: (1) In the training data, race is highly correlated with zip_code, and (2) The training dataset is imbalanced: a larger fraction of the people who purchase expensive products are male. Now she wonders: if these two properties did not hold in the dataset, would the learned classifier be fair? Have either (or both) of these properties caused the observed unfairness?

Unfortunately, existing tools (e.g., CheckCell (Barowy et al., 2014)) that blame individual cells (values) for unexpected outcomes cannot help here, as no single cell in the training data is responsible for the observed discrimination, rather, global statistical properties (e.g., correlation) that involve multiple attributes over the entire data are the actual culprits. Furthermore, Anita only identified two potential or correlated data issues that may or may not be the actual cause of the unfairness. To distinguish mere correlation from true causation and to verify if there is indeed a causal connection between the data properties and the observed unfairness, we need to dig deeper.

Example 1 is one among many incidents in real-world applications where issues in the data caused systems to malfunction (Bias, in Amazon Hiring, Vision Racism). A recent study of 112 high-severity incidents in Microsoft Azure services showed that 21% of the bugs were due to inconsistent assumptions about data format by different software components or versions (Liu et al., 2019). The study further found that 83% of the data-format bugs were due to inconsistencies between data producers and data consumers, while 17% were due to mismatch between interpretations of the same data by different data consumers. Similar incidents happened due to misspelling and incorrect date-time format (Rezig et al., 2020b), and issues pertaining to data fusion where schema assumptions break for a new data source (Dong et al., 2014; Wang et al., 2015)

. We provide another illustrative example where a system times out when the distribution of the data, over which the system operates, exhibits significant skew.

Example 2 (Process Timeout).

A toll collection software EZGo checks if vehicles passing through a gate have electronic toll pass installed. If it fails to detect a toll pass, it uses an external software OCR to extract the registration number from the vehicle’s license plate. EZGo operates in a batch mode and processes every vehicles together by reserving AWS for one hour, assuming that one hour is sufficient for processing each batch. However, for some batches, EZGo fails. After a close investigation, it turns out that the external software OCR uses an algorithm that is extremely slow for images of black license plates captured in low illumination. As a result, when a batch contains a large number of such cases (significantly skewed distribution), EZGo fails.

The aforementioned examples bring forth two key challenges. First, we need to correctly identify potential causes of unexpected outcomes and generate hypotheses that are expressive enough to capture the candidate root causes. For example, “outliers cause unexpected outcomes” is just one of the many possible hypotheses, which offers very limited expressivity. Second, we need to verify the hypotheses to confirm or refute them, which enables us to pinpoint the actual root causes, eliminating false positives.

Data profile as root cause.

Towards solving the first challenge, our observation is that data-driven systems often function properly for certain datasets, but malfunction for others. Such malfunction is often rooted in certain properties of the data, which we call data profiles, that distinguish passing and failing datasets. Examples include size of a dataset, domains and ranges of attribute values, correlations between attribute pairs, conditional independence (Yan et al., 2020), functional dependencies and their variants (Papenbrock et al., 2015a; Koudas et al., 2009; Fan et al., 2009; Ilyas et al., 2004; Caruccio et al., 2016), and other more complex data profiles (Song and Chen, 2011; Langer and Naumann, 2016; Papenbrock et al., 2015b; Chu et al., 2013).

Oracle-guided root cause identification.

Our second observation is that if we have access to an oracle that can indicate whether the system functions desirably or not, we can verify our hypotheses. Access to an oracle allows us to precisely isolate the correct root causes of the undesirable malfunction from a set of candidate causes. Here, an oracle is a mechanism that can characterize whether the system functions properly over the input data. The definition of proper functioning is application-specific; for example, achieving a certain accuracy may indicate proper functioning for an ML pipeline. Such oracles are often available in many practical settings, and have been considered in prior work (Lourenço et al., 2020; Fariha et al., 2020).

Solution sketch. In this paper, we propose DataExposer, a framework that identifies and exposes data profiles that cause a data-driven system to malfunction. Our framework involves two main components: (1) an intervention-based mechanism that alters the profiles of a dataset, and (2) a mechanism that speeds up analysis by carefully selecting appropriate interventions. Given a scenario where a system malfunctions (fails) over a dataset but functions properly (passes) over another, DataExposer focuses on the discriminative profiles, i.e., data profiles that significantly differ between the two datasets. DataExposer’s intervention mechanism modifies the “failing” dataset to alter one of the discriminative profiles; it then observes whether this intervention causes the system to perform desirably, or the malfunction persists. DataExposer

speeds up this analysis by favoring interventions on profiles that are deemed more likely causes of the malfunction. To estimate this likelihood, we leverage three properties of a profile: (1) 

coverage: the more tuples an intervention affects, the more likely it is to fix the system behavior, (2) discriminating power: the bigger the difference between the failing and the passing datasets over a profile, the more likely that the profile is a cause of the malfunction, and (3) attribute association: if a profile involves an attribute that is also involved with a large number of other discriminative profiles, then that profile has high likelihood to be a root cause. This is because altering such a profile is likely to passively repair other discriminative profiles as a side-effect (through the associated attribute). We also provide a group-testing-based technique that allows group intervention, which helps expedite the root-cause analysis further.

Scope. In this work, we assume knowledge of the classes of (domain-specific) data profiles that encompass the potential root causes. E.g., in Example 1

, we assume the knowledge that correlation between attribute pairs and disparity between the conditional probability distributions (the probability of belonging to a certain gender, given price of items bought) are potential causes of malfunction. This assumption is realistic because: (1) For a number of tasks there exists a well-known set of relevant profiles: e.g., class imbalance and correlation between sensitive and non-sensitive attributes are common causes of unfairness in classification 

(Bellamy et al., 2018); and violation of conformance constraints (Fariha et al., 2021), missing values, and out-of-distribution tuples are well-known causes of ML model’s performance degradation. (2) Domain experts are typically aware of the likely class of data profiles for the specific task at hand and can easily provide this additional knowledge as a conservative approximation, i.e., they can include extra profiles just to err on the side of caution. Notably, this assumption is also extremely common in software debugging techniques (Fariha et al., 2020; Liblit et al., 2005; Zheng et al., 2006), which rely on the assumption that the “predicates” (traps to extract certain runtime conditions) are expressive enough to encode the root causes, and software testing (Mesbah et al., 2011), validation (Kumar et al., 2013), and verification (Henzinger et al., 2003) approaches, which rely on the assumption that the test cases, specifications, and invariants reasonably cover the codebase and correctness constraints.

To support a data profile, DataExposer further needs the corresponding mechanisms for discovery and intervention. In this work, we assume knowledge of the profile discovery and intervention techniques, as they are orthogonal to our work. Nevertheless, we discuss some common classes of data profiles supported in DataExposer and the corresponding discovery and intervention techniques. For data profile discovery, we rely on prior work on pattern discovery (Python, Rexpy package), statistical-constraint discovery (Yan et al., 2020), data-distribution learning (Hellerstein, 2008)

, knowledge-graph-based concept identification 

(Galhotra et al., 2019), etc. While our evaluation covers specific classes of data profiles (for which there exist efficient discovery techniques), our approach is generic and works for any class of data profiles, as long as the corresponding discovery and intervention techniques are available.

Limitations of prior work. To find potential issues in data, Dagger (Rezig et al., 2020b, a) provides data debugging primitives for human-in-the-loop interactions with data-driven computational pipelines. Other explanation-centric efforts (Wang et al., 2015; Bailis et al., 2017; Chirigati et al., 2016; El Gebaly et al., 2014) report salient properties of historical data based only on observations. In contrast with observational techniques, the presence of an oracle allows for interventional techniques (Lourenço et al., 2020) that can query the oracle with additional, system-generated test cases to identify root causes of system malfunction more accurately. One such approach is CheckCell (Barowy et al., 2014), which presents a ranked list of cells of data rows that unusually affect output of a given target function. CheckCell uses a fine-grained approach: it removes one cell of the data at a time, and observes changes in the output distribution. While it is suitable for small datasets, where it is reasonable to expect a human-in-the-loop paradigm to fix cells one by one, it is not suitable for large datasets, where no individual cell is significantly responsible, rather, a holistic property of the entire dataset (profile) causes the problem.

Interpretable machine learning is related to our problem, where the goal is to explain behavior of machine-learned models. However, prior work on interpretable machine learning (Ribeiro et al., 2016, 2018) typically provide local (tuple-level) explanations, as opposed to global (dataset-level) explanations. While some approaches provide feature importance as a global explanation for model behavior (Casalicchio et al., 2018), they do not model feature interactions as possible explanations.

Software testing and debugging techniques (Gulzar et al., 2018; Attariyan and Flinn, 2011; Attariyan et al., 2012; Chen et al., 2002; Fraser and Arcuri, 2013; Godefroid et al., 2008; Holler et al., 2012; Johnson et al., 2020; Zheng et al., 2006; Liblit et al., 2005) are either application-specific, require user-defined test suites, or rely only on observational data. The key contrast between software debugging and our setting is that the former focuses on white-box programs: interventions, runtime conditions, program invariants, control-flow graphs, etc., all revolve around program source code and execution traces. Unlike programs, where lines have logical and semantic connections, tuples in data do not have similar associations. Data profiles significantly differ in their semantics, and discovery and intervention techniques from program profiles, and, thus, techniques for program profiling do not trivially apply here. We treat data as a first-class citizen in computational pipelines, while considering the program as a black box.

Contributions. In this paper, we make the following contributions:

  • We formalize the novel problem of identifying root causes (and fixes) of the disconnect between data and data-driven systems in terms of data profiles (and interventions). (Sec 2)

  • We design a set of data profiles that are common root causes of data-driven system malfunctions, and discuss their discovery and intervention techniques based on available technology. (Sec 3)

  • We design and develop a novel interventional approach to pinpoint causally verified root causes. The approach leverages a few properties of the data profiles to efficiently explore the space of candidate root causes with a small number of interventions. Additionally, we develop an efficient group-testing-based algorithm that further reduces the number of required interventions. (Sec 4)

  • We evaluate DataExposer on three real-world applications, where data profiles are responsible for causing system malfunction, and demonstrate that DataExposer successfully explains the root causes with a very small number of interventions (¡ ). Furthermore, DataExposer requires fewer interventions when compared against two state-of-the-art techniques for root-cause analysis: BugDoc (Lourenço et al., 2020) and Anchors (Ribeiro et al., 2018). Through an experiment over synthetic pipelines, we further show that the number of required interventions by DataExposer increases sub-linearly with the number of discriminative profiles, thanks to our group-testing-based approach. (Sec 5)

2. Preliminaries & Problem Definition

In this section, we first formalize the notions of system malfunction and data profile, its violation, and transformation function used for intervention. We then proceed to define explanation (cause and corresponding fix) of system malfunction and formulate the problem of data-profile-centric explanation of system malfunction.

Basic notations. We use to denote a relation schema over attributes, where denotes the attribute. We use to denote the domain of attribute . Then the set specifies the domain of all possible tuples. A dataset is a specific instance of the schema . We use to denote a tuple in the schema . We use to denote the value of the attribute of the tuple and use to denote the multiset of values all tuples in take for attribute .

2.1. Quantifying System Malfunction

To measure how much the system malfunctions over a dataset, we use the malfunction score.

Definition 1 (Malfunction score).

Let be a dataset, and be a system operating on . The malfunction score is a real value that quantifies how much malfunctions when operating on .

The malfunction score indicates that functions properly over and a higher value indicates a higher degree of malfunction, with indicating extreme malfunction. A threshold parameter defines the acceptable degree of malfunction and translates the continuous notion of malfunction to a Boolean value. If , then is considered to pass with respect to ; otherwise, there exists a mismatch between and , whose cause (and fix) we aim to expose.

Example 2 ().

For a binary classifier, its misclassification rate (additive inverse of accuracy) over a dataset can be used as a malfunction score. Given a dataset , if a classifier makes correct predictions for tuples in , and incorrect predictions for the remaining tuples, then achieves accuracy , and, thus, .

Example 3 ().

In fair classification, we can use disparate impact (IBM, AIF 360), which is defined by the ratio between the number of tuples with favorable outcomes within the unprivileged and the privileged groups, to measure malfunction.

2.2. Profile-Violation-Transformation (PVT)

Once we detect existence of a mismatch, the next step is to investigate its cause. We characterize the issues in a dataset that are responsible for the mismatch between the dataset and the system using data profiles. Structure or schema of data profiles is given by profile templates, which contains holes for parameters. Parameterizing a profile template gives us a concretization of the corresponding profile (). Given a dataset , we use existing data-profiling techniques to find out parameter values to obtain concretized data profiles, such that satisfies the concretized profiles. To evaluate how much a dataset satisfies or violates a data profile, we need a corresponding violation function (). Violation functions provide semantics of the data profiles. Finally, to alter a dataset , with respect to a data profile and the corresponding violation function, we need a transformation function (). Transformation functions provide the intervention mechanism to alter data profiles of a dataset and suggest fix to remove the cause of malfunction. DataExposer requires the following three components over the schema Profile, Violation function, Transformation function, PVT in short:

  1. : a (concretized) profile along with its parameters, which follows the schema profile type, parameters.

  2. : a violation function that computes how much the dataset violates the profile and returns a violation score.

  3. : a transformation function that transforms the dataset to another dataset such that no longer violates the profile with respect to the violation function . (When clear from the context, we omit the parameters and when using the notation for transformation functions.)

For a PVT triplet , we define as its profile, as the violation function and as the transformation function. We provide examples and additional discussions on data profiles, violation functions, and transformation functions in Section 3.

2.2.1. Data Profile

Intuitively, data profiles encode dataset characteristics. They can refer to a single attribute (e.g., mean of an attribute) or multiple attributes (e.g., correlation between a pair of attributes, functional dependencies, etc.).

Definition 4 (Data Profile).

Given a dataset , a data profile denotes properties or constraints that tuples in (collectively) satisfy.

2.2.2. Profile Violation Function

To quantify the degree of violation a dataset incurs with respect to a data profile, we use a profile violation function that returns a numerical violation score.

Definition 5 (Profile violation function).

Given a dataset and a data profile , a profile violation function returns a real value that quantifies how much violates .

implies that fully complies with (does not violate it at all). In contrast, implies that violates . The higher the value of , the higher the profile violation.

Profile Data type Discovery over Interpretation Violation by Transformation function
Domain, ,
Values are drawn from a specific domain.
Map values outside to values in using domain knowledge.
Domain, ,
= , where
Values lie within a bound.

(1) Use monotonic linear transformation and transform all values.

(2) Use winsorization techniques to replace the violating values only.
Domain, ,
where is a regex
over learned via
pattern discovery (Python, Rexpy package)
Values satisfy a regular expression or length of values lie within a bound.
Minimally alter data to satisfy regular expression. For example, insert (remove) characters to increase (reduce) text length.
Outlier, , ,
where is learned from
’s distribution (Hellerstein, 2008)
Fraction of outliers within an attribute does not exceed a threshold.
(1) Replace outliers with the expected value (mean, median, mode) of the attribute.
(2) Map all values above (below) the maximum (minimum) limit with highest (lowest) valid value.
Missing, ,
Fraction of missing values within an attribute does not exceed a threshold.

Use missing value imputation techniques.

thresholded by data coverage

Fraction of tuples satisfying a given constraint (selection predicate) does not exceed a threshold.
Undersample tuples that satisfy the predicate .
Indep, , ,
denotes Chi-squared
statistic between
statistic between a pair of attributes is below a threshold with a p-value 0.05.
Modify attribute values to remove/reduce dependence.
Indep, , ,
denotes Pearson
correlation coefficient
between and
PCC between a pair attributes is below a threshold with a p-value 0.05.
Add noise to remove/reduce
dependence between attributes.
thresholded by parameter
Indep, ,
Learn causal graph and
causal coefficients ()
using TETRAD (Scheines et al., 1998)
A causal relationship between a pair of attributes is unlikely (with causal coefficient less than ).
Change data distribution to modify the causal relationship.
Figure 1. A list of PVT triplets that we consider in this paper, their syntax, and semantics.

2.2.3. Transformation Function

In our work, we assume knowledge of a passing dataset for which the system functions properly, and a failing dataset for which the system malfunctions. Our goal is to identify which profiles of the failing dataset caused the malfunction. We seek answer to the question: how to “fix” the issues within the failing dataset such that the system no longer malfunctions on it (mismatch is resolved)? To this end, we apply interventional causal reasoning: we intervene on the failing dataset by altering its attributes such that the profile of the altered dataset matches the corresponding correct profile of the passing dataset. To perform intervention, we need transformation functions with the property that it should push the failing dataset “closer” to the passing dataset in terms of the profile that we are interested to alter. More formally, after the transformation, the profile violation score should decrease.

Definition 6 (Transformation function).

Given a dataset , a data profile , and a violation function , a transformation function alters to produce such that .

A dataset can be transformed by applying a series of transformation functions, for which we use the composition operator ().

Definition 7 (Composition of transformations).

Given a dataset , and two PVT triplets and , . Further, if , then .

2.3. Problem Definition

We expose a set of PVT triplets for explaining the system malfunction. The explanation contains both the cause and the corresponding fix: profile within a PVT triplet indicates the cause of system malfunction with respect to the corresponding transformation function, which suggests the fix.

Definition 8 (Explanation of system malfunction).


  1. a system with a mechanism to compute ,

  2. an allowable malfunction threshold ,

  3. a passing dataset for which ,

  4. a failing dataset for which , and

  5. a set of candidate PVT triplets such that ,

the explanation of the malfunction of for , but not for , is a set of PVT triplets such that .

Informally, explains the cause: why malfunctions for , but not for . More specifically, failing to satisfy the profiles of the PVT triplets in are the causes of malfunction. Furthermore, the transformation functions of the PVT triplets in suggest the fix: how can we repair to eliminate system malfunction. However, there could be many possible such and we seek a minimal set such that transformation for every is necessary to bring down the malfunction score below the threshold .

Definition 9 (Minimal explanation of system malfunction).

Given a system that malfunctions for and an allowable malfunction threshold , an explanation of ’s malfunction for is minimal if .

Note that there could be multiple such minimal explanations and we seek any one of them, as any minimal explanation exposes the causes of mismatch and suggests minimal fixes.

Problem 10 (Discovering explanation of mismatch between data and system).

Given a system that malfunctions for but functions properly for , the problem of discovering the explanation of mismatch between and is to find a minimal explanation that captures (1) the cause why malfunctions for but not for and (2) how to repair to remove the malfunction.

3. Data Profiles, Violation Functions, & Transformation Functions

We now provide an overview of the data profiles we consider, how we discover them, how we compute the violation scores for a dataset w.r.t. a data profile, and how we apply transformation functions to alter profiles of a dataset. While a multitude of data-profiling primitives exist in the literature, we consider a carefully chosen subset of them that are particularly suitable for modeling issues in data that commonly cause malfunction or failure of a system. We focus on profiles that, by design, can better “discriminate” a pair of datasets as opposed to “generative” profiles (e.g., data distribution) that can profile the data better, but nonetheless are less useful for the task of discriminating between two datasets. However, the DataExposer framework is generic, and other profiles can be plugged into it.

As discussed in Section 2, a PVT triplet encapsulates a profile, and corresponding violation and transformation functions. Figure 1 provides a list of profiles along with the data types they support, how to learn their parameters from a given dataset, how to interpret them intuitively, and the corresponding violation and transformation functions. In this work, we assume that a profile can be associated with multiple transformation functions (e.g., rows 2 and 4), but each transformation function can be associated with at most one profile. This assumption helps us to blame a unique profile as cause of the system malfunction when at least one of the transformation functions associated with that profile is verified to be a fix.

PVT triplets can be classified in different ways. Based on the strictness of the violation function, they can be classified as follows:

  • Strict: All tuples are expected to satisfy the profile (rows 1–3).

  • Thresholded by data coverage: Certain fraction () of data tuples are allowed to violate the profile (rows 4–6).

  • Thresholded by a parameter: Some degree of violation is allowed with respect to a specific parameter () (rows 7–9).

Further, PVT triplets can be classified in two categories based on the nature of the transformation functions:

  • Local transformation functions can transform a tuple in isolation without the knowledge of how other tuples are being transformed (e.g., rows 1–3). Some local transformation functions only transform the violating tuples (e.g., row 2, transformation (2)), while others transform all values (e.g., row 2, transformation (1)). For instance, in case of unit mismatch (kilograms vs. lbs), it is desirable to transform all values and not just the violating ones.

  • Global transformation functions are holistic, as they need the knowledge of how other tuples are being transformed while transforming a tuple (e.g., rows 6 and 9).

Example 1 ().

Domain requires two parameters: (1) an attribute , and (2) a set specifying its domain. A dataset satisfies if . The profile is minimal w.r.t. if s.t. satisfies the profile . The technique for discovering a domain varies depending on the data type of the attribute. Rows 1–3 of Figure 1 show three different domain-discovery techniques for different data types.

(Figure 2) satisfies , as all tuples draw values from for the attribute gender. Our case studies of Sentiment Prediction and Cardiovascular Disease Prediction show the application of the profile Domain (Section 5).

Example 2 ().

Outlier requires three parameters: (1) an attribute

, (2) an outlier detection function

that returns if is an outlier w.r.t. the values within , and otherwise, and (3) a threshold . A dataset satisfies if the fraction of outliers within the attribute —according to —does not exceed . Otherwise, we compute how much the fraction of outliers exceeds the allowable fraction of outliers () and then normalize it by dividing by . The profile is minimal if s.t. satisfies .

An outlier detection function identifies values that are more than standard deviation away from the mean as outliers. In , age has a mean and a standard deviation . According to , only —which is fraction of the tuples—is an outlier in terms of age as ’s age . Therefore, satisfies .

Example 3 ().

Indep requires three parameters: two attributes , and a real value . A dataset satisfies the profile if the dependency between and does not exceed . Different techniques exist to quantify the dependency and rows 6–9 of Figure 1 show three different ways to model dependency, where the first two are correlational and the last one is causal.

is satisfied by using the PVT triplet of row 7, as -statistic between race and high_expenditure over is . We show the application of the profile Indep in our case study involving the task of Income Prediction in Section 5.

While the profiles in Figure 1 are defined over the entire data, analogous to conditional functional dependency (Fan et al., 2009), an extension to consider is conditional profiles, where only a subset of the data is required to satisfy the profiles.

id name gender age race zip code phone
high expenditure
Shanice Johnson F 45 A 01004 2088556597 no
DeShawn Bad M 40 A 01004 2085374523 no
Malik Ayer M 60 A 01005 2766465009 no
Dustin Jenner M 22 W 01009 7874891021 yes
Julietta Brown F 41 W 01009 yes
Molly Beasley F 32 W 7872899033 no
Jake Bloom M 25 W 01101 4047747803 yes
Luke Stonewald M 35 W 01101 4042127741 yes
Scott Nossenson M 25 W 01101 yes
Gabe Erwin M 20 W 4048421581 yes
Figure 2. A sample dataset with entities. A logistic regression classifier trained over this dataset discriminates against African Americans (race = ‘A’) and women (gender = ‘F’) (Example 1).
id name gender age race zip code phone
high expenditure
Darin Brust M 25 W 01004 2088556597 no
Rosalie Bad F 22 W 01005 no
Kristine Hilyard F 50 W 01004 2766465009 yes
Chloe Ayer F 22 A 7874891021 yes
Julietta Mchugh F 51 W 01009 9042899033 yes
Doria Ely F 32 A 01101 yes
Kristan Whidden F 25 W 01101 4047747803 no
Rene Strelow M 35 W 01101 6162127741 yes
Arial Brent M 45 W 01102 4089065769 yes
Figure 3. A sample dataset with entities. A logistic regression classifier trained over this dataset does not discriminate against any specific race or gender, and, thus, is fair (Example 1).

4. Intervention Algorithms

We now describe our intervention algorithms to explain the mismatch between a dataset and a system malfunctioning on that dataset. Our algorithms consider a failing and a passing dataset as input and report a collection of PVT triplets (or simply PVTs) as the explanation (cause and fix) of the observed mismatch. To this end, we first identify a set of discriminative PVTs—whose profiles take different values in the failing and passing datasets—as potential explanation units, and then intervene on the failing dataset to alter the profiles and observe change in system malfunction. We develop two approaches that differ in terms of the number of PVTs considered simultaneously during an intervention. DataExposerGRD is a greedy approach that considers only one PVT at a time. However, in worst case, the number of interventions required by DataExposerGRD is linear in number of discriminative PVTs. Therefore, we propose a second algorithm DataExposerGT, built on the group-testing paradigm, that considers multiple PVTs to reduce the number of interventions, where the number of required interventions is logarithmically dependent on the number of discriminative PVTs. We start with an example scenario to demonstrate how DataExposerGRD works and then proceed to describe our algorithms.

4.1. Example Scenario

Consider the task of predicting the attribute high_expenditure to determine if a customer should get a discount (Example 1). The system calculates bias of the trained classifier against the unprivileged groups (measured using disparate impact (IBM, AIF 360)) as its malfunction score. We seek the causes of mismatch between this prediction pipeline and (Figure 2), for which the pipeline fails with a malfunction score of . We assume the knowledge of (Figure 3), for which the malfunction score is . The goal is to identify a minimal set of PVTs whose transformation functions bring down the malfunction score of below .

(Step 1) The first goal is to identify the profiles whose parameters differ between and . To do so, DataExposerGRD identifies the exhaustive set of PVTs over and and discards the identical ones (PVTs with identical profile-parameter values). We call the PVTs of the passing dataset whose profile-parameter values differ over the failing dataset discriminative PVTs. Figure 5 lists a few profiles of the discriminative PVTs w.r.t. and .

(Step 2) Next, DataExposerGRD ranks the set of discriminative PVTs based on their likelihood of offering an explanation of the malfunction. Our intuition here is that if an attribute is related to the malfunction, then many PVTs containing in their profiles would differ between and . Additionally, altering with respect to one PVT is likely to automatically “fix” other PVTs associated with .111Altering values of w.r.t. a PVT may also increase violation w.r.t. some other PVTs. However, for ease of exposition, we omit such issues in this example and provide a detailed discussion on such issues in the appendix. Based on this intuition, DataExposerGRD constructs a bipartite graph, called PVT-attribute graph, with discriminative PVTs on one side and data attributes on the other side (Figure 4). In this graph, a PVT is connected to an attribute if is defined over . In the bipartite graph, the degree of an attribute captures the number of discriminative PVTs associated with . During intervention, DataExposerGRD prioritizes PVTs associated with a high-degree attributes. For instance, since has the highest degree in Figure 4, PVTs associated with it are considered for intervention before others.

(Step 3) DataExposerGRD further ranks the subset of the discriminative PVTs that are connected to the highest-degree attributes in the PVT-attribute graph based on their benefit score. Benefit score of a PVT encodes the likelihood of reducing system malfunction when the failing dataset is altered using . The benefit score of is estimated from (1) the violation score that the failing dataset incurs w.r.t. , and (2) the number of tuples in the failing dataset that are altered by . For example, to break the dependence between high_expenduture and race, the transformation corresponding to Indep modifies five tuples in by perturbing (adding noise to) high_expenditure. In contrast, the transformation for Missing needs to change only one value ( or

). Since more tuples are affected by the former, it has higher likelihood of reducing the malfunction score. The intuition behind this is that if a transformation alters more tuples in the failing dataset, the more likely it is to reduce the malfunction score. This holds particularly in applications where the system optimizes aggregated statistics such as accuracy, recall, F-score, etc.

Figure 4. PVT-attribute graph. The attribute high_expenditure is associated with two discriminative PVTs. For ease of exposition, we only show profile within a PVT to denote the entire PVT.
Figure 5. A list of PVTs that discriminate (Figure 3) and (Figure 2) based on the scenario of Example 1 . We omit the violation and transformation functions for ease of exposition.

(Step 4) DataExposerGRD starts intervening on using the transformation of the PVT corresponding to the profile as its transformation offers the most likely fix. Then, it evaluates the malfunction of the system over the altered version of . Breaking the dependence between high_expenditure and race helps reduce bias in the trained classifier, and, thus, we observe a malfunction score of w.r.t. the altered dataset. This exposes the first explanation of malfunction.

(Step 5) DataExposerGRD then removes the processed PVT (Indep) from the PVT-attribute graph, updates the graph according to the altered dataset, and re-iterates steps 2–4. Now the PVT corresponding to the profile Selectivity is considered for intervention as it has the highest benefit score. To do so, DataExposerGRD over-samples tuples corresponding to female customers with . This time, DataExposerGRD intervenes on the transformed dataset obtained from the previous step. After this transformation, bias of the learned classifier further reduces and the malfunction score falls below the required threshold. Therefore, with these two interventions, DataExposerGRD is able to expose two issues that caused undesirable behavior of the prediction model trained on .

(Step 6) DataExposerGRD identifies an initial explanation over two PVTs: Indep and Selectivity. However, to verify whether it is a minimal, DataExposerGRD tries to drop from it one PVT at a time to obtain a proper subset of the initial explanation that is also an explanation. This procedure guarantees that the explanation only consists of PVTs that are necessary, and, thus, is minimal. In this case, both Indep and Selectivity are necessary, and, thus, are part of the minimal explanation. DataExposerGRD finally reports the following as a minimal explanation of the malfunction, where failure to satisfy the profiles is the cause and the transformations indicate fix (violation and transformation functions are omitted).

4.2. Assumptions and Observations

We now proceed to describe our intervention algorithms more formally. We first state our assumptions and then proceed to present our observations that lead to the development of our algorithms.


DataExposer makes the following assumptions:

(A1) The ground-truth explanation of malfunction is captured by at least one of the discriminative PVTs. This assumption is prevalent in software-debugging literature where program predicates are assumed to be expressive enough to capture the root causes (Fariha et al., 2020; Liblit et al., 2005).

(A2) If the fix corresponds to a composition of transformations, then the malfunction score achieved after applying the composition of transformations is less than the malfunction score achieved after applying any of the constituents, and all these scores are less than the malfunction score of the original dataset. For example, consider two discriminative PVTs and and a failing dataset . Our assumption is that if corresponds to a minimal explanation, then and . Intuitively, this assumption states that and have consistent (independent) effect on reducing the malfunction score, regardless of whether they are intervened together or individually. If this assumption does not hold, DataExposer can still work with additional knowledge about multiple failing and passing datasets. More details are in the appendix.


We make the following observations:

(O1) If the ground-truth explanation of malfunction corresponds to an attribute, then multiple PVTs that involve the same attribute are likely to differ across the passing and failing datasets. This observation motivates us to prioritize interventions based on PVTs that are associated with high-degree attributes in the PVT-attribute graph. Additionally, intervening on the data based on one such PVT is likely to result in an automatic “fix” of other PVTs connecting via the high-degree attribute. For example, adding noise to high_expenditure in Example 1 breaks its dependence with not only race but also with other attributes.

(O2) The PVT for which the failing dataset incurs higher violation score is more likely to be a potential explanation of malfunction.

(O3) A transformation function that affects a large number of data tuples is likely to result in a higher reduction in the malfunction score, after the transformation is applied.

PVT-attribute graph. DataExposer leverages observation O1 by constructing a bipartite graph (), called PVT-attribute graph, with all attributes as nodes on one side and all discriminative PVTs on the other side. An attribute is connected to a PVT if and only if has as one of its parameters. E.g., Figure 4 shows the PVT-attribute graph w.r.t. and (Example 1). In this graph, the PVT corresponding to is connected to two attributes, race and high_expenditure. Intuitively, this graph captures the dependence relationship between PVTs and attributes, where an intervention with respect to a PVT modifies an attribute connected to it. If this intervention reduces the malfunction score then it could possibly fix other PVTs that are connected to .

Input: Failing dataset , passing dataset , malfunction score threshold
Output: A minimal explanation set of PVTs
3 /* Common PVTs */
4 /* Discriminative PVTs */
7 /* Initialize minimal explanation set to be empty */
8 /* Initialize dataset to the failing dataset */
9 while  do
10       /* PVTs adjacent to high-degree attributes in */
11       /* Highest-benefit PVT */
12       /* Malfunction reduction */
13       /* Update */
14       if  then /* Reduces malfunction */
15             /* Apply transformation */
16             /* Update the PVT-attribute graph */
17             /* Update benefit scores */
18             /* Add to explanation set */
19             /* Remove from the candidates */
22 = Make-Minimal() /* Obtain minimality of */
return /* is a minimal explanation */
Algorithm 1 DataExposerGRD (greedy)

Benefit score calculation. DataExposer uses the aforementioned observations to compute a benefit score for each PVT to model their likelihood of reducing system malfunction if the corresponding transformation is used to modify the failing dataset . Intuitively, it assigns a high score to a PVT with a high violation score (O2) and if the corresponding transformation function modifies a large number of tuples in the dataset (O3). Formally, the benefit score of a PVT is defined as the product of violation score of w.r.t. and the “coverage” of . The coverage of is defined as the fraction of tuples that it modifies. Note that the benefit calculation procedure acts as a proxy of the likelihood of a PVT to offer an explanation, without actually applying any intervention.

4.3. Greedy Approach

Algorithm 1 presents the pseudocode of our greedy technique DataExposerGRD, which takes a passing dataset and a failing dataset as input and returns the set of PVTs that corresponds to a minimal explanation of system malfunction.

Lines 1-1:

Identify two sets of PVTs and satisfied by and , respectively.

Lines 1-1:

Discard the PVTs from and consider the remaining discriminative ones as candidates for potential explanation of system malfunction.

Line 1:

Compute the PVT-attribute graph , where the candidate PVTs correspond to nodes on one side and the data attributes correspond to nodes on the other side.

Line 1:

Calculate the benefit score of each discriminative PVT w.r.t. . This procedure relies on the violation score using the violation function of the PVT and the coverage of the corresponding transformation function over .

Line 1-1:

Initialize the solution set to and the dataset to perform intervention on to the failing dataset . In subsequent steps, will converge to a minimal explanation set and will be transformed to a dataset for which the system passes.

Line 1:

Iterate over the candidate PVTs until the dataset (which is being transformed iteratively) incurs an acceptable violation score (less than the allowable threshold ).

Line 1:

Identify the subset of PVTs such that all are adjacent to at least one of the highest degree attributes in the current PVT-attribute graph (Observation O1).

Line 1:

Choose the PVT that has the maximum benefit.

Line 1:

Calculate the reduction in malfunction score if the dataset is transformed according to the transformation .

Line 1:

Remove from as it has been explored.

Lines 1-1:

If the malfunction score reduces over , then is added to the solution set , and is updated to , which is then used to update the PVT-attribute graph and benefit of each PVT. The update procedure recalculates the benefit scores of all PVTs that are connected to the attributes adjacent to in .

Line 1:

Post-process the set to identify a minimal subset that ensure that malfunction score remains less than the threshold . This procedure iteratively removes one PVT at a time (say ) from and recalculates the malfunction score over the failing dataset transformed according to the transformation functions of the PVTs in the set . If the transformed dataset incurs a violation score less than then is replaced with .

4.4. Group-testing-based Approach

We now present our second algorithm DataExposerGT, which performs group interventions to identify the minimal explanation that exposes the mismatch between a dataset and a system. The group intervention methodology is applicable under the following assumption along with assumptions and (Section 4.2).

(A3) The malfunction score incurred after applying a composition of transformations is less than the malfunction score incurred by the the original dataset if and only if at least one of the constituent transformations reduces the malfunction score. For two PVTs and , , iff or . Note that this assumption is crucial to consider group interventions and is prevalent in the group-testing literature (Du et al., 2000).

DataExposerGT follows the classical adaptive group testing (GT) paradigm (Du et al., 2000)

for interventions. To this end, it iteratively partitions the set of discriminative PVTs into two “almost” equal subsets (when the number of discriminative PVTs is odd, then the size of the two partitions will differ by one). During each iteration, all PVTs in a partition are considered for intervention

together (group intervention) to evaluate the change in malfunction score. If a partition does not help reduce the malfunction score, all PVTs within that partition are discarded. While traditional GT techniques (Du et al., 2000) would use a random partitioning of the PVTs, DataExposerGT can leverage the dependencies among PVTs (inferred from the PVT-attribute graph) to achieve more effective partitioning. Intuitively, it is beneficial to assign all PVTs whose transformations operate on the same attribute to the same partition, which is likely to enable aggressive pruning of spurious PVTs that do not reduce malfunction.

DataExposerGRD captures the dependencies among PVTs by constructing a PVT-dependency graph . Two PVTs and are connected by an edge in if they are connected via some attribute in . is equivalent to (transitive closure of ), restricted to PVT nodes (excluding the attribute nodes). This ensures that PVTs that are associated via some attribute in are connected in . DataExposerGRD partitions such that the number of connections (edges) between PVTs that fall in different partitions are minimized. More formally, we aim to construct two “almost” equal-sized partitions of such that the number of edges between PVTs from different partitions are minimized, which maps to the problem of finding the minimum bisection of a graph (Garey et al., 1974). The minimum bisection problem is NP-hard (Garey et al., 1974) and approximate algorithms exist (Fernandes et al., 2018; Fellows et al., 2012). In this work, we use the local search algorithm (Fellows et al., 2012) (details are in the appendix).

We proceed to demonstrate the benefit of using DataExposerGT as opposed to traditional GT with the following example.

Input: Failing dataset , passing dataset , malfunction score threshold
Output: A minimal explanation set of PVTs
3 /* Common PVTs */
4 /* Discriminative PVTs */
6 /* Obtain an exp. */
7 = Make-Minimal() /* Obtain minimality of */
return /* is a minimal cause */
Algorithm 2 DataExposerGT (group-testing-based)
Input: Candidate PVT , dataset , PVT-dependency graph , malfunction score threshold
Output: A transformed dataset and an explanation set of PVTs
1 /* Initialize explanation set to be empty */
2 if  then /* Only a single PVT is candidate */
3       return
4 /* Partition */
5 /* Initial malfunction score */
6 /* Malfunction reduction by */
7 if  then /* alone is insufficient */
8       /* Malfunction reduction by */
10if  OR AND  then
11       /* is sufficient OR helps AND is insufficient */
13       /* Augment explanation set */ if  then /* Malfunction is acceptable */
14             return /* No need to check */
16if  then /* reduces malfunction */
18       /* Augment explanation set */
Algorithm 3 Group-Test
Example 1 ().

Consider a set of PVTs where the ground-truth (minimal) explanation is either or (disjunction). An example of steps for a traditional adaptive GT approach is shown in Figure 5(c). In this case, it requires a total of interventions. Note that adaptive GT is a randomized algorithm and this example demonstrates one such execution. However, we observed similar results for other instances. In contrast to adaptive GT, DataExposerGT constructs a min-bisection of the graph during each iteration: it does not partition and as none of these PVTs help reduce the malfunction. Therefore, it requires only interventions.

Algorithm 2 presents the DataExposerGT algorithm. It starts with a set of discriminative PVTs and the PVT-attribute graph . All candidate PVTs are then considered by Group-Test subroutine to identify the explanation .

Group-Test. Algorithm 3 presents the procedure that takes the set of discriminative PVTs , a failing dataset , PVT-dependency graph , and the malfunction score threshold as input. It returns a transformed (fixed) dataset and an explanation.

Lines 3:

Initialize the solution set to .

Lines 3-3:

Return the candidate PVT set if its cardinality is 1.

Lines 3:

Partition into and using min-bisection of the PVT-dependency graph .

Lines 3:

Calculate the malfunction score of the input dataset.

Lines 3:

Calculate the reduction in malfunction score if is intervened w.r.t. all PVTs .

Lines 3-3:

If the malfunction exceeds even after intervening on w.r.t. all PVTs in then try out : calculate the reduction in malfunction score if is intervened w.r.t. all PVTs in .

Lines 3-3:

Recursively call Group-Test on the partition if one of the following conditions hold: (1) Intervening on w.r.t. all PVTs in reduces the malfunction to be lower than : the explanation over is returned as the final explanation. (2) Intervening on w.r.t. all PVTs in reduces the malfunction, but still remains above , but intervening on w.r.t. all PVTs in brings the malfunction below : the explanation returned by the recursive call on is added to the set and is processed next.

Lines 3-3:

Recursively call Group-Test on if intervening on all PVTs in reduces malfunction. The set of PVTs returned by this recursive call of the algorithm are added to the solution set .

(a) Dependency graph
(b) Execution of DataExposerGT
(c) Execution of traditional group testing algorithm
Figure 6. Comparison between DataExposerGT and adaptive group testing on a toy example.

Discussion on DataExposerGRD vs. DataExposerGT. DataExposerGRD intervenes by considering a single discriminative PVT at a time. Hence, in the worst case, it requires interventions where denotes the set of discriminative PVTs. Note that DataExposerGRD requires much fewer interventions in practice and would require only when any of the mentioned observations (O1–O3) do not hold. In contrast, DataExposerGT performs group intervention by recursively partitioning the set of discriminative PVTs. Thus, the maximum number of interventions required by DataExposerGT is where denotes the number of PVTs that help reduce malfunction if the corresponding profile is altered. Note that, in expectation, DataExposerGT requires fewer interventions than DataExposerGRD whenever . DataExposerGT is particularly helpful when multiple PVTs disjunctively explain the malfunction. However, DataExposerGT requires an additional assumption assumption A3 (Section 4.4). We discuss the empirical impact of this assumption in Section 5.1 (Cardiovascular disease prediction). Overall, we conclude that DataExposerGT is beneficial for applications whenever and observations O1-O3 hold (more details are in the appendix).

5. Experimental Evaluation

Our experiments involving DataExposer aim to answer the following questions: (Q1) Can DataExposer correctly identify the cause and corresponding fix of mismatch between a system and a dataset for which the system fails? (Q2) How efficient is DataExposer compared to other alternative techniques? (Q3) Is DataExposer scalable with respect to the number of discriminating PVTs?

Number of Interventions Execution Time (seconds)
Application DataExposerGRD DataExposerGT BugDoc Anchor GrpTest DataExposerGRD DataExposerGT BugDoc Anchor GrpTest
Sentiment 2 3 10 303 3 25.1 23.4 64.6 4594.9 21.2
Income 1 8 20 800 10 11.8 12.5 20.0 195.5 10.4
Cardiovascular 1 NA 100 5900 NA 7.6 NA 62.1 8602.9 NA
Figure 7. Comparison of number of interventions and running time of DataExposer with other baselines. NA denotes that the technique could not identify the cause of malfunction because assumption A3 did not hold.

Baselines. Since there is no prior work on modifying a dataset according to a PVT, we adapted state-of-the-art debugging and explanation techniques to incorporate profile transformations and explain the cause of system failure. We consider three baselines: (1) BugDoc (Lourenço et al., 2020) is a recent debugging technique that explores different parameter configurations of the system to understand its behavior. We adapt BugDoc to consider each PVT as a parameter of the system and interventions as the modified configurations of the pipeline. (2) Anchor (Ribeiro et al., 2018) is a local explanation technique for classifiers that explains individual predictions based on a surrogate model. We train Anchor with PVTs as features, and the prediction variable is Pass/Fail where Pass (Fail) denotes the case where an input dataset incurs malfunction below (above) . In this technique, each intervention creates a new data point to train the surrogate model. (3) GrpTest (Du et al., 2000) is an adaptive group testing approach that performs group interventions to expose the mismatch between the input dataset and the system. It is similar to DataExposerGT with a difference that the recursive partitioning of PVTs is performed randomly without exploiting the PVT-dependency graph.

5.1. Real-world Case Studies

We design three case studies focusing on three different applications, where we use well-known ML models (Akbik et al., 2018; AdaBoost, Classifier, Forest Classifier) as black-box systems. For all of the three case studies, we use real-world datasets. Figure 7 presents a summary of our evaluation results.

Sentiment Prediction. The system in this study predicts sentiment of input text (reviews/tweets) and computes misclassification rate as the malfunction score. It uses flair (Akbik et al., 2018), a pre-trained classifier to predict sentiment of the input records and assumes a target attribute in the input data, indicating the ground truth sentiment: A value of for the attribute target indicates positive sentiment and a value of indicates negative sentiment. We test the system over IMDb dataset (IMDb, Dataset) ( tuples) and a twitter dataset (Sentiment, 140 dataset) (around M tuples). The malfunction score of the system on the IMDb dataset is only while on the twitter dataset it is . We considered IMDb as the passing dataset and twitter as the failing dataset and used both DataExposerGRD and DataExposerGT to explain the mismatch between the twitter dataset and the system. The ground-truth cause of the malfunction is that the target attribute in the twitter dataset uses “4” to denote positive and “0” to denote negative sentiment (Sentiment, 140 dataset). DataExposerGRD identifies a total of discriminative PVTs between the two datasets. One such PVT includes the profile Domain of the target attribute that has corresponding parameter for IMDb and for the twitter dataset. DataExposerGRD performs two interventions and identifies that the malfunction score reduces to by mapping and by intervening w.r.t. the PVT corresponding Domain, which is returned as an explanation of the malfunction.

DataExposerGT and GrpTest both require group interventions to explain the cause of system malfunction. BugDoc and Anchor require and interventions, respectively. Anchor calculates system malfunction on datasets transformed according to various local perturbations of the PVTs in the failing dataset.

Income Prediction. The system in this study trains a Random Forest classifier 

(Random, Forest Classifier) to predict the income of individuals while ensuring fairness towards marginalized groups. The pipeline returns the normalized disparate impact (IBM, AIF 360) of the trained classifier w.r.t. the protected attribute (sex), as the malfunction score. Our input data includes census records (Dua and Graff, 2017) containing demographic attributes of individuals along with information about income, education, etc. We create two datasets through a random selection of records, and manually add noise to one of them to break the dependence between target and sex. The system has malfunction score of for the passing dataset and for the failing dataset due to the dependence between target and sex. DataExposerGRD identifies a total of discriminative PVTs and constructs a PVT-attribute graph. In this graph, the target attribute has degree while all other attributes have degree . The PVTs that include target are then intervened in non-increasing order of benefit. The transformation w.r.t. Indep PVT on the target attribute breaks the dependence between target and all other attributes, thereby reducing the malfunction score to . Therefore, DataExposerGRD requires one intervention to explain the cause of the malfunction. Our group testing algorithm DataExposerGT and GrpTest require and interventions, respectively. Note that group testing is not very useful because the datasets contain few discriminative PVTs.

BugDoc and Anchor do not identify discriminative PVTs explicitly and consider all PVTs ( for this dataset) as candidates for intervention. Anchor performs local interventions to explain the malfunction. BugDoc identifies the ground truth malfunction in of the runs when allowed to run fewer than interventions. It identifies the mismatch with intervention budget of but the returned solution of PVTs is not minimal. For instance, BugDoc returns two PVTs: Indep, target, education and Indep, target, sex as the explanation of malfunction.

Cardiovascular Disease Prediction. This system trains an AdaBoost classifier (AdaBoost, Classifier) on patients’ medical records (Cardiovascular, Disease dataset) containing age, height (in centimeters) and weight along with other attributes. It predicts if the patient has a disease and does not optimize for false positives. Therefore, the system calculates recall over the patients having cardiovascular disease, and the goal is to achieve more than recall. The pipeline returns the additive inverse of recall as the malfunction score. We tested the pipeline with two datasets generated through a random selection of records: (1) the passing dataset satisfies the format assumptions of the pipeline; (2) for the failing dataset we manually converted height to inches. DataExposerGRD identifies discriminative PVTs with height, weight and age having the highest degree of