Multiple previous works have proposed algorithms for data repair using Denial Constraints (DCs) [DiscoveringChuIP13] or subsets thereof [RekatsinasCIR17, VolkovsCSM14, ChuIP13, BohannonFGJK07]. These approaches employ algorithms that use the constraints to detect and change values in a database table. We propose a system that provides explanations for data repairs by presenting the influence of each constraint and table cell. An explanation for such a repair may be useful both as means of understanding the repair process and algorithm, and as a tool for debugging the quality of the constraints for the repair of this specific data.
T-REx111Please refer to the video of the system at https://youtu.be/xPVWzHPOuAk is a novel system for data repair explanations based on Shapley values [shapley1953value]
. The notion of Shapley values was originally suggested in the context of Game Theory as a measure of quantifying the contribution of each player in a cooperative game. It was later adopted by the Machine Learning (ML) community as a tool for evaluating the contribution of each feature in the model[LundbergL17]. Given a repaired cell, T-REx computes and presents the Shapley values of the DCs and table cells that have influenced this repair. Our approach evaluates the contribution of the input directly rather than the contribution of hidden features which are used by a specific algorithm. This allows our solution to treat the repair algorithm as a black box and only query it to compute the Shapley values of DCs and cells. Explanations for the influence of DCs on the repair may assist users in correcting them and adapting them to the specific data and repair algorithm, while explanations about the influence of data cells can help in understanding the repair algorithm itself and changing specific cells to make the repair more accurate.
Consider the table in Figure 1(a) and the DCs in Figure 1 with the Shapley values of each DC on its left. C1 says that two tuples that share a team value must be in the same city, C2 says that if a pair of tuples share a city, they must have the same country, C3 says that two tuples that have the same league must have the same country, and C4 says that it is impossible for two different teams of the same league to finish in the same place in the same year. Consider the cell in the fifth row, denoted by . For simplicity, assume that we have Algorithm 1 as a näive repair algorithm222In practice, the repair algorithm may be more sophisticated; our solution is agnostic to the complexity of the repair algorithm.. T-REx computes the contribution of each DC and ranks them accordingly, where C3 is the most influential DC. It contributed the most as the value “La Liga” appears in 3 other tuples coupled with the value “Spain” in the attribute . C1 and C2 each contributed equally as C1 caused the change of “Capital” to “Madrid” first and then C2 caused the change of the value in the cell. C4 is not involved in the repair so its contribution is .
Next, we measure the influence of different data cells on this repair. Given Algorithm 1, observe that the value of has no influence on the modification of – as has no contradictions with , and the attribute does not affect in Algorithm 1. However, how can we determine if was more or less influential on the repair compared to ? Intuitively, is more influential than . This is because if had a different value, then tuple would not have any contradictions according to . While if had a different value, then according to there would have been a contradiction between and (as both tuples would have value of “Real Madrid”, and an inconsistent ) which would have been resolved by Algorithm 1. As a result T-REx will assign higher contribution to compared to .
T-REx takes as input the algorithm itself and its input which is a set of DCs and a dirty database table. Another input to the system is a specific table cell of interest whose repair requires explaining. The system then ranks the influencing DCs and table cells based on their Shapley value for this cell of interest. Generally, computing the Shapley value is exponential time in the number of DCs/table cells, and thus T-REx employs different algorithms to compute the Shapley value for DCs and for table cells. With DCs, the näive approach is feasible as the number of DCs is usually small. Conversely, the number of cells in a table can be very large, so T-REx uses a sampling algorithm based on [StrumbeljK14]. To compute the Shapley values, the system repeatedly changes the input of the repair algorithm and queries it, so it does not rely on the components or approach of a specific algorithm.
2 Technical Details
We give a short overview of the approach underlying T-REx.
2.1 Database Repair
will denote a database table with schema where is the th attribute of . For a tuple , the notation means that has the value in attribute . We denote by and the database table prior to the repair and after it respectively. Extending this, and will also be used to denote a dirty and clean cell, respectively.
We denote the repair algorithm by and its input by (1) , a set of DCs and (2) , a dirty table. Also, denote as the output table of . For our purposes, we will refer to as a binary function as follows. Given a table cell , the repair algorithm is a function , where signals that the value in is repaired to the value in , and otherwise.
2.2 Shapley Value
In Cooperative Game Theory, Shapley value [shapley1953value] is a way to distribute the worth of all players, assuming they cooperate. Let be a finite set of players and ,
be a function (called a characteristic function).maps sets of players to the joint worth they generate according to the game. The Shapley value of a player is then defined as:
In our scenario, the model is a black box so the Shapley values are computed on the input itself, i.e., the constraints and the table. For constraints, we adapt the definition so that it reflects the contribution of a specific constraint to the repair of a cell, as follows.
Where is a specific cell of interest and is a constraint whose contribution we want to determine. The “set of players” is the set of DCs while the table remains constant.
Recall the tables in Figure 2 with the DCs in Figure 1 (Shapley values are on the left) and Algorithm 1. We now compute the contribution of each DC to the repair of the cell , denoted . Algorithm 1 will repair only if we have the DCs , or . According to the definition, we can compute the contribution of as follows: there are 8 subset of , and only for and we have and , so . The same computation applies to . For we have 6 out of 8 subsets of that result in and , including . Thus, . As for , its presence or absence does not change the value of , so .
Let us explain the intuition for the value of being double that of the pair . Ignore for now since its contribution is . There are subsets of the DCs for which we repair . These are , , , , and . Four of these sets contain while only two contain the pair (for the subsets where one of these is present without its partner, the repair is due to ), thus, the contribution of and , as a pair, is half that of .
Similarly, we adjust the definition for the Shapley value of a cell. Given a repair of cell we define the formula for calculating the Shapley value of a cell , or intuitively, its contribution to the repair of .
Where means . Here, the “set of players” here is the set of cells in the table while the set of constraints remains constant.
Reconsider our example with the DCs from Figure 1, Algorithm 1, and the tables in Figure 2. Consider the cell whose value is changed from “España” to “Spain”.
Among all the cells, has the highest Shapley value, next we will explain why. Notice that based on C3 the inclusion of to any coalition that contains at least one of the pairs for any would result in the repair of to “Spain”. Observe that there are such coalitions (since out of the relevant cells there are options to choose a coalition such that at least one pair exists, and excluding those cells and there are remaining cells that can be either included or excluded from the coalition).
Next, we will estimate the number of coalitions that are required for the fix based on C1 and C2. According to these DCs, a coalition that contains
remaining cells that can be either included or excluded from the coalition). Next, we will estimate the number of coalitions that are required for the fix based on C1 and C2. According to these DCs, a coalition that containsis required. There are such coalitions. Since is more than five times larger than we conclude that has the highest influence on the repair of from “España” to “Spain”. For simplicity we overlooked the coalitions sizes, though they too play a role in the evaluation of Shapley values.
2.3 Computing Shapley Values
Shapley values can be computed from the definition, but the computation time may be exponential. For constraints, we can use the formula directly as their number is typically small. However, the number of table cells can be huge. Therefore, we use a novel algorithm based on probabilistic sampling [StrumbeljK14] to approximate the contribution of a table cell.
Reconsider the table in Figure 1(a). Suppose we are interested in the effect of the cell on the repair of the cell . We initialize a variable .
We vectorize the table
to get the vector
. We vectorize the table to get the vector. To sample a cell coalition, we take a random permutation of – the coalition is the set of all of the cells that precede . Values of cells that are not part of the coalition will be replaced with a sample value from their column distribution. Once the cell coalition was formed we generate two instances of vectorized tables: one with the original value of , and the second where the value is replaced with random value. We then compute the difference in the result of for these two instances and add it to . We repeat this times and output .
3 System Overview
4 Demo Scenario
Our demonstration will show that explaining repairs through Shapley values assists in understanding the repair process and debugging it. We will use a soccer database, scraped from Wikipedia, similarly to Figure 1(a), and errors will be manually added into the table. We will start with an initial set of DCs. To get the repair, we will employ HoloClean that will output a clean table. Then, we will indicate a repaired cell of interest and show the most influential table cells and DCs involved in this repair, ranked according to their Shapley value. We will show how removing or changing the highest ranked DCs improves the repair of the specified table cell. We will use a similar scenario for table cells, where the DCs will be appropriate but some of the cells will cause a specific cell to be repaired in the wrong manner. After showing the obtained repair, we will invoke T-REx to rank the influencing table cells. We will then allow users to change values in the initial table and the DCs and choose different cells of interest to them. Users could then use T-REx to compute the Shapley value of the table cells and DCs that influenced the repair of their chosen cell and explore the system.
This research has been funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 804302), the Israeli Science Foundation (ISF) Grant No. 978/17, and the Google Ph.D. Fellowship. The contributions of Nave Frost and Amir Gilad are part of their respective Ph.D. thesis research conducted at Tel Aviv University.