The Shapley Value of Tuples in Query Answering

04/18/2019 ∙ by Ester Livshits, et al. ∙ Technion RelationalAI, Inc. 0

We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single wealth-distribution measure that satisfies some natural axioms. While this value has been investigated in several areas, it received little attention in data management. We study this measure in the context of conjunctive and aggregate queries by defining corresponding coalition games. We provide algorithmic and complexity-theoretic results on the computation of Shapley-based contributions to query answers; and for the hard cases we present approximation algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Shapley value is named after Lloyd Shapley who introduced the value in a seminal 1952 article [32]. He considered a cooperative game that is played by a set of players and is defined by a wealth function that assigns, to each coalition , the wealth . For instance, in our running example the players are researchers, and is the total number of citations of papers with an author in . As another example, might be a set of politicians, and the number of votes that a poll assigns to the party that consists of the candidates in . The question is how to distribute the wealth among the players, or from a different perspective, how to quantify the contribution of each player to the overall wealth. For example, the removal of a researcher may have zero impact on the overall number of citations, since each paper has co-authors from . Does it mean that has no contribution at all? What if the removal in turns of every individual author has no impact? Shapley considered distribution functions that satisfy a few axioms of a good behavior. Intuitively, the axioms state that the function should be invariant under isomorphism, the sum over all players should be equal to the total wealth, and the contribution to a sum of wealths is equal to the sum of separate contributions. Quite remarkably, Shapley has established that there is a single such function, and this function has become known as the Shapley value.

The Shapley value is informally defined as follows. Assume that we select players one by one, randomly and without replacement, starting with the empty set. Whenever we select the player , its addition to the set of players selected so far may cause a change in wealth from to . The Shapley value of is the expectation of change that causes in this probabilistic process.

The Shapley value has been applied in various areas and fields beyond cooperative game theory (e.g., [1, 2]), such as bargaining foundations in economics [14], takeover corporate rights in law [26], pollution responsibility in environmental management [28, 20], influence measurement in social network analysis [25], and utilization of multiple Internet Service Providers (ISPs) in networks [21]. Closest to database manegement is the application of the Shapley value to attributing a level of inconsistency to a statement in an inconsistent knowledge base [17, 35]; the idea is natural: as wealth, adopt a measure of inconsistency for a set of logical sentences [12], and then associate to each sentence its Shapley value.

In this paper, we apply the Shapley value to quantifying the contribution of database facts (tuples) to query results. As in previous work on quantification of contribution of facts [23, 30], we view the database as consisting of two types of facts: endogenous facts and exogenous facts. Exogenous facts are taken as given (e.g., inherited from external sources) without questioning, and are beyond experimentation with hypothetical or counterfactual scenarios. On the other hand, we may have control over the endogenous facts, and these are the facts for which we reason about existence and marginal contribution. Our focus is on queries that can be viewed as mapping databases to numbers. These include Boolean queries (mapping databases to zero and one) and aggregate queries (e.g., count the number of tuples in a multiway join). As a cooperative game, the endogenous facts take the role of the players, and the result of the query is the wealth. The core computational problem for a query is then: given a database and an endogenous fact, compute the Shapley value of the fact.

We study the complexity of computing the Shapley value for Conjunctive Queries (CQs) and aggregate functions over CQs. Our main results are as follows. We first establish a dichotomy in complexity for the class of Boolean CQs without self-joins. Interestingly, our dichotomy is the same as that of query inference in probabilistic, tuple-independent databases [9]: if the CQ is hierarchical, then the problem is solvable in polynomial time, and otherwise, it is -complete (i.e., complete for the intractable class of polynomial-time algorithms with an oracle to, e.g., a counter of the satisfying assignments of a propositional formula). The proof, however, is more challenging than that of Dalvi and Suciu [9], as the Shapley value involves coefficients that do not seem to easily factor out. Since the Shapley value is a probabilistic expectation, we show how to use the linearity of expectation to extend the dichotomy to arbitrary sums over CQs without self-joins. For non-hierarchical queries (and, in fact, all unions of CQs), we show that both Boolean and summation versions are efficiently approximable (i.e., have a multiplicative FPRAS) via Monte Carlo sampling.

The general conclusion is that computing the exact Shapley value is notoriously hard, but the picture is optimistic if approximation is allowed under strong guarantees of error boundedness. Our results immediately generalize to non-Boolean CQs and group-by operators, where the goal is to compute the Shapley value of a facts to each tuple in the answer of a query. For aggregate functions other than summation (where we cannot apply the linearity of expectation), the picture is far less complete, and remains for future investigation. Nevertheless, we give some positive preliminary results about special cases of the minimum and maximum aggregate functions.

Various formal measures have been proposed for quantifying the contribution of a fact to a query answer. Meliou et al. [23] adopted the quantity of responsibility that is inversely proportional to the minimal number of endogenous facts that should be removed to make counterfactual (i.e., removing transitions the answer from true to false). This measure adopts earlier notions of formal causality by Halpern and Pearl [16]. This measure, however, is fundamentally designed for non-numerical queries, and it is not at all clear whether it can incorporate the numerical contribution of a fact (e.g., recognizing that some tuples contribute more than others due to high numerical attributes). Salimi et al. [30] proposed the causal effect: assuming endogenous facts are randomly removed independently and uniformly, what is the difference in the expected query answer between assuming the presence and the absence of ? Interestingly, as we show here, this value is the same as the Banzhaf power index that has also been studied in the context of wealth distribution in cooperative games [11], and is different from the Shapley value [29, Chapter 5]. While the justification to measuring tuple contribution using one measure over the other is yet to be established, we believe that the suitability of the Shapley value is backed by the aforementioned theoretical justification as well as its massive adoption in a plethora of fields. In addition, the complexity of measuring the causal effect has been left open, and we conjecture that all of our complexity results are applicable to (and, in fact, simpler to prove in) the causal-effect framework.

The remainder of the paper is organized as follows. In the next section, we give preliminary concepts, definitions and notation. In Section 3, we present the Shapley value to measure the contribution of a fact to a query answer, along with illustrating examples. In Section 4, we study the complexity of calculating the Shapley value. Finally, we discuss past contribution measures in Section 5 and conclude in Section 6. For lack of space, missing proofs are given in the Appendix.

2 Preliminaries

Databases

A (relational) schema is a collection of relation symbols with each relation symbol in having an associated arity that we denote by . We assume a countably infinite set of constants that are used as database values. If is a tuple of constants and , then we use to refer to the constant . A relation is a set of tuples of constants, each having the same arity (length) that we denote by . A database (over the schema ) associates with each relation symbol a finite relation , which we denote by , such that . We denote by the set of all databases over the schema . Notationally, we identify a database with its finite set of facts , stating that the relation over the -ary relation symbol contains the tuple . In particular, two databases and over satisfy if and only if for all relation symbols of .

Following prior work on explanations and responsibility of tuples to query answers [24, 22], we view the database as consisting of two types of facts: exogenous facts and endogenous facts. Exogenous facts represent a context of information that is taken for granted and assumed not to claim any contribution or responsibility to the result of a query. Our concern is about the role of the endogenous facts in establishing the result of the query. In notation, we denote by and the subsets of that consist of the exogenous and endogenous facts, respectively. Hence, in our notation we have that .

Author (endo)
name affil
Inst (exo)
name state
Pub (exo)
author pub
Citations (exo)
paper cits
Figure 1: The database of the running example

Figure 1 depicts the database of our running example from the domain of academic publications. The relation Author stores authors along with their affiliations, which are stored with their states in Inst. The relation Pub associates authors with their publications, and Citations stores the number of citations for each paper. For example, publication has citations and it is written jointly by from of state, from of state, and from of state. All Author facts are endogenous, and all remaining facts are exogenous. Hence, and consists of all for and relevant . ∎

Relational and conjunctive queries

Let be a schema. A relational query is a function that maps databases to relations. More formally, a relational query of arity is a function that maps every database over to a finite relation of arity . We denote the arity of by . Each tuple in is an answer to on . If the arity of is zero, then we say that is a Boolean query; in this case, denotes that consists of the empty tuple , while denotes that is empty.

Our analysis will focus on the special case of Conjunctive Queries (CQs). A CQ over the schema is a relational query definable by a first-order formula of the form , where is a conjunction of atomic formulas of the form with variables among those in . In the remainder of the paper, a CQ will be written shortly as a logic rule, that is, an expression of the form

where each is a relation symbol of , each is a tuple of variables and constants with the same arity as , and is a tuple of variables from . We call the head of , and the body of . Each is an atom of . The variables occurring in the head are called the head variables, and we make the standard safety assumption that every head variable occurs at least once in the body. The variables occurring in the body but not in the head are existentially quantified, and are called the existential variables. The answers to on a database are the tuples that are obtained by projecting to all homomorphisms from to , and replacing each variable with the constant it is mapped to. A homomorphism from to is a mapping of the variables in to the constants of , such that every atom in is mapped to a fact in .

A self join in a CQ is a pair of distinct atoms over the same relation symbol. For example, in the query , the first and third atoms constitute a self join. We say that is self-join free if it has no self joins, or in other words, every relation symbol occurs at most once in the body.

Let be a CQ. For variable of , let be the set of atoms of that contain (that is, occurs in ). We say that is hierarchical if for all existential variables and it holds that , or , or  [8]. For example, every CQ with at most two atoms is hierarchical. The smallest non-hierarchical CQ is the following.

(1)

On the other hand, the query , which has a single existential variable, is hierarchical.

Let be a Boolean query and a database, both over the same schema, and let be an endogenous fact. We say that is a counterfactual cause (for w.r.t. [22, 23] if the removal of causes to become false; that is, and .

We will use the following queries in our examples.

Note that and are Boolean, whereas and are not. Also note that and are hierarchical, and and are not. Considering the database of Figure 1, none of the Author facts is a counterfactual cause for , since the query remains true even if the fact is removed. The same applies to . However, the fact is a counterfactual cause for the Boolean CQ , asking whether there is a publication with an author from UCLA, since satisfies but the removal of Alice causes to be violated by , as no other author from UCLA exists. ∎

Numerical and aggregate-relational queries

A numerical query is a function that maps databases to numbers. More formally, a numerical query is a function that maps every database over to a real number .

A special form of a numerical query is what we refer to as an aggregate-relational query: a -ary relational query followed by an aggregate function that maps the resulting relation into a single number . We denote this aggregate-relational query as ; hence, .

Special cases of aggregate-relational queries include the functions of the form that transform every tuple into a number via a feature function , and then contract the resulting bag of numbers into a single number. Formally, we define where is used for bag notation. For illustration, if we assume that an th attribute of takes a numerical value, then can simply copy this number (i.e., ); we denote this by . As another example, can be the product of two attributes: . We later refer to the following aggregate-relational queries.

Other popular examples include the minimum (defined analogously to maximum), average and median over the feature values. A special case of is that counts the number of answers for . That is, is , where “” is the feature function that maps every -tuple to the number . A special case of is when is Boolean; in this case, we may abuse the notation and identify with itself. Put differently, we view as the numerical query defined by if and if .

Following are examples of aggregate-relational queries over the relational queries of Example 2.

  • calculates the total number of citations of all published papers.

  • counts the papers in Citations with an author in the database.

  • calculates the total number of citations of papers by Californians.

  • calculates the number of citations for the most cited paper.

For of Figure 1 we have , , and .∎

In terms of presentation, when we mention general functions and , we make the implicit assumption that they are computable in polynomial time with respect to the representation of their input. Also, observe that our modeling of an aggregate-relational query does not allow for grouping, since a database is mapped to a single number. This is done for simplicity of presentation, and all concepts and results of this paper generalize to grouping as in traditional modeling (e.g., [6]). This is explained in the next section.

Shapley value

Let be a finite set of players. A cooperative game is a function , such that (and is the power set of that consists of all subsets of ). The value represents a value, such as wealth, jointly obtained by when the players of cooperate. The Shapley value [32] measures the share of each individual player in the gain of for the cooperative game . Intuitively, the gain of is as follows. Suppose that we form a team by taking the players one by one, randomly and uniformly without replacement; while doing so, we record the change of due to the addition of as the random contribution of . Then the Shapley value of is the expectation of the random contribution.

(2)

where is the set of all possible permutations over the players in , and for each permutation we denote by the set of players that appear before in the permutation.

An alternative formula for the Shapley value is the following.

(3)

Note that is the number of permutations over such that all players in come first, then , and then all remaining players. For further reading, we refer the reader to the book by Roth [29].

3 Shapley Value of Database Facts

Let be a numerical query over a schema , and let be a database over . We wish to quantify the contribution of every endogenous fact in the result . For that, we view as a cooperative game over , where the value of every subset of is .

[Shapley Value of Facts] Let be a schema, a numerical query, a database, and an endogenous fact of . The Shapley value of for , denoted , is the value as given in (2), where:

  • ;

  • for all ;

  • .

That is, is the Shapley value of in the cooperative game that has the endogenous facts as the set of players and values each team by the quantity it adds to .

As a special case, if is a Boolean query, then is the same as the value . In this case, the corresponding cooperative game takes the values and , and the Shapley value then coincides with the Shapley-Shubik index [31]. Some fundamental properties of the Shapley value [32] are reflected here as follows:

  • .

  • .

Note that is defined for a general numerical query . The definition is immediately extendible to queries with grouping (producing tuples of database constants and numbers [6]), where we would measure the responsibility of for an answer tuple and write something like . In that case, we treat every group as a separate numerical query. We believe that focusing on numerical queries (without grouping) allows us to keep the presentation considerably simpler while, at the same time, retaining the fundamental challenges. ∎

In the remainder of this section, we illustrate the Shapley value on our running example.

We begin with a Boolean CQ, and specifically from Example 2. Recall that the endogenous facts correspond to the authors. As Ellen has no publications, her addition to any where does not change the satisfaction of . Hence, its Shapley value is zero: . The fact changes the query result if it is either the first fact in the permutation, or it is the second fact after . There are permutations that satisfy the first condition, and permutations that satisfy the second. The contribution of to the query result is one in each of these permutations, and zero otherwise. Therefore, we have . The same argument applies to , and , and so, . We get the same numbers for , since every paper is mentioned in the Citations relation. Note that the value of the query on the database is , and it holds that ; hence, the second fundamental property of the Shapley value mentioned above is satisfied.

While Alice, Bob, Cathy and David have the same Shapley value for , things change if we consider the relation pub endogenous as well: the Shapley value of Alice and Cathy will be higher than Bob’s and David’s values, since they have more publications. Specifically, the fact , for example, will change the query result if and only if at least one of or appears earlier in the permutation, and no pair among , , , and appears earlier than . By rigorous counting, we can show that there are: such sets of size one, such sets of size two, such sets of size three, such sets of size four, such sets of size five, such sets of size six, and such sets of size seven. Therefore, the Shapley value of is:

We can similarly compute the Shapley value for the rest of the authors, concluding that and . Hence, the Shapley value is the same for Alice and Cathy, who have two publications each, and lower for Bob and David, that have only one publication. ∎

The following example, taken from Salimi et al. [30], illustrates the Shapley value on (Boolean) graph reachability. Consider the following database defined via the relation symbol .

Here, we assume that all edges are endogenous facts. Let be the Boolean query (definable in, e.g., Datalog) that determines whether there is a path from to . Let us calculate for different edges . Intuitively, we expect to have the highest value since it provides a direct path from to , while contributes to a path only in the presence of , and enables a path only in the presence of both and . We show that, indeed, it holds that .

To illustrate the calculation, observe that there are subsets of that do not contain , and among them, the subsets that satisfy are the supersets of and . Hence, we have that (the detailed computation is in the appendix). A similar reasoning shows that , and that for . ∎

Lastly, we consider aggregate functions over conjunctive queries.

We consider the queries , , and from Example 2. Ellen has no publications; hence, for . The contribution of is the same in every permutation ( for and for ) since Alice is the single author of two published papers that have a total of citations. Hence, and . The total number of citations of Cathy’s papers is also ; however, Bob and David are her coauthors on paper C. Hence, if the fact appears before and in a permutation, its contribution the query result is for and for , while if appears after at least one of or in a permutation, its contribution is for and for . Clearly, appears before both and in one-third of the permutations. Thus, we have that and . Using similar computations we obtain that and .

Hence, the Shapley value of Alice, who is the single author of two papers with a total of citations, is higher than the Shapley value of Cathy who also has two papers with a total of citations, but shares one paper with other authors. Bob and David have the same Shapley value, since they share a single paper, and this value is the lowest among the four, as they have the lowest number of papers and citations.

Finally, consider . The contribution of in this case depends on the maximum value before adding in the permutation (which can be , or ). For example, if is the first fact in the permutation, its contribution is since . If appears after , then its contribution is , since whenever . We have that , and (we omit the computations here). We see that the Shapley value of is much higher than the rest, since Alice significantly increases the maximum value when added to any prefix. If number of citations of paper increases to , then , hence lower. This is because the next highest value is closer; hence, the contribution of diminishes. ∎

4 Complexity Results

In this section, we give complexity results on the computation of the Shapley value of facts. We begin with exact evaluation for Boolean CQs (Section 4.1), then move on to exact evaluation on aggregate-relational queries (Section 4.2), and finally discuss approximate evaluation (Section 4.3). In the first two parts we restrict the discussion to CQs without self joins, and leave the problems open in the presence of self joins. However, the approximate treatment in the third part covers the general class of CQs (and beyond).

4.1 Boolean Conjunctive Queries

In this section, we investigate the problem of computing the (exact) Shapley value w.r.t. a Boolean CQ without self joins. Our main result in this section is a full classification of (i.e., a dichotomy in) the data complexity of the problem. As we show, the classification criterion is the same as that of query evaluation over tuple-independent probabilistic databases [9]: hierarchical CQs without self joins are tractable, and non-hierarchical ones are intractable.

Let be a Boolean CQ without self joins. If is hierarchical, then can be computed in polynomial time, given and . Otherwise, the problem is -complete.

Recall that is the class of functions computable in polynomial time with an oracle to a -complete problem (e.g., counting the number of satisfying assignments of a propositional formula). This complexity class is considered intractable, and is known to be above the polynomial hierarchy (Toda’s theorem [34]).

Consider the query from Example 2. This query is hierarchical; hence, by Theorem 4.1, can be calculated in polynomial time, given and . On the other hand, the query is not hierarchical. Thus, Theorem 4.1 asserts that computing is -complete.∎

In the rest of this section, we discuss the proof of Theorem 4.1. While the tractability condition is the same as that of Dalvi and Suciu [9], it is not clear whether and/or how we can use their dichotomy to prove ours, in each of the two directions (tractability and hardness). The difference is mainly in that they deal with a random subset of probabilistically independent (endogenous) facts, whereas we reason about random permutations over the facts. In the next section, we discuss the algorithm for computing the Shapley value in the hierarchical case, and in the subsequent section, we discuss the proof of hardness for the non-hierarchical case.

Tractability side

Let be a database, let be an endogenous fact, and let be a Boolean query. The computation of easily reduces to the problem of counting the -sets (i.e., sets of size ) of endogenous facts that, along with the exogenous facts, satisfy . More formally, the reduction is to the problem of computing where is the set of all subsets of such that and . The reduction is as follows, where we denote and slightly abuse the notation by viewing as a 0/1-numerical query, where if and only if .

In the last expression, is the same as , except that is viewed as exogenous instead of endogenous. Hence, to prove the positive side of Theorem 4.1, it suffices to show the following.

Let be a hierarchical Boolean CQ without self joins. There is a polynomial-time algorithm for computing the number of subsets of such that and , given and .

We prove Theorem 4.1 in the Appendix by showing an algorithm for computing . As expected for a hierarchical query, our algorithm is a recursive procedure that acts differently in three different cases: (a) has no variables (only constants), (b) there is a variable (called a root variable) that occurs in all atoms of , or (c) consists of two (or more) sub-queries that do not share any variables. Since is hierarchical, at least one of these cases always apply [10]. The algorithm is fairly straightforward, except for case (b) where there is a root variable, and then we combine the recursive call with dynamic programming.

Hardness side

We now sketch the proof of the negative side of Theorem 4.1. (The complete proof is in the Appendix.) Membership in is straightforward, so we omit the discussion on that. Similarly to Dalvi and Suciu [9], our proof of hardness consists of two steps. First, we prove the -hardness of computing , where is given in (1). Second, we reduce the computation of to the problem of computing for any non-hierarchical CQ without self joins. The second step is the same as that of Dalvi and Suciu [9], so we do not discuss it here. Hence, in what follows, we focus on the first step—hardness of computing , as stated next by Lemma 4.1. The proof, which we discuss after the lemma, is considerably more involved than the corresponding proof of Dalvi and Suciu [9]

that computing the probability of

in a tuple-independent probabilistic database (TID) is -hard.

Computing is -complete.

Figure 2: Constructions in the reduction of the proof of Lemma 4.1. Relations and consist of endogenous facts and consists of exogenous facts.

The proof of Lemma 4.1 is by a (Turing) reduction from the problem of computing the number of independent sets of a given bipartite graph , which is the same (via immediate reductions) as the problem of computing the number of satisfying assignments of a bipartite monotone -DNF formula, which we denote by . Dalvi and Suciu [9] also proved the hardness of (for the problem of query evaluation over TIDs) by reduction from . Their reduction is a simple construction of a single input database, followed by a multiplication of the query probability by a number. It is not at all clear to us how such an approach can work in our case and, indeed, our proof is more involved. Our reduction takes the general approach that Dalvi and Suciu [10] used (in a different work) for proving that the CQ is hard over TIDs: solve several instances of the problem for the construction of a full-rank set of linear equations. The problem itself, however, is quite different from ours. This general technique has also been used by Aziz et al. [2] for proving the hardness of computing the Shapley value for a matching game on unweighted graphs, which is again quite different from our problem.

In more detail, the idea is as follows. Given an input bipartite graph for which we wish to compute , we construct different input instances , for , of the problem of computing , where . Each instance provides us with an equation over the numbers of independent sets of size in for . We then show that the set of equations constitutes a non-singular matrix that, in turn, allows us to extract the in polynomial time (e.g., via Gaussian elimination). This is enough, since .

Our reduction is illustrated in Figure 2. Given the graph (depicted in the leftmost part), we construct graphs by adding new vertices and edges to . For each such graph, we build a database that contains an endogenous fact for every left vertex, an endogenous fact for every right vertex, and an exogenous fact for every edge. In each constructed database , the fact represents a new left node, and we compute . In , the node of is connected to every right vertex. We use to compute a specific value that we refer to later on. For , the database is obtained from by adding and facts of new right nodes, all connected to . We show the following for all .

where is a value computed using , and is a constant that depends on . From these equations we extract a system of equations over variables (i.e., ), where each stands for .

By an elementary algebraic manipulation of , we obtain the matrix with the coefficients that Bacher [3] proved to be non-singular (and, in fact, that is its determinant). We then solve the system as discussed earlier to obtain .

4.2 Aggregates over Conjunctive Queries

Next, we study the complexity of aggregate-relational queries, where the internal relational query is a CQ. We begin with hardness. The following theorem generalizes the hardness side of Theorem 4.1 and states that it is -complete to compute whenever is of the form , as defined in Section 2, and is a non-hierarchical CQ without self joins. The only exception is when is a constant numerical query (i.e., for all databases and ); in that case, always holds.

Let be a fixed aggregate-relational query where is a non-hierarchical CQ without self joins. Computing , given and , is -complete, unless is constant. For instance, it follows from Theorem 4.2 that, whenever is a non-hierarchical CQ without self joins, it is -complete to compute the Shapley value for the aggregate-relational queries , , , and , unless for all databases and tuples . Additional examples follow.

Consider the numerical query from Example 2. Since is not hierarchical, Theorem 4.2 implies that computing is -complete. Actually, computing is -complete for any non-constant aggregate-relational query over . Hence, computing the Shapley value w.r.t.