The Complexity of Causality and Responsibility for Query Answers and non-Answers

09/10/2010 ∙ by Alexandra Meliou, et al. ∙ University of Washington 0

An answer to a query has a well-defined lineage expression (alternatively called how-provenance) that explains how the answer was derived. Recent work has also shown how to compute the lineage of a non-answer to a query. However, the cause of an answer or non-answer is a more subtle notion and consists, in general, of only a fragment of the lineage. In this paper, we adapt Halpern, Pearl, and Chockler's recent definitions of causality and responsibility to define the causes of answers and non-answers to queries, and their degree of responsibility. Responsibility captures the notion of degree of causality and serves to rank potentially many causes by their relative contributions to the effect. Then, we study the complexity of computing causes and responsibilities for conjunctive queries. It is known that computing causes is NP-complete in general. Our first main result shows that all causes to conjunctive queries can be computed by a relational query which may involve negation. Thus, causality can be computed in PTIME, and very efficiently so. Next, we study computing responsibility. Here, we prove that the complexity depends on the conjunctive query and demonstrate a dichotomy between PTIME and NP-complete cases. For the PTIME cases, we give a non-trivial algorithm, consisting of a reduction to the max-flow computation problem. Finally, we prove that, even when it is in PTIME, responsibility is complete for LOGSPACE, implying that, unlike causality, it cannot be computed by a relational query.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When analyzing complex data sets, users are often interested in the reasons for surprising observations. In a database context, they would like to find the causes of answers or non-answers to their queries. For example, “What caused my personalized newscast to have more than 50 items today?” Or, “What caused my favorite undergrad student to not appear on the Dean’s list this year?” Philosophers have debated for centuries various notions of causality, and today it is still studied in philosophy, AI, and cognitive science. Understanding causality in a broad sense is of vital practical importance, for example in determining legal responsibility in multi-car accidents, in diagnosing malfunction of complex systems, or scientific inquiry. A formal, mathematical study of causality was initiated by the recent work of Halpern and Pearl [13] and Chockler and Halpern [5], who gave mathematical definitions of causality and its related notion of degree of responsibility. These formal definitions lead to applications in knowledge representation and model checking [9, 10, 5]. In this paper, we adapt the notions of causality and responsibility to database queries, and study the complexity of computing the causes and their responsibilities for answers and non-answers to conjunctive queries.

Figure 1: A SQL query returning the genres of all movies directed by Burton, on the IMDB dataset (www.imdb.org). The famous director Tim Burton is known for dark, gothic themes, so the genres Fantasy and Horror are expected. But the genres Music and Musical are quite surprising. The goal of this paper is to find the causes for surprising query results.
(a) Lineage from Directors and Movies of the Musical tuple
Answer tuple
0.33 Movie(526338, “Sweeney Todd”, 2007)
0.33 Director(23456, David, Burton)
0.33 Director(23468, Humphrey, Burton)
0.33 Director(23488, Tim, Burton)
0.25 Movie(359516, “Let’s Fall in Love”, 1933)
0.25 Movie(565577, “The Melody Lingers On”, 1935)
0.20 Movie(6539, “Candide”, 1989)
0.20 Movie(173629, “Flight”, 1999)
0.20 Movie(389987, “Manon Lescaut”, 1997)
(b) Responsibility rankings for Musical
Figure 2: Lineage (a) and causes with their responsibilities (b) for the Musical tuple in Example 1.
Example (Imdb)

Tim Burton is an Oscar nominated director whose movies often include fantasy elements and dark, gothic themes. Examples of his work are “Edward Scissorhands”, “Beetlejuice” and the recent “Alice in Wonderland”. A user wishes to learn more about Burton’s movies and queries the IMDB dataset to find out all genres of movies that he has directed (see Figure 1). Fantasy and Horror are quite expected categories. But Music and Musical are surprising. The user wishes to know the reason for these answers. Examining the lineage of a surprising answer is a first step towards finding its reason, but it is not sufficient: the combined lineage of the two categories consists of a total of 137 base tuples, which is overwhelming to the user.

Causality is related to provenance, yet it is a more refined notion: Causality can answer questions like the one in our example by returning the causes of query results ranked by their degree of responsibility. Our starting point is Halpern and Pearl’s definition of causality [13], from which we borrow three important concepts:

(1) Partitioning of variables into exogenous and endogenous: Exogenous variables define a context determined by external, unconcerned factors, deemed not to be possible causes, while endogenous variables are the ones judged to affect the outcome and are thus potential causes. In a database setting, variables are tuples in the database, and the first step is to partition them into exogenous and endogenous. For example, we may consider Director and Movie tuples as endogenous and all others as exogenous. The classification into endogenous/exogenous is application-dependent, and may even be chosen by the user at query time. For example, if erroneous data in the directors table is suspected, then only Director may be declared endogenous; alternatively, the user may choose only Movie tuples with year2008 to be endogenous, for example in order to find recent, or under production movies that may explain the surprising outputs to the query. Thus, the partition into endogenous and exogenous tuples is not restricted to entire relations. As a default, the user may start by declaring all tuples in the database as endogenous, then narrow down.

(2) Contingencies: an endogenous tuple is a cause for the observed outcome only if there is a hypothetical setting of the other endogenous variables under which the addition/removal of causes the observed outcome to change. Therefore, in order to check that a tuple is a cause for a query answer, one has to find a set of endogenous tuples (called contingency) to remove from (or add to) the database, such that the tuple immediately affects the answer in the new state of the database. In theory, in order to compute the contingency one has to iterate over subsets of endogenous tuples. Not surprisingly, checking causality is NP-complete in general [9]. However, the first main result in this paper is to show that the causality of conjunctive queries can be determined in PTIME, and furthermore, all causes can be computed by a relational query.

(3) Responsibility, a notion first defined in [5], measures the degree of causality as a function of the size of the smallest contingency set. In applications involving large datasets, it is critical to rank the candidate causes by their responsibility, because answers to complex queries may have large lineages and large numbers of candidate causes. In theory, in order to compute the responsibility one has to iterate over all contingency sets: not surprisingly, computing responsibility in general is hard for  [5].111

This is the class of functions computable by a poly-time Turing machine which makes

queries to a oracle. However, our second main result, and at the same time the strongest result of this paper, is a dichotomy result for conjunctive queries: for each query without self-joins, either its responsibility can be computed in PTIME in the size of the database (using a non-obvious algorithm), or checking if it has a responsibility below a given value is NP-hard.

Example (IMDB continued)

Continuing section 1, we show in b the causes for Musical ranked by their responsibility score. (We explain in Sect. 2 how these scores are computed.) At the top of the list is the movie “Sweeney Todd”, which is, indeed, the one and single musical movie directed by Tim Burton. Thus, this tuple represents a surprising fact in the data of great interest to the user. The next three tuples in the list are directors, whose last name is Burton. These tuples too are of high interest to the user because they indicate that the query was ambiguous. Equally interesting is to look at the bottom of the ranked list. The movie “Manon Lescaut” is made by Humphrey Burton, a far less known director specialized in musicals. Clearly, the movie itself is not an interesting explanation to the user; the interesting explanation is the director, showing that he happens to have the same last name, and indeed, the director is ranked higher while the movie is (correctly) ranked lower. In our simple example Musical has a small lineage, consisting of only ten tuples. More typically, the lineage can be much larger (Music has a lineage with 127 tuples), and it is critical to rank the potential causes by their degree of responsibility.

We start by adapting the Halpern and Pearl definition of causality (HP from now on) to database queries, based on contingency sets. We define causality and responsibility both for Why-So queries (“why did the query return this answer?”) and for Why-No queries (“why did the query not return this answer?”). We then prove two fundamental results. First, we show that computing the causes to any conjunctive query can be done in PTIME in the size of the database, i.e. query causality has PTIME data complexity; by contrast, causality of arbitrary Boolean expressions is NP-complete [9]. In fact we prove something stronger: the set of all causes can be retrieved by a query expressed in First Order Logic (FO). This has important practical consequences, because it means that one can retrieve all causes to a conjunctive query by simply running a certain SQL query. In general, the latter cannot be a conjunctive query, but must have one level of negation. However, we show that if the user query has no self joins and every table is either entirely endogenous or entirely exogenous, then the Why-So causes can be retrieved by some conjunctive query. These results are summarized in Fig. 3.

Second, we give a dichotomy theorem for query responsibility. This is our strongest technical result with this paper. For every conjunctive query without self-joins, one of the following holds: either the responsibility can be computed in PTIME or it is provably NP-hard. In the first case, we give a quite non-obvious algorithm for computing the degrees of responsibility using Ford–Fulkerson’s max flow algorithm. We further show that one can distinguish between the two cases by checking a property of the query expression that we call linearity. We also discuss conjunctive queries with self-joins, and finally show that, in the case of Why-No causality, one can always compute responsibility in PTIME. These results are also summarized in Fig. 3.

Causality Why So? Why No?
w/o SJ PTIME (CQ) PTIME (FO)
with SJ PTIME (FO)
Responsibility Why So? Why No?
w/o SJ linear PTIME
non-linear NP-hard PTIME
with SJ NP-hard
Figure 3: Complexity of determining causality and responsibility for conjunctive queries. For queries with no self-joins we provide a complete dichotomy result. Queries with self-joins are NP-hard in general, but a similar dichotomy is not known.

Causality and provenance: Causality is related to lineage of query results, such as why-provenance [7] or where-provenance [2]. Recently, even explanations for non-answers have been described in terms of lineage [15, 3]. We make use of this prior work because the first step in computing causes and responsibilities is to determine the lineage of an answer or non-answer to a query. We note, however, that computing the lineage of an answer is only the first step, and is not sufficient for determining causality: causality needs to be established through a contingency set, and is also accompanied by a degree (the responsibility), which are both more difficult to compute than the lineage.

Contributions and outline. Our three main contributions are:

  • [itemsep=1pt, parsep=1pt, topsep = 1pt]

  • We define Why-So and Why-No causality and responsibility for conjunctive database queries (Sect. 2).

  • We prove that causality has PTIME data complexity for conjunctive queries (Sect. 3).

  • We prove a dichotomy theorem for responsibility and conjunctive queries (Sect. 4).

We review related work (Sect. 5) before we conclude (Sect. 6). All proofs are provided in the Appendix.

2 Query Cause and Responsibility

We assume a standard relational schema with relation names . We write for a database instance and for a query. We consider only conjunctive queries, unless otherwise stated. A subset of tuples represents endogenous tuples; the complement is called the set of exogenous tuples. For each relation , we write and to denote the endogenous and exogenous tuples in respectively. If is a tuple with the same arity as the query’s answer, then we write when is an answer to on , and write when is a non-answer to on .

Definition (Causality)

Let be an endogenous tuple, and a possible answer for .

  • [itemsep=1pt, parsep=1pt, topsep = 1pt]

  • is called a counterfactual cause for in if and

  • is called an actual cause for if there exists a set called a contingency for , such that is a counterfactual cause for in .

A tuple is a counterfactual cause, if by removing it from the database, we remove from the answer. The tuple is an actual cause if one can find a contingency under which it becomes a counterfactual cause: more precisely, one has to find a set such that, after removing from the database we bring it to a state where removing/inserting causes to switch between an answer and a non-answer. Obviously, every counterfactual cause is also an actual cause, by taking . The definition of causality extends naturally to the case when the query is Boolean: in that case, a counterfactual cause is a tuple that, when removed, determines to become false.

Example

Consider the query on the following database instance, and assume all tuples are endogenous: , . Consider the answer . The tuple is a counterfactual cause for this result, because if we remove this tuple from then is no longer an answer. Now consider the answer . Tuple is not a counterfactual cause: if we remove it from , is still an answer. But is an actual cause with contingency : once we remove we reach a state where is still an answer, but further removing makes a non-answer.

           R X Y     S Y X

For a more subtle example, consider the Boolean query (where is a constant), which is true on the given instance. Suppose only the first three tuples in are endogenous, and the last two are exogenous: . Let’s examine whether is a cause for the query being true. This tuple is not an actual cause. This is because is not a contingency for : by removing from the database we make the query false, in other words the tuple makes no difference, under any contingency. Notice that is not contingency because is exogenous.

In this paper we discuss two instantiations of query causality. In the first, called Why-So causality, we are given an actual answer to the query, and would like to find the cause(s) for this answer. Sect. 2 is given for Why-So causality. In this case is the real database, and the endogenous tuples are a given subset, while exogenous are . In the second instantiation, called Why-No causality, we are given a non-answer to the query, i.e. would like to know the cause why is not an answer. This requires some minor changes to Sect. 2. Now the real database consists entirely of exogenous tuples, . In addition, we are given a set of potentially missing tuples, whose absence from the database caused to be a non-answer: these form the endogenous tuples, , and we denote . We do not discuss in this paper how to compute : this has been addressed in recent work [15]. In this setting, the definition of the Why-No causality is the dual of Sect. 2 and we give it here briefly: a counterfactual cause for the non-answer in is a tuple s.t. and ; an actual cause for the non-answer is a tuple s.t. there exists a set called contingency set s.t. is a counterfactual cause for the non-answer of in .

We now define responsibility, measuring the degree of causality.

Definition (Responsibility)

Let be an answer or non-answer to a query , and let be a cause (either Why-So, or Why-No cause). The responsibility of for the (non-)answer is:

where ranges over all contingency sets for .

Thus, the responsibility is a function of the minimal number of tuples that we need to remove from the real database (in the case of Why-So), or that we need to add to the real database (in the case of Why-No) before it becomes counterfactual. The tuple is a counterfactual cause iff , and it is an actual cause iff . By convention, if is not a cause, .

Example (IMDB continued)

a shows the lineage of the answer Musical in section 1. Consider the movie “Sweeney Todd”: its responsibility is because the smallest contingency is: {Director(David, Burton), Director(Humphrey, Burton)} (if we remove both directors, then “Sweeney Todd” becomes counterfactual). Consider now the movie “Manon Lescaut”: its responsibility is because the smallest contingency set is {Director (David, Burton), Movie(“Flight”), Movie(“Candide”), Director(Tim, Burton)}.

We now define formally the problems studied in this paper. Let be a database consisting of endogenous and exogenous tuples, be a query, and be a potential answer to the query.

[itemsep=1pt, parsep=1pt, topsep = 1pt]

Causality problem

Compute the set of actual causes for the answer .

Responsibility problem

For each actual cause , compute its responsibility .

We study the data complexity in this paper: the query is fixed, and the complexity is a function of the size of the database instance . In the rest of the the paper we restrict our discussion w.l.o.g. to Boolean queries: if is not Boolean, then to compute the causes or responsibilities for an answer it suffices to compute the causes or responsibilities of the Boolean query , where all head variables are substituted with the constants in .

3 Complexity of Causality

We start by proving that causality can be computed efficiently; even stronger, we show that causes can be computed by a relational query. This is in contrast with the general causality problem, where Eiter [9] has shown that deciding causality for a Boolean expression is NP-complete. We obtain tractability by restricting our queries to conjunctive queries. Chockler et al. [6] have shown that causality for “read once” Boolean circuits is in PTIME. Our results are strictly stronger: for the case of conjunctive queries without self-joins, queries with read-once lineage expressions are precisely the hierarchical queries [8, 20], while our results apply to all conjunctive queries. The results in this section apply uniformly to both Why-So and Why-No causality, so we will simply refer to causality without specifying which kind. Also, we restrict our discussion to Boolean queries only.

We write positive Boolean expressions in DNF, like ; sometimes we drop , and write . A conjunct is redundant if there exists another conjunct that is a strict subset of . Redundant conjuncts can be removed without affecting the Boolean expression. In our example, is redundant, because it strictly contains ; it can be removed and simplifies to . A positive DNF is satisfiable if it has at least one conjunct; otherwise it is equivalent to false and we call it unsatisfiable.

Next, we review the definition of lineage. Fix a Boolean conjunctive query consisting of atoms, , and database instance ; recall that (exogenous and endogenous tuples). For every tuple , let denote a distinct Boolean variable associated to that tuple. A valuation for is a mapping, , where the active domain of the database, such that the instantiation of every atom is a tuple in the database: for . We associate to the valuation the following conjunct: . The lineage of is:

We will assume w.l.o.g. that and (otherwise we have no causes).

Definition (n-lineage)

The n-lineage of is:

Here means substituting with true, for all Boolean variables corresponding to exogenous tuples . Thus, the n -lineage is obtained as follows. Compute the standard lineage, over all tuples (exogenous and endogenous), then set to true all exogenous tuples: the remaining expression depends only on endogenous tuples. The following technical result allows us to compute the causes to answers of conjunctive queries.

Theorem 3.1 (Causality)

Let be a conjunctive query, and be an endogenous tuple. Then the following three conditions are equivalent:

  1. [itemsep=1pt, parsep=1pt, topsep = 1pt]

  2. is an actual cause for (Sect. 2).

  3. There exists set of tuples such that the lineage is satisfiable, and is unsatisfiable.

  4. There exists a non-redundant conjunct in the n-lineage that contains the variable .

We give the proof in the Appendix. The theorem gives a PTIME algorithm for computing all causes of : compute the n-lineage as described above, and remove all redundant conjuncts. All tuples that still occur in the lineage are actual causes of .

Example

Consider over the database of Sect. 2. Its lineage is . Assume is exogenous and , are endogenous. Then the n-lineage is obtained by setting : . After removing the redundant conjunct, the n-lineage becomes ; hence, is the only actual cause for the query.

In the rest of this section we prove a stronger result. Denote the set of actual causes in the relation ; that is, , and every tuple is an actual cause. We show that can be computed by a relational query. In particular, this means that the causes to a (non-)answer can be computed by a SQL query, and therefore can be performed entirely in the database system.

Theorem 3.2 (Causality FO)

Given a Boolean query over relations , the set of all causes of can be expressed in non-recursive stratified Datalog with negation, with only two strata.

Theorem 3.2, shows that causes can be expressed in a language equivalent to a subset of first order logic [1] and that, moreover, only one level of negation is needed. The proof is in the appendix.

Example

Continuing with the query from Sect. 3), suppose all tuples in are endogenous. Thus, we have , but . The complete Datalog program that produces the causes for is:

The role of is to remove redundant terms from the lineage. To see this, consider the database , , and assume that , , thus, ’s lineage and n-lineage are:

Thus, the only actual cause of is . Consider , which computes causes in . Without the negated term , would return (which would be incorrect). The role of the negated term is to remove the redundant terms in : in our example, returns the empty set (which is correct). Similarly, one can check that returns . Note that negation is necessary in because it is non-monotone: if we remove the tuple from the database then becomes a cause for the query , thus is non-monotone. Hence, in general, we must use negation in order to compute causes.

Example

Consider , and assume that is endogenous and is exogenous: in other words, , . The following Datalog program computes all causes:

Here, too, we can prove that is non-monotone and, hence, must use negation. Consider the database instance , . Then is not a cause; but if we remove , then becomes a cause.

As the previous examples show, the causality query is, in general, a non-monotone query: by inserting more tuples in the database, we determine some tuples to no longer be causes. Thus, negation is necessary in order to express . The following corollary gives a sufficient condition for the causality query to simplify to a conjunctive query.

Corollary

Suppose that each relation is either endogenous or exogenous (that is, either or ). Further, suppose that, if is endogenous, then the relation symbol occurs at most once in the query . Then, for each relation name , the causal query is a single conjunctive query (in particular it has no negation).

The two examples above show that the corollary is tight: Sect. 3 shows that causality is non-monotone when a relation is mixed endogenous/exogenous, and Sect. 3 shows that causality is non-monotone when the query has self-joins, even if all relations are either endogenous or exogenous.

To illustrate the corollary, we revisit Sect. 3, where the query is , and assume that and . Then the Datalog program becomes:

4 Complexity of Responsibility

In this section, we study the complexity of computing responsibility. As before, we restrict our discussion to Boolean queries. Thus, given a Boolean query and an endogenous tuple , compute its responsibility (Sect. 2). We say that the query is in PTIME if there exists a PTIME algorithm that, given a database and a tuple computes the value ; we say that the query is NP-hard, or simply hard, if the problem “given a database instance and a number , check whether ” is NP-hard. The strongest result in this section and the paper is a dichotomy theorem for Why-So queries without self-joins: for every query, computing the responsibility is either in PTIME or NP-hard (Sect. 4.1). The case of non-answers (Why-No) turns out to be a simpler problem as Sect. 4.2 shows.

4.1 Why So?

We assume that the conjunctive query is without self-joins, i.e. every relation occurs at most once in ; we discuss self-joins briefly at the end of the section. W.l.o.g. we further assume that each relation is either fully endogenous or exogenous ( or ). Recall that computing the Why-So responsibility of a tuple requires computing the smallest contingency set , such that is a counterfactual cause in . We start by giving three hard queries, which play an important role in the dichotomy result.

Theorem 4.1 (Canonical Hard Queries)

Each of the following three queries is NP-hard:

If the type of a relation is not specified, then the query remains hard whether the relation is endogenous or exogenous.

We give the proof in the Appendix: we prove the hardness of and directly, and that of by using a particular reduction from . Chockler and Halpern [5] have already shown that computing responsibility for Boolean circuits is hard, in general. One may interpret our theorem as strengthening that result somewhat by providing three specific queries whose responsibility is hard. However, the theorem is much more significant. We show in this section that every query that is hard can be proven to be hard by a simple reduction from one of these three queries.

Next, we illustrate PTIME queries, and start with a trivial example where is a constant. If , then its minimum contingency is simply the set of all tuples with , and one can compute ’s responsibility by simply counting these tuples. Thus, is in PTIME. We next give a much more subtle example.

Example (Ptime Query)

Let , let both and be endogenous, and w.l.o.g. let be a tuple in . We show how to compute the size of the minimal contingency set for with a reduction to the max-flow/min-cut problem in a network. Given the database instance , construct the network illustrated in Fig. 4. Its vertices are partitioned into . contains the source, which is connected to all nodes in . There is one edge from to for every tuple , and one edge from to for every tuple . Finally, every node in is connected to the target, in . Set the capacity of all edges from the source or into the target to . The other capacities will be described shortly. Recall that a cut in a network is a set of edges that disconnect the source from the target. A min-cut is a cut of minimum capacity, and the capacity of a min-cut can be computed in PTIME using Ford-Fulkerson’s algorithm. Now we make an important observation: any mincut in the network corresponds to a set of tuples222In other words, the mincut cannot include the extra edges connected to the source or the target as they have infinite capacity. in the database , such that is false on . We use this fact to compute the responsibility of as follows: First, set the capacity of to 0, and that of all other tuples in to 1. Then, repeat the following procedure for every path from the source to the target that goes through : set the capacities of all edges333In our example, contains a single other edge (namely a tuple in ). For longer queries, it may contain additional edges. For the query , for example, always contains two edges. Hence we refer to edges in in the plural. in to , compute the size of the mincut, and reset their capacities back to 1. In Fig. 4 there are two such paths : the first is (the figure shows the capacities set for this path), the other path is . We claim that for every mincut , the set is a contingency set for . Indeed, is false on because the source is disconnected from the target, and is true on , because once we add back, it will join with the other edges in . Note that cannot include these edges as their capacity is . Thus, by repeating for all paths (which are at most ), we can compute the size of the minimal contingency set as .

Figure 4: Flow transformation for .

We next generalize the algorithm in Sect. 4.1 to the large class of linear queries. We need two definitions first.

Definition (Dual Query Hypergraph )

The dual query hypergraph of a query is a hypergraph with vertex set and a hyperedge for each variable such that .

Note that nodes are the atoms, and edges are the variables. This is the “dual” of the standard query hypergraph [11], where nodes are variables and edges are atoms.

(a)
(b)
Figure 5: Dual query hypergraphs for easy query , and hard query
Definition (Linear Query)

A hypergraph is linear if there exists a total order of , such that every hyperedge is a consecutive subsequence of . A query is linear if its dual hypergraph is linear.

In other words a query is linear if its atoms can be ordered such that every variable appears in a continuous sequence of atoms. For example, the query in Fig. a is linear. Order the atoms as , and every variable appears in a continuous sequence, e.g.  occurs in . On the other hand, none of the queries in Theorem 4.1 is linear. For example, the dual hypergraph of is shown in Fig. b: one cannot “draw a line” through the vertices and stay inside hyperedges. Note that the definition of linearity ignores the endogenous/exogenous status of the atoms.

For every linear query, the responsibility of a tuple can be computed in PTIME using Algorithm LABEL:alg:flowTransform. The algorithm essentially extends the construction given Sect. 4.1 to arbitrary linear queries. Note that it treats endogenous relations differently than exogenous by assigning to them weight . Thus, we have:

Theorem 4.2 (Linear Queries)

For any linear query and any endogenous tuple , the responsibility of for can be computed in PTIME in the size of the database .

algocf[t]    

So far, Theorem 4.1 has described some hard queries, and Theorem 4.2 some PTIME queries. Neither class is complete, hence we do not yet have a dichotomy yet. To close the gap we need to work on both ends. We start by expanding the class of hard queries.

Definition (rewriting )

We define the following rewriting relation on conjunctive queries without self-joins: rewrites to , in notation , if can be obtained from by applying one of the following three rules:

  • [leftmargin=2.2itemsep=1pt, parsep=1pt, topsep = 1pt]

  • Delete (): Here, denotes the query obtained by removing the variable , and thus decreasing the arity of all atoms that contained .

  • Add (): Here, denotes the query obtained by adding variable to all atoms that contain variable , and thus increasing their arity, provided there exists an atom in that contains both variables .

  • Delete (): Here, denotes an atom and denotes the query without the atom , provided that is exogenous, or there exists some other atom s.t. .

Denote the transitive and reflexive closure of . We show that rewriting always reduces complexity:

Lemma (Rewriting)

If and is NP-hard, then is also NP-hard. In particular, is NP-hard if , where is one of the three queries in Theorem 4.1.

Example (Rewriting)

We illustrate how one can prove that the query is hard, by rewriting it to :

  (add )
  (add )
  (delete )
  (delete )

With rewriting we expanded the class of hard queries. Next we expand the class of PTIME queries. As notation, we say that two atoms of a conjunctive query are neighbors if they share a variable: .

Definition (Weakening )

We define the following weakening relation on conjunctive queries without self-joins: weakens to , in notation , if can be obtained from by applying one of the following two rules:

  • [leftmargin=2.2itemsep=1pt, parsep=1pt, topsep = 1pt]

  • Dissociation If is an exogenous atom and a variable occurring in some of its neighbors, then let be obtained by adding to the variable set of (this increases its arity).

  • Domination If is an endogenous atom and there exists some other endogenous atom s.t. , then let be obtained by making exogenous, .

Intuitively, a minimum contingency never needs to contain tuples from a dominated relation, and thus the relation is effectively exogenous. Along the lines of Sect. 4.1, we show the following for weakening:

Lemma (Weakening)

If and is in PTIME, then is also in PTIME.

Thus, weakening allows us to expand the class of PTIME queries. We denote the transitive and reflexive closure of . We say that a query is weakly linear if there exists a weakening s.t. is linear. Obviously, every linear query is also weakly linear.

Corollary (Weakly Linear Queries)

If is weakly linear, then it is in PTIME.

Sect. 4.1 is based on the simple observation that a weakening produces a query over a database instance that produces the same output tuples as query on database instance . Weakening only affects exogenous and dominated atoms, which are not part of minimum contingencies, and therefore responsibility remains unaffected. This also implies an algorithm for computing responsibility in the case of weakly linear queries: find a weakening of that is linear and apply Algorithm LABEL:alg:flowTransform.

Example

We illustrate the lemma with two examples. First, we show that is in PTIME by weakening with a dissociation:

  (dissociation)

The latter is linear. Query should be contrasted with in Theorem 4.1: the only difference is that here is exogenous, and this causes to be in PTIME while is NP-hard. Second, consider . Here we weaken with a domination followed by a dissociation:

(domination)
(dissociation)

The latter is linear with the linear order .

We say that a query is final if it is not weakly linear and for every rewriting , the rewritten query is weakly linear. For example, each of in Theorem 4.1 is final: one can check that if we try to apply any rewriting to, say, we obtain a linear query. We can now state our main technical result:

Theorem 4.3 (Final Queries)

If is final, then is one of .

This is by far the hardest technical result in this paper. We give the proof in the appendix. Here, we show how to use it to prove the dichotomy result.

Corollary (Responsibility Dichotomy)

Let be any conjunctive query without self joins. Then:

  • [leftmargin=2.2itemsep=1pt, parsep=1pt, topsep = 1pt]

  • If is weakly linear then is in PTIME.

  • If is not weakly linear then it is NP-hard.

If is weakly linear then it is in PTIME by Sect. 4.1. Suppose is not weakly linear. Consider any sequence of rewritings Any such sequence must terminate as any rewriting results in a simpler query. We rewrite as long as is not weakly linear and stop at the last query that is not weakly linear. That means that any further rewriting results in a weakly linear query . In other words, is a final query. By Theorem 4.3, is one of . Thus, we have proven , for some . By Sect. 4.1, the query is NP-hard.

Extensions. We have shown in Sect. 3 that causality can be computed with a relational query. This raises the question: if the responsibility of a query is in PTIME, can we somehow compute it in SQL? We answer this negatively:

Theorem 4.4 (Logspace)

Computing the Why-So responsibility of a tuple is hard for logspace for the following query:

Finally, we add a brief discussion of queries with self-joins. Here we establish the following result:

Proposition (self-joins)

Computing the responsibility of a tuple for is NP-hard. The same holds if one replaces with .

We include the proof in the appendix. Beyond this result, however, queries with self-joins are harder to analyze, and we do not yet have a full dichotomy. In particular, we leave open the complexity of the query .

4.2 Why No?

While the complexity of Why-So responsibility turned out to be quite difficult to analyze, the Why-No responsibility is much easier. This is because, for any query with subgoals and non-answer , any contingency set for a tuple will have at most tuples. Since is independent on the size of the database, we obtain the following:

Theorem 4.5 (Why-No responsibility)

Given a query over a database instance and a non-answer , computing the responsibility of over is in PTIME.

5 Related Work

Our work is mainly related to and unifies ideas from work on causality, provenance, and query result explanations.

Causality. Causality is an active research area mainly in logic and philosophy with its own dedicated workshops (e.g. [23]). The idea of counterfactual causality (if had not occurred, would not have occurred) can be traced back to Hume [19], and the best known counterfactual analysis of causation in modern times is due to Lewis [16]. Halpern and Pearl [13] define a variation they call actual causality which relies upon a graph structure called a causal network, and adds the crucial concept of a permissive contingency before determining causality. Chockler and Halpern [5] define the degree of responsibility as a gradual way to assign causality. Our definitions of Why-So and Why-No causality and responsibility for conjunctive queries build upon the HP definition, but simplify it and do not require a causal network. A general overview of causality in a database context is given in [17], while [18] introduces functional causality as an improved, more robust version of the HP definition.

Provenance.

Approaches for defining data provenance can be mainly classified into three categories: how-, why-, and where-provenance

[2, 4, 7, 12]. We point to the close connection between why-provenance and Why-So causality: both definitions concern the same tuples if all tuples in a database are endogenous444Note that why-provenance (also called minimal witness basis) defines a set of sets. To compare it with Why-So causality, we consider the union of tuples across those sets.. However, our work extends the notion of provenance by allowing users to partition the lineage tuples into endogenous and exogenous, and presenting a strategy for constructing a query to compute all causes555Note that, in general, Why-So tuples are not identical to the subset of endogenous tuples in the why-provenance.. In addition, we can rank tuples according to their individual responsibilities, and determine a gradual contribution with counterfactual tuples ranked first.

Missing query results. Very recent work focuses on the problem of explaining missing query answers, i.e. why a certain tuple is not in the result set? The work by Huang et al. [15] and the Artemis [14] system present provenance for potential answers by providing tuple insertions or modifications that would yield the missing tuples. This is equivalent to providing the set of endogenous tuples for Why-No causality. Alternatively, Chapman and Jagadish [3] focus on the operator in the query plan that eliminated a specific tuple, and Tran and Chan [22] suggest an approach to automatically generate a modified query whose result includes both the original query’s results as well as the missing tuple.

Our definitions of Why-So and Why-No causality highlight the symmetry between the two types of provenance (“positive and negative provenance”). Instead of considering them in separate manners, we show how to construct Datalog programs that compute all Why-So or Why-No tuple causes given a partitioning of tuples into endogenous and exogenous. Analogously, responsibility applies to both cases in a uniform manner.

6 Conclusions

In this paper, we introduce causality as a framework for explaining answers and non-answers in a database setting. We define two kinds of causality, Why-So for actual answers, and Why-No for non-answers, which are related to the provenance of answers and non-answers, respecitively. We demonstrate how to retrieve all causes for an answer or non-answer using a relational query. We give a comprehensive complexity analysis of computing causes and their responsibilities for conjunctive queries: whereas causality is shown to be always in PTIME, we present a dichotomy for responsibility within queries without self-joins.

Acknowledgements. We would like to thank Benny Kimelfeld for identifying a bug in an earlier version of one of the proofs.

References

  • [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
  • [2] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, 2001.
  • [3] A. Chapman and H. V. Jagadish. Why not? In SIGMOD, 2009.
  • [4] J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379–474, 2009.
  • [5] H. Chockler and J. Y. Halpern. Responsibility and blame: A structural-model approach. J. Artif. Intell. Res. (JAIR), 22:93–115, 2004.
  • [6] H. Chockler, J. Y. Halpern, and O. Kupferman. What causes a system to satisfy a specification? ACM Trans. Comput. Log., 9(3), 2008.
  • [7] Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., 25(2):179–227, 2000.
  • [8] N. Dalvi and D. Suciu. Management of probabilistic data: Foundations and challenges. In PODS, pages 1–12, Beijing, China, 2007. (invited talk).
  • [9] T. Eiter and T. Lukasiewicz. Complexity results for structure-based causality. Artif. Intell., 142(1):53–89, 2002. (Conference version in IJCAI, 2002).
  • [10] T. Eiter and T. Lukasiewicz. Causes and explanations in the structural-model approach: Tractable cases. Artif. Intell., 170(6-7):542–580, 2006.
  • [11] G. Gottlob, N. Leone, and F. Scarcello. The complexity of acyclic conjunctive queries. J. ACM, 48(3):431–498, 2001.
  • [12] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007.
  • [13] J. Y. Halpern and J. Pearl. Causes and explanations: A structural-model approach. Part I: Causes. Brit. J. Phil. Sci., 56:843–887, 2005. (Conference version in UAI, 2001).
  • [14] M. Herschel, M. A. Hernández, and W. C. Tan. Artemis: A system for analyzing missing answers. PVLDB, 2(2):1550–1553, 2009.
  • [15] J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the provenance of non-answers to queries over extracted data. PVLDB, 1(1):736–747, 2008.
  • [16] D. Lewis. Causation. The Journal of Philosophy, 70(17):556–567, 1973.
  • [17] A. Meliou, W. Gatterbauer, J. Halpern, C. Koch, K. F. Moore, and D. Suciu. Causality in databases. IEEE Data Engineering Bulletin, Sept. 2010.
  • [18] A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. Why so? or Why no? Functional causality for explaining query answers. In MUD, 2010.
  • [19] P. Menzies. Counterfactual theories of causation. Stanford Encylopedia of Philosophy, 2008.
  • [20] D. Olteanu and J. Huang. Secondary-storage confidence computation for conjunctive queries with inequalities. In SIGMOD, 2009.
  • [21] P. Senellart and G. Gottlob. On the complexity of deriving schema mappings from database instances. PODS, 2008.
  • [22] Q. T. Tran and C.-Y. Chan. How to conquer why-not questions. In SIGMOD, 2010.
  • [23] International multidisciplinary workshop on causality. IRIT, Toulouse, June 2009. (www.irit.fr/MICRAC/colloque/articles/extended_abstract_Micrac.pdf).

Appendix A Nomenclature

database instance
relations
is a fully endogenous relation
is a fully exogenous relation
set of endogenous tuples: for Why-So
set of exogenous tuples:
set of endogenous tuples in relation
contingency:
set of causes in relation
responsibility of tuple
Boolean variable associated with tuple
variables appearing in query
active domain
set of subgoals containing variable
lineage
endogenous lineage (n-lineage)
dual hypergraph
rewriting of a query
weakening of a query
canonical hard queries of Theorem 4.1

Appendix B Proofs Causality

Assume the lineage of in . We construct the endogenous lineage , and a DNF with all the non-redundant clauses of . We will show that a variable is a cause of , iff , which means that is part of a non-redundant clause in the endogenous lineage of .

Case A: (Why-So, answer ): First of all, if is not in , then is not a cause of , as there is no assignment that makes counterfactual for (and therefore ), because of monotonicity. If , where a clause of , we select ( since , ). Then, if we write , we know that , because contains only non-redundant terms. That means that every clause has at least one variable that is not in , and therefore can be negated by the above choice of . This makes counterfactual for (and also ) with contingency . Since is also counterfactual for with contingency , meaning that is satisfiable, and is unsatisfiable. Therefore, conditions 1, 2, and 3 are equivalent.

Case B: (Why-No, non-answer ): First of all, if is not in , then is not a cause of , as there is no assignment that makes counterfactual for (and therefore ), because of monotonicity. If , where a clause of , we select , and assign . since , . Then, if we write , we know that , because contains only non-redundant terms. That means that every clause has at least one variable that is not in , and therefore can be negated by the above choice of . This makes counterfactual for (and therefore ) with contingency . Since is also counterfactual for with contingency , meaning that is unsatisfiable, and is satisfiable. Therefore, conditions 1, 2, and 3 are equivalent.

[(Theorem 3.2)] To describe the relational query we need a number of technical definitions. Recall that denote the endogenous/exogenous tuples in . Given a Boolean conjunctive query we define a refinement to be a query of the form , where each . Thus, every atom is made either exogenous or endogenous, and we call it an n- or an x-atom; there are refinements. Clearly, is logically equivalent to the union of the refinements, and its lineage is equivalent to the disjunction of the lineages of all refinements. Consider any refinement . We call a variable an n-variable if it occurs in at least one n-atom. We apply repeatedly the following two operations: (1) choose two n-variables and substitute ; (2) choose any n-variable and any constant occurring in the query and substitute . We call any query that can be obtained by applying these operations any number of times an image query; in particular, the refinement itself is a trivial image. There are strictly less than images, where is the total number of n-variables and constants in the query. Note that is bounded by query size and thus irrelevant to data complexity. We always minimize an image query.

Fix a refinement . We define an n-embedding for as a function that maps a strict subset of the n-atoms in onto all n-atoms of , where is the image of a possibly different refinement