Computing Optimal Repairs for Functional Dependencies

12/20/2017 ∙ by Ester Livshits, et al. ∙ 0

We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair) that is obtained by a minimum number of tuple deletions, and an optimal update repair (optimal U-repair) that is obtained by a minimum number of value (cell) updates. For computing an optimal S-repair, we present a polynomial-time algorithm that succeeds on certain sets of FDs and fails on others. We prove the following about the algorithm. When it succeeds, it can also incorporate weighted tuples and duplicate tuples. When it fails, the problem is NP-hard, and in fact, APX-complete (hence, cannot be approximated better than some constant). Thus, we establish a dichotomy in the complexity of computing an optimal S-repair. We present general analysis techniques for the complexity of computing an optimal U-repair, some based on the dichotomy for S-repairs. We also draw a connection to a past dichotomy in the complexity of finding a "most probable database" that satisfies a set of FDs with a single attribute on the left hand side; the case of general FDs was left open, and we show how our dichotomy provides the missing generalization and thereby settles the open problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Database inconsistency arises in a variety of scenarios and for different reasons. For instance, data may be collected from imprecise sources (social encyclopedias/networks, sensors attached to appliances, cameras, etc.) via imprecise procedures (natural-language processing, signal processing, image analysis, etc.). Inconsistency may arise when integrating databases of different organizations with conflicting information, or even consistent information in conflicting formats. Arenas et al. 

[5] introduced a principled approach to managing inconsistency via the notions of repairs and consistent query answering. An inconsistent database is a database that violates integrity constraints, a repair is a consistent database obtained from by a minimal sequence of operations, and the consistent answers to a query are the answers given in every repair .

Instantiations of the repair framework differ in their definitions of integrity constraints, operations, and minimality [1]. Common types of constraints are denial constraints [18] that include the classic functional dependencies (FDs), and inclusion dependencies [11] that include the referential (foreign-key) constraints. An operation can be a deletion of a tuple, an insertion of a tuple, and an update of an attribute (cell) value. Minimality can be either local—no strict subset of the operations achieves consistency, or global—no smaller (or cheaper) subset achieves consistency. For example, if only tuple deletions are allowed, then a subset repair [12] corresponds to a local minimum (restoring any deleted tuple causes inconsistency) and a cardinality repair [27] corresponds to a global minimum (consistency cannot be gained by fewer tuple deletions). The cost of operations may differ between tuples; this can represent different levels of trust that we have in the tuples [27, 24].

In this paper, we focus on global minima under FDs via tuple deletions and value updates. Each tuple is associated with a weight that determines the cost of its deletion or a change of a single value. We study the complexity of computing a minimum repair in two settings: (a) only tuple deletions are allowed, that is, we seek a (weighted) cardinality repair, and (b) only value updates are allowed, that is, we seek what Kolahi and Lakshmanan [24] refer to as an “optimum V-repair.” We refer to the two challenges as computing an optimal subset repair (optimal S-repair) and computing an optimal update repair (optimal U-repair).

The importance of computing an optimal repair arises in the challenge of data cleaning [17]—eliminate errors and dirt (manifested as inconsistencies) from the database. Specifically, our motivation is twofold. The obvious motivation is in fully automated cleaning, where an optimal repair is the best candidate, assuming the system is aware of only the constraints and tuple weights. The second motivation comes from the more realistic practice of iterative, human-in-the-loop cleaning [9, 6, 13, 19]

. There, the cost of the optimal repair can serve as an educated estimate for the extent to which the database is dirty and, consequently, the amount of effort needed for completion of cleaning.

As our integrity constraints are FDs, it suffices to consider a database with a single relation, which we call here a table. In a general database, our results can be applied to each relation individually. A table conforms to a relational schema where each is an attribute. Integrity is determined by a set of FDs. Our complexity analysis focuses primarily on data complexity, where and are considered fixed and only is considered input. Hence, we have infinitely many optimization problems, one for each combination of and . Table records have identifiers, as we wish to be able to determine easily which cells are updated in a repair. Consequently, we allow duplicate tuples (with distinct identifiers).

We begin with the problem of computing an optimal S-repair. The problem is known to be computationally hard for denial constraints [27]. As we discuss later, complexity results can be inferred from prior work [20] for FDs with a single attribute on the left hand side (lhs for short). For general FDs, we present the algorithm (Algorithm 3). The algorithm seeks opportunities for simplifying the problem by eliminating attributes and FDs, until no FDs are left (and then the problem is trivial). For example, if all FDs share an attribute on the left hand side, then we can partition the table according to and solve the problem separately on each partition; but now, we can ignore . We refer to this simplification as “common lhs.” Two additional simplifications are the “consensus” and “lhs marriage.” Importantly, the algorithm terminates in polynomial time, even under combined complexity.

However, may fail by reaching a nonempty set of FDs where no simplification can be applied. We prove two properties of the algorithm. The first is soundness—if the algorithm succeeds, then it returns an optimal S-repair. More interesting is the property of completeness—if the algorithm fails, then the problem is NP-hard. In fact, in this case the problem is APX-complete, that is, for some it is NP-hard to find a consistent subset with a cost lower than times the minimum, but some is achievable in polynomial time. More so, the problem remains APX-complete if we assume that the table does not contain duplicates, and all tuples have a unit weight (in which case we say that is unweighted). Consequently, we establish the following dichotomy in complexity for the space of combinations of schemas and FD sets .

  • If we can eliminate all FDs in with the three simplifications, then an optimal S-repair can be computed in polynomial time using .

  • Otherwise, the problem is APX-complete, even for unweighted tables without duplicates.

We then continue to the problem of computing an optimal U-repair. Here we do not establish a full dichotomy, but we make a substantial progress. We have found that proving hardness results for updates is far more subtle than for deletions. We identify conditions where the complexity of computing an optimal U-repair and that of computing an optimal S-repair coincide. One such condition is the common lhs (i.e., all FDs share a left-hand-side attribute). Hence, in this case, our dichotomy provides the precise test of tractability. We also show decomposition techniques that extend the opportunities of using the dichotomy. As an example, consider and . We can decompose this problem into and , and consider each , for , independently. The complexity of each is the same in both variants of optimal repairs, and so, polynomial time. Yet, these results do not cover all sets of FDs. For example, let . Kolahi and Lakshmanan [24] proved that under , computing an optimal U-repair is NP-hard. Our dichotomy shows that it is also NP-hard (and also APX-complete) to compute an S-repair under . Yet, this FD set does not fall in our coincidence cases.

The above defined is an example where an optimal U-repair can be computed in polynomial time, but computing an optimal S-repair is APX-complete. We also show an example in the reverse direction, namely . This FD set falls in the positive side of our dichotomy for optimal S-repairs, but computing an optimal U-repair is APX-complete. The proof of APX-hardness is inspired by, but considerably more involved than, the hardness proof of Kolahi and Lakshmanan [24] for .

Finally, we consider approximate repairing. For the case of an optimal S-repair, the problem easily reduces to that of weighted vertex cover, and hence, we get a polynomial-time 2-approximation due to Bar-Yehuda and Even [7]. To approximate optimal U-repairs, we show an efficient reduction to S-repairs, where the loss in approximation is linear in the number of attributes. Hence, we obtain a constant-ratio approximation, where the constant has a linear dependence on the number of attributes. Kolahi and Lakshmanan [24] also gave an approximation for optimal U-repairs, but their worst-case approximation can be quadratic in the number of attributes. We show an infinite sequence of FD sets where this gap is actually realized. On the other hand, we also show an infinite sequence where our approximation is linear in the number of attributes, but theirs remains constant. Hence, in general, the two approximations are incomparable, and we can combine the two by running both approximations and taking the best.

Stepping outside the framework of repairs, a different approach to data cleaning is probabilistic [28, 4, 20]. The idea is to define a probability space over possible clean databases, where the probability of a database is determined by the extent to which it satisfies the integrity constraints. The goal is to find a most probable database that, in turn, serves as the clean outcome. As an instantiation, Gribkoff, Van den Broeck, and Suciu [20] identify probabilistic cleaning as the “Most Probable Database” problem (MPD): given a tuple-independent probabilistic database [14, 30] and a set of FDs, find the most probable database among those satisfying the FDs (or, put differently, condition the probability space on the FDs). They show a dichotomy for unary FDs (i.e., FDs with a single attribute on the left hand side). The case of general (not necessarily unary) FDs has been left open. It turns out that there are reductions from MPD to computing an optimal S-repair and vice versa. Consequently, we are able to generalize their dichotomy to all FDs, and hence, fully settle the open problem.

2 Preliminaries

We first present some basic terminology and notation that we use throughout the paper.

2.1 Schemas and Tables

An instance of our data model is a single table where each tuple is associated with an identifier and a weight that states how costly it is to change or delete the tuple. Such a table corresponds to a relation schema that we denote by , where is the relation name and , …, are distinct attributes. We say that is -ary since it has attributes. When there is no risk of confusion, we may refer to by simply .

We use capital letters from the beginning of the English alphabet (e.g., , , ), possibly with subscripts and/or superscripts, to denote individual attributes, and capital letters from the end of the English alphabet (e.g., , , ), possibly with subscripts and/or superscripts, to denote sets of attributes. We follow the convention of writing sets of attributes without curly braces and without commas (e.g., ).

We assume a countably infinite domain of attribute values. By a tuple we mean a sequence of values in . A table over has a collection of (tuple) identifiers and it maps every identifier to a tuple in and a positive weight; we denote this tuple by and this weight by . For we refer to as a tuple of . We denote by the set of all tuples of . We say that is:

  • duplicate free if distinct tuples disagree on at least one attribute, that is, whenever ;

  • unweighted if all tuple weights are equal, that is, for all identifiers and .

We use to denote the number of tuple identifiers of , that is, . Let be a tuple of . We use to refer to the value . If is a sequence of attributes in , then denotes the tuple .

Example 2.1

Our running example is around the tables of Figure 1. The figure shows tables over the schema , describing the location of offices in an organization. For example, the tuple corresponds to an office in room 322, in the third floor of the headquarters (HQ) building, located in Paris. The meaning of the yellow background color will be clarified later. The identifier of each tuple is shown on the leftmost (gray shaded) column, and its weight on the rightmost column (also gray shaded). Note that table is duplicate free and unweighted, table is duplicate free but not unweighted, and table is neither duplicate free nor unweighted.

2.2 Functional Dependencies (FDs)

Let be a schema. As usual, an FD (over ) is an expression of the form where and are sequences of attributes of . We refer to as the left-hand side, or lhs for short, and to as the right-hand side, of rhs for short. A table satisfies if every two tuples that agree on also agree on ; that is, for all , if then . We say that satisfies a set of FDs if satisfies each FD in ; otherwise, violates .

id facility room floor city
1 HQ 322 3 Paris 2
2 HQ 322 30 Madrid 1
3 HQ 122 1 Madrid 1
4 Lab1 B35 3 London 2
(a) Table
id facility room floor city
2 HQ 322 30 Madrid 1
3 HQ 122 1 Madrid 1
4 Lab1 B35 3 London 2
(b) Consistent subset
id facility room floor city
1 HQ 322 3 Paris 2
4 Lab1 B35 3 London 2
(c) Consistent subset
id facility room floor city
3 HQ 122 1 Madrid 1
4 Lab1 B35 3 London 2
(d) Consistent subset
id facility room floor city
1 F01 322 3 Paris 2
2 HQ 322 30 Madrid 1
3 HQ 122 1 Madrid 1
4 Lab1 B35 3 London 2
(e) Consistent update
id facility room floor city
1 HQ 322 3 Paris 2
2 HQ 322 3 Paris 1
3 HQ 122 1 Paris 1
4 Lab1 B35 3 London 2
(f) Consistent update
id facility room floor city
1 HQ 322 30 Madrid 2
2 HQ 322 30 Madrid 1
3 HQ 122 1 Madrid 1
4 Lab1 B35 3 London 2
(g) Consistent update
Figure 1: For and FDs and , a table , consistent subsets , and , and consistent updates , and . Changed values are marked in yellow.

An FD is entailed by , denoted , if every table that satisfies also satisfies the FD . The closure of , denoted , is the set of all FDs over that are entailed by . The closure of an attribute set (w.r.t. ), denoted , is the set of all attributes such that the FD is entailed by . Two sets and of FDs are equivalent if they have the same closure (or in other words, each FD in is entailed by and vice versa, or put differently, every table that satisfies one also satisfies the other). An FD is trivial if ; otherwise, it is nontrivial. Note that a trivial FD belongs to the closure of every set of FDs (including the empty one). We say that is trivial if does not contain any nontrivial FDs (e.g., it is empty); otherwise, is nontrivial.

Next, we give some non-standard notation that we need for this paper. A common lhs of an FD set is an attribute such that for all FDs in . An FD set is a chain if for every two FDs and it is the case that or . Livshits and Kimelfeld [26] proved that the class of chain FD sets consists of precisely the FD sets in which the subset repairs, which we define in Section 2.3, can be counted in polynomial time (assuming ). The chain FD sets will arise in this work as well.

Example 2.2

In our running example (Figure 1) the set consists of the following FDs:

  • : a facility belongs to a single city.

  • : a room in a facility does not go beyond one floor.

Note that the FDs allow for the same room number to occur in different facilities (possibly on different floors, in different cities). The attribute facility is a common lhs. Moreover, is a chain FD set, since . Table (Figure 1(a)) violates , and the other tables (Figures 1(b)1(g)) satisfy .

An FD might be such that is empty, and then we denote it by and call it a consensus FD. Satisfying the consensus FD means that all tuples agree on , or in other words, the column that corresponds to each attribute in consists of copies of the same value. For example, means that all tuples have the same city. A consensus attribute (of ) is an attribute in , that is, an attribute such that is implied by . We say that is consensus free if it has no consensus attributes.

2.3 Repairs

Let be a schema, and let be a table. A subset of is a table that is obtained from by eliminating tuples. More formally, table is a subset of if and for all we have and . If is a subset of , then the distance from to , denoted , is the weighted sum of the tuples missing from ; that is,

A value update of (or just update of for short) is a table that is obtained from by changing attribute values. More formally, a table is an update of if and for all we have . We adopt the definition of Kolahi and Lakshmanan [24] for the distance from to . Specifically, if and are tuples of tables over , then the Hamming distance is the number of attributes in which and disagree, that is, . If is an update of then the distance from to , denoted , is the weighted Hamming distance between and (where every changed value counts as the weight of the tuple); that is,

Let be a schema, let be table, and let be a set of FDs. A consistent subset (of w.r.t. ) is a subset of such that , and a consistent update (of w.r.t. ) is an update of such that . A subset repair, or just S-repair for short, is a consistent subset that is not strictly contained in any other consistent subset. An update repair, or just U-repair for short, is a consistent update that becomes inconsistent if any set of updated values is restored to the original values in . An optimal subset repair of , or just optimal S-repair for short, is a consistent subset of such that is minimal among all consistent subsets of . Similarly, an optimal update repair of , or just optimal U-repair for short, is a consistent update of such that is minimal among all consistent updates of . When there is risk of ambiguity, we may stress that the optimal S-repair (or U-repair) is of and under or under and .

Every (S- or U-) optimal repair is a repair, but not necessarily vice versa. Clearly, a consistent subset (respectively, update) can be transformed into a (not necessarily optimal) S-repair (respectively, U-repair), with no increase of distance, in polynomial time. In fact, we do not really need the concept of a repair per se, and the definition is given mainly for compatibility with the literature (e.g., [1]). Therefore, unless explicitly stated otherwise, we do not distinguish between an S-repair and a consistent subset, and between a U-repair and a consistent update.

We also define approximations of optimal repairs in the obvious ways, as follows. For a number , an -optimal S-repair is an S-repair of such that for all S-repairs of , and an -optimal U-repair is a U-repair of such that for all U-repairs of . In particular, an optimal S-repair (resp., optimal U-repair) is the same as a -optimal S-repair (resp., -optimal U-repair).

Example 2.3

In our running example (Figure 1), tables , and are consistent subsets, and , and are consistent updates. For clarity, we marked with yellow shading the values that were changed for constructing each . We have since the missing tuple (tuple ) has the weight . We also have and . The reader can verify that and are optimal S-repairs. However, is not an optimal S-repair since its distance to is greater than the minimum. Nevertheless, is an -optimal S-repair (since ). Similarly, we have , , and (since is obtained by changing two values from a tuple of weight ).

It should be noted that the values of an update of a table are not necessarily taken from the active domain (i.e., values that occur in ). An example is the value F01 of table in Figure 1(e). This has implications on the complexity of computing optimal U-repairs. We discuss a restriction on the allowed update values in Section 5.

2.4 Complexity

We adopt the conventional measure of data complexity, where the schema and dependency set are assumed to be fixed, and only the table is considered input. In particular, a “polynomial” running time may have an exponential dependency on , as in , Hence, each combination of and defines a distinct problem of finding an optimal repair (of the relevant type), and different combinations may feature different computational complexities.

For the complexity of approximation, we use the following terminology. In an optimization problem , each input has a space of solutions , each associated with a cost . Given , the goal is to compute a solution with a minimum cost. For , an -approximation for is an algorithm that, for input , produces an -optimal solution , which means that for all solutions . The complexity class APX consists of all optimization problems that have a polynomial-time constant-factor approximation. A polynomial-time reduction from an optimization problem to an optimization problem is a strict reduction if for all , any -optimal solution for can be transformed in polynomial time into an -optimal solution for  [25]; it is a PTAS (Polynomial-Time Approximation Scheme) reduction if for all there exists such that any -optimal solution for can be transformed in polynomial time into an -optimal solution for . A strict reduction is also a PTAS reduction, but not necessarily vice versa. A problem is APX-hard if there is a PTAS reduction to from every problem in APX; it is APX-complete if, in addition, it is in APX. If is APX-hard, then there is a constant such that cannot be approximated better than , or else PNP.

3 Computing an Optimal S-Repair

In this section, we study the problem of computing an optimal S-repair. We begin with some conventions.

Assumptions and Notation

Throughout this section we assume that every FD has a single attribute on its right-hand side, that is, it has the form . Clearly, this is not a limiting assumption, since replacing with and preserves equivalence.

Let be a set of FDs. If is a set of attributes, then we denote by the set of FDs that is obtained from by removing each attribute of from every lhs and rhs of every FD in . Hence, no attribute in occurs in . If is an attribute, then we may write instead of .

An lhs marriage of an FD set is a pair of distinct lhs of FDs in with the following properties.

  • The lhs of every FD in contains either or (or both).

Example 3.1

A simple example of an FD set with an lhs marriage is the following FD set.

(1)

As another example, consider the following FD set.

In the pair is an lhs marriage.

  1 

 

1:if  is trivial then successful termination
2:     return
3:remove trivial FDs from
4:if  has a common lhs then
5:     return
6:if  has a consensus FD then
7:     return
8:if  has an lhs marriage then
9:     return
10:fail cannot find a minimum repair

 
  1 

 

1: a common lhs of
2:return

 
  2 

 

1:select a consensus FD in
2:for all  do
3:     
4:
5:return

 
  3 

 

1:select an lhs marriage of
2:for all  do
3:     
4:     
5: for
6:
7: weighted bipartite graph
8: a maximum matching of
9:return

 

  2 

 

1:while  is nontrivial do
2:     remove trivial FDs from
3:     if  has a common lhs  then
4:         
5:     else if  has a consensus FD  then
6:         
7:     else if  has an lhs marriage  then
8:         
9:     else
10:         return      
11:return

 

Finally, if is a subset of a table , then we denote by the sum of weights of the tuples of , that is,

3.1 Algorithm

We now describe an algorithm for finding an optimal S-repair. The algorithm terminates in polynomial time, even under combined complexity, yet it may fail. If it succeeds, then the result is guaranteed to be an optimal S-repair. We later discuss the situations in which the algorithm fails. The algorithm, , is shown as Algorithm 3. The input is a set of FDs and a table , both over the same relation schema (that we do not need to refer to explicitly). In the remainder of this section, we fix and , and describe the execution of on and .

The algorithm handles four cases. The first is where is trivial. Then, is itself an optimal S-repair. The second case is where has a common lhs . Then, the algorithm groups the tuples by , finds an optimal S-repair for each group (via a recursive call to ), this time by ignoring (i.e., removing from the FDs of ), and returning the union of the optimal S-repairs. The precise description is in the subroutine (Subroutine 3). The third case is where has a consensus FD . Similarly to the second case, the algorithm groups the tuples by and finds an optimal S-repair for each group. This time, however, the algorithm returns the optimal S-repair with the maximal weight. The precise description is in the subroutine (Subroutine 3).

The fourth (last) case is the most involved. This is the case where has an lhs marriage . In this case the problem is reduced to finding a maximum weighted matching of a bipartite graph. The graph, which we denote by , consists of two disjoint node sets and , an edge set that connects nodes from to nodes from , and a weight function that assigns a weight to each edge . For , the node set is the set of tuples in the projection of to .111In principle, it may be the case that the same tuple occurs in both and , since the tuple is in both projections. Nevertheless, we still treat the two occurrences of the tuple as distinct nodes, and so effectively assume that and are disjoint. To determine the weight , we select from the subset that consists of the tuples that agree with and on and , respectively. We then find an optimal S-repair for , after we remove from every attribute in either or . Then, the weight is the weight of this optimal S-repair. Next, we find a maximum matching of . Note that is a subset of such that no node appears more than once. The returned result is then the disjoint union of the optimal S-repair of over all in . The precise description is in the subroutine (Subroutine 3).

The following theorem states the correctness and efficiency of .

Theorem 3.2

Let and be a set of FDs and a table, respectively, over a relation schema . If succeeds, then it returns an optimal S-repair. Moreover, terminates in polynomial time in , , and .

What about the cases where fails? We discuss it in the next section.

Approximation

An easy observation is that the computation of an optimal subset is easily reducible to the weighted vertex-cover problem—given a graph where nodes are assigned nonnegative weights, find a vertex cover (i.e., a set of nodes that intersects with all edges) with a minimal sum of weights. Indeed, given a table , we construct the graph that has as the set of nodes, and an edge between every and such that and contradict one or more FDs in . Given a vertex cover for , we obtain a consistent subset by deleting from every tuple with an identifier in . Clearly, this reduction is strict. As weighted vertex cover is 2-approximable in polynomial time [7], we conclude the same for optimal subset repairing.

Proposition

For all FD sets , a 2-optimal S-repair can be computed in polynomial time.

While Proposition 3.1 is straightforward, it is of practical importance as it limits the severity of the lower bounds we establish in the next section. Moreover, we will later show that the proposition has implications on the problem of approximating an optimal U-repair.

3.2 Dichotomy

The reader can observe that the success or failure of depends only on , and not on . The algorithm , depicted as Algorithm 3, tests whether is such that succeeds by simulating the cases and corresponding changes to . The next theorem shows that, under conventional complexity assumptions, covers all sets such that an optimal S-repair can be found in polynomial time. Hence, we establish a dichotomy in the complexity of computing an optimal S-repair.

Theorem 3.3

Let be a set of FDs.

  • If returns true, then an optimal S-repair can be computed in polynomial time by executing on the input .

  • If returns false, then computing an optimal S-repair is APX-complete, and remains APX-complete on unweighted, duplicate-free tables.

Moreover, the execution of terminates in polynomial time in .

Recall that a problem in APX has a constant factor approximation and, under the assumption that PNP, an APX-hard problem cannot be approximated better than some constant factor (that may depend on the problem itself).

Example 3.4

We now illustrate the application of Theorem 3.3 to several FD sets. Consider first the FD set of our running example. The execution of transforms as follows.

Hence, is true, and hence, an optimal S-repair can be found in polynomial time.

Next, consider the FD set from Example 3.1. executes as follows.

Hence, this is again an example of an FD set on the tractable side of the dichotomy.

As the last positive example we consider the FD set of Example 3.1.

On the other hand, for , none of the conditions of is true, and therefore, the algorithm returns false. It thus follows from Theorem 3.3 that computing an optimal S-repair is APX-complete (even if all tuple weights are the same and there are no duplicate tuples). The same applies to .

As another example, the following corollary of Theorem 3.3 generalizes the tractability of our running example to general chain FD sets.

Corollary

If is a chain FD set, then an optimal S-repair can be computed in polynomial time.

The reader can easily verify that when is a chain FD set, will reduce it to emptiness by repeatedly removing consensus attributes and common-lhs, as done in our running example.

3.3 Proof of Theorem 3.3

In this section we discuss the proof of Theorem 3.3. (The full proof is in the Appendix.) The positive side is a direct consequence of Theorem 3.2. For the negative side, membership in APX is due to Proposition 3.1. The proof of hardness is based on the concept of a fact-wise reduction [22], as previously done for proving dichotomies on sets of FDs [23, 22, 16, 26]. In our setup, a fact-wise reduction is defined as follows. Let and be two relation schemas. A tuple mapping from to is a function that maps tuples over to tuples over . We extend to map tables over to tables over by defining to be . Let and be sets of FDs over and , respectively. A fact-wise reduction from to is a tuple mapping from to with the following properties: (a) is injective, that is, for all tuples and over , if then ; (b) preserves consistency and inconsistency; that is, satisfies if and only if satisfies ; and (c) is computable in polynomial time. The following lemma is straightforward.

Lemma

Let and be relation schemas and and FD sets over and , respectively. If there is a fact-wise reduction from to , then there is a strict reduction from the problem of computing an optimal S-repair under and to that of computing an optimal S-repair under and .

In the remainder of this section, we describe the way we use Lemma 3.3. Our proof consists of four steps.

Name FDs
,
,
,
, ,
Table 1: FD sets over used in the proof of hardness of Theorem 3.3.
  1. We first prove APX-hardness for each of the FD sets in Table 1 over . For and we adapt reductions by Gribkoff et al. [20] in a work that we discuss in Section 3.4. For we show a reduction from MAX-non-mixed-SAT [21]. Most intricate is the proof for , where we devise a nontrivial adaptation of a reduction by Amini et al. [3] to triangle packing in graphs of bounded degree.

  2. Next, we prove that whenever simplifies into , there is a fact-wise reduction from to , where is the underlying relation schema.

  3. Then, we consider an FD set that cannot be further simplified (that is, does not have a common lhs, a consensus FD, or an lhs marriage). We show that

    can be classified into one of five certain classes of FD sets (that we discuss next).

  4. Finally, we prove that for each FD set in one of the five classes there exists a fact-wise reduction from one of the four schemas of Table 1.

The most challenging part of the proof is identifying the classes of FD sets in Step 3 in such a way that we are able to build the fact-wise reductions in Step 4. We first identify that if an FD set cannot be simplified, then there are at least two distinct local minima and in . By a local minimum we mean an FD with a set-minimal lhs, that is, an FD such that no FD in satisfies that is a strict subset of . We pick any two local minima from . Then, we divide the FD sets into five classes based on the relationships between , , , which we denote by , and , which we denote by . The classes are illustrated in Figure 2.

Each line in Figure 2 represents one of the sets , , or . If two lines do not overlap, it means that we assume that the corresponding two sets are disjoint. For example, the sets and in class have an empty intersection. Overlapping lines represent sets that have a nonempty intersection, an example being the sets and in class . If two dashed lines overlap, it means that we do not assume anything about their intersection. As an example, the sets and can have an empty or a nonempty intersection in each one of the classes. Finally, if a line covers another line, it means that the set corresponding to the first line contains the set corresponding to the second line. For instance, the set in class contains the set , while in class it holds that . We remark that Figure 2 well covers the important cases that we need to analyze, but it misses a few cases. (As previously said, full details are in the Appendix.)

(1)

    (2)  (3)   (4)      (5)   

Figure 2: Classes of FD sets that cannot be simplified.
Example 3.5

For each one of the five classes of FD sets from Figure 2 we will now give an example of an FD set that belongs to this class.

Class 1.   . In this case , , and . Thus, , and and indeed the only overlapping lines in are the dashed lines corresponding to and .

Class 2.   . It holds that , , and . Hence, and , but , and the difference from is that the lines corresponding to and in overlap.

Class 3.   . Here, it holds that , , and . Thus, , but . The difference from is that now the lines corresponding to and overlap and we do not assume anything about the intersection between and .

Class 4.   . In this case we have three local minima. We pick two of them: and . Now, , , and . Thus, and . The difference from is that now the lines corresponding to and overlap. Moreover, the line corresponding to covers the entire line corresponding to and the line corresponding to covers the entire line corresponding to . This means that we assume that and .

Class 5.   . Here, , , and , therefore and . The difference from is that now we assume that .

3.4 Most Probable Database

In this section, we draw a connection to the Most Probable Database problem (MPD) [20]. A table in our setting can be viewed as a relation of a tuple-independent database [14] if each weight is in the interval . In that case, we view the weight as the probability of the corresponding tuple, and we call the table a probabilistic table. Such a table represents a probability space over the subsets of , where a subset is selected by considering each tuple independently and selecting it with the probability , or equivalently, deleting it with the probability . Hence, the probability of a subset , denoted , is given by:

(2)

Given a constraint over the schema of , MPD for is the problem of computing a subset that satisfies , and has the maximal probability among all such subsets. Here, we consider the case where is a set of FDs. Hence MPD for is the problem of computing

Gribkoff, Van den Broeck, and Suciu [20] proved the following dichotomy for unary FDs, which are FDs of the form having a single attribute on their lhs.

Theorem 3.6

[20]   Let be a set of unary FDs over a relational schema. MPD for is either solvable in polynomial time or NP-hard.

The question of whether such a dichotomy holds for general (not necessarily unary) FDs has been left open. The following corollary of Theorem 3.3 fully resolves this question.

Theorem 3.7

Let be a set of FDs over a relational schema. If