 # The Complexity of Computing a Cardinality Repair for Functional Dependencies

For a relation that violates a set of functional dependencies, we consider the task of finding a maximum number of pairwise-consistent tuples, or what is known as a "cardinality repair." We present a polynomial-time algorithm that, for certain fixed relation schemas (with functional dependencies), computes a cardinality repair. Moreover, we prove that on any of the schemas not covered by the algorithm, finding a cardinality repair is, in fact, an NP-hard problem. In particular, we establish a dichotomy in the complexity of computing a cardinality repair, and we present an efficient algorithm to determine whether a given schema belongs to the positive side or the negative side of the dichotomy.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Preliminaries

We first present some basic terminology and notation that we use throughout the paper.

### 1.1 Relational Signatures and Instances

We assume three infinite collections: attributes (column names), relation symbols (table names), and constants (cell values). A heading is a sequence of distinct attributes, where is the arity of the heading. A signature is a mapping from a finite set of relation symbols to headings . We use the conventional notation to denote that is a relation symbol that is assigned the heading . An instance of a signature maps every relation symbol to a finite set, denoted , of tuples where each is a constant. We may omit stating the signature of an instance when is clear from the context or irrelevant.

Let be a signature, a relation symbol of , an instance of , and be a tuple in . We refer to the expression as a fact of . By a slight abuse of notation, we identify an instance with the set of its facts. For example, denotes that is a tuple in . As another example, means that for every relation symbol of ; in this case, we say that is subinstance of .

### 1.2 Functional Dependencies

A Functional Dependency (FD for short) over a signature is an expression of the form , where is a relation symbol and and are sets of attributes of . When is clear from the context, we simply write . We may also write and by simply concatenating the attribute symbols; for example, we may write instead of for the relation symbol . An FD is trivial if , and otherwise it is nontrivial. We say that an attribute in an FD is trivial if it holds that and . In this case, removing a trivial attribute from the FD means removing it from . For example, if we remove the trivial attributes from the FD , the result is .

An instance satisfies an FD if for every two facts and over , if and agree on (i.e., have the same constants in the position of) the attributes of , then they also agree on the attributes of . We say that satisfies a set of FDs if satisfies every FD in ; otherwise, we say that violates . Two sets of FDs over the same signature are equivalent if every instance that satisfies one also satisfies the other. For example, and are equivalent. An FD is entailed by (denoted by ) if for every instance over the schema, if satisfies , then it also satisfies . We denote by the restriction of to the FDs over (i.e., those of the form ).

Let be a set of FDs. We say that an FD, , is a local minimum of , if there is no other FD, , such that . We say that the FD is a global minimum of , if it holds that for every FD .

An FD schema is a pair , where is a signature and is a set of FDs over . Two FD schemas and are equivalent if and is equivalent to . We say that an FD schema is a chain if for every two FDs and over the same relation symbol, either or  .

For example, Table 1 depicts specific schemas that we refer to throughout the paper. None of these schemas is a chain, while the schema is a chain since it holds that .

### 1.3 Repairs

Let be an FD schema and let be an inconsistent instance of . We say that is a subset repair of , or s-repair for short, if is a maximal consistent subinstance of (that is, does not violate any FD in , and it is not possible to add another fact from to without violating consistency) [1, 2]. We say that is a cardinality repair of , or C-repair for short, if is a maximum s-repair of (that is, there is no other subset repair of that contains more facts that does) .

## 2 Main Result

In this section, we present our main result, which is a dichotomy for the problem of finding a C-repair of an inconsistent database. Note that since we only consider FD schemas, conflicting facts always belong to the same relation. Thus, if the schema contains two or more relations, we can solve the problem for each relation separately. Hence, our analysis can be restricted to single-relation schemas.

Let be an FD schema, and let be the single relation in the schema. Let be a subset of . We denote by the projection of onto the attributes in . We also denote by the result of removing the attributes in from all the FDs in . In addition, we denote by the result of removing the attributes in from the FD (that is, ).

Let be an FD schema. We start by defining the following simplification steps:

Simplification 1.   If some attribute appears on the left-hand side of all the FDs in , remove the attribute from and from all the FDs in . We denote the result by .

Simplification 2.   If contains an FD of the form , remove the attributes in from and from all the FDs in . We denote the result by .

Simplification 3.   If contains two FDs and , such that and , and for each FD in it holds that or , remove the attributes in from and from all the FDs in . We denote the result by .

Let be an FD schema, such that . We can apply simplification to the schema, since it contains the FD . The result will be . Next, we can apply simplification , since the attribute appears on the left-hand side of all the FDs in . The result will be . Finally, since contains two FDs and , and it holds that , we can apply simplification . The result will be The result will be .

Let be an FD schema. Then, the problem can be solved in polynomial time if and only if returns true.

The algorithm , depicted in Figure 1, starts with the given FD schema , and at each step it tries to apply one of the simplifications to . If at some point no simplification can be applied to , there are two possible cases:

• is empty. In this case, there is a polynomial time algorithm for solving .

• is not empty. In this case, is NP-hard.

In the next sections we prove Theorem 2.

## 3 Finding a Cardinality Repair

In this section we introduce a recursive algorithm for finding a C-repair for a given instance over an FD schema . The algorithm is depicted in Figures 2 and 3. If the problem can be solved in polynomial time, the algorithm will return a C-repair, otherwise it will return . The algorithm’s structure is similar to that of , and it uses the three subroutines: , and . We will now introduce these three subroutines.

The subroutine is used if simplification can be applied to . It divides the given instance into blocks of facts that agree on the value of attribute , and then finds a C-repair for each block separately, using the algorithm . Then, it returns the union of all those C-repairs.

The subroutine is used if simplification can be applied to . It divides the given instance into blocks of facts that agree on the values of all the attributes in , and then finds a C-repair for each block separately, using the algorithm . Then, the algorithm selects the C-repair that contains the most facts among those C-repairs and returns it.

The subroutine is used if simplification can be applied to . It divides the given instance into blocks of facts that agree on the values of all the attributes in , and then finds the C-repair for each block separately, using the algorithm . Then, the algorithm uses an existing polynomial time algorithm for finding the maximum weight matching in a bipartite graph  . This graph has a node on its left-hand side for each possible set of values such that for some fact . Similarly, it has a node on its right-hand side for each possible set of values such that for some fact . The weight of each edge is the number of facts that appear in a C-repair of the block (the block that contains all the facts such that and ). The algorithm returns the subinstance that correspond to this maximum weighted matching (that is, the subinstance that contains the C-repair of each block such that the edge belongs to the maximum matching).

As long as there exists a simplifaction that can be applied to , the algorithm applies this simplification to the schema and calls the corresponding subroutine on the result. If not simplification can be applied to , then returns the instance itseld if , or otherwise. In the following sections we will prove the correctness of the algorithm and Finally we will prove Theorem 2.

## 4 Tractability Side

In this section, we prove, for each one of the three simplifications, that if the problem of finding a C-repair can be solved in polynomial time, using the algorithm , after applying the simplification to a schema , then it can also be solved in polynomial time for the original schema . More formally, we prove the following three lemmas.

Let be an FD schema, such that simplification can be applied to . Let be an instance of . If can be solved in polynomial time using , the problem can be solved in polynomial time using as well.

###### Proof.

Assume that can be solved in polynomial time using . That is, for each , the algorithm returns a C-repair of . We contend that can also be solved in polynomial time using . Since the condition of line 4 of is satisfied, the algorithm will call subroutine and return the result. Thus, we have to prove that returns a C-repair of .

Let be the result of . We will start by proving that is consistent. Let us assume, by way of contradiction, that is not consistent. Thus, there are two facts and in that violate an FD in . Since , and agree on the value of attribute , thus they belong to the same block . By definition, there is an FD in . Clearly, the facts and agree on all the attributes in , and since they also agree on the attribute , there exists an attribute such that . Thus, and violate an FD in , which is a contradiction to the fact that returns a C-repair of that contains both and .

Next, we will prove that is a C-repair of . Let us assume, by way of contradiction, that this is not the case. That is, there is another subset repair of , such that contains more facts than . In this case, there exists at least one value of attribute , such that contains more facts for which it holds that than . Let be the set of facts from for which it holds that that appear in , and let be the set of facts from for which it holds that that appear in . It holds that . We claim that is a subset repair of , which is a contradiction to the fact that is a C-repair (that is, a C-repair) of . Let us assume, by way of contradiction, that is not a subset repair of . Thus, there exist two facts and in that violate an FD, , in . By definition, there is an FD in , and since and agree on the value of attribute , they clearly violate this FD, which is a contradiction to the fact that they both appear in (which is a subset repair of ).

Clearly, if the the algorithm solves the problem in polynomial time, then it also solves the problem in polynomial time, and that concludes our proof of the lemma. ∎

Let be an FD schema, such that simplification can be applied to . Let be an instance of . If can be solved in polynomial time using , the problem can be solved in polynomial time using as well.

###### Proof.

Assume that can be solved in polynomial time using . That is, for each , the algorithm returns a C-repair of . We contend that can also be solved in polynomial time using . Note that the condition of line 4 cannot be satisfied, since there is no attribute that appears on the left-hand side of . Since the condition of line 6 of is satisfied, the algorithm will call subroutine and return the result. Thus, we have to prove that returns a C-repair of .

Let be the result of . We will start by proving that is not consistent. Thus, there are two facts and in that violate an FD in . That is, and agree on all the attributes in , but do not agree on at least one attribute . Note that and agree on all the attributes in (since always returns a set of facts that belong to a single block). Thus, it holds that . By definition, there is an FD in . Clearly, the facts and agree on all the attributes in , but do not agree on the attribute . Thus, and violate an FD in , which is a contradiction to the fact that returns a C-repair of that contains both and .

Next, we will prove that is a C-repair of . Let us assume, by way of contradiction, that this is not the case. That is, there is another subset repair of , such that contains more facts than . Clearly, each subset repair of only contains facts that belong to a single block (since the FD implies that all the facts must agree on the values of all the attributes in ). The instance is a C-repair of some block . If , then we get a contradiction to the fact that is a C-repair of . Thus, contains facts from another block . In this case, the C-repair of contains more facts than the C-repair of , which is a contradiction to the fact that no block has a C-repair that contains more facts than does.

Clearly, if the the algorithm solves the problem in polynomial time, then it also solves the problem in polynomial time, and that concludes our proof of the lemma. ∎

Let be an FD schema, such that simplification can be applied to . Let be an instance of . If can be solved in polynomial time using , then can be solved in polynomial time using as well.

###### Proof.

Assume that can be solved in polynomial time using . That is, returns a C-repair of for each . We contend that can also be solved in polynomial time using . Note that the condition of line 4 cannot be satisfied. Otherwise, there is an attribute that appears on the left-hand side of both and . Since we always remove redundant attributes from the FDs in before calling , the attribute does not appear on the right-hand side of these FDs, and it does not hold that , which is a contradiction to the fact that simplification can be applied to . The condition of line 6 cannot be satisfied as well, since neither nor . The condition of line 8 on the other hand is satisfied, thus the algorithm will call subroutine and return the result. Thus, we have to prove that returns a C-repair of .

Let us denote by the result of . We will start by proving that is consistent. Let and be two FDs in . Note that it cannot be the case that but (or vice versa), since in this case the matching that we found for contains two edges and , which is impossible. Moreover, if it holds that and , then and do not agree on the left-hand side of any FD in (since we assumed that for each FD in it either holds that or ). Thus, satisfies all the FDs in . Now, let us assume, by way of contradiction, that is not consistent. Thus, there are two facts and in that violate an FD in . That is, and agree on all the attributes in , but do not agree on at least one attribute . As mentioned above, the only possible case is that and . In this case, and belong to the same block , and they do not agree on an attribute . The FD belongs to , and clearly and also vioalte this FD, which is a contradiction to the fact that only contains a C-repair of and does not contain any other facts from this block.

Next, we will prove that is a C-repair of . Let us assume, by way of contradiction, that this is not the case. That is, there is another subset repair of , such that contains more facts than . Note that the weight of the matching corresponding to is the total number of facts in (since the weight of each edge is the number of facts in the C-repair of the block , and contains the C-repair of each block , such that the edge belongs to the matching). Let and be two facts in . Note that it cannot be the case that that but , since in this case, violates the FD (we recall that , thus the fact that implies that ). Hence, it either holds that and or and . Therefore, clearly corresponds to a matching of as well (the matching will contain an edge if there is a fact , such that and ).

Next, we claim that for each edge that belongs to the above matching, the subinstance contains a C-repair of the block w.r.t. . Clearly, cannot contain two facts and from that violate an FD from (otherwise, and will also violate the FD from , which is a contradiction to the fact that is a subset repair of ). Thus, contains a consistent set of facts from . If this set of facts is not a C-repair of , then we can replace this set of facts with a C-repair of . This will not break the consistency of since these facts do not agree on the attributes in neither nor with any other fact in , and each FD in is such that or . The result will be a repair of that contains more facts than , which is a contradiction to the fact that is a C-repair of . Therefore, for each edge that belongs to the above matching, contains exactly facts, which means that the weight of this matching is the total number of facts in . In this case, we found a matching of with a higher weight than the matching corresponding to , which is a contradiction to the fact that corresponds to the maximum weighted matching of .

Clearly, if the the algorithm solves the problem in polynomial time, it also solves the problem in polynomial time, and that concludes our proof of the lemma. ∎

## 5 Hardness Side

Our proof of hardness is based on the concept of a fact-wise reduction . Let and be two FD schemas. A mapping from to is a function that maps facts over to facts over . We naturally extend a mapping to map instances over to instances over by defining to be . A fact-wise reduction from to is a mapping from to with the following properties.

1. is injective; that is, for all facts and over , if then .

2. preserves consistency and inconsistency; that is, for every instance over , the instance satisfies if and only if satisfies .

3. is computable in polynomial time.

The following lemma is straightforward. Let and be FD schemas, and suppose that there is a fact-wise reduction from to . If the problem is NP-hard, then so is .

We first prove the hardness of for all the schemas that appear in Table 1. Then, we prove the existence of fact-wise reductions from these schemas to other schemas. We will use all of these results in our proof of correctness for the algorithm .

### 5.1 Hard Schemas

We start by proving that is NP-hard for four specific FD schemas.

The problem is NP-hard.

###### Proof.

We construct a reduction from non-mixed CNF satisfiability to . The input to the first problem is a formula with the free variables , such that has the form where each is a clause. Each clause is a conjunction of variables from one of the following sets: (a) or (b) (that is, each clause either contains only positive variables or only negative variables). The goal is to determine if there exists an assignment that satisfies . Given such an input, we will construct the input for our problem as follows. For each and , will contain the following facts:

• , if contains only positive variables and appears in .

• , if contains only negative variables and appears in .

We will now prove that there exists a satisfying assignment to if and only if the C-repair of contains exactly facts.

#### The “if” direction

Assume that a C-repair of contains exactly facts. The FD implies that no subset repair of contains two facts and such that . Thus, each subset repair contains at most one fact for each . Since contains exactly facts, it contains precisely one fact for each . We will now define an assignment as follows: if there exists a fact in for some . Note that the FD implies that no subset repair contains two facts and , thus the assignment is well defined. Finally, sa mentioned above, contains a fact for each . If appears in without negation, it holds that , thus and is satisfied. Similarly, if appears in with negation, it holds that , thus and is satisfied. Thus, each clause is satisfied by and we conclude that is a satisfying assingment of .

#### The “only if” direction

Assume that is an assignment that satisfies . We claim that the C-repair of containts exactly facts. Since is a satisfying assignment, for each clause there exists a variable , such that if appears in without negation or if it appears in with negation. Let us build an instance as follows. For each we will choose exactly one variable that satisfies the above and add the fact (where ) to . Since there are clauses, will contain exactly facts, thus it is only left to prove that is a subset repair. Let us assume, by way of contradiction, that is not a subset repair. As mentioned above, each subset repair can contain at most one fact for each