1 Introduction
Soft variants of database constraints (also referred to as weak or approximate constraints) have been a building block of various challenges in data management. In constraint discovery and mining, for instance, the goal is to find constraints, such as Functional Dependencies (FDs) [DBLP:journals/cj/HuhtalaKPT99, DBLP:journals/cbm/CombiMSSAMP15, DBLP:conf/apweb/LiLCJY16] and beyond [DBLP:journals/pvldb/ChuIP13, DBLP:journals/pvldb/LivshitsHIK20, DBLP:journals/pvldb/PenaAN19], that generally hold in the database but not necessarily in a perfect manner. There, the reason for the violations might be rare events (e.g., agreement on the zip code but not the state) or noise (e.g., mistyping). Soft constraints also arise when reasoning about uncertain data [DBLP:conf/icdt/SaIKRR19, DBLP:conf/pods/JhaRS08, DBLP:journals/vldb/SenDG09, GVSBUDA14]—the database is viewed as a probabilistic space over possible worlds, and the violation of a weak constraint in a possible world is viewed as evidence that affects the world’s probability.
Our investigation concerns the latter application of soft constraints. To be more precise, the semantics is that of a parametric factor graph: the probability of a possible world is the product of factors where every violation of the constraint contributes one factor; in turn, this factor is a weight that is assigned upfront to the constraint. This approach is highly inspired by successful concepts such as the Markov Logic Network (MLN) [Richardson:2006:MLN:1113907.1113910]. The computational challenges are the typical ones of probabilistic modeling: marginal inference (compute the probability of a query answer) and maximum likelihood (find the most probable world)—the problem that we focus on here.
More specifically, we investigate the complexity of finding a most probable world in the case where the constraints are FDs. By taking the logarithms of the factors, this problem can be formally defined as follows. We are given a database and a set of FDs, where every tuple and every FD has a weight (a nonnegative number). We would like to obtain a cleaner subset of by deleting tuples. The cost of includes a penalty for every deleted tuple and a penalty for every violation of (i.e., pair of tuples that violates) an FD; the penalties are the weights of the tuple and the FD, respectively. The goal is to find a subset with a minimal cost. In what follows, we refer to such as an optimal subset and to the optimization problem of finding an optimal subset as soft repairing. The optimal subset corresponds to the “most likely intention” in the Probabilistic Unclean Database (PUD) framework of De Sa, Ilyas, Kimelfeld, Ré and Rekatsinas [DBLP:conf/icdt/SaIKRR19] in a restricted case that is studied in their work, and to the “most probable world” in the probabilistic database model of Sen, Deshpande and Getoor [DBLP:journals/vldb/SenDG09]. In the special case where the FDs are hard constraints (i.e., their weight is infinite or just too large to pay), an optimal subset is simply what is known as a “cardinality repair” [DBLP:conf/icdt/LopatenkoB07] or, equivalently [DBLP:journals/tods/LivshitsKR20], a “most probable database” [GVSBUDA14].
The computational challenge of soft repairing is that there are exponentially many candidate subsets. We investigate the data complexity of the problem, where the database schema and the FD set are fixed, and the input consists of the database and all involved weights. Moreover, we assume that consists of a single relation; this is done without loss of generality, since the problem boils down to soft repairing each relation independently (since an FD does not involve more than one relation).
The complexity of the problem is very well understood in the case of hard constraints (cardinality repairs). Gribkoff, Van den Broeck and Suciu [GVSBUDA14] established complexity results for the case of unary FDs (having a single attribute on the lefthand side), and Livshits, Kimelfeld and Roy [DBLP:journals/tods/LivshitsKR20] completed the picture to a full (effective) dichotomy over all possible sets of FDs. For example, the problem is solvable in polynomial time for the FD sets , and , but is NPhard for . In contrast, very little is known about the more general case where the FDs are soft (and violations are allowed), where the problem seems to be fundamentally harder, both to solve and to reason about. Clearly, for every where it is intractable to find a cardinality repair, the soft version is also intractable. But the other direction is false (under conventional complexity assumptions). For example, soft repairing is hard for , for the following reason. We can set the weights of and to be very high, making each of them a hard constraint in effect, and the weight of very low, making it ignorable in effect. Hence, an optimal subset is a cardinality repair for that, as said above, is hard to compute.
So, which sets of FDs have a tractable soft repairing? The only polynomialtime algorithm we are aware of is that of De Sa et al. [DBLP:conf/icdt/SaIKRR19] for the special case of a single key constraint, that is, where contain all of the schema attributes; they have left the more general case (that we study here) open. In this work, we make substantial progress in answering this question by presenting algorithms for two types of FD sets: (a) a single FD and (b) a matching constraint.
The first type generalizes the tractability of De Sa et al. [DBLP:conf/icdt/SaIKRR19] from a key constraint to an arbitrary FD (as long as it is the only FD in ). Like theirs, our algorithm employs dynamic programming, but in a more involved fashion. This is because their algorithm is based on the fact that in a key constraint , any two tuples that agree on are necessarily conflicting. We also show that our algorithm can be generalized to additional sets of FDs. For example, it turns our that the FD set is tractable as well. (Note that the address attribute on the lefthand side of the second FD is not redundant, as in the ordinary semantics, since the FDs are treated as soft constraints.) In Section 4 we phrase the more general condition that this FD set satisfies.
The second type, matching constraints, refers to FD sets over a schema with the attributes , …, where . The simplest example is over the binary schema that represents a bipartite graph, and the problem is that of finding the best “almost matching” of a bipartite graph where a penalty is paid for every lost edge and every violation of monogamy. A more involved example is over the schema . Our algorithm is based on a reduction to the Minimum Cost Maximum Flow (MCMF) problem [10.1287/moor.15.3.430].
Whether our algorithms cover all of tractable cases remains an open problem for future investigation. (In the Conclusions we discuss the simplest FD sets where the question is left unsolved.) We do show, however, that there is a polynomialtime approximation algorithm with an approximation factor , that is, a subset where the penalty is at most three times the optimum.
The rest of the paper is organized as follows. We give the formal setup and the problem definition in Section 2. We then discuss the complexity of the general problem and its relationship to past results in Section 3. We describe our algorithm for soft repairing in Sections 4 and 5 for a single FD and a matching constraint, respectively, and conclude in Section 6. For lack of space, some of the proofs are given in the Appendix.
2 Formal Setup
We begin with preliminary definitions and terminology that we use throughout the paper.
2.1 Databases, FDs and Repairs
A relation schema consists of a relation symbol and a set of attributes. A database over is a set of facts of the form , where each is a constant. We denote by the value that the fact associates with attribute (i.e., ). Similarly, if is a sequence of attributes from , then is the tuple .
A Functional Dependency (FD) over the relation schema is an expression of the form where . A violation of an FD in a database is a pair of tuples from that agrees on the lefthand side (i.e., ) but disagrees on the righthand side (i.e., ). An FD is trivial if . We denote by the set of all the violations of the FD in . We say that satisfies , denoted , if it has no violations (i.e., is empty). The database satisfies a set of FDs, denoted by , if satisfies every FD in ; otherwise, violates (denoted ).
When there is no risk of ambiguity, we may omit the specification of the relation schema and simply assume that the involved databases and constraints are all over the same schema.
Let be a database and let be a set of FDs. A repair (of w.r.t. ) is a maximal consistent subset ; that is, and , and moreover, for every such that . Note that the number of repairs can be exponential in the number of facts of . A cardinality repair is a repair of a maximal cardinality (i.e., for every repair ).
2.2 Soft Constraints
We define the concept of soft constraints (or weak constraints or weighted rules) in the standard way of “penalizing” the database for every missing fact, on the one hand, and every violation, on the other hand. This is the concept adopted in past work such as the parfactors of De Sa et al. [DBLP:conf/icdt/SaIKRR19], the soft keys of Jha et al. [DBLP:conf/pods/JhaRS08], and the PrDB model of Sen et al. [DBLP:journals/vldb/SenDG09]. The concept can be viewed as a special case of the Markov Logic Network (MLN) [Richardson:2006:MLN:1113907.1113910].
Formally, let be a database and a set of FDs. We assume that every fact and every FD have a nonnegative weight, hereafter denoted and , respectively. (The weight of a fact is sometimes viewed as the log of a validity/existence probability [DBLP:journals/vldb/SenDG09, GVSBUDA14].) The cost of a subset of a database is then defined as follows.
(1) 
As for the computational model, we assume that every weight is a rational number that is represented using the numerator and the denominator, namely , where each of the two is an integer represented in the standard binary manner.
2.3 Problem Definition: Soft Repairing
The problem we study in this paper, referred to as soft repairing, is the optimization problem of finding a database subset with a minimal cost. Since we consider the data complexity of the problem, we associate with each relation schema and set of FDs a separate computational problem.
Problem (Soft Repairing).
Let be a relation schema and a set of FDs. Soft repairing (for and ) is the following optimization problem: Given a database , find an optimal subset of , that is, a subset of with a minimal .
Note that a cardinality repair is an optimal subset in the special case where the weight of every FD is (or just higher than the cost of deleting the entire database), and the weight of every fact is . Livshits et al. [DBLP:journals/tods/LivshitsKR20] studied the complexity of finding a weighted cardinality repair, which is the same as a cardinality repair but the weight of every fact can be arbitrary. Hence, both types of cardinality repairs are consistent (i.e., the constraints are strictly satisfied). In contrast, an optimal subset in the general case may violate one or more of the FDs. In the next section we recall the known complexity results for cardinality and weighted cardinality repairs.




Our running example is based on the database of Figure 1 over the relation schema that contains information about domestic flights in the United States. The weight of each tuple appears on the rightmost column. The FD set consists of the following FDs:

: a flight is associated with a single airline.

: a flight on a certain date has a single destination.
We assume that the weight of the first FD is , and the weight of the second FD is (as the same flight number can be reused for different flights).
The database of Figure 1 is a cardinality repair of as no repair of can be obtained by removing less then three facts. However, is not a weighted cardinality repair, since its cost is eight, while the cost of is six. The reader can easily verify that is a weighted cardinality repair of . Finally, is not a repair of in the traditional sense as it contains a violation of the second FD, but it is an optimal subset of with . ∎
3 Preliminary Complexity Analysis
We consider the data complexity of the problem of computing an optimal subset. We assume that the schema and the set of FDs are fixed, and the input consists of the database. Livshits et al. [DBLP:journals/tods/LivshitsKR20] studied the problems of finding a cardinality repair and a weighted cardinality repair, and established a dichotomy over the space of all the sets of functional dependencies. In particular, they introduced an algorithm that, given a set of FDs, decides whether:

A weighted cardinality repair can be computed in polynomial time; or

Finding a (weighted) cardinality repair is APXcomplete.^{1}^{1}1Recall that APX is the class of NP optimization problems that admit constantratio approximations in polynomial time. Hardness in APX is via the so called “PTAS” reductions (cf. textbooks on approximation complexity, e.g., [DBLP:reference/crc/2007aam]).
No other possibility exists. The algorithm, which is depicted here as Algorithm 1, is a recursive procedure that attempts to simplify at each iteration by finding a removable pair of attribute sets, and removing every attribute of and from all the FDs in (which we denote by ). Note that and may be the same, and then the condition states that every FD contains on the left hand side. If we are able to transform to an empty set of FDs by repeatedly applying simplification, then the algorithm returns true and finding an optimal consistent subset is solvable in polynomial time. Otherwise, the algorithm returns false and the problem is APXcomplete. We state their result for later reference.
[DBLP:journals/tods/LivshitsKR20] Let be a set of FDs. If can be emptied via steps, then a weighted cardinality repair can be computed in polynomial time; otherwise, finding a cardinality repair is APXcomplete.
The hardness side of Theorem 1 immediately implies the hardness of the more general softrepairing problem. Yet, the other direction (tractability generalizes) is not necessarily true. As discussed in the Introduction, if , then
, as a set of hard constraints, is classified as tractable according to Algorithm
1; however, this is not the case for soft constraints. We can generalize this example by stating that if contains a subset that is hard according to Theorem 1, then soft repairing is hard. (This does not hold when considering only hard constraints, as the example shows that there exists an easy with a hard subset.) In the following sections, we are going to discuss tractable cases of FD sets. Before that, we will show that the problem becomes tractable if one settles for an approximation.3.1 Approximation
The following theorem shows that soft repairing admits a constantratio approximation, for the constant three, in polynomial time. This means that there is a polynomialtime algorithm for finding a subset with a cost of at most three times the minimum.
For all FD sets, soft repairing admits a 3approximation in polynomial time.
Proof.
We reduce soft repairing to the problem of finding a minimum weighted set cover where every element belongs to sets. ‘A simple greedy algorithm finds a approximation to this problem in linear time [hochbaum1982approximation].
We set the elements to be . Each element belongs to three sets: with weight , with weight , and with weight . Each minimal solution to this set cover problem can be translated to a soft repair: the selected sets that correspond to tuples are removed in the repair. Indeed, a minimal set cover of such a construction has to resolve each conflict by either paying for the removal of at least one of the tuples or paying for the violation. ∎
In terms of formal complexity, Theorem 3.1 implies that the problem of soft repairing is in APX (for every set of FDs). From this, from Theorem 1 and from the discussion that follows Theorem 1, we conclude the following.
Let be a set of FDs. Soft repairing for is in APX. Moreover, if any subset of cannot be emptied via steps, then soft repairing is APXcomplete for .
4 Algorithm for a Single Functional Dependency
In this section, we consider the case of a single functional dependency, and present a polynomialtime algorithm for soft repairing. Hence, we establish the following result.
In the case of a single FD, soft repairing can be solved in polynomial time. Next, we prove Theorem 4 by presenting an algorithm. Later, we also generalize the argument and result beyond a single FD (Theorem 4.1).
We assume that the single FD is and that our input database is . We split into blocks and subblocks, as we explain next. The blocks of are the maximal subsets of that agree on the values. Denote these blocks by . Note that there are no conflicts across blocks; hence, we can solve the problem separately for each block and then an optimal subset is simply the union of optimal subsets of the blocks :
The subblocks of a block are the maximal subsets of that agree on the values (in addition to the values). We denote these subblocks by . Note that two facts from the same subblock are consistent, while two facts from different subblocks are conflicting.
From here we continue with dynamic programming. For a number , where is the number of subblocks of , and a number of facts, we define the following values that we are going to compute:

is the cost of an optimal subset of (i.e., the union of the first subblocks) with precisely facts.

is a subset of that realizes , that is,
(If multiple choices of exist, we select an arbitrary one.) Once we compute the , we are done since it then suffices to return the best subset over all :
It remains to compute and . We will focus on the former, as the latter is obtained by straightforward bookkeeping. The key observation is that if we decide to delete facts from , then we always prefer to delete the facts with the minimal weight. We use this observation as follows.
For a subblock and , denote by an arbitrary subset of with facts of the highest weight. Hence, is obtained by taking a prefix of size when sorting the tuples of from the heaviest to the lightest. Then is computed as follows.
The correctness of the above computation is due to the definition of the cost in Equation (1). In particular, in the third case, we go over all options for the number of facts taken from the subblock and choose an option with the minimum cost. This cost consists of the following components:

is the cost of the best choice of facts from the remaining subblocks.

is the cost of the violations in which the th subblock participates: any combination of a fact from and a fact from the other subblocks is a violation of .

is the cost of removing every fact that is not in from the th subblock.
This completes the description of the algorithm. From this description, the correctness should be a straightforward conclusion.
4.1 Extension
In this section, we generalize the idea from the previous section. An attribute is an lhs attribute of an FD if , and it is a consensus attribute of if and (hence, states that all tuples should have the same value). The simplification step of Algorithm 2 removes an attribute if for every FD in , it is either an lhs or a consensus attribute. We prove the following.
Let be a set of FDs. If can be emptied via L/CSimplify() steps, then soft repairing for is solvable in polynomial time.
Note that whenever can be emptied via L/CSimplify() steps, it can also be emptied via Simplify() steps. Indeed, if L/CSimplify() eliminates the attribute , then we can take: (a) and in Algorithm 1 if is a consensus attribute of some FD, or (b) if is an lhs attribute of every FD. This is expected due to Theorems 1 and 4.1, and the observation of Section 3 that softrepairing is hard whenever computing a cardinality repair is hard.
Consider the database and the FD set of our running example (Example 2.3). This FD set, which we denote here by , can be emptied via L/CSimplify() steps, by selecting attributes in the following order:
Hence, Theorem 4.1 implies that soft repairing can be solved in polynomial time for .
Next, consider the FD set consisting of the following FDs: and . This FD set is logically equivalent to ; hence, they both entail the exact same cardinality repairs. However, these sets are no longer equivalent when considering soft repairing. In particular, two facts that agree on the values of the Flight and Date attributes, but disagree on the values of the Airline and Destination attributes, violate only one FD in but two FDs in , which affects the cost of keeping these two tuples in the database. In fact, the FD set cannot be emptied via L/CSimplify() steps, as after removing the Flight attribute, no other attribute is either an lhs or a consensus attribute of the remaining FDs. The complexity of soft repairing for remains an open problem.∎
Next, we prove Theorem 4.1 by presenting a polynomialtime algorithm for soft repairing in the case where can be emptied via L/CSimplify() steps. Our algorithm generalizes the idea of the algorithm for a single FD, and we again use dynamic programming.
The main observation is as follows. Let be an attribute chosen by L/CSimplify(), and let be the maximal subsets of that agree on the value of , which we refer to as blocks (w.r.t. ). Two facts from different blocks violate all of the FDs wherein is a consensus attribute and none of the FDs wherein is an lhs attribute. Therefore, to compute the cost of a soft repair, each pair of facts from different blocks is charged with the violation of all FDs wherein is a consensus attribute. Then, we can remove from all FDs and continue the computation separately for each block.
Now, let be an FD set that can be emptied via L/CSimplify() steps, and let be the attributes in the order of such an elimination process. For each , we denote by the FD set in line 2 of the th iteration of this execution (after removing the trivial FDs). Thus, contains every nontrivial FD of , and is empty. We also denote by the total weight of the FDs in of which is a consensus attribute (if there are no such FDs, then ).
In the algorithm for a single FD, the recursion steps were with respect to the block (which determines the value of ), and so the value of was a parameter. Here, we need to maintain the assignment to all previously handled attributes, and we use and as parameters. Given , if is an assignment to the attributes , then denotes the database (i.e., the database that contains all the tuples that agree with on the values of the attributes ). We denote by the blocks of w.r.t. . Moreover, we denote by the assignment to the attributes that agrees with block on the value assigned to and agrees with on all other values. We denote by an optimal subset of of size w.r.t. . We also denote by the cost of . According to Equation (1), our goal is to compute for .
We again focus on the computation of that can be done as follows.
The first line (where ) refers to the case where is empty. Since there are no FDs that need to be taken into account, the optimal subset of of size consists of the facts of the highest weight. In the fourth case, we go over all options for the number of facts taken from the block and choose an option with the minimum cost. This cost consists of the following components:

is the cost of the best choice of facts from the remaining blocks.

is the cost of the violations in which the th block participates: any combination of a fact from and a fact from the other blocks is a violation of the FDs in which is a consensus attribute.

is the cost of the further repairing needed following the elimination of (i.e., repairing with respect to ) applied to the current block (the facts from ) .
The given recursion can be computed in polynomial time via dynamic programming; thus, this proves Theorem 4.1.
5 Algorithm for Matching Constraints
Next, we consider the case of a “matching” constraint, where the FD set states two keys that cover all of the attributes. (We give the precise definition in Section 5.1.) We present a polynomialtime algorithm for soft repairing in this case. For presentation sake, we first describe the algorithm for the special case where the schema is and . Later in the section, we generalize it to the case of two keys. So, we begin by proving the following lemma.
Soft repairing is solvable in polynomial time for and .

In the remainder of this section, we assume the input over . We begin with an observation. For it holds that:
Since the value does not depend on the choice of , minimizing the value is the same as minimizing the value . We use the following notation:
To solve the problem, we construct a reduction to the Minimum Cost Maximum Flow (MCMF) problem. The input to MCMF is a flow network , that is, a directed graph with a source node having no incoming edges and a sink node having no outgoing edges. Each edge is associated with a capacity and a cost . A flow of is a function such that for every , and moreover, for every node it holds that where and are the sets of incoming and outgoing edges of , respectively. A maximum flow is a flow that maximizes the value , and a minimum cost maximum flow is a maximum flow with a minimal cost, where the cost of a flow is defined by . We say that is integral if all values are integers. It is known that, whenever the capacities are integral (i.e., natural numbers, as will be in our case), an integral minimum cost maximum flow exists and, moreover, can be found in polynomial time [DBLP:books/daglib/0069809, Chapter 9].
From we construct instances of the MCMF problem, where is the number of facts in , in the following way.
First, we denote the FD by and the FD by . We also denote by the set of values occurring in attribute in (that is, ). We do the same for attribute and denote by the set of values that occur in attribute in . For each value we denote by the number of appearances of the value a in attribute (i.e., the number of facts such that ). Similarly, we denote by the number of appearances of the value b in attribute in . Observe that
since every fact of the form violates with every fact where . Similarly, it holds that
Next, we describe the construction of the network . Our construction for the database of Figure 1(a) is illustrated in Figure 3. Note that Figure 1(b) depicts the conflict graph of the database of Figure 1(a) w.r.t. , which contains a vertex for each fact in the database and an edge between two vertices if the corresponding facts jointly violate an FD of . The blue edges in the conflict graph are violations of the FD and the red edges are violations of the FD .
For each we construct the network that consists of the set of nodes where:
contains the following edges:

, with cost

for every , with cost

for every value , with cost

for every and such that occurs in , with cost

for every value , with cost

for every , with cost
The capacity of the edge is and the capacity of the other edges is . The intuition for the construction is as follows. A network with edges of the form that are connected to a source on one side and a target on the other corresponds to a matching, which in turn corresponds to a traditional repair. To allow violations of , we add the vertices . The cost of a violation of this FD is defined by the cost of the edges . In particular, if we keep facts of the form for some we pay for violations of . We include the vertices to similarly allow violations of . The discarding of facts is discouraged by offering gain for the edges . Finally, to prevent the case where the flow always fills the entire network (which corresponds to taking all facts and paying for all violations), we introduce the edge which limits the capacity of the network, and enables us to find the minimum cost flow of a given size . We will show that for every , the cost of the solution to the MCMF problem on will be the cost of the “cheapest” subinstance of of size . Hence, the solution to our problem is the cost of the minimal solution among all the instances .
Given an integral flow in , the repair induced by , is the set of facts corresponding to edges of the form such that . Moreover, given a subinstance of of size , we denote by the integral flow in defined as follows.


for and for for every

for and for for every

if and otherwise

for and for for every

for and for for every
The reader can easily verify that is indeed an integral flow in . Clearly, the value of the flow is .
We have the following lemmas. The first is proved in the Appendix and the second follows straightforwardly from the construction of and the definition of .
Every integral solution to MCMF on satisfies .
Every subinstance of satisfies .
Now, let be an optimal subset of w.r.t. and assume that . Let be a solution with the minimum cost among all the solutions to MCMF on . Lemma 5 implies that there is an integral flow in such that . Hence, we have that . By applying Lemma 5 on , there is another subinstance of such that . Since is an optimal subset, we have that . Overall, we have that , and we conclude that . Therefore, by taking the solution with the lowest cost among all solutions to MCMF on , we indeed find a solution to our problem, and that concludes our proof of Lemma 5.
Consider again the database of Figure 1(a). Assume that: