The presence in a database of duplicate, but non-identical representations of the same external entity leads to uncertainty. Applications running on top of the database or a query answering process may not be able to tell them apart, and the results may lead to ambiguity, semantic problems, such as unintended inconsistencies, and erroneous results. In this situation, the database has to be cleaned. The whole area of entity resolution (ER) deals with identifying and merging database records in a database that refer to the same real-world entity [Bleiholder and Naumann 2008, Elmagarmid, Ipeirotis and Verykios 2007]. In so doing, duplicates are eliminated from the database, while at the same time new tuples are created through the merging process. ER is one of the most common and difficult problems in data cleaning.
In the last few years there has been strong and increasing interest in providing declarative and generic solutions to data cleaning problems [Bertossi and Bravo 2013], in particular, in logical specifications of the ER process. In this direction, matching dependencies (MDs) have been proposed [Fan 2008, Fan et al. 2009]. They are declarative rules that assert that certain attribute values in relational tuples have to be merged, i.e. made identical, when certain similarity conditions hold between possibly other attribute values in those tuples.
Example 1. Consider the relational predicate , with attributes and . The symbolic rule is an MD specifying that, if for any two database tuples in an instance , when -values are similar, i.e. , then their -values have to be made equal (merged), i.e. or (or both) have to be changed to a value in common.
Let us assume that is reflexive and symmetric, and that , but . The table on the left-hand side (LHS) below provides the extension for predicate in . In it some duplicates are not “resolved”, e.g. the tuples (with tuple identifiers) and have similar – actually equal – -values, but their -values are different.
does not satisfy the MD, and is a dirty instance. After applying the MD, we could get the instance on the right-hand side (RHS), where values for have been identified. is stable in the sense that the MD holds in the traditional sense of an implication and “” on , which we call a clean instance. In general, for a dirty instance and a set of MDs, multiple clean instances may exist. Notice that if we add the MD , creating a set of interacting MDs, a merging with one MD may create new similarities that enable the other MD.
A dynamic semantics for MDs was introduced in [Fan et al. 2009], that requires pairs of instances: a first one where the similarities hold, and a second where the mergings are enforced, e.g. and in Example 1. MDs, as introduced in [Fan et al. 2009], do not specify what values to use when merging two attribute values.
The semantics was refined and extended in [Bertossi, Kolahi and Lakshmanan 2012] by means of matching functions (MFs) providing values for equality enforcements. An MF induces a lattice-theoretic structure on an attribute’s domain. Actually, a chase-based semantics for MD enforcement was proposed. On this basis, given an instance and a set of MDs, wrt. which may contain duplicates, the chase procedure may lead to several different clean and stable solutions . Each of them can be obtained by means of a provably terminating, but non-deterministic, iterative procedure that enforces the MDs through application of MFs. The set of all such clean instances is denoted by . Each clean instance can be seen as the result of an uncertainty reduction process. If at the end there are several possible clean instances, uncertainty is still present, and expressed through this class of possible worlds. Identifying cases for which a single clean instance exists is particularly relevant: for them uncertainty can be eliminated.
In [Bahmani et al. 2012], a declarative specification of this procedural data cleaning semantics was proposed. More precisely, a general methodology was developed to produce, from , and the MFs, an answer set program (ASP) [Gelfond and Lifschitz 1991, Brewka, Eiter and Truszczynski 2011] whose models are exactly the clean instances in the class . The ASP enables reasoning in the presence of uncertainty due to multiple clean instances. Computational implementations of ASP can be then used for reasoning, for computing clean instances, and for computing certain query answers (aka. clean answers), i.e. those that hold in all the clean instances [Bahmani et al. 2012]. Disjunctive ASPs, aka. disjunctive Datalog programs with stable model semantics [Eiter, Gottlob and Mannila 1997], are used (and provably required) for this task.
For some classes of MDs, for any given initial instance , the class contains a single clean instance that can be computed in polynomial time in the size of . Some sufficient syntactic and MF-dependent conditions were identified in [Bertossi, Kolahi and Lakshmanan 2012]. In this work we identify a new important “semantic” class of MDs, where the initial instance is also considered. This is the similarity-free attribute intersection class (the SFAI class) of combinations of MDs and initial instances. Members of this class also have (polynomial-time computable) single clean instances. For all these classes, we show that the general ASP mentioned above can be automatically and syntactically transformed into an equivalent stratified Datalog program with the single clean instance as its standard model, which can be computed bottom-up from in polynomial time in the size of [Abiteboul, Hull, and Vianu 1995, Ceri, Gottlob and Tanca 1989].
Relational ER has been approached by the machine learning community[Bhattacharya and Getoor 2007]
. The idea is to learn from examples a classifier that can be used to determine if an arbitrary pair of records (or tuples),, are duplicates (or each other) or not. In order to speed up the process of learning and applying the classifier, usually blocking techniques are applied [Whang et al. 2009]. They are used to group records in clusters (blocks), for further comparison of pairs within clusters, but never of two records in different clusters. Interestingly, as reported in [Bahmani, Bertossi, and Vasiloglou 2015], MDs can be used in the blocking phase. As expected, MDs were also used during the final merging phase, after the calls to the classifier. However, the use at the earlier stage is rather surprising. The kind of MDs in this case turn out to belong, together with the initial instance, to the SFAI class. Actually, this allowed implementation of MD-based blocking by means of Datalog.
The reason for using MDs at the blocking stage is that they may convey semantic relationships between records for different entities, and can then be used to collectively block records for different entities [Bhattacharya and Getoor 2007]: blocking together two records for an entity, say of books, may depend on having blocked together related records for a different entity, say of authors. For these kinds of applications, to capture semantic relationships, MDs were extended with relational atoms (conditions) in the antecedents, leading to the class of relational MDs.
In this work we also introduce and investigate the class of relational MDs, we extend the single-clean instance classes mentioned above to the relational MD case, and we obtain in a uniform manner Datalog programs for the enforcement of MDs in these classes. For lack of space, our presentation is based mainly on representative examples.
We consider relational schemas with a possibly infinite data domain , a finite set of database predicates, e.g. , and a set of built-in predicates, e.g. . Each has attributes, say , each of them with a domain . We may assume that the s are different, and different predicates do not share attributes. However, different attributes may share the same domain.
An instance for is a finite set of ground atoms (or tuples) of the form , with , . We will assume that tuples have identifiers, as in Example 1. They allow us to compare extensions of the same predicate in different instances, and trace changes of attribute values. Tuple identifiers can be accommodated by adding to each predicate an extra attribute, , that acts as a key. Then, tuples take the form , with a value for . Most of the time we leave the tuple identifier implicit, or we use it to denote the whole tuple. More precisely, if is a tuple identifier in an instance , then denotes the entire atom, , identified by . Similarly, if is a list of attributes of predicate , then denotes the tuple identified by , but restricted to the attributes in . We assume that tuple identifiers are unique across the entire instance.
For a schema with predicates , with lists of attributes , resp., a matching dependency (MD) [Fan et al. 2009] is an expression of the form:
Here, are sublists of , and sublists of . The lists (also ) are comparable, i.e. the attributes in them, say , are pairwise comparable in the sense that they share the same data domain on which a binary similarity (i.e. reflexive and symmetric) relation is defined.
The MD (1) intuitively states that if, for an -tuple and an -tuple in an instance the attribute values in are similar to attribute values in , then the values and have to be made identical. This update results in another instance , where holds. W.l.o.g., we may assume that the list of attributes on the RHS of MDs contain only one conjunct (attribute).
For a set of MDs, a pair of instances satisfies if whenever satisfies the antecedents of the MDs, then satisfies the consequents (taken as equalities). If , we say that is “dirty” (wrt. ). On the other hand, an instance is stable if [Fan et al. 2009].
We now review some elements in [Bertossi, Kolahi and Lakshmanan 2012]. In order to enforce an MD on two tuples, making values of attributes identical, we assume that for each comparable pair of attributes with domain (in common) , there is a binary matching function (MF) , such that is used to replace two values whenever necessary. MFs are idempotent, commutative, and associative. Similarity relations and MFs are treated as built-in relations.
A chase-based semantics for entity resolution with MDs is as follows: starting from an instance , we identify pairs of tuples that satisfy the similarity conditions on the left-hand side of an MD , i.e. (but not the identity in its RHS), and apply an MF on the values for the right-hand side attribute, , to make them both equal to . We keep doing this on the resulting instance, in a chase-like procedure [Abiteboul, Hull, and Vianu 1995], until a stable instance is reached (cf. [Bertossi, Kolahi and Lakshmanan 2012] for details), i.e. a clean instance. An instance may have several -clean instances. denotes the set of clean instances for wrt. .
For given and
, the class of clean instances can be specified as the stable models of a logic programin , i.e. a disjunctive Datalog program with weak negation and stable model semantics [Gelfond and Lifschitz 1991, Eiter, Gottlob and Mannila 1997], with rules of the form: . Here, , and are (positive) atoms. Rules with are called program constraints and have the effect of eliminating the stable models of the program (without them) that make their bodies (RHS of the arrow) true. When and , we have (plain) Datalog programs. When and not is stratified, we have disjunctive, stratified Datalog programs, denoted . The subclass with is stratified Datalog, denoted .
We now introduce general cleaning programs by means of a representative example (for full generality and details, see [Bahmani et al. 2012]). Let be a given, possibly dirty initial instance wrt. a set of MDs. The cleaning program, , that we will introduce here, contains an -ary predicate , for each -ary database predicate . It will be used in the form , where is a variable for the tuple identifier attribute, and is a list of variables standing for the (ordinary) attribute values of .
For every attribute in the schema, with domain , the built-in ternary predicate represents the MF , i.e. means . is used as an abbreviation for . For attributes without a matching function, becomes the equality, . For lists of variables and , denotes the conjunction . Moreover, for each attribute , there is a built-in binary predicate . For two lists of variables and representing comparable attribute values, denotes the conjunction .
In intuitive terms, program has rules to implicitly simulate a chase sequence, i.e. rules that enforce MDs on pairs of tuples that satisfy certain similarities, create newer versions of those tuples by applying matching functions, and make the older versions of the tuples unavailable for other rules. The main idea is making stable models of the program correspond to valid chase sequences leading to clean instances.
When the conditions for applying an MD hold, we have the choice between matching or not.111Matching is merging, or making identical, two attribute values on the basis of the MDs. If we do, the tuples are updated to new versions. Old versions are collected in a predicate, and tuples that have not participated in a matching that was possible never become old versions (see the last denial constraint under 2. in Example 2, saying that the RHS of the arrow cannot be made true).
The program eliminates, using program constraints, instances (models of the program) that are the result of an illegal set of applications of MDs, i.e. they cannot put them in a linear (chronological) order representing chase steps. This occurs when matchings use old versions of tuples that have been replaced by new versions. To ensure that the matchings are enforced according to an order that correctly represents a chase, pairs of matchings are stored in an auxiliary relation, . The last two program constraints under 6. in the example make a linear order. In particular, matchings performed using old versions of tuples are disallowed.
Example 2. Consider relation with extension in as below; and assume that exactly the following similarities hold: , ; and the MFs are as follows: ,
contains the MDs:
which are interacting in that the set of attributes in the RHS of , namely , and the set of attributes in the LHS of , namely , have non-empty intersection. For the same reason, also interacts with itself. Enforcing on results in two alternative chase sequences, each enforcing the MDs in a different order, and two final stable clean instances and .
The cleaning program is as follows:
The program constraint under 2. (last in the list) ensures that all new, applicable matchings have to be eventually carried out. The last set of rules (one for each database predicate) collect the final, clean extensions of them.
Program has two stable models, whose -atoms are shown below:
From them we can read off the two clean instances , for that were obtained from the chase.
The cleaning program allows us to reason in the presence of uncertainty as represented by the possibly multiple clean instances. Actually, it holds that there is a one-to-one correspondence between and the set of stable models of . Furthermore, the program without its program constraints belongs to the class , the subclass of programs in that have stratified negation [Eiter and Gottlob 1995]. As a consequence, its stable models can be computed bottom-up by propagating data upwards from the underlying extensional database (that corresponds to the set of facts of the program), and making sure to minimize the selection of true atoms from the disjunctive heads. Since the latter introduces a form of non-determinism, a program may have several stable models. If the program is non-disjunctive, i.e. belongs to the , it has a single stable model that can be computed in polynomial time in the size of the extensional database . The program constraints in make it unstratified [Gelfond and Kahl 2014]. However, this is not a crucial problem because they act as a filter, eliminating the models that make them true from the class of models computed with the bottom-up approach.
3 Relational MDs
We now introduce a class of MDs that have found useful applications in blocking for learning a classifier for ER [Bahmani, Bertossi, and Vasiloglou 2015]. They allow bringing additional relational knowledge into the conditions of the MDs. Before doing so, notice that an explicit formulation of the MD in (1) in classical predicate logic is:222Similarity symbols can be treated as regular, built-in, binary predicates, but the identity symbol, , would be non-classical.
with . The are variables for tuple IDs. and denote the sets of atoms on the LHS and RHS of , respectively. Atoms and contain all the variables in the MD; and similarity and identity atoms involve one variable from each of .
Now, relational MDs may have in their LHSs, in addition to the two leading atoms, as in (2), additional database atoms, from more than one relation, that are used to give context to similarity atoms in the MD, and capture additional relational knowledge via additional conditions. Relational MDs extend “classical” MDs.
Example 3. With predicates (with ID and block attributes), this MD, , is relational:
with implicit quantifiers, and underlined leading atoms (they contain the identified variables on the RHS). It contains similarity comparisons involving attribute values for both relations Author and Paper. It specifies that when the Author-tuple similarities on the LHS hold, and their papers are similar to those in corresponding Paper-tuples that are in the same block (an implicit similarity captured by the join variable ), then blocks have to be made identical. This blocking policy uses relational knowledge (the relationships between Author and Paper tuples), plus the blocking decisions already made about Paper tuples.
4 Single-Clean-Instance Classes
First we introduce some notation. For an MD , denotes the set of (non-tid) attributes (with predicates) appearing in similarities in the LHS of (including equalities, implicit or not). Similarly, contains the attributes appearing in identities in the RHS. In Example 3: