ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

02/07/2016 ∙ by Zeinab Bahmani, et al. ∙ Carleton University 0

Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called "matching dependencies" (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating four components of ER: (a) Building a classifier for duplicate/non-duplicate record pairs built using machine learning (ML) techniques; (b) Use of MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language "LogiQL" -an extended form of Datalog supported by the "LogicBlox" platform- for all activities related to data processing, and the specification and enforcement of MDs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Entity resolution (ER) is a common and difficult problem in data cleaning that has to do with handling unintended multiple representations in a database of the same external objects. This problem is also known as deduplication, reference reconciliation, merge-purge, etc. Multiple representations lead to uncertainty in data and the problem of managing it. Cleaning the database reduces uncertainty. In more precise terms, ER is about the identification and fusion of database records (think of rows or tuples in tables) that represent the same real-world entity naumannACMCS ; elmargamid . As a consequence, ER usually goes through two main consecutive phases: (a) detecting duplicates, and (b) merging them into single representations.

For duplicate detection, one must first analyze multiple pairs of records, comparing the two records in them, and discriminating between: pairs of duplicate records and pairs of non-duplicate records

. This classification problem is approached with machine learning (ML) methods, to learn from previously known or already made classifications (a training set for supervised learning), building a

classification model (a classifier) for deciding about other record pairs Christen2007 ; elmargamid .

In principle, in ER every two records (forming a pair) have to be compared through the classifier. Most of the work on applying ML to ER work at the record level Rastgoi11 ; Christen2007 ; Christen2008 , and only some of the attributes, or their features, i.e. numerical values associated to them, may be involved in duplicate detection. The choice of relevant sets of attributes and features is application dependent.

With a classifier at hand, ER may be a task of quadratic complexity since it requires comparing every two records. To reduce the large number of two-record comparisons, blocking techniques are used surveyBlocking ; Baxter03 ; Herzog07 ; Garcia-Molina09 . Commonly, a single record attribute, or a combination of attributes, the so-called blocking key, is used to split the database records into blocks. Next, under the assumption that any two records in different blocks are unlikely to be duplicates, only every two records in a same block are compared for duplicate detection.

Although blocking will discard many record pairs that are obvious non-duplicates, some true duplicate pairs might be missed (by putting them in different blocks), due to errors or typographical variations in attribute values or the rigidity and low sensitivity of blocking keys. More interestingly, similarity between blocking key values alone may fail to capture the relationships that naturally hold in the data and could be used for blocking. Thus, entity blocking based only on similarities of blocking key values may cause low recall. This is a major drawback of traditional blocking techniques.

In this work we consider different and coexisting entities, for example Author and Paper. For each of them, there is a collection of records. For entity Author, records may have the form ; and for Paper entity, records may be of the form .111For all practical purposes, think of records as database tuples in a single table.

Records for different entities may be related via attributes in common and referential constraints, something the blocking mechanism could take advantage of. Blocking can be performed on each of the participating entities, and the way records for an entity, say Author, are placed in blocks may influence the way the records for another entity, say Paper, are assigned to blocks. This is called “collective blocking”. Semantic, relational information, in addition to that provided by blocking keys for single entities, can be used to state relationships between different entities and their corresponding similarity criteria. So, blocking decision making forms a collective and intertwined process involving several entities. In the end, the records for each individual entity will be placed in blocks associated to that entity.

In our work, collective blocking is based on blocking keys and the enforcement of semantic information about the relational closeness of entities Author and Paper, which is captured by a set of matching dependencies (MDs) FanJLM09 . So, we propose “MD-based collective blocking”.

After records are divided in blocks, the proper duplicate detection process starts, and is carried out by comparing every two records in a block, and classifying the pair as “duplicates” or “non-duplicates” using the trained ML model at hand. In the end, records in duplicate pairs are considered to represent the same external entity, and have to be merged into a single representation, i.e. into a single record. This second phase is also application dependent. MDs were originally proposed to support this kind of task, and their use in blocking is somehow unexpected.

Matching dependencies are declarative logical rules that tell us under what conditions of similarity between attribute values, any two records must have certain attribute values merged (or matched), i.e. made identical Fan08 ; FanJLM09 . For example, the MD:

(1)

tells us that, for any two records for entity (or relation or table) that have similar values for attribute , their values for attribute should be merged, i.e. made the same.

MDs as introduced in FanJLM09 do not specify how to merge values. In Bertossi12 , MDs were extended with matching functions (MFs). For a data domain, a MF specifies how to assign a value in common to two values. In this work, we adopt MDs with MFs. In the end, the enforcement of MDs with MFs should produce a duplicate-free instance (cf. Section 2 for more details).

MDs have to be specified in a declarative manner, and at some point enforced, by producing changes on the data. For this purpose, we use the LogicBlox platform, a data management system developed by the LogicBlox222 www.logicblox.com company, that is centered around its declarative language, LogiQL Halpin15 . LogiQL supports relational data management and, among several other features Aref15 , an extended form of Datalog with stratified negation ceri90 . This language is expressive enough for the kind of MDs considered in this work.333For arbitrary sets of MDs, we need higher expressive power Bertossi12 , such as that provided by answer set programming Bahmani12 .

In this paper, we describe our ERBlox system. It is built on top of the LogicBlox platform, and implements entity resolution (ER) applying LogiQL for the specification and enforcement of MDs, and built-in ML techniques for building the classifier. More specifically, ERBlox has four main components or modules:

  • MD-based collective blocking:  This phase is just about clustering together records that might be duplicates of each other. Additional comparisons between two records between will be performed within block, and never with records from different blocks. Blocking can be used before learning a classifier, to ascribe labels (duplicate/non-duplicate or ) to pairs of records that will become training examples, or after the classifier has been learned, with new records in the database that have to be checked for duplication with other records in the database.444In our case, the training data already came with labels. So, blocking was applied to the unlabeled records before, but independently, from the learning and execution of the classifier. It may be the case that two records in a same block may end up not being considered as duplicates of each other. However, through blocking the number of pairwise record comparisons of is reduced.

  • ML-based classification model construction:  At this point any supervised technique for classification, i.e. for building the mathematical model for classification, could be used. This is the proper machine learning phase. We used the support-vector machine (SVM) approach cristianini ; Vapnik98 .

  • Duplicate detection:  Having the new records in the database (as opposed to training examples) already clustered in blocks, this phase is about applying the classification model obtained in the previous phase to new pairs of records, obtaining for each pair the outcome . The classifier could be applied to any two records, or -if blocking techniques have been applied to the database- only to two records in a same block. In our case, we did the latter.

  • MD-based duplicate merging:  The output of the preceding phase is just a set of record-pairs with their newly created labels, , indicating that they are duplicates of each other, or not. This last phase merges duplicates into single records. In our case, when and how to merge is specified by matching dependencies, which invoke matching functions to find the values in common to be used in the records created by merging.

The blocking phase, (a) above, uses MDs to specify the blocking strategy. They express conditions in terms of blocking key similarities and also relational closeness -the semantic knowledge- to assign two records to a same block, by making their block identifiers identical. Then, under MD-based collective blocking different records of possibly several related entities are simultaneously assigned to blocks through the enforcement of MDs (cf. Section 5 for details). This is a non-traditional, novel use of MDs, whereas their intended use is the application to proper merging phase, (d) above, Fan08 .

It is important to emphasize that, in our work, MDs are not used for the whole ER process, but only in two of the phases above. In principle, the whole ER process could be based only on MDs. However, this would be a complete different problem, in particular, a completely different machine learning problem: the MDs for this application would have to be learned from scratch (implicitly learning similarity relationships and a classifier). Learning MDs is a rather unexplored area of research (cf. leiChen ; leiChen2 for some work in this direction), which is somehow closer to the areas of rule learning fur+ and discovery of database dependencies felix . With our approach we can leverage, at a particular phase of the ER process, available machine learning techniques that are fully integrated with database management systems, as in the case of LogicBlox.

The sets of MDs used in (a) and (d) are different, and play different roles. In both cases, they are application-dependent, and have a canonical representation in the system, as Datalog rules. The MDs are then enforced by applying (running) those rules. Although in general a set of MDs may lead to alternative final instances through its enforcement Bertossi12 , in our application of MDs both sets of MDs lead to a single instance.

In the case of (a), this means that, for each entity, a unique set of disjoint blocks is generated. The reason is that the combination of the set of MDs and the initial database instance falls into a newly identified, well-behaved class, the SFAI class, that we introduce in this work. (The main ideas and intuitions around it are presented in the main body of this paper, but more specific details are given in A.)  In the case of (d), the set of “merge” MDs also leads to a single, duplicate-free instance (as captured by the classifier and the merge MDs). This is because the MDs in the set turn out to be interaction-free Bertossi12 (cf. also A).

We use LogiQL to declaratively implement the two MD-based components of ERBlox. As shown in Bahmani12 ; BahmaniExten12 in general, sets of MDs can be expressed by means of answer-set programs (ASPs) brewka . However, both classes of MDs used by ERBlox can be expressed by computationally efficient fragments of ASPs, namely Datalog with stratified negation ceri90 , which is supported by LogiQL.

On the machine learning side (item (b) above), the problem is about building and implementing a model for the detection of pairs of duplicate records. The classification model is trained using record-pairs known to be duplicates or non-duplicates. We independently used three established classification algorithms: SVM, k-nearest neighbor (K-NN) Cover67 , and non-parametric Bayes classifier (NBC) Baudat00 . We used the Ismion555http://www.ismion.com implementation of them due to the in-house expertise at LogicBlox. Since the emphasis of this work is on the use of LogiQL and MDs, we will refer only to our use of SVM.

For experimentation with the ERBlox system, we used as dataset a snapshot of Microsoft Academic Search (MAS)666http://academic.research.microsoft.com. As of January 2013. that includes K authors, M papers, and a training set. We also used, independently, datasets from DBLP and Cora Citation. The experimental results show that our system improves ER recall and precision over traditional, standard blocking techniques jaro89

, where just blocking-key similarities are used. Actually, MD-based collective blocking leads to higher precision and recall on the given datasets.

Our work also shows the integration under a single system of different forms of data retrieval, storage and transformation, on one side, and machine learning techniques, on the other. All this is enabled by the use of optimized Datalog-rule declaration and execution as supported by the LogicBlox platform.

This paper is structured as follows. Section 2 introduces background on:  matching dependencies (including a brief description of the new SFAI class), classification, and collective blocking. A general overview of the ERBlox system is presented in Section 3. Specific details about the components of our methodology and ERBlox are given and discussed in Sections 4, 5, 6, and 7. Experimental results are shown in Section 8. Sections 9 and 10 present related work and conclusions, respectively. In A we provide the definitions and more details about relational MDs, the SFAI class, and other classes with the unique clean instance property.777The material in A is all new, but, although important for ERBlox, departs from the main thread of the paper. This paper is a revised and extended version of sum .

2 Preliminaries

2.1 Matching dependencies

We consider an application-dependent relational schema , with a data domain . For an attribute , is its domain. We assume predicates do not share attributes, but different attributes may share a domain. An instance for is a finite set of ground atoms of the form , with , . The active domain of an instance , denoted , is the finite set of all constants from that appear in .

We assume that each entity is represented by a relational predicate, and its tuples or rows in its extension correspond to records for the entity. As in Bertossi12 , we assume records have unique, fixed, global record identifiers (rids), which are positive integers. This allows us to trace changes of attribute values in records. When records are represented as tuples in a database, which is usually the case, we talk about global tuple identifiers (tids). Record and tuple ids are placed in an extra, first attribute for that acts as a key. Then, records take the form , with the identifier. Sometimes we leave tids and rids implicit.  If is a sublist of the attributes for a predicate , denotes the restriction of an -tuple (or the predicate ) to attributes in .

MDs are formulas of the form Fan08 ; FanJLM09 :

(2)

oo

where attributes (treated as variables) and (also ) are comparable, in the sense that they share the same data domain on which a binary similarity (i.e., reflexive and symmetric) relation is defined. could be the same predicate. The MD in (2) states that, for every pair of tuples (one in relation , the other in relation ) where the left-hand side (LHS) of the arrow is true, the attribute values in them on the right-hand side (RHS) have to be made identical. We can consider only MDs with a single identity atom (with ) in the RHSs. Accordingly, an explicit formulation of the MD in (2) in classical predicate logic is:888Similarity symbols can be treated as regular, built-in, binary predicates, but the identity symbol, , would be non-classical.

(3)

oo

with . The are used as variables for tuple IDs. We usually leave the universal quantifiers implicit. and denote the sets of atoms on the LHS and RHS of , respectively.   contains, apart from similarity atoms, atoms and , which contain all the variables in the MD, including those in the . So, similarity and identity atoms in involve one variable from predicate , and one from predicate .


Example 1. Consider , a relational predicate representing records for entity Paper. It includes a first attribute for a tuple identifier, and a last indicating the block the tuple (record) has been assigned to. The MD

(4)

involves a similarity relation on the Title attribute, and equality as similarity relation on attributes Year and CID. The MD specifies that, when the conditions expressed in the LHS are satisfied, the two block values have to be made the same, i.e. the two records should be (re)assigned to the same block.

A dynamic, chase-based semantics for MDs with matching functions (MFs) was introduced in Bertossi12 , and we briefly summarize it here. Given an initial instance , the set of MDs is iteratively enforced until they cannot be be applied any further, at which point a resolved instance has been produced.

In order to enforce (the RHSs of) MDs, there are binary matching functions (MFs) ; and is used to replace two values that have to be made identical. For example, for an attribute , we might have a MF , such that .

MFs are idempotent, commutative, and associative, and then induce a partial-order structure , with:  Bertossi12 ; BenjellounGMSWW09 . It always holds:  . Actually, the relationship can be thought in terms of information contents:   is at least as informative as .999Of course, this claim assumes that MFs locally assign an at least as informative value as both of the two input values. MFs are application dependent. This partial order allows to define a partial order on instances Bertossi12 . Accordingly, when MDs are applied, a chain of increasingly more informative (or less uncertain) instances is generated:  . In this work, MFs are treated as built-in relations.

Given a database instance and a set of MDs , there may be several resolved instances for and Bertossi12 . However, there is a unique resolved instance if one of the following holds Bertossi12 ; BahmaniExten12 :

  • MFs used by are similarity-preserving, i.e., for every .  When MDs use similarity-preserving MFs, we also say that the MDs are similarity-preserving.

  • is interaction-free, i.e. no attribute (with its predicate) appears both in a RHS and a LHS of MDs in .

    For example, the set is not interaction-free due to the presence of attribute .   is interaction-free.

  • The combination of and the initial instance is similarity-free attribute intersection (we say it is SFAI), if is interaction-free, or, otherwise, for every pair of interacting MDs in , and, for every , it holds is not true in instance or is not true in instance .

    Consider, for example, predicate , the instance below, and the set of interacting MDs:

    Assume that the only similarities that holds in the data domain are and , with . Since is not applicable in (i.e. there is no pair of tuples making it true), the combination of and is SFAI. Notice that does not matter, because there is no tuple in with as value for .

    With general sets of MDs, different orders of MD enforcements may result in different clean instances, because tuple similarities may be broken during the chase with interacting, non-similarity-preserving MDs, without reappearing again Bertossi12 . With SFAI combinations, two similar tuples in the original instance -or becoming similar along a chase sequence- may have the similarities broken in a chase sequence, but they will reappear later on in the same and the other chase sequences. Thus, different orders of MD enforcements cannot lead in the end to different clean instances.

The SFAI class had not been investigated before. It is a semantic class, as opposed to syntactic, in that there is a dependency upon the initial instance. See A for more details on this class.

The three classes above have the unique clean instance (UCI) property, i.e. iteratively and exhaustively enforcing them leads to a single clean, stable instance. Even more, in these three cases, the single clean instance can be computed in polynomial time in data, i.e. in polynomial time in , the size of the initial instance, leaving the set of MDs as a fixed, external parameter for the computational problem that here receives database instances as inputs.101010In data management it is common to measure computational complexity (in our case, time complexity) in terms of the size of the underlying dataset, which is usually much larger than that of any other ingredient, such as a query, a set of integrity constrains, a set of view definitions, etc. If we bring the sizes of the latter into the complexity analysis, we talk of combined complexity ahv95 .

In this work, for collective-blocking purposes, we will introduce and use a new class of MDs, that of relational MDs, that extends the class of “classical” MDs introduced earlier in this section. Actually, the three UCI classes of classical MDs listed above can be extended to relational MDs, and preserving the UCI property (cf. A for more details).

Relational MDs, the SFAI class, and the UCI property are all relevant for this work. However, a detailed analysis of them is somehow beyond the scope of this work. For this reason, and in order not to break the natural flow of the presentation, we provide in A, mainly for reference, some more details about all these subjects.

2.2 Classification with support-vector models

The support-vector machines  technique (SVM) Vapnik98 is a form of kernel-based learning.  SVM can be used for classifying vectors in an inner-product vector space over . Vectors are classified in two classes, say with labels or . The classification model is a hyper-plane in

: vectors are classified depending on the side of the hyperplane they fall.

The hyper-plane has to be learned through an algorithm applied to a training set of examples, say . Here, , and for the real-valued feature (function) :  .

Figure 1:   Classification hyperplane

The SVM algorithm finds an optimal hyperplane, , in that separates the two classes in which the training vectors are classified. Hyperplane has an equation of the form , where denotes the inner product, is a vector variable, is a weight-vector of real values, and is a real number. Now, a new vector in can be classified as positive or negative depending on the side of it lies. This is determined by computing . If , belongs to class ; otherwise, to class .

It is possible to compute real numbers , the coefficients of the “support vectors”, such that the classifier can be computed through:  flach .

As Figure 1 shows, in our case, we need to classify pairs of records, that is our vectors are of record-pairs of the form . If , the classifier returns as output , meaning that they two records are duplicates (of each other). Otherwise, it returns

, meaning that the records are non-duplicates (of each other). For the moment we do not need more than this about the SVM technique.

2.3 Collective blocking

Entity-resolution (and other machine learning tasks) use blocking techniques, to group together input values for further processing. In the case of ER, records that might be duplicates of each other are grouped under a same block, and only records within the same block are compared. Any two records in different blocks will never be declared as duplicates.

Commonly, a single attribute in records, or a combination of attributes, called a blocking key, is used to split records into blocks. If two records share the same (or have similar) values for the blocking-key attributes, they are put into the same block. For example, we could block employee records according to the name and the city. If two of them share (or have similar) name and city values, they go to the same block. Additional analysis, or the use of a classifier, will eventually determine if they are duplicates or not.

Blocking keys are rather rigid, and “local”, in that they are applied to records for a single entity (other entities may have other blocking keys). Their use may cause low recall. For this reason, it may be useful to apply blocking techniques that take advantage of additional semantics and/or domain knowledge. Actually, collective blocking creates blocks for different entities by exploiting the relational relationships between entities. Records for different entities are separately, but simultaneously blocked, in interaction. Accordingly, this approach can be called semantic collective blocking.

Figure 2:   Collective blocking

Example 2. Consider two entities, Author and Paper. For each of them, there is a set of records. For Author, they are of the form , with {name, affiliation} the blocking key; and for Paper, records are the form , with title the blocking key.

We can block together two Author records on the basis of the similarities of their values for the blocking key, in this case of authors’ names and affiliations. (This blocking policy can be specified by means of an MD of the form (4) in Example 2.1.)  However, if two Author records, say , have similar names, but not similar affiliations, they will not be assigned to the same block.

An alternative approach, could create and assign blocks of Author records, and also blocks of Paper records, at the same time, separately for each entity, but in an intertwined process. In this case, the same Author records , that were assigned to different blocks, may be the authors of papers, represented as Paper records, say , resp., which have been already be put in the same block (of papers) on the basis of similarities of paper titles (cf. Figure 2). With this additional information, we might assign and to the same block.

The additional knowledge comes in two forms: (a) semantic knowledge, about the relational relationships between records for different entities, in this case, the reference of paper titles appearing in Author records to paper titles in Paper entities, and (b) “procedural” knowledge that tells us about blocks certain entities have been assigned to. As we will see, MDs allow us to express both, simultaneously. In this case, we will be able to express that “if two papers are in the same block, then the corresponding Author records that have similar author names should be put in the same block too”. So, we are blocking Author and Paper entities, separately, but collectively and in interaction. Similarly, and the other way around, we could block Papers records according to the blocking results for their authors (Author records).

3 Overview of ERBlox

A high-level description of the components and workflow of ERBlox is given in Figure 3. In the rest of this section, numbers in boldface refer to the edges in that figure. ERBlox’s main four components are:  1.  MD-based collective blocking (path ),  2.  Classification-model construction (all the tasks up to , inclusive),  3.  Duplicate detection (continues with edge ), and  4.  MD-based merging (previous path extended with ). All the tasks in the figure, except for the classification model construction (that applying the SVM algorithm), are supported by LogiQL.111111The implementation of in-house developed ML algorithms as components of the LogicBlox platform is ongoing work.

Figure 3:   Overview of ERBlox

The initial input data is stored in structured text files, which are initially standardized and free of misspellings, etc. However, there may be duplicates. The general LogiQL program supporting the above workflow contains rules for importing data from the files into the extensions of relational predicates (tables). This is edge . This results in a relational database instance containing the training data (edge ), and instance to be subject to ER (edge ).

Figure 4:   Records

Entity records are represented as relational tuples as shown in Figure 4. However, we will keep referring to them as records, and they will be generally denoted with .

The next tasks require similarity computation of pairs of records in and (separately) in (edges and ). Similarity computation is based on two-argument similarity functions on the domain of a record attribute, say  , each of which assigns a numerical value to (the comparison of) two values for attribute , in two different records.

These similarity functions, being real-valued functions of the objects under classification, correspond to features in the general context of machine learning. They are considered only for a pre-chosen subset of record attributes. Weight-vectors   are formed by applying predefined weights, , on real-valued similarity functions, , on a pair of of values for attributes (edges and ), as in Figure 5. (For more details on similarity computation see Section 4.)

Figure 5:   Feature-based similarity

Some record-pairs in the training dataset are considered as duplicates and others as non-duplicates, which results (according to path ) in a “similarity-enhanced” training database of tuples of the form , with label . Label indicates if the two records are duplicates () or not (). These labels are consistent with the corresponding weight vectors. The classifier is trained using , leading, through the application of the SVM algorithm, to the classification model (edges ) to be used for ER.

Blocking is applied to instance , pre-classifying records into blocks, so that only records in a same block will form input pairs for the trained classification model. Accordingly, two records in a same block may end up as duplicates (of each other) or not, but two records in different blocks will never be duplicates.

We assume each record has two extra, auxiliary attributes: a unique and global (numerical) record identifier (rid) whose value is originally assigned and never changes; and a block number that initially takes the rid as value. This block number is subject to changes.

For the records in , similarity measures are used for blocking (see sub-path ). To decide if two records, , go into the same block, the weight-vector can be used: it can be read off from it if their values for certain attributes are similar enough or not. However, the similarity computations required for blocking may be different from those involved in the computation of the weight-vectors , which are related to the classification model. Either way, this similarity information is used by the blocking-matching dependencies, which are pre-declared and domain-dependent.

Blocking-MDs specify and enforce (through their RHSs) that the blocks (block numbers) of two records have to be made identical. This happens when certain similarities between pairs of attribute values appearing in the LHSs of the MDs hold. For example, (4) is a blocking-MD that requires the computation of similarities of string values for attribute . The similarity-atoms on the LHS of a blocking-MD are considered to be true when the similarity values are above thresholds that have been predefined for blocking purposes only.


Example 3. (ex. 2.3 cont.) With schema (including ID and block attributes), the following is a relational MD that captures a blocking policy that is similar to (but more refined than) that in Example 2.3:

(5)

with Author-atoms as “leading atoms” (they contain the identified variables on the RHS). It contains similarity comparisons involving attribute values for both relations Author and Paper. It specifies that when the Author-tuple similarities on the LHS hold, and their papers are similar to those in corresponding Paper-tuples that are in the same block (equality as an implicit similarity is captured by the join variable ), then blocks have to be made identical. This blocking policy uses relational knowledge (the relationships between Author and Paper tuples), plus the blocking decisions already made about Paper tuples.

We can see from (5) that information about classifications in blocks of records for the entity at hand (Author in this case) and for others entities (Paper in this case) may simultaneously appear as conditions in the LHSs of blocking-MDs. Furthermore, blocking-MDs may involve in their LHSs inter-entity similarity conditions, such as in (5)). All this is the basis for our “semantically-enhanced” collective blocking process.

The MD-based collective blocking stage (steps ) consists in the enforcement of the blocking-MDs on , which results in database enhanced with information about the blocks to which the records are assigned. Pairs of records with the same block form candidate duplicate record-pairs.

We emphasize that some blocking-MDs, such as (5), are more general than those of the form (2) introduced in FanJLM09 or Section 2.1: In their LHSs, they may contain regular database atoms, from more that one relation, that are used to give context to the similarity atoms in the MD, to capture additional relational knowledge. MDs of this kind are called relational MDs, and extend the so-called classical MDs of Section 2.1. (Cf. A for more details on relational MDs.)

A unique assignment of blocks to records is obtained after the enforcement of the blocking-MDs. Uniqueness is guaranteed by the properties of the class of MDs we use for blocking. Actually, blocking-MDs will turn out to have the UCI property (cf. Section 2.1). (More details on this are given in Sections 5 and A.)

After the records have been assigned to blocks, record-pairs , with in the same block, are considered for the duplicate test. At this point, we proceed as we did for the training data: the weight-vectors , which represent the record-pairs in the “feature vector space”, are computed and passed over to the classifier (edges ).121212Similarity computations are kept in appropriate program predicates. So, similarity values computed before blocking can be reused at this stage, or whenever needed.

The result of applying the trained ML-based classifier to the record-pairs is a set of triples containing records that come from the same block and are considered to be duplicates. Equivalently, the output is a set containing pairs of duplicate records (edge ). The records in pairs in are merged by enforcing an application-dependent set of (merge-)MDs (edge ). This set of MDs is different from that used for blocking.

Since records have kept their rids, we define a “similarity” predicate on the domain of rids as follows:   , i.e. iff the corresponding records are considered to be duplicates by the classifier. We informally denote by . Using this notation, the merge-MDs are usually and informally written in the form:   . Here, the RHS is a shorthand for  , where are all the record attributes, excluding the first and last, i.e. ignoring the identifier and the block number (cf. Figure 4). Putting all together, merge-MDs take the official form:

(6)

Merging at the attribute level, as required by the RHS, uses the predefined and domain-dependent matching functions .

After applying the merge-MDs, a single duplicate-free instance is obtained from (edge ). Uniqueness is guaranteed by the fact that the classes of merge-MDs that we use in our generic approach turn out to be interaction-free. (More details are given in Section 7 and A. See also the brief discussion in Section 2.1.)

More details about the ERBlox system and our approach to ER are found in the subsequent sections.

4 Datasets and Similarity Computation

We now describe some aspects of the MAS dataset that are relevant for the description of the ERBlox system components,131313We also independently experimented with the DBLP and Cora Citation datasets, but we will concentrate on MAS. and the way the initial data is processed and created for their use with the LogiQL language of LogicBlox.

4.1 Data files and relational data

In the initial, structured data files, entries (non-relational records) for entity Author relation contain author names and their affiliations. The entries for entity Paper contain: paper titles, years of publication, conference IDs, journal IDs, and keywords. Entries for the PaperAuthor relationship between Paper and Author entities contain: paper IDs, author IDs, author names, and their affiliations. The entries for the Journal and Conference entities contain both short names of the publication venue, their full names, and their home pages.

The dataset is preprocessed by means of Python scripts, in preparation for proper ERBlox tasks. This is necessary because the data gathering methods in general, and for the MAS dataset in particular, are often loosely controlled, resulting in out-of-range values, impossible data combinations, missing values, etc. For example, non-word characters are replaced by blanks, some strings are converted into lower case, etc. Not solving these problems may lead to later execution problems and, in the end, to misleading ER results. This preprocessing produces updated structured data files. As expected, there is no ER at this stage, and in the new files there may be many authors who publish under several variations of their names; also the same paper may appear under slightly different titles, etc. This kind of cleaning will be performed with ERBlox.

Next, from the data in (the preprocessed) structured files, relational predicates and their extensions are created and computed, by means of a generic Datalog program in LogiQL Aref15 ; Halpin15 . For example, these rules are part of the program:

(7)
(8)
(9)

Here, (7) is a predicate schema declaration, in this case of the “filein” predicate with three string-valued attributes. It is used to automatically store the contents extracted from the source file ”author.csv”, as specified in (8). In LogiQL in general, metadata declarations use “”. (In LogiQL, each predicate’s schema has to be declared, unless it can be inferred from the rest of the program.) Derivation rules, such as (9), use “”, as usual in Datalog. It defines the author predicate, and the “” in the rule head inserts the data into the predicate extension. The rule also makes the first attribute a tuple identifier.

Figure 6 shows three relational predicates that are created and populated in this way: , , . The (partial) tables show that there may be missing attribute values.

Figure 6: Relation extensions from MAS using LogiQL rules

4.2 Features and similarity computation

From the general description of our methodology in Section 3, a crucial component is similarity computation. It is needed for:  (a) blocking, and  (b) building the classification model. Similarity measures are related to features

, which are numerical functions of the data, more precisely of the values of some specially chosen attributes. Feature selection is a fundamental task in machine learning

dash ; tang ; going in detail into this subject is beyond the scope of this work.  Example 4.2 shows some specific aspects of this task as related to our dataset.

In relation to blocking, in order to decide if two records, in , go into the same block, similarity of values for certain attributes are computed, those that are appear in similarity conditions in the LHSs of blocking-MDs. All is needed is whether they are similar enough or not, which is determined by predefined numerical thresholds.

For model building, similarity values are computed to build the weight-vectors, , for records from the training data in . The numerical values in those vectors depend on the values taken by some selected record attributes (cf. Figure 5).


Example 4. (ex. 2.3 cont.) Bibliographic datasets, such as MAS, have been commonly used for evaluation of machine learning techniques, in particular, classification for ER. In our case, the features chosen in our work for the classification of records for entities Paper and Author from the MAS dataset (and the other datasets) correspond to those previously used in Torvik09 ; Christen2008 . Experiments in Kopcke08 show that the chosen features enhance generalization power of the classification model, by reducing over-fitting.

In the case of Paper-records, if the “journal ID” values are null in both records, but not their “conference ID” values, “journal ID” is not considered for feature computation, because it does not contribute to the recall or precision of the classifier under construction. Similarly, when the “conference ID” values are null. However, the values for “journal ID” and “conference ID” are replaced by “journal full name” and “conference full name” values that are found in Conference- and Journal-records, resp. Attributes Title, Year, ConfFullName or JourFullName, and Keyword are chosen for feature computation.

For feature computation in the case of Author-records, the Name attribute is split in two, the Fname and Lname attributes, to increase recall and precision of the classifier under construction. Accordingly, features are computed for attributes Fname, Lname and Affiliation.

Once the classifier has been built, also weight-vectors, are computed as inputs for the classifier, but this time for records from the data under classification (in ).141414In our experiments, we did not care about null values in records under classification. Learning, inference, and prediction in the presence of missing values are pervasive problems in machine learning and statistical data analysis. Dealing with missing values is beyond the scope of this work.

Notice that numerical values, associated to similarities, in a weight-vector for under classification, could be used as similarity information for blocking. However, the attributes and features used for blocking may be different from those used for weight-vectors. For example, in our experiments with the MAS dataset, the classification of Author-records is based on attributes Fname, Lname, and Affiliation. For blocking, the latter is reused as such (cf. MD (13) below), but also the combination of Fname and Lname is reused, as attribute Name in MDs (cf. MDs (13) and (15) below).

There is a class of well-known and widely applied similarity functions that are used in data cleaning and machine learning cohen03 . For our application with ERBlox we used three of them, depending on the attribute domains for the MAS dataset. Long-text-valued attributes, in our case, e.g. for the Affiliation attribute, their values are represented as lists of strings. For computing similarities between these kinds of attribute values, the “TF-IDF cosine” measure was used Salton88 . It assigns low weights to frequent strings and high weights to rare strings. For example, affiliation values usually contain multiple strings, e.g. “Carleton University, School of Computer Science”. Among them, some are frequent, e.g. “School”, and others are rare, e.g. “Carleton”.

For attributes with “short” string values, such as author names, “Jaro-Winkler” similarity was used Jaro95 ; Winkler99 . This measure counts the characters in common in two strings, even if they are misplaced by a short distance. For example, this measure gives a high similarity value to the pair of first names “Zeinab” and “Zienab”. In the MAS dataset, there are many author first names and last names presenting this kind of misspellings.

For numerical attributes, such as publication year, the “Levenshtein distance” was used Navarro . The similarity of two numbers is based on the minimum number of operations required to transform one into the other.

As already mentioned in Section 3, these similarity measures are used, but differently, for blocking and the creation and application of the classification algorithm. In the former case, similarity values related to LHSs of blocking-MDs are compared with user-defined thresholds, in essence, making them boolean variables. In the latter case, they are used for computing the similarity vectors, which contain numerical values (in ). Notice that similarity measures are not used beyond the output of the classification algorithm, in particular, not for MD-based record merging.

Similarity computation for ERBlox is done through LogiQL-rules that define the similarity functions. In particular, similarity computations are kept in extensions of program-defined predicates. For example, if the similarity value for the pair of values, , for attribute is above the threshold, a tuple is created by the program.

5 MD-Based Collective Blocking

As described in Section 3, the Block attribute, , in records takes integer numerical values; and before the blocking process starts (or blocking-MDs are enforced), each record in the instance has a unique block number that coincides with its rid. Blocking policies are specified by blocking-MDs, all of which use the same matching function for identity enforcement, given by:

(10)

A blocking MD that identifies block numbers (i.e. makes them identical) in two records (tuples) for database relation (cf. Figure 4) takes the form:

(11)

Here, are variables for block numbers, is a database predicate (representing an entity), the lists of variables stand for all the attributes in but , for which variables are used. The MD in (11) is relational when formula in it is a conjunction of relational atoms plus comparison atoms via similarity predicates; including implicit equalities of block numbers (but not -similarities between block numbers). The variables in may appear among those in (in ) or in another database predicate or in a similarity atom. We assume that .  (Cf. A for more details on relational MDs.)

An example is the MD in (5), where the leading -atoms are Author tuples, the extra conjunction contains Paper atoms, non-block-similarities, and an implicit equality of blocks through the shared use of variable . There, is .


Example 5. These are some of the blocking-MDs used with the MAS dataset. The first two are classical blocking-MDs, and the last two are properly relational blocking-MDs:

(12)
(13)
(14)
(15)

In informal terms, (12) requires that, for every two Paper entities for which the values for attribute are similar, and with the same publication year and conference ID, the values for attribute must be made identical. According to (13), whenever there are similar values for name and affiliation in Author, the corresponding authors should go into the same block.

The relational blocking-MDs in (14) and (15) collectively block Paper and Author entities. According to (14), a blocking-MD for Paper, if two authors are in the same block, their papers , having similar titles must be in the same block too. Notice that if papers and have similar titles, but they do not have same publication year or conference ID, we cannot block them together using (12) alone. The blocking-MD (15) for Author is similar to that discussed in Example 5.

For the application-dependent set, , of blocking-MDs we adopt the chase-based semantics Bertossi12 , which may lead, in general, to several, alternative final instances. In each of them, every record is assigned to a unique block, but now records may share block numbers, which is interpreted as belonging to the same block. In principle, there might be two final instances where the same pair of records is put in the same block in one of them, but not in the other one. However, with a set of the relational blocking-MDs of the form (11) acting on an initial instance (created with LogicBlox as described above), the chase-based enforcement of the MDs results in a single, final instance, . This is because the combination of the blocking-MDs with the initial instance turns out to belong to the SFAI class, which has the UCI property (cf. Section 2.1 and A).

That the initial instance and the blocking-MDs form a SFAI combination is easy to see. In fact, initially the block numbers in tuples (or records) are all different, they are the same as their tids. Now, the only relevant attributes in records (for SFAI membership) are “block attributes”, those appearing in RHSs of blocking-MDs (cf. (11)). In the LHSs of blocking-MDs they may appear only in implicit equality atoms. Since all initial block numbers in are different, no relevant similarity holds in .

Due to the SFAI property of blocking-MDs in combination with the initial instance, MD enforcement leads to a single instance that can be computed in polynomial time in data, which gives us the hope to use a computationally well-behaved extension of plain Datalog for MD enforcement (and blocking). It turns out that the representation and enforcement of these MDs can be done by means of Datalog with stratified negation ceri90 ; ahv95 , which is supported by LogiQL. Stratified Datalog programs have a unique stable model, which can be computed in a bottom-up manner in polynomial time in the size of the extensional database.151515General sets of MDs can be specified and enforced by means of disjunctive, stratified answer set programs, with the possibly multiple resolved instances corresponding to the stable models of the program Bahmani12 . These programs can be specialized, via an automated rewriting mechanism, for the SFAI case, obtaining residual programs in Datalog with stratified negation BahmaniExten12 .

In LogiQL, blocking-MDs take the form as Datalog rules:

(16)

subject to the same conditions as for (11). The condition in the rule body corresponds to the use of the MF in (10).

An atom of the form not only declares as an attribute value for , but also that predicate is functional on Aref15 : Each record in can have only one block number.

In addition to the blocking-MDs, we need some auxiliary rules, which we introduce and discuss next. Given an initial instance and a set of blocking-MDs , the LogiQL-program that specifies MD-based collective blocking contains the following rules:

  • For every atom , the fact .  That is, initially, the block number, , is functionally assigned the value .

  • Facts of the form , where , the finite attribute domain of an attribute . They state that the two values are similar, which is determined by similarity computation. (Cf.  Section 4.2 for more on similarity computation.)

  • Rules for the blocking-MDs, as in (16).

  • Rules specifying older versions of entity records (in relation ) after MD-enforcement:

    Here, variable stands for the rid. Since for each rid, , there could be several atoms of the form , corresponding to the evolution of the record identified by through an MD-based chase sequence, the rule specifies as old those versions of the record with a block number that is smaller than the last one obtained for it.

  • Rules that collect the records’ latest versions, to form blocks:

    The rule collects -records that are not old versions.161616LogiQL, uses “!” instead of for Datalog negation Aref15 .

Program as above is a Datalog program with stratified negation (there is no recursion through negation). In computational terms, this means that the program computes old version of records (using negation), and next definitive blocks are computed. As expected from the SFAI property of blocking-MDs in combination with the initial instance, the program has and computes a single model, in polynomial time in the size of the initial instance. From it, the final block numbers of records can be read off.


Example 6. (ex. 5 cont.)  We consider only blocking-MDs (12) and (14). The portion of that does the blocking of records for the Paper entity has the following rules (we follow the numbering used in the generic program):

  • Facts such as:
    .

    po