AutoER: Automated Entity Resolution using Generative Modelling

08/16/2019 ∙ by Renzhi Wu, et al. ∙ Hamad Bin Khalifa University Georgia Institute of Technology 0

Entity resolution (ER) refers to the problem of identifying records in one or more relations that refer to the same real-world entity. ER has been extensively studied by the database community with supervised machine learning approaches achieving the state-of-the-art results. However, supervised ML requires many labeled examples, both matches and unmatches, which are expensive to obtain. In this paper, we investigate an important problem: how can we design an unsupervised algorithm for ER that can achieve performance comparable to supervised approaches? We propose an automated ER solution, AutoER, that requires zero labeled examples. Our central insight is that the similarity vectors for matches should look different from that of unmatches. A number of innovations are needed to translate the intuition into an actual algorithm for ER. We advocate for the use of generative models to capture the two similarity vector distributions (the match distribution and the unmatch distribution). We propose an expectation maximization based algorithm to learn the model parameters. Our algorithm addresses many practical challenges including feature correlations, model overfitting, class imbalance, and transitivity between matches. On six datasets from four different domains, we show that the performance of AutoER is comparable and sometimes even better than supervised ML approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Entity resolution (ER) – also known as duplicate detection, record linkage, or record matching – refers to the problem of identifying tuples in one or more relations that refer to the same real world entity. This problem is widely prevalent in domains as diverse as banking, insurance, e-commerce, and many more. An e-commerce website would want to identify duplicate products (such as from different suppliers) so that they could all be listed in the same product page. Two insurance companies that are merging would want to identify and reconcile common customers. ER has been extensively studied in many research communities, including databases, statistics, NLP, and data mining (e.g., see multiple surveys [DBLP:conf/sigmod/KoudasSS06, DBLP:journals/tkde/ElmagarmidIV07, DBLP:journals/pvldb/DongN09, DBLP:series/synthesis/2010Naumann, DBLP:journals/pvldb/GetoorM12, herzog2007data]).

Voracious Appetite of ML Approaches. Supervised machine Learning (ML) approaches provide the state-of-the-art results for ER [deeper, anhaisigmod2018, kopcke2010evaluation, dong2018data]

. Powerful techniques such as deep learning are notoriously hungry for large amounts of training data. For example, both DeepER 

[deeper] and DeepMatcher [anhaisigmod2018] require thousands of labeled examples. Even non-deep learning based methods require hundreds of labeled examples [kopcke2010evaluation]. It has been reported [dong2018data] that achieving F-measures of

99% with random forests can require up to 1.5M labels even for relatively clean datasets. Generating large amounts of labeled examples is extremely time consuming even for domain experts.

AutoER Problem. In this paper, we systematically investigate an important problem: is it possible to build an effective ER algorithm that does not require any labeled data? An affirmative answer could dramatically simplify the life of domain experts by freeing them from the drudgery of generating labels. Specifically, we want the approach to have the following desirable properties:

  1. [leftmargin=*]

  2. It should not require any labeled data.

  3. It should handle the notorious class imbalance problem where unmatches significantly outnumber matches.

  4. It should leverage ER specific properties such as transitivity to further improve performance.

Addressing any one of the aforementioned requirements is challenging on its own. In this paper, we introduce AutoER that satisfies all these properties. As we shall show in the experiments, the performance of AutoER is competitive with supervised methods using hundreds of labeled examples.

Our Approach: Generative Modeling. A pair of tuples is said to be a match if they represent the same real-world entity and an unmatch otherwise. We start with a blindingly simple yet powerful observation: the similarity vectors (aka feature vectors) for matches should look different from the similarity vectors for unmatches. This observation inspires us to use generative modelling for the ER problem: if a pair of tuples is a match, then the feature vector is assumed to be generated according to one distribution (termed the M-Distribution); otherwise, the feature vector is assumed to be generated according to a different distribution (termed the U-Distribution). If we can learn the parameters of these two distributions accurately, then determining whether a tuple pair is a match is equivalent to asking if it is more likely to be generated from the M-distribution than from the U-distribution. There are a number of challenges to be addressed in using generative modeling for AutoER:

  • [leftmargin=*]

  • Identifying the Right M-Distribution and U-Distribution. Different choices of these two distributions lead to different generative models, and there is a trade-off between model expressiveness and the ability to learn model parameters effectively. For example, assuming all features are independent simplifies the generative model but may fail to capture real-world data feature vector distributions. On the other hand, modeling all pairwise feature dependencies comes at the cost of increased model complexity.

  • Parameter Estimation using Unlabeled Data Only and Handling Class Imbalance.

    If we know the ground truth labels for all tuple pairs, we can easily use the matches and unmatches to learn the parameters of the M-Distribution and the U-Distribution, respectively. Similarly, if we know the parameters of these two distributions, we can determine the label of each tuple pair easily. The challenge is how to learn the model parameters and the tuple pair label simultaneously. The class imbalance problem in ER – the number of matches being significantly smaller than the number of matches – makes learning accurate M-Distribution parameters even more challenging.

  • Exploiting ER-specific Transitivity Property. ER exhibits the transitivity property: if both and are matches, then by definition is also a match. While leveraging this ER-specific property improves performance, it is non-trivial to incorporate it into our generative modeling without sacrificing efficiency.

We make the following contributions in addressing the above challenges by using a generative model for solving the AutoER problem.

  • [leftmargin=*]

  • We formalize the idea of using a generative model for solving the ER problem in Section 2. In Section 3

    , we discuss the design of the M-Distribution and U-Distribution, which are adapted from the popular Gaussian Mixture Model and include a novel feature regularization term to prevent overfitting. Our proposed distributions allows for efficient parameter learning using an Expectation-Maximization (EM) algorithm.

  • The class imbalance problem makes learning of model parameters of the minority class M-distribution challenging. We tackle this by “borrowing” information from the majority class U-distribution through an innovative decomposition of the covariance matrices of M- and U-distributions in Section 4.

  • We also propose an effective approach to incorporate the transitivity constraint into AutoER in Section 5

    . Our core idea is to leverage the transitivity as a soft constraint to calibrate the posterior probabilities in each iteration of the EM algorithm.

  • Our extensive experiments on six datasets from four different domains shows the performance of AutoER is comparable to that of supervised ML methods even though it uses zero labeled data.

The overall AutoER algorithm is presented in Section 6, which also discusses the algorithm initialization and termination conditions. We describe our extensive experiments over various datasets in Section 7. Section 8 describes the related work followed by concluding remarks in Section 9.

2 Preliminaries

We formally define entity resolution in Section 2.1 and describe the generative model for ER in Section 2.2.

2.1 Entity Resolution Problem Definition

Intuitively, an entity is a distinct real-world object such as a customer, an organization, a publication etc. We are given two relations and with attributes . The entity resolution (ER) problem [DBLP:journals/tkde/ElmagarmidIV07] seeks to identify all pairs of tuples , where , that refer to same real-world entity. A pair of tuples is said to be a match (denoted as M) (resp. unmatch (denoted as U) ) when they refer to the same (resp. different) real-world entity. When , this is also known as data deduplication; and when , this is also known as record linkage. Figure 1 provides an illustration where we wish to identify which pair of records from Google Scholar and DBLP refers to the same publication.

Figure 1: Example of entity resolution: linking entries between (a) Google scholar and (b) DBLP. (c) shows features generated by Magellan [konda2016magellan].

A typical solution consists of two phases: blocking and matching. Comparing all possible tuple pairs could be prohibitively expensive. For efficiency reasons, ER solutions usually first run blocking methods, which generate a candidate set

that excludes tuple pairs that are unlikely to match. A matcher then evaluates each tuple pair in the candidate set as a match or a non-match. Typically, the matcher is a binary ML classifier that is trained on some labeled examples (a subset of

), and then is used to make predictions on all pairs in . Design blocking strategies to retain as few pairs as possible but also not to lose many matching pairs is an orthogonal research problem. In this work, we assume blocking is already done, and compare our matcher (AutoER) with all baselines on top of the retained set after blocking.

Feature Engineering for ER. To build a matcher, we need to generate similarity vectors (aka feature vectors) for every pair of tuples in . Typically, a domain expert designs different features for different datasets. Since we aim to automate ER, we leverage the automated feature generating system implemented in the popular Magellan [konda2016magellan] ER package. Under the hood, Magellan infers a type for each aligned attribute, and applies a set of pre-defined similarity functions for each type to generate features. As an example, Figure 1(c) shows features generated by Magellan. Multiple similarity functions are applied on each attribute to create features, e.g. and are two features obtained by applying two similarity functions on the titles of the tuple pairs. Since Magellan can apply multiple similarity functions to each aligned attribute, we end up with features with where is the number of aligned attributes. Furthermore, the number of features per attribute need not to be fixed – some attributes might generate more features than others. We use to denote the feature vector and to denote the corresponding match status. Note that while the values are observed,

is an unobserved random variable that must be estimated.

Formal Problem Definition. Given two relations and , a candidate set of pairs after blocking , and a feature vector for each pair in , our goal is to assign a binary label for every pair in without any labeled data.

Probabilistic Modeling of ER. Solving the ER problem is essentially equivalent to finding the conditional probability over two outcomes: M and U, where . Typical supervised ML methods are considered discriminative modeling methods that learn the conditional distribution directly using training data.

2.2 Generative Models for ER

We propose to use generative models for performing unsupervised AutoER. Generative models have shown to be very successful in dealing with training data deficiency problems [ratner2017snorkel, das2019goggles]. We begin by describing an intuitive process by which the similarity vectors are generated. Using a generative model, one can hypothetically “generate” feature vector-label pairs

from M- and U- probability distributions by repeating the following two steps:

  1. [leftmargin=*]

  2. Choose a distribution M or U by sampling

    based on the Bernoulli distribution parameterized by

    . Intuitively, the process tosses a coin that comes heads with probability . If it comes heads, it chooses the M-distribution else the U-distribution.

  3. Sample the feature vector

    from the selected distribution. For example, if the M-distribution was selected then the feature vector is sampled according to the conditional probability distribution

    .

Formally, and

are known as the prior probabilities that specify the proportion of M’s and U’s in all tuple pairs, and thus we have

. The M- and U- probability distributions and are two class conditional distributions of feature vectors.

Generative models compute by invoking Bayes’ rule:

(1)

If , we describe x as a match and vice versa.

Training. Intuitively, we wish to learn the parameters of the model such that the likelihood that the model generates the observed similarity vectors is maximized. Let (resp. ) denote the parameters that are governing the M-Distribution (resp. U-distribution). Then the total set of parameters include . Notice that is not included as it can be directly inferred from . A common way to estimate is by maximizing the log likelihood function (the log is taken for computational convenience):

(2)

where is the identity function that evaluates to 1 if the condition is true and 0 otherwise; X is the feature matrix with the th row being the feature vector of the th tuple pair; is the status vector with the th element denoting the match status of the th tuple pair.

Since is unknown, Section 2.2 cannot be optimized directly. The EM algorithm is the canonical algorithm to use in the case of unobserved variables [dempster1977maximum]. Each iteration of the EM algorithm consists of two steps: an Expectation (E)-step and a Maximization (M)-step. Intuitively, the E-step determines what is the (soft) class assignment for every tuple pair based on the parameter estimates from last iteration . In other words, E-step computes the posterior probability . The M-step takes the new class assignments and re-estimates all parameters by maximizing Section 2.2. More precisely, the M-step maximizes the expected value of Section 2.2, since the E-step produces soft assignments. Formally, the two steps are as follows:

  1. [leftmargin=*]

  2. E Step. Given the parameter estimates from the previous iteration , compute the posterior probabilities as follows:

    (3)
  3. M Step. Given the new class assignments as defined by , re-estimate by maximizing the following expected log likelihood function:

    (4)

ER as Inference. Once the EM algorithm converges with final parameters , we could then assign a label to every tuple pair as follows based on the posterior probability and . We assign to the distribution with the higher probability.

(5)

3 The AutoER Generative Model

As discussed before, different choices of the M-distribution and the U-distribution will lead to different generative models. There are two main considerations in choosing which generative model to use:

  • [leftmargin=*]

  • Efficiency. Whatever distributions we use, it should be easy to estimate the parameters for optimizing Item 2 in the M-step at every iteration.

  • Expressiveness. The chosen distributions should be able to capture ER specific data characteristics.

Choosing a powerful distribution might allow us to solve challenging ER instances but might suffer from learning inefficiency and vice versa. For example, non-parametric distributions are the most expressive as they can capture any data distributions; however, estimating the probability density of them (e.g., by using kernel density estimator) is very expensive. The Gaussian Mixture Model (GMM) is a popular choice for many applications exactly because it strikes a nice balance between efficiency and expressiveness – real-world data often follows Gaussian distributions, and there exist closed-form solutions for estimating the parameters of a Gaussian distribution given a set of data points.

However, as we shall show in experiments, directly using GMM is not effective, primarily because the feature matrix generated by ER problems has certain characteristics and peculiarities. In this section, we propose two novel modifications to the naive GMM, feature grouping in Section 3.2 and feature regularization in Section 3.3. Our proposed modifications not only model the specific characteristics of ER features well, but also permit closed-form update rules for maximizing Item 2 in the M-step under the AutoER generative model.

3.1 GMM for Entity Resolution

For the sake of completeness, we go through the mechanics of how to tackle ER as a GMM. As we shall show in experiments, this approach produces inferior results.

Under GMM, the parameters for M-distribution and U-distribution are , where , is the mean vector, and is the covariance matrix. Under GMM,

is replaced with the Gaussian probability density function

. Thus, Item 2 becomes (see Appendix A.1 for detailed derivation):

(6)

where is and is ; is sum of posterior probabilities:

(7)

It is known that the maximum of Equation 6 is achieved under the following parameter assignments [bishop2006pattern]:

(8)

In other words, Equation 6 is maximized when and equal to the weighted sample mean and weighted sample covariance respectively.

3.2 Feature Grouping

Our first contribution is to improve the naive GMM by leveraging how feature engineering is done for ER.

Deficiencies of Prior Approaches. A naive invocation of GMM is problematic. The parameters for the two distributions are which contains free parameters. This is often an overkill as most pair of features have low covariance. If the data is insufficient, accurately estimating all those parameters is even more challenging. The most common way to alleviate this is to assume feature independence, i.e. assume to be diagonal, which drastically reduces the number of parameters to be estimated. However, this assumption is inappropriate for ER under the feature engineering process described in Section 2.1. If an attribute generated multiple features, then those features are clearly not independent.

Best of Both Worlds by Feature Grouping. For the specific task of ER, it is possible to get the best of both worlds. The complete dependence model is appealing as it is very expressive, but has too many parameters to estimate. The independence model is appealing as it has smaller number of parameters to estimate, but loses the obvious feature dependencies. What we need is an alternate approach that has drastically less parameters but is still expressive enough for ER.

Figure 2: Heat map of correlation between features

Consider Figure 2 that shows the heat map of the correlation matrix of all features for matches from the Fodors Zagats dataset (the heat map for unmatches is similar). We can clearly see the correlation matrix has a banding effect with some blocks having higher values while other values are closer to 0. Not surprisingly, these blocks correspond to the set of features generated by the same attribute. This suggests a natural modification to naive GMM. We consider an approach based on feature grouping with the following simplifications: (1) features generated from the same attribute are dependent; and (2) features generated by different attributes are independent. We can see that this approach achieves a good balance of expressiveness and performance. Instead of estimating parameters for , we just need to estimate

(9)

where is the set of features generated from attribute . We can see that this results in a dramatically reduced number of parameters. Conceptually, the covariance matrix is a block diagonal matrix where each block corresponds to the covariance matrix of the features obtained from the same attribute.

(10)

where and is the covariance matrix for features of the -th attribute.

Parameter Estimation Under Feature Grouping. Using feature grouping, we essentially have independent GMMs sharing a common prior . Hence, it is not hard to derive the closed-form solution for maximizing Item 2 as follows:

(11)

3.3 Feature Regularization

AutoER relies on accurate modelling of features for the M-distribution and U-distribution. While the grouped Gaussian distributions generally provide a good fit to real-world data and have closed-form update rules in the EM algorithm, they do suffer from degenerate overfitting cases, just like how one can overfit supervised ML models.

To illustrate the overfitting problem in generative modelling, consider a dataset with multiple features and one of them is , where the values of for all unmatch pairs are between and and the value of for all match pairs is . This simple dataset can be perfectly fit using two Gaussians, as shown in Figure 3(a1). In particular, the M-distribution is a Gaussian distribution with a mean of

and a variance of

, which means that , for all in class , as all the probability mass for the M-distribution concentrate on one point. If , then Item 2 is already maximized. In other words, the generative model is overfitting such that all other features play no role in maximizing Item 2. More formally, under the Gaussian distribution when Item 2 can be written as Equation 6, at the optimal solution given by Equation 8, the expected log likelihood value when approaches . This is also known as the singularity problem [bishop2006pattern]. This problem occurs when all data points in one component collapse to a single value in one or more dimensions so that the corresponding rows in approach zero and approaches zero. As demonstrated in Figure 3, these degenerate cases happen when the probability density of one class in any dimension is highly concentrated.

Figure 3: The singularity problem: the naive fit for M has infinite probability density for both feature (a1) and feature (a2). Tikhonov regularization for feature (b1) and feature (b2). AutoER adaptive regularization for feature (c1) and feature (c2).

Avoiding Overfitting by Tikhonov Regularization. One straightforward solution to address the overfitting problem is to add a small constant to the the diagonal entries of , i.e., the variance of every feature. This is known formally as the Tikhonov regularization [Duchi_2008, Honorio2013Jul], in which each feature is regularized uniformly. In fact, this is also the solution adopted by the GMM implementation in the popular sklearn package [gmm_sklearn]. However, choosing a proper is non-trivial. If is very small the degenerate features still dominate and cause overfitting, so has to be large enough. But, if is too large it will dominate the covariance matrix and cause underfitting. Furthermore, each feature might require a different value for regularization instead of the constant . In our two-component mixture model, an improper regularization parameter can easily cause misclassification.

Example 1

Consider two features that need regularization: in Figure 3(a1) and in Figure 3(a2). A is chosen to regularize very well as shown in Figure 3(b1), since the two distributions are still well separated but each distribution now does not have concentrated probability mass.

Applying the same to results in Figure 3(b2), which is clearly an inferior fit, since there is too much overlap between the M-distribution and the U-distribution. This will cause some data points originally belonging to the U-distribution to be misclassified to the M-distribution, undermining the accuracy of the generative model.

Our Solution: Adaptive Regularization. To address the overfitting problem and to also avoid the challenges associated with the uniform Tikhonov regularization, we propose an adaptive regularization strategy that regularizes different features differently. Formally, we modify the expected likelihood function as follows:

(12)

where is a diagonal matrix acting as regularization parameter; denotes the trace operation on a matrix – the sum of all diagonal elements of the matrix; and ensure the regularization term is “normalized” to the amount of data , without them, the first term will dominate when is large. Intuitively, this regularization term punishes the variances of M and U being small, and the amount of regularization depends on the “distance” between the two distributions on different features.

To maximize Equation 12, the optimal values of and are still given by Equation 8 but the optimal is given by:

(13)

Detailed derivation can be found in Appendix A.2. Equation 13 adds an adaptive factor on the diagonal of the covariance matrix enlarging the variance of every feature dimension adaptively.

Example 2

Continuing with the previous example on regularizing and . Using our adaptive regularization strategy, is fitted in Figure 3(c1) and is fitted in Figure 3(c2). As we can see, both features are now well separated and well spread out to avoid the overfitting problem.

4 Handling Class Imbalance

One of the key characteristics that make the ER problem challenging for any ML based approach is the extreme class imbalance. The number of unmatches outnumber the number of matches by many orders of magnitude. Not surprisingly, this has a huge influence on how the parameters for the generative models are estimated and thereby impacts the performance of AutoER. We propose an elegant approach based on covariance matrix decomposition that dramatically improves the effectiveness of parameter learning.

Learning Parameters from Unbalanced Data. Recall from the previous sections that the parameters of AutoER are prior probabilities (), mean vectors () and covariance matrices (). The prior probability is a scalar and the mean vector has a dimensionality of (the number of features). These parameters could be effectively learned. The challenge occurs when learning the covariance matrix . is estimated through the sample covariance matrix (Equation 8, Equation 13) whose dimensionality is . For the convenience of exposition, let us assume there is no feature grouping – the argument is the same in the case of feature grouping. Since the matrix is symmetric, there are parameters that must be estimated from the tuple pairs that are marked duplicates. For example, if a dataset has 50 features, then there are 1225 parameters to evaluate. However, in many datasets that ratio of duplicate tuple pairs to number of parameters is very small. If done naively, this will result in a large systematic distortion to the eigenstructure of  [Dempster1972Mar] making to be a poor estimate for the true covariance matrix [Velasco]. A similar problem occurs, though at a lesser severity, for .

Our solution. Intuitively, there are two techniques to tackle this problem. We could reduce the number of parameters to learn and/or increase the data that is used for learning the parameters. AutoER leverages both ideas for improved learning. The key challenge is to identify ER specific properties. We achieve this by identifying an appropriate decomposition for and that allows us to both reduce the number of parameters and to increase the amount of data used for estimation.

Specifically, we decompose based on the well-known relationship between co-variance and Pearson correlation of two random variables and :

where and

are the standard deviation of

and , respectively. In other words, correlation can be seen as a normalized measure of covariance – while , we always have . Based on the above observation, we can thus decompose the sample covariance matrix as follows:

(14)

where is a diagonal matrix with , i.e., the sample standard deviation of feature in class , and is the sample Pearson correlation matrix of pairs of features in class .

Superficially, it seems that we have actually increased the number of parameters to be learned. In order to obtain , we only needed to estimate parameters. Under the decomposition, we need parameters. This is due to the fact that we need parameters for the diagonal matrix and parameters for Pearson correlation matrix which is also symmetric. Our core observation is that, in the ER context, the Pearson correlation matrices and are very similar. This is not surprising as feature correlations are only mildly affected by the class labels. This is especially true in our case due to the way we create feature groups. Features in a same group are obtained by applying different similarity functions on the same attribute, so the Pearson correlation between the corresponding features reflects the correlation of their corresponding similarity functions on this attribute, which is often consistent over matches and unmatches. This ER specific property allows to rewrite Equation 14 as

(15)

Estimating and using Equation 15 addresses the class imbalance problem for the exact two reasons we mentioned before. First, the number of parameters need to learn decreases from to , almost less number of parameters for a large . Second, since the Pearson correlation matrix is the same for both classes, we can estimate it using the entire dataset, which is substantially larger than just the matches.

5 Incorporating Transitivity

Transitivity is another important characteristic of ER. Any effective solution for ER could improve its matching accuracy by using the transitivity behavior. In this section, we discuss why ML based approaches often have trouble handling this constraint and how AutoER tackles it.

Transitivity in ER. Intuitively, the transitivity property stipulates that if tuple pairs and are matches, then must be a match. In a number of real-world datasets, this is an effective strategy and hence most prior ER approaches have tried to incorporate it. The solutions often could be broadly categorized into two techniques. The simplest approach is to perform the matching process in pairwise manner and perform a post-processing by computing the transitive closure on the matches (e.g., perform clustering). While intuitive, this approach may not always be effective. In many cases, the graphs obtained from pairwise ER could have a large diameter – as much as 20 [rastogi2013finding] resulting in possibly unmatches being marked as matches.

An alternate approach is to incorporate the transitive closure inside the statistical model used for ER, and let the model takes the property into account while training. The stumbling block is that often the assumption in most ML models is that similarity vectors are generated in an i.i.d fashion. So one has to move from a pairwise ER model to a collective ER setting where the match decision is made through a joint optimization [bhattacharya2007collective]. Relaxing the i.i.d. constraint often complicates the statistical model and makes it prohibitively expensive. For example, [mccallum2005conditional] proposed a conditional random field based approach that incorporated transitivity, whose training is very expensive.

Transitivity in AutoER. We face similar challenges when incorporating transitivity in AutoER. Retrofitting this requirement into the generative model’s objective function results in the same scalability issues as prior collective ER approaches. For example, we can pose the transitivity as a constraint on posterior probabilities, and perform constrained maximum likelihood estimation, which is known to be hard to scale [ganchev2008expectation]. Instead, AutoER leverages the transitivity as a soft constraint that calibrates the posterior probabilities at the end of E-step of every EM iteration. Our proposed approach has two main advantages: (1) The transitivity is only treated as a “soft constraint” — this is important because the posterior probabilities are only estimates and can be incorrect, especially given that they change at every EM iteration. (2) Our approach does not modify the model and thus incur no additional computation cost other than checking the transitivity constraint. In the following, we describe how AutoER poses and uses the transitivity constraint.

Let , , and denote the posterior probability of tuple pair , , and being a match, respectively. Transitivity states that if both and are matches, then must be match. However, there exists cases where is a match, but not both and are matches. This observation is thus captured using the following inequality:

(16)

At the end of E-step in every EM iteration, we check for violations of this constraint and correct for those violations. Specifically, for three tuple pairs, if Equation 16 is violated, i.e., , the least “confident” one among the three probabilities is adjusted. We can determine the least “confident” tuple pairs as the one whose posterior probability for matching is closest to 0.5. For example, when and , is adjusted by:

(17)

It is possible that there is no row in the feature matrix corresponds to tuple due to blocking, in this case we assume as tuples excluded by blocking are unlikely to match. Of course, if this check is implemented naively, this will require where is the number of tuple pairs. To improve computational efficiency, we only perform this check on tuples that are more likely to be matches ( and ). Since the number of match tuples is often much smaller, and the match tuples that are adjacent or share a node, e.g. and , is even smaller, the checking can be done efficiently.

Implementation: DeDuplication v.s. Record Linkage. While the above correction procedure looks straightforward, implementing it requires some considerations, depending on whether the two datasets and we do ER on are the same. When (also known as data deduplication), computing , , and are exactly same as presented before using one generative model. However, when (also known as record linkage), computing , , and requires the use of multiple generative models: one for tuple pairs in , one for tuple pairs in , and one for cross-dataset tuple pairs. In other words, we need to run three EM tasks – one to estimate the matches across datasets and one each to identify matches within each dataset. Let be the model for matching two datasets while and be the models for matching with the “left” and “right” datasets respectively. In the E step of , Equation 17 can modify the posterior probability from and , so the three models should be trained together with each iteration performs the following E-steps and M-steps for three models: , ,, ,, . Notice that the M-steps for and need to be called before their E-steps to incorporate the changes made by ’s E step.

6 Putting It All Together

In this section, we describe the miscellaneous details that are needed to make AutoER algorithm work. The pseudocode for AutoER is shown is Algorithm 1.

We consider blocking orthogonal to our problem. While AutoER could work with all pairs of tuples, it is often more efficient to do blocking. Furthermore, we assume that domain experts have done feature engineering and provided the feature matrix of dimension . If not, one could use any of the practical ER systems such Magellan [konda2016magellan] that could generate features automatically.

It is possible that different features are in different scales. For example, Jaccard similarity is between 0-1 while string edit distance could be arbitrarily large. So, we ensure that each feature is in the scale through min-max normalization such that the minimim value for that feature is set to 0 and the max to 1 and all other values adjusted proportionally.

EM algorithm requires an initialization. We initialize the class assignment of each example () straightforwardly according to the relative magnitude of its feature vector: min-max normalize the magnitude of feature vectors of all examples , then assign if and if . We choose 0.5 as the default value for , and show experimentally that AutoER is robust to the choice of .

The EM iteration is terminated when the difference between normalized log likelihoods [bishop2006pattern] and (obtained by Item 2) between consecutive iterations is less than a threshold, i.e.: . We also set a limit the number of EM iterations to 200, a common practice also used in popular ML packages such as sklearn [gmm_sklearn]. When EM is terminated due to the limit of the maximum number of iterations (instead of likelihood convergence), we average the likelihood results from the latest 20 iterations. The space complexity is , determined by the feature matrix . The time complexity of each EM iteration is , determined by Equation 3 and Equation 8.

0:  Feature matrix
0:  A binary label for every pair
1:  Initialize each
2:  while Not Converged do
3:     M Step
4:       Update and by Equation 8
5:       Obtain and by Equation 15
6:       update and by Equation 13
7:     E Step
8:       Update each by Equation 3
9:       Adjust that violates Equation 16 by Equation 17
10:  end while
11:  Assign a binary label to each by Equation 5
Algorithm 1 AutoER

7 Experiments

Domain Dataset Notation #Tuples #Matches #Attr
Restaurants Fodors-Zagat [uci] Rest-FZ [uci] 533 - 331 112 7
Publications DBLP-ACM [erhardws] Pub-DA 2,616 - 2,294 2,224 4
Publications DBLP-Scholar [erhardws] Pub-DS 2,616 - 64,263 5,347 4
Movies Rotten Tomatoes-IMDB [magellandata] Mv-RI 558 - 556 190 8
Products Abt-Buy [erhardws] Prod-AB 1,082 - 1,093 1,098 3
Products Amazon-Google products [erhardws] Prod-AG 1,363 - 3,226 1,300 4
Table 1: Datasets characteristics

We conduct an extensive set of experiments to evaluate the efficacy of AutoER. Specifically, our experiments focus on three dimensions:

  • [leftmargin=*]

  • Feasibility and Performance of AutoER (Section 7.2) Is it really possible for an unsupervised method to achieve performance comparable to supervised ML algorithms? How does AutoER compare with existing unsupervised methods, such as various clustering methods?

  • Ablation Analysis (Section 7.3). How do various innovations (feature grouping, feature regularization, handling class imbalance, and incorporating transitivity) in AutoER contribute to its final accuracy?

  • Sensitivity Analysis (Section 7.4).

    Is AutoER sensitive to the size of the dataset, the regularization hyperparameter, and its initialization?

7.1 Experimental Setup

Hardware and Platform. All our experiments were performed on a machine with a 2.20GHz Intel Xeon(R) Gold 5120 CPU and with 96GB 2666MHz RAM.

Datasets. We conducted extensive experiments on six datasets from four diverse domains such as publication, e-commerce, movies and restaurants. Table 1 provides statistics of these datasets. All are popular benchmark datasets and have been extensively evaluated by prior ER work using both ML and non-ML based approaches.

Algorithms Evaluated. We compare AutoER against representative algorithms from supervised and unsupervised approaches. We used Magellan [konda2016magellan] to generate the features automatically, which are used by AutoER and all baseline methods. The three supervised algorithms are:

  • [leftmargin=*]

  • Logistic Regression (LR): This is a typical linear classifier. We use 5-fold cross validation to tune the regularization parameter.

  • Random Forest (RF): This is a typical tree-based classifier. The number of trees is set as 100 and the minimum number of samples required to be at a leaf node is tuned by a 5-fold cross validation to avoid overfitting.

  • Multi-layer Perceptron (MLP):

    This is a typical deep learning classifier. We use two hidden layers of size 50 and 10 and tune regularization parameter by 5-fold cross validation.

We use the sklearn [scikit-learn] implementation for the three supervised methods. We randomly split each dataset to training and test set by 50%-50% and report the average results of ten runs. Note that using 50% of the data as training data is a very generous setting as in practice the number of labeled examples available is much smaller. The match entries in the training set are over-sampled as is typically done in training supervised ML methods in the presence of class imbalance.

As we shall describe in Section 8, unsupervised approaches formalize the ER problem as model based clustering or an extension of Fellegi-Sunter model. We chose four representative algorithms that cover the spectrum of unsupervised approaches:

  • [leftmargin=*]

  • K-Means (SK):

    This baseline applies the K-Means algorithm from Scikit-Learn with

    . If the similarity vectors for M- and U-distributions are very different, then this method would provide good results.

  • K-Means (RL): This is an improved baseline from [RecordLinkageGH] that is calibrated for two-cluster ER task. Traditional K-Means often fails when the sizes of two clusters are very uneven [KMeansIssues] which is often the case in ER. This algorithm tackle the class imbalance through class weighting so that matches get a higher weight than unmatches.

  • GMM: This applies Gaussian Mixture Model from scikit-learn with components.

  • ECM: The Fellegi-Sunter (FS) model [fellegi1969theory] is a seminal approach for unsupervised ER. There has been a number of improvements to FS model that generalizes and improves it performance. We use the implementation from [RecordLinkageGH, de2015probabilistic] that provides state of the art results. This approach implements a expectation conditional maximization algorithm that relaxes the simplistic independence assumption from FS.

The unsupervised methods, including AutoER, are fitted on the whole dataset without labels and evaluated on the whole dataset.

Apart from the evaluated algorithms, we also report the best-in-literature performance for all datasets whenever available. Note that the best-in-literature work typically does dataset specific feature engineering [kopcke2010evaluation] or use many training data [kopcke2010evaluation, anhaisigmod2018]. Nevertheless, we report the scores as upper bounds for reference.

Performance Measures.

We used F-score as the performance measure as ER is a task with unbalanced labels. We report averaged value F-score from ten runs when train-test split is needed for supervised methods.

AutoER Setup. By default, we set the feature regularization parameter for AutoER to ; we set the initialization threshold to be ; and we use all unlabeled pairs for fitting the generative model. We will report the sensitivity of AutoER to all these parameters in Section 7.4.

7.2 AutoER Performance

Overall F-score comparison. Table 2 reports the F-score for AutoER, three tuned supervised ML methods, four unsupervised methods, and the published best in literature.

  • [leftmargin=*]

  • AutoER vs unsupervised methods: AutoER outperforms all the unsupervised benchmarks. We can see that unmodified K-Means works well for simple datasets but fails for challenging datasets. However, the modified K-Means does not always work well with all domains. If the assumptions of K-Means (such as similar variance) are severely violated then it provides inferior results. AutoER dramatically outperforms GMM based methods due to the use of number of adaptations including feature grouping, feature regularization, transitivity etc. Finally, the ECM algorithm is simply not competitive with AutoER.

  • AutoER vs supervised methods: The performance of AutoER is competitive to that of supervised methods even though they have substantially larger amounts of training data and AutoER requires zero training data. In fact, on 2/6 datasets AutoER even outperforms the supervised methods. This validates the various design choices made by AutoER.

  • AutoER vs best-in-literature work: The performance of AutoER is comparable to the best-in-literature work on 3 datasets. However, there are certain datasets such as Prod-AB and Prod-AG datasets where there is a larger gap in performance. We note that both these datasets are known to be challenging for traditional similarity based approaches. The datasets contain large strings such as product description where simple string similarity approaches fail. In fact, the state-of-the-art results are provided by crowdsourcing and deep learning based methods that use semantic similarity. Nevertheless, as shown from Table 2, our approaches are comparable to the traditional supervised ML approaches.

Unsupervised Supervised
AutoER ECM k-Means (RL) k-Means(SK) GMM RF LR MLP published best in literature
Rest-FZ 1 0.07 0.30 0.30 0.30 0.97 0.98 0.99 [anhaisigmod2018]
Pub-DA 0.95 0.09 0.95 0.27 0.53 0.98 0.96 0.97 0.984 [anhaisigmod2018]
Pub-DS 0.85 0.07 0.85 0.43 0.28 0.93 0.88 0.92 0.947 [anhaisigmod2018]
Mv-RI 0.85 0.56 0.81 0.81 0.81 0.83 0.81 0.79 N/A
Prod-AB 0.40 0.01 0.01 0.02 0.02 0.46 0.18 0.32 0.713 [kopcke2010evaluation]
Prod-AG 0.40 0.01 0.02 0.02 0.02 0.51 0.18 0.35 0.693 [anhaisigmod2018]
Table 2: F-score for all methods
Dataset AutoER F-score LR Pct LR Tuples RF Pct RF Tuples MLP Pct MLP Tuples
Rest-FZ 1 100% 2915 100% 2915 100% 2915
Pub-DA 0.95 0.9% 418 0.5% 232 0.9% 417
Pub-DS 0.85 0.9% 418 0.5% 232 0.2% 270
Mv-RI 0.85 100% 214 100% 214 100% 214
Prod-AB 0.4 100% 162981 2.6% 4248 75% 123054
Prod-AG 0.4 100% 358281 2.12% 7589 0.8% 2864
Table 3: Amount of labeled training data needed for supervised methods to achieve same performance as AutoER. X Pct and X Tuples denote the percentage and the absolute number of labeled tuple pairs needed for supervised method X {LR,RF,MLP} to match AutoER’s F-score.

Labeling effort saved. We further investigate how much labeled training data does the supervised methods need to match the performance of AutoER. Intuitively, this is a proxy for the labeling effort that is saved when AutoER is used instead of the supervised baselines. Table 3 shows the results. We can make a few observations. First, to achieve the same performance as AutoER, supervised ML methods could need as much as hundreds to thousands of training examples in the worst case. Obtaining that many labels is often prohibitively expensive and often error prone. Second, some baselines on some datasets need as much as 100% data as training data in order to match AutoER’s F-score. If there are only limited number of labels, it is often preferable to use AutoER instead of prior supervised methods.

7.3 Ablation Analysis

Feature Dependence Feature Regularization AutoER Variants
Full Independent Grouped F-Tik I-Tik G-Tik F-Adp I-Adp G-Adp G+A+P G+A+P+T
Rest-FZ 0.94 1 0.94 0.98 0.96 0.98 0.56 0.91 0.97 0.98 1
Pub-DA 0.27 0.81 0.27 0.57 0.63 0.59 0.63 0.71 0.95 0.96 0.95
Pub-DS 0.27 0.28 0 0.73 0.72 0.74 0.73 0.70 0.73 0.78 0.85
Mv-RI 0.69 0.68 0.69 0.81 0.80 0.81 0.81 0.83 0.82 0.82 0.85
Prod-AB 0.05 0.01 0 0 0.03 0 0.20 0.16 0.20 0.27 0.40
Prod-AG 0.03 0.03 0.03 0 0 0 0.28 0.22 0.28 0.35 0.40
Table 4: Ablation analysis for AutoER. G+A+P+T is the final AutoER with feature Grouping, Adaptive feature regularization, using Pearson correlation for handling class imbalance and incorporating Transitivity.

We next perform a series of experiments to understand the contributions of different components of AutoER, including (1) different ways of handling feature dependency (full dependency assumption, complete independency assumption, and our proposed grouped dependency assumption); (2) different ways of performing feature regularization (the existing Tikhonov regularization and our proposed adaptive regularization, applied on the three different dependency assumptions); (3) using Pearson correlation to address the challenges associated with class imbalance; and (4) incorporating transitivity into AutoER. With a different combination of techniques, the proper of feature regularization parameter might be different. We set for all the other AutoER variants that are not equipped with all optimizations as generally works well for them on most datasets. The results for different combinations of techniques in AutoER setting are shown in Table 4. There are a few observations we can make:

  • [leftmargin=*]

  • The final AutoER with combined optimizations (last column in Table 4) together consistently achieve the best performances across all datasets.

  • We compare three different feature dependency assumptions using no feature regularization at all. Surprisingly, the most naive feature independent assumption actually works the best. This is because feature independence suffers the least to the singularity problem. Intuitively, the column (or row) vectors in (the covariance matrix under independence assumption) are linearly independent as is a diagonal matrix, while the column vectors in (the covariance matrix under feature grouping) as well as (the covariance matrix with full feature dependence) are possible to be linearly dependent (see Figure 2). As stated in Section 3.3, the singularity problem tends to occur when the determinant of the covariance matrix approaches zero. The determinant of a matrix is zero when its column vectors are linearly dependent. Therefore, the singularity problem is more likely to happen in and .

  • With feature regularization, the singularity problem is resolved. We observe that feature grouping now almost always works better than the other two, regardless of the regularization strategies. Furthermore, we can see that our proposed adaptive regularization is better than the standard Tikhonov regularization, particularly on Pub-DA, Prod-AB, and Prod-AG.

  • Finally, addressing class imbalance and incorporating transitivity further increases AutoER’s performance. These two optimizations are especially important on harder datasets, such as Prod-AB and Prod-AG, where we see up to 100% F-measure increase.

7.4 Sensitivity Analysis

Figure 4: Sensitivity Analysis. F1 score under different (a)regularization parameter , (b)initialization threshold and (c)unlabeled training data size.

In this set of experiments, we show how AutoER is robust with respect to the regularization hyperparameter, the initialization, and the size of unlabeled data used to fit the generative model.

Sensitivity to regularization hyperparameter. In the next experiment, we vary the regularization parameter to understand how it affects the results. Recall that when , there is no feature regularization. In this case, our generative model suffers from two major issues – singularity and over-fitting on the degenerate features – that reduces its performance. However, when is too large (e.g. ), the regularization term in Equation 13 begins to dominate the covariance matrix. This results in the generative model to underfit the data for certain datasets. As shown in Figure 4(a), the performance of AutoER is robust for intermediate values of .

Sensitivity to initialization. We vary the initialization threshold to show that AutoER is robust to initialization. As shown in Figure 4(b), AutoER is robust to , and the result doesn’t change at all for most datasets. When = 0 or 1, no data is assigned to M or U component so that EM will fail to run. It is well known that EM is only guaranteed to return a local minimum so a poor initialization tends to give poor results [balakrishnan2017statistical]. For Mv-RI, when approaches 0 or 1, F-score decreases. This is because Mv-RI is a small dataset and when is close to 0 or 1 too few data are assigned to M or U component, offering an extremely poor initialization. As we can see, offers a safe initialization for all datasets.

Sensitivity & Scalability to unlabeled training data size. We demonstrate that AutoER can produce excellent results even when trained on a subset of dataset. Specifically, we vary the amount of data (without labels) used to fit AutoER and evaluate how it impacts AutoER’s performance on the reminder of data. As shown in Figure 4(c), F1 score increases as the amount of unlabeled training data increases. Figure 4(c) also shows that AutoER already gives good F1 score with only about 10% training data, which can save about 90% training time.

By varying the training data size, we can also demonstrate the scalability of AutoER. Figure 5 shows the running time per EM iteration is roughly linear to the amount of training data. This means AutoER is scalable.

Figure 5: Running time per EM iteration for different amount (%) of training data.

8 Related Work

Prior work on ER can be categorized as based on (a) rules, (b) expert or crowd and (c) machine learning based. Rule-based approaches are often simple and easily interpretable [DBLP:series/synthesis/2010Naumann], but often cannot achieve state-of-the-art results. Human-involved ER has become popular [DBLP:journals/pvldb/WangKFF12, corleone], but often still requires substantial involvement of the expert and/or crowd either to perform data labeling or to do feature engineering. In this work we focus on and compare with ML based approaches as they provide state-of-the-art solutions. A good overview of ER can be found in surveys such as [DBLP:conf/sigmod/KoudasSS06, DBLP:journals/tkde/ElmagarmidIV07, DBLP:journals/pvldb/DongN09, DBLP:series/synthesis/2010Naumann, DBLP:journals/pvldb/GetoorM12, herzog2007data, maggi2008survey] and tutorials [DBLP:journals/pvldb/GetoorM12].

Supervised ML approaches for ER rely on obtaining significant number of tuple pairs as matches and unmatches which is cumbersome and error prone. With sufficient labeled examples, many binary classifiers have been experimented, including Naive Bayes 

[Winkler99thestate]

, decision tree 

[DBLP:conf/vldb/ChaudhuriCGK07], SVM [DBLP:conf/kdd/BilenkoM03]

, active learning 

[SarawagiB02]

and transfer learning 

[thirumuruganathan2018reuse]. [konda2016magellan] provides a comprehensive survey of the current ER systems based on supervised approaches. We picked three representative and best-performing supervised ML models and compared against them.

Unsupervised approaches often model ER as a probabilistic clustering problem, as is our proposal. The Fellegi-Sunter (FS) model is generally regarded as the seminal work in unsupervised ER, and was originally proposed for deduplicating the Census data [fellegi1969theory]. Since then, many extensions have been proposed, including using EM for learning the weights of features [winkler2000using], improved decision rules [winkler1993improved], generalization to multiple datasets [sadinle2013generalized], other generative processes [larsen2001iterative], relaxing the conditional independence assumptions in FS [schurle2005method, de2015probabilistic] and leveraging auxiliary data [armstrong1992estimation]. We not only compared with some classical clustering methods (K-Means and GMM), but also a popular open-source implementation of the FS model [RecordLinkageGH, de2015probabilistic].

9 Conclusion

We have proposed AutoER, an automated solution for solving the ER problem without any labeled examples. This is achieved through a generative model that discerns the difference between similarity vectors of matches and unmatches. Our proposed generative model extends the popular Gaussian Mixture Model by including many improvements that are specific to address ER challenges. First, our feature grouping idea captures the dependencies between features generated from the same attribute, while assumes features across attributes to be independent. Second, our feature regularization idea addresses the singularity problem found in ER datasets. Third, we handle the class imbalance problem in ER by decomposing the covariance matrix, and thus reducing the number of parameters that need to be estimated. Fourth, we also propose a way to treat the transitivity constraint in ER as a soft constraint. With all of these optimizations together, we are able to achieve state-of-the-art results, but with zero training data.

Appendix A Appendix

a.1 Proof I

Under GMM, is the Gaussian probability density function:

(18)

substitute the above equation into Item 2 gives Equation 6:

(19)

where is and is ; is sum of posterior probabilities:

(20)

Actually, Equation 6 can be written more compactly as:

(21)

where is:

(22)

a.2 Proof II

Proof for Equation 13:

substitute Equation 21 into Equation 12:

(23)

The derivative of with respect to and are all zero, so the optimal and for Equation 23 are the same as that of Equation 21:

(24)

To derive the optimal , we write Equation 23 as:

(25)

which is equivalent to simply replacing with in Equation 21. It is known the optimal for Equation 21 equals to  [bishop2006pattern], so the optimal for Equation 25 is given by:

(26)

substitute the inside by Equation 24:

(27)

where is the weighted sample covariance matrix:

(28)

References