Performance Bounds for Pairwise Entity Resolution

09/10/2015 ∙ by Matt Barnes, et al. ∙ Carnegie Mellon University 0

One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine learning tasks, when an entity resolution algorithm performs well on small hold-out datasets, there is no guarantee this performance holds on larger hold-out datasets. We prove simple bounding properties between the performance of a match function on a small validation set and the performance of a pairwise entity resolution algorithm on arbitrarily sized datasets. Thus, our approach enables optimization of pairwise entity resolution algorithms for large datasets, using a small set of labeled data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Entity resolution (ER) is the task of identifying records belonging to the same entity (e.g. individual, product) across one or multiple datasets. Ironically, it has multiple names: deduplication and record linkage, among others [1]. For example, ER is used to disambiguate shopping products [2], merge datasets of users from disparate sources, or even profile potential terrorist threats. With the use of blocking techniques, entity resolution can be scaled to many millions of records [3].

The canonical example in Table 1 illustrates the usefulness of pairwise ER for these application domains. Initially, the match function may only predict using the common phone number, where denotes a match. A partial name may not be a strong enough commonality to predict either of these individually match . However, the merge of these records , where denotes a merge, provides the full name ‘John Doe’ and enables correctly merging all three records.

To design an effective entity resolution system, one would optimize over the ER merge and match functions. One might be tempted to evaluate and optimize an ER system on a small dataset with known labels, and then extend this to real-world applications. We stress that performance on small datasets does not necessarily imply similar performance on large datasets. Unlike more traditional machine learning tasks, in ER applications the number of entities often scales linearly with the size of the dataset [1]. This is not true in other clustering problems, where the number of clusters is typically constant or sublinear with the dataset size – a significantly easier problem. Further, the ‘no negative evidence’ assumption [1, 4] can cause a ‘snowball effect,’ wherein several false positives trigger many more clusters to merge, leading to a detrimental degradation in performance.

Consider the simple example in Figure 1, using synthetic data described in Section 5.1

. First, we learned a match function using a small training dataset of 100 records. On a test dataset of comparable size, it achieved near perfect pairwise precision and recall. However, as we added new entities to the test dataset, pairwise precision significantly degraded – an extreme example of the entire dataset snowballing into a single entity. More importantly, near perfect performance on the larger datasets

was

possible (dotted line), just with different match function parameters. Using our approach to instead optimize over the larger dataset’s estimated lower bound dramatically improves performance on the large set (solid line).

Record Name1 Name2 Phone
John D. 377-8328
J. Doe 377-8328
John Doe
Table 1: Canonical Entity Resolution Example
Figure 1: A simple experiment demonstrates the potential degradation of pairwise precision as the size of the dataset increases. Here the ‘Original’ algorithm (dashed line) was tuned for optimal performance on a training set of 100 records. ‘Optimized Lower Bound’ (solid line) shows our results after instead optimizing model parameters over the larger dataset’s estimated lower bound. ‘True’ (dotted line) shows the actual performance corresponding to this lower bound.

Although performance on a small labeled dataset does not directly equate to performance on an actual larger dataset, some useful information does exist which we will leverage into an estimated lower bound for ER performance on arbitrarily sized problems. Then, optimization of the estimated lower bound allows tuning of pairwise ER systems for large datasets.

In this paper, our contributions are:

  1. Theoretical Performance Bounds: We prove simple, estimated, lower bounds on pairwise recall, precision, and performance metrics for arbitrarily sized datasets, under reasonable assumptions and given a small number of labeled record pairs.

  2. Empirical Tightness: We evaluate the bounds on one synthetic and three real world datasets to demonstrate the theoretical bounds are tight to the true performances.

  3. Optimal Merge Function: Given any match function, we prove a lower-bound optimal merge function and ‘wrapper’ for the match function. This conservative strategy is equivalent to finding all connected components, a key insight of the simple bounds.

The remainder of the paper is organized as follows. We begin section 2 with a quick overview of related work in the field of entity resolution. In sections 3 and 4, we derive the estimated lower bounds and optimal merge function, respectively. Lastly, in section 5 we demonstrate the empirical tightness of the bound on real world datasets.

2 Related Work

Entity resolution encompasses a broad set of approaches, including many adapted from the machine learning, optimization, and graph theory domains. Strategies appropriate for ER includes hierarchical clustering

[5]

, integer linear programming

[6], latent Dirichlet allocation [7], pairwise match/merge [4], Markov logic [8] and hybrid human-machine systems [9]. Pairwise entity resolution approaches are appealing because they use an intuitive and easy to implement iterative match and merge process between pairs of records. Further, under certain assumptions, pairwise algorithms will perform the optimal number of record comparisons [4].

Perhaps the most general framework for pairwise entity resolution was presented by Benjelloun et al. [4]. They outlined a theoretically disciplined approach, wherein certain properties of the match and merge function guarantee a deterministic output in the optimal number of record comparisons. We explore the use of some of these properties in the derivation of our bounds. Collectively, these properties are referred to by their acronym ICAR:

  1. Idempotence: and .

  2. Commutativity: iff , and if , then .

  3. Associativity: such that and exist, .

  4. Representativity: If then for any such that , we also have .

The first three properties are straightforward and reasonable to assume for most ER systems. The crux of determinism falls on the final property, representativity. We, too, will take advantage of this convenient property, leaving the interesting problem of how relaxing this assumption affects the performance bounds for future work. Intuitively, representativity means merging any two records can only monotonically increase their chance of matching with other records. This is also referred to as the ‘no negative evidence’ clause.

3 Lower Bounds of Performance

Although many metrics exist to evaluate entity resolution performance when a ground truth dataset is available, this is rarely the case. Not surprisingly, human-generated clusterings rarely number beyond a thousand records [2] – a relatively easy ER problem. Even finding publicly available datasets with ground truth so that we could objectively evaluate our results was a trying task.

In the simplest setting, we assume we have access to some pairs with known binary match/mismatch label , such that . For large datasets, finding all records belonging to one entity is a worst-case combinatorial problem, but finding just two matching records is relatively easy using a hybrid human-machine system [9] or with strong features (e.g. phone number, product ID)

With both match and mismatch pairs at our disposal, we created a training and validation set of labeled pairs. The remaining records form the test dataset. Note the training and validation sets will likely have significantly different class balance, cluster sizes, and overall number of samples than the test set. Though an entity resolution algorithm may perform well on the validation set with few samples and small cluster sizes, this may not indicate strong performance on the full dataset with millions of records and many more clusters. In practice, a developer needs to know performance guarantees of the test set because this is the deployed system.

Here, we derive precise relationships between the performance of the match function on the validation record pairs and estimated lower bounds on ER pairwise precision, recall, and on the test set. Our notation for the following proofs, which the reader may find convenient to refer back to, is:

Record formed by merging records and .

Set of validation record pairs with known labels, .

Set of record pairs in the validation set with positive label,
, .

Set of record pairs in the validation set that are predicted to directly match,
, .

Set of test records .

Set of record pairs in the test set that are predicted to directly match,
, .

Set of record pairs in the entity resolution clustering of the test set.

Set of record pairs in the true clustering of the test set (unknown).

Precision of predicted and true positive pairs, .

Recall of predicted and true positive pairs, .

Class balance of pairs in the validation set, .

Estimated class balance of pairs in the test set, .

Lemma 1.

For entity resolution systems satisfying the representativity property, every record pair that directly matches will end up in the same entity.

(1)

Additional pairs in can occur from chains of matches (i.e. , , thus ) and from merging (see Table 1). However, we are unable to make strong claims about the additional matches since composite records do not occur in the validation set.

Proof.

Suppose on the contrary there exists a pair of records , such that but . In other words, and they are resolved to separate entities and . Since these clusters were not merged in the ER process, , which contradicts the representativity property. ∎

Theorem 1.

The pairwise precision of an entity resolution result can be lower bounded by:

(2)

The bound is composed of two parts. is the fraction of record pairs in the test set entity resolution that directly match, which we can make stronger claims about. is the precision of these direct matches, adjusted for the change in class balance.

Proof.

From Lemma 1 and applying the definitions of pairwise precision for and :

where the last step follows from equating the match function validation set performance to the expected match function test set performance using change in match/mismatch class balance. ∎

Most of the values are straightforward to count from the resolution. is the number of pairs in the clustering output. is the number of records that directly match, which by Lemma 1 can be efficiently computed as .

The class balance of the validation set is known, but we must estimate . We refer the reader to state-of-the-art results for class prior estimation [10, 11].

Theorem 2.

The pairwise recall of an entity resolution result can be lower bounded by:

(3)

In other words, the recall on the validation set already forms a lower bound for the pairwise recall on the test resolution.

Proof.

From the definitions of pairwise recall for and and then applying Lemma 1:

where the last step does not require class rebalancing because recall is not a function of class balance (unlike precision, it only depends on the positive pairs). ∎

A lower bound on pairwise

(the harmonic mean of pairwise precision and recall) can be computed with the two former lower bounds. We will focus more on measuring both pairwise precision and recall as they are more informative than the aggregated

metric.

4 Optimal Merge Function

Given any match function satisfying the idempotence and commutativity properties, we will prove a merge function and ‘wrapper’ match function that optimize the estimated lower bounds. Since the idempotence property is trivially satisfied for any match function by checking for identical records and the commutativity property is satisfied by checking both directions and , this essentially holds for all pairwise match functions. These match and merge functions form a conservative strategy, but provide the lower bound optimal performance given only labeled pairs.

We consider the original set of records and use notation for a record formed by merging at least two other records.

Theorem 3.

For any match function, the pairwise precision, recall, and estimated lower bounds are optimal for the merge function:

(4)

The corresponding ‘wrapper’ match function between and is:

(5)
Proof.

We will show both directions, that the optimal merge function and match ‘wrapper’ must make at least these matches to satisfy the ICAR properties, and that any additional matches will decrease the estimated performance lower bound. By the definition of the set union operator, the merge function is associative. The rest of the proof will focus on the representativity property.

Direction 1: We are constrained by match and merge functions that satisfy the ICAR properties. In the first direction, we will show these are the minimum matches required to satisfy representativity. Assume on the contrary: there exist two composite records and , such that but one pair of their constituent records match, i.e. , for some , . By definition, this contradicts the representativity property.

Direction 2: In the second direction, we will show any additional matches will increase and thus decrease the estimated pairwise precision lower bound. Assume there exist two records and , such that but none of their constituent records match, i.e. , , . The additional match may increase , thus decreasing . ∎

The simplicity of this approach is derived from only claiming performance knowledge of direct record matches from the validation set performance. Interestingly, this ER system is equivalent to finding all connected components, where each edge in the adjacency matrix . We stress that though this may optimize the estimated lower bound performances, it does not necessarily guarantee better performance. However, if ground truth is not available for a dataset of comparable size to the deployed system, then this is now a theoretically well motivated approach.

A significant benefit of Theorem 3 is the provided match function need not satisfy the very restrictive representativity property. Further, since the idempotence and commutativity properties are trivial to satisfy,

can be essentially any match function. For example, one could use more complex machine learning based match functions (e.g. kernelized SVM, random forests) and featurizations which may not have intuitive merge operations (e.g. word2vec

[12], Brown clustering [13]). Using less restrictive match functions undoubtedly enables better and , further improving the lower bounds.

5 Experiments

We conducted experiments on multiple datasets with known ground truth to empirically demonstrate the tightness of the estimated lower bounds. Specifically, we are interested in optimizing ER model parameters over the estimated lower bounds and over the ground truth metrics to show they achieve similar results.

5.1 Datasets

We used one synthetic and three real world datasets with known ground truth for our experiments, as described in Table 2

. For all these datasets, the goal of entity resolution is to find records describing the same entity (e.g. restaurant, product, or person). For the synthetic dataset, we generated each record’s features using a feature vector unique to its respective entity, plus random Gaussian noise.

[b] Dataset # dim # records # matches Synthetic 10 1000 4500 Restaurant1 4 864 112 Abt-Buy2 3 2173 1118 Escort (subset) 20 10000 10596

Table 2: Datasets used in the experiments
(a) Synthetic precision
(b) Synthetic recall
(c) Restaurant precision
(d) Restaurant recall
(e) Abt-Buy precision
(f) Abt-Buy recall
(g) Escort precision
(h) Escort recall
Figure 10: Experimental results demonstrate model parameters can be tuned to optimize estimated lower bound pairwise precision and recall of the test set. The resulting estimated lower bound is close to the true performance. Pairwise is not shown because it is the harmonic mean of the two former metrics, and is thus less informative.

Unlike general machine learning tasks, publicly available entity resolution datasets with known ground truth are extremely limited, and do not number beyond several thousand records. The restaurant dataset is one of the earliest ER tasks discussed in literature [14], and still used today [9, 15]. Unfortunately, the dataset is also relatively small – numbering only 864 records and five features (name, phone number, street address, city, cuisine). We threw away the phone number feature because it made the problem too simple. The Abt-Buy dataset is more recent, larger at 2173 records, and used extensively in current research [9, 16]. It consists of product information from two retailers, including product name, description, and price.

Both the Restaurant and Abt-Buy datasets are a class of entity resolution known as clean-clean, wherein two ‘clean’ datasets with completely resolved entities are merged together [3]. This problem is easier than the more general problem of resolving entities with an unknown number of records. To formulate these datasets in a more general context, we merged them together into a single ‘dirty’ dataset and ignored the advantageous ‘clean-clean’ knowledge in our experiments.

Lastly, we evaluated a subset of a personal ads dataset scraped from escort advertising websites over the past few years [17]

. We used natural-language-processing algorithms to extract 20 features, such as name, age, location, and hair color of the person being advertised. For ground truth, we used a subset of the data containing phone number matches as a proxy label. Although phone numbers will not allow us to discover the full ground truth, it is reasonable to assume ads with the same phone number belong to the same entity (i.e. person or group) because those numbers are the means of contact for potential customers.

5.2 Entity Resolution

We used the R-Swoosh algorithm for our ER systems [4]. For the merge function, we simply used the set union of the respective features. For example, in Table 1, would be [{J., John}, {D., Doe}, {377-8328}].

For the match function, we trained a binary logistic regression classifier using known matches and mismatches in the training dataset. Like all pairwise entity resolution algorithms, it operates on pairwise features, which we computed from two records’ features using either a binary match (e.g. state, hair color), numerical difference (e.g. ages, weights), or Levenshtein string edit distance (e.g. name) of each feature pair. If a record had multiple of a particular feature from a merge operation, we used the closest feature match.

Another benefit of using a probabilistic match function is the choice of parameters is reduced to a single value: the cut-off threshold. The choice of cut-off threshold is a classic trade-off between precision and recall – an ideal setting to examine the results of our bounds.

5.3 Results

To examine the efficacy of the estimated lower bound in tuning an entity resolution system, we evaluated the true and lower bound performances across tightly spaced intervals of match cut-off thresholds, as shown in Figures 10. The tightness of the bounds demonstrate two important qualities. First, they enable the optimization of model parameters (e.g. cut-off threshold) using the estimated lower bound. Though this may not necessarily result in the true (unknown) optimal parameters, it will result in the best estimated lower bound. Second, it enables enforcing a level of acceptable quality prior to the use of any entity resolution results.

One may be surprised to see the estimated lower bound exceed the true performance. This is, indeed, possible because of uncertainty in estimations of , and

. The 95% confidence intervals are obtained via the propagation of validation set Wilson scores for precision and recall

[18]. Uncertainty increases as the gap between validation set and test set sizes widen, a phenomenon observable in Figure 1. For very small datasets such as Restaurant, we were restricted to using minimal validation samples due to the small number of labels. However, for larger experiments such as Abt-Buy and Escort, we could afford hundreds or thousands of validation samples, significantly reducing uncertainty. This is also theoretically motivated by the shift in class balance in Theorem 1.

The four experiments demonstrate different ER behavior. The synthetic experiment has a narrow range of model parameters with perfect precision and recall, where performance degrades dramatically outside this range. The Restaurant experiment has a more gradual tradeoff between precision and recall, though there is a significant uncertainty in the lower bound estimate due to the limited number of validation samples. Precision in Abt-Buy quickly degrades, though recall is much more gradual. Our bounds correctly capture the need to improve the underlying ER systems for the Abt-Buy and Escort datasets. Without this lower bound, the poor performance on larger datasets would not be evident from smaller tests.

6 Conclusions

Performance optimization of scalable entity resolution systems is challenging because unlike other machine learning tasks, there is not a clear understanding of how behavior will change on larger datasets. In this paper, we developed a simple – yet effective – method for optimizing lower bound performance using a small set of labeled pairs.

Further, we showed the optimal lower bound strategy for any match function is the connected components problem from graph theory – a relatively conservative clustering approach compared to many ER systems. We understand that this does not necessarily guarantee better performance, but it does provide a better lower-bound guarantee. For instance, in our original example in Table 1, would have matched neither nor . However, when labeled datasets of comparable size to the deployed system are not available, this is now a theoretically well motivated approach.

Our bounds specifically addressed performance of pairwise entity resolution algorithms satisfying the ICAR properties [4]. Pairwise algorithms are intuitive, easy to implement, and perform an optimal number of pairwise record comparisons. However, they are also only a subset of entity resolution approaches [1, 5, 6, 7, 8, 9]. Further, we only considered pairwise precision, recall, and due to their popularity, intuitive interpretation and mathematical convenience, though other existing metrics have been shown to produce conflicting rankings [2].

Estimating the lower bounds relies on accurate estimations of several other quantities, including recall and precision on the validation set and class prevalence estimation in the test set. Especially as datasets scale to much larger sizes, our bounds rely on these estimates. As evident in Theorem 1 and in our experiments, uncertainty increases as the gap between validation and testing set sizes widened.

References

  • [1] Lise Getoor and Ashwin Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5(12):2018–2019, 2012.
  • [2] David Menestrina, Steven Euijong Whang, and Hector Garcia-Molina. Evaluating entity resolution results. Proceedings of the VLDB Endowment, 3(1-2):208–219, 2010.
  • [3] Georgios Papadakis. Blocking Techniques for efficient Entity Resolution over large, highly heterogeneous Information Spaces. PhD thesis, Leibniz Universität Hannover, 2013.
  • [4] Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal – The International Journal on Very Large Data Bases, 18(1):255–276, 2009.
  • [5] Mikhail Bilenko, S Basil, and Mehran Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In 5th IEEE International Conference on Data Mining. IEEE, 2005.
  • [6] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):23, 2008.
  • [7] Indrajit Bhattacharya and Lise Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007.
  • [8] Parag Singla and Pedro Domingos. Entity resolution with markov logic. In Sixth IEEE International Conference on Data Mining, pages 572–582. IEEE, 2006.
  • [9] Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11):1483–1494, 2012.
  • [10] Marthinus Christoffel Du Plessis and Masashi Sugiyama. Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Networks, 50:110–119, 2014.
  • [11] Marco Saerens, Patrice Latinne, and Christine Decaestecker.

    Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure.

    Neural computation, 14(1):21–41, 2002.
  • [12] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations (ICRL), 2013.
  • [13] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai.

    Class-based n-gram models of natural language.

    Computational Linguistics, 18(4):467–479, 1992.
  • [14] Sheila Tejada, Craig A Knoblock, and Steven Minton. Learning object identification rules for information integration. Information Systems, 26(8):607–633, 2001.
  • [15] Hanna Köpcke and Erhard Rahm. Training selection for tuning entity matching. In QDB/MUD, pages 3–12, 2008.
  • [16] Hanna Köpcke and Erhard Rahm. Frameworks for entity matching: A comparison.

    Data & Knowledge Engineering

    , 69(2):197–210, 2010.
  • [17] Larry Greenemeier. Human Traffickers Caught on Hidden Internet. Scientific American, 2015.
  • [18] Edwin B Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209–212, 1927.