Entity resolution (ER, record linkage, deduplication, etc.) seeks to identify which records in a data set refer to the same underlying real-world entity [Fellegi and Sunter1969, Elmagarmid, Ipeirotis, and Verykios2007, Getoor and Machanavajjhala2012, Larsen and Rubin2001, Christen2012]
. Our ability to represent information about real-world entities in very diverse ways makes this a complicated problem. For example, collecting profiles of people and businesses, or specifications of products and services from websites and social media sites can result in billions of records that need to be resolved. These entities are identified in a wide variety of ways, complicated further by language ambiguity, poor data entry, missing values, changing attributes and formatting issues. ER is a fundamental task in data processing with wide-array of applications. There is a huge literature on ER techniques; many include machine learning algorithms, such as decision trees, SVMs, ensembles of classifiers, conditional random fields, unsupervised learning etc. (see[Getoor and Machanavajjhala2012] for a recent survey). Yet, ER remains a demanding task for any automated strategy yielding low accuracy.
ER can be cast as a clustering problem. Consider a set of elements that must be clustered into disjoint parts . The true underlying clusters , are unknown to us, and so is . Each of these s represents an entity. Each element
has a set of attributes. A similarity function is used to estimate the similarity of the attribute sets of two nodesand . If and represent the same entity, then an ideal similarity function will return , and if they are different, then it will return . However, in practice, it is impossible to find an ideal similarity function, or even a function close to it. Often, some attribute values may be missing or incorrect, and that leads to similarity values that are noisy representation of the ideal similarity function. Any automated process that uses such similarity function is thus prone to make errors. To overcome this difficulty, a relatively recent line of works propose to use human knowledge via crowdsourcing to boost accuracy of ER [Davidson et al.2014, Firmani, Saha, and Srivastava2016, Verroios and Garcia-Molina2015, Gruenheid et al.2015, Wang et al.2012, Wang et al.2013, Vesdapunt, Bellare, and Dalvi2014, Yi et al.2012, Whang, Lofgren, and Garcia-Molina2013]. Human based on domain knowledge can match and distinguish entities with complex representations, where automated strategies fail.
Motivating example. Consider the following illustrative example shown in Figure 1. The Walt Disney, commonly known as Disney, is an American multinational media and entertainment company that owns and licenses theme parks around the world. 111 https://en.wikipedia.org/wiki/The_Walt_Disney_Company Given the six places () Disney World, () Walt Disney World Resort, () Walt Disney Theme Park, Orlando, () Disneyland, () Disneyland Park, humans can determine using domain knowledge that these correspond to two entities: refer to one entity, and refer to a second entity.
Answering queries by crowd could be time-consuming and costly. Therefore, a crowd based ER strategy must attempt to minimize the number of queries to the oracle while resolving the clusters exactly. Having access to ideal crowd answers, a good ordering of comparing record pairs is , , , . After the first three pairs have been compared, we can safely infer as “matching” the remaining pair leveraging transitive relations. After the last pair in the ordering has been compared, we can safely infer as “non-matching” all the remaining pairs , , , , in the database.
The work by Wang et al. [Wang et al.2013] was among the first few [Wang et al.2012, Demartini, Difallah, and Cudré-Mauroux2012, Whang, Lofgren, and Garcia-Molina2013] to propose the notion of hybrid human-machine approach for entity resolution. Moreover, it is the first paper to leverage the transitive relationship among the entities to minimize the number of queries which has since become a staple in every follow-up work on this topic [Firmani, Saha, and Srivastava2016, Verroios and Garcia-Molina2015, Gruenheid et al.2015, Vesdapunt, Bellare, and Dalvi2014]. Assuming there is an oracle, an abstraction of a crowd-sourcing platform that can correctly answer questions of the form “Do records and refer to the same entity?”, they presented a new algorithm for crowd-sourced ER. To minimize the number of queries to the crowd oracle, Wang et al. utilizes the transitive relation in which known match and non-match labels on some record pairs can be used to automatically infer match or non-match labels on other record pairs. In short, the heuristic algorithm by Wang et al. does the following: it orders the tuples (record pairs/edges) in nonincreasing order of similarity, and query any edge according to that order whenever the right value of that edge cannot be transitively deduced from the already queried/inferred edges so far.
While the crowd-sourcing algorithm of Wang et al. works reasonably well on real datasets, theoretical guarantees for it was not provided. However, in [Vesdapunt, Bellare, and Dalvi2014], Vesdapunt et al. showed that in some instances this algorithm can only give an approximation, that is when an optimum algorithm may require queries, Wang et al.’s algorithm can require queries.
Vesdapunt et al. proposed an algorithm that proceeds in the following iterative manner. In each round, an element to be clustered is compared with one representative of all the existing clusters. The order of these comparisons is defined by a descending order of the similarity measures. As soon as a positive query result is found the element is assigned to the corresponding cluster and the algorithm moves to the next round with a new element. It is easy to see that in the worst case the number of queries made by the algorithm is , where is the number of elements and is the number of clusters. It also follows that this is at least an approximation.
Note that [Wang et al.2013, Vesdapunt, Bellare, and Dalvi2014] consider the answers of queries are correct as an ideal crowd abstraction - this can often be guaranteed via majority voting. But it is unclear that how the quality of the similarity measurements affects the total number of queries. Indeed, in typical datasets, the performances of the algorithms of Wang et al. and Vesdapunt et al. are quite similar, and they are much better than their worst case guarantees that do not take into account the existence of any meaningful similarity measures. This means the presence of the similarity measures helps reduce the query complexity significantly. Is there a way to theoretically establish that and come up with guarantees that match the experimental observations?
It is of paramount interest to characterize the query complexity (number of questions asked to the crowd) of these popular heuristics and come up with algorithms that minimize such complexity. The query complexity is directly proportional to the overall cost of a crowd-based algorithm, due to the fact that crowd questions are time-consuming and in many times involve compensations. Designing a strategy that would minimize the query complexity can directly be seen as alternatives to active learning problem with minimum labeling requirements [Sarawagi and Bhamidipaty2002, Bellare et al.2012]. From the perspective of lower bounding the query complexity, ER can be seen as a reinforcement learning problem. Indeed, in each step of assigning a record to one of the underlying entities, a query must be made wisely so that under any adversarial configurations, the total number of queries remain small.
Contributions. In this paper we assume the following model for the similarity measurements. Let denote the matrix obtained by pair-wise similarity computation, whereif and belong to the same cluster and drawn from a probability distribution otherwise. The subscripts of and are chosen to respectively signify a “red edge” (or absence of a link) and a “green edge” (or presence of a link). Note that, this model of similarity matrix is by no means the only possible; however it captures the essential flavor of the problem.
Our main contribution in this paper is to provide a theoretical analysis of query complexities of the two aforementioned heuristics from [Wang et al.2013, Vesdapunt, Bellare, and Dalvi2014]. Our analysis quantifies the effect of the presence of similarity measures in these algorithms, establishes the superiority between these algorithms under different criteria, and derives the exact expression of query complexity under some fundamental probability models.
Next, to establish the near-optimality or sub-optimality of the above heuristics, we compare our results with an information theoretic lower bound recently proposed by us [Mazumdar and Saha2016]. As a corollary to the results of [Mazumdar and Saha2016], it can be seen that the information theoretic lower bound depends on the Hellinger divergence between and . More interestingly, the quality of the similarity matrix can be characterized by the Hellinger divergence between and as well.
Finally, we show that the experimental observations of [Wang et al.2013, Vesdapunt, Bellare, and Dalvi2014] agree with our theoretical analysis of their algorithms. Moreover, we conduct a thorough experiment on the bibliographical cora [McCallum2004] dataset for ER and several synthetic datasets to validate the theoretical findings further.
2 System model and techniques
2.1 Crowdsourced Entity Resolution Crowd-ER
Consider a set of elements which is a disjoint union of clusters , , where and the subsets are unknown. The crowd (or the oracle) is presented with an element-pair for a query, that results in a binary answer denoting the event belonging to the same cluster. Note that, this perfect oracle model had been used in the prominent previous works by Wang et al. and Vesdapunt et al.wang2013leveraging,vesdapunt2014crowdsourcing.
We can assume that with probability , the crowd gives a wrong answer to the th query. However, with resampling the th query times, that is by asking the same th query to different users and by taking the majority vote, we can drive the probability to nearly and return to the model of perfect oracle. Note that we have assumed independence among the resampled queries over the index , which can be justified since we are sampling a growing () number of samples. Furthermore, repetition of the same query to the crowd may not not lead to reduction in the error probability, i.e., a persistent error. Even in this scenario an element can be queried with multiple elements from a same cluster to infer with certainty whether the element belong to the cluster or not. These situations have been covered in detail in our recent work [Mazumdar and Saha2016]. Henceforth, in this paper, we only consider the perfect oracle model. All our results hold for the faulty oracle model described above with only an blow-up in the query complexity.
Consider , an similarity matrix, with the th entry a nonnegative random variable in drawn from a probability density or mass function when belong to the same cluster, and drawn from a probability density or mass function otherwise. and are unknown.
The problem of Crowd-ER is to design a set of queries in , given and , such that from the answers to the queries, it is possible to recover , .
2.2 The two heuristic algorithms
The Edge ordering algorithm [Wang et al.2013]. In this algorithm, we arrange the set in non-increasing order of similarity values s. We then query sequentially according to this order. Whenever possible we apply transitive relation to infer edges. For example, if the queries and both get positive answers then there must be an edge , and we do not have to make the query . We stop when all the edges are either queried, or inferred.
The Node ordering algorithm [Vesdapunt, Bellare, and Dalvi2014]. In this algorithm, the empirical expected size of the cluster containing element , , is first computed as . Then all the elements are ordered non-increasingly according to the empirical expected sizes of the clusters containing them. At any point in the execution, the algorithm maintains at most clusters. The algorithm selects the next element and issues queries involving that element and elements which are already clustered in non-increasing order of their similarity, and apply transitivity for inference. Therefore, the algorithm issues at most one query involving the current node and an existing cluster. Trivially, this gives an -approximation.
3 Information theoretic lower bound
Note that, in the absence of similarity matrix , any optimal (possibly randomized) algorithm must make queries to solve Crowd-ER. This is true because an input can always be generated that makes vertices to be involved in queries before they can be correctly assigned. However, when we are allowed to use the similarity matrix, this bound can be significantly reduced. Indeed, the following lower bound follows as a corollary of the results of our previous work [Mazumdar and Saha2016].
Given the number of clusters and any randomized algorithm that does not perform at least queries, will be unable to return the correct clustering with high probability, where is the squared Hellinger divergence between the probability measures and .
The main idea of proving this lower bound already appears in our recent work [Mazumdar and Saha2016], and we give a brief sketch of the proof below for the interested readers. Strikingly, Hellinger divergence between and appears to be the right distinguishing measure even for analyzing the heuristic algorithms.
To show the lower bound we consider an input where one of the clusters are fully formed and given to us. The remaining clusters each has size We prove the result through contradiction. Assume there exists a randomized algorithm ALG that makes a total of queries and assigns all the remaining vertices to correct clusters with high probability. However, that implies that the average number of queries ALG makes to assign each of the remaining elements to a cluster must be .
Since there are clusters, this actually guarantees the existence of an element that is not queried with the correct cluster it is from, and that completely relies on the matrix for the correct assignment. Now the probability distribution (which is a product measure) of , , can be one of two different distributions, and depending on whether this vertex belong to or not. Therefore these two distributions must be far apart in terms of total variation distance for correct assignment.
However, the total variation distance between and . But as both are product measures that can differ in at most random variables (recall the clusters are all of size ), we must have, using the properties of the Hellinger divergence, . This means, , i.e., the two distributions are close enough to be confused with a positive probability - which leads to a contradiction. Note that, in stead of recovery with positive probability, if we want to ensure exact recovery of the clusters (i.e., with probability 1) we must query each element at least once. This leads to the following corollary.
Any (possibly randomized) algorithm with the knowledge of and the number of clusters , must perform at least queries, , to return the correct clustering exactly.
4 Main results: Analysis of the heuristics
We provide expressions for query complexities for both the edge ordering and the node ordering algorithms. It turns out that the following quantity plays a crucial role in the analysis of both:
Theorem 2 (The Edge ordering).
The query complexity for Crowd-ER with the edge ordering algorithm is at most,
The proof of this theorem is provided in Section 5.
Theorem 3 (The Node ordering).
The query complexity for Crowd-ER with the node ordering algorithm is at most,
The proof of this theorem is provided in Section 6.
4.1 Illustration: -biased Uniform Noise Model
We consider two distributions for and which are only
far in terms of total variation distance from the uniform distribution. However, if we consider Hellinger distance, thenDist-1 is closer to uniform distribution than Dist-2. These two distributions will be used as representative distributions to illustrate the potentials of the edge ordering and node ordering algorithms. In both cases, substituting with , we get uniform distribution which contains no information regarding the similarities of the entries.
Consider the following probability density functions forand , where , and
Note that . Similarly, , that is they represent valid probability density functions. We have,
Dist-2. Now consider the following probability density functions for and with .
Again, . Similarly, , that is they represent valid probability density functions. We have, .
We have the following results for these two distributions.
Proposition 1 (Lower bound).
Any (possibly randomized) algorithm for Crowd-ER, must make queries for Dist-1 and queries for Dist-2, to recover the clusters exactly (with probability 1).
The following set of results are corollaries of Theorem 2.
Proposition 2 (Uniform noise (no similarity information)).
Under the uniform noise model where , the edge ordering algorithm has query complexity for Crowd-ER.
Since , the similarity matrix amounts to no information at all. We know that in this situation, one must make queries for the correct solution of Crowd-ER.
In this situation, a straight-forward calculation shows that, This means, ignoring the first term, from Theorem 2, the edge ordering algorithm makes at most number of queries. By bounding the harmonic series and using the concavity of log, we have the number of queries made by the edge ordering algorithm is at most where we have substituted . ∎
Proposition 3 (Dist-1).
When Dist-1, the edge ordering algorithm has query complexity for Crowd-ER.
The proof is identical to the above. For small , we have (see, Section 8). The algorithm queries at most edges. ∎
Proposition 4 (Dist-2).
When Dist-2, the edge ordering algorithm has query complexity for Crowd-ER.
For the Node-ordering algorithm, we have the following result as a corollary of Theorem 3.
Proposition 5 (Node-Ordering).
When Dist-1, the node ordering algorithm has query complexity for Crowd-ER. When Dist-2, node ordering has query complexity for Crowd-ER.
For Dist-1, . Therefore, when , . Thus, the total number of queries is . For Dist-2, . Therefore, when , . Thus the total number of expected queries is . ∎
Note that, there is no difference in the upper bounds given between the Edge and Node ordering algorithms for Dist-2. But Edge-ordering uses order factor more queries than the optimal () for Dist-1. Dist-1 is closer to uniform distribution by the Hellinger measure than Dist-2, which shows that Hellinger distance is the right choice for distance here. Assuming , we get a drastic reduction in query complexity by moving from Dist-1 to Dist-2.
5 Analysis of the Edge ordering algorithm: proof of Theorem 2
Let be a random variable with distribution and be identical random variables with distribution . Let be all independent. Note that,
In the interest of clarity, let us call a pair a green edge iff for some , and otherwise call the pair a red edge.
In the current graph, let there exist nodes, called , which all belong to the same cluster but no edge from the induced graph on these vertices have been queried yet. Then there are green edges within , yet to be queried. On the other hand, there are at most red edges with one end point incident on the vertices in . We now count the number of red edges incident on that the algorithm will query before querying a green edge within . We can account for all the red edges queried by the algorithm by considering each cluster at a time, and summing over the queried red edges incident on it. In fact, by doing this, we double count every red edge. Since the probability of querying a red edge incident on before querying any of the green edges incident on is , the expected number of queried red edges incident on before querying a green edge in is at most
Let be a positive integer. Consider a cluster . Suppose at some point of time, there are components of remaining to be connected. Then, again there are at least green edges, querying any of which will decrease the number of components by . Thus, the expected number of red edges that are queried incident on nodes in before there remain at most components of is at most . Therefore, the expected number of red edges that are queried until only components are left for every cluster is
Now the number of red edges across the clusters having size at most is at most Therefore, even if we query all those edges, we get the total number of queried red edges to be at most
The algorithm queries a total of green edges, exactly spanning every cluster. Thus the total number of queries is at most
6 Analysis of the Node ordering algorithm: proof of Theorem 3
The computed expected cluster size for each node can be a highly biased estimator, and may not provide any useful information. For example, the expected cluster size of a node in is where for Dist 1 and for Dist 2. Therefore, the node ordering considered by [Vesdapunt, Bellare, and Dalvi2014] can be arbitrary. Hence, for the purpose of our analysis, we ignore this ordering based on the expected size.
Consider the state of the algorithm where it needs to insert a node which truly belongs to cluster . Suppose the current size of is , that is already contains nodes when is considered. Consider another cluster , , and let its current size be . Let and denote the current subclusters of and that have been formed.Then, where is at most . Hence, . Thus the expected number of queried red edges before is correctly inserted in is at most . Hence the expected total number of queried red edges to grow the th cluster is at most , and thus the expected total number of queries, including green and red edges is bounded by .
7 Experimental Observations
A detailed comparison of the node ordering and edge ordering methods on multiple real datasets has been shown in [Vesdapunt, Bellare, and Dalvi2014, Figures 12,14]. The number of queries issued by the two methods are very close on complete resolution.To validate further, we did the following experiments.
Datasets. (i) We created multiple synthetic datasets each containing nodes and clusters with the following size distribution: two clusters of size , four clusters of size , eight clusters of size , two clusters each of size and and the rest of the clusters of size . The datasets differed in the way similarity values are generated by varying and sampling the values either from Dist-1 or Dist-2. The similarity values are further discretized to take values from the set .
(ii) We used the widely used cora [McCallum2004] dataset for ER. cora is a bibliography dataset, where each record contains title, author, venue, date, and pages attributes. There are nodes in total with clusters, among which are non-singletons. The largest cluster size is , and the total number of pairs is . We used the similarity function as in [Whang, Lofgren, and Garcia-Molina2013, Wang et al.2013, Vesdapunt, Bellare, and Dalvi2014, Firmani, Saha, and Srivastava2016].
The number of queries for the node-ordering and edge-ordering algorithms are reported in Table 1 for the synthetic datasets. Clearly, the number of queries asked for Dist-2 is significantly less than that for Dist-1 at the same value of . This confirms with our theoretical findings. Interestingly, we observe that the number of queries asked by the edge-ordering algorithm is consistently higher than the node-ordering algorithm under Dist-1. This is also expected from Propositions 3 and 5 due to a gap of in the number of queries of the two algorithms. In a similar vein, we see the edge-ordering algorithm is more effective than the node-ordering for Dist-2
, possibly because of hidden constants in the asymptotic analysis.
Figure 2(a) shows the similarity value distribution for cora which is closer to Dist-2 than Dist-1. Figure 2(b) shows the recall vs number of queries issued by the two methods. The line marked with ‘+’ sign is the curve for the ideal algorithm that will ask only the required “green” edges first to grow all the clusters and then ask just one “red” edge across every pair of clusters. Upon completion, the number of queries issued by the edge ordering and node ordering methods are respectively 21,099 and 23,243 which are very close to optimal. Interestingly, this confirms with our observation on the However, they achieve above recall in less than queries. This can also be explained by our analysis. The remaining large number of queries are mainly spent on growing small clusters, e.g. when cluster sizes are –they do not give much benefit on recall, but consume many queries.
8 Appendix: for Dist-1, Dist-2
For Dist-1 and small , we have .
Set , then . We have
For Dist-2 we have .
Acknowledgements: This research is supported in part by NSF CCF Awards 1464310, 1642658, 1642550 and a Google Research Award. The authors would like to thank Sainyam Galhotra for his many help with the simulation results.
- [Bellare et al.2012] Bellare, K.; Iyengar, S.; Parameswaran, A. G.; and Rastogi, V. 2012. Active sampling for entity matching. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, 1131–1139.
- [Christen2012] Christen, P. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media.
- [Davidson et al.2014] Davidson, S. B.; Khanna, S.; Milo, T.; and Roy, S. 2014. Top-k and clustering with noisy comparisons. ACM Trans. Database Syst. 39(4):35:1–35:39.
- [Demartini, Difallah, and Cudré-Mauroux2012] Demartini, G.; Difallah, D. E.; and Cudré-Mauroux, P. 2012. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, 469–478.
- [Elmagarmid, Ipeirotis, and Verykios2007] Elmagarmid, A. K.; Ipeirotis, P. G.; and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1):1–16.
- [Fellegi and Sunter1969] Fellegi, I. P., and Sunter, A. B. 1969. A theory for record linkage. Journal of the American Statistical Association 64(328):1183–1210.
- [Firmani, Saha, and Srivastava2016] Firmani, D.; Saha, B.; and Srivastava, D. 2016. Online entity resolution using an oracle. PVLDB 9(5):384–395.
- [Getoor and Machanavajjhala2012] Getoor, L., and Machanavajjhala, A. 2012. Entity resolution: theory, practice & open challenges. PVLDB 5(12):2018–2019.
- [Gruenheid et al.2015] Gruenheid, A.; Nushi, B.; f, T.; Gatterbauer, W.; and Kossmann, D. 2015. Fault-tolerant entity resolution with the crowd. CoRR abs/1512.00537.
- [Larsen and Rubin2001] Larsen, M. D., and Rubin, D. B. 2001. Iterative automated record linkage using mixture models. Journal of the American Statistical Association 96(453):32–41.
- [Mazumdar and Saha2016] Mazumdar, A., and Saha, B. 2016. Clustering via crowdsourcing. arXiv preprint arXiv:1604.01839.
- [McCallum2004] McCallum, A. 2004. https://people.cs.umass.edu/~mccallum/data/cora-refs.tar.gz.
- [Sarawagi and Bhamidipaty2002] Sarawagi, S., and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, 269–278.
- [Verroios and Garcia-Molina2015] Verroios, V., and Garcia-Molina, H. 2015. Entity resolution with crowd errors. In 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13-17, 2015, 219–230.
- [Vesdapunt, Bellare, and Dalvi2014] Vesdapunt, N.; Bellare, K.; and Dalvi, N. 2014. Crowdsourcing algorithms for entity resolution. PVLDB 7(12):1071–1082.
- [Wang et al.2012] Wang, J.; Kraska, T.; Franklin, M. J.; and Feng, J. 2012. Crowder: Crowdsourcing entity resolution. PVLDB 5(11):1483–1494.
- [Wang et al.2013] Wang, J.; Li, G.; Kraska, T.; Franklin, M. J.; and Feng, J. 2013. Leveraging transitive relations for crowdsourced joins. In SIGMOD Conference, 229–240.
- [Whang, Lofgren, and Garcia-Molina2013] Whang, S. E.; Lofgren, P.; and Garcia-Molina, H. 2013. Question selection for crowd entity resolution. PVLDB 6(6):349–360.
- [Yi et al.2012] Yi, J.; Jin, R.; Jain, A. K.; Jain, S.; and Yang, T. 2012. Semi-crowdsourced clustering: Generalizing crowd labeling by robust distance metric learning. In NIPS 2012., 1781–1789.