Principled Graph Matching Algorithms for Integrating Multiple Data Sources

02/03/2014
by   Duo Zhang, et al.
0

This paper explores combinatorial optimization for problems of max-weight graph matching on multi-partite graphs, which arise in integrating multiple data sources. Entity resolution-the data integration problem of performing noisy joins on structured data-typically proceeds by first hashing each record into zero or more blocks, scoring pairs of records that are co-blocked for similarity, and then matching pairs of sufficient similarity. In the most common case of matching two sources, it is often desirable for the final matching to be one-to-one (a record may be matched with at most one other); members of the database and statistical record linkage communities accomplish such matchings in the final stage by weighted bipartite graph matching on similarity scores. Such matchings are intuitively appealing: they leverage a natural global property of many real-world entity stores-that of being nearly deduped-and are known to provide significant improvements to precision and recall. Unfortunately unlike the bipartite case, exact max-weight matching on multi-partite graphs is known to be NP-hard. Our two-fold algorithmic contributions approximate multi-partite max-weight matching: our first algorithm borrows optimization techniques common to Bayesian probabilistic inference; our second is a greedy approximation algorithm. In addition to a theoretical guarantee on the latter, we present comparisons on a real-world ER problem from Bing significantly larger than typically found in the literature, publication data, and on a series of synthetic problems. Our results quantify significant improvements due to exploiting multiple sources, which are made possible by global one-to-one constraints linking otherwise independent matching sub-problems. We also discover that our algorithms are complementary: one being much more robust under noise, and the other being simple to implement and very fast to run.

READ FULL TEXT

page 5

page 7

page 8

page 9

page 10

page 11

page 12

page 13

research
12/28/2021

Bipartite Graph Matching Algorithms for Clean-Clean Entity Resolution: An Empirical Evaluation

Entity Resolution (ER) is the task of finding records that refer to the ...
research
02/21/2018

Max-size popular matchings and extensions

We consider the max-size popular matching problem in a roommates instanc...
research
08/22/2022

Locally Defined Independence Systems on Graphs

The maximization for the independence systems defined on graphs is a gen...
research
06/22/2022

Deep Learning to Jointly Schema Match, Impute, and Transform Databases

An applied problem facing all areas of data science is harmonizing data ...
research
12/06/2021

Multidimensional Assignment Problem for multipartite entity resolution

Multipartite entity resolution aims at integrating records from multiple...
research
07/19/2019

Fast Record Linkage for Company Entities

Record Linkage is an essential part of almost all real-world systems tha...

Please sign up or login with your details

Forgot password? Click here to reset