Em-K Indexing for Approximate Query Matching in Large-scale ER

by   Samudra Herath, et al.

Accurate and efficient entity resolution (ER) is a significant challenge in many data mining and analysis projects requiring integrating and processing massive data collections. It is becoming increasingly important in real-world applications to develop ER solutions that produce prompt responses for entity queries on large-scale databases. Some of these applications demand entity query matching against large-scale reference databases within a short time. We define this as the query matching problem in ER in this work. Indexing or blocking techniques reduce the search space and execution time in the ER process. However, approximate indexing techniques that scale to very large-scale datasets remain open to research. In this paper, we investigate the query matching problem in ER to propose an indexing method suitable for approximate and efficient query matching. We first use spatial mappings to embed records in a multidimensional Euclidean space that preserves the domain-specific similarity. Among the various mapping techniques, we choose multidimensional scaling. Then using a Kd-tree and the nearest neighbour search, the method returns a block of records that includes potential matches for a query. Our method can process queries against a large-scale dataset using only a fraction of the data L (given the dataset size is N), with a O(L^2) complexity where L ≪ N. The experiments conducted on several datasets showed the effectiveness of the proposed method.


page 1

page 2

page 3

page 4


Unconventional application of k-means for distributed approximate similarity search

Similarity search based on a distance function in metric spaces is a fun...

Scalable Feature Matching Across Large Data Collections

This paper is concerned with matching feature vectors in a one-to-one fa...

Multidimensional Assignment Problem for multipartite entity resolution

Multipartite entity resolution aims at integrating records from multiple...

Top-k queries over digital traces

Recent advances in social and mobile technology have enabled an abundanc...

Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor Data on a Single Node

Multidimensional data are becoming more prevalent, partly due to the ris...

Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

The vast amounts of data collected in various domains pose great challen...

Simulating Name-like Vectors for Testing Large-scale Entity Resolution

Accurate and efficient entity resolution (ER) has been a problem in data...

Please sign up or login with your details

Forgot password? Click here to reset