Fast Record Linkage for Company Entities

07/19/2019
by   Thomas Gschwind, et al.
0

Record Linkage is an essential part of almost all real-world systems that consume data coming from different sources, structured and unstructured. Typically no common key is available in order to connect the records. Often massive data cleaning and data integration processes have to be completed before any data analytics and further processing can be performed. Though record linkage is often seen as a somewhat tedious necessary step, it is able to reveal valuable insights of the data at hand. These insights guide further analytic approaches over the data and support data visualization. In this work we focus on company entity matching, where company name, location and industry are taken into account. The matching is done on the fly to accommodate realtime processing of streamed data. Our contribution is a system that uses rule-based matching algorithms for scoring operations which we extend with a machine learning approach to account for short company names. We propose an end-to-end highly scalable enterprise-grade system. Linkage time is greatly reduced by efficient decomposition of the search space using MinHash. High linkage accuracy is reached by the proposed thorough scoring process of the matching candidates. Based on two real world ground truth datasets, we show that our approach reaches a recall of 91 results are achieved while scaling linearly with the number of nodes used in the system.

READ FULL TEXT
research
01/12/2022

CompanyName2Vec: Company Entity Matching Based on Job Ads

Entity Matching is an essential part of all real-world systems that take...
research
03/07/2023

Disambiguation of Company names via Deep Recurrent Networks

Name Entity Disambiguation is the Natural Language Processing task of id...
research
06/26/2018

Record Linkage to Match Customer Names: A Probabilistic Approach

Consider the following problem: given a database of records indexed by n...
research
09/29/2017

Entity Consolidation: The Golden Record Problem

Four key processes in data integration are: data preparation (i.e., extr...
research
02/03/2014

Principled Graph Matching Algorithms for Integrating Multiple Data Sources

This paper explores combinatorial optimization for problems of max-weigh...
research
09/05/2018

Merging datasets through deep learning

Merging datasets is a key operation for data analytics. A frequent requi...
research
10/02/2008

Enhanced Integrated Scoring for Cleaning Dirty Texts

An increasing number of approaches for ontology engineering from text ar...

Please sign up or login with your details

Forgot password? Click here to reset