Supervised machine learning techniques for data matching based on similarity metrics

07/08/2020
by   Pim Verschuuren, et al.
0

Businesses, governmental bodies and NGO's have an ever-increasing amount of data at their disposal from which they try to extract valuable information. Often, this needs to be done not only accurately but also within a short time frame. Clean and consistent data is therefore crucial. Data matching is the field that tries to identify instances in data that refer to the same real-world entity. In this study, machine learning techniques are combined with string similarity functions to the field of data matching. A dataset of invoices from a variety of businesses and organizations was preprocessed with a grouping scheme to reduce pair dimensionality and a set of similarity functions was used to quantify similarity between invoice pairs. The resulting invoice pair dataset was then used to train and validate a neural network and a boosted decision tree. The performance was compared with a solution from FISCAL Technologies as a benchmark against currently available deduplication solutions. Both the neural network and boosted decision tree showed equal to better performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/28/2017

On the Decision Tree Complexity of String Matching

String matching is one of the most fundamental problems in computer scie...
research
10/19/2017

E3Solver: decision tree unification by enumeration

We introduce E3Solver, a unification-based solver for programming-by-exa...
research
08/14/2019

Fast Cartesian Tree Matching

Cartesian tree matching is the problem of finding all substrings of a gi...
research
07/02/2019

Treant: Training Evasion-Aware Decision Trees

Despite its success and popularity, machine learning is now recognized a...
research
10/21/2020

Neural Networks for Entity Matching

Entity matching is the problem of identifying which records refer to the...
research
09/03/2022

Identify The Beehive Sound Using Deep Learning

Flowers play an essential role in removing the duller from the environme...
research
04/14/2005

The Combined Technique for Detection of Artifacts in Clinical Electroencephalograms of Sleeping Newborns

In this paper we describe a new method combining the polynomial neural n...

Please sign up or login with your details

Forgot password? Click here to reset