Supervised machine learning techniques for data matching based on similarity metrics

Businesses, governmental bodies and NGO's have an ever-increasing amount of data at their disposal from which they try to extract valuable information. Often, this needs to be done not only accurately but also within a short time frame. Clean and consistent data is therefore crucial. Data matching is the field that tries to identify instances in data that refer to the same real-world entity. In this study, machine learning techniques are combined with string similarity functions to the field of data matching. A dataset of invoices from a variety of businesses and organizations was preprocessed with a grouping scheme to reduce pair dimensionality and a set of similarity functions was used to quantify similarity between invoice pairs. The resulting invoice pair dataset was then used to train and validate a neural network and a boosted decision tree. The performance was compared with a solution from FISCAL Technologies as a benchmark against currently available deduplication solutions. Both the neural network and boosted decision tree showed equal to better performance.



There are no comments yet.


page 1

page 2

page 3

page 4


On the Decision Tree Complexity of String Matching

String matching is one of the most fundamental problems in computer scie...

E3Solver: decision tree unification by enumeration

We introduce E3Solver, a unification-based solver for programming-by-exa...

Soil Classification Using GATree

This paper details the application of a genetic programming framework fo...

Treant: Training Evasion-Aware Decision Trees

Despite its success and popularity, machine learning is now recognized a...

Neural Networks for Entity Matching

Entity matching is the problem of identifying which records refer to the...

A Comparison of Deep Learning Architectures for Optical Galaxy Morphology Classification

The classification of galaxy morphology plays a crucial role in understa...

Determining Song Similarity via Machine Learning Techniques and Tagging Information

The task of determining item similarity is a crucial one in a recommende...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the rise of the information age in the past decades, the correct handling of data has become a necessity throughout society. However, many systems and processes that integrate and store data are error prone due to human factors, errors in the digitisation and poor system integration. The account payable department of any company is an area that is particularly affected by these problems since it has to handle invoices coming from multiple sources and add them to a single system. The result can be the duplication of an invoice record caused by subtle and small differences in the fields that define the invoice record. This is a common issue and the field that deals with these problems is known as data matching, deduplication, record linkage, entity resolution or field matching. Many of the available solutions are rule-based models that use domain specific knowledge to identify duplicate records. In this study a framework is presented that uses domain specific knowledge and combines it with similarity algorithms to generalise invoice comparison. This makes it possible to analyse the data with more sophisticated tools such as those provided by the machine learning community. These tools have shown to be very good in finding hidden patterns and correlations in data which can be exploited to label or categorise the data in different groups. This framework aims at identifying as many invoice pairs as possible as true duplicates (maximisation of our ”true-positives”) with the smallest contamination of false duplicates (minimisation of ”false-positives”).
The paper is structured as follows; in Section 2 the data and its preprocessing is described. Section 3 contains a brief summary on the chosen supervised machine learning techniques and their specific architecture in this application. In Section 4 the training and evaluation of the models is described. The conclusions and outlook on possible future studies are given in Section 5.

2 Data and preprocessing

A set of invoice records is used as dataset for the study described in this article. The dataset was provided by FISCAL Technologies and originates from various industries. Each invoice consists of several fields eg. invoice number, invoice date, supplier) that can have text and/or number based values. The dataset was preprocessed by setting fields in the same format (eg. for dates or prices) and removing invoices that are missing mandatory fields. This procedure provides a cleaner training sample which allows for better performances. Note that the removal of invoices is valid as long as the removal percentage w.r.t. the whole dataset remains low (eg. in this study). After data cleaning, the pairing step is performed. Pairing requires each invoice to be compared with all the remaining invoices which for invoices would give unique invoice pairs. Given the large number of invoices received by FISCAL Technologies (up to per company) this is not computationally feasible. Furthermore, this naive pairing approach would be very inefficient as certain invoices would clearly not match. To solve this issue, a procedure called grouping is used to reduce the dimensionality of the matching process. Duplicate invoices are very likely to have at least one shared field value. By grouping invoices that have certain field values in common, it is possible to split the dataset into non-disjoint subsets of invoices. Each invoice is then only further compared with invoices from within that subset. This reduces the number of pairings and significantly speeds up the process. Comparing invoices is non-trivial due to the variety of field types that makes an invoice. Fields involving dates, plain numbers, currency and text fields not only have different structures and meanings but their errors may also originate from different sources. Hence, several dedicated similarity algorithms were defined to compare different fields in several ways. All the similarity algorithms take two values as input and try to quantify the degree of similarity with their own unique approach. All of the outputs were normalised such that 1 indicates an exact match and 0 indicates complete dissimilarity. The considered algorithms are listed below:
Jaro and Jaro-Winkler
The Jaro similarity algorithm was developed by M. Jaro in jaro. It was initially designed to compare short length strings such as names. This algorithm was extended by W. Winkler et al. in winkler where strings that matched in the beginning get higher similarity scores.


The N-gram similarity algorithm was proposed by G. Kondrak in

ngram. This algorithm compares contiguous sequences of characters also known as -grams. Only 2-, 3- and 4-grams were used in this work.

The Smith-Waterman algorithm was developed for DNA-sequencing as proposed in Smith. The algorithm tries to find matching character sequences with dynamic programming, a programming approach that deconstructs problems into simpler subproblems.

Levenshtein and Damerau-Levenshtein
The Levenshtein algorithm levenshtein tries to find the number of single character operations (eg. insertion, deletion or substitution) needed to change one string into the other. Damerau proposed an extension Damerau were a transposition between two adjacent characters was also considered as an edit operation.

Longest Common Substring
The longest common substring algorithm was proposed by G. Benson et al. in lcs. It tries to find the longest string that is a substring of both compared strings.

Binary comparison
The binary comparison compares the two field values and gives a 1 if they are completely the same and a 0 if at least 1 character is different.

The Monge-Elkan algorithm MonElkan can be applied to a string that is constructed from multiple strings divided by spaces eg. a sentence. It takes all possible combinations of the shorter strings and applies one of the before mentioned similarity algorithms on each combination. The average of these similarity scores is then passed as the final similarity score.

The similarity algorithms for fields involving strings are well established but algorithms for purely numerical fields (value, time and age) are still quite underdeveloped. The above set of algorithms has been selected such that as many different field types as possible were covered. The result is a vector of similarity scores for each invoice pair that will be used as input for the machine learning algorithms. All of the invoice pairs are then labelled either duplicate or non-duplicate with the use of customer feedback. An overview of the data and preprocessing workflow is given in figure


3 Machine Learning Architectures

The deduplication problem can be classified as a binary classification problem and the labelled invoice pairs makes it possible to use supervised learning techniques. In this setting, the following two architectures were chosen.

3.1 Boosted Decision Tree

Boosting FreudBoosting is a method that can be applied to many machine learning algorithms. It takes multiple weak learners, eg. classical decision trees, and combines them into one strong learner. The weak learners are trained sequentially with a training sample that is weighted according to the accuracy of the previous weak learner. Boosted decision trees(BDTs) are well known for their accuracy and being less prone to overfitting. In this application, an improved form of boosting was used known as gradient boostingBoostGrad. In this approach each learner does not only fit to the reweighted training data but also to the residuals of the previous learners, the difference between predicted and target value. This addition results in a more stable and faster fit convergence. A set of 200 weak learners were trained with a modified least-squares fitting criterionBoostGrad

, a learning rate of 0.1 and a maximum node depth of 4. The number of weak learners, learning rate and maximum node depth were all simultaneously determined with a cross validation hyper parameter grid search. The gradient boosted decision tree uses the

Gradient Boosting Classifier implementation of the Scikit-learn library.scikit-learn

3.2 Neural Network

A neural network (NN) is a machine learning algorithm that is inspired by the way the human brain works. It is based on a collection of connected nodes, like human neurons. In a similar way, feature values are passed to the nodes of the input layer which on their turn pass it on to the nodes of the next hidden layer and so on. The input dense layer consists of 20 nodes, matching the number of considered input features. The network has two hidden layers, both having 30 nodes. All layers use the

ReLUactivation function. The output layer consists out of a single node that uses a Sigmoid

activation to produce a duplication probability. The training is performed with an

Adam optimization algorithmadam and a binary cross-entropyloss function.

4 Results

In this section it is shown how the models are trained and validated on the constructed dataset. Validating the model means accurately estimating the performance of the model in future predictions which can be done in a variety of ways. For all of the methods it is important that the bias towards the training data is minimized, i.e. that overfitting is avoided, and that the model generalizes well to future unseen data.

Figure 1: Dataset creation flow chart

A vital prerequisite is that the dataset is a good representation of the distribution it is drawn from and that it is large enough for the complexity of the model. The data is drawn from a customer base spanning a large variety of industries and the number of statistics w.r.t. the model parameters that need to be optimized is reasonable. However, one should note that in general it is hard to quantify this a priori. Additionally, a finite training dataset will always be an approximation to the whole data space it tries to model. A rigorous model training and validation strategy is therefore crucial.

4.1 5-fold cross-validation

The first scheme undersamples the non-duplicates until the number of statistics is equal in both the duplicate and non-duplicate category. The data is then split into 5 even-sized and stratified subsets of which 4 are used for training and 1 for validation. The model is trained and validated 5 times such that each subset is used once for validation.

4.2 Client validation

The second scheme segments the dataset based the clients the invoice originated from. One client dataset is used in its unbalanced form for validation. The data corresponding to the remaining clients is again undersampled in the non-duplicate category and used as training data. The procedure is repeated until each client has been used once for validation. The predictions of all the clients are then accumulated and evaluated as one testing dataset. The performance is also compared to the predictions of the solution from FISCAL Technologies to provide a benchmark with currently available deduplication frameworks. This scheme is setup to give a more application realistic perspective on the model performance.

4.3 Performance metrics

In case of a binary classification problem one can classify each test statistic outcome as either a

true positive(TP), a true negative(TN), a false positive(FP) and a false negative(FN). The total number of each category can then be used to define the following performance metrics.


The output of the models is a probability that an invoice pair is a duplicate. This means the above performance metrics are dependent on a user-defined probability threshold which is set to 0.5 in this case. One can quantify the performance without this user dependency by plotting the sensitivity, also known as true positive rate, against the false positive rate for varying probability thresholds. This is also known as a Receiver Operating Characteristic(ROC)-curve and shows the trade-off between true and false positive rate for a model. A bigger area under the curve(AUC) therefore means better model performance. In case of the 5-fold cross-validation

, the performance metrics, ROC-curves and AUCs are summarized with the arithmetic mean and the standard deviation of the 5 values for each testing fold and stated in Table

1 and figure 2. The peformance metrics, ROC-curves and AUCs of the client validation can be found in Table 2 and figure 3.

X[l] X[c] X[c] & NN & BDT
Accuracy & &
False positive rate & &
False negative rate & &
Sensitivity & &
Specificity & &

Table 1: Arithmetic means and standard deviation of performance metrics of the 5-fold cross-validation
(a) NN
(b) BDT
Figure 2: The ROC-curves and AUC’s of each folds and the average over all folds of the 5-fold cross-validation

X[1.5,l] X[c] X[c] X[c] & FISCAL & NN & BDT
Accuracy & & &
False positive rate & & &
False negative rate & & &
Sensitivity & & &
Specificity & & &

Table 2: Performance metrics of the client validation
Figure 3: The ROC-curves and AUC’s of the client validation

The boosted decision tree and neural network show to be a valuable solution for the invoice deduplication problem in both the client and 5-fold cross-validation scheme. The accuracy, sensitivity, specificity, false positive rate and false negative rate for a probability threshold of 0.5 give a good indication for the performance of the model whilst the AUC of the ROC-curves make a more universal comparison possible. The ROC-curves in both validation schemes have the highest AUCs for the boosted decision tree which indicates the most promise in the deduplication problem. However, one should note that the reported false positive rates would still result in a large number of false positives caused by the substantial class imbalance. By undersampling the dominant class in the training data the effects of this imbalance were reduced. However, the metrics of the client evaluation show that it remains a non-trivial issue and should be taken into account with future studies.

5 Conclusions and outlook

Supervised learning techniques have been applied to identify duplicated invoices. With the use of similarity functions it was possible to construct a dataset that can be used to train and validate binary classification models. Both a neural network and boosted decision tree were trained and validated and showed to be a valuable solution to the invoice deduplication problem. The solution of FISCAL performed well and supplied a good benchmark. However, the client evaluation showed performance metrics and ROC-curves that indicate that the proposed solution described in this paper can improve identifying duplicate invoices. The framework presented in this study can easily be extended to other non-invoice datasets because of the general applicability of the similarity functions and machine learning techniques.

The presented work shows also promise for future studies. The similarity functions used to construct the features were primarily string based. Adding features that quantify the similarity between numerical fields could add discriminative power to the models. Another addition would be to explore the possibilities of unsupervised learning. Removing the need for (non-)duplicate labels would substantially increase the size of the dataset at hand. Finally, the class imbalance was handled with undersampling of the training data but remains an issue resulting in a large number of false positives. Research on more advanced methods is needed here.

6 Acknowledgments

This research was conducted in partnership with FISCAL Technologies Ltd from their head office in Reading, Berkshire; FISCAL supplied facilities, equipment and test data, as well as the expertise and insight required to ratify the success of the project results.

This project has received funding from the European Union Horizon 2020 research and innovation programme under grant agreement No 765710.