Simulating Name-like Vectors for Testing Large-scale Entity Resolution

09/07/2020
by   Samudra Herath, et al.
0

Accurate and efficient entity resolution (ER) has been a problem in data analysis and data mining projects for decades. In our work, we are interested in developing ER methods to handle big data. Good public datasets are restricted in this area and usually small in size. Simulation is one technique for generating datasets for testing. Existing simulation tools have problems of complexity, scalability and limitations of resampling. We address these problems by introducing a better way of simulating testing data for big data ER. Our proposed simulation model is simple, inexpensive and fast. We focus on avoiding the detail-level simulation of records using a simple vector representation. In this paper, we will discuss how to simulate simple vectors that approximate the properties of names (commonly used as identification keys).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2018

Goodness-of-Fit Tests for Large Datasets

Nowadays, data analysis in the world of Big Data is connected typically ...
research
01/03/2015

A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining

Big data comes in various ways, types, shapes, forms and sizes. Indeed, ...
research
08/26/2018

A MapReduce based Big-data Framework for Object Extraction from Mosaic Satellite Images

We propose a framework stitching of vector representations of large scal...
research
11/07/2021

Em-K Indexing for Approximate Query Matching in Large-scale ER

Accurate and efficient entity resolution (ER) is a significant challenge...
research
09/21/2022

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Most of the popular Big Data analytics tools evolved to adapt their work...
research
06/26/2018

An Efficient Data Warehouse for Crop Yield Prediction

Nowadays, precision agriculture combined with modern information and com...

Please sign up or login with your details

Forgot password? Click here to reset