Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization

01/13/2021
by   Adel Ardalan, et al.
0

Many applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100 done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100 solution has significant limitations. It does not tell the users how to verify and clean the clusters. This part also often takes a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2023

Linear-time computation of generalized minimal absent words for multiple strings

A string w is called a minimal absent word (MAW) for a string S if w doe...
research
08/21/2023

DataVinci: Learning Syntactic and Semantic String Repairs

String data is common in real-world datasets: 67.6 1.8 million real Exce...
research
03/31/2020

A Clustering Framework for Lexical Normalization of Roman Urdu

Roman Urdu is an informal form of the Urdu language written in Roman scr...
research
06/25/2020

Normalizing Text using Language Modelling based on Phonetics and String Similarity

Social media networks and chatting platforms often use an informal versi...
research
09/03/2023

Carbon Emission Prediction and Clean Industry Transformation Based on Machine Learning: A Case Study of Sichuan Province

This study preprocessed 2000-2019 energy consumption data for 46 key Sic...
research
07/23/2019

Optimal Transport-based Alignment of Learned Character Representations for String Similarity

String similarity models are vital for record linkage, entity resolution...

Please sign up or login with your details

Forgot password? Click here to reset