DataVinci: Learning Syntactic and Semantic String Repairs

08/21/2023
by   Mukul Singh, et al.
0

String data is common in real-world datasets: 67.6 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, proposed approaches have often been limited to error detection or require that the user provide annotations, examples, or constraints to fix the errors. Furthermore, these systems have focused independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and constraints learned over other columns without the need for further user interaction. To handle strings with both syntactic and semantic substrings, DataVinci uses an LLM to abstract (and re-concretize) portions of strings that are semantic prior to learning majority patterns and deriving edits. Because not all data can result in majority patterns, DataVinci leverages execution information from an existing program (which reads the target data) to identify and correct data repairs that would not otherwise be identified. DataVinci outperforms 7 baselines on both error detection and repair when evaluated on 4 existing and new benchmarks.

READ FULL TEXT
research
04/05/2020

Learning Over Dirty Data Without Cleaning

Real-world datasets are dirty and contain many errors. Examples of these...
research
02/07/2023

Recent advances in the Self-Referencing Embedding Strings (SELFIES) library

String-based molecular representations play a crucial role in cheminform...
research
12/29/2022

Matchertext: Towards Verbatim Interlanguage Embedding

Embedding text in one language within text of another is commonplace for...
research
10/31/2017

Extracting Syntactic Patterns from Databases

Many database columns contain string or numerical data that conforms to ...
research
01/13/2021

Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization

Many applications need to clean data with a target accuracy. As far as w...
research
04/09/2020

Pattern Discovery in Colored Strings

We consider the problem of identifying patterns of interest in colored s...
research
06/26/2019

String Sanitization: A Combinatorial Approach

String data are often disseminated to support applications such as locat...

Please sign up or login with your details

Forgot password? Click here to reset