Unifacta: Profiling-driven String Pattern Standardization

03/02/2018
by   Zhongjun Jin, et al.
0

Data cleaning is critical for effective data analytics on many real-world data collected. One of the most challenging data cleaning tasks is pattern standardization-reformatting ad hoc data, e.g., phone numbers, human names and addresses, in heterogeneous non-standard patterns (formats) into a standard pattern-as it is tedious and effort-consuming, especially for large data sets with diverse patterns. In this paper, we develop Unifacta, a technique that helps the end user standardize ill-formatted ad hoc data. With minimum user input, our proposed technique can effectively and efficiently help the end user synthesize high quality explainable pattern standardization programs. We implemented Unifacta, on Trifacta, and experimentally compared Unifacta with a previous state-of-the-art string transformation tool, Flashfill, along with Trifacta and Blinkfill. Experimental results show that Unifacta produced programs of comparable quality, but more explainable, while requiring substantially less user effort than Flashfill, and other related baseline systems. In a user effort study, Unifacta saved 30% - 70% user effort compared to the baseline systems. In an experiment testing the user's understanding of the synthesized transformation logic, Unifacta users achieved a success rate about twice that of Flashfill users.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/02/2018

CLX: Towards a scalable and comprehensible design of PBE data transformations

Effective data analytics on data collected from the real world usually b...
research
08/12/2022

SeeSaw: interactive ad-hoc search over image databases

As image datasets become ubiquitous, the problem of ad-hoc searches over...
research
06/03/2019

On Modelling the Avoidability of Patterns as CSP

Solving avoidability problems in the area of string combinatorics often ...
research
04/19/2023

An Exploratory Study of Ad Hoc Parsers in Python

Background: Ad hoc parsers are pieces of code that use common string fun...
research
03/09/2022

ASET: Ad-hoc Structured Exploration of Text Collections [Extended Abstract]

In this paper, we propose a new system called ASET that allows users to ...
research
10/09/2020

A Generic Approach to Detect Design Patterns in Model Transformations Using a String-Matching Algorithm

Maintaining software artifacts is among the hardest tasks an engineer fa...

Please sign up or login with your details

Forgot password? Click here to reset