CLX: Towards a scalable and comprehensible design of PBE data transformations

03/02/2018
by   Zhongjun Jin, et al.
0

Effective data analytics on data collected from the real world usually begins with a notoriously expensive pre-processing step of data transformation and wrangling. Human-in-the-loop tools have been proposed to speed up the process of data transformation, using the Programming By Example (PBE) approach. However, two important usability issues limit the effective use of such PBE data transformation systems: (1) the cost of user effort grows quickly as volume or heterogeneity of the raw data increases (prohibitive user effort), and (2) the underlying process of transformation is opaque to the user and hence difficult to validate, correct and debug (incomprehensibility). In this project, we propose a new PBE data transformation paradigm design CLX (pronounced "clicks") for data normalization to address these two issues. For the issue of prohibitive user effort, we present a pattern profiling algorithm that hierarchically clusters the input raw data based on format structures that help the user quickly identify both well-formatted and ill-formatted data and specify the desired format. After the desired transformation logic is inferred, CLX explains it as a set of simple regular expression replacement operations to improve comprehensibility. We experimentally compared the CLX prototype with FlashFill, a state-of-the-art data transformation tool. The results show improvements over the state of the art in saving user effort and enhancing comprehensibility, without loss of efficiency or expressive power. In a user effort study on data sets of various sizes, when the data size grew by a factor of 30, the user effort required by the CLX prototype grew 1.2x whereas that required by FlashFill grew 9.1x. In another test assessing the users' understanding of the transformation logic, the CLX users achieved a success rate about twice that of the FlashFill users.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/02/2018

Unifacta: Profiling-driven String Pattern Standardization

Data cleaning is critical for effective data analytics on many real-worl...
research
04/15/2020

On Box-Cox Transformation for Image Normality and Pattern Classification

A unique member of the power transformation family is known as the Box-C...
research
01/24/2022

Leveraging Data and Analytics for Digital Business Transformation through DataOps: An Information Processing Perspective

Digital business transformation has become increasingly important for or...
research
07/08/2023

Multi-Intent Detection in User Provided Annotations for Programming by Examples Systems

In mapping enterprise applications, data mapping remains a fundamental p...
research
09/06/2017

Learning to Compose Domain-Specific Transformations for Data Augmentation

Data augmentation is a ubiquitous technique for increasing the size of l...
research
06/07/2018

Is preprocessing of text really worth your time for online comment classification?

A large proportion of online comments present on public domains are cons...
research
04/10/2019

Constructing Clustering Transformations

Clustering is one of the fundamental tasks in data analytics and machine...

Please sign up or login with your details

Forgot password? Click here to reset