CLX: Towards a scalable and comprehensible design of PBE data transformations

by   Zhongjun Jin, et al.

Effective data analytics on data collected from the real world usually begins with a notoriously expensive pre-processing step of data transformation and wrangling. Human-in-the-loop tools have been proposed to speed up the process of data transformation, using the Programming By Example (PBE) approach. However, two important usability issues limit the effective use of such PBE data transformation systems: (1) the cost of user effort grows quickly as volume or heterogeneity of the raw data increases (prohibitive user effort), and (2) the underlying process of transformation is opaque to the user and hence difficult to validate, correct and debug (incomprehensibility). In this project, we propose a new PBE data transformation paradigm design CLX (pronounced "clicks") for data normalization to address these two issues. For the issue of prohibitive user effort, we present a pattern profiling algorithm that hierarchically clusters the input raw data based on format structures that help the user quickly identify both well-formatted and ill-formatted data and specify the desired format. After the desired transformation logic is inferred, CLX explains it as a set of simple regular expression replacement operations to improve comprehensibility. We experimentally compared the CLX prototype with FlashFill, a state-of-the-art data transformation tool. The results show improvements over the state of the art in saving user effort and enhancing comprehensibility, without loss of efficiency or expressive power. In a user effort study on data sets of various sizes, when the data size grew by a factor of 30, the user effort required by the CLX prototype grew 1.2x whereas that required by FlashFill grew 9.1x. In another test assessing the users' understanding of the transformation logic, the CLX users achieved a success rate about twice that of the FlashFill users.



There are no comments yet.


page 1

page 2

page 3

page 4


Unifacta: Profiling-driven String Pattern Standardization

Data cleaning is critical for effective data analytics on many real-worl...

On Box-Cox Transformation for Image Normality and Pattern Classification

A unique member of the power transformation family is known as the Box-C...

cleanTS: Automated (AutoML) Tool to Clean Univariate Time Series at Microscales

Data cleaning is one of the most important tasks in data analysis proces...

WebRelate: Integrating Web Data with Spreadsheets using Examples

Data integration between web sources and relational data is a key challe...

Learning to Compose Domain-Specific Transformations for Data Augmentation

Data augmentation is a ubiquitous technique for increasing the size of l...

Is preprocessing of text really worth your time for online comment classification?

A large proportion of online comments present on public domains are cons...

Quda: Natural Language Queries for Visual Data Analytics

Visualization-oriented natural language interfaces (V-NLIs) have been ex...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.