Unifacta: Profiling-driven String Pattern Standardization
Data cleaning is critical for effective data analytics on many real-world data collected. One of the most challenging data cleaning tasks is pattern standardization-reformatting ad hoc data, e.g., phone numbers, human names and addresses, in heterogeneous non-standard patterns (formats) into a standard pattern-as it is tedious and effort-consuming, especially for large data sets with diverse patterns. In this paper, we develop Unifacta, a technique that helps the end user standardize ill-formatted ad hoc data. With minimum user input, our proposed technique can effectively and efficiently help the end user synthesize high quality explainable pattern standardization programs. We implemented Unifacta, on Trifacta, and experimentally compared Unifacta with a previous state-of-the-art string transformation tool, Flashfill, along with Trifacta and Blinkfill. Experimental results show that Unifacta produced programs of comparable quality, but more explainable, while requiring substantially less user effort than Flashfill, and other related baseline systems. In a user effort study, Unifacta saved 30% - 70% user effort compared to the baseline systems. In an experiment testing the user's understanding of the synthesized transformation logic, Unifacta users achieved a success rate about twice that of Flashfill users.
READ FULL TEXT