Wrangling Messy CSV Files by Detecting Row and Type Patterns

It is well known that data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently, so each file requires manual inspection and potentially repair before the data can be loaded, an enormous waste of human effort for a task that should be one of the simplest parts of data science. The first and most essential step in retrieving data from CSV files is deciding on the dialect of the file, such as the cell delimiter and quote character. Existing dialect detection approaches are few and non-robust. In this paper, we propose a dialect detection method based on a novel measure of data consistency of parsed data files. Our method achieves 97 of real-world CSV files and improves the accuracy on messy CSV files by almost 22 library.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/24/2018

Neural Fuzzing: A Neural Approach to Generate Test Data for File Format Fuzzing

This article is aimed at the design and implementation of a file format ...
research
12/15/2020

Looking for non-compliant documents using error messages from multiple parsers

Whether a file is accepted by a single parser is not a reliable indicati...
research
08/12/2017

SigViewer: Visualizing Multimodal Signals Stored in XDF (Extensible Data Format) Files

Multimodal biosignal acquisition is facilitated by recently introduced s...
research
09/14/2021

Detecting Layout Templates in Complex Multiregion Files

Spreadsheets are among the most commonly used file formats for data mana...
research
04/10/2023

Extension of Dictionary-Based Compression Algorithms for the Quantitative Visualization of Patterns from Log Files

Many services today massively and continuously produce log files of diff...
research
01/20/2022

Statistical detection of format dialects using the weighted Dowker complex

This paper provides an experimentally validated, probabilistic model of ...
research
06/21/2021

ciftiTools: A package for reading, writing, visualizing and manipulating CIFTI files in R

Surface- and grayordinate-based analysis of MR data has well-recognized ...

Please sign up or login with your details

Forgot password? Click here to reset