DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python

04/02/2021
by   Jinglin Peng, et al.
0

Exploratory Data Analysis (EDA) is a crucial step in any data science project. However, existing Python libraries fall short in supporting data scientists to complete common EDA tasks for statistical modeling. Their API design is either too low level, which is optimized for plotting rather than EDA, or too high level, which is hard to specify more fine-grained EDA tasks. In response, we propose DataPrep.EDA, a novel task-centric EDA system in Python. DataPrep.EDA allows data scientists to declaratively specify a wide range of EDA tasks in different granularity with a single function call. We identify a number of challenges to implement DataPrep.EDA, and propose effective solutions to improve the scalability, usability, customizability of the system. In particular, we discuss some lessons learned from using Dask to build the data processing pipelines for EDA tasks and describe our approaches to accelerate the pipelines. We conduct extensive experiments to compare DataPrep.EDA with Pandas-profiling, the state-of-the-art EDA system in Python. The experiments show that DataPrep.EDA significantly outperforms Pandas-profiling in terms of both speed and user experience. DataPrep.EDA is open-sourced as an EDA component of DataPrep: https://github.com/sfu-db/dataprep.

READ FULL TEXT

page 2

page 4

page 7

research
11/09/2022

Minimalist Data Wrangling with Python

Minimalist Data Wrangling with Python is envisaged as a student's first ...
research
02/18/2021

A Unified System for Data Analytics and In Situ Query Processing

In today's world data is being generated at a high rate due to which it ...
research
03/30/2022

Error Identification Strategies for Python Jupyter Notebooks

Computational notebooks – such as Jupyter or Colab – combine text and da...
research
04/06/2020

giotto-tda: A Topological Data Analysis Toolkit for Machine Learning and Data Exploration

We introduce giotto-tda, a Python library that integrates high-performan...
research
11/06/2019

Towards Human Centered AutoML

Building models from data is an integral part of the majority of data sc...
research
11/09/2019

DataSist: A Python-based library for easy data analysis, visualization and modeling

A large amount of data is produced every second from modern information ...
research
02/08/2021

PyAutoFit: A Classy Probabilistic Programming Language for Model Composition and Fitting

A major trend in academia and data science is the rapid adoption of Baye...

Please sign up or login with your details

Forgot password? Click here to reset