Efficient Specialized Spreadsheet Parsing for Data Science

02/26/2022
by   Felix Henze, et al.
0

Spreadsheets are widely used for data exploration. Since spreadsheet systems have limited capabilities, users often need to load spreadsheets to other data science environments to perform advanced analytics. However, current approaches for spreadsheet loading suffer from either high runtime or memory usage, which hinders data exploration on commodity systems. To make spreasheet loading practical on commodity systems, we introduce a novel parser that minimizes memory usage by tightly coupling decompression and parsing. Furthermore, to reduce the runtime, we introduce optimized spreadsheet-specific parsing routines and employ parallelism. To evaluate our approach, we implement a prototype for loading Excel spreadsheets into R environments. Our evaluation shows that our novel approach is up to 3x faster while consuming up to 40x less memory than state-of-the-art approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2019

Wise Data: A Novel Approach in Data Science from a Network Science Perspective

Human beings have been generating data since very long times ago. We ask...
research
04/23/2020

Human-Machine Collaboration for Democratizing Data Science

Everybody wants to analyse their data, but only few posses the data scie...
research
07/05/2022

How sustainable is "common" data science in terms of power consumption?

Continuous developments in data science have brought forth an exponentia...
research
06/03/2019

Phase-based Minimalist Parsing and complexity in non-local dependencies

A cognitively plausible parsing algorithm should perform like the human ...
research
01/12/2021

Fits and Starts: Enterprise Use of AutoML and the Role of Humans in the Loop

AutoML systems can speed up routine data science work and make machine l...
research
12/23/2022

Neural Transition-based Parsing of Library Deprecations

This paper tackles the challenging problem of automating code updates to...
research
12/12/2021

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their ...

Please sign up or login with your details

Forgot password? Click here to reset