Smallset Timelines: A Visual Representation of Data Preprocessing Decisions

06/10/2022
by   Lydia R. Lucchesi, et al.
0

Data preprocessing is a crucial stage in the data analysis pipeline, with both technical and social aspects to consider. Yet, the attention it receives is often lacking in research practice and dissemination. We present the Smallset Timeline, a visualisation to help reflect on and communicate data preprocessing decisions. A "Smallset" is a small selection of rows from the original dataset containing instances of dataset alterations. The Timeline is comprised of Smallset snapshots representing different points in the preprocessing stage and captions to describe the alterations visualised at each point. Edits, additions, and deletions to the dataset are highlighted with colour. We develop the R software package, smallsets, that can create Smallset Timelines from R and Python data preprocessing scripts. Constructing the figure asks practitioners to reflect on and revise decisions as necessary, while sharing it aims to make the process accessible to a diverse range of audiences. We present two case studies to illustrate use of the Smallset Timeline for visualising preprocessing decisions. Case studies include software defect data and income survey benchmark data, in which preprocessing affects levels of data loss and group fairness in prediction tasks, respectively. We envision Smallset Timelines as a go-to data provenance tool, enabling better documentation and communication of preprocessing tasks at large.

READ FULL TEXT

page 5

page 10

research
07/10/2020

Boba: Authoring and Visualizing Multiverse Analyses

Multiverse analysis is an approach to data analysis in which all "reason...
research
10/25/2016

How Document Pre-processing affects Keyphrase Extraction Performance

The SemEval-2010 benchmark dataset has brought renewed attention to the ...
research
07/09/2021

Hacking VMAF and VMAF NEG: vulnerability to different preprocessing methods

Video-quality measurement plays a critical role in the development of vi...
research
04/18/2023

Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks

In this paper, we primarily focus on understanding the data preprocessin...
research
05/08/2019

Brief Announcement: Does Preprocessing Help under Congestion?

This paper investigates the power of preprocessing in the CONGEST model....
research
08/26/2021

PTRAIL – A python package for parallel trajectory data preprocessing

Trajectory data represent a trace of an object that changes its position...
research
02/02/2018

Scalable Preprocessing of High Volume Bird Acoustic Data

In this work, we examine the problem of efficiently preprocessing high v...

Please sign up or login with your details

Forgot password? Click here to reset