PRESISTANT: Learning based assistant for data pre-processing

03/02/2018
by   Besim Bilalli, et al.
0

Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks.

READ FULL TEXT

page 5

page 15

page 16

page 17

page 19

research
11/16/2018

Image Pre-processing Using OpenCV Library on MORPH-II Face Database

This paper outlines the steps taken toward pre-processing the 55,134 ima...
research
04/03/2019

Optimized Preprocessing and Machine Learning for Quantitative Raman Spectroscopy in Biology

Raman spectroscopy's capability to provide meaningful composition predic...
research
05/15/2020

Convolutional neural networks for classification and regression analysis of one-dimensional spectral data

Convolutional neural networks (CNNs) are widely used for image recogniti...
research
09/08/2015

HEp-2 Cell Classification: The Role of Gaussian Scale Space Theory as A Pre-processing Approach

Indirect Immunofluorescence Imaging of Human Epithelial Type 2 (HEp-2) c...
research
04/29/2022

Extended Analysis of "How Child Welfare Workers Reduce Racial Disparities in Algorithmic Decisions"

This is an extended analysis of our paper "How Child Welfare Workers Red...
research
09/10/2020

On the Fairness of 'Fake' Data in Legal AI

The economics of smaller budgets and larger case numbers necessitates th...
research
04/09/2022

Peekaboo: A Hub-Based Approach to Enable Transparency in Data Processing within Smart Homes (Extended Technical Report)

We present Peekaboo, a new privacy-sensitive architecture for smart home...

Please sign up or login with your details

Forgot password? Click here to reset