A Comparison of Modeling Preprocessing Techniques

02/23/2023
by   Tosan Johnson, et al.
0

This paper compares the performance of various data processing methods in terms of predictive performance for structured data. This paper also seeks to identify and recommend preprocessing methodologies for tree-based binary classification models, with a focus on eXtreme Gradient Boosting (XGBoost) models. Three data sets of various structures, interactions, and complexity were constructed, which were supplemented by a real-world data set from the Lending Club. We compare several methods for feature selection, categorical handling, and null imputation. Performance is assessed using relative comparisons among the chosen methodologies, including model prediction variability. This paper is presented by the three groups of preprocessing methodologies, with each section consisting of generalized observations. Each observation is accompanied by a recommendation of one or more preferred methodologies. Among feature selection methods, permutation-based feature importance, regularization, and XGBoost's feature importance by weight are not recommended. The correlation coefficient reduction also shows inferior performance. Instead, XGBoost importance by gain shows the most consistency and highest caliber of performance. Categorical featuring encoding methods show greater discrimination in performance among data set structures. While there was no universal "best" method, frequency encoding showed the greatest performance for the most complex data sets (Lending Club), but had the poorest performance for all synthetic (i.e., simpler) data sets. Finally, missing indicator imputation dominated in terms of performance among imputation methods, whereas tree imputation showed extremely poor and highly variable model performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/02/2019

UAFS: Uncertainty-Aware Feature Selection for Problems with Missing Data

Missing data are a concern in many real world data sets and imputation m...
research
09/25/2022

Feature Encodings for Gradient Boosting with Automunge

Selecting a default feature encoding strategy for gradient boosted learn...
research
07/06/2020

Does imputation matter? Benchmark for predictive models

Incomplete data are common in practical applications. Most predictive ma...
research
10/31/2022

Diffusion models for missing value imputation in tabular data

Missing value imputation in machine learning is the task of estimating t...
research
11/06/2018

An exploration of algorithmic discrimination in data and classification

Algorithmic discrimination is an important aspect when data is used for ...
research
01/31/2015

A New Intelligence Based Approach for Computer-Aided Diagnosis of Dengue Fever

Identification of the influential clinical symptoms and laboratory featu...
research
01/24/2019

A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization

This paper aims to explore models based on the extreme gradient boosting...

Please sign up or login with your details

Forgot password? Click here to reset