Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems

05/27/2021
by   Carlos Mougan, et al.
0

Regression problems have been widely studied in machinelearning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper,we provide an in-depth analysis of how to tackle high cardinality categor-ical features with the quantile. Our proposal outperforms state-of-the-encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2016

Bayesian quantile additive regression trees

Ensemble of regression trees have become popular statistical tools for t...
research
07/17/2023

A benchmark of categorical encoders for binary classification

Categorical encoders transform categorical features into numerical repre...
research
07/13/2022

Parametric quantile regression for income data

Univariate normal regression models are statistical tools widely applied...
research
11/29/2021

PCA-based Category Encoder for Categorical to Numerical Variable Conversion

Increasing the cardinality of categorical variables might decrease the o...
research
11/05/2012

Soft (Gaussian CDE) regression models and loss functions

Regression, unlike classification, has lacked a comprehensive and effect...
research
01/30/2023

Machine Learning with High-Cardinality Categorical Features in Actuarial Applications

High-cardinality categorical features are pervasive in actuarial data (e...

Please sign up or login with your details

Forgot password? Click here to reset