On Clustering Categories of Categorical Predictors in Generalized Linear Models

10/19/2021
by   Emilio Carrizosa, et al.
10

We propose a method to reduce the complexity of Generalized Linear Models in the presence of categorical predictors. The traditional one-hot encoding, where each category is represented by a dummy variable, can be wasteful, difficult to interpret, and prone to overfitting, especially when dealing with high-cardinality categorical predictors. This paper addresses these challenges by finding a reduced representation of the categorical predictors by clustering their categories. This is done through a numerical method which aims to preserve (or even, improve) accuracy, while reducing the number of coefficients to be estimated for the categorical predictors. Thanks to its design, we are able to derive a proximity measure between categories of a categorical predictor that can be easily visualized. We illustrate the performance of our approach in real-world classification and count-data datasets where we see that clustering the categorical predictors reduces complexity substantially without harming accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/01/2023

Optimal Scaling transformations to model non-linear relations in GLMs with ordered and unordered predictors

In Generalized Linear Models (GLMs) it is assumed that there is a linear...
research
05/11/2023

Bias of determinacy coefficients in confirmatory factor analysis based on categorical variables

The relevance of determinacy coefficients as indicators for the validity...
research
01/05/2021

Weight-of-evidence 2.0 with shrinkage and spline-binning

In many practical applications, such as fraud detection, credit risk mod...
research
08/29/2022

Multiresolution categorical regression for interpretable cell type annotation

In many categorical response regression applications, the response categ...
research
11/13/2019

Generating Stereotypes Automatically For Complex Categorical Features

In the context of stereotypes creation for recommender systems, we found...
research
02/01/2020

Deep segmental phonetic posterior-grams based discovery of non-categories in L2 English speech

Second language (L2) speech is often labeled with the native, phone cate...
research
06/13/2021

Linear representation of categorical values

We propose a binary representation of categorical values using a linear ...

Please sign up or login with your details

Forgot password? Click here to reset