Modelling High-Dimensional Categorical Data Using Nonconvex Fusion Penalties

02/28/2020
by   Benjamin G. Stokell, et al.
0

We propose a method for estimation in high-dimensional linear models with nominal categorical data. Our estimator, called SCOPE, fuses levels together by making their corresponding coefficients exactly equal. This is achieved using the minimax concave penalty on differences between the order statistics of the coefficients for a categorical variable, thereby clustering the coefficients. We provide an algorithm for exact and efficient computation of the global minimum of the resulting nonconvex objective in the case with a single variable with potentially many levels, and use this within a block coordinate descent procedure in the multivariate case. We show that an oracle least squares solution that exploits the unknown level fusions is a limit point of the coordinate descent with high probability, provided the true levels have a certain minimum separation; these conditions are known to be minimal in the univariate case. We demonstrate the favourable performance of SCOPE across a range of real and simulated datasets. An R package CatReg implementing SCOPE for linear models and also a version for logistic regression is available on CRAN.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2022

Simultaneous Factors Selection and Fusion of Their Levels in Penalized Logistic Regression

Nowadays, several data analysis problems require for complexity reductio...
research
07/12/2017

A Cluster Fusion Penalty for Grouping Response Variables in Multivariate Regression Models

We propose a method for estimating coefficients in multivariate regressi...
research
06/02/2016

High Dimensional Multivariate Regression and Precision Matrix Estimation via Nonconvex Optimization

We propose a nonconvex estimator for joint multivariate regression and p...
research
08/18/2017

A debiased distributed estimation for sparse partially linear models in diverging dimensions

We consider a distributed estimation of the double-penalized least squar...
research
10/09/2018

SNAP: A semismooth Newton algorithm for pathwise optimization with optimal local convergence rate and oracle properties

We propose a semismooth Newton algorithm for pathwise optimization (SNAP...
research
06/26/2015

Clustering categorical data via ensembling dissimilarity matrices

We present a technique for clustering categorical data by generating man...
research
11/07/2016

Distributed Coordinate Descent for Generalized Linear Models with Regularization

Generalized linear model with L_1 and L_2 regularization is a widely use...

Please sign up or login with your details

Forgot password? Click here to reset