Representing missing values through polar encoding

10/04/2022
by   Oliver Urs Lenz, et al.
0

We propose polar encoding, a representation of categorical and numerical [0,1]-valued attributes with missing values that preserves the information encoded in the distribution of the missing values. Unlike the existing missing-indicator approach, this does not require imputation. We support our proposal with three different arguments. Firstly, polar encoding ensures that missing values become equidistant from all non-missing values by mapping the latter onto the unit circle. Secondly, polar encoding lets decision trees choose how missing values should be split, providing a practical realisation of the missingness incorporated in attributes (MIA) proposal. And lastly, polar encoding corresponds to the normalised representation of categorical and [0,1]-valued attributes when viewed as barycentric attributes, a new concept based on traditional barycentric coordinates. In particular, we show that barycentric attributes are fuzzified categorical attributes, that their normalised representation generalises one-hot encoding, and that the polar encoding of [0, 1]-valued attributes is analogous to the one-hot encoding of binary attributes. With an experiment based on twenty real-life datasets with missing values, we show that polar encoding performs about as well or better than the missing-indicator approach in terms of the resulting classification performance.

READ FULL TEXT
research
06/28/2022

No imputation without representation

By filling in missing values in datasets, imputation allows these datase...
research
10/25/2022

Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings

In this paper, we introduce the Vehicle Claims dataset, consisting of fr...
research
06/13/2021

Linear representation of categorical values

We propose a binary representation of categorical values using a linear ...
research
10/31/2022

Diffusion models for missing value imputation in tabular data

Missing value imputation in machine learning is the task of estimating t...
research
05/29/2020

Quasi-orthonormal Encoding for Machine Learning Applications

Most machine learning models, especially artificial neural networks, req...
research
09/08/2022

Stochastic gradient descent with gradient estimator for categorical features

Categorical data are present in key areas such as health or supply chain...
research
10/09/2017

Efficient mining of maximal biclusters in mixed-attribute datasets

This paper presents a novel enumerative biclustering algorithm to direct...

Please sign up or login with your details

Forgot password? Click here to reset