Two-level histograms for dealing with outliers and heavy tail distributions

06/09/2023
by   Marc Boullé, et al.
0

Histograms are among the most popular methods used in exploratory analysis to summarize univariate distributions. In particular, irregular histograms are good non-parametric density estimators that require very few parameters: the number of bins with their lengths and frequencies. Many approaches have been proposed in the literature to infer these parameters, either assuming hypotheses about the underlying data distributions or exploiting a model selection approach. In this paper, we focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without any user parameter and achieves state-of-the art performance w.r.t accuracy; parsimony and computation time. We investigate on the limits of this method in the case of outliers or heavy-tailed distributions. We suggest a two-level heuristic to deal with such cases. The first level exploits a logarithmic transformation of the data to split the data set into a list of data subsets with a controlled range of values. The second level builds a sub-histogram for each data subset and aggregates them to obtain a complete histogram. Extensive experiments show the benefits of the approach.

READ FULL TEXT

page 24

page 26

research
07/23/2018

Outliers and The Ostensibly Heavy Tails

The aim of the paper is to show that the presence of one possible type o...
research
08/23/2018

Data-adaptive trimming of the Hill estimator and detection of outliers in the extremes of heavy-tailed data

We introduce a trimmed version of the Hill estimator for the index of a ...
research
12/27/2022

Fast and fully-automated histograms for large-scale data sets

G-Enum histograms are a new fast and fully automated method for irregula...
research
12/25/2015

Histogram Meets Topic Model: Density Estimation by Mixture of Histograms

The histogram method is a powerful non-parametric approach for estimatin...
research
02/16/2018

Univariate and Bivariate Geometric Discrete Generalized Exponential Distributions

Marshall and Olkin (1997, Biometrika, 84, 641 - 652) introduced a very p...
research
11/24/2019

Histogram Transform Ensembles for Density Estimation

We investigate an algorithm named histogram transform ensembles (HTE) de...

Please sign up or login with your details

Forgot password? Click here to reset