Unsupervised Discretization by Two-dimensional MDL-based Histogram

06/02/2020
by   Lincen Yang, et al.
15

Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalised maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which partitions each dimension alternately and then merges neighbouring regions, all using the MDL principle. Experiments on synthetic data show that PALM 1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; 2) approximates well a wide range of partitions outside the model class; 3) converges, in contrast to its closest competitor IPD; and 4) is self-adaptive with regard to both sample size and local density structure of the data despite being parameter-free. Finally, we apply our algorithm to two geographic datasets to demonstrate its real-world potential.

READ FULL TEXT

page 23

page 24

research
06/10/2022

Causal Discovery in Hawkes Processes by Minimum Description Length

Hawkes processes are a special class of temporal point processes which e...
research
09/10/2020

Population structure-learned classifier for high-dimension low-sample-size class-imbalanced problem

The Classification on high-dimension low-sample-size data (HDLSS) is a c...
research
08/01/2018

Model selection by minimum description length: Lower-bound sample sizes for the Fisher information approximation

The Fisher information approximation (FIA) is an implementation of the m...
research
11/08/2022

Bounded Guaranteed Algorithms for Concave Impurity Minimization Via Maximum Likelihood

Partitioning algorithms play a key role in many scientific and engineeri...
research
01/06/2012

The Interaction of Entropy-Based Discretization and Sample Size: An Empirical Study

An empirical investigation of the interaction of sample size and discret...
research
02/23/2022

Investigating the effect of binning on causal discovery

Binning (a.k.a. discretization) of numerically continuous measurements i...
research
12/27/2022

Fast and fully-automated histograms for large-scale data sets

G-Enum histograms are a new fast and fully automated method for irregula...

Please sign up or login with your details

Forgot password? Click here to reset