Estimating the intrinsic dimension of datasets by a minimal neighborhood information

03/19/2018
by   Elena Facco, et al.
0

Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved, in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis.

READ FULL TEXT

page 1

page 3

page 4

page 6

page 7

page 8

research
01/18/2019

Estimating the effective dimension of large biological datasets using Fisher separability analysis

Modern large-scale datasets are frequently said to be high-dimensional. ...
research
02/27/2019

Clustering by the local intrinsic dimension: the hidden structure of real-world data

It is well known that a small number of variables is often sufficient to...
research
06/18/2019

Intrinsic dimension estimation for locally undersampled data

High-dimensional data are ubiquitous in contemporary science and finding...
research
10/11/2022

Intrinsic Dimension for Large-Scale Geometric Learning

The concept of dimension is essential to grasp the complexity of data. A...
research
02/11/2020

The role of intrinsic dimension in high-resolution player tracking data – Insights in basketball

A new range of statistical analysis has emerged in sports after the intr...
research
09/29/2022

Intrinsic Dimensionality Estimation within Tight Localities: A Theoretical and Experimental Analysis

Accurate estimation of Intrinsic Dimensionality (ID) is of crucial impor...
research
07/20/2022

Intrinsic dimension estimation for discrete metrics

Real world-datasets characterized by discrete features are ubiquitous: f...

Please sign up or login with your details

Forgot password? Click here to reset