The Curse Revisited: a Newly Quantified Concept of Meaningful Distances for Learning from High-Dimensional Noisy Data

by   Robin Vandaele, et al.

Distances between data points are widely used in point cloud representation learning. Yet, it is no secret that under the effect of noise, these distances-and thus the models based upon them-may lose their usefulness in high dimensions. Indeed, the small marginal effects of the noise may then accumulate quickly, shifting empirical closest and furthest neighbors away from the ground truth. In this paper, we characterize such effects in high-dimensional data using an asymptotic probabilistic expression. Furthermore, while it has been previously argued that neighborhood queries become meaningless and unstable when there is a poor relative discrimination between the furthest and closest point, we conclude that this is not necessarily the case when explicitly separating the ground truth data from the noise. More specifically, we derive that under particular conditions, empirical neighborhood relations affected by noise are still likely to be true even when we observe this discrimination to be poor. We include thorough empirical verification of our results, as well as experiments that interestingly show our derived phase shift where neighbors become random or not is identical to the phase shift where common dimensionality reduction methods perform poorly or well for finding low-dimensional representations of high-dimensional data with dense noise.


page 1

page 2

page 3

page 4


Topological Stability: Guided Determination of the Nearest Neighbors in Non-Linear Dimensionality Reduction Techniques

In machine learning field, dimensionality reduction is one of the import...

A Nonlinear Dimensionality Reduction Framework Using Smooth Geodesics

Existing dimensionality reduction methods are adept at revealing hidden ...

Clustering with UMAP: Why and How Connectivity Matters

Topology based dimensionality reduction methods such as t-SNE and UMAP h...

Incomplete Pivoted QR-based Dimensionality Reduction

High-dimensional big data appears in many research fields such as image ...

LLE with low-dimensional neighborhood representation

The local linear embedding algorithm (LLE) is a non-linear dimension-red...

Geodesic Learning via Unsupervised Decision Forests

Geodesic distance is the shortest path between two points in a Riemannia...

Synthesis parameter effect detection using quantitative representations and high dimensional distribution distances

Detection of effects of the parameters of the synthetic process on the m...

Please sign up or login with your details

Forgot password? Click here to reset