The Anchors Hierachy: Using the triangle inequality to survive high dimensional data

01/16/2013
by   Andrew Moore, et al.
0

This paper is about metric data structures in high-dimensional or non-Euclidean space that permit cached sufficient statistics accelerations of learning algorithms. It has recently been shown that for less than about 10 dimensions, decorating kd-trees with additional "cached sufficient statistics" such as first and second moments and contingency tables can provide satisfying acceleration for a very wide range of statistical learning tasks such as kernel regression, locally weighted regression, k-means clustering, mixture modeling and Bayes Net learning. In this paper, we begin by defining the anchors hierarchy - a fast data structure and algorithm for localizing data based only on a triangle-inequality-obeying distance metric. We show how this, in its own right, gives a fast and effective clustering of data. But more importantly we show how it can produce a well-balanced structure similar to a Ball-Tree (Omohundro, 1991) or a kind of metric tree (Uhlmann, 1991; Ciaccia, Patella, & Zezula, 1997) in a way that is neither "top-down" nor "bottom-up" but instead "middle-out". We then show how this structure, decorated with cached sufficient statistics, allows a wide variety of statistical learning algorithms to be accelerated even in thousands of dimensions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/08/2021

A Triangle Inequality for Cosine Similarity

Similarity search is a fundamental problem for many data analysis techni...
research
12/23/2009

Elkan's k-Means for Graphs

This paper extends k-means algorithms from the Euclidean domain to the d...
research
03/01/1998

Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

This paper introduces new algorithms and data structures for quick count...
research
07/08/2021

Accelerating Spherical k-Means

Spherical k-means is a widely used clustering algorithm for sparse and h...
research
05/30/2018

U-statistical inference for hierarchical clustering

Clustering methods are a valuable tool for the identification of pattern...
research
11/06/2018

Fast High-Dimensional Bilateral and Nonlocal Means Filtering

Existing fast algorithms for bilateral and nonlocal means filtering most...
research
06/15/2017

Computational Anatomy in Theano

To model deformation of anatomical shapes, non-linear statistics are req...

Please sign up or login with your details

Forgot password? Click here to reset