
A Triangle Inequality for Cosine Similarity
Similarity search is a fundamental problem for many data analysis techni...
read it

Elkan's kMeans for Graphs
This paper extends kmeans algorithms from the Euclidean domain to the d...
read it

Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
This paper introduces new algorithms and data structures for quick count...
read it

Accelerating Spherical kMeans
Spherical kmeans is a widely used clustering algorithm for sparse and h...
read it

An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality
Distances are pervasive in machine learning. They serve as similarity me...
read it

Ustatistical inference for hierarchical clustering
Clustering methods are a valuable tool for the identification of pattern...
read it
The Anchors Hierachy: Using the triangle inequality to survive high dimensional data
This paper is about metric data structures in highdimensional or nonEuclidean space that permit cached sufficient statistics accelerations of learning algorithms. It has recently been shown that for less than about 10 dimensions, decorating kdtrees with additional "cached sufficient statistics" such as first and second moments and contingency tables can provide satisfying acceleration for a very wide range of statistical learning tasks such as kernel regression, locally weighted regression, kmeans clustering, mixture modeling and Bayes Net learning. In this paper, we begin by defining the anchors hierarchy  a fast data structure and algorithm for localizing data based only on a triangleinequalityobeying distance metric. We show how this, in its own right, gives a fast and effective clustering of data. But more importantly we show how it can produce a wellbalanced structure similar to a BallTree (Omohundro, 1991) or a kind of metric tree (Uhlmann, 1991; Ciaccia, Patella, & Zezula, 1997) in a way that is neither "topdown" nor "bottomup" but instead "middleout". We then show how this structure, decorated with cached sufficient statistics, allows a wide variety of statistical learning algorithms to be accelerated even in thousands of dimensions.
READ FULL TEXT
Comments
There are no comments yet.