A New Family of Near-metrics for Universal Similarity

07/21/2017
by   Chu Wang, et al.
0

We propose a family of near-metrics based on local graph diffusion to capture similarity for a wide class of data sets. These quasi-metametrics, as their names suggest, dispense with one or two standard axioms of metric spaces, specifically distinguishability and symmetry, so that similarity between data points of arbitrary type and form could be measured broadly and effectively. The proposed near-metric family includes the forward k-step diffusion and its reverse, typically on the graph consisting of data objects and their features. By construction, this family of near-metrics is particularly appropriate for categorical data, continuous data, and vector representations of images and text extracted via deep learning approaches. We conduct extensive experiments to evaluate the performance of this family of similarity measures and compare and contrast with traditional measures of similarity used for each specific application and with the ground truth when available. We show that for structured data including categorical and continuous data, the near-metrics corresponding to normalized forward k-step diffusion (k small) work as one of the best performing similarity measures; for vector representations of text and images including those extracted from deep learning, the near-metrics derived from normalized and reverse k-step graph diffusion (k very small) exhibit outstanding ability to distinguish data points from different classes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/17/2006

Similarity of Objects and the Meaning of Words

We survey the emerging area of compression-based, parameter-free, simila...
research
06/02/2023

Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning

Distance-based clustering and classification are widely used in various ...
research
11/28/2022

Continuous diffusion for categorical data

Diffusion models have quickly become the go-to paradigm for generative m...
research
10/19/2012

Learning Riemannian Metrics

We propose a solution to the problem of estimating a Riemannian metric a...
research
07/01/2019

Learning to Link

Clustering is an important part of many modern data analysis pipelines, ...
research
09/05/2023

Diffusion on the Probability Simplex

Diffusion models learn to reverse the progressive noising of a data dist...
research
06/22/2021

Near-Delaunay Metrics

We study metrics that assess how close a triangulation is to being a Del...

Please sign up or login with your details

Forgot password? Click here to reset