Generalized Compression Dictionary Distance as Universal Similarity Measure

10/21/2014
by   Andrey Bogomolov, et al.
0

We present a new similarity measure based on information theoretic measures which is superior than Normalized Compression Distance for clustering problems and inherits the useful properties of conditional Kolmogorov complexity. We show that Normalized Compression Dictionary Size and Normalized Compression Dictionary Entropy are computationally more efficient, as the need to perform the compression itself is eliminated. Also they scale linearly with exponential vector size growth and are content independent. We show that normalized compression dictionary distance is compressor independent, if limited to lossless compressors, which gives space for optimizations and implementation speed improvement for real-time and big data applications. The introduced measure is applicable for machine learning tasks of parameter-free unsupervised clustering, supervised learning such as classification and regression, feature selection, and is applicable for big data problems with order of magnitude speed increase.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2010

Normalized Information Distance is Not Semicomputable

Normalized information distance (NID) uses the theoretical notion of Kol...
research
08/18/2021

Big Data in Astroinformatics – Compression of Scanned Astronomical Photographic Plates

Construction of Scanned Astronomical Photographic Plates(SAPPs) database...
research
11/26/2018

A Consolidated Approach to Convolutional Neural Networks and the Kolmogorov Complexity

The ability to precisely quantify similarity between various entities ha...
research
12/19/2003

Clustering by compression

We present a new method for clustering based on compression. The method ...
research
12/22/2012

Normalized Compression Distance of Multisets with Applications

Normalized compression distance (NCD) is a parameter-free, feature-free,...
research
10/02/2012

A fast compression-based similarity measure with applications to content-based image retrieval

Compression-based similarity measures are effectively employed in applic...
research
07/09/2014

Identifying Cover Songs Using Information-Theoretic Measures of Similarity

This paper investigates methods for quantifying similarity between audio...

Please sign up or login with your details

Forgot password? Click here to reset