Efficient Binary Embedding of Categorical Data using BinSketch

11/13/2021
by   Bhisham Dev Verma, et al.
0

In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches. The minimum dimension of the sketches required by Cham to ensure a good estimation theoretically depends only on the sparsity of the data points - making it useful for many real-life scenarios involving sparse datasets. We present a rigorous theoretical analysis of our approach and supplement it with extensive experiments on several high-dimensional real-world data sets, including one with over a million dimensions. We show that the Cabin and Cham duo is a significantly fast and accurate approach for tasks such as RMSE, all-pairs similarity, and clustering when compared to working with the full dataset and other dimensionality reduction techniques.

READ FULL TEXT
research
12/01/2021

Dimensionality Reduction for Categorical Data

Categorical attributes are those that can take a discrete set of values,...
research
10/10/2019

Efficient Sketching Algorithm for Sparse Binary Data

Recent advancement of the WWW, IOT, social network, e-commerce, etc. hav...
research
06/26/2015

Clustering categorical data via ensembling dissimilarity matrices

We present a technique for clustering categorical data by generating man...
research
07/10/2018

A GPU-Oriented Algorithm Design for Secant-Based Dimensionality Reduction

Dimensionality-reduction techniques are a fundamental tool for extractin...
research
08/22/2023

Minwise-Independent Permutations with Insertion and Deletion of Features

In their seminal work, Broder et. al. <cit.> introduces the minHash algo...
research
09/17/2016

ADAGIO: Fast Data-aware Near-Isometric Linear Embeddings

Many important applications, including signal reconstruction, parameter ...
research
02/10/2022

Understanding Hyperdimensional Computing for Parallel Single-Pass Learning

Hyperdimensional computing (HDC) is an emerging learning paradigm that c...

Please sign up or login with your details

Forgot password? Click here to reset