Dimensionality Reduction for Categorical Data

12/01/2021
by   Debajyoti Bera, et al.
0

Categorical attributes are those that can take a discrete set of values, e.g., colours. This work is about compressing vectors over categorical attributes to low-dimension discrete vectors. The current hash-based methods compressing vectors over categorical attributes to low-dimension discrete vectors do not provide any guarantee on the Hamming distances between the compressed representations. Here we present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwise Hamming distances among the uncompressed data only from their sketches. We claim that these sketches can be used in the usual data mining tasks in place of the original data without compromising the quality of the task. For that, we ensure that the sketches also are categorical, sparse, and the Hamming distance estimates are reasonably precise. Both the sketch construction and the Hamming distance estimation algorithms require just a single-pass; furthermore, changes to a data point can be incorporated into its sketch in an efficient manner. The compressibility depends upon how sparse the data is and is independent of the original dimension – making our algorithm attractive for many real-life scenarios. Our claims are backed by rigorous theoretical analysis of the properties of FSketch and supplemented by extensive comparative evaluations with related algorithms on some real-world datasets. We show that FSketch is significantly faster, and the accuracy obtained by using its sketches are among the top for the standard unsupervised tasks of RMSE, clustering and similarity search.

READ FULL TEXT
research
11/13/2021

Efficient Binary Embedding of Categorical Data using BinSketch

In this work, we present a dimensionality reduction algorithm, aka. sket...
research
10/10/2019

Efficient Sketching Algorithm for Sparse Binary Data

Recent advancement of the WWW, IOT, social network, e-commerce, etc. hav...
research
01/04/2023

A general framework for implementing distances for categorical variables

The degree to which subjects differ from each other with respect to cert...
research
04/16/2021

Parameterized Complexity of Categorical Clustering with Size Constraints

In the Categorical Clustering problem, we are given a set of vectors (ma...
research
10/04/2017

Massively Parallel Algorithms and Hardness for Single-Linkage Clustering Under ℓ_p-Distances

We present massively parallel (MPC) algorithms and hardness of approxima...
research
08/26/2019

Sufficient Representations for Categorical Variables

Many learning algorithms require categorical data to be transformed into...
research
08/22/2023

Minwise-Independent Permutations with Insertion and Deletion of Features

In their seminal work, Broder et. al. <cit.> introduces the minHash algo...

Please sign up or login with your details

Forgot password? Click here to reset