Pb-Hash: Partitioned b-bit Hashing

06/28/2023
by   Ping Li, et al.
0

Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of B bits. With k hashes for each data vector, the storage would be B× k bits; and when used for large-scale learning, the model size would be 2^B× k, which can be expensive. A standard strategy is to use only the lowest b bits out of the B bits and somewhat increase k, the number of hashes. In this study, we propose to re-use the hashes by partitioning the B bits into m chunks, e.g., b× m =B. Correspondingly, the model size becomes m× 2^b × k, which can be substantially smaller than the original 2^B× k. Our theoretical analysis reveals that by partitioning the hash values into m chunks, the accuracy would drop. In other words, using m chunks of B/m bits would not be as accurate as directly using B bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) m=2∼ 4. In some regions, Pb-Hash still works well even for m much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine m embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study.

READ FULL TEXT
research
04/24/2020

Reinforcing Short-Length Hashing

Due to the compelling efficiency in retrieval and storage, similarity-pr...
research
02/11/2020

Superbloom: Bloom filter meets Transformer

We extend the idea of word pieces in natural language models to machine ...
research
10/31/2021

On the Optimal Time/Space Tradeoff for Hash Tables

For nearly six decades, the central open question in the study of hash t...
research
02/07/2021

Additive Feature Hashing

The hashing trick is a machine learning technique used to encode categor...
research
06/13/2023

Differentially Private One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Minwise hashing (MinHash) is a standard algorithm widely used in the ind...
research
09/12/2017

Hash Embeddings for Efficient Word Representations

We present hash embeddings, an efficient method for representing words i...
research
09/23/2022

Analysis of the new standard hash function

On 2^nd October 2012 the NIST (National Institute of Standards and Techn...

Please sign up or login with your details

Forgot password? Click here to reset