Accurate Estimators for Improving Minwise Hashing and b-Bit Minwise Hashing

08/03/2011
by   Ping Li, et al.
0

Minwise hashing is the standard technique in the context of search and databases for efficiently estimating set (e.g., high-dimensional 0/1 vector) similarities. Recently, b-bit minwise hashing was proposed which significantly improves upon the original minwise hashing in practice by storing only the lowest b bits of each hashed value, as opposed to using 64 bits. b-bit hashing is particularly effective in applications which mainly concern sets of high similarities (e.g., the resemblance >0.5). However, there are other important applications in which not just pairs of high similarities matter. For example, many learning algorithms require all pairwise similarities and it is expected that only a small fraction of the pairs are similar. Furthermore, many applications care more about containment (e.g., how much one object is contained by another object) than the resemblance. In this paper, we show that the estimators for minwise hashing and b-bit minwise hashing used in the current practice can be systematically improved and the improvements are most significant for set pairs of low resemblance and high containment.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/15/2011

Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

We generated a dataset of 200 GB with 10^9 features, to test our recent ...
research
06/06/2011

Hashing Algorithms for Large-Scale Learning

In this paper, we first demonstrate that b-bit minwise hashing, whose es...
research
07/08/2018

A Filter of Minhash for Image Similarity Measures

Image similarity measures play an important role in nearest neighbor sea...
research
10/18/2019

The Bitwise Hashing Trick for Personalized Search

Many real world problems require fast and efficient lexical comparison o...
research
05/24/2019

Quantum Period Finding with a Single Output Qubit -- Factoring n-bit RSA with n/2 Qubits

We study quantum period finding algorithms such as Simon, Shor, and Eker...
research
05/23/2011

b-Bit Minwise Hashing for Large-Scale Linear SVM

In this paper, we propose to (seamlessly) integrate b-bit minwise hashin...
research
05/22/2019

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Estimating set similarity and detecting highly similar sets are fundamen...

Please sign up or login with your details

Forgot password? Click here to reset