Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

08/15/2011
by   Ping Li, et al.
0

We generated a dataset of 200 GB with 10^9 features, to test our recent b-bit minwise hashing algorithms for training very large-scale logistic regression and SVM. The results confirm our prior work that, compared with the VW hashing algorithm (which has the same variance as random projections), b-bit minwise hashing is substantially more accurate at the same storage. For example, with merely 30 hashed values per data point, b-bit minwise hashing can achieve similar accuracies as VW with 2^14 hashed values per data point. We demonstrate that the preprocessing cost of b-bit minwise hashing is roughly on the same order of magnitude as the data loading time. Furthermore, by using a GPU, the preprocessing cost can be reduced to a small fraction of the data loading time. Minwise hashing has been widely used in industry, at least in the context of search. One reason for its popularity is that one can efficiently simulate permutations by (e.g.,) universal hashing. In other words, there is no need to store the permutation matrix. In this paper, we empirically verify this practice, by demonstrating that even using the simplest 2-universal hashing does not degrade the learning performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2011

Hashing Algorithms for Large-Scale Learning

In this paper, we first demonstrate that b-bit minwise hashing, whose es...
research
08/06/2012

One Permutation Hashing for Efficient Search and Learning

Recently, the method of b-bit minwise hashing has been applied to large-...
research
08/03/2011

Accurate Estimators for Improving Minwise Hashing and b-Bit Minwise Hashing

Minwise hashing is the standard technique in the context of search and d...
research
10/18/2019

The Bitwise Hashing Trick for Personalized Search

Many real world problems require fast and efficient lexical comparison o...
research
05/23/2011

b-Bit Minwise Hashing for Large-Scale Linear SVM

In this paper, we propose to (seamlessly) integrate b-bit minwise hashin...
research
03/30/2018

Engineering a Simplified 0-Bit Consistent Weighted Sampling

The Min-Hashing approach to sketching has become an important tool in da...
research
07/28/2020

Model Size Reduction Using Frequency Based Double Hashing for Recommender Systems

Deep Neural Networks (DNNs) with sparse input features have been widely ...

Please sign up or login with your details

Forgot password? Click here to reset