Fully Understanding the Hashing Trick

05/22/2018 ∙ by Casper Benjamin Freksen, et al. ∙ Aarhus Universitet 0

Feature hashing, also known as the hashing trick, introduced by Weinberger et al. (2009), is one of the key techniques used in scaling-up machine learning algorithms. Loosely speaking, feature hashing uses a random sparse projection matrix A : R^n →R^m (where m ≪ n) in order to reduce the dimension of the data from n to m while approximately preserving the Euclidean norm. Every column of A contains exactly one non-zero entry, equals to either -1 or 1. Weinberger et al. showed tail bounds on Ax_2^2. Specifically they showed that for every ε, δ, if x_∞ / x_2 is sufficiently small, and m is sufficiently large, then [ | Ax_2^2 - x_2^2 | < εx_2^2 ] > 1 - δ . These bounds were later extended by Dasgupta (2010) and most recently refined by Dahlgaard et al. (2017), however, the true nature of the performance of this key technique, and specifically the correct tradeoff between the pivotal parameters x_∞ / x_2, m, ε, δ remained an open question. We settle this question by giving tight asymptotic bounds on the exact tradeoff between the central parameters, thus providing a complete understanding of the performance of feature hashing. We complement the asymptotic bound with empirical data, which shows that the constants "hiding" in the asymptotic notation are, in fact, very close to 1, thus further illustrating the tightness of the presented bounds in practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dimensionality reduction

that approximately preserves Euclidean distances is a key tool used as a preprocessing step in many geometric, algebraic and classification algorithms, whose performance heavily depends on the dimension of the input. Loosely speaking, a distance-preserving dimensionality reduction is an (often random) embedding of a high-dimensional Euclidean space into a space of low dimension, such that the distance between every two points is approximately preserved (with high probability). Its applications range upon nearest neighbor search

[AC09, HIM12], classification and regression [RR08, MM09, PBMID14], manifold learning [HWB08] sparse recovery [CT06] and numerical linear algebra [CW09, MM13, Sár06]. For more applications see, e.g. [Vem05].

One of the most fundamental results in the field was presented in the seminal paper by Johnson and Lindenstrauss [JL84].

Lemma 1 (Distributional JL Lemma).

For every and , there exists a random projection matrix , where such that for every

(1)

The target dimension in the lemma is known to be optimal [JW13, LN17].

Running Time Performances.

Perhaps the most common proof of the lemma (see, e.g. [DG03, Mat08]) samples a projection matrix by independently sampling each entry from a standard Gaussian (or Rademacher) distribution. Such matrices are by nature very dense, and thus a naïve embedding runs in time, where is the number of non-zero entries of .

Due to the algorithmic significance of the lemma, much effort was invested in finding techniques to accelerate the embedding time. One fruitful approach for accomplishing this goal is to consider a distribution over sparse projection matrices. This line of work was initiated by Achlioptas [Ach03], who constructed a distribution over matrices, in which the expected fraction of non-zero entries is at most one third, while maintaining the target dimension. The best result to date in constructing a sparse Johnson-Lindenstrauss matrix is due to Kane and Nelson [KN14], who presented a distribution over matrices satisfying (1) in which every column has at most non-zero entries. Conversely Nelson and Nguyn [NN13] showed that this is almost asymptotically optimal. That is, every distribution over matrices satisfying (1) with , and such that every column has at most non-zero entries must satisfy .

While the bound presented by Nelson and Nguy

n is theoretically tight, we can provably still do much better in practice. Specifically, the lower bound is attained on vectors

for which, loosely speaking, the ”mass” of is concentrated in few entries. Formally, the ratio is large. However, in practical scenarios, such as the term frequency - inverse document frequency representation of a document, we may often assume that the mass of is ”well-distributed” over many entries (That is, is small). In these common scenarios projection matrices which are significantly sparser turn out to be very effective.

Feature Hashing.

In the pursuit for sparse projection matrices, Weinberger et al. 
[WDL09] introduced dimensionality reduction via Feature Hashing, in which the projection matrix is, in a sense, as sparse as possible. That is, every column of contains exactly one non-zero entry, randomly chosen from . This technique is one of the most influential mathematical tools in the study of scaling-up machine learning algorithms, mainly due to its simplicity and good performance in practice [Dal13, Sut15]. More formally, for , the projection matrix is sampled as follows. Sample , and independently. For every , let (that is, iff and otherwise). Weinberger et al.  additionally showed exponential tail bounds on when the ratio is sufficiently small, and is sufficiently large. These bounds were later improved by Dasgupta et al.  [DKS10] and most recently by Dahlgaard, Knudsen and Thorup [DKT17] improved these concentration bounds. Conversely, a result by Kane and Nelson [KN14] implies that if we allow to be too large, then there exist vectors for which (1) does not holds.

Finding the correct tradeoffs between , and in which feature hashing performs well remained an open problem. Our main contribution is settling this problem, and providing a complete and comprehensive understanding of the performance of feature hashing.

1.1 Main results

The main result of this paper is a tight tradeoff between the target dimension , the approximation ratio , the error probability and . More formally, let and . Let be the maximum such that for every , if then (1) holds. Our main result is the following theorem, which gives tight asymptotic bounds for the performance of feature hashing, thus closing the long-standing gap.

Theorem 2.

There exist constants such that for every and the following holds. If then

Otherwise, if then . Moreover if then .

While the bound presented in the theorem may strike as surprising, due to the intricacy of the expressions involved, the tightness of the result shows that this is, in fact, the correct and ”true” bound. Moreover, the proof of the theorem demonstrates how both branches in the expression are required in order to give a tight bound.

Experimental Results.

Our theoretical bounds are accompanied by empirical results that shed light on the nature of the constants in Theorem 2. Our empirical results show that in practice the constants inside the Theta-notation are significantly tighter than the theoretical proof might suggest, and in fact feature hashing performs well for a larger scope of vectors. Specifically, our main result implies that whenever ,

(except for very sparse vectors, i.e. ) whereas the theoretical proof provides a smaller constant in front of . Since feature hashing satisfies (1) whenever , this implies that feature hashing works well on even a larger range of vectors than the theory suggests.

Proof Technique

As a fundamental step in the proof of Theorem 2 we prove tight asymptotic bounds for high-order norms of the approximation factor.111

Given a random variable

and , the th norm of (if exists) is defined as . More formally, for every let

. The technical crux of our results is tight bounds on high-order moments of

. Note that by rescaling we may restrict our focus without loss of generality to unit vectors.

Notation 1.

For every denote

In these notations our main technical lemmas are the following.

Lemma 3.

For every even and unit vector , .

Lemma 4.

For every and even , , where is the unit vector whose first entries equal .

While it might seem at a glance that bounding the high-order moments of is merely a technical issue, known tools and techniques could not be used to prove Lemmas 3, 4. Particularly, earlier work by Kane and Nelson [KN14, CJN18] and Freksen and Larsen [FL17] used high-order moments bounds as a step in proving probability tail bounds of random variables. The existing techniques, however, can not be adopted to bound high-order moments of (see also Section 1.2), and novel approaches were needed. Specifically, our proof incorporates a novel combinatorial scheme for counting edge-labeled Eulerian graphs.

Previous Results.

Weinberger et al.  [WDL09] showed that if , then . Dasgupta et al.  [DKS10] showed that under similar conditions . These bounds were recently improved by Dahlgaard et al.  [DKT17] who showed that . Conversely, Kane and Nelson [KN14] showed that for the restricted case of , , which matches the bound in Theorem 2 if, in addition, .

Key Tool : Counting Labeled Eulerian Graphs.

Our proof presents a new combinatorial result concerning Eulerian graphs. Loosely speaking, we give asymptotic bounds for the number of labeled Eulerian graphs containing a predetermined number of nodes and edges. Formally, let be integers such that . Let denote the family of all edge-labeled Eulerian multigraphs , such that

  1. has no isolated vertices;

  2. , and is a bijection, which assigns a label in to each edge; and

  3. the number of connected components in is .

Notation 2.

Denote .

Theorem 5.

.

1.2 Related Work

The CountSketch scheme, presented by Charikar et al.  [CCF04], was shown to satisfy (1) by Thorup and Zhang [TZ12]. The scheme essentially samples independent copies of a feature hashing matrix with rows, and applies them all to

. The estimator for

is then given by computing the median norm over all projected vectors. The CountSketch scheme thus constructs a sketching matrix such that every column has non-zero entries. However, this construction does not provide a norm-preserving embedding into a Euclidean space (that is, the estimator of cannot be represented as a norm of ), which is essential for some applications such as nearest-neighbor search [HIM12].

Kane and Nelson [KN14] presented a simple construction for the so-called sparse Johnson Lindenstrauss transform. This is a distribution of matrices, for , where every column has non-zero entries, randomly chosen from . Note that if , this distribution yields the feature hashing one. Kane and Nelson showed that for this construction satisfies (1). Recently, Cohen et al.  [CJN18] presented two simple proofs for this result. While their proof methods give (simple) bounds for high-order moments similar to those in Lemmas 3 and  4, they rely heavily on the fact that is relatively large. Specifically, for the bounds their method or an extension thereof give are trivial.

2 Counting Labeled Eulerian Graphs

In this section we prove Theorem 5. In order to upper bound , we give an encoding scheme and show that every graph can be encoded in a succinct manner, thus bounding .

Encoding Argument.

Fix a graph , and let be its ordered sequence of edges. In what follows, we give an encoding algorithm that, given , produces a ”short” bit-string that encodes . The string is a concatenation of three strings , encoded as follows.

Let be the set of connected components of ordered by the smallest labeled node in each component, and for every , denote the graph induced by in by . For every the encoding algorithm chooses a set of edges of a spanning tree in . Denote by the union of all trees in .

Proposition 6.

.

Proof.

For every , . Therefore . ∎

Let be the ordering of induced by . The algorithm encodes to be the list of edges in , followed by an encoding of as a set in . Next, since every connected component is Eulerian, for every , there is an edge . Let denote the set of all such edges, and let be the ordering of induced by . For every , the algorithm encodes a pair , and appends them in order together with to encode . Finally, the algorithm encodes in the ordering induced by as a list of length in . Denote this list of the rest of the edges by .

Lemma 7.

can be encoded using at most bits.

Proof.

In order to bound the length of we shall bound each of the three strings separately. One can encode an ordered list of distinct unordered pairs in using at most bits. Therefore can be encoded using at most

(2)

bits.

Next, for every , can be encoded using bits. Therefore can be encoded using at most

(3)

bits, where the last inequality follows from the AM-GM inequality, since .

Finally, note that can be encoded using bits. Since

we get that can be encoded using

(4)

bits. Summing over (2), (3) and (4) implies the lemma. ∎

Lemma 8.

Given , one can reconstruct .

Proof.

In order to prove the lemma, we give a decoding algorithm that receives and constructs . The algorithm first reads the first list of elements of from , followed by , to decode , and the restriction of to . Given the set of spanning trees, the algorithm constructs (note that the ordering on is inherent in the components themselves, and does not depend on ). Next, the algorithm reads and recovers the set of edges, along with the restriction of to . Finally, the algorithm reads and reconstructs the remaining edges, with their induced ordering. Since , the algorithm can reconstruct the restriction of to , thus reconstructing .

Corollary 9.

.

Next we turn to lower bound . To this end, we construct a subset of of size at least , thus lower bounding .

Consider the following family of labeled multigraphs over the vertex set . For every , contains connected components, where components, referred to as small are composed of vertices each, and one large component contains the remaining nodes. The first edges (according to ) are a union of simple cycles, where each cycle contains the entire set of nodes of one connected component.

Claim 10.

.

Proof.

The number of possible ways to choose the partition of into connected components such that all but one contain exactly vertices is . Each small component has exactly one spanning cycle, while the large component has spanning cycles. Once the cycles are chosen, there are at least ways to order the edges. The number of possible edges in is . Therefore there are ways to choose the ordered sequence of remaining edges. We conclude that

Lemma 11.

.

Proof.

Every contains labeled edges, no isolated vertices and connected components. Therefore, if and only if the degree of every node in is even. Let be the set the last edges in , then since is a union of disjoint cycles spanning all vertices, for every , . Hence if and only if for every , is even. Consider the set of all possible sequences of edges in . For every such sequence , let the signature of be the indicator vector , where for every , if and only if

is odd. Let

be of the same signature, and let be the edge sequence of length , which is the concatenation of and . Then is even. Since the number of possible signatures is , there exists a set of edge sequences of length that all have the same signature such that . Therefore

We therefore conclude the following, which finishes the proof of Theorem 5.

Corollary 12.

.

3 Bounding

In this section we prove Theorem 2, assuming Lemmas 3 and 4, whose proof is deferred to section 4. Fix and an integer . We first address the case where . Let be a unit vector. Then

Therefore by Chebyshev’s inequality .

We therefore continue assuming . From Lemmas 3 and 4 there exist such that for every , if then for every unit vector , . Moreover, if then

Note that in addition . Denote , and .

Lemma 13.

If then .

Proof.

Let , and let be some integer. Then

Applying the Paley-Zygmund inequality

(5)

Therefore for every , which implies . ∎

For the rest of the proof we assume that , and we start by proving a lower bound on .

Lemma 14.

.

Proof.

Let , let be a unit vector such that , and let . If , then since is convex as a function of then

Moreover, if then since is convex as a function of , then

Since clearly, , then by Lemma 3 we have , and thus

(6)

Hence . ∎

Lemma 15.

.

To this end, let , and denote

Assume first that , and let . We will show that . Since , then . If , then . Since , then . Therefore

Otherwise, . Moreover, since , then . Therefore

Applying the Paley-Zygmund inequality we get that similarly to (5)

Therefore .

Assume next that , and note that since , then , and since then . Let , and consider independent , and . Let be defined as follows. For every , if and only if , and otherwise. Denote . Then , and moreover, , where . Let denote the event that , and that for all , if then , and let denote the event that . By Chebyshev’s inequality, . Note that if