Fast hashing with Strong Concentration Bounds

05/01/2019 ∙ by Anders Aamand, et al. ∙ 0

Previous work on tabulation hashing of Pǎtraşcu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, but under some quite severe restrictions on the expected values of these sums. More precisely, the basic idea in tabulation hashing is to view a key as consisting of c=O(1) characters, e.g., a 64-bit key as c=8 characters of 8-bits. The character domain Σ should be small enough that character tables of size |Σ| fit in fast cache. The schemes then use O(1) tables of this size, so the space of tabulation hashing is O(|Σ|). However the above concentration bounds only apply if the expected sums are ≪ |Σ|. To see the problem, consider the very simple case where we use tabulation hashing to throw n balls into m bins and apply Chernoff bounds to the number of balls that land in a given bin. We are fine if n=m, for then the expected value is 1. However, if m=2 bins as when tossing n unbiased coins, then the expectancy n/2 is ≫ |Σ| for large data sets, e.g., data sets that don't fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chernoff bounds [4] date back to the 1950s. Originating from the area of statistics they are now one of the most basic tools of Randomized Algorithms [11]. A canonical form considers the sum

of independent random variables

. Writing , it holds for every that

(1)
(2)

Textbook proofs of these bounds can be found in [11, §4]333The bounds in [11, §4] are stated as working only for , but the proofs can easily handle any ..

In practice (and in life), we rarely see completely independent random variables. A much more common scenario, which is the focus of this paper, is that each is a function of the hash value of some key , and possibly, the hash value of some distinguished query key . Hashing is another fundamental tool of randomized algorithms from the 1950s [8]. It is used whenever a system wants the same key to always make the same random choice over time, e.g. always be found in the same place in a hash table, or in a distributed setting. However, unless an infeasible amount of space is used to store a fully random table of hash values, the values will not be independent, yet we would like Chernoff-style bounds to hold.

One way to achieve such Chernoff-style bounds is through the -independent hashing framework of Wegman and Carter [19]. Here we only ask that the hash values of any keys are independent. Using such a -independent hash function, the variables become -independent in which case Schmidt and Siegel [16] have shown that the above Chernoff bounds hold with an additive error decreasing exponentially in . Unfortunately, evaluating any -independent hash function takes time unless we use a lot of space. A lower bound by Siegel [17] states that to evaluate a -independent hash function over a key domain in time , we need at least space whenever .

Pǎtraşcu and Thorup [12, 15] have shown that Chernoff-style bounds can be achieved in constant time with simple tabulation based hashing methods. However, their results suffer from some quite severe restrictions on the expected value of the sum. In this paper, we completely lift these restrictions by proving new strong concentration bounds on simple tabulation and by introducing and analyzing a new family of fast hash functions, tabulation-permutation hashing.

1.1 Simple Tabulation Hashing

Simple tabulation hashing dates back to Zobrist [20]. For some alphabet , the key domain of a simple tabulation hash function is for a , i.e. each key consists of characters of . For instance, we could consider 32-bit keys consisting of four 8-bit characters. We always assume that the alphabet size is a power of two.

A simple tabulation hash function, , with for some , is defined as follows. For each , we store a fully random independent character table mapping characters of the alphabet, , to -bit hash values. Here we identify with . A key is hashed to , where denote bitwise XOR – an extremely fast operation. With character tables in cache, this scheme is the fastest known 3-independent hashing scheme.

Consider hashing weighted balls or keys into bins with a simple tabulation function, . In that setting we prove the following theorem.

Theorem 1.

Let be a simple tabulation hash function. For let denote the size of and for every ball let be the weight of . Furthermore, let a bin be arbitrary or chosen as a function of for some query ball . Define to be the total weight of the balls in bin and . Then for any constant ,444We note that the constant can be replaced by any constant making the implicit constant in larger. A constant

is only necessary if the probability bounds (

1) and (2) with full randomness are .

(3)
(4)

These bounds are the same as the classic Chernoff bounds (1) and (2) if we remove the factor 2, replace with , and remove the additive error probability . With an -independent hash function [16], we would get the same bound except with instead of so long as for some constant . In practice, however, simple tabulation is much faster than any known -independent hash function for large – in fact experiments in [12] showed simple tabulation hashing to be the fastest known 3-independent hash function.

Pǎtraşcu and Thorup [12] proved the same bounds as in this paper, but with the restriction that . Our first main result on simple tabulation is thus to remove this restriction on , which allows us to handle cases like for every and . This is accomplished through a new and much more powerful analysis of simple tabulation.

For small the error probability is prohibitive. For instance unbiased coin tossing, corresponding to the case , has an additive error probability of which is useless. We will show that this additive error is real by providing examples of sets of keys having a bad error bound for small values of , which can be seen in the experiments that we run. To handle such instances and support much more general Chernoff bounds, we introduce a new hash function: tabulation-permutation hashing.

1.2 Tabulation-Permutation Hashing

We start by defining tabulation-permutation hashing from to with , e.g. . Later we use it to hash into for any .

A tabulation-permutation hash function is given as a composition, , of a simple tabulation hash function and a permutation . The permutation is chosen as follows. For each , pick a uniformly random character permutation , where denotes the set of all permutations of . Now, we let in the sense that for , . We note that with , the permutation step is at least as fast as the simple tabulation step, since it can be implemented the same way. More precisely, we precompute tables , where

and then . Thus, itself has the structure of a simple tabulation hash function but with a different distribution of the character tables.

Write and . For tabulation-permutation hash functions, we prove the following.

Theorem 2.

Let be a tabulation-permutation hash function and be a fixed value function that to each ball assigns the value depending on the hash value . Define and . For any constant ,4

(5)
(6)

The above bounds also hold if we condition on a distinguished query key having a specific hash value , that is, can be regarded as a fixed constant in the definition of the hash function and the value function.

As described below, tabulation-permutation hashing can be used to hash into any number of bins with the same probability bounds using an appropriate choice of value function.

Pǎtraşcu and Thorup [15] introduced twisted tabulation to get the same error probability bounds as above, but only under the restriction . To understand how serious this restriction is, consider again the case where we want to toss unbiased coins for each key in a set . This corresponds to the case . With the error probability bound of Pǎtraşcu and Thorup, we can only handle , but recall that is chosen small enough for character tables to fit in fast cache. Thus, for most normal data sets meaning that twisted tabulation cannot handle any moderately large data set and sampling probability.

Pǎtraşcu and Thorup [15] were acutely aware of the prohibitive restriction . For cases of larger they proved that with twisted tabulation hashing, with probability for any . However, one can show by considering an appropriate subset of keys , that no substantial improvement over this bound is attainable with twisted tabulation hashing. With our tabulation permutation hashing Theorem 2 yields with the same probability.

Interestingly, the idea in twisted tabulation is to randomize the keys before applying simple tabulation, whereas our permutation step applied after simple tabulation serves to break any problematic structure in the value function, e.g. ensuring that with high probability the value function cannot just depend on a single bit of the the hash value from the simple tabulation step. We prove that this random permutation suffices in tandem with our new tight understanding of simple tabulation.

Tabulation-Permutation Hashing into Bins.

We now revisit the problem of throwing keys into bins for some . Later, in Section 8, we will discuss a very efficient way of hashing into a number of bins which is not a power of two. We want to get concentrations bounds like in Theorem 2 with additive error probability for some given . Examples where we may want strong concentration is if we want to distribute, say, jobs or customers on servers without overloading any server.

If , we just use simple tabulation. Setting in Theorem 1, we can an additive error bound of as desired.

If , e.g., if , we use tabulation-permutation with , that is, with a single output character. Then we throw key into bin . Here & is a bit-wise and, and since , consists of 1s in the least significant bit positions, so .

Suppose we have a key set and a distinguished query key that we all hash to bins using . We now claim that our concentration bounds apply to the number of keys from landing in the same bin as . This number can be written as . Recall from Theorem 2 that can be regarded as a fixed constant in our value function. We can thus count the number of balls from landing with as where and . Therefore the bounds from Theorem 2 apply to as desired.

The Necessity of Permutations.

We defined a tabulation-permutation hash function by where is a concatenation of permutations of and is a simple tabulation hash function. The reader may wonder why we use these permutations in instead of random hash functions as in double tabulation [18]. We show below this makes a huge difference even in very simple cases.

We consider the above balls into bins case where we throw balls into bins. For our example, the keys are from , and they are all balls, that is, and . Following the description above, we use , a random simple tabulation hash function and single random permutation , set , and throw into bin .

We will compare this with the situation where we instead use a fully random , set , and throw into bin .

Consider first the effect of . By Theorem 1, it maps our balls into bins so that every bin gets balls w.h.p.

The nice concentation is trivially preserved when we make our random permutation , and the function maps exactly random bins from into each . The deviation in is therefore only based on the deviations from the bins mapping to . By Theorem 2, every bin from ends up with balls w.h.p., and this kind of concentration works with arbitrary value functions, not just balls into bins.

If instead we applied a fully random

, then, the number of bins landing in a given bin would follow a Poisson distribution with mean 1. Thus

would lead to a constant fraction of empty bins and bins with double or more of the expected contents. With , the map would lead to many empty bins contrasting our strong concentration around . This does not contradict the findings on double tabulation from [18] for there , while our example uses .

Remarks on Universe Reduction and Amount of Randomness.

In general, the keys in question may originate from a very large universe . However, often we are only interested in the performance on an unknown set of up to keys. A standard first step is to perform a universe reduction, mapping randomly to “signatures” in , where , e.g. , so that no two keys from are expected to get the same signature [2]. In this paper, we generally assume that this universe reduction has been done, if needed, hence that we “only” need to deal with keys from the universe of size polynomial in . In this case, we can for any small constant pick such that the space for our tables, , is .

Above, we required the character tables of our simple tabulation hash function to be fully random. However, for the bounds (3) and (4), it would suffice if they were populated with a -independent pseudo-random number generator (PNG), so we only need a seed of random words to be shared among all applications who want to use the same simple tabulation hash function. Then, as a preprocesing for fast hashing, each application can locally fill the character tables in time [5]. Likewise, for our tabulation permutation hashing, the bounds (5) and (6) only require a -independent PNG to generate the permutations.

1.3 Related work

1.3.1 High independence and double tabulation

As mentioned earlier, the concentration bounds for tabulation permutation hashing can also be achieved using an -independent hash function. The best positive result towards high independence and constant time evaluation is Thorup’s double tabulation [18]. For double tabulation, we use two independent simple tabulation functions and , and use the double tabulation hash function . Thorup [18] has shown that if , with probability over the choice of , the double tabulation hash function is -independent for . This matches the previously mentioned lower bound of Siegel [17] in the sense that we evaluate the hash function in time using space.

At first the above scheme may seem much better and even simpler than tabulation-permutation hashing since our suggested implementation of tabulation-permutation is just like double tabulation. However, for -independence, we need . For any , this makes it hard to keep the character tables in cache since they use space . In concrete numbers using more precise formulas (more precise than the succinct, general formula ), for 32-bit keys, [18] suggests using characters of 16 bits, and derived characters to get a 100-independent hash function w.h.p. According to [18] we cannot use significantly fewer resources even if we just want 4-independence. With our tabulation-permutation hashing, we would just use , thus having instead of lookups. Indeed in our experiments below double tabulation is approximately 20 times slower than tabulation-permutation hashing. Things get much worse for double tabulation of 64-bit keys.

We also note that Dahlgaard et al. [6] have shown that double tabulation becomes fully random with high probability for a given set of size so long as . However, the whole point our our work here is to handle the case where , and by definition, since the value functions we consider take values in .

1.3.2 Small space alternatives to independence

Finally, there have been developments in faster low space hash functions that, for example, can hash balls to bins so that the maximal number of balls in any bin is . In general, from an independence perspective, this requires independence and evaluation time unless we switch to hash functions using a lot of space as above. However, in [3, 10] it is shown that we can get down to words on operations per hash function evaluation. This is impressive when only using a small amount of space and a short random seed. However, in terms of running time it still does not come close to the small, constant amount of time required by tabulation-permutation hashing.

1.4 Experiments

We run experiments on various basic hash functions. We evaluate both the speed of the hash functions and also the quality of the output. Our experiments show that Tabulation-Permutation does not only have theoretical guarantees but also performs very well in practice.

We consider some of the most popular hash functions: -wise PolyHash [2], Multiply-Shift [7], Simple Tabulation [20], Twisted Tabulation [15] and Double Tabulation [18]. Out of these hash functions only Tabulation-Permutation, Double Tabulation, and very high degree -wise PolyHash have theoretical guarantees. Our experiments show that Tabulation-Permutation is approximately 20 times faster than Double-Tabulation and approximately 125 times faster than 100-wise PolyHash. Our experiments also show that there exists bad instances for Multiply-Shift, 2-wise PolyHash, Simple Tabulation, and Twisted Tabulation.

All experiments are implemented in C++11 using a random seed from https://www.random.org. The seed for the tabulation based hashing methods are seeded using a random 100-wise PolyHash function. PolyHash is implemented using Mersenne prime , Horner’s rule, and GCC’s 128-bit integers to ensure an efficient implementation. Double Tabulation is implemented as described in [18] with .

Timing

The result of our time experiment is presented in Table 1. We hash the same randomly chosen integers with each hash function. We consider the case where the hash functions output 32 bits and when the hash functions output 8 bits.

First we compare with the hash functions that do not have good concentration bounds, that is, Multiply-Shift, 2-wise PolyHash, Simple Tabulation, and Twisted Tabulation. We see that when the hash functions output 32 bits then Tabulation-Permutation is slightly slower than Simple Tabulation and Twisted Tabulation which again are slower than Multiply-Shift and 2-wise PolyHash. Next we consider the case where we only have a single 8-bit character output. Recall that this is the case we needed when using Tabulation-Permutation to hash into a small number of bins, e.g. . Having a small output is only an advantage for Simple Tabulation, Twisted Tabulation, and Tabulation-Permutation. For the latter it means that we only need to perform one more table-lookup than Simple Tabulation which is also reflected in the running time where Tabulation-Permutation is only marginally slower than Simple Tabulation and faster than Twisted Tabulation 8-bit output.

Next, comparing with the hash functions that do strong concentration bounds, Tabulation-Permutation is approximately 20 times faster than Double Tabulation and approximately 125 times faster than 100-wise PolyHash.

Table 1: The time taken for different hash functions to hash random numbers.
Hash function time
Hashing from 32 bits to 32 bits
Multiply-Shift 5.3 ms
2-wise PolyHash 7.7 ms
Simple Tabulation 16.1 ms
Twisted Tabulation 21.8 ms
Double Tabulation 522.4 ms
"Random"(100-wise PolyHash) 3166.8 ms
Tabulation-Permutation 24.2 ms
Hashing from 32 bits to 8 bits
Simple Tabulation 13.2 ms
Twisted Tabulation 20.1 ms
Tabulation-Permutation 16.4 ms
Quality

We will now present experiments with concrete bad instances for the schemes without general concentration bounds, that is, Multiply-Shift, 2-wise PolyHash, Simple Tabulation, and Twisted. In each case, we compare with our new Tabulation-Permutation scheme as well as 100-wise PolyHash, which is our approximation to an ideal fully random hash function. We note that all schemes considered are 2-independent, so they all have exactly the same variance as fully-random hashing. Our concern is therefore the frequency of large deviations in the tails.

First we consider simple bad instances for Multiply-Shift and 2-wise PolyHash. The case is analyzed in detail in [14, Appendix B]. The instance is that we the arithmetic progression to bins. The results are seen in Figure 1. We see that most of the time 2-wise PolyHash and Multiply-Shift distribute exactly of the keys in each bin. This should be matched by much heavier tails, which is indeed what what our experiments show. For contrast, we see that our Tabulation-Permutation is almost indistinguishable from the 100-wise Poly-Hash.

Figure 1: Hashing the arithmetic progression to bins for a random integer . The dotted line is a 100-wise PolyHash.

We now show that Simple Tabulation and Twisted Tabulation can not guarantee Chernoff-style bounds when the output domain is small. We hash the discrete cube to bins using Simple Tabulation, Twisted Tabulation, and Tabulation-Permutation, the result of which can be seen in Figure 2. In general if we hash the keyset to with Simple Tabulation then if then each bin will get exactly the same amount of keys. When we hash the keyset then with probability the two bins get exactly the same amount of keys, and with probability it corresponds to hashing keys each with weight completely independent, the variance of this is times higher so we expect a much heavier tail than in completely independent case. We think this instance is also one of the worst instances for Tabulation-Permutation. It is therefore expected that is performs slightly worse in this case compared with 100-wise PolyHash. This is also exactly what our experiments show. We note that that no amount of experimentation can prove that Tabulation-Permutation always works well for all inputs, but we do have mathematical guarantees for the concentration bounds, and the experiments performed here give us some idea of the impact of the real constants hidden in our asymptotic bounds.

Figure 2: Hashing the discrete cube to bins. The dotted line is a 100-wise PolyHash.

2 Techniques

We apply a string of new techniques and ideas to arrive at the two main theorems of the paper. The exposition is subdivided into three parts, each yielding theorems of independent interest. Combining these yields Theorem 1 and Theorem 2, but in fact in both cases imply even stronger statements. Below we present the main theorem of each section and show how they imply Theorem 1 and Theorem 2. This is followed by a technical description of each section and of how each of the main theorems come about.

Simple Tabulation

The first part of this paper (Section 4) presents a much tighter analysis of simple tabulation hashing, proving concentration bounds for throwing weighted balls into bins. This includes an analysis of the case of general value functions with restricted support, not just one as in Theorem 1. Our main theorem of Section 4 is the following.

Theorem 3.

Let be a simple tabulation hash function and be a set of keys. Let be a value function such that the set satisfies , where is a constant. Define the random variable

and . For any constant the following bounds hold:

(7)
(8)

for some large enough constants ,. Here maps . This result holds even when the value function depends on the hash of a query key (with and slightly larger).

The theorem can be seen as a tight analysis of the use of simple tabulation hashing towards concentration bounds, thus completing a part of the line of research [6, 12, 15, 18] started by Pǎtraşcu and Thorup in [12] on concentration with tabulation hashing.  Eq. 8 of the theorem is an important technical tool. It is proved first and then used in the proof of Eq. 7. Later it is then applied in the analysis of the performance of tabulation-permutation hashing.

Let us briefly argue that Theorem 3 implies Theorem 1. Assume that we are in the setting of LABEL:{thm:intro-simple-tab} and let be defined by

Applying Eq. 7 of Theorem 3 with and we find that

Now and it is an easy exercise (and the content of the upcoming Lemma 10) to show that the bound above only gets worse if is replaced by a larger number. One can moreover show that for – this is the content of the upcoming Lemma 9). Combining these two observations it follows that

The bound on is proved similarly, and when is a function of the hash of a query we proceed as above using the corresponding part of Theorem 3.

Permutation

The second part of the paper (Section 5) proves that given a hash function with concentration bounds like (7) and (8) in Theorem 3, composing with a uniformly random permutation of the entire codomain yields a hash function giving Chernoff-type concentration for general value functions. The main theorem of Section 5 is the following.

Theorem 4.

For a set , an integer and a positive number , let be a 2-independent hash function satisfying the following. For every value function such that the set satisfies , and every Eqs. 8 and 7 of Theorem 3 hold.

Let be any value function, a uniformly random permutation independent of , and , then

where , and is a universal constant depending on and . This result holds (with slightly larger constants) even when the value function depends on the hash value of a query key . The only requirement is that in that case the assumptions on must also hold under this condition.

We believe the theorem to be of independent interest. From a hash function that only performs well for value functions supported on an asymptotically small subset of the bins we can construct a hash function performing well for any value function – simply by composing with a random permutation. Theorem 3 shows that simple tabulation satisfies the two conditions in the theorem above. It follows that if , e.g., if , then composing a simple tabulation hash function with a uniformly random permutation yields a hash function having Chernoff-style bounds for general value functions asymptotically matching those from the fully random setting up to an additive error inversely polynomial in the size of the universe. In particular these bounds hold for tabulation-permutation hashing from to , that is, using just a single permutation, which yields the result of Theorem 2 in the case . If we desire a range of size the permutation becomes too expensive to store. In tabulation-permutation hashing we instead use permutations , hashing

Concatination

The third and final part of the paper (Section 6) shows that concatenating the hash values of two independent hash functions each with Chernoff-style bounds for general value functions yields a new hash function with a larger codomain and similar Chernoff-style bounds with only a constant blow up of the constants. In particular it will follow that tabulation-permutation hashing has Chernoff-style bounds for general value functions. The main theorem of Section 6 is the following.

Theorem 5.

Let be a finite set. Let and be pairwise independent families of random variables taking values in and , respectively, and satisfying that the distributions of and are independent. Suppose that , , and are such that for every choice of weight functions and and for every ,

(9)
(10)

where and . Then for every weight function and every ,

where , , , and .

As with Theorem 4, this result is of independent interest. Since it uses the input hash functions in a black box manner it functions as a general tool towards producing hash functions with Chernoff-style bounds.

We proceed to argue that Theorems 4, 3 and 5 imply Theorem 2. Suppose that we are in the setting of Theorem 2. The arguments succeeding Theorem 4 show that in the case and for any value function ,

for some constant depending on and . Applying Theorem 5, times, we find that even for ,

for some constant depending on and . At this point an argument similar to that following Theorem 3 completes the proof.

2.1 Analysis of Simple Tabulation

A large part of the technical contribution of this paper is spent on a thorough analysis of simple tabulation in Section 4. We pick up the basic approach in the analysis by Pǎtraşcu and Thorup [12] and generalize and extend it to obtain strong, general concentration bounds. The pinnacle of the section is the proof of Theorem 3.

The key idea of Pǎtraşcu and Thorup [12] was to perform the analysis of the concentration in the following way. Consider a simple tabulation function . Define a position character to be an element of . We can then consider a key as a set containing the position characters . With this view in mind, we shall write if is a position character of . If and is defined by , we overload notation by letting for a position character . For a set of position characters we further write . Suppose is an order on the set of position characters and enumerate the position characters according to this order. For a set of keys and every , we can then define the group of keys that contain the position character and satisfy that is the maximal position character of with respect to the ordering ,

We can view the process of randomly picking a simple tabulation hash function as sequentially fixing the value of for uniformly at random in the order . Immediately before fixing the hash value the internal clustering of the keys of has been determined in the sense that has been determined for all , and after fixing the hash of these values are each shifted by an XOR with . Now, Pǎtraşcu and Thorup prove that the ordering can be chosen such that for each , . This fact together with the following trick enables them to prove concentration results on the number of balls landing in a given bin . The proof is inductive, viewing the situation before each step as though the set of keys has already been hashed into bins by a simple tabulation function . The fixing of the last position character of each key of then “rotates” the image of under , , into by the operation . The inductive hypothesis applied to the groups combined with the contribution of each group being relatively small () then allows them to prove concentration bounds.

We rely on this idea and framework too. However, we generalize the framework significantly, greatly extending the theory of simple tabulation hashing and obtaining tighter concentration bounds. This makes the proofs involved a more technical endeavour, but also yields a better understanding of simple tabulation hashing. Later we use this improved understanding of simple tabulation hashing to prove that permutation-tabulation hashing gives Chernoff concentration which up to -notation matches that from the fully random setting even for arbitrary value functions. (This is the content of Theorem 2.) The hardness of the proof of Theorem 3 is captured in the case of unit weight balls – the generalisation to weight functions of bounded support is a mere technicality. Where the paper conceptually breaks new ground in this section is in applying martingale concentration results to obtain a bound on the sum of the squared deviations in the bins which is Eq. 8 of Theorem 3. We proceed to describe this technique starting with a brief description of martingales.

Colloquially, a martingale555A more general and formal definition of a martingale is used in the body of the paper and can be found in the preliminaries. is a sequence of random variables satisfying that for any , . In the modern theory of stochastic processes, martingales are one of the most fundamental objects. A natural way to construct a martingale is through martingale differences. A martingale difference is a sequence of random variables satisfying that for any , . The name derives from the fact that the sum is a martingale. A large body of work exists generalizing concentration inequalities for sums of independent random variables to martingales. In this paper, we mainly use a generalization of Bennett’s inequality [1] by Fan, Grama, and Liu [9] that allows Chernoff-style bounds for a martingale sequence given an almost sure bound for any on the increase of the martingale at step , and on the conditional variance . Now, the applicability to the analysis of simple tabulation is very roughly the following: As discussed above, the analysis proceeds by considering the application of simple tabulation hashing as a process of fixing the hash value of the position characters under one at a time. Fixing the hash value of a position character is independent of the previous events so letting be the value added from the ball in when fixing and , is a martingale difference. Using the inductive hypothesis on the group , we obtain the bound with high probability. Combined with a bound on the variance we can apply the martingale concentration inequality described above.

The key to the above strategy being successful is a bound on on the variance of given

. This is accomplished using another martingale sequence as above. This sequence describes the sum of the squared deviations of each bin from its mean. Informally, this is a martingale of the variance and one would think that the approach runs into the same problem as before, since we now need a bound on the variance of the variance, leading to an endless cycle passing to higher and higher moments. However, it turns out that it is possible to get by with Markov’s inequality and a tight, combinatorial analysis of a quantity relating to the number of collisions between sets of keys. Armed with this, we get a bound on the expectation of higher moment bounds, which leads to a bound on the conditional variance of the square deviations of the bins. It thus becomes possible to prove 

Eq. 8 of Theorem 3. Having established this result we apply it to obtain Eq. 7 of Theorem 3.

2.2 Permutation Yields General Value Functions

Suppose that we have a hash function satisfying that for any value function supported on bins for some , (that is, satisfying that only for for some set of size at most ,) we have Chernoff-style bounds on the distribution of the sum . Furthermore, we have a technical requirement on a squared sum of deviations relating to . This is exactly what we show regarding simple tabulation hashing. To lift the requirement on the support, we propose the following solution. Pick a uniformly random permutation . We then prove (Theorem 4) that the composition will have Chernoff-style bounds on the distribution of for any weight function . We loosely sketch this proof in the following. The full proof along with the statement of the theorem can be found in Section 5.

We can suppose up to loss of constant factors that for every , . Denote by , the “variational contribution” of bin . In the fully random setting, the variance is . So long as no is large, e.g.  for every , we show that the distribution of obeys Chernoff-style bounds. This allows us to complete the proof in two steps.

First, we consider the bins with the largest contributions to the variance. Denote this set of bins by . Fixing the permutation and defining a new value function by

Now, this weight function has support on bins, so we can apply the assumption on to get Chernoff-style bounds on the sum .

Second, by choice of , the remaining bins all have small variance, i.e.  for every . Thus, applying the lemma mentioned above to the value function given by

we get Chernoff-style bounds on the sum .

Combining the two observations above yields Chernoff-style bounds on the entire sum as desired.

2.3 Extending the Codomain

The conclusion of the first two parts of the paper is that for universe , composing a simple tabulation hash function with a uniformly random permutation yields a hash function that for every value function has Chernoff-style concentration bounds on the sum . This construction works well when is small, e.g. . However, since the table required to store has size , it is infeasible for large . For instance, if we might as well have stored a fully random hash function from to .

Consequently, we formulate and prove a general theorem capable of extending the codomain of any hash function supporting Chernoff-style bounds on general value functions. We restate the theorem informally for intuition.

Suppose that and , respectively, are independently distributed hash functions satisfying that for every choice of weight functions and , the sums and are each distributed such that they obey Chernoff-style bounds up to constant factors. The hash function given by then satisfies, up to constant factors, Chernoff-style bounds on the distribution of the sum for every weight function .

The proof of this theorem crucially relies on the fact that any weight function or can be chosen. Writing , the key trick is to choose . In this way the analysis can be divided in two. First, one proves that conditioning on the choice of , the mean value stays within an appropriate range with high probability. We also bound the number of keys that become very “heavy” after the conditioning. This is done using the assumption on . Second, we use the assumption on to show that applying to the value functions arising from conditioning on , we get Chernoff-style probability bounds and lose only constant multiplicative factors.

2.4 On the Addition of a Query Ball

Recall that the theorems stated above each contained the condition that they would still hold when the value functions depended on the hash value of some query key . Call this condition . Reading the body of the paper, one will notice that there is no mention of condition . This is for two reasons. First, the exposition is already notationally and conceptually heavy without the condition, so excluding the condition enhances readability of the paper. Second, the addition of condition adds very little conceptual difficulty. Hence, we have elected instead to sketch the arguments necessary for the addition in a section at the end of the paper.

3 Preliminaries

Before proceeding, we establish some basic definitions and describe results from the literature which we will use.

3.1 Notation

Throughout the paper, we use the following general notation.

  • We let denote the set .

  • For a statement or event we let be the indicator variable on , i.e.

  • Whenever are variables and , we shall denote by the sum