To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. A classic example of this approach is the algorithm of Bar-Yossef et al.  to estimate the number of distinct elements in a stream. Using standard strongly universal hashing to process each element, we get an estimator where the probability of a too large error is, say, 1/4. By performing independent repetitions and taking the median of the estimators, the error probability falls exponentially in . However, running independent experiments increases the processing time by a factor .
Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of independent sketches, we have a single sketch that is times bigger, so the total space is essentially the same. However, we only apply a single hash function, processing each element in constant time regardless of , and the overall algorithms just get simpler.
Fast practical hash functions with strong concentration bounds were recently proposed by Aamand et al. . Using their hashing schemes, we get a very fast implementation of the above streaming algorithm, suitable for online processing of high volume data streams.
To illustrate a streaming scenario where the constant in the processing time is critical, consider the Internet. Suppose we want to process packets passing through a high-end Internet router. Each application only gets very limited time to look at the packet before it is forwarded. If it is not done in time, the information is lost. Since processors and routers use some of the same technology, we never expect to have more than a few instructions available. Slowing down the Internet is typically not an option. The papers of Krishnamurthy et al.  and Thorup and Zhang  explain in more detail how high speed hashing is necessary for their Internet traffic analysis. Incidentally, the hash function we use from  is a bit faster than the ones from [19, 25], which do not provide Chernoff-style concentration bounds.
The idea is generic and can be applied to other algorithms. We will also apply it to Broder’s original min-hash algorithm  to estimate set similarity, which can now be implemented efficiently, giving the desired estimates with high probability.
Let us now be more specific about the algorithmic context. We have a key universe , e.g., 64-bit keys, and a random hash function mapping uniformly into .
For some input set and some fraction , we want to know the number of keys from that hash below . Here could be an unknown function of , but should be independent of the random hash function . Then the mean is .
If the hash function is fully random, we get the classic Chernoff bounds on (see, e.g, ):
Unfortunately, we cannot implement fully random hash functions as it requires space as big as the universe.
To get something implementable in practice, Wegman and Carter  proposed strongly universal hashing. The random hash function is strongly universal if for any given distinct keys , is uniform in . The standard implementation of a strongly universal hash function into is to pick large prime and two uniformly random numbers . Then is strongly universal from to . Obviously it is not uniform in , but for any , we have with equality if . Below we ignore this deviation from uniformity in .
Assuming we have a strongly universal hash function , we again let be the number of elements from that hash below . Then and because the hash values are 2-independent, we have . Therefore, by Chebyshev’s inequality,
As gets large, we see that the concentration we get with strongly universal hashing is much weaker than the Chernoff bounds with fully random hashing. However, Chebyshev is fine if we just aim at a constant error probability like , and then we can use the median over independent repetitions to reduce the error probability.
In this paper we discuss benefits of having hash functions with strong concentration akin to that of fully random hashing:
A hash function is strongly concentrated with added error probability if for any set and , if is the number of elements from hashing below , and , then
If , we simply say that is strongly concentrated.
Another way of viewing the added error probability is as follows. We have strong concentration as long as we do not aim for error probabilities below , so if is sufficiently low, we can simply ignore it.
What makes this definition interesting in practice is that Aamand et al.  recently presented a fast practical small constant time hash function that for is strongly concentrated with added error probability for any constant . This term is so small that we can ignore it in all our applications. The speed is obtained using certain character tables in cache that we will discuss later.
Next we consider our two streaming applications, distinct elements and set-similarity, showing how strongly concentrated hashing eliminates the need for time consuming independent repetitions. We stress that in streaming algorithms on high volume data streams, speed is of critical importance. If the data is not processed quickly, the information is lost.
Distinct elements is the simplest case, and here we will also discuss the ramifications of employing the strongly concentrated hashing of Aamand et al.  as well as possible alternatives.
2 Counting distinct elements in a data stream
We consider a sequence of keys where each element may appear multiple times. Using only little space, we wish to estimate the number of distinct keys. We are given parameters and , and the goal is to create an estimator, , such that with probability at least .
Following the classic approach of Bar-Yossef et al. , we use a strongly universal hash function . For simplicity, we assume to be collision free over .
For some , we maintain the smallest distinct hash values of the stream. We assume for simplicity that . The space required is thus , so we want to be small. Let be the key having the ’th smallest hash value under and let . As in , we use as an estimator for (we note that  suggests several other estimators, but the points we will make below apply to all of them).
The point in using a hash function is that all occurrences of a given key in the stream get the same hash value, so if is the set of distinct keys, is just the smallest hash value from . In particular, depends only on , not on the frequencies of the elements of the stream. Assuming no collisions, we will often identify the elements with the hash values, so is smaller than if .
We would like to be concentrated around . For any probability , let denote the number of elements from that hash below . Let and . Note that both and are independent of the random hash function . Now
and these observations form a good starting point for applying probabilistic tail bounds as we now describe.
2.1 Strong universality and independent repetitions
Since is strongly universal, the hash values of any two keys are independent, so for any , we have , and so by Chebyshev’s inequality,
Assuming , we thus get that
To get the desired error probability , we could now set , but if is small, e.g. , becomes way too large. As in  we instead start by aiming for a constant error probability, , say . For this value of , it suffices to set . We now run (to be determined) independent experiments with this value of , obtaining independent estimators for , . Finally, as our final estimator, , we return the median of . Now for each , and these events are independent. If , then for at least half of the . By the standard Chernoff bound (1), this probability can be bounded by
Setting , we get the desired error probability . The total number of hash values stored is .
2.2 A better world with fully random hashing
Thus, to get error probability , we just use . There are several reasons why this is much better than the above approach using 2-independence and independent repetitions.
It avoids the independent repetitions, so instead of applying hash functions to each key we just need one. We thus save a factor of in speed.
Overall we store fewer hash values: instead of .
With independent repetitions, we are tuning the algorithm depending on and , whereas with a fully-random hash function, we get the concentration from (3) for every .
The only caveat is that fully-random hash functions cannot be implemented.
2.3 Using hashing with strong concentration bounds
We now discuss the effect of relaxing the abstract full-random hashing to hashing with strong concentration bounds and added error probability . Then for ,
To obtain the error probability , we again need to store hash values. Within a constant factor this means that we use the same total number using 2-independence and independent repetitions, and we still retain the following advantages from the fully random case.
With no independent repetitions we avoid applying hash functions to each key, so we basically save a factor in speed.
With independent repetitions, we only address a given and , while with a fully-random hash function we get the concentration from (3) for every .
2.4 Implementation and alternatives
We briefly discuss how to maintain the smallest elements/hash values. The most obvious method is using a priority queue, but this takes time per element, dominating the cost of evaluating the hash function. However, we can get down to constant time per element if we have a buffer for . When the buffer gets full, we find the median in linear time with (randomized) selection and discard the bigger elements. This is standard to de-amortize if needed.
A different, and more efficient, sketch from  identifies the smallest such that the number of keys hashing below is at most . For the online processing of the stream, this means that we increment whenever . At the end, we return . The analysis of this alternative sketch is similar to the one above, and we get the same advantage of avoiding independent repetitions using strongly concentrated hashing, that is, for error probability , in , they run independent experiments with independent hash functions, each storing up to hash values, whereas we run only a single experiment with a single strongly concentrated hash function storing hash values. The total number of hash values stored is the same, but asymptotically, we save a factor in time.
Estimating the number of distinct elements in a stream began with the work of Flajolet and Martin  and has continued with a long line of research [2, 3, 4, 5, 8, 9, 11, 12, 13, 14, 15, 16, 17, 27]. In particular, there has been a lot of focus on minimizing the sketch size. Theoretically speaking, the problem finally found an asymptotically optimal, both in time and in space, solution by Kane, Nelson and Woodruff , assuming we only need probability of success. The optimal space, including that of the hash function, is bits, improving the bits needed by Bar-Yossef et al.  to store hash values. Both  and , suggest using independent repetitions to reduce the error probability to , but then both time and space blow up by a factor .
Recently Blasiok  found a space optimal algorithm for the case of small error probability . In this case, the bound from  with independent repetitions was which he reduces to , again including the space for hash functions. He no longer has hash functions, but this only helps his space, not his processing time, which he states as polynomial in and .
The above space optimal algorithms [6, 18] are very interesting, but fairly complicated, seemingly involving some quite large constants. However, here our focus is to get a fast practical algorithm to handle a high volume data stream online, not worrying as much about space. Assuming fast strongly concentrated hashing, it is then much better to use our implementation of the simple algorithm of Bar-Yossef et al.  using .
2.5 Implementing Hashing with Strong Concentration
As mentioned earlier, Aamand et al.  recently presented a fast practical small constant time hash function, Tabulation-1Permutation, that for is strongly concentrated with additive error for any constant . The scheme obtains its power and speed using certain character tables in cache.
More specifically, we view keys as consisting of a small number of characters from some alphabet , that is, . For -bit keys, this could be characters of bits each. Let’s say that hash values are also from , but viewed as bit strings representing fractions in .
Tabulation-1Permutation needs character tables mapping characters to hash values. To compute the hash value of a key, we need to look up characters in these tables. In addition we need fast AC operations to extract the characters and xor the hash values. The character tables can be populated with an independent pseudo-random number generator, needing a random seed of bits.
Computer dependent versus problem dependent view of resources for hashing
We view the resources used for Tabulation-1Permutation as computer dependent rather than problem dependent. When you buy a new computer you can decide how much cache you want to allocate for your hash functions. In the experiments performed in , using 8-bit characters and for 64-bit keys was very efficient. On two computers, it was found that tabulation-1permutation was less than 3 times slower than the fastest known strongly universal hashing scheme; namely Dietzfelbinger’s  which does just one multiplication and one shift. Also, Tabulation-1Permutation was more than 50 times faster than the fastest known highly independent hashing scheme; namely Thorup’s  double tabulation scheme which, in theory also works in constant time.
In total, the space used by all the character tables is bits which is less than 20 KB, which indeed fits in very fast cache. We note that when we have first populated the tables with hash values, they are not overwritten. This means that the cache does not get dirty, that is different computer cores can access the tables and not worry about consistency.
This is different than the work space used to maintain the sketch of the number of distinct keys represented via hash values, but let’s compare anyway with real numbers. Even with a fully random hash function with perfect Chernoff bounds, we needed , so with, say, and , we get , which is much more than the hash values stored in the character tables for the hash functions. Of course, we would be happy with a much smaller so that everything is small and fits in fast cache.
We note that any rules out the concentration of previous tabulation schemes such a simple tabulation  and twisted tabulation . The reader is referred to  for a thorough discussion of the alternatives.
Let be a tabulation-1permutation hash function with and , . Consider a key/ball set of size where each ball is assigned a weight . Choose arbitrary hash values with . Define to be the total weight of balls hashing to the interval . Write and . Then for any constant and every ,
Here is given by , so . The above also holds if we condition the random hash function on a distinguished query key having a specific hash value.
The above statement is far more general than what we need. All our weights are unit weights. We fix and . Viewing hash values as fractions in
, the random variableis the number of items hashing below . Also, since , (5) implies the same statement with instead of . Moreover, our corresponds to , and then we get
which is exactly as in our Definition 1. Only remaining difference is that Definition 1 should work for any while the bound we get only works for that are multiples of . However, this suffices by the following general lemma:
Suppose we have a hash function such that for any set and for any that is a multiple of , for the number of elements from that hash below , with and , it holds that
Then the same statement holds for all
First we note that the statement is trivially true if , so we can assume . Since , we also have .
We are given an arbitrary . Let be the nearest higher multiple of . Since and we have , implying . We also let .
It is now clear that since , it holds that . We first show that
Indeed, implies .
But and , so . The last follows from the fact that , but and so .
The exact same reasoning gives
Notice that , and and are multiples of , so we can use the bounds of the statement. Thus is upper bounded by
We note that  also presents a slightly slower scheme, Tabulation-Permutation, which offers far more general concentration bounds than those for Tabulation-1Permutation in Theorem 1. However, Tabulation-1Permutation is faster and sufficient for the strong concentration needed for our streaming applications.
3 Set similarity
We now consider Broder’s  original algorithm for set similarity. As above, it uses a hash function which we assume to be collision free. The bottom- sample of a set consists of the elements with the smallest hash values. If is fully random then is a uniformly random subset of distinct elements from . We assume here that . With , we can estimate the frequency of any subset as .
Broder’s main application is the estimation of the Jaccard similarity between sets and . Given the bottom- samples from and , we may construct the bottom- sample of their union as , and then the similarity is estimated as .
We note again the crucial importance of having a common hash function . In a distributed setting, samples and can be generated by different entities. As long as they agree on , they only need to communicate the samples to estimate the Jaccard similarity of and . As noted before, for Tabulation-1Permutation can be shared by exchanging a random seed of bits.
For the hash function , Broder  first considers fully random hashing. Then is a fully random sample of distinct elements from , which is very well understood.
Broder also sketches some alternatives with realistic hash functions, but Thorup  showed that even if we just use 2-independence, we get the same expected error as with fully random hashing, but here we want strong concentration. Our analysis follows the simple union-bound approach from .
For the analysis, it is simpler to study the case where we are sampling from a set and want to estimate the frequency of a subset . Let be the th smallest hash value from as in the above algorithm for estimating distinct elements. For any let be the number of elements from with hash value at most . Then which is our estimator for .
For , if is strongly concentrated with added error probability , then
Let . We already saw in (4) that for any , Thus, with and , we have with probability , and in that case, .
Let . By strong concentration, for any , we get that
Likewise, with , for any , we get that
To prove the theorem for , we set . Then and . Therefore
This completes the proof of (6). ∎
As for the problem of counting distinct elements in a stream, in the online setting we may again modify the algorithm above to obtain a more efficient sketch. Assuming that the elements from appear in a stream, we again identify the smallest such that the number of keys from hashing below , , is at most . We increment by one whenever and in the end we return as an estimator for . The analysis of this modified algorithm is similar to the analysis provided above.
Research of all authors partly supported by Thorup’s Investigator Grant 16582, Basic Algorithms Research Copenhagen (BARC), from the VILLUM Foundation. Evangelos Kipouridis has also received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 801199.
-  Aamand, A., Knudsen, J. B. T., Knudsen, M. B. T., Rasmussen, P. M. R., and Thorup, M. Fast hashing with strong concentration bounds. CoRR abs/1905.00369 (2019). Accepted for STOC’20.
Alon, N., Matias, Y., and Szegedy, M.
The space complexity of approximating the frequency moments.Journal of Computer and System Sciences 58, 1 (1999), 209–223. Announced at STOC’96.
-  Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., and Trevisan, L. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM) (2002), pp. 1–10.
-  Bar-Yossef, Z., Kumar, R., and Sivakumar, D. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. 13th ACM/SIAM Symposium on Discrete Algorithms (SODA) (2002), pp. 623–632.
-  Beyer, K. S., Haas, P. J., Reinwald, B., Sismanis, Y., and Gemulla, R. On synopses for distinct-value estimation under multiset operations. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, June 12-14, 2007 (2007), pp. 199–210.
-  Blasiok, J. Optimal streaming and tracking distinct elements with high probability. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018 (2018), pp. 2432–2448.
-  Broder, A. Z. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences (SEQUENCES) (1997), pp. 21–29.
-  Brody, J., and Chakrabarti, A. A multi-round communication lower bound for gap hamming and some consequences. In Proceedings of the 24th Annual IEEE Conference on Computational Complexity, CCC 2009, Paris, France, 15-18 July 2009 (2009), pp. 358–368.
-  Cohen, E. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences 55, 3 (1997), 441–453. Announced at STOC’94.
-  Dietzfelbinger, M. Universal hashing and -wise independent random variables via integer arithmetic without primes. In Proc. 13th Symposium on Theoretical Aspects of Computer Science (STACS) (1996), pp. 569–580.
-  Durand, M., and Flajolet, P. Loglog counting of large cardinalities (extended abstract). In Proc. 11th European Symposium on Algorithms (ESA) (2003), pp. 605–617.
-  Estan, C., Varghese, G., and Fisk, M. E. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw. 14, 5 (2006), 925–937.
-  Flajolet, P., and Martin, G. N. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences 31, 2 (1985), 182–209.
-  Flajolet, P., Éric Fusy, Gandouet, O., and Meunier, F. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In In AOFA ’07: Proceedings of the 2007 International Conference on Analysis of Algorithms (2007), pp. 127–146.
-  Gibbons, P. B. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy (2001), pp. 541–550.
-  Gibbons, P. B., and Tirthapura, S. Estimating simple functions on the union of data streams. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 2001, Heraklion, Crete Island, Greece, July 4-6, 2001 (2001), pp. 281–291.
-  Indyk, P., and Woodruff, D. P. Tight lower bounds for the distinct elements problem. In 44th Symposium on Foundations of Computer Science (FOCS 2003), 11-14 October 2003, Cambridge, MA, USA, Proceedings (2003), pp. 283–288.
-  Kane, D. M., Nelson, J., and Woodruff, D. P. An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010, June 6-11, 2010, Indianapolis, Indiana, USA (2010), pp. 41–52.
-  Krishnamurthy, B., Sen, S., Zhang, Y., and Chen, Y. Sketch-based change detection: methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM Internet Measurement Conference, IMC 2003, Miami Beach, FL, USA, October 27-29, 2003 (2003), pp. 234–247.
-  Motwani, R., and Raghavan, P. Randomized Algorithms. Cambridge University Press, 1995.
-  Pǎtraşcu, M., and Thorup, M. The power of simple tabulation-based hashing. Journal of the ACM 59, 3 (2012), Article 14. Announced at STOC’11.
-  Pǎtraşcu, M., and Thorup, M. Twisted tabulation hashing. In Proc. 24th ACM/SIAM Symposium on Discrete Algorithms (SODA) (2013), pp. 209–228.
Bottom-k and priority sampling, set similarity and subset sums with
Proc. 45th ACM Symposium on Theory of Computing (STOC)(2013).
-  Thorup, M. Simple tabulation, fast expanders, double tabulation, and high independence. In Proc. 54th IEEE Symposium on Foundations of Computer Science (FOCS) (2013), pp. 90–99.
-  Thorup, M., and Zhang, Y. Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM Journal on Computing 41, 2 (2012), 293–331. Announced at SODA’04 and ALENEX’10.
-  Wegman, M. N., and Carter, L. New classes and applications of hash functions. Journal of Computer and System Sciences 22, 3 (1981), 265–279. Announced at FOCS’79.
-  Woodruff, D. P. Optimal space lower bounds for all frequency moments. In Proc. 15th ACM/SIAM Symposium on Discrete Algorithms (SODA) (2004), pp. 167–175.