Count-Min sketch is a hash-based data structure to represent a dynamically changing associative array of counters in an approximate way. The array can be seen as a mapping from some set of keys to and the goal is to support point queries about the (approximate) current value of for a key . Count-Min is especially suitable for the streaming framework, when counters associated to keys are updated dynamically. That is, updates are (key,value) pairs with the meaning that is updated to .
Count-Min sketch was proposed in , see e.g.  for a survey. A similar data structure was introduced earlier in  named Spectral Bloom filter, itself closely related to Counting Bloom filters . The difference between Count-Min sketch and Spectral Bloom filter is marginal: while a Count-Min sketch requires hash functions to have disjoint codomains (rows of Count-Min matrix), a Spectral Bloom filter has all hash functions mapping to the same array, as does the regular Bloom filter. In this paper, we will deal with the Spectral Bloom filter version but will keep the term Count-Min sketch as more common in the literature.
Count-Min sketch supports negative update values
provided that at each moment, each counterremains non-negative (so-called strict turnstile model ). When updates are positive, the Count-Min update algorithm can be modified to a stronger version leading to smaller errors in queries. This modification, introduced in  as conservative update, is mentioned , without any formal analysis given in those papers. This variant is also discussed in  under the name minimal increase
, where it is claimed that it decreases the probability of a positive error by a factor of the number of hash functions, but no proof is given. We discuss this claim in the concluding part of this paper.
The case of positive updates is widespread in practice. In particular, a very common instance is counting where all update values are . This task occurs in different scenarios in network traffic monitoring, as well as other applications related to data stream mining . In bioinformatics, we may want to maintain, on the fly, multiplicities of -mers (words of length ) occurring in a big dataset [20, 1, 24]. We refer to  for more examples of applications.
While it is easily seen that the error in conservative update can only be smaller than in Count-Min, more precise bounds are not known. Count-Min guarantees, with high probability, that the additive error can be made bounded by for any , where is the -norm of . In the counting setting,
is the length of the input stream which can be very large, and therefore this bound provides a weak guarantee in practice, unless the distribution of keys is very skewed and queries are made on frequent elements (heavy hitters) [18, 3, 6]. Designing a probabilistic sketch for counting with good error guarantees is therefore an important practical question.
An attempt to analyse the conservative update algorithm for counting has been made in . Their main idea is a simulation of spectral Bloom filters by a hierarchy of ordinary Bloom filters. However, the bounds provided are not explicit but are expressed via a recursive relation based on false positive rates of involved Bloom filters.
In this paper, we provide a probabilistic analysis of the conservative update scheme for counting under the assumption of uniform distribution of keys
in the input. Our main result is a demonstration that the error in count estimates undergoes a phase transition when the number of distinct keys grows relative to the size of the Count-Min array. The phase transition threshold corresponds to thepeelability threshold for random -uniform random hypergraphs ( number of hash functions). For the subcritical regime, when the number of distinct keys is below the threshold, we show that the relative error for a randomly chosen key tends to asymptotically, with high probability. This contrasts with the regular Count-Min algorithm producing a relative error shown to be at least with constant probability.
For the supercritical regime, we show that the average relative error is lower-bounded by a constant (depending on the number of distinct keys), with high probability. We prove this result for and conjecture that it holds for arbitrary as well. We provide computer simulations showing that the expected relative error grows fast after the threshold, with a distribution showing a peculiar multi-modal shape. In particular, keys with small (or zero) error still occur after the threshold, but their fraction quickly decreases when the number of distinct keys grows.
After defining Count-Min sketch and conservative update strategy in Section 2 and introducing hash hypergraphs in Section 3, we formulate the conservative update algorithm (or regular Count-Min, for that matter) in terms of a hypergraph augmented with counters associated to vertices. In Section 5, we state our main results and illustrate them with a series of computer simulations. All technical proofs are provided in a separate Section 6.
In addition, in Section 7, we study a specific family of 2-regular -hypergraphs that are sparse but not peelable. For such graphs we show that while the relative error of every key is with the regular Count-Min strategy, it is for conservative update. While this result is mainly of theoretical interest, it illustrates that the peelability property is crucial for the error to be asymptotically vanishing.
2 Count-Min and Conservative Update
We consider a (counting version of) Count-Min sketch to be an array of size of counters initially set to , together with hash functions mapping keys from a given universe to . To count key occurrences in a stream of keys, regular Count-Min proceeds as follows. To process a key , each of the counters , , is incremented by . Querying the occurrence number of a key returns the estimate . It is easily seen that . A bound on the overestimate of is given by the following result adapted from .
Theorem 1 ()
For , , consider a Count-Min sketch with and size . Then with probability at least , where is the size of the input stream.
While Theorem 1 is useful in some situations, it has a limited utility as it bounds the error with respect to the stream size which can be very large.
Conservative update strengthens Count-Min by increasing only the smallest counters among . Formally, for , is incremented by only if and is left unchanged otherwise. The estimate of , denoted , is computed as before: . It can be seen that still holds, and that . The latter follows from the observation that on the same input, an entry of counter array under conservative update can never get larger than the same entry under Count-Min.
3 Hash hypergraphs
With a counter array and hash functions we associate a -uniform hash hypergraph with vertex-set and edge-set for all distinct keys . Let be the set of -uniform hypergraphs with vertices and edges. We assume that hash functions are truly random, that is values
are independent random variable uniformly distributed on. Even if this property is not granted by hash functions used in practice, it is a reasonable and commonly used hypothesis to conduct the analysis of sketch algorithms. Under this assumption, the hash hypergraph is a uniformly random Erdős-Rényi hypergraph in , which we denote by , where is the number of distinct keys in the input (for , we use the notation ).
Below we show that the behavior of a sketching scheme depends on the properties of the associated hash hypergraph. It is well-known that depending on the ratio, many properties of Erdős-Rényi (hyper)graphs follow a phase transition phenomenon . For example, the emergence of a giant component, of size , occurs with high probability (hereafter, w.h.p.) at the threshold .
Particularly relevant to us is the peelability property. Let be a hypergraph. The peeling process on is as follows. We define , and iteratively for , we define to be the set of leaves (vertices of degree 1) or isolated vertices in , to be the set of edges of incident to vertices in , and to be the hypergraph obtained from by deleting the vertices of and the edges of . A vertex in is said to have peeling level . The process stabilizes from some step , and the hypergraph is called the core of , which is the largest induced sub-hypergraph whose vertices all have degree at least . If is empty, then is called peelable.
It is known  that peelability undergoes a phase transition. For , there exists a positive constant such that, for , the random hypergraph is w.h.p. peelable as , while for , the core of has w.h.p. a size concentrated around for some that depends on . The first peelability thresholds are , , etc., being the largest.
For , for , w.h.p. a proportion of vertices are in trees of size , (and a proportion of the vertices are in the core), while for , the core size is w.h.p. concentrated around for that depends on .
We note that properties of hash hypergraphs determine the behavior of some other hash-based data structures, such as Cuckoo hash tables  and Cuckoo filters , Minimal Perfect Hash Functions and Static Functions , Invertible Bloom filters , and others. We refer to  for an extended study of relationships between properties of hash hypergraphs and some of those data structures. In particular, peelability is directly relevant to certain constructions of Minimal Perfect Hash Functions as well as to good functioning of Invertible Bloom filters.
4 CU-process on hypergraphs
The connection to hash hypergraphs allows us to reformulate the Count-Min algorithm with conservative updates as a process, which we call CU-process, on a random hypergraph , where correspond to counter array length, number of distinct keys, and number of hash functions, respectively. Let be a hypergraph. To each vertex we associate a counter initially set to . At each step , a CU-process on chooses an edge in , and increments by 1 those which verify . For and , will denote the value of the counter after steps, and the number of times edge has been drawn in the first steps. The counter of an edge is defined as . Clearly, for each and each , . The relative error of at time is defined as . The following Lemma can be easily proved by induction on .
Let be a hypergraph on which a CU-process is run. At every step , for each vertex , there is at least one edge incident to such that .
Observe that, when is a graph (), Lemma 1 is equivalent to the property that vertex counters cannot have a strict local maximum, i.e., at every step , each vertex has at least one neighbour such that .
5 Phase transition of the relative error
5.1 Main results
Let be a hypergraph, . Let . We consider two closely related models of input to perform the CU-process. In the -uniform model, the CU process is performed on a random sequence of keys (edges in ) of length , each element being drawn independently and uniformly in . In the -balanced model, the CU-process is performed on a random sequence of length , such that each occurs exactly times, and the order of keys is random. In other words, the sequence of keys on which the CU-process is performed is a random permutation of the multiset made of copies of each key of . Clearly, both models are very close, since the number of occurrences of any key in the -uniform model is concentrated around (with Gaussian fluctuations of order ) as gets large. For both models, we use the notation for the resulting counter of , for the resulting counter of , and for the resulting relative error of . In the -balanced model, since each element occurs times, we have .
Our main result is the following.
Theorem 2 (subcritical regime)
Let , and let , where , and for , is the peelability threshold as defined in Section 3. Consider a CU-process on a random hypergraph under either -uniform or -balanced model, and consider the relative error of a random edge in . Then w.h.p., as both and grow111Formally, for any , there exists such that if and ..
Note that with the regular Count-Min algorithm (see Section 2), in the -balanced model, the counter value of a node is , and the relative error of an edge is always (whatever ) equal to , and is thus always a non-negative integer. For fixed and , and for a random edge in , the probability that all vertices belonging to have at least one incident edge apart from converges to a positive constant . Therefore, is a nonnegative integer whose probability to be non-zero converges to . Thus, Theorem 2 ensures that, for , conservative updates lead to a drastic decrease of the error, from to .
For a given hypergraph with edges, we define the average error over the edges of . Theorem 2 states that a randomly chosen edge has a small error, but this does not formally exclude that a small fraction of edges may have large errors, possibly yielding larger than . However, we believe that this is not the case. From the previous remark, it follows that the error of an edge is upper-bounded by
. Since the expected maximal degree ingrows very slowly with , one can expect that any set of edges should have a contribution w.h.p.. This is also supported by experiments given in the next section.
Based on Theorem 2 and the above discussion, we propound that a phase transition occurs for the average error, in the sense that it is in the subcritical regime , and in the supercritical regime , w.h.p.. Regarding the supercritical regime, we are able to show that this indeed holds for in the -balanced model.
Theorem 3 (supercritical regime, case )
Let . Then there exists a positive constant such that, in the -balanced model, w.h.p., as grows222Formally, for any , there exists such that if and ..
Our proof of Theorem 3 can be extended to any and such that the giant component in satisfies w.h.p. for some positive constant . For , any value has this property , while for (respectively, ), the analysis given in  ensures that this property holds for values of strictly above the peelability threshold, namely (respectively, ) . Nevertheless, based on simulations presented below, we expect that Theorem 3 holds for for all as well, however proving this would then require a different kind of argument.
Here we provide several experimental results illustrating the phase transition stated in Theorems 2 and 3. Figure 1 shows plots for the average relative error as a function of , for for regular Count-Min and the conservative update strategies. Experiments were run for with the -independent model (each edge drawn independently with probability ) and (number of steps ). For each , an average is taken over random graphs.
The phase transitions are clearly seen to correspond to the critical threshold for , and, for , to the peelability thresholds , . Observe that the transition looks sharper for , which may be explained by the fact that the core size undergoes a discontinuous phase transition for , as shown in  (e.g. for , the fraction of vertices in the core jumps from to about ).
For the supercritical regime, we first analysed the concentration around the average shown in Figure 1 by simulating CU-processes on 50 random graphs, for each . Figure 2 shows the results. We observe that the concentration tends to become tighter for growing .
Furthermore, we experimentally studied the empirical distribution of individual relative errors, which turns out to have an interesting multi-modal shape for intermediate values of . Typical distributions for are illustrated in Figure 3 where each point corresponds to an edge, and the edges are randomly ordered along the -axis. Each plot corresponds to an individual random graph.
When grows beyond the peelability threshold, a fraction of edges with small errors still remains but vanishes quickly: these include edges incident to at least one leaf (these have error ) and peelable edges (these have error , by arguments to be given in Section 6.3). For intermediate values of , the distribution presents several modes: besides the main mode (largest concentration on plots of Figure 3), we observe a few other concentration values which are typically integers. While this phenomenon is still to be analysed, we explain it by the presence in the graphs of certain structural motifs that involve disparities in node degrees. Note that the fraction of values concentrated around the main mode is dominant: for example, for (Figure 2(d)), about 90% of values correspond to the main mode (). Finally, when becomes larger, these “secondary modes” disappear, and the distribution becomes concentrated around a single value. This is consistent with the tighter concentration observed earlier in Figure 2.
Finally, we report on another experiment supporting the conjecture of a positive average error in the supercritical regime. We simulated the CU-process on sparse random non-peelable 3-hypergraphs (i.e. ), namely 2-regular 3-hypergraphs with edges and vertices ( parameter). These are sparsest possible non-peelable 3-hypergraphs, with degree 2 of each vertex. We observed that the average error for such graphs is concentrated around a constant value of . Since the core size is linear in the supercritical regime, this experiment provides an evidence of a positive error in the general case. While this remains to be proved in general, in Section 7 we provide a proof for certain families of regular hypergraphs.
6 Proofs of main results
Theorem 2 relies on properties of random hypergraphs. Case corresponds to Erdős-Rényi random graphs  which have been extensively studied . In particular, it is well known when and gets large, is, w.h.p., a union of small connected components most of which are constant-size trees. That is, a random edge in is, w.h.p., in a tree of size . Thus, the proof amounts to showing that, for a fixed tree and a vertex , we have w.h.p.. We prove this in Section 6.1 for both -uniform and -balanced models. The proof for , given in Section 6.3, requires more ingredients. An additional difficulty is that, for , a random edge in may be in the giant component (if ). However, we rely on the fact that the peeling level of is w.h.p., and prove that for a vertex of bounded level, we have w.h.p. as , where the term does not depend on the size of the giant component.
6.1 CU-process on a fixed tree
6.1.1 Analysis in the -uniform model
Consider a graph , with edges, on which the CU-process is run, in the -uniform model. Recall that (resp. ) denotes the value of the counter for (resp. for ) after steps, and is the number of occurrences in the first steps. The aim of this Section is to prove the following result.
Let be a tree, on which the CU-process is run, in the -uniform model. Let . Then, for every vertex of , there exist absolute positive constants such that, for any and , we have
Lemma 2 implies that, in the -uniform model, the final counter of every vertex of is concentrated around , with (sub-)Gaussian fluctuations of order . The same holds for the final counter of every edge of . On the other hand, the number of times
is chosen follows a binomial distribution. Then, it is also concentrated around , with Gaussian fluctuations of order . This implies that w.h.p. as gets large.
We say that a family of events indexed by two parameters , is -concentrated if there are absolute constants such that, for every , we have , where . For a (possibly random) quantity depending on , we use the notation . Thus, to prove Lemma 2, we have to show that for a fixed tree with edges on which the -independent CU-process is run, and for a vertex of , the event is -concentrated.
We proceed by induction on the peeling level of vertices. A vertex at level is a leaf. Let be its incident edge. It is easy to see that the counter of increases exactly at the steps when is drawn. Hence for . Doob’s martingale maximal inequality combined with Hoeffding’s inequality ensure that, for every edge of , we have
Hence, for any leaf of , the event is -concentrated. Moreover, for a vertex and an arbitrary edge incident to , the fact that ensures that the event is -concentrated. It thus remains to show that, for vertices of positive levels, the event is -concentrated.
The following statement will be useful to treat the inductive step.
Let be a graph on which a CU-process is run. Let be a vertex of , with its neighbours one of which (say ) is distinguished, and with denoting the edge . For , let . Consider the event that there exists such that , and the event that there exists such that or . Then implies .
Proof: If holds, let be such that . Let . The crucial point is that, in the interval , any step where increases occurs when is chosen (indeed, when , choosing an edge with yields no increase of ). Hence, . Since and we conclude that . Thus we also have . Hence , so that holds.
Let , and assume Lemma 2 holds for all vertices of level smaller than . Let be a vertex of level , for which we want to prove that the event is -concentrated. All neighbours of have level at most , except for one, say , with its level in (respectively corresponding to having one vertex, two vertices, or at least three vertices).
Let be the edge between and . Let be the event that for some . If this holds, then one of the two events or holds. By Lemma 3, the first event implies that or for some . Hence, the event implies that either the event (which is also the union of the events for ) or the event holds. Since these events are -concentrated, we conclude (using the union bound) that the event is also -concentrated, which concludes the proof of Lemma 2.
6.1.2 Analysis in the -balanced model
Let be a tree, on which the CU-process is performed, in the -balanced model. Then, for every edge of , we have w.h.p. as .
Proof: Note that the -balanced model is just the -uniform model conditioned on all edges occurring exactly times, which happens with probability if has edges. Let be an extremity of . By Lemma 2, there exists a positive constant such that, in the -uniform model,
Hence, in the -balanced model, we have
which ensures that w.h.p. as .
6.2 Proof of Theorem 2 for
We use the well-known property  that, for fixed , a random edge in is w.h.p. in a tree of size . Precisely, if we let be the event that belongs to a tree of size at most , then we have the property that, for every , there exist and such that for . By Lemma 4, there exists (depending on ) such that, in the -balanced or -uniform model, for every tree with at most edges, and every edge , we have for . Hence, for and , if we perform the CU-process on in the -balanced model (note that the -balanced model holds separately on every connected component), and draw a random edge , we have
This means that w.h.p., when and grow. In the -uniform model, the same argument holds, using the fact that the total number of times edges in are drawn is concentrated around .
6.3 Proof of Theorem 2 for
The proof partly follows the same lines as for , but requires additional arguments, in particular a suitably extended notion of peelability. A marked hypergraph is a hypergraph where some of the vertices are marked. The subset of marked vertices is denoted . When performing a CU-process on a marked hypergraph, the counters of unmarked vertices are (as usual) initially , while the counters at marked vertices are initially (and remain) . When peeling, the marked vertices are not allowed to be peeled (even when they are incident to a unique edge). We define , and iteratively for , we define as the set of non-marked vertices that are leaves or isolated vertices in , as the set of edges of incident to vertices in , and to be the hypergraph obtained from by deleting the vertices in and the edges in . A vertex in is said to have level . Then is called peelable if every unmarked vertex is peeled at some step.
Let be a peelable connected marked hypergraph, on which the CU-process is run, in the -uniform model. Let , and for , let be the value of the counter of after steps. Then, for every unmarked vertex of , there exist absolute positive constants such that, for any and , we have
In the -uniform or -balanced model, for every unmarked vertex of , we have w.h.p. as .
For a hypergraph, and for a vertex of finite level in , a vertex is called a descendant of if there is a sequence of vertices such that, for each , and are incident to a same edge, and the level of is larger than the level of .
For a hypergraph , and for a vertex of of finite level, let be the set formed by and it descendants, and let be the set of edges that are incident to at least one vertex from . Let be the marked hypergraph formed by the edges in and their incident vertices, where the marked vertices are those not in . Then is peelable.
Proof: Let be a vertex in , of level . If , then is a leaf in . It is also a leaf in , and is immediately peeled. Otherwise, except for possibly one incident edge, has at least one neighbour of level smaller than in every incident edge (note that has to be in ). By induction, and are peeled in the peeling process of . Hence, is peeled as well during the peeling process on (by induction as well, the level of a vertex in is actually the same in and in ).
In a hypergraph , the distance between two vertices and is the minimum number of edges to traverse in order to reach from . For and , the ball is the set of vertices at distance at most from . Clearly, if is at level , then every vertex in is in , and every vertex of is in .
Lemma 7 (Lemma 3 in )
Let , and let . For every , there exist and such that, for a random vertex in , the probability that has level at most is at least , for .
Let be a hypergraph, and let be a subset of . Let (resp. ) be the subset of vertices incident to at least one edge from (resp. incident only to edges in ). Let be the marked hypergraph where the unmarked vertices are those in . Consider a CU-process on , with the sequence of items (each item an edge of ), and for let be the final value of the counter of . Let be the subsequence of composed by the items in . Consider the CU process on where the sequence of items is . For , let be the final value of the counter of . Then .
Proof: For an assignment of initial values to the vertex counters, let be the function giving the final vertex counters after performing the CU-process on an input sequence . It is easy to check (by induction on the length of ) that the CU-process is monotonous: if for all , then for all . Now, if we define as the function assigning initial value to vertices in , and initial value to the other vertices, then for all . On the other hand, if denotes the function assigning initial value to all vertices, then for all . Hence, the monotonicity property ensures that for all .
We can now conclude the proof of Theorem 2 for . Let . Let and be as given in Lemma 7. Let be such that, for a random vertex , we have for all . The existence of easily follows from the known property that converges to a Galton-Watson branching process for the local topology, see e.g. [9, Prop 2.6]. Let be a random vertex in . Let be the event that the level of is at most , and the number of vertices in is at most . Then, using Remark 1 and the union bound, we have for . From Lemma 5, there exists such that, for every marked -uniform hypergraph with at most vertices, and for every unmarked vertex of , we have for . Let be a random edge in , and let be a random vertex of . Note that is distributed as a random vertex in . With Lemma 6 and Lemma 8, this implies that, conditioned on , for we have in the -balanced model, and also (since ). Since , we conclude that, for and , we have . Hence, in the -balanced model, w.h.p., when and grow. In the -uniform model, the same argument holds again, using the fact that the number of times an edge in is drawn is concentrated around , with the number of edges of .
6.4 Proof of Theorem 3
The excess of a graph is .
Let be a graph. Then, for the -balanced model, we have .
Proof: During the CU process, each time an edge is drawn, the counter of at least one of its extremities is increased by . Hence . Hence, with the notation , we have . Now, by Lemma 1, for each , there exists an edge incident to such that (if several incident edges have this property, an arbitrary one is chosen). Hence, . Note that, in this sum, every edge occurs at most twice (since it has two extremities), thus .
7 Analysis for some non-peelable hypergraphs
Analysing the asymptotic behaviour of the relative error of the CU-process on arbitrary hypergraphs seems to be a challenging task, even if we restrict ourselves to -uniform and -balanced models, as we do in this paper. Based on simulations, we expect that, for a fixed connected hypergraph , and for , we have w.h.p. as , for an explicit constant . If is peelable, then Lemma 5 implies that this concentration holds, with . We expect that, if no vertex of is peelable, and if is “sufficiently homogeneous”, then the constants should be all equal to the same constant , and thus the relative error of every edge is concentrated around w.h.p. as . This, in particular, is supported by an experiment reported at the end of Section 5.2. In this Section, we show that this is the case for another family of regular hypergraphs which are very sparse ( edges) but have a high order (an edge contains vertices).
The dual of a hypergraph is the hypergraph where the roles of vertices and edges are interchanged: the vertices of are the edges of , and the edges of are the vertices of so that an edge of corresponding to a vertex of contains those vertices that correspond to edges incident to in .
Here we consider the hypergraph dual to the complete graph . It is a -uniform hypergraph with edges and