We consider the problem of maintaining datasets subject to insert, delete, and membership query operations. Given a set of elements from a universe , a membership query asks if the queried element belongs to the set . When exact answers are required, the associated data structure is called a dictionary. When one-sided errors are allowed, the associated data structure is called a filter. Formally, given an error parameter , a filter always answers “yes” when and when , it makes a mistake with probability at most . We refer to such an error as a false positive event111The probability is taken over the random choices that the filter makes..
When false positives can be tolerated, the main advantage of using a filter instead of a dictionary is that the filter requires much less space than a dictionary [CFG78, LP10]. Let be the size of the universe and denote an upper bound on the size of the set at all points in time. The information theoretic lower bound for the space of dictionaries is bits.222All logarithms are base unless otherwise stated. is used to denote the natural logarithm.333We consider the case in which the bound on the cardinality of the dataset is known in advance. The scenario in which is not known is addressed in Pagh et al. [PSW13].444 This equality holds when is significantly larger than . On the other hand, the lower bound for the space of filters is bits [CFG78]. In light of these lower bounds, we call a dictionary space-efficient when it requires bits, where the term converges to zero as tends to infinity. Similarly, a space-efficient filter requires bits.555An asymptotic expression that mixes big-O and small-o calls for elaboration. If , then the asymptotic expression does not require the addend. If is constant, the addend only emphasizes the fact that the constant that multiplies is, in fact, the sum of two constants: one is almost , and the other does not depend on . Indeed, the lower bound in [LP10] excludes space for constant values of .
When the set is fixed, we say that the data structure is static. When the data structure supports insertions as well, we say that it is incremental. Data structures that handle both deletions and insertions are called fully-dynamic (in short, dynamic).
The goal is to design dictionaries and filters that achieve “the best of both worlds” [ANS10]: they work in the fully-dynamic setting, are space-efficient and perform operations in constant time in the worst case with high probability.666By with high probability (whp), we mean with probability at least . The constant in the exponent can be controlled by the designer and only affects the term in the space of the dictionary or the filter.
On the fully-dynamic front, one successful approach for designing fully-dynamic filters was suggested by Pagh et al. [PPR05]. Static (resp., incremental) filters can be obtained from static (resp., incremental) dictionaries for sets by a reduction of Carter et al. [CFG78]. This reduction does not directly lead to filters that support deletions. To extend the reduction to the fully-dynamic setting, Pagh et al. [PPR05] propose to reduce instead from a stronger dictionary that maintains multisets rather than sets (i.e., elements in multisets have arbitrary multiplicities). This extension combined with a fully-dynamic dictionary for multisets yields in [PPR05] a fully-dynamic filter that is space-efficient but performs insertions and deletions in amortized constant time, not in the worst case. It is still an open problem whether one can design a fully-dynamic dictionary on multisets that is space-efficient and performs operations in constant time in the worst case whp [ANS10].
In this paper, we present the first fully-dynamic space-efficient filter with constant time operations in the worst case whp. Our result is based on the observation that it suffices to use the reduction of Carter et al. [CFG78] on dictionaries that support random multisets rather than arbitrary multisets. We then design the first fully-dynamic space-efficient dictionary that works on random multisets and supports operations in constant time in the worst case whp. Applying the reduction to this new dictionary yields our fully-dynamic filter. We also show how a static version of our dictionary can be used to design a data structure for the static retrieval problem.
1.1 Our Contributions
Fully-Dynamic Dictionary for Random Multisets.
We consider a new setting for dictionaries in which the dataset is a random sample (with replacements) of the universe. We refer to such datasets as random multisets. We present the first space-efficient fully-dynamic dictionary for random multisets that performs inserts, deletes, and queries in constant time in the worst case whp. The motivating application for random multi-sets is in designing fully-dynamic filters (see Sec. 3). We note that our dictionary can also maintain regular sets.
In the following theorem, we summarize the properties of our dictionary, called the Crate Dictionary. Note that we analyze the number of memory accesses in the RAM model in which every memory access reads/writes a word of contiguous bits. All operations we perform in one memory access take constant time. We also analyze the probability that the space allocated for the dictionary does not suffice; such an event is called an overflow.
The Crate Dictionary with parameter is a fully-dynamic dictionary that maintains sets and random multisets with up to elements from a universe with the following guarantees: (1) For every polynomial in sequence of insert, delete, and query operations, the dictionary does not overflow whp. (2) If the dictionary does not overflow, then every operation (query, insert, delete) can be completed using at most memory accesses. (3) The required space is bits.
The comparable dictionary of Arbitman et al. [ANS10] also achieves constant time memory accesses, is space-efficient and does not overflow with high probability, but works only for sets. We remark that dictionary constructions for arbitrary multisets exist[CKRT04, PPR05, DadHPP06, KR10, PT14] with weaker performance guarantees.
One advantage of supporting random multisets is that it allows us to use our dictionary to construct a fully-dynamic filter. We show a reduction that transforms a fully-dynamic dictionary on random multisets into a fully-dynamic filter on sets (see Lemma 13). Applying this reduction to the Crate Dictionary in Theorem 1, we obtain the Crate Filter in Theorem 2. The Crate Filter is the first in-memory space-efficient fully-dynamic filter that supports all operations in constant time in the worst case whp. The filter does not overflow with high probability. We summarize its properties in the following:777Note that the theorem holds for all ranges of , in particular, can depend on or . Moreover, even when , each operation in the filter requires a constant number of memory accesses.
The Crate Filter with parameters and is a fully-dynamic filter that maintains a set of at most elements from a universe with the following guarantees: (1) For every polynomial in sequence of insert, delete, and query operations, the filter does not overflow whp. (2) If the filter does not overflow, then every operation (query, insert, and delete) can be completed using at most memory accesses. (3) The required space is bits. (4) For every query, the probability of a false positive event is bounded by .
We also present an application of the Crate Dictionary design for the static Retrieval Problem [CKRT04]. In the static retrieval problem, the input consists of a fixed dataset and a function . On query , the output must satisfy if (if , any output is allowed). The data structure can also support updates in which is modified for an .
We employ a variant of the Crate Dictionary design to obtain a Las Vegas algorithm for constructing a practical data structure in linear time whp. The space requirements of this data structure are better than previous constructions under certain parametrizations (see [DW19] and references therein). In particular, the term in the required space is similar to the one we obtain for our dictionary.
There exists a data structure for the static Retrieval Problem with the following guarantees: (1) Every query and update to can be completed using at most memory accesses. (2) The time it takes to construct the data structure is whp. (3) The required space is bits.
1.2 Our Model
Memory Access Model.
We assume that the data structures are implemented in the RAM model in which the basic unit of one memory access is a word. Let denote the memory word length in bits. We assume that . Performance is measured in terms of memory accesses. We do not count the number of computer word-level instructions because the computations we perform over words per memory access can be implemented with instructions using modern instruction sets and lookup tables [Rei13, PBJP17, BFG17].
We note that comparable dictionary designs assume words of bits [PPR05, DadHPP06, ANS10]. When the dictionaries are used to obtain filters [PPR05, ANS10], they assume that the word length is . If , then our designs always require a constant number of memory accesses per operation (hence, constant time). However, we prefer to describe our designs using a model in which the word length does not depend on the size of the universe[PT14].
Our constructions succeed with high probability in the following sense. Upfront, space is allocated for the data structure and its components. An overflow is the event that the space allocated for one of the components does not suffice (e.g., too many elements are hashed to the same bin). We prove that overflow occurs with probability at most and that one can control the degree of the polynomial (the degree of the polynomial only affects the
term in the size bound). In the case of random multisets, the probability of an overflow is a joint probability distribution over the random choices of the dictionary and the distribution over the realizations of the multiset. In the case of sets, the probability of an overflow depends only on the random choices that the dictionary/filter makes.
We assume that we have access to perfectly random hash functions and permutations with constant evaluation time. We do not account the space these hash functions occupy. A similar assumption is made in [BM99, RR03, CKRT04, DP08, BFG18, DW19]. See [FKS82, KNR09, ANS10, PT11, CRSW13, MRRR14] for further discussion on efficient storage and evaluation of hash functions.
Worst Case vs. Amortized.
An interesting application that emphasizes the importance of worst-case performance is that of handling search engine queries. Such queries are sent in parallel to multiple servers, whose responses are then accumulated to generate the final output. The latency of this final output is determined by the slowest response, thus reducing the average latency of the final response to the worst latency among the servers. See [BM01, KM07, ANS09, ANS10] for further discussion on the shortcomings of expected or amortized performance in practical scenarios.
1.3 Main Ideas and Techniques
We review the general techniques which we employ in our design and highlight the main issues that arise when trying to design a fully-dynamic filter. We then describe how our construction addresses these issues.
1.3.1 Relevant Techniques and Issues
Dictionary Based Filters.
Carter et al. [CFG78] observed that one can design a static filter using a dictionary that stores hashes of the elements. The reduction is based on a random hash function . We refer to as the fingerprint of . A static dictionary over the set of fingerprints becomes a static filter over the set . Indeed, the probability of a false positive event is at most because , for every . This reduction yields a space-efficient static filter if the dictionary is space-efficient.
Delete Operations in Filters.
Delete operations pose a challenge to the design of filters based on dictionaries [PPR05]. Consider a collision, namely for some and . One must make sure that a operation does not delete as well. We consider two scenarios. In the first scenario, and a causes to be deleted from the filter. A subsequent operation is then responded with a “no”, rendering the filter incorrect. To resolve this issue, we assume that a delete operation is only issued for elements in the dataset.888This assumption is unavoidable if one wants to design a filter using the reduction of Carter et al. [CFG78].
At the time a is issued to the filter, the element is in the dataset.
In the second scenario, both and when is issued. In this case, the dictionary underlying the filter must be able to store duplicate elements (or their multiplicities), so that deletes only one duplicate leaving the other one intact (or subtracts the counter).999This is why a fully-dynamic dictionary for sets cannot be used out-of-the-box as a fully-dynamic filter. Pagh et al. [PPR05] propose a solution that requires the underlying dictionary to support a multiset rather than a set.
The space lower bound of bits suggests that one should be able to save bits in the representation of each element. Indeed, the technique of quotienting [Knu73] saves bits by storing them implicitly in an array of entries. This technique has been successfully used in several filter and dictionary constructions (see [Pag01, PPR05, DadHPP06, ANS10, BFJ12, BFG18] and references therein). The idea is to divide the bits of an element into the first bits called the quotient (which are stored implicitly), and the remaining bits called the remainder (which are stored explicitly). An array is then used to store the elements, where stores the multiset of remainders such that .101010We abuse notation and interpret both as a number in and a binary string of bits. The benefit is that one does not need to store the quotients explicitly because they are implied by the location of the remainders.
Load Balancing and Spares.
When quotients are random locations in , it is common to analyze dictionaries and filters using the balls-into-bins paradigm [Wie17]. In the case of dictionary/filter design, the remainders are balls, and the quotients determine the bins. The balls that cannot be stored in the space allocated for their bin are stored in a separate structure called the spare. A balls-into-bins argument implies that, under certain parametrizations, the number of balls stored in the spare is small compared to and can be accommodated by the term in the space bounds [DadH90, DadHPP06, PP08, ANS10, BFG18]. Since the spare stores a sublinear number of elements, space inefficient dictionaries can be employed to implement the spare.
Two challenges with this approach need to be addressed: (1) Organize the spare such that it can be accessed and updated in a constant number of memory accesses. (2) Manage the spare so that it does not overflow. In particular, to avoid an overflow of the spare in the fully-dynamic setting, one must make sure that balls are not only added to the spare but also moved back to their bins. For the balls-into-bins analysis to remain valid over a sequence of insertions and deletions, an invariant is maintained that requires that a ball be stored in the spare only when its bin is full (Invariant 16). This means that whenever a ball from a full bin is deleted, one needs to move another ball from the spare back to the bin.
The implementation of the spare plays a crucial role in whether the dictionary supports deletions, duplicates, or operations in constant time. The dictionary of Arbitman et al. [ANS10] manages the spare as a de-amortized cuckoo hash table [ANS09] (so operations are in constant time). Each time an element is relocated within the cuckoo hash table, it checks if its corresponding bin is not full. If so, it leaves the spare and returns to its bin (hence, deletions are supported). However, the implementation of the spare in [ANS10] does not extend to filters with deletions because their implementation does not support duplicate elements.
Sparse vs. Dense Cases.
Most dictionary and filter constructions consider two cases (sparse/dense) based on the relative size of the dataset with respect to the size of the universe (see [BM99, RR03, DadHPP06, ANS10, BFG18]). In the case of filters, the two cases are separated based on the value of . (See Def. 11).
In the dense case, elements are short enough (in bits) that a bin can pack all the elements that hash into it within a word (so an element can be found by searching its corresponding bin in constant time). The challenge in the sparse case is that the remainders are long, and bins no longer fit in a constant number of words. Arbitman et al. [ANS10] employ additional structures in this case (i.e., global lookup tables) that point to the location in which the remainder of an element is (separately) stored.
1.3.2 Our Techniques
Reduction Using Random Multisets.
We introduce a relaxation of the Pagh et al. [PPR05] condition that a dictionary must support multisets to function as a fully-dynamic filter (Sec. 3). This relaxation is based on the observation that it suffices for the dictionary to support random multisets for the reduction of Carter et al. [CFG78] to succeed. Theorem 1 provides the first such dictionary. Applying the reduction to the Crate Dictionary gives Theorem 2. While random invertible permutations yield a reduction from a set to a random set [DadHPP06, ANS10] (see Assumption 25), this is not the case for multisets. Indeed, the image of a multiset with respect to a random permutation is not a random multiset.
The starting point of our dictionary construction is the idea that the bins the elements hash to (based on their random quotients) should behave like self-contained dictionaries. Moreover, these dictionaries should be space-efficient and fit in a constant number of words so that all operations on them can be executed in a constant number of memory accesses.
Specifically, the basic building blocks of our design are small local data structures that we employ in a modular, black-box fashion (Sec. 4). To this end, we construct a simple space-efficient dictionary for small general multisets of elements whose quotients belong to a limited range. We refer to this construction as a pocket dictionary. These small dictionaries function as bins; elements choose a bin based on their (random) quotient.
Pocket dictionaries are space-efficient and use at most two extra bits per element overall. This is an improvement over previous packing techniques in which the number of bits per element is [Cle84, BFJ12] and [PBJP17]. We note that the practical filter proposed in Pagh et al. [PPR05] also uses two extra bits per element. However, they use a variant of linear probing and therefore, the runtime of the operations is not constant (in the worst case) and depends on the load. Arbitman et. al [ANS10] manage the bins by employing a global lookup table for storing the encodings of all possible subsets stored in a bin.
The simplicity of the pocket dictionary design gives it flexibility. In particular, we also construct variants that support variable-length remainders() or count multiplicities of elements ().
To facilitate constant time access in the spare, we propose to manage the spare by partitioning it across intervals of pocket dictionaries (hereafter referred to as crates). Specifically, we allocate one distributed spare per crate such that each distributed spare stores the overflow only from the pocket dictionaries in its crate (see Section 5.4). With high probability, at most elements are stored in each distributed spare (however, the elements no longer form a random multiset). Each distributed spare is implemented as a space-inefficient dictionary ().
The key advantage of distributed spares is that doubly linked lists of elements from the same pocket dictionary can be implemented by storing pointers of length alongside the elements (as opposed to pointers of length in the case of a global spare). These linked lists enable us to move an element from the spare back to its bin in a constant number of memory accesses.
The first technical challenge with this approach is that the elements stored in a no longer form a random multiset. To this end, we implement each as an array of s that can maintain general multisets (See Appendix B). The second technical challenge is that the probability of overflow of the components in each must be exponentially small in their size. (Indeed, their size is and the tolerated failure probability is at most .) We formalize the properties of the in the following lemma.
The with parameter is a fully-dynamic dictionary that maintains (general) multisets of cardinality at most from a universe with the following guarantees: (1) For every polynomial in sequence of insert, delete, and query operations, the does not overflow with probability at least . (2) If the does not overflow, then every operation (query, insert, delete) can be completed using at most memory accesses. (3) The required space is bits.
Note the extra factor in the space requirement that justifies the term space-inefficient dictionary. The also supports a special delete operation called pop, with the same performance guarantees (for an elaboration on the pop operation, see Sec. 5.4.1).
Variable-Length Adaptive Remainders.
In the sparse case, reading even one remainder of an element might require more than memory accesses. (The word length might be smaller than , the length of a remainder.) We propose to solve this problem by maintaining variable-length prefixes of the remainders (Sec. 6.1). We call such prefixes adaptive remainders. We maintain the adaptive remainders dynamically such that they are minimal and prefix-free (Invariant 26).
The full remainder of an element is stored separately from its adaptive remainder but their locations are synchronized. Specifically, the key property of the adaptive remainders is that they allow us to find the location of an element using a constant number of memory accesses, regardless of the length of .111111The “location” of an element is ambiguous since might not even be in the dataset. More precisely, we can certify whether is in the dataset or not by reading at most two full remainders and these full remainders can be located in a constant number of memory accesses (see Claim 30). Indeed, one may view this technique as a method for dynamically maintaining a perfect hashing of the elements in the same bin with the same quotient. As opposed to typical perfect hashing schemes in which the images of the elements are of the same length, we employ variable-length images. The adaptive remainders have, on average, constant length so they do not affect space efficiency. Indeed, considering variable-length adaptive remainders allows us to bypass existing space bounds on dynamic perfect hash functions that depend on the size of the universe [MPP05].
Prefix-free variable-length adaptive fingerprints are employed in the adaptive Broom filter of Bender et al. [BFG18] to fix false positives and maintain distinct fingerprints for elements in the filter. Their adaptive fingerprints are computed by accessing an external memory reverse hash table that is not counted in the filter’s space. 121212The reverse hash table allows the filter to look up an element based on its fingerprint in the filter. [BFG18]. We emphasize that all our adaptive remainders are computed in-memory.
1.4 Related Work
The topic of dictionary and filter design is a fundamental theme in the theory and practice of data structures. We restrict our focus to the results that are closest to our setting.
To the best of our knowledge, the dictionary of Arbitman et al. [ANS10] is the only space-efficient fully-dynamic dictionary for sets that performs queries, insertions, and deletions in constant time in the worst case with high probability.
Several other fully-dynamic constructions support operations in constant time with high probability [DadH90, DDMM05, DadHPP06, ANS09] but are not space-efficient. On the other hand, some dictionaries are space-efficient but do not have constant time guarantees with high probability for all of their operations [RR03, FPSS05, Pan05, DW07]. For the static case, several space-efficient constructions exist that perform queries in constant time [BM99, Pag01, Păt08].
In the context of filters, to the best of our knowledge, only a few space-efficient constructions have proven guarantees. We focus here on space-efficient filters that perform operations in a constant number of memory accesses (albeit with different probabilistic guarantees). The fully-dynamic filter of Pagh et al. [PPR05] supports constant time queries and insertions and deletions in amortized expected constant time. The incremental filter of Arbitman et al. [ANS10] performs queries in constant time and insertions in constant time with high probability. The construction of Bender et al. [BFG18] describes an adaptive filter 131313Loosely speaking, an adaptive filter is one that fixes false positives after they occur [BFG18, MPR18]. equipped with an external memory reverse hash table that supports queries, insertions and deletions in constant time with high probability. (The space of the reverse hash table is not counted in the space of their filter.) Space-efficient filters for the static case have been studied extensively [Mit02, DW07, DP08, Por09]
. Several heuristics such as the cuckoo filter[FAKM14], the quotient filter [BFJ12, PBJP17] and variations of the Bloom filter [Blo70] have been reported to work well in practice.
1.5 Paper Organization
The preliminaries are in Section 2. In Section 3, we discuss the reduction from a fully-dynamic dictionary on random multisets to a fully-dynamic filter on sets. In Section 4, we describe the structure of the pocket dictionary and its variants. We also briefly mention some auxiliary data structures which we employ and whose complete description can be found in Appendix B.
The description and analysis of the Crate Dictionary are covered in two parts. In Section 5, we discuss the Crate Dictionary construction and analysis in the dense case. The distributed spares are described in Section 5.4. Section 6 covers the Crate Dictionary in the sparse case. Section 7 describes the modifications required to perform static retrieval.
The indicator function of a set is the function defined by
For any positive , let denote the set . For a string , let denote the length of in bits.
We define the range of a hash function to be a set of natural numbers and also treat the image as a binary string, i.e., the binary representation of using bits.
Given two strings, , the concatenation of and is denoted by . We denote the fact that is a prefix of by .
A sequence of strings is prefix-free if, for every , the string is not a prefix of .
2.1 Filter and Dictionary Definitions
Let denote the universe of all possible elements.
We consider three types of operations:
- insert to the dataset.
- delete from the dataset.
- is in the dataset?
Dynamic Sets and Random Multisets.
Every sequence of operations defines a dynamic set over as follows.141414 The definition of state in Equation 1 does not rule out a deletion of . However, we assume that only if . See Assumption 4 and the discussion therein.
A multiset over is a function . We refer to as the multiplicity of . If , we say that is not in the multiset. We refer to as the cardinality of the multiset and denote it by .
The support of the multiset is the set . The maximum multiplicity of a multiset is .
A dynamic multiset is specified by a sequence of insert and delete operations. Let denote the multiset after operations.151515As in the case of fully-dynamic sets, we require that only if .
We say that a dynamic multiset has cardinality at most if , for every .
A dynamic multiset over is a random multiset if for every , the multiset is the outcome of independent uniform samples (with replacements) from .
A fully-dynamic filter is a data structure that maintains a dynamic set and is parameterized by an error parameter . Consider an input sequence that specifies a dynamic set , for every . The filter outputs a bit for every query operation. We denote the output that corresponds to by . We require that the output satisfy the following condition:
The output is an approximation of with a one-sided error. Namely, if , then must equal .
Definition 9 (false positive event).
Let denote the event that , and .
The error parameter is used to bound the probability of a false positive error.
We say that the false positive probability in a filter is bounded by if it satisfies the following property. For every sequence of operations and every ,
A fully-dynamic dictionary with parameter is a fully-dynamic filter with parameters and . In the case of multisets, the response of a fully-dynamic dictionary to a operation must satisfy iff . 161616 One may also define .
When we say that a filter or a dictionary has parameter , we mean that the cardinality of the input set/multiset is at most at all points in time.
Success Probability and Probability Space.
We say that a dictionary (filter) works for sets and random multisets if the probability that the dictionary does not overflow is high (i.e., it is ). The probability in the case of random multisets is taken over both the random choices of the dictionary and the distribution of the random multisets. In the case of sets, the success probability depends only on the random choices of the dictionary.
The relative sparseness [BM99] of a set (or multiset) is the ratio . We denote the relative sparseness by . In the fully-dynamic setting, we define the relative sparseness to be , where is the upper bound on .
Recall that denotes the memory word length. We differentiate between two cases in the design of dictionaries, depending on the ratio .
The dense case occurs when . The sparse case occurs when .171717By , we mean that . By , we mean that there exists a constant such that , for sufficiently large values of .
The reduction from dictionaries (see Sec 3) implies that the dictionaries used to implement the filter have a relative sparseness . Hence, in the case of filters, we differentiate between the high error case, in which to , and the low error case in which .
3 Reduction: Filters Based on Dictionaries
In this section, we extend the reduction of Carter et al. [CFG78] to construct fully-dynamic filters out of fully-dynamic dictionaries for random multisets. Our reduction can be seen as a relaxation of the reduction of Pagh et al. [PPR05]. Instead of requiring that the underlying dictionary support multisets, we require that it only supports random multisets.
Consider a random hash function and let denote a dynamic set whose cardinality never exceeds . Then is a random multiset of cardinality at most .
Since is random, an “adversary” that generates the sequence of insertions and deletions for is an oblivious adversary in the following sense. When inserting, it inserts a random element (which may be a duplicate of a previously inserted element181818Duplicates in are caused by collisions (i.e., ) rather than by reinsertions.). When deleting at time , it specifies a previous time in which an insertion took place, and requests to delete the element that was inserted at time .
For every dynamic set of cardinality at most , the dictionary Dict with respect to the random multiset and universe is a fully-dynamic filter for with parameters and .
The Dict records the multiplicity of in the multiset and so deletions are performed correctly. The filter outputs if and only if the multiplicity of is positive. False positive events are caused by collisions in . Therefore, the probability of a false positive is bounded by because of the cardinality of the range of . ∎
4 The Pocket Dictionary
In this section, we propose a fully-dynamic dictionary construction for the special case of small multisets of elements consisting of quotient/remainder pairs . This construction works provided that the quotients belong to a small interval. We emphasize that the data structure works for multisets in general, without the assumption that they are random.
Let denote the multiset of quotient/remainder pairs to be stored in the dictionary, with and , for every . Namely, is the length of the interval that contains the quotients , is an upper bound on the number of elements that will be stored in the dictionary, and is the length in bits of each remainder . We store in a data structure called a pocket dictionary denoted by .
The pocket dictionary data structure uses two binary strings, denoted by and , as follows. Let denote the number of elements that share the same quotient . The header
stores the vectorin unary as the string . The length of the header is . The body is the concatenation of remainders sorted in nondecreasing lexicographic order of . The length of the body is . We refer the reader to Figure 1 for a depiction.
Let denote the number of bits used to store . Recall that the word length satisfies .
The number of bits that requires is . If , and , then fits in a constant number of words.
One can decode given and . Moreover, one can modify and to reflect updates (insertions and deletions) to . When the pocket dictionary fits in a constant number of words, all operations on the multiset can, therefore, be executed using a constant number of memory accesses.
We say that a pocket dictionary overflows if more than quotient/remainder pairs are stored in it.
4.1 Variable-Length Pocket Dictionary
The use of variable-length adaptive remainders for the sparse case requires a variant of the pocket dictionary that supports variable-length values. We refer to this construction as the variable-length pocket dictionary (). This variant shares most of its design with the pocket dictionary with fixed size remainders. We briefly describe the required modifications.
We denote this modified pocket dictionary by . This new dictionary is in charge of storing at most variable-length remainders with quotients , where . The header of the dictionary is the same as in . The body consists of the list of remainders separated by an “end-of-string” symbol. The list of remainders appears in the body in nondecreasing lexicographic order of the strings . We suggest to simply use two bits to represent the ternary alphabet, so the space requirement of the pocket dictionary for variable-length remainders at most doubles.
The number of bits that requires is . If , then fits in a constant number of words.
We say that a variable-length pocket dictionary overflows if the cardinality of the multiset exceeds or the total length of the remainders exceeds .
4.2 Auxiliary Data Structures
We employ two additional data structures for storing elements whose pocket dictionaries overflow. The first data structure is a space-inefficient variant of the pocket dictionary, called a counting set dictionary (). Its variable-length counterpart is called a variable-length counting set dictionary (). These data structures need not be space-efficient since, across all distributed spares, they store only a sublinear number of elements. A key property of the is that its failure probability (i.e. overflow) is exponentially small in its capacity. The implementation details of these auxiliary data structures appear in Appendix B.
5 Crate Dictionary - Dense Case
In this section, we present a fully-dynamic dictionary for sets and random multisets for the case in which the relative sparseness satisfies . We refer to this construction as the Crate Dictionary for the dense case.
|extra capacity (over the average) per|
|interval of quotients per|
|exponent affecting number of crates, for|
|exponent affecting size of|
|is the number of crates.|
|capacity of the|
The Crate Dictionary is parametrized by the cardinality of the dynamic random multiset. The dictionary consists of two levels of dictionaries. The dictionaries in the first level are called crates. Each crate consists of two main parts: a set of pocket dictionaries (in which most of the elements are stored) and a space-inefficient dictionary () that stores elements whose corresponding pocket dictionary is full. The internal parameters of the Crate Dictionary are summarized in Table 1. See Fig. 2 for a depiction of the structure of each crate.
The number of crates is . The number of pocket dictionaries in each crate is , and each pocket dictionary is a . The space inefficient dictionary stores at most elements, each of length bits. Each is implemented using counting set dictionaries (Sec. B.1). We elaborate on the structure and functionality of the separately in Sec. 5.4.
5.2 Element Representation
Without loss of generality, we have that . Since the multiset is random, its elements are chosen with replacements u.a.r. and independently from . We represent each by a -tuple , where
Informally, consider the binary representation of , then is the least significant bits, is the next bits, and so on.
An element is stored in the crate of index . Within its crate, the element is stored in one of two places: (1) in the pocket dictionary of index that stores , or (2) in the that stores . (Storing in the avoids spurious collisions since elements from all the full pocket dictionaries in the crate are stored in the same .)
An operation on element is forwarded to the crate whose index is . We elaborate here on how a crate supports queries, insertions, and deletions involving an element . Operations to the are described in Sec. 5.4.
A is implemented by searching for in the pocket dictionary of index in the crate. If the pocket dictionary is full and the element has not been found, the query is forwarded to the .
An operation first attempts to insert in the pocket dictionary of index in the crate. If the pocket dictionary is full, it forwards the insertion to the .
To make sure that the does not overflow whp in the fully-dynamic setting, we maintain the following invariant.
An element is stored in the only if the pocket dictionary of index in crate is full.
As a consequence, an element stored in the must be moved to its corresponding pocket dictionary whenever this pocket dictionary is no longer full due to a deletion. We refer to the operation of retrieving such an element from the as a pop operation (Sec. 5.4.1).
To maintain Invariant 16, a differentiates between the following cases depending on where is stored and whether its pocked dictionary is full:
If is in the pocket dictionary of index and the pocket dictionary was not full before the deletion, we simply delete from the pocket dictionary and return.
If is in the pocket dictionary of index and the pocket dictionary was full before the deletion, we delete from the pocket dictionary and issue a to the . The pop operation returns a triple , where (if any). We then insert into the pocket dictionary of index and return.
If is found in the of crate , we delete it from the .
Note that duplicate elements must be stored to support random multisets. In particular, a deletion should erase only one duplicate (or decrease the counter in the appropriately). This description allows for copies of the same element to reside both in the and outside the . Precedence is given to storing elements in the pocket dictionaries because the is used only for elements that do not fit in their pocket dictionaries due to overflow. (One could consider a version in which deletions and insertions of duplicates are first applied to the .)
5.4 A Space-Inefficient Dictionary ()
The space-inefficient dictionary () is a fully-dynamic dictionary for arbitrary dynamic multisets. It is used for storing elements whose pocket dictionaries are full. A supports queries, insertions, deletions, and pop operations (see Sec. 5.4.1 for a specification of pop operations). A key property of a is that the probability of overflow is at most . Note that this bound on the probability of overflow can be exponentially small in the cardinality of the multiset stored by a in the .191919Maybe previous dictionary designs such as [PT14, DadHPP06] can be parameterized to work with exponentially small failure probabilities.202020The probability space over which overflow is bounded depends only on the random hash function of the . The explicit bits of an element stored in the are . See Fig. 3 for a depiction.
Counting Set Dictionaries.
The basic building block of the is a data structure called the counting set dictionary (). Each stores a multiset of elements. The data structure is parametrized by the cardinality of the support set of the multiset and by the maximum multiplicity of elements in the multiset. The stores and pointers of length used to support operations. More details on the can be found in Appendix B.1.
The parameters used for the design of the are summarized in Table 2.
|bound on the cardinality of the multiset and|
|the number of s per|
|cardinality of the support of the multiset in|
|length element plus pointers|
|bound on multiplicity of element in|
|c||constant affecting exponent in overflow probability|
A consists of counting set dictionaries .212121We pessimistically set the upper bound on the maximum multiplicity of any element in a to be the cardinality of the dynamic multiset stored in the . By Claim 37, each fits in a constant number of words. An independent fresh random hash function is utilized to map each element to one of the s in the . We note that must be independent of for the analysis of the maximum loads in each to work (the same hash function may be used for all s).222222We note that if are stored in the same , then .
Each insert, delete and query operation on an element is forwarded to the of index . Multiplicities are counted using counters in the . If the counter becomes zero after a deletion, the element is lazily removed from the . The pop operation is described separately in Sec. 5.4.1. Since each fits in a constant number of words, all these operations require only a constant number of memory accesses.
5.4.1 Pop Operations
To maintain Invariant 16 after , a operation is issued to the of crate whenever the pocket dictionary of index transitions from full to non-full. The input is the index of the pocket dictionary and the output is the explicit part of an element stored in the with the property that . A copy of the element is then deleted from the and inserted into the pocket dictionary of index .
In order to support operations, the keeps a doubly linked list per (see Fig. 3). The doubly linked list of contains the indexes of the s that store elements such that . The doubly linked list is implemented by storing pointers to s alongside the elements stored in the s.
The fact that each crate has its own implies that the pointers we store require only bits. A separate array is used to store the list heads.
Insertion of a new element to the is implemented as follows. Recall that is mapped to the of index . If already contains an element such that , then is inserted using the same pointers of . Otherwise, is inserted to the , its pointer is set to the head of , and the head of the list is updated to .
If a duplicate element is inserted, the counter is merely incremented. Deletion of from the decrements the counter. If there is no other remaining element in with , then we update pointers in to skip over .
A pop operation with input returns the first element from the pointed to by the head with . The counter of is decremented. If ’s counter reaches zero, we continue as if is deleted. The overhead in memory accesses required to support the doubly linked lists is constant.
5.4.2 Analysis: Proof of Lemma 5 in the dense case
Even if the cardinality of the dynamic multiset stored in the is at most , there are two possible causes for an overflow of the : (1) more than distinct elements are stored in a or, (2) more than copies of the same element are stored.
Two technical challenges arise when we try to bound the probability of overflow in the : (1) The triples stored in the are neither independent nor distributed u.a.r. and, (2) The number of elements stored in the is , yet we seek an upper bound on the overflow probability that is .
Consider a dynamic multiset stored in the over a polynomial number of insertions and deletions to the Crate Dictionary.
If the cardinality of the dynamic multiset stored in the is at most , then the counting set dictionaries in the do not overflow whp.
Since we set , the maximum multiplicity of an element in a never exceeds the cardinality of the multiset in the . To complete the proof, we show that no more than distinct elements are stored per .
Let be the number of distinct elements that are stored in the . The probability that more than distinct elements are mapped to a specific is bounded by . Setting gives a probability that is at most a polynomial fraction of (the exponent is linear in ). The claim follows by applying a union bound over all the s in a and all time steps. ∎
If no overflows, then every query, insert, delete, and pop to the requires a constant number of memory accesses.
Every insert, delete, or query operation is executed by accessing the corresponding . The parametrization of the implies that it fits in a constant number of words (Claim 37) and therefore these operations require a constant number of memory accesses. Maintaining the doubly linked lists accesses one entry in the array and affects at most three different s: one that stores the element, its predecessor, and its successor. ∎
The number of bits required for storing a is .
The has counting set dictionaries, each of which fits in a constant number of words. Hence, the total space occupied by the s is bits. The array of list head pointers requires bits because each pointer takes bits.232323A shared datastructure for all the elements that do not fit in the “bins” appears in [ANS10] and also [BFG18]). However, in our case, such a shared would require bits just for the array of pointers. This would render the Crate Dictionary space-inefficient. ∎
5.5 Analysis: Proof of Theorem 1 in the dense case
Consider a dynamic random multiset of cardinality at most specified by a sequence of insertions and deletions. We bound the number of memory accesses of each operation:
If none of the components of the Crate Dictionary for the dense case overflow, then every insert, delete, and query operation requires a constant number of memory accesses.
Each operation accesses one pocket dictionary, forwards the operation to the , or issues a pop operation to the . By Claim 14, each pocket dictionary fits in a constant number of words, and hence each operation on a pocket dictionary requires only a constant number of memory accesses. Operations on the require only a constant number of memory accesses by Claim 18. ∎
We now show that the Crate Dictionary in the dense case is space-efficient. We set . Since , by the assumption that and given that we are in the dense case), we have that . We first observe that only a polylogarithmic fraction of the bins in a crate will overflow whp.
For every crate, at most pocket dictionaries are full in the crate whp.
Fix a crate. Let denote the number of elements that hash to the pocket dictionary of index and let
be the indicator random variable that isif (i.e. the pocket dictionary of index is full) and otherwise. The total number of full pocket dictionary is denoted by .
Since each element is drawn independently and u.a.r., we have that and are values also chosen independently and u.a.r. We get that . By Chernoff’s bound, we have that . Indeed,
The expected number of full pocket dictionaries satisfies:
Note that by our choice of , we have that
The random variables are negatively associated, hence by Chernoff’s bound [DR98],
For , we have , hence is at most . Since , the claim follows. ∎
In the following claim we bound the maximum number of elements that are assigned to the same pocket dictionary in a crate. Recall the notation of the proof of Claim 21, where denotes the number of elements assigned to pocket dictionary .
For every and any random multiset set such that ,
The random variable satisfies . By Chernoff’s bound, we have that
and the claim follows. ∎
The maximum number of elements stored in the of a crate is whp. Therefore, whp, no more than elements get stored in a specific if .
We summarize the space requirements of the Crate Dictionary in the dense case.
If , then the number of bits required for storing the Crate Dictionary in the dense case with parameters and is
We have pocket dictionaries , each of which takes