Succinct Filters for Sets of Unknown Sizes

04/26/2020 ∙ by Mingmou Liu, et al. ∙ Nanjing University 0

The membership problem asks to maintain a set S⊆[u], supporting insertions and membership queries, i.e., testing if a given element is in the set. A data structure that computes exact answers is called a dictionary. When a (small) false positive rate ϵ is allowed, the data structure is called a filter. The space usages of the standard dictionaries or filters usually depend on the upper bound on the size of S, while the actual set can be much smaller. Pagh, Segev and Wieder (FOCS'13) were the first to study filters with varying space usage based on the current |S|. They showed in order to match the space with the current set size n=|S|, any filter data structure must use (1-o(1))n(log(1/ϵ)+(1-O(ϵ))loglog n) bits, in contrast to the well-known lower bound of Nlog(1/ϵ) bits, where N is an upper bound on |S|. They also presented a data structure with almost optimal space of (1+o(1))n(log(1/ϵ)+O(loglog n)) bits provided that n>u^0.001, with expected amortized constant insertion time and worst-case constant lookup time. In this work, we present a filter data structure with improvements in two aspects: - it has constant worst-case time for all insertions and lookups with high probability; - it uses space (1+o(1))n(log (1/ϵ)+loglog n) bits when n>u^0.001, achieving optimal leading constant for all ϵ=o(1). We also present a dictionary that uses (1+o(1))nlog(u/n) bits of space, matching the optimal space in terms of the current size, and performs all operations in constant time with high probability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Membership data structures are fundamental subroutines in many applications, including databases [9], content delivery network for web caching [24], image processing [17], scanning for viruses [14], etc. The data structure maintains a set of keys from a key space ,111Throughout the paper, stands for the set . supporting the following two basic operations:

  • insert(): insert into the set;

  • lookup(): return YES if is in the set, and NO otherwise.

When false positive errors are allowed, such a data structure usually is referred as a filter. That is, a filter with false positive rate may answer YES with probability when is not in the set (but it still needs to always answer YES when is in the set).

In the standard implementations, a initialization procedure receives the key space size and a capacity , i.e., an upper bound on the number of keys that can simultaneously exist in the database. Then it allocates sufficient space for the data structure, e.g., a hash table consisting of buckets. Thereafter, the memory usage is always staying at the maximum, as much space as

keys would take. It introduces inefficiency in the space, when only few keys have been inserted so far. On the other hand, it could also happen that only a rough estimation of the maximum size is known (e.g.  

[16, 1, 22]). Therefore, to avoid overflowing, one has to set the capacity conservatively. The capacity parameter given to the initialization procedure may be much more than the actual need. To avoid such space losses, a viable approach is to dynamically allocate space such that at any time, the data structure occupies space depending only on the current database size (rather than the maximum possible).

For exact membership data structures, it turns out that such promise is not too hard to obtain if one is willing to sacrifice an extra constant factor in space and accept amortization: When the current database has keys, we set the capacity to ; after more keys are inserted, we construct a new data structure with capacity equal to and transfer the whole database over. The amortized cost to transfer the database is per insertion. Raman and Rao [29] showed that the extra constant factor in space is avoidable, they designed a succinct222A succinct data structure uses space equal to the information theoretical minimum plus an asymptotically smaller term called redundancy. membership data structure using space ,333All logarithms are base . where is the current database size, supporting insertions in expected amortized constant time, and lookup queries in worst-case constant time.

For filters, the situation is more complicated. The optimal space to store at most keys while supporting approximate membership queries with false positive rate is  [8, 23] (Pagh, Pagh and Rao [27] achieved bits). However, the above trick to reduce the space may not work in general. This is because the filter data structures do not store perfect information about the database, and therefore, it is non-trivial to transfer to the new data structure with capacity , as one might not be able to recover the whole database from the previous data structure. In fact, Pagh, Segev and Wieder [28] showed an information theoretical space lower bound of bits, regardless of the insertion and query times. That is, one has to pay extra bits per key in order to match the space with the current database size. They also proposed a data structure with a nearly matching space of bits when , while supporting insertions in expected amortized constant time and lookup queries in worst-case constant time. When is at least , the extra bits per key is dominating. It was proposed as an open problem in [28] whether one can make the term succinct as well, i.e., to pin down its leading constant.

On the other hand, an amortized performance guarantee is highly undesirable in many applications. For instances, IP address lookups in the context of router hardware  [7, 19], and timing attacks in cryptography  [21, 20, 26, 25]. When the database size is always close to the capacity (or when the space is not a concern), it was known how to support all operations in worst-case constant time [13, 3] with high probability. That is, except for a probability of , the data structure handles every operation in a sequence of length in constant time.444This is stronger guarantee than expected constant time, since when the unlikely event happened, one could simply rebuild the data structure in linear time. The expected time is still a constant. However, it was not known how to obtain such a guarantee when the space is succinct with respect to the current database size, i.e., . For filters, Pagh et al. [28] showed it is possible to get worst-case constant time with high probability, at the price of a constant factor more space . They asked if there is a data structure which enjoys the succinct space usage and the worst-case constant time with high probability simultaneously.

1.1 Main Results

In this paper, we design a new dynamic filter data structure that answers both questions. Our data structure has both worst-case constant time with high probability and is succinct in space in terms of the current database size.

[Dynamic filter - informal] There is a data structure for approximate membership with false positive rate that uses space bits, where is the current number of keys in the database, such that every insertion and lookup takes constant time in the worst case with high probability.

We also present a dictionary data structure with the space depending on the current . A dictionary is a generalization of membership data structures, it maintains a set of key-value pairs, supporting

  • insert(): insert a key-value pair for and -bit ;

  • lookup(): if in the database, output ; otherwise output NO.

By setting , the lookup query simply tests if is in the database.

[Dynamic dictionary - informal] There is a dictionary data structure that uses space bits, where is the current number of key-value pairs in the database, such that every insertion and lookup takes constant time in the worst case with high probability.

1.2 Related Work

Membership with Constant Time Worst-Case Guarantee.

The FKS perfect hashing [15] stores a set of fixed (i.e., static) keys using space, supporting membership queries in worst-case constant time. Dietzfelbinger, Karlin, Mehlhorn, Meyer auf der Heide, Rohnert and Tarjan [12] introduced an extension of the FKS hashing, which is the first dynamic membership data structure with worst-case constant query time and the expected amortized constant insertion time. Later, Dietzfelbinger and Meyer auf der Heide [13] improved the insertion time to worst-case constant, with an overall failure probability of . Demaine, Meyer auf der Heide, Pagh and Pǎtraşcu [11] improved the space to bits of space. Arbitman, Naor and Segev [2] proved that a de-amortized version of cuckoo hashing [19] has constant operation time in the worst case with high probability.

On the other hand, filters can be reduced to dictionaries with a hash function , and thus, all the dictionaries imply similar upper bounds for filters [8].

Succinct Membership.

Raman and Rao [29] presented the first succinct dictionary with constant time operations, while the insertion time is amortized. Arbitman, Naor and Segev [3] refined the schema of [2], suggested a succinct dictionary with worst case operation time with high probability.

By using the reduction from [8] and the succinct dictionary from [29], Pagh, Pagh and Rao [27] provided a succinct filter with constant time, while the insertion time is amortized due to [29]. Bender, Farach-Colton, Goswami, Johnson, McCauley and Singh [5] suggested a succinct adaptive filter555In an adaptive filter, for a negative query , the false positive event is independent of previous queries. with constant time operation in the worst case with high probability.

Membership for Sets of Unknown Sizes.

The data structure of Raman and Rao [29] can be implemented such that the size of the data structure always depends on the “current ”. Pagh, Segev and Wieder [28] were the first to study dynamic filters in this setting from a foundational perspective. As we mentioned above, they proved an information-theoretical space lower bound of bits for filter, and presented a filter data structure using bits of space with constant operation time when . Indeed, the insertion time is expected amortized, since the succinct dictionary of Raman and Rao is applied as a black box (it was not clear if any succinct dictionary with worst-case operational time can be generalized to this setting).

Very recently, Bercea and Even [6] proposed a succinct membership data structure for maintaining dictionaries and random multisets with constant operation time. While their data structure is originally designed for the case where an upper bound on the keys is given (and the space usage is allowed to depend on ), we note that it is possible to extend their solution and reduce the space to depend only on the current . However, their data structure assumes free randomness, and straightforward extension results in an additive term in space. The redundancy makes their data structure space-inefficient for filters, since the space lower bound is .

1.3 Previous Construction

As we mentioned earlier, for dynamic membership data structures, if we are willing to pay an extra constant factor in space, one way to match the space with the “current” is to set the capacity to be . When the data structure is full after another insertions, we double the capacity, and transfer the database to the new data structure. However, the standard way to construct an efficient filter is to hash to (where is the false positive rate) and store all hash values in a membership data structure, which takes bits of space. As we insert more keys and increase the capacity to , the range of the hash value needs to increase as well. Unfortunately, it cannot be done, because the original keys are not stored, and we have lost the information in order to save space (this is exactly the point of a filter). On the other hand, we could choose to keep the previous data structure(s), and only insert the future keys to the new data structure. For each query, if it appears in any of the (at most ) data structures, we output YES. By setting the false positive rate for the data structure with capacity to , the overall false positive rate is at most by union bound. The total space usage becomes roughly .

To avoid querying all filters for each query, the previous solution by Pagh et al. [28] uses a single global hash function that maps to -bit strings for all filters. For a key in the -th data structure (with capacity ), one simply takes the first bits of as its hash value. Then querying the -th data structure on is to check whether the -bit prefix of exists. Since all filters use the same hash function, the overall task is to check whether some prefix of appears in the database, which now consists of strings of various lengths. Note that there are very few short strings in the database, the previous solution extends all short strings to length by duplicating the string and appending all possible suffixes, e.g., a string of length is duplicated into strings by appending all possible -bit suffixes. Then all strings are stored in one single dictionary (longer strings are stored according to their first bits), and the query becomes to check if the -bit prefix of is in the dictionary, which is solved by invoking Raman and Rao [29]. One may verify that duplicating the short strings does not significantly increase the total space, and comparing only the -bit prefix of a longer string does not increase the false positive rate by much.

1.4 Our Techniques

Our new construction follows a similar strategy, but the “prefix matching” problem is solved differently. Given a collection of strings of various lengths, we would like to construct a data structure such that given any query , we will be able to quickly decide if any prefix of appears in the database. The first observation is that the short strings in the database can be easily handled. In fact, all strings shorter than bits can be stored in a “truth table” of size . That is, we simply store for all -bit strings, whether any of its prefix appears in the database. For a query , by checking the corresponding entry of its -bit prefix, one immediately resolves all short strings. On the other hand, for strings longer than bits, we propose a new (exact) membership data structure, and show that it in fact, automatically solves prefix matching when all strings are long. Before describing its high-level construction in Section 1.4.1, let us first see what it can do and how it is applied to our filter construction.

When the capacity is set to , the membership data structure stores keys from using space bits, supporting insertion and membership query in worst-case constant time with high probability. When applying to prefix matching, it stores strings of length at most (and more than ) using bits. Using this data structure with the capacity set to , we are able to store the database succinctly when . As we insert more keys to the database, the capacity needs to increase. Another advantage of our membership data structure is that the data can be transferred from the old data structure with capacity to a new one with capacity in time. More importantly, the transfer algorithm runs almost “in-place”, and the data structure remains “queryable” in the middle of the execution. That is, one does not need to keep both data structures in full, at any time the total memory usage is still , and the data structure can be queried. Therefore, as is increasing from to , we gradually build a new data structure with capacity . Every time a key is inserted, the background data-transfer algorithm is run for constant steps. By the time reaches , we will have already transferred everything to the new data structure, and will be ready to build the next one with capacity . Overall, the data structure is going to have stages, the -th stage handles the -th to the -th insertion. In each stage, the database size is doubled, and the data structure also gradually doubles its capacity. This guarantees that the total space is succinct with respect to the current database size, and every operation is handled in constant time with high probability.

Finally, to pin down the leading constant in the extra bits, we show that for the -th inserted key for , storing the -bit prefix of balances the false positive rate and the space. Since our new membership data structure only introduces an extra bits of space per key, it is not hard to verify that the total space of our construction is .

1.4.1 Membership Data Structure

In the following, let us briefly describe how our new membership data structure works. The data structure works in the extendable array model, as the previous solution by Raman and Rao. See Section 2.2.2 or [29] for more details.

Our main technique contribution is the idea of data block. Without the data blocks, our data structure degenerates into a variant of the one proposed in [6]. Instead of a redundancy of bits, the degeneration contributes a redundancy of bits, which makes the data structure space-inefficienct for filters as we discussed early.

For simplicity, let us for now assume that we have free randomness, and the first step is to randomly permute the universe. Thus, we may assume that at any time, the database is a uniformly random set (of certain size). We divide the universe into buckets, e.g., according to the top bits of the key. Then with high probability, every bucket will have keys. We will then dynamic allocate space for each bucket. Note that given that a key is a bucket , we automatically know that its top bits is “”. Therefore, within each bucket, we may view the keys have lengths only , or equivalently, the universe size being (recall that the goal is to store each key using bits on average).

To store the keys in a bucket, we further divide it into data blocks consisting of keys each, based on the time of insertion. That is, the first keys inserted to this bucket will form the first data block, the next keys will be the second data block, etc. Since each data block has few enough keys, they can be stored using a static constructions (supporting only queries) using nearly optimal space of , which is bits per key, or a dynamic constructions use bits per key. The latest data block, which we always insert the new key into, is maintained using the dynamic construction. When it becomes full, we allocate a new data block, and at the same time, we run a in-place reorganization algorithm in the background. The reorganization algorithm runs in time, and convert the dynamic construction into the static construction, which uses less space. For each insertion in the future, the reorganization algorithm is run for constant steps, thus, it finishes before the next data block becomes full. Finally, for each bucket, we maintain an adaptive prefixes structure  [4, 5] to navigate the query to the relevant data block. Roughly speaking, when all keys in the bucket are random, most keys will have a unique prefix of length . In fact, Bender et al. [4, 5] showed that for every keys, the shortest prefix that is unique in the bucket can be implicitly maintained in constant time, and the total space for all keys is bits with high probability.666The -bit representation is implicit. We further store for each such unique prefix, which data block contains the corresponding key. It costs bits per key. Given a query, the adaptive prefix structure is able to locate the prefix that matches the query in constant time, which navigates the query algorithm to the (only) relevant data block. We present the details in Section 4.

2 Preliminaries

2.1 String Notations

Let and . Given a string , we use to denote its length. We denote by the concatenation of two strings . We denote the concatenation of ones or zeros by or , respectively.

For , we use (or ) to denote that is a prefix of , formally:

(1)

Note that our notation is unconventional: we use for prefixing , to reflect that the Hamming cube identified by is contained by the Hamming cube for its prefix .

For two strings such that , to compare and in lexicographical order, we compare and in lexicographical order, where is a special symbol which is smaller than any other symbol.

Recall that an injection (code) on strings is a prefix-free code if no codeword is a prefix of another codeword. There is a prefix-free code for strings of length .

Proof.

Given any , the codeword is . ∎

2.2 Computational Models

2.2.1 Random Access Machine

Throughout the paper, we use to denote the word size: each memory word is a Boolean string of bits. We assume that the total number of memory words is at most , and each memory word has an unique address from , so that any pointer fits in one memory word. We also assume CPU has constant number of registers of size , and any datapoint fits in constant number of words (i.e. ). During each CPU clock tick, CPU may load one memory word to one of its register, write the content of some register to some memory word, or execute the basic operations on the registers. Specifically, the basic operations include four arithmetic operations (addition, subtraction, multiplication, and division), bitwise operations (AND, OR, NOT, XOR, shifting), and comparison.

2.2.2 Memory Models

We use a memory access model known as the extendable arrays [29] to model the dynamic space usage.

The extendable array is one of the most fundamental data structures in practice. It is implemented by the standard libraries of most popular programming languages, such as

std::vector

in C++, ArrayList in java and list in python. [Extendable arrays] An extendable array of length maintains a sequence of fixed-sized elements, each assigned a unique address from , such that the following operations are supported:

  • : access the element with address ;

  • : increment the length , creating an arbitrary element with address ;

  • : decrement the length , remove the element with address .

A collection of extendable arrays supports

  • : create an empty extendable array with element of size and return its name;

  • : destroy the empty extendable array ;

  • , , : apply the corresponding operations on array .

Each of above operations takes constant time. The space overhead of an extendable array is , where are the word size, the length of the array, and the element size respectively. Indeed, the space overhead of a collection of extendable arrays is , where , and are the set of extendable arrays, the length of array , and the element size of array respectively.


We also consider the following allocate-free model. [Allocate and free] In the allocate-free model, there are two built-in procedures:

  • : return a pointer to a block of consecutive memory words which is uninitialized;

  • : free the block of consecutive memory words which is pointed by and have been initialized to s.

Each of above operations takes constant time. The total space overhead is , where is set of all memory blocks and is the length of memory block . We discuss the space usages of our data structures in allocate-free model in Section 7.

To avoid the pointer being too expensive in the dynamic memory models, we assume .

2.3 Random Functions

[-wise independent random function] A random function is called -wise independent if for any distinct , and any ,

[[31, 10]] Let be a universe, , , , and . There exists a data structure for a random function such that

  • with probability , the data structure is constructed successfully;

  • upon successful construction of the data structure, is -wise independent;

  • the data structure uses space bits;

  • for each , is evaluated in time in the worst case in the RAM model.

[Chernoff bound with limited independence [30]] Let be arbitrary

-wise independent boolean random variables with

for any . Let , then for any , it holds that

as long as .

2.4 Adaptive Prefixes

Given a sequence of strings, let be a collection of prefixes, such that for every , the is the shortest prefix, of length at least , of the binary representation of , such that prefixes no other . Note that for any string , there is at most one such that as long as exists. In particular, does not exist if there are such that .

The prefixes are stored in lexicographical order, thus we refer -th prefix as the prefix with rank in lexicographical order.

[Refined from [4, 5]] Let be two constants where . For a random sequence of strings drawn from uniformly at random with replacement, with probability at least , the prefix collection exists and can be represented with at most bits, where is determined by . Furthermore, the following operations are supported in constant time:

  • : update the representation by inserting a new string to , when there is at most one such that ;

  • : given any query , return the rank of the only that prefixes , and return NO if there does not exist such a ;

  • : given any query , return the lowest rank of all that , and return if there is no in the collection;

For completeness, we prove it in Appendix B.

3 Data Structures for Sets of Unknown Sizes

In this section, we present our filter and dictionary data structures for sets of unknown sizes.

3.1 The Succinct Dynamic Filters

The following theorem is a formal restatement of Theorem 1.1.

[Dynamic filter - formal] Let , the data universe, and , where is an arbitrary constant. Assume the word size . There exists a data structure for approximate membership for subsets of unknown sizes of , such that

  1. for any and , the data structure uses bits of space after insertions of any key, and extra precomputed bits that are independent of the input, where is an arbitrary small constant;

  2. each insertion and membership query takes time in the worst case;

  3. after each insertion, a failure may be reported by the data structure with some probability, and for any sequence of insertions, the probability that a failure is ever reported is at most , where the probability is taken over the precomputed random bits;

  4. conditioned on no failure, each membership query is answered with false positive rate at most .

As we mentioned in the introduction, our data structure has stages when handling insertions. The -th stage is from the insertion of the -th key to the -th key – the database size doubles after each stage.

The main strategy is to reduce the problem of (approximate) membership to (exact) prefix matching. More formally, in the prefix matching problem, we would like to maintain a set of binary strings of possibly different lengths, supporting

  • insert(): add string to the set;

  • query(): decide of any string in the set is a prefix of .

To this end, our filter first applies a global hash function such that is -wise independent according to Theorem 2.3, where is a constant to be fixed later, and is a sufficiently large constant (which in fact, is the in Theorem 2.4). To insert a key in stage , we calculate its hash value , and then insert the -bit prefix of , for some parameter . To answer a membership query , we simply calculate and search if any prefix of is in the database. If no prefix of is in the database, we output NO; otherwise, we output YES. It is easy to see that this strategy will never output any false negatives. On the other hand, by union bound, if the query is not in the set, the probability that the query algorithm outputs YES is at most

since is -wise independent (in particular, it is pairwise independent), then the probability that the -bit prefix of matches with the prefix of the hash value of key is . Hence, by setting

(2)

the false positive rate is at most

We use to denote the distribution of the random insertion sequence for prefix matching constructed above. Formally, is the distribution of a sequence of random strings obtained from -wise independent sequence by truncating: , , where .

[Prefix matching] Let , where is an arbitrary constant. There exist a constant and a deterministic data structure for prefix matching such that

  1. for any and , the data structure uses bits of space after insertions, and extra precomputed bits, where is an arbitrary small constant;

  2. each insertion and query takes time in the worst case;

  3. after each insertion, a failure may be reported by the data structure, and for a random sequence of insertions drawn from , the probability that a failure is ever reported is at most , where the probability is taken over ;

  4. every query is answered correctly if no “fail” is reported.

We present the construction in Section 4. Using this prefix matching data structure, the space usage of the filter is

  • bits,

  • and bits for storing by Theorem 2.3 and for the precomputed lookup tables described in Appendix A, both independent of the operation sequence.

Each insertion and query can be handled in constant time given the data structure does not fail. This proves Theorem 3.1.

3.2 The Succinct Dynamic Dictionaries

The data structure for prefix matching also works well as a dictionary data structure for the insertions with keys are sampled uniformly at random. A worst-case instance can be converted into a random instance by a random permutation . Assuming an idealized -wise independent random permutation whose representation and evaluation are efficient, the data structure for prefix matching in Lemma 3.1 can be immediately turned to a dictionary. However, the construction of -wise independent random permutation with low space and time costs is a longstanding open problem [18].

We show that our data structure can solve the dictionary problem in the worst case unconditionally, at the expense of extra bits of space for storing random bits which are independent of the input. [Dynamic dictionary - formal] Let be the data universe, and , where is an arbitrary constant. Assume the word size . There exists a data structure for dictionary for sets of unknown sizes of key-value pairs from , such that

  1. for any and , the data structure uses bits of space after insertions of any key-value pairs, and extra precomputed bits that are independent of the input, where is an arbitrary small constant;

  2. each insertion and query takes time in the worst case;

  3. after each insertion, a failure may be reported by the data structure with some probability, and for any sequence of insertions, the probability that a failure is ever reported is at most , where the probability is taken over the precomputed random bits;

  4. conditioned on no failure, each query is answered correctly.

The details of the data structure are postponed to Secion 6.

4 Prefix Matching Upper Bound

In this section, we prove Lemma 3.1.

Recall the distribution of random insertion sequence assumed in Lemma 3.1. Given an insertion sequence , we define the core set , and its subset for any . Let denote the distribution of . We say that a random sequence of strings is drawn from if it can be obtained by permuting the random core set .

We show that Lemma 3.1 is true as long as there exist a family of deterministic data structures for prefix matching with known capacity . An instance of the data structure is parameterized by capacity , and string length upper bound . The data structure uses bits extra space whose contents are precomputed lookup tables, and supports following functionalities with good guarantees:

  • and : subroutines for initializing and destroying respectively. The data structure is successfully initialized (or destroyed) after invoking () consecutively for times. When successfully initialized, uses space bits. The ’s are invoked before all other subroutines and ’s are invoked after all other subroutines.

  • : insert string to , where . After insertions, uses at most bits. Each insertion may cause to fail. A failure ever occurs for a random insertion sequence with probability at most , as long as is drawn from , where is suitably determined by constant .

  • : return one bit to indicate whether there exists a prefix of in . The correct answer is always returned as long as has not failed.

  • : try to delete an arbitrary string in and return the if is deleted. An invoking may delete nothing and hence nothing is returned, but it guarantees that the total number of such empty invoking is at most . Each invoking that successfully deletes a string frees space bits. The ’s are invoked after all insertions.

Given the deterministic data structures supporting above functionalities in constant time in the worst case, Lemma 3.1 is true.

Proof.

We use an auxiliary structure called truth table to deal with short strings. A truth table is a bitmap (i.e. array of bits) of length and supports the required functionalities in the worst cases:

  • is initialized to the all-0 string , where each invoking of extends by one 0 until is of length , and each invoking of shrinks by one bit until is fully destroyed;

  • to insert where , we set ;777For , a list or array of items, we let denote the -th item of .

  • to query where , we return YES if and return NO if otherwise;

  • to decrement , we maintain a that traverses from to , and at each time set , return if , and increment by 1.

Initially, the prefix matching data structure required by Lemma 3.1 consists of and respectively with capacities , and string lengths , where is defined in Eq(2).

To insert , which is the -th insertion, we set , invoke if there is no prefix of has been inserted. Then we execute the following procedure for times to maintain our data structure:

  1. If is non-empty, we decrement it by invoking . If a is returned, we insert it into by invoking and .

  2. If is non-empty, we invoke . If a is returned, we insert it into by invoking when and insert into by invoking otherwise.

  3. If (or ) is empty but not destroyed yet, we invoke (or ).

  4. If (or ) has been destroyed, we invoke (or for with capacity and string length upper bound ), where is defined in Eq(2).

A failure is reported whenever a failure is reported during insertion to .

Clearly, for any integer , after insertions, all inserted strings are stored in either or . By the time reaches , have been initialized, all inserted strings are stored in , and have been destroyed.

Consider the insertion sequence for a fixed . Observe that the strings inserted into must be in the core set . Therefore the insertion sequence is drawn from , which means that insertions to each ever failed with probability at most . By union bound, a failure is ever reported with probability at most .

Overall, the data structure uses at most bits after insertions, besides the precomputed bits.

Suppose strings has been inserted, let . To query , we invoke , , , simultaneously, and return YES if any one of the invokings returns YES.

Obviously each insertion and query takes constant time in the worst case, and it is easy to check that every query is correctly answered as long as no failure is reported. ∎

5 Succinct Prefix Matching with Known Capacity

We now describe the data structures required by Claim 4. The pseudocodes are given in Appendix C.

The data structure consists of a main table and subtables.

We partition each binary string into four consecutive parts: of lengths respectively. Roughly speaking, a datapoint will be distributed into a subtable according to , then be put into a data block of size according to the order it is inserted, therefore we can save bits for each datapoint by properly encoding.

input input independence
main table main table length
max-load
subtable max-load on
capacity of data block
failure probability of fingerprint collection
Table 1: The setting of parameters for the data structure.
Main Table.

The main table consists of entries, each of which contains a pointer to a subtable. Each insertion/query is distributed into an entry of the main table addressed by . Recall the word size . The main table uses bits.

Recall that is transformed from a -wise independent sequence by truncating. The insertion sequence is drawn from by permuting , the restriction of the core set to the strings whose lengths ranges within .

Let denote the subsequences of which contain all the strings that have prefix , respectively. By definitions, . Recall that are -wise independent. Due to Theorem 2.3, the load of entry exceeds with probability

(3)

as long as . Therefore the max-load of entries of the main table is upper bounded by with probability at least . The data structure reports failure if any entry of the main table overflows. In the rest of the proof, we fairly assume for all .

Observe that a datapoint can be identified with if the entry it is distributed into is fixed. Therefore we let denote the subsequences generated from by discarding the left-most bits.

Subtable.

Each subtable consists of the following parts to be specified later:

  • a collection of fingerprints and its indicator list ;

  • an (extendable) array of navigators ;

  • an (extendable) array of data blocks ;

  • two buffers, ;

  • constant many other local variables.

All the datapoints are stored in array . Given a datapoint , it is easy to see that the addresses of the entries which contains the information of is high correlated with the order it is inserted, since any insertion takes constant time in the worst case. Hence we take the fingerprints , indicators , navigators , and a tricky way to encode a data block as clues to locate the entries which maintain . Recall that new insertions is put into the latest data block using a dynamic construction, and we reorganize the full dynamic data block into a static construction. We use buffer to maintain the dynamic block, and use buffer to “de-amortize” the reorganization.

At first consider a static version of our data structure. In the static version, the buffers and the indicator list are unnecessary. Let .

Fingerprints.

The collection of fingerprints is obtained by applying Theorem 2.4 on with guarantee . Note that are mutually independent as long as . Due to Theorem 2.4, there exists a constant such that a fingerprint collection for can be represented in bits with probability .

We show that there exists a injective function such that . Due to the injective function and the guarantee that is prefix-free, the fingerprint collection of can be represented with the same space and probability guarantees as above.

We define by . By the definition, for any , there is such that . Recall that all the strings in has prefix . Hence for any such that , it holds that , i.e. . Thus for any . Therefore is well-defined. On the other hand, for distinct , and are disjoint, since is prefix-free and there is no such that prefix simultaneously for distinct . Therefore is injective. Recall that are generated by removing the prefix from the strings in : . Therefore works for too, i.e. .

A failure is reported if any fingerprint collection can not be represented within bits, which occurs with probability at most .

The fingerprints are sorted lexicographically, so that by the -th fingerprint we mean the -th in lexicographical order. For simplicity, we write .

A failure is reported if there are more than datapoints share identical and , which occurs with probability at most

(4)

The fingerprints cost bits per subtable if no failures.

Navigators.

is an array of pointers. For any datapoint, the rank of its fingerprint is synchronized with the index of its navigator. In particular, for the -th fingerprint in , is the address of the data block which maintains the datapoint with the fingerprint. A data block maintains up to datapoints, thus there are at most data blocks. The navigators cost at most bits of space.

Data Blocks.

is interpreted as an array of data blocks, with each data block holding up to datapoints.

Consider the following succinct binary representation (called pocket dictionary in [6]) of a collection of datapoints : The representation consists of two parts and . Let denote the left-most bits and the right-most bits of . Let , and sorted lexicographically. Then and