C++17 implementation of memory-efficient dynamic tries
A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are only efficient in the static case, it is still difficult to implement a keyword dictionary that is space efficient and dynamic. In this article, we propose such a keyword dictionary. Our main idea is to embrace the path decomposition technique, which was proposed for constructing cache-friendly tries. To store the path-decomposed trie in small memory, we design data structures based on recent compact hash trie representations. Exhaustive experiments on real-world datasets reveal that our dynamic keyword dictionary needs up to 68READ FULL TEXT VIEW PDF
We introduce a new family of compressed data structures to efficiently s...
Given a dynamic set K of k strings of total length n whose characters
While a lot of work in theoretical computer science has gone into optimi...
It has been shown in the indexing literature that there is an essential
A fully-dynamic dictionary is a data structure for maintaining sets that...
We present highly optimized data structures for the dynamic predecessor
Succinct data structures give space-efficient representations of large
C++17 implementation of memory-efficient dynamic tries
Benchmark for dynamic keyword dictionaries
An associative array is called a keyword dictionary if its keys are strings. In this article, we study the problem to maintain a keyword dictionary in main memory efficiently. When storing words extracted from text collections written in natural or computer languages, the size of a keyword dictionary
is not of major concern. This is because, after carefully polishing the extracted strings with natural language processing tools like stemmers, the size ofgrows sublinearly as for some over a text of words due to Heaps’ Law [books:heaps1978information, books:baeza2011modern]. However, as reported in [martinez2016practical], some natural language applications such as web search engines and machine translation systems need to handle large datasets that are not under Heaps’ Law. Also, other recent applications as in semantic web graphs and in bioinformatics handle massive string databases with keyword dictionaries [martinez2016practical, mavlyutov2015comparison]. Although common implementations like hash tables are efficiently fast, their memory consumption is a severe drawback in such scenarios. Here, a space-efficient implementation of the keyword dictionary is important. In this paper, we focus on the practical side of this problem.
In the static setting omitting the insertion and deletion of keywords, a number of compressed keyword dictionaries have been developed for a decade, some of which we highlight in the following. We start with Martínez-Prieto et al. [martinez2016practical], who proposed and evaluated a number of compressed keyword dictionaries based on techniques like hashing, front-coding, full-text indexes, and tries. They demonstrated that their implementations use up to 5% space of the original dataset size, while also supporting searches of prefixes and substrings of the keywords. Subsequently, Grossi and Ottaviano [grossi2014fast] proposed a cache-friendly keyword dictionary through path decomposition of tries. Arz and Fischer [arz2018lempel] adapted the LZ78 compression to devise a keyword dictionary. Finally, Kanda et al. [kanda2017compressed] proposed a keyword dictionary based on a compressed double-array trie. As we can see from these representations, space-efficient static keyword dictionaries have been well studied because of the advancements of practical (yet static) succinct data structures collected in well maintained libraries such as SDSL [gog2014theory] and Succinct [grossi2013design].
Under the dynamic setting, however, only a few space-efficient keyword dictionaries have been realized, probably due to the implementation difficulty. Although HAT-trie[askitis2010engineering] and Judy [manual:judy10min] are representative space-efficient dynamic implementations as demonstrated in previous experiments111Such as http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/cedar/#perf and https://github.com/Tessil/hat-trie/blob/master/README.md#benchmark., they still waste memory by maintaining many pointers. The Cedar trie [yoshinaga2014self] is a space-efficient implementation embracing heavily 32-bit pointers to address memory, and therefore cannot be applied to massive datasets. Despite that its implementation makes it hard to switch to 64-bit pointers, we expect that doing so will increase its space consumption considerably. Although several practical dynamic succinct data structures [prezza2017framework, poyias2017compact, DBLP:journals/corr/PoyiasPR17] have been recently developed, modern dynamic keyword dictionaries are heavily based on pointers, consuming a large fraction of the entire needed space. Nonetheless, there are some applications that need dynamic keyword dictionaries for massive datasets such as search engines [software:groonga, busch2012earlybird], RDF stores [mavlyutov2015comparison], or Web crawler [ueda2013parallel]. Consequently, realizing a practical space-efficient dynamic keyword dictionaries is an important open challenge.
Common keyword dictionary implementations represent the keywords in a trie, supporting the retrieval of keywords with trie navigation operations. In this subsection, we summarize space-efficient dynamic tries.
We consider a dynamic trie with nodes over an alphabet of size . Arroyuelo et al. [arroyuelo2016succinct] introduced succinct representations that require almost optimal bits of space, while supporting insertion and deletion of a leaf in amortized time if and in amortized time otherwise.222Throughout this paper, the base of the logarithm is 2, whenever not explicitly indicated. Jansson et al. [jansson2015linked] presented a dynamic trie representation that uses bits of space, while supporting insertion and deletion of a leaf in expected amortized time.
On the practical side, Poyias et al. [DBLP:journals/corr/PoyiasPR17] proposed the m-Bonsai trie, a practical dynamic compact trie representation. It is a variant of the Bonsai trie [darragh1993bonsai] that represents the trie nodes as entries in a compact hash table. It takes bits of space, while supporting update and some traversal operations in expected time. Fischer and Köppl [fischer2017practical] presented and evaluated a number of dynamic tries for LZ78 [ziv1978compression] and LZW [welch1984technique] factorization. They also proposed an efficient hash-based trie representation in a similar way to m-Bonsai, which is referred to as FK-hash.333The representation is referred to as hash or cht in their paper [fischer2017practical]. To avoid confusion, we name it FK-hash by using the initial letters of the proposers, Fischer and Köppl. Although FK-hash uses bits of space, its update algorithm is simple and practically fast. However, we are not aware of any space-efficient approach using them as keyword dictionaries.
In another line of research, focus has been set to limit the needed space of a trie in relation with the number of keywords. Suppose that we want to maintain a set of strings with a total length of on a machine, where characters fit into a single machine word . In this setting, Belazzougui et al. [belazzougui2010dynamic] proposed the (dynamic) z-fast trie, which takes bits of space and supports retrieval, insertion and deletion of a string in expected time. Takagi et al. [takagi2016packed] proposed the packed compact trie, which takes bits of space and supports the same operations in expected time. Recently, Tsuruta et al. [tsuruta2019dynamic] developed a hybrid data structure of the z-fast trie and the packed compact trie, which also takes bits of space, but improves each of these operations to run in expected time.
We propose a novel space-efficient dynamic keyword dictionary, called the dynamic path-decomposed trie (abbreviated as DynPDT). DynPDT is based on a trie formed by path decomposition [ferragina2008searching]. The path decomposition is a trie transformation technique, which was proposed for constructing cache-friendly trie dictionaries. It was up to now utilized only in static applications [grossi2014fast, hsu2013space]. Here, we adapt this technique for the dynamic construction of DynPDT, which gives DynPDT two main advantages over other known keyword dictionaries.
The first is that the data structure is cache efficient because of the path decomposition. During the retrieval of a keyword, most parts of the keyword can be cache-friendly scanned without node-to-node traversals based on random accesses.
The second is that the path decomposition allows us to plug in any dynamic trie representation for the path-decomposed trie topology. For this job, we choose the hash-based trie representations m-Bonsai and FK-hash as these are fast and memory efficient in the setting when all trie nodes have to be represented explicitly (which is the case for the nodes of the path-decomposed trie).
Based on these advantages, DynPDT becomes a fast and space-efficient dynamic keyword dictionary.
From experiments using massive real-world datasets, we demonstrate that DynPDT is more space efficient compared to existing keyword dictionaries. For example, to construct a keyword dictionary from a large URI dataset of 13.8 GiB, DynPDT needs only 2.5 GiB of working space, while a HAT-trie and a Judy trie need 9.5 GiB and 7.8 GiB, respectively. Furthermore, the time performance is competitive in many cases thanks to the path decomposition. The source code of our implementation is available at https://github.com/kampersanda/poplar-trie.
In Section 2, we introduce the keyword dictionary, and review the trie data structure and the path decomposition in our preliminaries. We introduce our new data structure DynPDT in Section 3. Subsequently, we present our DynPDT representations based on m-Bonsai and FK-hash in Sections 4 and 5, respectively. In Section 6, we provide our experimental results. Finally, we conclude the paper in Section 7.444A preliminary version of this work appeared in our conference paper [kanda2017practical] and the first author’s Ph.D. thesis [kanda2018space]. This paper contains the significant differences as follows: (1) a fast variant of m-Bonsai was incorporated in Section 4.1; (2) an efficient implementation of the bijective hash function in m-Bonsai was incorporated in Section 4.2; (3) a growing algorithm of m-Bonsai is presented in Section 4.3; (4) FK-hash was also considered in addition to m-Bonsai in Section 5; (5) the experimental results in Section 6 and all descriptions are significantly enhanced.
A string is a (finite) sequence of characters over a finite alphabet. Our strings always start at position 0. Given a string of length , denotes the substring for . Particularly, is a prefix of and is a suffix of . Let denote the length of . The same notation is also applied to arrays. The cardinality of a set is denoted by .
Our model of computation is the transdichotomous word RAM model of word size , where is the size of the problem. We can read and process bits in constant time.
A keyword is a string over an alphabet that is terminated with a special character at its end. In a prefix-free set of strings, no string is a prefix of another string. A set of keywords is always prefix-free due to the character $. A keyword dictionary is a dynamic associative array that maps a dynamic set of keywords to values , where belongs to a finite set . It supports the retrieval, the insertion, and the deletion of keywords while maintaining the key-value mapping. In detail, it supports the following operations:
returns the value associated with the keyword if or otherwise.
inserts the keyword in , i.e., , and associates the value with .
removes the keyword from , i.e., .
A trie [books:knuth1998art, fredkin1960trie] is a rooted labeled tree representing a set of keywords . Each edge in is labeled by a character. All outgoing edges of a node are labeled with a distinct character. The label of the edge between a node and its parent is called the branching character of . The parent and branching character unique determines . Each keyword is represented by exactly one path from the root to a leaf , i.e., the keyword can be extracted by concatenating the edge labels on the path from the root to . Since is prefix-free ($ is a unique delimiter of each keyword), there is a 1-to-1 correlation between leaves and keywords.
Given a keyword of length , retrieves by traversing nodes from the root to a leaf while matching the characters of with the edge labels of the traversed path. In representations storing all trie nodes explicitly, we visit nodes during this traversal. However, this traversal suffers poor locality of reference based on random accesses. In practice, this cache inefficiency is a critical bottleneck especially for long strings such as URLs. The most successful solution for this problem is path decomposition [ferragina2008searching] reviewed in the next subsection.
The path decomposition [ferragina2008searching] of a trie is a recursive procedure that first chooses an arbitrary root-to-leaf path in , then compactifies the path to a single node, and subsequently repeats the procedure in each subtrie hanging off the path . As a result, is partitioned into a set of node-to-leaf paths because there are leaves in . This decomposition produces the path-decomposed trie , which is composed of compactified nodes.
For explaining the properties of , we call the concatenation of the labels of all edges of a node-to-leaf path in the path string of . The path strings of the compactified paths of are the node labels of . In detail, each node in is associated with a node-to-leaf path of and is labeled by the path string of , denoted by . Each edge in is labeled by a pair consisting of a branching character and an integer, which are defined as follows (see also Figure 1): Take a node in and one of its children . Suppose that and are associated with the paths and in , respectively, such that and are the path labels of and . The edge has the label if, in , the first node on the path is the node
whose branching character is , and
whose parent is the -th node555Throughout this paper, we start counting from zero. visited on the path .
The edge labels of are characters drawn from the alphabet , where is the longest length of all node labels.
Figure 2 illustrates a root-to-leaf path in and the corresponding root in after compactifying to . The root is labeled by the path string of , which is . The branching character of in is because in is the child of the third node on the path with branching character . Also for the subtries rooted at the nodes in , the decomposition is recursively applied to produce the children of the root in .
Given a keyword , the retrieval on can be simulated with a traversal of starting at its root: Let denote the currently visited node in . On visiting , we compare the path string with the characters of . If we find a mismatch at with , we descend to the child with branching character and drop the first characters of .
When storing the characters of each path string in consecutive memory, the number of random accesses involved in the retrieval on is bounded by , where is the height of . The following property regarding the height is satisfied by construction.
The height of cannot be larger than that of .
A way to improve this height bound in the static case is the centroid path decomposition [ferragina2008searching]. Given an inner node in , the heavy child of is the child whose subtrie has the most leaves (ties are broken arbitrarily). Given a node , the centroid path is the path from to a leaf obtained by descending only to heavy children. The centroid path decomposition yields the following property by always choosing centroid paths in the decomposition.
Through the centroid path decomposition, the height of is bounded by .
We can implement the key-value mapping through because there is a 1-to-1 correlation between nodes in and keywords in . A simple approach is to store the associated values in an array such that stores the value associated with node . If we assign each of the nodes in a unique id from the range , then has no vacant entry (i.e. ). Another approach is to embed the value of at the end of , where the node corresponds to the keyword . This approach can be used without considering the assignment of node ids. In our experiments, we used the latter approach.
Although the centroid path decomposition gives a logarithmic upper bound on the height of (cf. Section 2), it can be adapted only in static settings because we have to know the complete topology of a priori to determine the centroid paths. As a matter of fact, previous data structures embracing the path decomposition [grossi2014fast, hsu2013space, ferragina2008searching] consider only static applications.
In this section, we present the incremental path decomposition that is a novel procedure to construct a dynamic path-decomposed trie, which we call DynPDT in the following. Our procedure incrementally chooses666We actually do not construct , but represent it with the DynPDT a node-to-leaf path in and directly updates the DynPDT on inserting a new keyword of . This incrementally chosen path is not a centroid path in general. Thus, the incremental path decomposition does not necessarily satisfy Property 2 but always satisfies Property 1.
In this section, we drop the technical detail of storing the values to ease the explanation of DynPDT, for which we omit the second argument in the insert operation of a new keyword .
In the following, we simulate a dynamic trie by DynPDT . Suppose that is non-empty. On inserting a new keyword into , we proceed as follows:
First traverse from the root by matching characters of until reaching the deepest node whose string label is a prefix of .
Decompose into for and , which is possible since and .
Finally, insert a new child of with branching character and append, from node , new nodes corresponding to the suffix .
In other words, the task of on is to create a new node-to-leaf path representing the suffix . We call that path the incremental path of the keyword . We simulate by creating a new node in whose label is the path label of this incrementally path :
If , create the root and associate the keyword with by .
Otherwise (), retrieve the keyword from the root in three steps after setting variables and :
Compare with . If , terminate because is already inserted; otherwise, proceed with Step 2.
Find such that and ( exists since and ), and search the child of with branching character . If found, go back to Step 1 after setting the variable to this child and to the remaining suffix ; otherwise, proceed with Step 3.
Insert into by creating a new child of with branching character , and store the remaining suffix in by .
Figure 3 illustrates the construction process of DynPDT when inserting the keywords , , , and in this order, where the -th created node is denoted by . The process begins with an empty trie .
In the first insertion , we create the root and associate with , that is, becomes technology$. The resulting for is shown in Figure (a)a.
In the second insertion , we define a string variable initially set to . We try to retrieve in by comparing with , but fail as there is a mismatching character i at position 5 with and . Based on this mismatch result, we search the child of with branching character . However, since there is no such child, we add a new child to with branching character and associate the remaining suffix with . The resulting for is shown in Figure (b)b.
In the third insertion , we initially set the string variable to and then compare with in the same manner as the second insertion. Since and , we descend to child with branching character . After updating , we subsequently compare with to obtain the mismatch character q at position 0 with . We search the child with branching character , but there is no such child; thus, we create the child and set to the remaining suffix . The resulting for is shown in Figure (c)c.
The fourth insertion is also conducted in the same manner. The final trie is shown in Figure (d)d.
It is left to define the operations lookup and delete to make DynPDT a keyword dictionary. Similarly to insert, the operation lookup can be performed by traversing from the root. After matching all the characters of , returns the value associated with the last visited node. It returns on a mismatch.
We provide an example for a successful and an unsuccessful search. Both examples are similar to the construction described in Example 2.
We consider for the in Figure (d)d. We define a string variable initially set to , and compare with to retrieve (a part of) the query from the root. Since and , we descend to child with branching character . Subsequently, we update to the remaining suffix as and descend to child with branching character since and . Finally, we update and compare with . As both match, we return the value stored in .
We consider for the in Figure (d)d. In the same manner as in the above case, we reach node with the prefix technica and subsequently compare and . Since and , we search a child with branching character ; however, there is no such child. As a result, returns .
The operation delete can be implemented by introducing deletion flags for each node (i.e., for each keyword), a trick that is also used in hashing with open addressing [books:knuth1998art, Chapter 6.4, Algorithm L]. In other words, retrieves and sets the deletion flag for the node corresponding to . However, this approach additionally needs one bit for each node. Another approach is to set the value associated with the deleted keyword to as an invalid value. This approach does not need additional space for the deletion flags. Although these approaches do not free up space after deletion, the space is reused for keywords inserted subsequently if the new keywords share sufficiently long prefixes with the deleted ones.
In practice, a critical problem of DynPDT is that the domain of the edge labels in and the longest length of all node labels are not constant in general. We tackle this problem by limiting the size of . To this end, we introduce a new parameter to forcibly fix the alphabet as in advance. Within this limitation, suppose that we want to create an edge labeled from node with . As this label is not in , we create dummy nodes called step nodes with a special character by repeating the following procedure until becomes less than : add a new child of with branching character and recursively set and . is the empty string if is a step node.
We consider for in Figure (d)d with . We set and compare with . Since and , we try to create the edge label ; however, as , we instead create a step child with branching character , descend to this child, and set . Since becomes less than , we define a child of the step node with branching character and associate the remaining suffix with . The resulting DynPDT is depicted in Figure 4.
This solution creates additional nodes depending on . When is too small, many step nodes are created and extra node traversals are involved. When is too large, the alphabet size becomes large and the space usage increases significantly. Therefore, it is necessary to determine a suitable . In Section 6, we empirically determine 32 and 64 to be favorable values for .
To use standard trie techniques, we split up into two parts:
a (standard) trie structure for a set of strings to represent with the difference that it assigns a node to a unique id instead of its node label, and
an associative array that maps the ids of the nodes of to their corresponding node labels, called node label map (NLM).
For example, in Figure 4, the trie built on the string set and the NLM stores node labels to be accessed by the respective node ids .
NLM dynamically manages node labels depending on the node ids assigned. As explained in Section 1, we use the m-Bonsai [DBLP:journals/corr/PoyiasPR17] and FK-hash [fischer2017practical] representations for . Moreover, we design the NLM data structures for m-Bonsai and FK-hash individually, which we respectively present in Sections 4 and 5.
To discuss the representation approaches in the next sections, we define to be a dynamic trie with nodes whose edge labels are characters drawn from the alphabet of size . Although the number of nodes depends on , we write for simplicity. supports the following operations:
adds a new child of with branching character and returns its id.
returns the id of the child of with branching character if exists, or returns otherwise.
We briefly review some common trie representations and point out their suitability for . The simplest representation is a list trie [askitis2007efficient, Chapter 2.3.2], which transforms an arbitrary trie to its first-child next-sibling representation. In this representation, each node of the list trie stores its branching character, a pointer to its first child, and a pointer to its next sibling. The list trie represents in bits and supports addchild and getchild in time; however, the operation time becomes problematic if is large. Another representation is a ternary search trie (TST) [bentley1997fast] that reduces the time complexity of the list trie to ; however, the space usage grows to bits. A well-known time- and space-efficient representation is the double array [aoe1989efficient]. Its space usage is bits in the best case, while supporting getchild in time; however, a double array for a large alphabet tends to be empirically sparse. Actually, we are only aware of dynamic double-array implementations handling byte characters (e.g., [yoshinaga2014self, kanda2017rearrangement]). Judy [manual:judy10min] and ART (adaptive radix tree) [leis2013adaptive] are trie representations that dynamically choose suitable data structures for the trie topology; however, both are also designed for byte characters. As each trie node is associated with an id, compact tries like the z-fast trie [belazzougui2010dynamic] representing only nodes explicitly become inefficient with this requirement.
Compared to these trie representations, m-Bonsai and FK-hash have better complexities. m-Bonsai can represent in bits of expected space for a constant , while supporting getchild and addchild in expected time [DBLP:journals/corr/PoyiasPR17]. Compared to that, FK-hash needs additional bits of expected space, but supports practically faster insertions.
A straightforward solution to provide the NLM for m-Bonsai and FK-hash is to store the node labels as satellite data in the respective hash table. However, by doing so, we would waste space for each unoccupied entry in the hash table. In the following, we present efficient solutions for the NLM tailored to m-Bonsai and FK-hash.
This section presents our approach based on m-Bonsai [DBLP:journals/corr/PoyiasPR17]. m-Bonsai represents trie nodes as entries in a closed hash table that, spoken informally, compactify the stored keys with compact hashing [books:knuth1998art].
We present a plain and a compact form of the representation based on m-Bonsai. We refer to the former as PBT (Plain m-Bonsai Trie), which is a non-compact variant of m-Bonsai. PBT can be useful for fast implementation although it has not been considered in any applications yet. We refer to the latter as CBT (Compact m-Bonsai Trie) as it uses the original m-Bonsai implementation. We describe PBT and CBT in Sections 4.1 and 4.2, respectively. In both variants, we maintain a hash table of size with the load factor to store nodes. In Section 4.3, we propose a linear-time growing algorithm based on Arroyuelo’s approach [arroyuelo2017lz78]. Finally, in Section 4.4, we propose NLM data structures designed for PBT and CBT.
PBT uses a hash function . Trie nodes are elements in the hash table. As their locations in the hash table are fixed unless the hash table is rebuilt, we use these locations as node ids. In other words, the id of a node located at is . is performed as follows. We first compose the hash key and then compute its initial address .777This paper defines as . Let be the first vacant address from determined by linear probing. We create the new child by . That is, the id of the new child becomes . getchild can be also computed in the same manner. If is fully independent and uniformly random, the operations can be performed in expected time. PBT uses bits of space.
The table size is a power of two in order to quickly compute the modulo operation of by using the bitwise AND operation .888http://blog.teamleadnet.com/2012/07/faster-division-and-modulo-operation.html We set the maximum load factor to . If reaches during an update, we double the size of the hash table by the growing algorithm described in Section 4.3. We set the initial capacity of the hash table to . Our hash function is a XorShift hash function999http://xorshift.di.unimi.it/splitmix64.c. derived from [steele2014fast].
CBT reduces the space usage of PBT with the compact hashing technique [books:knuth1998art]. Locating nodes on a compact hash table is identical to PBT with the difference that CBT uses a bijective transform that maps a key to its hash value and its quotient . Instead of , the compact hash table stores only its quotient in . The hash value can be restored from the initial address and the quotient , where is the first empty slot at or after the initial address . The original key can also be restored from the hash value since is bijective. Therefore, addchild and getchild can be performed in the same manner as PBT if the corresponding initial address can be identified from the location .
The remaining problem is how to identify the corresponding initial address from . Poyias et al. [DBLP:journals/corr/PoyiasPR17] solved this problem by introducing a displacement array such that keeps the number of probes from to , that is, . Given a location , one can compute the corresponding initial address with . Although a value in is at most , the average value becomes small if is fully independent and uniformly random and the load factor is small. Poyias et al. [DBLP:journals/corr/PoyiasPR17] demonstrated that can be represented in bits using CDRW (Compact Dynamic ReWritable) arrays. As takes bits for the quotients, CBT can represent in expected bits of space.
The representation of with the CDRW array seems unpractical. Poyias et al. [DBLP:journals/corr/PoyiasPR17] gave an alternative practical representation, where is represented by three data structures , and as follows.
is a simple array of length in which each element uses bits for a constant .
is a compact hash table (CHT) described by Cleary [cleary1984compact], which stores keys from and values from for a constant . The keys are stored in a closed hash table of length through the compact hashing technique [books:knuth1998art], where is a power of two (a property that is in common with ). In detail, the hash table consists of
a bijective transform ,
an integer array of length to store the quotients of the keys (i.e., entry indices of ) representable in bits,
an integer array of length to store displacement values of representable in bits, and
two bit arrays each of length storing the displacement values of the quotients in (not to be confused with the displacement values stored in ).
On inserting a key , we store its quotient in the first vacant slot in starting at the initial address . The collisions in are therefore resolved with linear probing. However, this collision resolution poses the same problem as in CBT, as additional displacement information is required to restore the initial address of a stored quotient in . Cleary solves this problem by using two bit arrays (see [cleary1984compact] for how they work). Finally, stores the value associated with the key whose quotient is stored in . Since uses bits of space, uses bits of space in total.
is a standard associative array that maps keys from to values from . In our implementation, is a closed hash table with linear probing. Given is the capacity of , takes bits.
The representation of the entry for an integer depends on its actual value:
If , then we store in the bits of .
If , we represent by the key-value pair stored in .
Finally, if , we represent by the key-value pair stored in .
In the experiments, we set and . We set the initial capacities of and to and , respectively. We set the maximum load factor of and to 0.9. If the actual load factor of (resp. ) reaches the maximum load factor 0.9, we double the size of (resp. of ).
Since we assume that , , and are powers of two, the bijective transform is for some . We design this function as the concatenation of two bijective functions , where for an integer larger than and for a large prime smaller than . is based on the XorShift random number generators [marsaglia2003xorshift], where the inverse function is given by . The inverse function of is given by , where is the multiplicative inverse of such that (see [koeppl2019separate, Sect. 2.1] for details). By construction, the inverse function of is . Our hash function is inspired by the SplitMix algorithm [steele2014fast].
If the load factor of hash table of length reaches the maximum load factor , we create a new hash table (and a new displacement array for CBT) of length and relocate all nodes to . Since a node depends on the position of its parent in , we can relocate a node only after having relocated all its ancestors. This can be done in a top-down traversal (e.g., in BFS or DFS order) of the tree during which all children of a node are successively selected. However, because selecting all children of a node is performed by checking getchild for all possible characters in , the relocation based on a top-down traversal needs expected time and is therefore only for tiny alphabets practical. Here we describe a bottom-up approach that is based on the approach by Arroyuelo et al. [arroyuelo2017lz78]. This approach, called growing algorithm, runs in expected time. A pseudo code of it is shown in Algorithm 1.
Given a trie with a hash table of length , the algorithm constructs an equivalent trie with a hash table of length . To explain the algorithm, we define two operations returning the branching character of node and returning the parent id of node . They can be computed in constant time because explicitly stores the branching character and the parent id as the hash key in PBT. CBT can also restore the hash key from and .
In the growing algorithm, we initially define two auxiliary arrays Map and Done: Map is an integer array and Done is a bit array, each of length . We store in a 1 after relocating the node stored in . We keep the invariant that whenever , then stores the position in of the node stored in . All bits in Done are initialized by 0 except for the root. We scan from left to right and perform the following steps for each non-vacant slot . We first set to and to an empty string, and then climb up the path from the node to the root. We prematurely stop when encountering a node with . In this case, all ancestors of have already been relocated such that there is no need to visit them again. Subsequently, we walk down the computed path while relocating the visited nodes. Since we do not reprocess already visited nodes, we can perform the node relocation in expected time with for a constant loaf factor .
Algorithm 1 maintains the auxiliary arrays Map of bits, Done of bits and of bits, where is the height of . Thus, the extra working space is bits if we create the auxiliary arrays naively. However, the working space of Map can be shared with because for is no longer needed. In PBT, the working space of Map can be fully placed in because the space of is bits and is at least in practice.101010Even for , a simple bit array suffices. Based on this in-place approach, the extra working space of Algorithm 1 is only bits, taking account for Done and in PBT. In practice, the space of is negligible because is bounded by the maximum length of keywords in and .
In CBT, uses only bits. As in most scenarios, it is difficult to completely store Map in ; however, we can also use the space of , which is bits. If , Map can be fully placed in and ; otherwise, the extra working space of bits for Map is needed in addition to that of Done and .
In m-Bonsai, the node ids are values drawn from the universe whose randomness depend on the used hash function. As the task of an NLM data structure is to map node ids to their respective node labels, an appropriate NLM data structure for m-Bonsai is a dynamic associative array that stores node label strings for arbitrary integer keys . In what follow, we first present a plain approach and then show how to compactify it.
The simplest approach is to use a pointer array of length such that stores the pointer to or if no node with id exists. We refer to the approach as PLM (Plain Label Map). Figure (a)a shows an example of PLM. Given a node of id , PLM can obtain through in time. However, takes bits, where the word size is . This space consumption is obviously large.
We present an alternative compact approach that reduces the pointer overhead of PLM in a manner similar to Google’s sparse hash table [software:sparsehash]. In this approach, we divide the node labels into groups of labels over the ids. That is, the first group consists of , the second group consists of , and so on. Moreover, we introduce a bitmap such that iff exists. We concatenate all node labels with of the same group together, sorted in the id order. The length of becomes by maintaining, for each group, a pointer to its concatenated label string. We refer to the approach as SLM (Sparse Label Map).
With the array and the bitmap , we can access as follows: If , we are done since does not exist in this case; otherwise, we obtain the concatenated label string storing from , where . Given for the bit chunk , is the -th node label of the concatenated label string. As , counting the occurrences of 1s in chunk is supported in constant time using the popcount operation [gonzalez2005practical]. It is left to explain how to search in the respective concatenated label string. For that we present to representations of the concatenated label strings:
If the node labels are straightforwardly concatenated (e.g., the second group in Figure (a)a is cal$ue$ in ), we can sequentially count the $ delimiters to find the -th delimiter marking the ending of the -th stored string, after which starts. We can therefore extract in time, where again denotes the maximum length of all node labels.
We can shorten the scan time with the skipping technique used in array hashing [askitis2005cache]. This technique puts its length in front of each node label via some prefix encoding such as VByte [williams1999compressing]. Note that we can omit the terminators of each node label. The skipping technique allows us to jump ahead to the start of the next node label; therefore, the scan is supported in time. Figure (b)b shows an example of SLM with the skipping technique.
Regarding the space usage of SLM, and use and bits, respectively. For , the total space usage becomes bits, which is smaller than bits in PLM; however, the access time is .
This section presents our DynPDT representation approaches based on FK-hash [fischer2017practical]. The basic idea of FK-hash is the same as that of m-Bonsai. The difference is that FK-hash incrementally assigns node ids and explicitly stores them as values in the hash table, while m-Bonsai uses the locations of the stored elements of the hash table as node ids. Although FK-hash uses more space than m-Bonsai, the assignment of node ids simplifies the growing algorithm.
Like m-Bonsai, FK-hash locates nodes on a closed hash table of length , but does not use the addresses of as node ids. FK-hash incrementally assigns node ids from zero and explicitly stores them in an integer array of length . In other words, when creating the -th node by storing it in , its node id is , which is stored in . In a way similar to m-Bonsai, is performed as follows: We compose the key , hash it with , and then search the first vacant slot from by linear probing. Given is the currently largest node id, we assign the id to the new child, and set and . The displacement information is maintained analogously to m-Bonsai.
In the same manner as m-Bonsai, we can think of two representations: whether is compactified or not. The non-compact one is referred to as PFKT (Plain FK-hash Trie). The compact one is referred to as CFKT (Compact FK-hash Trie). Compared to PBT and CBT, PFKT and CFKT keep an additional integer array and require additional bits of space.
An advantage of FK-hash is that growing the hash table is done in the same manner as in standard closed hash tables. In detail, can be enlarged by scanning nodes on from left to right and relocating the nodes in a new hash table of length . The growing algorithm takes expected time. This time complexity is identical to that of Algorithm 1; however, the growing algorithm of FK-hash is faster in practice because of its simplicity. In addition, no auxiliary data structure is needed like Map and Done used by Algorithm 1.
Like in Section 4.4, we introduce PLM and SLM adapted to FK-hash. Figure 6 shows an example for each of them. Although PLM in FK-hash is basically identical to that in m-Bonsai, SLM can be simplified as follows.
In m-Bonsai, it is necessary to identify whether exists and the rank of in the group because node ids are randomly assigned; therefore, we introduced a bitmap of length and utilized the popcount operation. In FK-hash, however, such a bitmap is not needed because node ids are incrementally assigned. Put simply, a node label is stored in the group of id and located at the -th position in the group. When using the skipping technique, care has to be taken for the step nodes whose node labels are empty. For each of them, we put the length 0 in its corresponding concatenated label string. For example, we put a ’0’ in the second concatenated label string for the step node in Figure (b)b. Finally, we can insert a new node label by appending it to the last concatenated label string.
In this section we evaluate the practical performance of DynPDT.
We conducted all experiments on one core of a quad-core Intel Xeon CPU E5-2680 v2 clocked at 2.80 Ghz in a machine with 256 GB of RAM, running the 64-bit version of CentOS 6.10 based on Linux 2.6. We implemented our data structures in C++17. We compiled the source code with g++ (version 7.3.0) in optimization mode -O3. We used 4-byte integers for the values associated with the keywords.
Our benchmarks are based on the following eight real-world datasets:
GeoNames consists of the geographic names in the asciiname column of the GeoNames dump.111111http://download.geonames.org/export/dump/
AOL consists of the queries in the AOL 2006 query log.121212http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection/
Wiki consists of all page titles from the English Wikipedia at September 2018.131313https://dumps.wikimedia.org/enwiki/
DNA consists of all 12-mers (i.e., substrings of length 12) found in the DNA dataset from Pizza&Chili corpus.141414http://pizzachili.dcc.uchile.cl/texts/dna/
LUBMS consists of URIs extracted from the RDF dataset generated by the Lehigh University Benchmark [guo2005lubm] for 1,600 universities.151515The dataset is distributed under the name ‘DS5’ at https://exascale.info/projects/web-of-data-uri/.
LUBML consists of URIs extracted from the RDF dataset generated by the Lehigh University Benchmark [guo2005lubm] for 7,000 universities.161616Although this dataset is not distributed, one can obtain the identical dataset through the LUBM data generator (called UBA) at http://swat.cse.lehigh.edu/projects/lubm/.
UK consists of URLs obtained from a 2005 crawl of the .uk domain performed by UbiCrawler [boldi2004ubicrawler].171717http://law.di.unimi.it/webdata/uk-2005/
WebBase consists of URLs of a 2001 crawl performed by the WebBase crawler [hirai2000webbase].181818http://law.di.unimi.it/webdata/webbase-2001/
|GeoNames||109 MiB||7.3 M||2||152||15.7||99|
|AOL||224 MiB||10.2 M||2||523||23.2||85|
|Wiki||286 MiB||14.1 M||2||252||21.2||200|
|DNA||189 MiB||15.3 M||13||13||13.0||16|
|LUBMS||3.1 GiB||52.6 M||10||80||63.7||57|
|LUBML||13.8 GiB||230.1 M||10||80||64.2||57|
|UK||2.7 GiB||39.5 M||17||2,030||72.4||103|
|WebBase||6.6 GiB||118.2 M||10||10212||60.2||223|
Table 1 summarizes relevant statistics for each dataset.
We evaluate the average height of the DynPDT built on our datasets. The average height of is the arithmetic mean of the heights of all nodes over the number of nodes, omitting step nodes in the calculation. Although the average height is an important measure related to the average number of random accesses, we cannot a priori predict the average height of DynPDT because this number depends on the insertion order of the keywords. To reason about the quality of the average height, we study it in relation to the following known lower and upper bounds on it: The lower bound is the average height of the path-decomposed trie created by the centroid path decomposition [daigle2016optimal, Corollary 3]. The upper bound is the average height of the path-decomposed trie created by always choosing the child whose subtrie has the fewest number of leaves.
Table 2 shows the experimental results of the average heights of and for all the datasets. To analyze the performance of DynPDT in our experiments, we constructed DynPDT dictionaries by inserting keywords in random order. For that, we shuffled the dataset with the Fisher–Yates shuffle algorithm [durstenfeld1964algorithm]. Naturally, the actual average heights of are between their lower and upper bounds, and those of are the same as AveLen. The upper bounds are more than twice as large as the lower bounds for AOL, UK, and WebBase; however, the upper bounds were up to 5.4x smaller than the average heights of due to the path decomposition, especially for long keywords such as URIs. Therefore, the incremental path decomposition makes dynamic keyword dictionaries cache-friendly, especially for long keywords even if the insertion order is inconvenient and the average height is close to the upper bound.
The parameter influences the number of step nodes. We analyze the space and time performance of DynPDT when varying the parameter . In this experiment, we constructed DynPDT dictionaries for each parameter on the datasets Wiki, LUBMS and UK, and observed the working space and the construction time. For the DynPDT representation, we tested the combination of CFKT and SLM with , referred to as PDT-CFK in the following. As described in Section 6.2, the dictionary was constructed by inserting keywords in random order. The working space was measured by checking the maximum resident set size (RSS) required during the online construction.
Table 3 shows the experimental results for construction. Since has a direct impact on , which influences the space usage of , the working space depends on the value of . Although this dependency looks like and the taken space are in direct correlation, for Wiki and UK, the working spaces for (i.e., 0.36 GiB and 1.22 GiB respectively) were not the smallest. For Wiki, the reason for this is that many step nodes raised the load factor and involved an additional enlargement of the hash table. Specifically, the enlargements were conducted nine times with , although they were conducted eight times with . For UK, this reason is that the high load factor caused by a huge number of step nodes raised the average displacement value stored in and involved the use of and , although no additional enlargement was conducted. Regarding the time performance, this huge number of step nodes slowed down the construction. Therefore, a too small parameter can involve large space requirements and long construction times. On the other hand, when , the working space and construction time do not significantly vary.
From this observation, we derive two facts for : On the one hand, the most important recommendation is not to choose a parameter that is too small. On the other hand, choosing a large parameter is not a significant problem because the space and time performances do not significantly decrease as grows. For example, when on Wiki, the proportion of step nodes is 0.12%; however, even with a larger parameter such as 512 or 1024, the working space and construction time are almost the same. Table 4 shows Steps for each parameter and the average length of the node labels (denoted by AveNLL) for all the datasets. Even for long keywords like URLs (i.e., UK), AveNLL is bounded by 18.0 and Steps is within 1% of all nodes when . Consequently, we suggest setting to 32 or 64 for keywords whose length is not much longer than that of the URL datasets.
We compared the performances of our DynPDT representations, for which we benchmarked the following six combinations:
PDT-PB is the combination of PBT and PLM,
PDT-SB is the combination of PBT and SLM,
PDT-CB is the combination of CBT and SLM,
PDT-PFK is the combination of PFKT and PLM,
PDT-SFK is the combination of PFKT and SLM, and
PDT-CFK is the combination of CFKT and SLM.
We evaluated the working space during the construction and the running times of insert and lookup. Like in Section 6.3, we constructed each dictionary and measured its working space. To measure the lookup time, we chose 1 million random keywords from each dataset. The running times are the average of 10 runs. For SLM, we tested