Space-Efficient Huffman Codes Revisited

08/12/2021 ∙ by Szymon Grabowski, et al. ∙ 0

Canonical Huffman code is an optimal prefix-free compression code whose codewords enumerated in the lexicographical order form a list of binary words in non-decreasing lengths. Gagie et al. (2015) gave a representation of this coding capable to encode or decode a symbol in constant worst case time. It uses σℓ_max + o(σ) + O(ℓ_max^2) bits of space, where σ and ℓ_max are the alphabet size and maximum codeword length, respectively. We refine their representation to reduce the space complexity to σℓ_max (1 + o(1)) bits while preserving the constant encode and decode times. Our algorithmic idea can be applied to any canonical code.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Huffman coding [16, 23], whose compressed output closely approaches the zeroth-order empirical entropy, is nowadays perceived as a standard textbook encoder, being the most prevalent option when it comes to statistical compression. Unlike arithmetic coding [29], its produced code is instantaneous, meaning that a character can be decoded as soon as the last bit of its codeword is read from a compressed input stream. Thanks to this property, instantaneous codes tend to be swiftly decodable. While most research efforts focus on the achievable decompression speeds, not much has been considered for actually representing the dictionary of codewords, which is usually done with a binary code tree, called the Huffman tree.

Given that the characters of our text are drawn from an alphabet  of size , the Huffman tree is a full binary tree, where each leaf represents a character of . The tree stored in a plain pointer-based representation takes bits of space. Here we assume that and , where is the number of characters of our text. A naive encoding algorithm for a character traverses the Huffman tree top-down to the leaf representing while writing the bits stored as labels on the traversed edges to the output. Similarly, a decoding algorithm traverses the tree top-down based on the read bits until reaching a leaf. Hence, the decoding and encoding need time, where is the height of the tree. If is constant, then this representation is optimal with respect to space and time (since becomes constant). Therefore, the problem we study becomes interesting if , which is the setting of this article.

The alphabet size is often much smaller than . Yet in some cases the Huffman tree space becomes non-negligible: For instance, let us consider that represents a set of distinct words of a language. Then its size can even exceed a million111Two such examples are the Polish Scrabble dictionary with over 3.0M words (https://sjp.pl/slownik/growy/) and the Korean dictionary Woori Mal Saem with over 1.1M words (https://opendict.korean.go.kr/service/dicStat). for covering an entire natural language. In another case, we maintain a collection of texts compressed with the same Huffman codewords under the setting that a single text needs to be shipped together with the codeword dictionary, for instance across a network infrastructure. A particular variation are canonical Huffman codes [28], where leaves of the same depth in the Huffman tree are lexicographically ordered with respect to the characters they represent. An advantage of canonical Huffman codes is that several techniques (see [21, 24] and the references therein) can compute the lengths of a codeword from the compressed stream by reading bit chunks instead of single bits, thus making it possible to decode a character faster than an approach linear in the bit-length of its codeword. Moffat and Turpin [24, Algorithm TABLE-LOOKUP]

were probably the first to notice that it is enough to (essentially) perform binary search over a collection of the lexicographically smallest codewords for each possible length, to decode the current symbol. The same idea is presented perhaps more lucidly in

[25, Section 2.6.3], with an explicit claim that this representation requires bits222By we mean the logarithm to base two () throughout the paper. Occasional logarithms to another base are denoted with . and achieves time per codeword. Given is the maximum length of the codewords, an improvement, both in space and time, has been achieved by Gagie et al. [13], who gave a representation of the canonical Huffman tree within

  • bits of space with [13, Theorem 1], or

  • bits of space with [13, Corollary 1],333Although the space is stated as bits in that paper, we could verify with the authors that the last term can be improved to the bounds written here.

while supporting character encoding and decoding in time. Their approach consists of a wavelet tree, a predecessor data structure , and an array  storing the lexicographically smallest codeword for each depth of the Huffman tree. The last two data structures take account for the last term in each of the space complexities above.

Our Contribution

Our contribution is a joint representation of the predecessor data structure with the array to improve their required space of bits down to bits, and this space is in bits since . In other words, we obtain the following theorem.

Theorem 1.1.

There is a data structure using bits of space, which can encode a character to a canonical Huffman codeword or restore a character from a binary input stream of codewords, both in constant time per character.

We remark that when the alphabet consists of natural language words, like in the examples shown in the beginning, we often initially map words to IDs in an arbitrary manner. For our use case, it would be convenient to assign the IDs based on their frequencies. By doing so, the needed space of canonical Huffman representation is greatly reduced as the alphabet permutation can be considered as already given. In such a case, we only consider the extra space needed for the Huffman coding, where our improvement in space from down to is more pronounced (in detail, we can omit the wavelet tree of Gagie et al.’s achieved space complexities).

Related Work

A lot of research effort has been spent on the practicality of decoding Huffman-encoded texts (e.g., [15, 1, 24, 14, 22]) where often a lookup table is employed for figuring out the lengths of the currently read codeword. For instance, Nekrich [26] added a lookup table for byte chunks of a codeword to figure out the length of a codeword by reading it byte-wise instead of bit-wise. Regarding different representations of the Huffman tree, Chowdhury et al. [7] came up with a solution using words of space, which can decode a codeword in time.

In theory terms, important results were given by Gagie et al. [13] and Fariña et al. [10] (see also an extended version [11]). The former is heavily cited in this work, and is discussed in detail in Section 2.2. In the latter, it is shown how to represent a particular optimal prefix-free code used for compressing wavelet matrices [8] in bits, allowing -time encoding and decoding, for a selectable constant . Like Gagie et al. [13], they also use a wavelet tree of bits to map each character to the length of its respective codeword. The tree topology is represented by counting level-wise the number of nodes and leaves, resulting in bits. With these two ingredients, this structure is already operational with time per encoding or decoding operation. To obtain the aforementioned time bounds, they sample certain depths of the code tree with lookup tables to speed-up top-down traversals.

Another line of research is on so-called skeleton trees [18, 19, 20]. The idea is to replace disjoint perfect subtrees in the canonical Huffman tree with leaf nodes. A leaf node representing a pruned perfect subtree only needs to store the height of this tree and the characters represented by its leaves to be able to reconstruct it. Note that all leaves in a perfect binary tree are on the same level, and hence due to the restriction on the order of the leaves of the canonical Huffman tree, we can restore the pruned subtree by knowing its height and the set of characters. Thus, a skeleton tree may use less space. Decompression can be accelerated whenever we hit a leaf  during a top-down traversal, where we know that the next bits from the input are equal to the suffix of the currently parsed codeword if the leaf  is the representative of a perfect subtree of height 

. Unfortunately, the gain depends on the shape of the Huffman tree, and since there are no non-trivial theoretical worst-case bounds known, this approach can be understood as a heuristic so far.

2 Preliminaries

Our computational model is the standard word RAM with machine word size  bits, where denotes the length of a given input string  (called the text) whose characters are drawn from an integer alphabet  with size  being bounded by and . We call the elements of characters. Given a string , we define the queries and returning the number of ’s in and the position of the -th in , respectively. There are data structures that replace and use bits, and can answer each query in time [3, Theorem 4.1]. We further stipulate that for any . In what follows, we assume that we have at least bits of working space available for storing a constant number of pointers, and omit this space in the space analysis of this article.

2.1 Code Tree

A code tree is a full binary tree whose leaves represent the characters of . A left or a right child of a node  is connected to  via an edge with label 0 or 1, respectively. We say that the string label of a node  is the concatenation of edge labels on the path from the root to . Then the string label of a leaf  is the codeword of the character represented by . The length of the string label of a node is equal to its depth in the tree. Let denote the set of the string labels of all leaves. The bit strings of are called codewords. A crucial property of a code tree is that is prefix-free, meaning that no codeword in is prefix of another. A consequence is that, when reading from a binary stream of concatenated codewords, we only have to match the longest prefix of this stream with to decode the next codeword. Hence, prefix-free codes are instantaneous, where an instantaneous code, as described in the introduction, is a code with the property that a character can be decoded as soon as the last bit of its codeword is read from the compressed input. Instantaneous codes are also prefix-free. That is because, if a code is not prefix-free, then there can be two codewords and with being a prefix of such that when reading we need to read the next bits to judge whether the read bits represent or .

A code tree is called canonical [28] if its induced codewords read from left to right are lexicographically strictly increasing, while their lengths are non-decreasing. Consequently, a codeword of a shallow leaf is both shorter and lexicographically smaller than the codeword of a deeper leaf. We further assume that all leaves on the same depth are sorted according to the order of their respective characters.

Let denote the maximum length of the codewords in . In the following we fix an instance of a code tree for which holds, where is the golden ratio. It has been shown by Buro [6, comment after Theorem 2.1] that the canonical Huffman coding exhibits this property.

Our claims in Sections 3 and 4, expressed in terms of the canonical Huffman codes and trees, are applicable to arbitrary canonical codes (resp. trees) if they obey the above bound on . Our focus on Huffman codes is motivated by the importance of this coding.

2.2 Former Approach

We briefly review the approach of Gagie et al. [13], which consists of three data structures:

  1. a multi-ary wavelet tree storing, for each character, the length of its corresponding codeword,

  2. an array storing the codeword of the leftmost leaf on each depth, and

  3. a predecessor data structure  for finding the length of a codeword that is the prefix of a read bit string.

The first data structure represents an array  with being the codeword length of the -th character, i.e., is a string of length  whose characters are drawn from the alphabet . The wavelet tree built upon  can access , and answer and in time [3, Theorem 4.1]. While a plain representation costs bits, the authors showed that they can use an empirical entropy-aware topology of the wavelet tree to get the space down to bits, still accessing , and answering and in time [3, Theorem 5.1]. Since , , and hence the time complexity is constant.

The second data structure  is a plain array of length . Each of its entries has bits, therefore it takes bits in total. It additionally stores in bits (for instance, in a prefix-free code such as Elias- [9]).

The last data structure  is represented by a fusion tree [12]

storing for each depth the lexicographically smallest codeword padded to

bits at its right end, where we assume the least significant bit is stored. A fusion tree storing elements of a set , each represented in bits, needs bits of space and can answer the query returning the predecessor of in , i.e., in time. For our application, the fusion tree  built on codewords computes in time, and needs bits. Like above, the query time is constant since . Here, we want to return the depth  of the returned codeword, not the codeword itself (which we could retrieve by ). Unfortunately, it is not explicitly mentioned by Gagie et al. [13] how to obtain this depth, in particular when some depths may have no leaves. We address this problem with the following subsection and explain after that how the actual computation is done (Section 2.2.2).

2.2.1 Missing Leaves

To extract the depth from a predecessor query on , we replace each right-padded codeword  by the pair  as the keys stored in , where is the length of . Such a pair of codeword and length is represented by the concatenation of the binary representations of its two components such that we can interpret this pair as a bit string of length . By slightly abusing the notation, we can now query to obtain , i.e., the argument for is a pair rather than a single value and similarly the returned value is a pair, but physically the bit string is a predecessor of . Then for a bit string  gives us not only the predecessor of , but also . Storing such long bit strings poses no problem as the time complexity of for bit string does not change in a fusion tree when the length of is, e.g., doubled. Appendix A outlines a different strategy augmenting the fusion tree with additional methods instead of storing the codeword lengths.

2.2.2 Encoding and Decoding

Finally, we explain how the operational functionality is implemented, i.e., the steps of how to encode a character, and how to decode a character from a binary stream. We can encode a character by first finding the depth of the leaf  representing with , which is also the length of the codeword to compute. Given the string label of the leftmost leaf  on depth  is , then all we have to do is increment the value of this codeword by the number of leaves between  and on the same depth. For that we compute that gives us the number of leaves to the left of  on the same depth . Hence, is the codeword of . For decoding a character, we assume to have a binary stream of concatenated codewords as an input. We first use the predecessor data structure to figure out the length of the first codeword in the stream. To this end, we peek the next bits in the binary stream, i.e., we read bits into a variable  from the stream without removing them. Given , we know that and that the next codeword has length . Hence, we can read bits from the stream into a bit string , which must be the codeword of the next character. The leaf having as its string label is on depth , and has the rank among all other leaves on the same depth. This rank helps us to retrieve the character having the codeword , which we can find by .

3 A Warm-Up

We start with a simple idea unrelated to the main contribution presented in the subsequent section. The point is that it might have a (mild) practical impact. Also, we admit this idea is not new (Fariña et al. [11, Section 2.4] attribute it to Moffat and Turpin [24], although, in our opinion, it is presented rather in disguise), we however hope that the analysis given below is original and of some value.

Navarro, in Section 2.6.3 of his textbook [25], describes a simple solution (based on an earlier work of Moffat and Turpin [24]) which gives worst-case time for symbol decoding. With the modification presented below, this worst-case time remains, but the average time bound becomes . It also means we can decode the whole text in rather than time.

The referenced solution is based on a binary search over a collection of the lexicographically smallest codewords for each possible length; the collection is ordered ascendingly by the codeword length. We replace the binary search with the exponential search [5], which finds the predecessor of item in a (random access) list of items in time, where is the position of in . Exponential search is not faster than binary search in general, but may help if occurs close to the beginning of .

The changed search strategy has an advantage whenever short codewords occur much more often than longer ones. Fortunately, we have such a distribution, which is expressed formally in the following lemma:

Lemma 3.1.

The number of occurrences of all characters in the text whose associated Huffman codewords have lengths exceeding is .

Proof.

For each character , let denote the number of occurrences of in the text and denote the depth of its associated leaf in the Huffman tree, which is also the length of its associated codeword. Here we are interested in those characters for which . Since  [17, Thm 1] and is a strictly increasing function, we have , or, equivalently, . The number of characters we deal with is upper-bounded by , therefore the total number of their occurrences is , which ends the proof. ∎

For the characters specified in Lemma 3.1 the exponential search works in time. However, for the remaining characters of the text the exponential search works in only = time. Overall, the average time is , which is also the total average character decoding time, as finding the lexicographically smallest codeword with the appropriate length is, in general, more costly than all other required operations, which work in constant time.

4 Multiple Fusion Trees

In what follows, we present our space efficient representation of and for achieving the result claimed in Theorem 1.1. In Section 4.1, we start with the key observation that long codewords have a necessarily long prefix of ones. This observation lets us group together codewords of roughly the same lengths, storing only the suffixes that distinguish them from each other. Hence, we proceed with partitioning the set of codewords into codewords of different lengths, and orthogonal to that, of different distinguishing suffix lengths.

4.1 Distinguishing Suffixes

Let us notice at the beginning that a long codeword has a necessarily long prefix of ones:

Lemma 4.1.

Let be a canonical code. Each codeword can be represented as a binary sequence for some , where for all , and either (a) , or (b) and .

Proof.

Without loss of generality, let be a power of two, i.e., is an integer. Let us assume there is a codeword  with . The length of the suffix  of is . The prefix of is the string label of a node in the canonical tree. The height of is at least since it has a leaf whose string label is . Since a canonical tree is a full binary tree, has a right sibling, which we call . Since the depths of the leaves iterated in the left-to-right order in a canonical tree are non-decreasing, all leaves of the tree rooted at must be at depth at least . This implies that the number of leaves in the tree rooted in is at least . There are at least two leaves in the tree rooted in . Moreover, the two trees rooted in and , respectively, are disjoint. Consequently, there are at least leaves in the tree representing . For , we obtain a contradiction since the code tree has exactly leaves. Hence, , i.e., . ∎

The lemma states that given a codeword perceived as a concatenation of a maximal run of set ‘’ bits and a suffix that follows it, the length of this suffix is less than . Given a length  and the lexicographically smallest codeword with length , and assume that it is of Case (b) described in Lemma 4.1, then the proof of Lemma 4.1 states we can charge it for nodes in the following argumentation: Since the number of nodes is in a code tree, we have that the number of codewords with a suffix of Case (b) with a length for a fixed is at most by the pigeonhole principle. We formalize this observation as follows:

Corollary 4.2.

Given a length , the number of codewords  of a canonical code given by with and is bounded by .

In what follows, we want to derive a partition of the codewords stored in . Let us recall that stores not all codewords, but the codeword of the leftmost leaf on each depth (and omits depth  if there is no leaf with depth ). Given a set of such codewords with , let denote the length of the shortest suffix of the representation with . In what follows, we call a codeword  long-tailed if , and otherwise short-tailed. A consequence of Corollary 4.2 is that codewords can be long-tailed. In what follows, we manage long-tailed and short-tailed codewords separately.

4.2 Long-Tailed Codewords

We partition the long-tailed codewords into sets with being the set of codewords of lengths within the range , for each . By Lemma 4.1, we know that codewords in have the shape with , and . This means that we can represent a codeword of by its suffix  of length at most . Instead of representing with a single fusion tree storing the lexicographically smallest codeword of  for each depth, we maintain for each such set a dedicated fusion tree . To know which fusion tree to consult during a query with a bit string  read from a binary stream, we compute the longest prefix  of  (that is, a maximal run of ones). We can do that by asking for the position most significant set bit of bit-wise negated, which can be computed in constant time [12]. Given that the most significant set bit of is at position (counting from the leftmost position of , which is ), we know that , and consult .

This works fine unless we encounter the following two border cases: Firstly, is empty if and only if , which happens when the next codeword is the lexicographically smallest codeword  with . In that case, we know already that the answer is and are finished, so we treat individually. Secondly, the predecessor is not stored in the queried fusion tree, but in one built on shorter codewords. To treat that problem, for , we store the dummy codeword in and cache the largest value of in such that returns this cached value instead of in the case that is the returned predecessor. Finally, we treat the border case for , in which we store the smallest codeword regardless of whether is long-tailed or not (in that way we are sure that a predecessor always exists), but store this codeword with the same amount of bits as the other codewords trimmed to their long-tailed suffix of bits.

Since the lengths of the codewords (before truncation) in the sets are pairwise disjoint, the number of elements stored in the fusion trees is the number of long-tailed codewords, which is bounded by . A key in a fusion tree represents a long-tailed codeword  by a pair consisting of ’s suffix of length (thus using bits) and ’s lengths packed in bits. Our total space for the long-tailed codewords is bits, which are stored in fusion trees.

4.3 Short-Tailed Codewords

Similar to our strategy for long-tailed codewords, we partition the short-tailed codewords by their total lengths. By doing so, we obtain similarly a set of codewords with being the set of codewords of lengths within the range . Like in Section 4.2, we build a fusion tree on each of the codeword sets  with the same logic.

But this time we know that each codeword  has the shape for and . Additionally, we represent the length of the codeword  in the set by the local depth within as a -bit integer. Hence, the key of a fusion tree is composed by the -bit long suffix and the local depth with bits. Therefore, the total space for the short-tailed codewords is bits, which are stored in fusion trees.

4.4 Complexities

Summing up the space for the short- and long-tailed codewords, we obtain bits, which are stored in fusion trees. To maintain these fusion trees in small space, we assume that we have access to a separately allocated RAM of above stated size to fit in all the data. While we have a global pointer of bits to point into this space, pointers inside this space take bits. Hence, we can maintain all fusion trees inside this allocated RAM with an extra space of bits. In total, the space for is bits.

To answer , we find the predecessor of among the short- and long-tailed codewords, and take the maximum one. In detail, this is done as follows: Remembering the decoding process outlined in Section 2.2.2, is a bit string of length , which we delegate to the fusion tree  corresponding to either or (we try both of them) if the longest unary prefix of ‘1’s in is in (long-tailed) or (short-tailed). By delegation we mean that we remove this unary prefix, take the most significant bits (long-tailed) or the most significant bits (short-tailed) from , and stored these bits in a variable used as the argument for the query .

Finally, it remains to treat for encoding a character (cf. Section 2.2.2). One way would be to partition analogously like . Here, we present a solution based on the already analyzed bounds for : We create a duplicate of , named ’, whose difference to is that the components of the stored pairs are swapped, such that a predecessor query is of the form for a length  and a bit string of length . Then , and the first/leftmost bits of are equal to if (otherwise, if , then there is no leaf at depth ).

5 Conclusion and Open Problems

Canonical codes (e.g., Huffman codes) are an interesting subclass of general prefix-free compression codes, allowing for compact representation without sacrificing character encoding and decoding times. In this work, we refined the solution presented by Gagie et al. [13] and showed how to represent a canonical code, for an alphabet of size and the maximum codeword length of , in bits, capable to encode or decode a symbol in constant worst case time. Our main idea was to store codewords not in their plain form, but partition them by their lengths and by the length of the shortest suffix covering all ‘0’ bits of a codeword such that we can discard unary prefixes of ‘1’s from all codewords.

This research spawns the following open problems: First, we wonder whether the proposed data structure works in the model. As far as we are aware of, the fusion tree can be modeled in the model [2], and our enhancements do not involve complicated operations except finding the most significant set bit, which can be also computed in  [4]. The missing part is the wavelet tree, for which we used the implementation of Belazzougui and Navarro [3, Thm 4.1]. We think that most bit operations used by this implementation are portable, and the used multiplications are not of general type, but a broadword broadcast operation, i.e., storing copies of a bit string of length  in a machine word of bits. However, they rely on a monotone minimum perfect hash function, for which we do not know whether there is an existing alternative solution (even allowing a slight increase in the space complexity but still within bits) in .

Another open question concerns the possibility of applying the presented technique in an adaptive (sometimes also called dynamic) Huffman coding scheme. There exist efficient dynamic fusion tree and multi-ary wavelet tree implementations, but it is unclear to us whether we can meet those costs (at least in the amortized sense) involved in maintaining the code tree dynamically. Also, we are curious whether, for a small enough alphabet, a canonical code can encode and decode () symbols at a time (ideally, or , where the denominator could be changed to , if focusing on the average case) without a major increase in the required space.

Acknowledgments

This work is funded by the JSPS KAKENHI Grant Number JP21K17701.

References

  • Aggarwal and Narayan [2000] M. Aggarwal and A. Narayan. Efficient Huffman decoding. In Proc. ICIP, pages 936–939, 2000.
  • Andersson et al. [1999] A. Andersson, P. B. Miltersen, and M. Thorup. Fusion trees can be implemented with AC instructions only. Theor. Comput. Sci., 215(1-2):337–344, 1999. doi: 10.1016/S0304-3975(98)00172-8.
  • Belazzougui and Navarro [2015] D. Belazzougui and G. Navarro. Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms, 11(4):31:1–31:21, 2015.
  • Ben-Kiki et al. [2014] O. Ben-Kiki, P. Bille, D. Breslauer, L. Gasieniec, R. Grossi, and O. Weimann. Towards optimal packed string matching. Theor. Comput. Sci., 525:111–129, 2014. doi: 10.1016/j.tcs.2013.06.013.
  • Bentley and Yao [1976] J. L. Bentley and A. C. Yao. An almost optimal algorithm for unbounded searching. Inf. Process. Lett., 5(3):82–87, 1976. doi: 10.1016/0020-0190(76)90071-5.
  • Buro [1993] M. Buro. On the maximum length of Huffman codes. Inf. Process. Lett., 45(5):219–223, 1993. doi: 10.1016/0020-0190(93)90207-P.
  • Chowdhury et al. [2002] R. A. Chowdhury, M. Kaykobad, and I. King. An efficient decoding technique for Huffman codes. Inf. Process. Lett., 81(6):305–308, 2002.
  • Claude et al. [2015] F. Claude, G. Navarro, and A. Ordóñez. The wavelet matrix: An efficient wavelet tree for large alphabets. Information Systems, 47:15–32, 2015.
  • Elias [1975] P. Elias. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory, 21(2):194–203, 1975.
  • Fariña et al. [2016] A. Fariña, T. Gagie, G. Manzini, G. Navarro, and A. Ordóñez. Efficient and compact representations of some non-canonical prefix-free codes. In Proc. SPIRE, LNCS 9954, pages 50–60, 2016.
  • Fariña et al. [2021] A. Fariña, T. Gagie, S. Grabowski, G. Manzini, G. Navarro, and A. Ordóñez. Efficient and compact representations of some non-canonical prefix-free codes, 2021. arXiv:1605.06615.
  • Fredman and Willard [1993] M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci., 47(3):424–436, 1993. doi: 10.1016/0022-0000(93)90040-4.
  • Gagie et al. [2015] T. Gagie, G. Navarro, Y. Nekrich, and A. O. Pereira. Efficient and compact representations of prefix codes. IEEE Trans. Inf. Theory, 61(9):4999–5011, 2015.
  • Hashemian [1995] R. Hashemian. Memory efficient and high-speed search Huffman coding. IEEE Trans. Commun., 43(10):2576–2581, 1995.
  • Hirschberg and Lelewer [1990] D. S. Hirschberg and D. A. Lelewer. Efficient decoding of prefix codes. Commun. ACM, 33(4):449–459, 1990.
  • Huffman [1952] D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers, 40(9):1098–1101, 1952.
  • Katona and Nemetz [1976] G. O. H. Katona and T. O. H. Nemetz. Huffman codes and self-information. IEEE Trans. Inf. Theory, 22(3):337–340, 1976. doi: 10.1109/TIT.1976.1055554.
  • Klein [1997] S. T. Klein. Space- and time-efficient decoding with canonical Huffman trees. In Proc. CPM, volume 1264 of LNCS, pages 65–75, 1997.
  • Klein [2000] S. T. Klein. Skeleton trees for the efficient decoding of Huffman encoded texts. Inf. Retr., 3(1):7–23, 2000. doi: 10.1023/A:1009910017828.
  • Kosolobov and Merkurev [2020] D. Kosolobov and O. Merkurev. Optimal skeleton Huffman trees revisited. In Proc. CSR, volume 12159 of LNCS, pages 276–288, 2020.
  • Liddell and Moffat [2006] M. Liddell and A. Moffat. Decoding prefix codes. Softw. Pract. Exp., 36(15):1687–1710, 2006. doi: 10.1002/spe.741.
  • Mansour [2007] M. F. Mansour. Efficient Huffman decoding with table lookup. In Proc. ICASSP, pages 53–56, 2007.
  • Moffat [2019] A. Moffat. Huffman coding. ACM Comput. Surv., 52(4):85:1–85:35, 2019. doi: 10.1145/3342555.
  • Moffat and Turpin [1997] A. Moffat and A. Turpin. On the implementation of minimum redundancy prefix codes. IEEE Trans. Commun., 45(10):1200–1207, 1997.
  • Navarro [2016] G. Navarro. Compact Data Structures – A practical approach. Cambridge University Press, 2016.
  • Nekrich [2000] Y. Nekrich. Decoding of canonical Huffman codes with look-up tables. In Proc. DCC, page 566. IEEE Computer Society, 2000. doi: 10.1109/DCC.2000.838213.
  • Pǎtraşcu and Thorup [2014] M. Pǎtraşcu and M. Thorup. Dynamic integer sets with optimal rank, select, and predecessor search. In Proc. FOCS, pages 166–175, 2014.
  • Schwartz and Kallick [1964] E. S. Schwartz and B. Kallick. Generating a canonical prefix encoding. Commun. ACM, 7(3):166–169, 1964. ISSN 0001-0782. doi: 10.1145/363958.363991.
  • Witten et al. [1987] I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data compression. Commun. ACM, 30(6):520–540, 1987. doi: 10.1145/214762.214771.

Appendix A Fusion Tree Augmentation

Here, we provide an alternative solution to Section 2.2.1 without the need to let store pairs of codewords with their respective lengths. Instead, the idea is to augment with additional information to allow queries needed for simulating . For that, we follow Pǎtraşcu and Thorup [27, Section IV.], who presented a solution for a dynamic fusion tree that can answer the following queries, where is the set of keys defined in Section 2.2.

  • returns and

  • returns with .

Since is a static data structure, adding these operations to is rather simple: The idea is to augment each fusion tree node with an integer array storing the prefix-sums of subtree sizes. Specifically, let be a fusion tree node having children, for a fixed (but initially selectable) positive constant . Then we augment  with an integer array  of length such that is the sum of the sizes of the subtrees rooted at the preceding siblings of ’s -th child. With we can answer via , which is solved by traversing the fusion tree  in a top-down manner. Starting with a counter  of the rank initially set to zero, on visiting the -th child of a node , we increment by . On finding we return (+1 because the predecessor itself needs to be counted).

To answer a query, we store the content of each in a (separate) fusion tree  to support a query on the prefix-sum values stored in . Consequently, a node of stores not only but also a fusion tree built upon . The algorithm works again in a top-down manner, but uses for navigation: For answering , suppose we are at a node , and suppose that gives us . Then we exchange with . Now if has been decremented to zero, we are done since the answer is the -th child of . Otherwise, we descend to the -th child of  and recurse.

The fusion trees as well as the prefix-sums take asymptotically the same space as its respective fusion tree node .444Since our tree  has a branching factor of , there are leaves and internal nodes. Since an internal node takes bits, we need bits for all internal nodes. Also, a leaf  does not need to store and since . Regarding the time, and answer a and an access query in time, and therefore, we can answer and in the same time bounds as .

It is left to deal with depths having no leaves. For that, we add a bit vector

of length  with if and only if depth  has at least one leaf. For the latter solution, only needs to take care of all depths in which leaves are present, i.e., will return a depth  during a predecessor query, which we map to the correct depth with , given that we endowed with a select-support data structure in a precomputation step. In total, with its select-support data structure takes bits of space, and answers a select query in constant time. We no longer need to store since , given there is a leaf of the Huffman tree with depth . Consequently, our approach shown in Section 4 works with this augmented fusion tree analogously.