Variable-Byte Encoding is Now Space-Efficient Too

04/29/2018 ∙ by Giulio Ermanno Pibiri, et al. ∙ University of Pisa 0

The ubiquitous Variable-Byte encoding is considered one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2× by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that, by running in linear time and with low constant factors, does not affect indexing time; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The inverted index is the core data structure at the basis of search engines, massive database architectures and social networks [47, 27, 12, 14, 11]. In its simplicity, the inverted index can be regarded as being a collection of sorted integer sequences, called inverted or posting lists.

When the index is used to support full-text search in databases, each list is associated to a vocabulary term and stores the sequence of integer identifiers of the documents that contain such term [27]. Then, identifying a set of documents containing all the terms in a user query reduces to the problem of intersecting the inverted lists associated to the terms in the query. Likewise, an inverted list can be associated to a user in a social network (e.g., Facebook) and stores the sequence of all the friend identifiers of the user [14]. Moreover, database systems based on SQL often precompute the list of row identifiers matching a specific frequent predicate over a huge table, in order to speed up the execution of a query involving the conjunction of, possibly, many predicates [24, 36]. Also, finding all occurrences of twig patterns in XML databases can be done efficiently by resorting on an inverted index [9]. In recent years, a vast number of key-value stores has emerged, e.g., Apache Ignite, Redis, InfinityDB, BerkeleyDB and many others. Common to all such architectures is the organization of data elements falling into the same bucket due to an hash collision: the list of all such elements is recorded, which is nothing but an inverted list [16].

Because of the huge quantity of data available and processed on a daily basis by the mentioned systems, compressing the inverted index is indispensable since it can introduce a two-fold advantage over a non-compressed representation: feed faster memory levels with more data and, hence, speed up the query processing algorithms. As a result, the design of algorithms that compress the index effectively while maintaining a noticeable decoding speed is an old problem in computer science, that dates back to more than 50 years ago, and still a very active field of research. Many representation for inverted lists are known, each exposing a different compression ratio vs. query processing speed trade-off. We point the reader to [34] and Section 2 of this paper for a concise overview of the different encoders that have been proposed through the years.

Among these, Variable-Byte [40, 44] (henceforth, VByte) is the most popular and used byte-aligned code. In particular, VByte owes its popularity to its sequential decoding speed and, indeed, it is the fastest representation up to date for integer sequences. For this reason, it is widely adopted by well-known companies as a key database design technology to enable fast search of records. We mention some noticeable examples. Google uses VByte extensively: for compressing the posting lists of inverted indexes [15] and as a binary wire format for its protocol buffers [2]. IBM DB2 employs VByte to store the differences between successive record identifiers [8]. Amazon patented an encoding scheme, based on VByte and called Varint-G8IU, which uses SIMD (Single Instruction Multiple Data) instructions to perform decoding faster [39]. Many other storage architectures rely on VByte to support fast full-text search, like Redis [3], UpscaleDB [4] and Dropbox [1].

We now quickly review how the VByte encoding works. It was first described by Thiel and Heaps [40]. The binary representation of a non-negative integer is divided into groups of bits which are represented as a sequence of bytes. In particular, the least significant bits of each byte are reserved for the data whereas the most significant (the -th), called the continuation bit, is equal to 1 to signal continuation of the byte sequence. The last byte of the sequence has its -th bit set to 0 to signal, instead, the termination of the byte sequence. As an example, is represented as

, where we have underlined the control bits. Also notice the padding bits in the first byte starting from the left, inserted to align the binary representation of the number to a multiple of

bits. In particular, VByte uses bits to represent an integer . Decoding is simple: we just need to read one byte at a time until we find a value smaller than . The format is also suitable for SIMD for speeding up sequential decoding, as we will review in Section 2.

(a) Gov2
(b) ClueWeb09
Figure 1: Percentage of postings belonging to Dense and Sparse regions of the posting lists for the Gov2 and ClueWeb09 datasets. The posting lists have been clustered by size into three categories: Short (size 10K); Medium (10K size 7M); Long (size 7M). Below each category we also indicate the percentage of postings belonging to its posting lists.

The main drawback of VByte lies in its byte-aligned nature, which means that the number of bits needed to encode an integer cannot be less than . For this reason, VByte is only suitable for large numbers. However, the inverted lists are notably known to exhibit a clustering effect, i.e., these present regions of close identifiers that are far more compressible than highly scattered regions [30, 32, 34]. Such natural clusters are present because the indexed data itself tend to be very similar. As a simple example, consider all the Web pages belonging to the same site: these are likely to share a lot of terms. Also, the values stored in the columns of databases typically exhibit high locality: that is why column-oriented databases can achieve very good compression and high query throughput [5].

The key point is that efficient inverted index compression should exploit as much as possible the clustering effect of the inverted lists. VByte currently fails to do so and, as a consequence, it is believed to be space-inefficient for inverted indexes.

The motivating experiment. As an illustrative example, consider the following two sequences: and . To reduce the values of the integers, VByte compresses the differences between successive values, known as delta-gaps or -gaps, i.e., the sequences and respectively (the first integer is left as it is). Now, it is easy to see that VByte will use 5 bytes to encode both sequences, but the first one can be compressed much better, with just bits. To better highlight how this behavior can deeply affect compression effectiveness, we consider the statistic shown in Figure 1. This statistic reports the percentage of postings belonging to dense and sparse regions of the lists for the, widely used, two datasets Gov2 and ClueWeb09. More precisely, the plot originated from the following experiment: we divided each inverted list into chunks of 128 integers and we considered as sparse a chunk where VByte yielded the best space usage with respect to the

characteristic bit-vector

representation of the chunk (if is the last element in the chunk, we have the -th bit set in a bitmap of size for all integers belonging to the chunk), regarded to as the dense case. We also clustered the inverted lists by their sizes, in order to show where dense and sparse regions are most likely to be present.

The experiment clearly shows that we have a majority of dense regions, thus explaining why in this case VByte is not competitive with respect to bit-aligned encoders and, thus, motivating the need for introducing a better encoding strategy that adapts to such distribution without compromising the query processing speed of VByte. We can also conclude that such optimization is likely to pay off because the majority of integers, i.e., for Gov2 and for ClueWeb09, concentrate in the lists of medium size (thanks to the Zipfian distribution of words in text), where indeed more than half of them belong to dense chunks.

Our contributions. We list here our main contributions.

  1. We disprove the folklore belief that VByte is too large to be considered space-efficient for compressing inverted indexes, by exhibiting an improved compression ratio of on the standard datasets Gov2 and ClueWeb09.

    The result is achieved by partitioning the inverted lists into blocks and representing each block with the most suitable encoder, chosen among VByte and the characteristic bit-vector representation. Partitioning the lists has the potential of adapting to the distribution of the integers in the lists, such as the ones shown in Figure 1, by adopting VByte for the sparse regions where larger -gaps are likely to be present.

  2. Since we cannot expect the dense regions of the lists be always aligned with uniform boundaries, we consider the optimization problem of minimizing the space of representation of an inverted list of size by representing it with variable-length partitions. To solve the problem efficiently, we introduce an algorithm that finds the optimal partitioning in time and space.

    We remark that the state-of-the-art dynamic programming algorithm in [32] can be used as well to find an -optimal solution in time and space for any , but it is noticeably slower than our approach and approximated, rather than exact.

    We also remark that, although we conducted the experiments using VByte, our optimal algorithm can be applied to any point-wise encoder, that is whenever the chosen encoder needs a number of bits to represent an integer that solely depends on the value of the integer and not on the universe and size of the chunk to which it belongs to.

  3. We conduct an extensive experimental analysis to demonstrate the effectiveness of our approach on standard large datasets, such as Gov2 and ClueWeb09. More precisely, when compared to the un-partitioned VByte indexes, the optimally-partitioned counterparts are: (1) significantly smaller, by on average; (2) marginally slower at computing boolean conjunctions and perform sequential decoding, by only ; (3) even faster to build on large datasets thanks to the introduced fast partitioning algorithm and improved compression ratio.

    We compare the performance of partitioned VByte indexes against several state-of-the-art encoders, such as: partitioned Elias-Fano (PEF[32]

    , Binary Interpolative coding (

    BIC[30], the optimized PForDelta (OptPFD[45], a recent proposal based on Asymmetric Numeral Systems (ANS[29] and QMX [41]. The partitioned VByte representation reduces the gap between the space of VByte and the one of the best bit-aligned compressors, such as, for example, PEF and BIC, by passing from an average original gap of to only with respect to PEF; from to only with respect to BIC. Moreover, it also offers the fastest query processing speed in all cases, except against the QMX mechanism which is slightly faster, by , but up to larger.

2 Related work

2.1 Compressors for inverted lists

In this subsection we overview the most important compressors devised for efficient inverted list representation. Additionally to the ones we review in the following, we remark that well-known compressors like Elias’ and  [19] and Golomb [23] are known to obtain inferior compression ratios for inverted index storage with respect to the state-of-the-art, thus we do not consider them. We point the reader to [34] for a recent and concise survey for a description of such mechanisms.

Other recent techniques, instead, consider the compression of the index as a whole, by trying to represent many inverted lists together, thus reducing the implicit redundancy present in the lists [33, 13, 46]. These results are orthogonal to the work we present in this paper.

Block-based. Blocks of contiguous integers can be encoded separately, to improve both compression ratio and retrieval efficiency. This line of work finds its origin in the so-called frame-of-reference (For) [22]. A simple example of this approach, called binary packing, encodes blocks of fixed length, e.g., integers [7, 25]. To reduce the value of the integers, we can subtract from integer the previous one (the first integer is left as it is), making each block be formed by integers greater than zero known as delta-gaps (or just -gaps). Scanning a block will need to re-compute the original integers by computing the prefix sums. In order to avoid the prefix sums, we can just encode the difference between the integers and the first element of the block (base+offset encoding) [32, 43]. Using more than one compressors to represent the blocks, rather than only one, can also introduce significant improvements in query time within the same space constraints [31].

Other binary packing strategies are Simple-9 [6], Simple-16 [7] and QMX [41], that combine relatively good compression ratio and high decompression speed. The key idea is to try to pack as many integers as possible in a memory register (32, 64 or 128 bits). Along with the data bits, a selector is used to indicate how many integers have been packed together in a single unit. In the QMX mechanism the selectors are run-length encoded.

PForDelta. The biggest limitation of block-based strategies is that these are inefficient whenever a block contains at least one large element, because this causes the compressor to use a number of bits per element proportional to the one needed to represent that large value. To overcome this limitation, PForDelta was proposed [48]. The main idea is to choose a proper value for the universe of representation of the block, such that a large fraction, e.g., , of its integers fall in the range and, thus, can be written with bits each. This strategy is called patching. All integers that do not fit in bits, are treated as exceptions and encoded separately using another compressor.

The optimized variant of the encoding [45], which selects for each block the values of and that minimize its space occupancy, has been demonstrated to be more space-efficient and only slightly slower than the original PForDelta [45, 25].

Elias-Fano. This strategy directly encodes a monotone integer sequence without a first delta encoding step. It was independently proposed by Elias [18] and Fano [20], hence its name. Given a sequence of size and universe , its Elias-Fano representation takes at most bits, which can be shown to be less that half a bit away from optimality [18]. The encoding has been recently applied to the representation of inverted indexes [42] and social networks [14], thanks to its excellent space efficiency and powerful search capabilities, namely random access in and successor queries in time. The latter operation which, given an integer of a sequence returns the smallest integer such that , is the fundamental one when resolving boolean conjunctions over inverted lists (see [42, 32, 33] for details). As standard, we refer to this primitive with the name NextGEQ (Next Greater than or EQual to) in the whole paper.

The partitioned variant of Elias-Fano (PEF[32], splits a sequence into variable-sized partitions and represents each partition with Elias-Fano. The partitioned representation sensibly improves the compression ratio of Elias-Fano by preserving its query processing speed. In particular, it currently embodies the best trade-off between index space and query processing speed.

Binary Interpolative Coding. Binary Interpolative Coding (BIC[30] is another approach that, like Elias-Fano, directly compresses a monotonically increasing integer sequence. In short, BIC is a recursive algorithm that first encodes the middle element of the current range and then applies this encoding step to both halves. At each step of recursion, the algorithm knows the reduced ranges that will be used to write the middle elements in fewer bits during the next recursive calls.

Many papers in the literature experimentally proved that BIC is one of the most space-efficient method for storing highly clustered sequences, though among the slowest at performing decoding [45, 38, 32, 33].

Asymmetric Numeral Systems. Asymmetric Numeral Systems (ANS) is a family of entropy coders, originally developed by Jarek Duda [17]. ANS combines the excellent compression ratio of Arithmetic coding with a decompression speed comparable with the one of Huffman. It is now widely used in commercial applications, like Facebook ZSTD, Apple LZFSE and in the Linux kernel.

The basic idea of ANS is to represent a sequence of symbols with a natural number . If each symbol belongs to the binary alphabet , then appending to the end of the binary string representing will generate a new integer . This coding is optimal whenever . ANS generalizes this concept by adapting it to a general distribution of symbols . In particular, appending a symbol to increases the information content from bits to bits, thus the new generated natural number will be .

Recently, ANS-based techniques for inverted index compression have been proposed [28, 29]. Such investigation has lead to a higher compression ratio than the one of BIC and with faster query processing speed.

The Variable-Byte family. Various encoding formats for VByte have been proposed in the literature in order to improve its sequential decoding speed. By assuming that the largest represented integer fits into 4 bytes, two bits are sufficient to describe the proper number of bytes needed to represent an integer. In this way, groups of four integers require one control byte that has to be read once as a header information. This optimization was introduced in Google’s Varint-GB [15]

and reduces the probability of a branch misprediction which, in turn, leads to higher instruction throughput. Working with byte-aligned codes also opens the possibility of exploiting the parallelism of SIMD (Single Instruction Multiple Data) instructions of modern processors to further enhance the decoding speed. This is the line of research taken by the recent proposals that we overview below.

Varint-G8IU [39] uses a similar idea to the one of Varint-GB but it fixes the number of compressed bytes rather than the number of integers: one control byte is used to describe a variable number of integers in a data segment of exactly 8 bytes, therefore each group can contain between two and eight compressed integers.

Masked-VByte [35] directly works on the original VByte format. The decoder first gathers the most significant bits of consecutive bytes using a dedicated SIMD instruction. Then, using previously-built look-up tables and a shuffle instruction, the data bytes are permuted to obtain the original integers.

Stream-VByte [26], instead, separates the encoding of the control bytes from the data bytes, by writing them into separate streams. This organization permits to decode multiple control bytes simultaneously and, therefore, reduce branch mispredictions that can stop the CPU pipeline execution when decoding the data stream.

2.2 Partitioning algorithms

The simplest partitioning strategy is to fix the length of every partition, e.g., integers, and split the list into blocks, where is the size of the list (the last partition could be potentially smaller than integers). We call this partitioning strategy, uniform. The advantage of this representation is simplicity, since no expensive calculation is needed prior to encoding. However, we cannot expect this strategy to yield the most compact indexes because the highly clustered regions of inverted lists could be likely broken by such fix-sized partitions.

This is the main motivation for introducing optimization algorithms that try to find the best partitioning of the list, thus minimizing its space of representation. Previous work used dynamic programming extensively [10, 38, 21, 32]. More precisely, Silvestri and Venturini [38] obtained a construction time, where is the length of the inverted list and its longest partition. Ferragina et al. [21] improve the result in [10] by computing a partitioning whose cost is guaranteed to be at most times away from the optimal one, for any , in time. Their approach can be applied to any encoder E

whose cost in bits can be computed (or, at least, estimated) in constant-time, for any portion of the input.

Finally, Ottaviano and Venturini [32] resort to similar ideas to the ones presented in [21] to obtain a running time of , and yet, preserving the same approximation guarantees. Note that the complexity is as soon as is constant. Since their dynamic programming technique is the state-of-the-art and our algorithm aims at improving such result, we will review the algorithm in Section 3.

3 Optimal partitioning in linear time

In this section we study the problem of partitioning a monotone integer sequence of size to improve its compression, by adopting a 2-level representation. This data structure stores as a sequence of partitions that are concatenated in the second level . The first level stores, instead, a fix amount of bits, say , for each partition , needed to describe its size and largest element . Clearly, can be safely upper bounded by bits. This representation has several important advantages over a shallow representation:

  1. it permits to choose the most suitable encoder for each partition, given its size and upper bound, hence improving the overall index space;

  2. each partition can be represented in a smaller universe, i.e., , by subtracting to all its elements the base value , thus contributing to further reduction in space;

  3. it allows a faster access to the individual elements of , since we can first locate the partition to which an element belongs to and, then, conclude the search in that partition only.

Now, the natural problem is how to choose the lengths and encoders for each partition in order to minimize the space of . As already noted, the problem is not trivial since we cannot expect dense regions of the lists being always aligned with fix-sized partitions. While a dynamic programming recurrence offers an optimal solution to this problem in time and space by trivially considering the cost of all possible splittings, this approach is clearly unfeasible already for modest sizes of the input. Therefore, we need smarter methods such as the ones we describe in the following.

3.1 Dynamic programming: slow and approximated

The core idea of this approach is to not consider all possible splittings, but only the ones whose cost is able of amortizing the fix cost  [32]. More precisely, it finds a solution whose encoding cost is at most away from the optimal one in time and space, for any . Note that the time complexity is linear as soon as is constant. We now quickly describe the technique and highlight its main drawback.

The problem of determining the partitioning of minimum cost can be modeled as the problem of determining the path of minimum cost (shortest) in a complete, weighted and directed acyclic graph (DAG) . This DAG has vertices, one for each position of , and it is complete, i.e., it has edges where the cost of edge represents the number of bits needed to represent . Since the DAG is complete, a simple shortest path algorithm will not suffice to compute an optimal solution efficiently. Thus, we proceed by sparsification of , as follows. We first consider a new DAG , which is obtained from and has the following properties: (1) the number of edges is for any given ; (2) its shortest path distance is at most times the one of the original DAG , where represents the encoding cost of when no partitioning is performed. It can be proven that the shortest path algorithm on finds a solution which is at most times larger than an optimal one, in time , because has edges [21]. To further reduce the complexity by preserving the same approximation guarantees, we define two approximation parameters: and . We first retain from all the edges whose cost is no more than , then we apply the pruning strategy described above with as approximation parameter. The obtained graph has now edges, which is as soon as and are constant. Again, it can be proven that the shortest path distance is no more than times the one in by setting  [32].

Despite the theoretical linear-time complexity for a constant , the main drawback of the algorithm lies in high constant factor. For example, even by setting we obtain a hidden constant of , which results in a noticeable cost in practice. Although enlarging can reduce the constant at the price of reducing the compression efficacy, this remains the bottleneck for the building step of large inverted indexes.

3.2 Our solution: fast and exact

The interesting research question is whether there exist an algorithm that finds an exact solution, rather than approximated, in linear time and with low constant factors. This section answers positively to this question by showing that if the cost function of the chosen encoder is point-wise, i.e., the number of bits need to represent a single posting solely depends on such posting and not on the universe and size of the partition it belongs to, the problem admits an optimal and fast solution in time and space.

In the following, we first overview and discuss our solution by explaining the intuition that lies at its core, then we give the full technical details along with a proof of optimality and the relative pseudo-code.

3.2.1 Overview

We are interested in computing the partitioning of whose encoding cost is minimum by using two different encoders that take into account the relation between the size and universe of each partition. We already motivated the potential of this strategy by explaining Figure 1, which shows the distribution of the integers in dense and sparse regions of the inverted lists. Let us consider the partition , , of relative universe and size . Intuitively, when gets closer to the partition becomes denser, viceversa, it becomes sparser whenever diverges from . Thus the encoding cost is chosen to be the minimum between bits (dense case) and bits (sparse case), where B is the characteristic bit-vector of and E is the chosen point-wise encoder for sparse regions.

Examples of point-wise encoders are VByte [40], Elias’ - [19] and Golomb [23]. Other encoders, such as Elias-Fano [20, 18], Binary Interpolative Coding [30] and PFor-Delta [45] are not point-wise, since a different number of bits could be needed to represent the same integer when belonging to partitions having different characteristics, namely different length and universe. To clarify what we mean, consider the following exemplar sequence:

Let us now compare the behavior of Elias-Fano (non point-wise) and VByte (point-wise). By performing no splitting, Elias-Fano will use bits to represent each posting. By performing the splitting , the first five values will be represented with bits each, but the next five values with bits each. Instead, by performing the splitting , the first six values will use bits each, while the next four only bits each. Thus, performing different splittings change the cost of representation of the same postings for a non point-wise encoder, such as Elias-Fano. Instead, it is immediate to see that VByte will encode each element with 8 bits, regardless any partitioning.

The above example gives us an intuitive explanation of why it is possible to design a light-weight approach for a point-wise encoder E: we can compute the number of bits needed to represent a partition of with E by just scanning its elements and summing up their costs, knowing that performing a splitting will not change their cost of representation nor, therefore, the one of the partition. This means that as long as the cost , for some , is less than we know that will be represented optimally with E. Therefore, we can safely keep scanning the sequence until the difference in cost between and becomes more than bits. At this point, it means that E is wasting more than bits with respect to B, thus we should stop encoding with E the current partition because we can afford to pay the fix cost and continue the encoding with B. Now, the crucial question we would like to answer is: at which position should we stop encoding with E and switch to B? The answer is simple: we should stop at the position at which we saw the maximum difference between the costs of E and B, because splitting in any other point will yield a larger encoding cost. In other words, represents the position at which E gains most with respect to B, so we will be wasting bits by splitting before or after position . Observe that we must also require such gain be more than bits, otherwise switching encoder will actually cause a waste of bits. In other terms, we say that in such case the gain would not be sufficient to amortize the fix cost of the partition, meaning that we should not split the sequence yet.

In conclusion, we encode with E and know that the elements will be now best represented with B, rather than with E. Figure 2 offers a pictorial representation of how the difference between the encoding costs of E and B, referred to as the gain function, changes during the scan of . When the function is decreasing, it means that E is winning over B, i.e., its encoding cost is less; conversely, when B is more efficient than E, the function is increasing.

After encoding the first partition , the process repeats: (1) we keep scanning until B loses more than bits with respect to E; at that point (2) we encode with B the elements in if the maximum gain of B with respect to E, seen at position , is greater than bits. We keep alternating compressors until the end of the sequence.

(a)
(b)
Figure 2: In case (a), we should split in correspondence of position because there the gain is the minimum among all points whose gain is below from and ; in case (b) we should not split the sequence because, although an increase in gain of bits follows, we do not have a sufficiently high gain up to to amortize the cost of the splitting.

Before sketching a compact pseudo-code of our algorithm, we first express some considerations. First of all note that, for all partitions except the first, we need to amortize twice the fix cost, because we could potentially merge the last formed partition with the current one, thus, in order to be beneficial, the difference in the cost of the two encoders must be larger than bits. Again, refer to Figure (a)a for an example. Also, for illustrative purposes, in the above discussion we have assumed that the first partition is best encoded with E: clearly, B could be better at the beginning but the algorithm will work in the very same way.

In the most general terms, call L the encoder used to represent the last encoded partition and C the current one. These will be either E or B. We also indicate with the same letters the costs in bits of their representation of the current partition. Finally, let indicate the best gain of C with respect to L. At a high level, the skeleton of our algorithm looks as follows.

  1. Encode the first partition.

  2. If and are greater than bits, encode the current partition with C and swap the roles of C and L.

  3. Repeat step (2) until the end of the sequence.

  4. Encode the last partition.

In the above pseudo-code, the encoding of the first and last partitions its treated separately because these must amortize a fix cost of bits instead of bits, because we do not have any partition before and after, respectively (see Figure (b)b).

It is immediate to see that the described approach can be implemented by using space because we only need to keep the difference between the costs of E and B (plus some cursor variables), and that it runs in time because we calculate the cost in bits of each integer exactly once. We have, therefore, eliminated the linear-space complexity of any dynamic programming approach because we do not need to maintain the costs of the shortest path ending in each position of . Moreover, the introduced algorithm has very low constant factors in the time complexity, since it just performs few comparisons and updates of some variables for each integer of .

3.2.2 Technical discussion

Let be the gain function, defined as for . In order to describe the properties of our solution, we first need the following definition.

Definition 1.

Given , , the integer is the point dominating for the encoder E, if

(1)

where if or otherwise, and satisfies one of the following:

(2)
(3)

Notice that the dominating point could not exist for any sub-sequence , but if it exists and for any , it must be unique. Clearly, the definition of dominating point for encoder B is symmetric to Definition 1.

The above definition explains that, given , we can always improve its cost of representation by splitting the interval in correspondence of the dominating point if it exists, otherwise we should not split . It is easy to see that the dominating point in is the point in which the difference of the costs between the two compressors is maximized, thus it will be only beneficial to split in this point rather than any other point, as we explained in the previous paragraph. It is also easy to see why we should search the dominating point among the ones whose gain is at least bits less than . The threshold is set to the minimum amount of bits needed to amortize the cost of switching from one compressor to the other. Consider Figure (a)a and suppose we are encoding with B before and after . If we compress with E the partition , we are switching encoder twice, thus the gain in must be at least bits less than to be able of amortizing the cost for two switches. In Figure (b)b, instead, we have no partition before , thus we strive to amortize the cost for a single switch.

Now, let be the position of integer in , i.e., . Our strategy consists in splitting the sequence in correspondence of the dominating points, as defined above. More precisely, the solution output by this strategy can be described by the following recursive equation

(4)

where , and notation means that is the point dominating . In other words, any , except for the first and the last, is the dominating point of the interval whose endpoints are dominating points as well.

Notice that, by definition, there cannot be two adjacent dominating points that are relative to the same encoder, but they must be relative to different encoders. In fact, suppose we have a dominating point for E: it means that either we have seen an increase in gain of after and, thus, the dominating point after (if found) will be relative to B, or the gain never goes under and a dominating point after if not found. This means that alternates the choice of compressors, i.e., a partition encoded with E is delimited by two partitions encoded with B and viceversa (except for the first and last). We call such behavior, alternating.

In particular, our strategy will encode with compressor E all partitions ending with a dominating point for E (and starting with a dominating point for B, since alternates the compressors). The same holds for B. As already pointed out, the only exception is made for the last partition , because cannot be a dominating point by definition (no increase or decrease in gain is possible after the end of the sequence). In this case, the strategy selects the compressor that yields the minimum cost over .

Since a feasible solution to the problem is just either a singleton partition or consists in any sequence of strictly increasing positions, we argue that is a feasible solution. This follows automatically by the definition of dominating point because such points are different one another and, therefore, their positions strictly increasing. If no dominating points exist, then will be empty: it is a feasible solution too and indicates that should not be cut (singleton partition). We now show the following lemma.

Figure 3: Path of minimum cost till position (thick black line) and its representation in terms of gain function; is the position of the point dominating .
Lemma 1.

is optimal.

Proof.

As already noted in Subsection 3.1, an optimal solution to the problem can be thought as a path of minimum cost in the DAG whose vertices are the positions of the integers of and for any edge . Thus, suppose that is not a shortest path and let be the shortest path sharing the longest common prefix with . Refer to Figure 3 for a graphical representation: is the largest position shared by and . We want to show that we can replace the edge , , with the path , , without changing the cost of , therefore extending the longest common prefix up to node (the case for is symmetric). We argue that this is only possible if is optimal, otherwise it would mean that is not a shortest path sharing a common longest prefix with , which is absurd by assumption.

First note that both edges and must be encoded with the same compressor. In fact, suppose that these are not, for example is encoded with B and with E. Since , we know that it is optimal to encode with B until . However, the fact that is a dominating point for E implies that , which is absurd because and B is optimal until . Therefore, both edges must use the same encode. Assume that it is E (the case for B is symmetric).

The fact that belongs to the optimal solution means that if we split the edge into two (or more) pieces, we cannot decrease the cost, i.e., , . Since E is point-wise, we have and thus, by imposing , we obtain (1) . Viceversa, the fact that is a dominating point for E means that from to the cost is higher if we keep the same encoder, i.e., . Again, by exploiting the fact that E is point-wise, we have (2) . Conditions (1) and (2) together imply that it must be , thus we have no change in the cost of by performing the exchange, which contradicts our assumption that was not optimal.

We are now left to present a detailed algorithm that computes , i.e., that iteratively finds all the dominating points of according to Equation 4. The argue that the function optimal_partitioning coded in Algorithm 1 does the job.

Input: A monotone sequence
Output: The partitioning
1 Function optimal_partitioning
2      
3      
4      
5      
6       for   do
7            
8             if  is non-decreasing then
9                   if  then
10                         ,
11                        
12                  if  and  then
13                         update()
14                        
15                  
16            else
17                   if  then
18                         ,
19                        
20                  if  and  then
21                         update()
22                        
23                  
24            
25      close() return
26
Algorithm 1 The partitioning algorithm

Before proving that the algorithm is correct, let us explain the meaning of the variables used in the pseudo-code.

Call the last added position to . Variables and keep track of the positions of the points dominating the interval for, respectively, B and E encoders. Likewise, and are the gains in correspondence of positions and ; indicates the gain at step , for .

1 append to
2 ,
3
,
Algorithm 2 update()
1 if  and  then
2       update()
3      
4if  and  then
5       update()
6      
7if  then
8       , update()
9      
10else
11       , update()
12      
Algorithm 3 close()
Lemma 2.

Algorithm 1 is correct.

Proof.

We want to show that the array returned by the function optimal_partitioning contains all the positions of the dominating points, as recursively described by Equation 4. We proceed by induction on the elements of .

The main loop in lines 6-17:

  1. computes the gain at step (line 7);

  2. updates the variables (lines 9-10) and (lines 14-15);

  3. add new positions to (lines 11-12 and 16-17).

Correctness of point 1. and 2. is immediate: the crucial point to explain is the third.

The if statements in lines 11 and 16 check whether positions and are the positions of a dominating point in , i.e., whether and satisfy Definition 1. Since the if statements are symmetric, we proved the correctness of the first one for non-decreasing values of (line 11).

We first check whether the gain, as seen so far, is sufficient to be the one of a dominating point for E as required by Equation 1. At the beginning of the algorithm, the current interval starts at and , therefore in Equation 1 and the test is correct. If is true, then we also check if we have a sufficiently large increase in gain at the current step with respect to the previously seen gain according to Condition 2. Again, it is immediate to see that the test checks such condition and, therefore, it is correct. If both previous conditions are satisfied, then is the position of the dominating point for E in the first interval by Definition 1. If so, we can execute the update code, show in Algorithm 2, which adds to and sets to according to Definition 1. Moreover, it updates the gain to maintain the invariant that its value is always relative to the current interval, which now begins at position . In fact: since we have seen an increase of bits, the gain in must be the current gain , whereas the gain is 0 because is non-decreasing. Thus, the first point is computed correctly.

Now, assume that we have added points to and that the last added is for encoder E. We want to show that the next point will be dominating for encoder B. As explained before, whenever we add a dominating point for E to , it means that we have seen an increase of bits with respect to the last added position, i.e., position is the one of a point that satisfies Equation 1 for encoder B. Therefore the -th point added to will be dominating for B.

To conclude, we have to explain what happens at the end of the algorithm. Refer to the close function, coded in Algorithm 3. Lines 1-2 (3-4) check Condition 3:

if successful, then () is the next dominating point for B (E) and, since compressors must alternate each other, we close the encoding of the sequence with the other compressor in lines 5-8, that is E (B);

if both unsuccessful, i.e., no dominating point is found, then it means that the remaining part of the sequence should not be cut and, thus, encoded with a single compressor in lines 5-8.

In conclusion, since we consider each element of once and use a constant number of variables, Lemma 1 and 2 imply the following result.

Theorem 1.

A monotone integer sequence of size can be partitioned optimally in time and space, whenever its partitions are represented with a point-wise encoder and characteristic bit-vectors.

4 Experimental evaluation

Datasets. We performed our experiments on the following two standard datasets, whose statistics are summarized in Table 1.

  • Gov2 is the TREC 2004 Terabyte Track test collection, consisting in roughly million .gov sites crawled in early 2004. Documents are truncated to 256KB.

  • ClueWeb09 is the ClueWeb 2009 TREC Category B test collection, consisting in roughly million English web pages crawled between January and February 2009.

Gov2 ClueWeb09
Documents
Terms
Postings
Table 1: Basic statistics for the tested collections.

Standard text pre-processing was performed before indexing the collections: terms were extracted using Apache Tika; lower-cased and stemmed using the Porter2 stemmer; no stopwords were removed form the collections. Document identifiers (docIDs) were assigned by following the order of the URLs of the documents [37].

Gov2 ClueWeb09
space doc freq building decoding AND query space doc freq building decoding AND query
GB bpi bpi minutes ns/int ms/query GB bpi bpi minutes ns/int ms/query
Varint-GB
Varint-G8IU
Masked-VByte
Stream-VByte
Table 2: The performance of the Variable-Byte family on Gov2 and ClueWeb09: space in giga bytes (GB); average number of bits (bpi) per document (doc) and frequency (freq); sequential decoding time in nanoseconds per integer (ns/int) and query processing time for TREC-05 AND queries in milliseconds per query (ms/query).

Experimental setting and methodology. All the experiments were run on a machine with 4 Intel i7-4790K CPUs (8 threads) clocked at 4.00GHz and with 32GB of RAM DDR3, running Linux 4.13.0 (Ubuntu 17.10), 64 bits. The implementation of our partitioned indexes is in standard C++14 and it is a flexible template library allowing any point-wise encoder to be used, provided that its interface exposes a method to compute the cost in bits of a single integer in constant time. The source code, that will be released upon publication of the paper to favor further research and reproducibility of results, was compiled with gcc 7.2.0 using the highest optimization setting.

To test the building time of the indexes we measure the time needed to perform the whole building process, that is: (1) fetch the posting lists from disk to main memory; (2) encode them in main memory; (3) save the whole index data structure back to a file on disk. Since the process is mostly I/O bound, we make sure to avoid disk caching effects by clearing the disk cache before building the indexes.

To test the query processing speed of the indexes, we memory map the index data structures on disk and compute boolean conjunctions over a set of random queries drawn from TREC-05 and TREC-06 Efficiency Track topics. We repeat each experiment three times to smooth fluctuations in the measurements and consider the mean value. The query algorithm runs on a single core and timings are reported in milliseconds. To test sequential decoding speed instead, we measure the time to touch each element of all the posting lists whose size is in between 0.5 and 5 million postings, for a total of 2033 lists for Gov2 (2.9 billion postings) and 3593 lists for ClueWeb09 (5.5 billion postings).

In all the experiments, we used the value bits for partitioning the posting lists for both VByte and Elias-Fano (henceforth, EF).

Organization. The aim of this section is to measure the space improvement, indexing time and query processing speed of indexes that are optimally partitioned by the algorithm described in Section 3. Since we adopt VByte as exemplar point-wise encoder, the next subsection compares the performance of all the encoders in the VByte family in order to chose the most convenient for the subsequent experiments. Then, we measure the benefits of applying our optimization algorithm on the chosen VByte encoder, by comparing the corresponding partitioned index against the un-partitioned counterpart. Finally, we consider the comparison with many other compressors for inverted indexes at the state-of-the-art, as described in Subsection 2.1.

4.1 The performance of the Variable-Byte family

To help us in deciding which VByte encoder to chose for our subsequent analysis, we consider the Table 2. The data reported in the table illustrates how different VByte stream organizations actually impact index space. Since Varint-GB and Stream-VByte take exactly the same space, given that Stream-VByte stores the very same control and data bytes but concatenated in separate streams, in the following we refer to Varint-GB to mean both of them. As we can see, the original VByte format (referred to as Masked-VByte in Table 2 because it uses this algorithm to perform decoding) is the most space-efficient among the family. This is no surprise given the distribution plotted in Figure 1: it means that the majority of the encoded -gaps falls in the interval , otherwise the compression ratio of VByte would have been worse than the one of Varint-GB and Varint-G8IU. As an example, consider the sequence of -gaps . VByte uses 8 bytes to represent such sequence, whereas Varint-GB uses 1 control byte and 4 data bytes, thus 5 bytes in total. When all values are in , VByte uses 4 bytes instead of 5 as needed by Varint-GB. For this reason, the space usage of Varint-GB and Varint-G8IU is worse than the one of VByte: it is and on Gov2 and ClueWeb09 respectively for Varint-GB; more than for Varint-G8IU. The control byte of Varint-G8IU stores a bit for every of the 8 data bytes: a 0 bit means that the corresponding byte completes a decoded integer, whereas a 1 bit means that the byte is part of a decoded integer or it is wasted. Thus, Varint-G8IU compress worse than plain VByte due to the wasted bytes. Finally notice that Varint-GB is slightly worse than Varint-G8IU because it uses 10 bits per integer instead of 9 for all integers in . In fact, the difference between these two encoders in less than 1 bit on both Gov2 and ClueWeb09.

The speed of the encoders is actually very similar for all alternatives, regarding both AND queries and sequential decoding: the spread between the fastest (Varint-G8IU) and the slowest alternative (Masked-VByte) is as little as . The same holds true for the building of the indexes where, as expected, the plain VByte is the fastest and Varint-G8IU is slower on average.

In conclusion, for the reasons discussed above, i.e., better space occupancy, fastest index building time and competitive speed, we adopt the original VByte stream organization and the Masked-VByte algorithm by Plaisance, Kurz and Lemire [35] to perform sequential decoding.

Gov2 ClueWeb09
space doc freq space doc freq
GB bpi bpi GB bpi bpi
VByte
VByte uniform
VByte -optimal
VByte optimal
Table 3: Space in giga bytes (GB) and average number of bits (bpi) per document (doc) and frequency (freq).

4.2 Optimizing Variable-Byte: partitioned vs. un-partitioned indexes

In this subsection, we evaluate the impact of our solution by comparing the optimally-partitioned VByte indexes against the un-partitioned indexes and the ones obtained by using other partitioning strategies, like the -optimal based on dynamic programming (see Subsection 3.1) and the uniform one. The result of the comparison shows that our optimally-partitioned VByte indexes are smaller than the original, un-partitioned, counterparts; can be built faster than dynamic programming and offer the strongest guarantee, i.e., an exact solution rather than an -approximation; despite the significant space savings, they are as fast as the original VByte indexes.

Index space. Table 3 shows the results concerning the space of the indexes. Compared to the case of un-partitioned indexes, we observe gains ranging from up to , with a net factor of improvement with respect to the original VByte format. For the uniform partitioning we used partitions of 128 integers, for both documents and frequencies. As we can see, this simple strategy already produces significant space savings: it is and better on doc sequences for Gov2 and ClueWeb09 respectively; and better on freq sequences. This is because most -gaps are actually very small but any un-partitioned VByte encoder needs at least 8 bits per -gap. In fact, notice how the average bits per integer on the doc sequences becomes less than 8.

We remark that the -optimal algorithm based on dynamic programming was proposed for Elias-Fano, whose cost in bits can be computed in : we adapt the dynamic programming recurrence in order to use it for VByte too. As approximation parameters we used the same as in the experiments of the original paper [32], i.e., we set and . The computed approximation could be possibly large by enlarging such parameters, while our algorithm finds an exact solution. However, we notice that the -approximation is good and our optimal solution is slightly better: precisely by and on the doc sequences of Gov2 and ClueWeb09, respectively. Compared to uniform, the optimal partitioning pays off: indeed it produces a further saving of on average, thus confirming the need for an optimization algorithm.

Gov2 ClueWeb09
VByte
VByte uniform
VByte -optimal
VByte optimal
Table 4: Index building timings in minutes.

Index building time. Although the un-partitioned variant would be the fastest to build in internal memory because the posting lists are compressed in the same pass in which these are read from disk, the writing of the data structure to the disk imposes a considerable overhead because of the high memory footprint of the un-partitioned index. Notice how this factor becomes dramatic for the larger dataset ClueWeb09, resulting in a overhead. This also causes the simple uniform strategy be not faster at all, being actually a little slower (by on average). Despite the linear-time complexity as soon as is constant, the -optimal solution has a noticeable CPU cost due to the high constant factor, as we motivated in Section 3. The optimal solutions has instead low constant factors and, as a result, is faster than the dynamic programming approach by more than on average on both Gov2 and ClueWeb09.

Gov2 ClueWeb09

TREC 05

VByte
VByte uniform
VByte -optimal
VByte optimal

TREC 06

VByte
VByte uniform
VByte -optimal
VByte optimal
(a) AND queries (ms/query)
Gov2 ClueWeb09
VByte
VByte uniform
VByte -optimal
VByte optimal
(b) decoding time (ns/int)
Table 5: Timings for AND queries in milliseconds (ms/query) and sequential decoding time in nanosecond per integer (ns/int).
Gov2 ClueWeb09
space doc freq space doc freq
GB bpi bpi GB bpi bpi
PEF -optimal
OptPFD
BIC
ANS
QMX
VByte optimal
Table 6: Space in giga bytes (GB) and average number of bits (bpi) per document (doc) and frequency (freq).

Index speed: AND queries and decoding. Table 5 illustrates the results. The striking result of the experiment is that, despite the significant space reduction ( improvement, see Table 3), the partitioned indexes are as fast as the un-partitioned ones on both the Gov2 and ClueWeb09 datasets, with only a minor slowdown of in sequential decoding.

It is, therefore, important a careful explanation of such result. The answer is provided by understanding the plots in Figure 4, along with the ones in Figures 1 and 5. In particular, Figure 4 illustrates the average nanoseconds spent per NextGEQ query by VByte, the binary vector representation and Elias-Fano (EF) on a sequence of one million integers and with an average gap of: (a) 2.5, as a dense case and (b) 1850 as a sparse case. As the dense case illustrates, the binary vector representation is as fast as VByte for all jumps of entity less then or equal to 8, and becomes actually faster for longer jumps. Moreover, the distribution of the jump sizes plotted in Figure 5 indicates us that, whenever executing AND queries, the number of jumps of size less than 16 accounts for 90 of the jumps performed by NextGEQ. Furthermore, the distribution plotted in Figure 1 tells us that the majority of blocks are actually encoded with their characteristic bit-vector, thus explaining why the partitioned indexes exhibit no penalty against the un-partitioned counterparts.

(a) Dense
(b) Sparse
Figure 4: Nanoseconds per NextGEQ query for (a) Dense and (b) Sparse sequences of one million integers, having an average gap of 2.5 and 1850, respectively. These values mimic the ones for the Gov2 datasets, that are 2.125 and 1852. The ones for the ClueWeb09 dataset are 2.137 and 963, thus the plots have a similar shape.
Figure 5: When the difference between two consecutive accessed positions is , NextGEQ is said to make a jump of size . The distribution of the jump sizes is divided into buckets of exponential size: all sizes between and belong to bucket . The plot shows the jumps distribution, in percentage, for the query logs used in the experiments, when performing AND queries.

However, VByte tends to be slower on longer jumps because of its block-wise organization: since a posting list is split into blocks of 128 postings that are encoded separately, a block must be completely decoded even for accessing a single integer, which is not uncommon for boolean conjunctions. Moreover, since -gaps values are encoded, we need to access the elements by a linear scan of the block after decoding in order to compute their prefix sums. When the accessed elements per block are very few, even using SIMD instructions to perform decoding results in a slower query execution. Conversely and as expected, the binary vector representation is inefficient for the sparse regions since potentially many bits need to be scanned to perform a query, but still faster than VByte whenever the jump size becomes larger than 64 because it allows skipping over the bit stream by keeping samples of the bit set positions.

4.3 Overall comparison

In this section we compare the optimally-partitioned VByte indexes against several competitors: (1) the -optimal partitioned Elias-Fano (PEF) by Ottaviano and Venturini [32]; (2) the Binary Interpolative coding (BIC) by Moffat and Stuiver [30]; (3) the optimized PForDelta (OptPDF) by Yan et al. [45]; (4) the recent ANS-based technique by Moffat and Petri [29] (5) the QMX mechanism by Trotman [41].

For all competitors, we used the C++ code from the original authors, compiled with gcc 7.2.0 using the highest optimization setting as we did for our own code, to ensure a fair comparison.

It would also be interesting to compare against the MILC framework [43], although not tailored for AND queries but only for random access, if it were publicly available.

Index space and building time. Table 6 shows the results concerning the space of the indexes. Clearly, the space usage of the VByte optimal indexes is higher than the one of the bit-aligned encoders: this was expected since VByte is byte-aligned. However, the important result is that its space is not so high as it was used to be before: it is now larger than BIC and only larger than PEF on the larger ClueWeb09 ( on Gov2). Comparing the results reported in Table 3 with the ones in Table 6, we see that, without partitioning, VByte was larger than PEF and larger than BIC on Gov2; larger than PEF and larger than BIC on ClueWeb09. Thus, we reduced the space gap between the original VByte index representation and the one of the best bit-aligned encoders by nearly . In particular, notice that it is less than larger than PEF on the docID sequences, while it is more inefficient on the frequency sequences. This is because the frequencies are made up of smaller -gaps than the ones present in the docIDs. Very similar considerations hold for the other alternatives, such as PFD and ANS. In particular, we notice that on the ClueWeb09 dataset, the difference between VByte optimal and OptPDF is very small (only overall); ANS is particularly better than the other methods on the freq sequences; the byte-aligned QMX is, instead, significantly larger (by on Gov2 and by on ClueWeb09).

We now consider the time needed to build the indexes. We do not show the corresponding table for space constraints. As already noted in the previous subsection, the dynamic programming approach used for PEF imposes a severe penalty with respect to VByte optimal of on average. The penalty is due to not only the difference in speed between dynamic programming and the algorithm devised in Section 3, but also to the fact that Elias-Fano, being bit-aligned, is slower to encode with respect to VByte. Except for the ANS indexes which are slower to build, by on average, because of the two-pass process of first collecting symbol occurrence counts and, then, encoding [29], the building timings for the other competitors are, instead, competitive: our optimization algorithm only takes a couple of minutes more overall the whole building process. Only BIC and QMX took considerably less indexing time ( faster on average).

Gov2 ClueWeb09

TREC 05

PEF -optimal
OptPFD
BIC
ANS
QMX
VByte optimal

TREC 06

PEF -optimal
OptPFD
BIC
ANS
QMX
VByte optimal
(a) AND queries (ms/query)
Gov2 ClueWeb09
PEF -optimal
OptPFD
BIC
ANS
QMX
VByte optimal
(b) decoding time (ns/int)
Table 7: Timings for AND queries in milliseconds (ms/query) and sequential decoding time in nanosecond per integer (ns/int).

Index speed: AND queries and decoding. Table 7 shows the query processing speed of the indexes, along with the corresponding sequential decoding speed. Compared to PEF, the results are indeed very similar to the ones obtained by Ottaviano and Venturini [32], i.e., there is only a marginal gap between the speed of PEF and VByte when computing boolean conjunctions. The reason has to be found, again, in the plot illustrated in Figure (b)b. As we can see, for all the jump sizes less than 32, VByte is faster than Elias-Fano, while this advantage vanishes for the longer jumps thanks to the powerful skipping abilities of Elias-Fano [42, 32, 34]. However, we know that this advantage is shrunk because jumps larger than 32 are not very frequent on the tested query logs, as depicted by the distribution of Figure 5.

Compared to the other approaches, we can see significant gains with respect to OptPDF (by on Gov2 and on ClueWeb09), BIC and ANS ( faster on average) and only a slight penalty with respect to QMX (by on ClueWeb09). The same conclusions apply to the sequential decoding speed of the indexes.

5 Conclusions

We have presented an optimization algorithm for point-wise encoders that splits a posting list into variable-sized partitions to improve its compression and runs in linear time and constant space. We also proved that the algorithm is optimal, i.e., it finds the splitting that minimizes the space of the representation. This sensibly improves the previous approaches based on dynamic-programming on all aspects: time/space complexity and practical performance.

By applying our technique to the ubiquitous Variable-Byte encoding, we improved its compression ratio by and build optimally-partitioned indexes faster than the dynamic programming approach. Despite the significant space savings, the partitioned representation does not introduce penalties at query processing time, being actually the fastest when compared with many other state-of-the-art encoders.

As a last note, we mention the possibility of introducing another encoder for representing the runs of the posting lists. Clearly, a run of consecutive docIDs can be described with just the information stored in the first level of representation, i.e., the size of the run. Although our framework can be extended to include this case, the algorithm and its analysis become much more complicated. This additional complexity does not pay off, because space improved by only on the tested datasets.

References

  • [1] Dropbox Techblog. https://blogs.dropbox.com/tech/2016/09/improving-the-performance-of-full-text-search/. Accessed on 15-04-2018.
  • [2] Protocol Buffers - Google’s data interchange format. https://github.com/google/protobuf. Accessed on 15-04-2018.
  • [3] RediSearch. https://github.com/RedisLabsModules/RediSearch/blob/master/docs/DESIGN.md. Accessed on 15-04-2018.
  • [4] UpscaleDB. https://upscaledb.com/about01.html#compression. Accessed on 15-04-2018.
  • [5] D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden, et al. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases, 5(3):197–280, 2013.
  • [6] V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval Journal, 8(1):151–166, 2005.
  • [7] V. N. Anh and A. Moffat. Index compression using 64-bit words. Software: Practice and Experience, 40(2):131–147, 2010.
  • [8] B. Bhattacharjee, L. Lim, T. Malkemus, G. Mihaila, K. Ross, S. Lau, C. McArthur, Z. Toth, and R. Sherkat. Efficient index compression in DB2 LUW. Proceedings of the VLDB Endowment, 2(2):1462–1473, 2009.
  • [9] N. Bruno, N. Koudas, and D. Srivastava.

    Holistic twig joins: optimal XML pattern matching.

    In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 310–321. ACM, 2002.
  • [10] A. Buchsbaum, G. Fowler, and R. Giancarlo.

    Improving table compression with combinatorial optimization.

    Journal of the ACM, 50(6):825–851, 2003.
  • [11] M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at twitter. In Proceedings of the 28th International Conference on Data Engineering (ICDE), pages 1360–1369. IEEE, 2012.
  • [12] S. Büttcher, C. Clarke, and G. Cormack. Information retrieval: implementing and evaluating search engines. MIT Press, 2010.
  • [13] F. Claude, A. Fariña, M. A. Martínez-Prieto, and G. Navarro. Universal indexes for highly repetitive document collections. Information Systems, 61:1–23, 2016.
  • [14] M. Curtiss, I. Becker, T. Bosman, S. Doroshenko, L. Grijincu, T. Jackson, S. Kunnatur, S. Lassen, P. Pronin, S. Sankar, G. Shen, G. Woss, C. Yang, and N. Zhang. Unicorn: A system for searching the social graph. In Proceedings of the Very Large Database Endowment (VLDB), volume 6, pages 1150–1161, 2013.
  • [15] J. Dean. Challenges in building large-scale information retrieval systems: invited talk. In Proceedings of the 2nd International Conference on Web Search and Data Mining (WSDM), 2009.
  • [16] B. Debnath, S. Sengupta, and J. Li. Skimpystash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 25–36. ACM, 2011.
  • [17] J. Duda. Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. arXiv preprint arXiv:1311.2540, 2013.
  • [18] P. Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21(2):246–260, 1974.
  • [19] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975.
  • [20] R. M. Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, 1971.
  • [21] P. Ferragina, I. Nitto, and R. Venturini. On optimally partitioning a text to improve its compression. Algorithmica, 61(1):51–74, 2011.
  • [22] J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing relations and indexes. In Proceedings of the 14th International Conference on Data Engineering (ICDE), pages 370–379, 1998.
  • [23] S. Golomb. Run-length encodings. IEEE Transactions on Information Theory, 12(3):399–401, 1966.
  • [24] V. Hristidis, Y. Papakonstantinou, and L. Gravano. Efficient IR-style keyword search over relational databases. In Proceedings 2003 VLDB Conference, pages 850–861. Elsevier, 2003.
  • [25] D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 45(1):1–29, 2013.
  • [26] D. Lemire, N. Kurz, and C. Rupp. Stream-VByte: faster byte-oriented integer compression. Information Processing Letters, 130:1–6, 2018.
  • [27] C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
  • [28] A. Moffat and M. Petri. ANS-based index compression. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017, pages 677–686, 2017.
  • [29] A. Moffat and M. Petri. Index compression using byte-aligned ANS coding and two-dimensional contexts. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 405–413, 2018.
  • [30] A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval Journal, 3(1):25–47, 2000.
  • [31] G. Ottaviano, N. Tonellotto, and R. Venturini. Optimal space-time tradeoffs for inverted indexes. In Proceedings of the 8th Annual International ACM Conference on Web Search and Data Mining (WSDM), pages 47–56, 2015.
  • [32] G. Ottaviano and R. Venturini. Partitioned Elias-Fano indexes. In Proceedings of the 37th International Conference on Research and Development in Information Retrieval (SIGIR), pages 273–282, 2014.
  • [33] G. E. Pibiri and R. Venturini. Clustered Elias-Fano indexes. ACM Transactions on Information Systems, 36(1):1–33, 2017.
  • [34] G. E. Pibiri and R. Venturini. Inverted index compression. Encyclopedia of Big Data Technologies, pages 1–8, 2018.
  • [35] J. Plaisance, N. Kurz, and D. Lemire. Vectorized VByte decoding. In International Symposium on Web Algorithms (iSWAG), 2015.
  • [36] V. Raman, L. Qiao, W. Han, I. Narang, Y.-L. Chen, K.-H. Yang, and F.-L. Ling. Lazy, adaptive rid-list intersection, and its application to index anding. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 773–784. ACM, 2007.
  • [37] F. Silvestri. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research (ECIR), pages 101–112, 2007.
  • [38] F. Silvestri and R. Venturini. VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th International Conference on Information and Knowledge Management (CIKM), pages 1219–1228, 2010.
  • [39] A. Stepanov, A. Gangolli, D. Rose, R. Ernst, and P. Oberoi. Simd-based decoding of posting lists. In Proceedings of the 20th International Conference on Information and Knowledge Management (CIKM), pages 317–326, 2011.
  • [40] L. H. Thiel and H. Heaps. Program design for retrospective searches on large data bases. Information Storage and Retrieval, 8(1):1–20, 1972.
  • [41] A. Trotman. Compression, simd, and postings lists. In Proceedings of the 2014 Australasian Document Computing Symposium, page 50. ACM, 2014.
  • [42] S. Vigna. Quasi-succinct indices. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM), pages 83–92, 2013.
  • [43] J. Wang, C. Lin, R. He, M. Chae, Y. Papakonstantinou, and S. Swanson. MILC: inverted list compression in memory. Proceedings of the VLDB Endowment, 10(8):853–864, 2017.
  • [44] H. E. Williams and J. Zobel. Compressing integers for fast file access. The Computer Journal, 42(3):193–201, 1999.
  • [45] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 401–410, 2009.
  • [46] Z. Zhang, J. Tong, H. Huang, J. Liang, T. Li, R. J. Stones, G. Wang, and X. Liu. Leveraging context-free grammar for efficient inverted index compression. In Proceedings of the 39th International Conference on Research and Development in Information Retrieval (SIGIR), pages 275–284, 2016.
  • [47] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2):1–56, 2006.
  • [48] M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE), pages 59–70, 2006.