Sliced indexes in C++. https://arxiv.org/abs/1907.01032
Representing sorted integer sequences in small space is a central problem for large-scale retrieval systems such as Web search engines. Efficient query resolution, e.g., intersection or random access, is achieved by carefully partitioning the sequences. In this work we describe and compare two different partitioning paradigms: partitioning by cardinality and partitioning by universe. Although the ideas behind such paradigms have been known in the coding and algorithmic community since many years, inverted index compression has extensively adopted the former paradigm, whereas the latter has received only little attention. As a result, an experimental comparison between these two is missing for the setting of inverted index compression. We also propose and implement a solution that recursively slices the universe of representation of a sequence to achieve compact storage and attain to fast query execution. Albeit larger than some state-of-the-art representations, this slicing approach substantially improves the performance of list intersections and unions while operating in compressed space, thus offering an excellent space/time trade-off for the problem.READ FULL TEXT VIEW PDF
Sliced indexes in C++. https://arxiv.org/abs/1907.01032
Large-scale retrieval systems employ a simple, yet ingenious, data structure to support text search – the inverted index (Zobel and Moffat, 2006; Manning et al., 2008; Cambazoglu and Baeza-Yates, 2015; Witten et al., 1999). In its simplest incarnation, the inverted index is a collection of sorted integer sequences, called inverted lists. For each distinct term appearing in the textual collection, the corresponding inverted list represents the list of the identifiers of the documents where the term appears. Then, resolving a user query such as, for example, “return all documents where terms and appear” reduces to the problem of intersecting the inverted lists of and . Other query operators are possible and several pruning techniques have been developed (Broder et al., 2003; Mallia et al., 2017) for the case of ranked retrieval, i.e., when the returned documents have to be ranked according to a scoring function (Robertson and Jones, 1976). Zobel and Moffat (2006) provide general background on inverted indexes.
Literature on the representation of integers and integer sequences is vast. Many solutions are known, each of them exposing a different space/time trade-off, including: Elias’ gamma and delta (Elias, 1975), Golomb (Golomb, 1966), Elias-Fano (Fano, 1971; Elias, 1974; Vigna, 2013), partitioned Elias-Fano (Ottaviano and Venturini, 2014), clustered Elias-Fano (Pibiri and Venturini, 2017a)1996, 2000), PForDelta (Zukowski et al., 2006; Yan et al., 2009; Lemire and Boytsov, 2015), Simple (Anh and Moffat, 2005; Zhang et al., 2008; Anh and Moffat, 2010), Variable-Byte (Thiel and Heaps, 1972; Scholer et al., 2002; Dean, 2009; Stepanov et al., 2011; Plaisance et al., 2015; Lemire et al., 2018b; Pibiri and Venturini, 2019), QMX (Trotman, 2014), ANS-based (Moffat and Petri, 2017, 2018), DINT (Pibiri et al., 2019). We point the reader to the surveys by Zobel and Moffat (2006), by Moffat (2008) and by Pibiri and Venturini (2018) for a review of many techniques.
More precisely, the problem we take into account is the one of introducing a compressed representation for a sorted integer sequence of size whose values are drawn from a universe , here assumed to be strictly increasing, i.e., for , so that the following operations have to be supported efficiently.
.decode(output): decodes sequentially to the output buffer of 32-bit integers;
AND/OR(, , output): performs the intersection/union between and , materializing the result into the output buffer of 32-bit integers and returning the size of the result;
.access(): returns the integer ;
.nextGEQ(): returns the integer greater-than or equal-to (this operations is more classically known as successor), that is the smallest integer . If is larger than the largest element of , a default value is returned, here assumed to be called limit and such that .
Except for the operations decode and OR that need to sequentially scan the sequence, an efficient implementation of the aforementioned operations relies on partitioning the sequence because of the following simple observation:
When we ask whether the integer is present or not in the sequence , we can safely skip all partitions of whose maximum integer is less than because is sorted, thus none of the integers less than should be considered.
Classically, integer sequences have been partitioned by cardinality, i.e., consecutive elements are grouped together into fixed-size or variable-size partitions. However, partitioning a sequence by universe is also possible. In a simple implementation of the approach, a universe span is chosen and all integers falling into the -th bucket are compressed into the same partition.
Fig. 3 shows an example of such paradigms applied to an example sequence 0, 1, 4, 5, 6, 17, 18, 19, 20, 21, 22, 24, 27, 31, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 50, 52, 53, 54, 55, for partitions of size 8. In Fig. (a)a, 8 consecutive integers are packed together, thus the following partitions are defined 0, 1, 4, 5, 6, 17, 18, 19 20, 21, 22, 24, 27, 31, 34, 35 37, 38, 39, 40, 41, 42, 43, 44 45, 46, 47, 50, 52, 53, 54, 55. In Fig. (b)b, 8 consecutive universe values are packed together, thus the following partitions are defined 0, 1, 4, 5, 6 17, 18, 19, 20, 21, 22 24, 27, 31 34, 35, 37, 38, 39 40, 41, 42, 43, 44, 45, 46, 47 50, 52, 53, 54, 55. Notice how, in this latter example, partitions may have different cardinalities and that some of them may be empty indeed as it happens for the second one spanning the universe slice .
The key point is that literature on inverted index compression has extensively adopted the partitioning-by-cardinality paradigm (PC), where little attention has been given to the other paradigm (PU). As a result of this: (1) no experimental comparison between such paradigms have been assessed in the setting of inverted index compression (to the best of our knowledge, only one prior work (Wang et al., 2017) takes a similar issue into account); (2) few solutions for the PU paradigm have been designed.
Therefore, after a detailed description of the two paradigms (Section 2), we: first, design a simple PU solution that is tailored for the exploitation of the clustering property of inverted lists (Section 3), namely the fact that inverted lists are notably known to feature clusters of very close document identifiers that can be compressed very well; then, experimentally compare the advantages and disadvantages of both paradigms in terms of achieved compression effectiveness and the efficiency of the operations introduced before (Section 4). We finally summarize the experimental findings and sketch some promising future directions (Section 5).
As already introduced, we individuate two different paradigms that partition a sequence to achieve efficient query resolution, as exemplified in Fig. 3 and described in the following: partitioning by cardinality (PC) and partitioning by universe (PU).
Traditionally, an inverted list is partitioned by cardinality, i.e., consecutive integers of the list are grouped together into a partition until a given cardinality is not reached. The cardinality can be fixed for every partition (besides the last one which may contain less integers), e.g., 128 integers, or can vary according to the actual values of the integers being compressed, e.g., in order to achieve a more compact representation (Ottaviano and Venturini, 2014; Pibiri and Venturini, 2019).
The list also stores the (sorted) sequence formed by the maximum values of every partition. The values in such sequence are usually called skip pointers. Skip pointers add a small space overhead to the list representation itself for reasonably-large cardinality values, but allow skipping over the inverted list’s values. Such list organization relies on the operation to support efficient list intersection. We can implement by first searching within the skip pointers to individuate the partition where the wanted value lies in and, then, conclude the search for it in that partition only. The operation is efficient because the set of skip pointers is small and searching a value in a single partition is faster than searching it in the whole list without any positional restriction.
Let us now consider how list intersection can be achieved through the nextGEQ primitive. Suppose we have to compute the intersection between the inverted lists associated to terms and , i.e., , where is shorter than . We search for the first value of in with : if the value returned by the operation is equal to then it is a member of the intersection and we can just repeat this search step for the next value of ; otherwise gives us a candidate value to be searched next, indeed allowing to skip the searches for all values between and . In fact, since we have that , will be equal to also for all values such that , thus none of such integers can be a member of the intersection. Fig. (a)a illustrates such procedure.
Another strategy is partitioning an inverted list by universe, i.e., all integers belong to the same universe-aligned partition . For example, given we may choose a universe span of integers, so that the following universe-aligned partitions are defined: . Note that such partitions do not depend on the actual size of the list, but only on its universe of representation. All integers less than are grouped into the first partition; all integers less than and larger than or equal to are grouped into the second partition; and so on. In general, if the -th partition contains integers, i.e., there are integers such that , we say that the partition has cardinality equal to .
Also this strategy permits to skip over the list values because only partitions relative to the same universe have to be intersected. Now the skip pointers are represented by enumerating the non-empty partitions, rather than being actual list values. Therefore, list intersection proceeds by identifying all common partitions, and, for each of them, by resolving a smaller intersection of at most integers. Depending on the actual cardinality of a partition, different compression strategies AND/OR intersection algorithms can be adopted, thus not necessarily relying on the nextGEQ primitive. Fig. (b)b illustrates this other approach.
On the other hand, the overhead represented by the skip pointers may be excessive for very sparse inverted lists, because the partitions do not depend on the value of . In fact, in the worst case, we could be maintaining a pointer for each integer in the list (each partition is a singleton).
As already claimed, this paradigm has been used in the coding and (theoretical) data structure design areas. We briefly discuss some old, but yet very meaningful, examples: the Elias-Fano encoding algorithm (Fano, 1971; Elias, 1974) and the van Emde Boas data structure (van Emde Boas, 1975, 1977).
Elias-Fano represents a sequence in at most bits. It can be shown (Elias, 1974) that choosing minimizes the number of bits. In other words, the values of are partitioned by universe into chunks containing at most integers each. We refer to this split as parametric, because it is dependent on the value of and , thus making the intersection algorithm shown in Fig. (b)b not directly applicable because sequences having different sizes partition the integers differently.
But a non-parametric split is possible as well. For example, assuming a universe of size , we could partition into chunks, i.e., . In this way, each chunk contains all integers sharing the same 16 most significant bits. This is reminiscent of the van Emde Boas data structure: a recursive tree layout solving the well-known dictionary problem (see the introduction to parts III and V of the book by Cormen et al. (2009)) in time per operation and words of space111Actually, the space can be improved to words using bucketing. See this blog post by Mihai Pǎtraşcu: http://infoweekly.blogspot.com/2007/09/love-thy-predecessor-iii-van-emde-boas.html.. In such data structure, the universe is recursively partitioned into chunks.
A similar fixed-universe partitioning approach as been recently adopted by Roaring (Chambi et al., 2016; Lemire et al., 2016; Lemire et al., 2018a), a practical data structure that has been shown to outperform all previously proposed bitmap indexes (Chambi et al., 2016; Wang et al., 2017) and it is widely used in commercial applications. Specifically, Roaring partitions into chunks of integers and represents all the integers falling into a chunk in two different ways according to the cardinality of the chunk: if a chunk contains less than 4096 elements, then it is represented as a sorted array of 16-bit integers; otherwise it is represented as a bitmap of bits. Finally, extremely dense chunks can also be represented with runs. For example, the two runs mean that all the integers and belong to the chunk.
In this section we design a solution that applies a recursive universe slicing approach to achieve compact storage and good practical performance for the operations introduced in Section 1.
Let us consider a strictly increasing sequence whose elements are drawn from a universe of size . At a high-level point of view, we represent using a tree of height 3 (at most), where the root has fanout and its children have fanout . Refer to Fig. (a)a. We now detail how the data structure is concretely implemented and operations supported.
The root of the tree logically corresponds to the interval that is partitioned into slices spanning integers each (except, possibly, the last slice which may contain less integers). In what follows, we refer to such -long slices as chunks. A header array
is used to classify chunks into 4 different types according to their cardinality: full, dense, sparse and empty. Full chunks, i.e., containing exactlyintegers, and empty chunks (containing no integers at all) are represented implicitly by their types. A dense chunk spanning the -th universe slice is represented with a bitmap of bits by setting the -th bit if the integer belongs to the slice. In particular, we regard a chunk to be dense if its cardinality is at least . Doing so guarantees that the average number of bits spent for each integer belonging to a dense chunk is at most 2. Therefore, full, empty and dense chunks have no children. Instead, sparse chunks are encoded by re-applying the same strategy: since every sparse chunk now logically corresponds to a (smaller) universe slice of size , the interval is partitioned into slices spanning integers each (again, except possibly the last one). We refer to such -long slices as blocks. As before, a header array is used to distinguish between different block types. However, given the smaller universe slice, we only distinguish between two block types, namely dense and sparse, in order to better amortize the cost of . In particular, a dense block is encoded with a bitmap of bits; a sparse block is represented with a sorted array of -bit integers.
We choose and . Since we consider the case when and for our choice of , the number of chunks is always at most , thus we encode this quantity into 16 bits. Similarly, it follows that each sparse chunk is sliced into at most blocks. With this choice of and , a dense chunk is a bitmap of 1024 bytes; a dense block is a bitmap of 32 bytes; a sparse block whose cardinality is is a sorted array of 8-bit integers, hence taking bytes overall.
For each non-empty block , the array stores its identifier and its cardinality. Both quantities are always at most , thus they take one byte each. Knowing the cardinality of a block we derive the number of bytes needed by its representation. If , the block is considered to be sparse, thus it takes bytes; otherwise it is dense and takes 32 bytes. Therefore, a sparse block consumes at most bits. Note that we do not set the sparseness threshold to because otherwise a sparse block would consume at most bits that is equal to the cost of a dense block and a bitmap would suffice.
For each non-empty chunk , the array stores, instead, the following quantities: its identifier , its cardinality, and the number of bytes needed by its encoding. Similarly to the case of a sparse block, we require the encoding of a sparse chunk to take less than bits, otherwise a bitmap of bits would suffice. Therefore, each of these 3 quantities easily fits into a 16-bit integer.
Although we could derive the type of a chunk from its cardinality as done for a block, we also store the type explicitly using 16 bits. When a chunk is sparse, we need to know the number of its blocks, i.e., the number of non-empty -size slices. This number is the size of the corresponding header . Therefore, we write this quantity in 8 bits and interleaved with the 16 bits dedicated to the type information. In conclusion, we spend a 64-bit overhead per chunk.
Fig. (b)b shows an example of such organization and how all the different data quantities (headers, bitmaps and arrays) are laid out in memory. In practice, the logical tree shown in Fig. (a)a is “linearized” into an array of bytes.
The data structure described here has similarities with some previous approaches. As already discussed, the universe is exponentially reduced like in a van Emde Boas tree, i.e., . Partitioning the universe recursively has the potential of adapting to the distribution of the integers being encoded, a crucial design choice for clustered integer sequences such as inverted lists. The choice of bitmaps to represents dense sets is a widely adopted technique, employed by, for example, partitioned Elias-Fano (Ottaviano and Venturini, 2014), hybrid Variable-Byte schemes (Pibiri and Venturini, 2019) and Roaring (Lemire et al., 2018a). However, as we are going to show, the use of bitmaps joint with universe-aligned partitions is particularly effective for fast query execution because operations can be implemented via inexpensive bitwise instructions, hence exploiting word-level parallelism, and are suitable for even more advanced instructions, such as SIMD AVX.
The description above also opens the possibility for better compression. For example, we could use a different representation for sparse blocks, e.g., bit-aligned universal codes. Whatever representation we use, that will give birth to interesting time/space trade-offs. The choice adopted here of , and the use of 8-bit integer arrays clearly favours time efficiency given that both bitmaps and packed arrays are aligned to byte boundaries.
We now describe how the operations are supported by the data structure.
Decoding. The .decode(output) operation decodes sequentially to the output buffer of 32-bit integers. We loop through each chunk and, depending on its type, we decode it accordingly appending the result into the output buffer. This permits to write different specialized functions to handle a slice differently based on its type.
Bitmaps can be efficiently decoded using the built-in function ctzll which counts the number of trailing zero into a 64-bit word (Lemire et al., 2018a) (we also tested a SIMD algorithm to decode larger bitmaps but got almost no speed improvement).
The sorted arrays encoding sparse blocks contain at most integers, each value taking 8 bits. Therefore, we can use the SIMD instruction _mm256_cvtepu8_epi32, that zero extend packed (unsigned) 8-bit integers to 32-bit integers. Doing so, we can efficiently decode 8 values at a time. We also observed that this approach is even more efficient when paired with loop unrolling, thus we apply the instruction either 2 or 4 times after a single test on the block cardinality.
Intersection. The operation AND(, , output) performs the intersection between and , materializing the result into the output buffer of 32-bit integers and returning the size of the result. We use the algorithm illustrated in Fig. (b)b, thus we loop through the header arrays of the sequences, intersecting only chunks/blocks that are shared by the two (line 8). Therefore, we reduce the problem of list intersection to the smaller instance of performing intersections between (1) two bitmaps, or (2) two arrays, or a (3) bitmap and an array (line 9).
Case (1) – the intersection between two bitmaps – translates into a sequence of inexpensive bitwise AND
instructions between 64-bit words with (usually) automatic compiler vectorization.
Case (2) has to intersect tiny 8-bit sorted arrays. While a scalar textbook intersection algorithm between uncompressed arrays would suffice, we can accelerate the process using a variation of the vectorized approach by Schlegel, Willhalm, and Lehner (2011). In short, the algorithm uses the SIMD instruction _mm_cmpestrm to compare strings of bytes. In our case we can, therefore, execute an all-versus-all comparison in parallel between sets of 16 8-bit integers. Matching integers, i.e., integers in common between the two sets, are marked with a 32-bit bitmap returned as the result of the comparison. We can use this 32-bit value as an index in a pre-computed universal table of 1024 1024 bytes to obtain a permutation of bytes indexes, indicating how the matching integers should be permuted to collate them to the beginning of a 128-bit register. Such permutation is applied with the dedicated _mm_shuffle_epi8 SIMD instruction. The C++ macro INTERSECT shown in Fig. 10 illustrates this approach. Let and be the cardinality of the two sets respectively. Since in our case we have that both and are less than 32, we can directly enumerate the following 3 different cases: (1) and , then we need only 1 string comparison; (2) and (or and ), then we need 2 string comparisons; (3) and , then we would need 4 string comparisons but we determined that the simple scalar version is more efficient. The C++ function sparse_blocks_and coded in Fig. 11 shows these cases and, along with the code in Fig. 10, completes our intersection algorithm for small sets.
For detailed descriptions of SIMD instructions, refer to the excellent Intel guide at https://software.intel.com/sites/landingpage/IntrinsicsGuide.
Case (3) – the intersection between a bitmap and an array – is implemented by checking if the values of the array correspond to bits set in the bitmap, using the bit-test assembler instruction.
Union. The operation OR(, , output) performs the union between and , materializing the result into the output buffer of 32-bit integers and returning the size of the result. The algorithm follows the same skeleton described for the intersection, albeit we do not rely on specific SIMD optimizations: bitmaps are merged using bitwise OR within 64-bit words; sorted arrays using scalar code; the case with a bitmap and an array is handled by first converting the sorted array into a bitmap, then using the parallelism of bitwise OR.
Random access. The operation .access() returns the integer . We scan the header array of the data structure to take into account for the cardinality of each chunk covering a universe of size in order to locate the chunk containing the -th integer. To make this search faster, we build cumulative cardinality counts for groups of non-empty universe chunks, thus skipping chunks if the sum of their cardinalities is less than . The parameter is an associativity value that in our implementation we set to 32. Then we proceed recursively at the block-level if a chunk is sparse (but we do not build cumulative counts at the block level).
In particular, whenever we encounter a bitmap, we rely on efficiency of the built-in instruction popcountll to locate the 64-bit word where the wanted integer lies in. This instruction returns the number of bits set in a 64-bit word. Now that we have reduced the problem to a word of 64 bits, we can use the parallel-bit deposit assembler instruction pdep to perform a fast select-in-word operation (Pandey et al., 2017).
nextGEQ. The operation .nextGEQ() returns the integer greater-than or equal-to , that is the smallest integer . Since our data structure is partitioned by universe, we can directly identify the chunk comprising because this is the one having identifier , i.e., we consider the 16 most significant bits of the key . The wanted value lies in such partition or, if is larger than the maximum value in the partition, it is the minimum (first) value in the partition that follows. Observe that this operation is actually faster than access for universe-aligned methods, because it does not need to search for the wanted partition.
The aim of this section is twofold: establishing a solid experimental comparison between the two different paradigms described in Section 2 in order to assess the achievable space/time trade-offs and reporting on the effectiveness/efficiency of the Slicing approach introduced in Section 3.
Tested configurations. We compare the configurations summarized in Table 1 for the following reasons. For the paradigm partitioning by cardinality with fixed-sized partitions of 128 integers, we test: Variable-Byte (Thiel and Heaps, 1972) with the SIMD-ized decoding algorithm devised by Plaisance et al. (2015); Interpolative (Moffat and Stuiver, 2000) and Elias-Fano (Ottaviano and Venturini, 2014) as representative of, respectively, highest speed, best compression effectiveness and best space/time trade-off in the literature. As representative of the paradigm partitioning by cardinality with variable-sized partitions, we test the -optimal Elias-Fano mechanism (Ottaviano and Venturini, 2014). For all such representation, we use the C++ implementation provided in the ds2i library, available at https://github.com/ot/ds2i.
Concerning the paradigm partitioning by universe, we test three solutions. The first two solutions are represented by Roaring (Lemire et al., 2018a) (see Section 2.2). We test the solution without the run-container optimization, thus using two container types (bitmap and sorted array), and with the optimization, thus using three container types (bitmap, sorted array and run). We use the dedicated library written in C and available at https://github.com/RoaringBitmap/CRoaring.
|Variable-Byte||V||PC; fixed-sized partitions of 128 integers; byte-aligned|
|Elias-Fano||EF||PC; fixed-sized partitions of 128 integers; bit-aligned|
|Interpolative||BIC||PC; fixed-sized partitions of 128 integers; bit-aligned|
|Elias-Fano -opt.||PEF||PC; variable-sized partitions; bit-aligned|
|Roaring without run opt.||R2||PU; single-span; 2 container types; byte-aligned|
|Roaring with run opt.||R3||PU; single-span; 3 container types; byte-aligned|
|Slicing||S||PU; multi-span; byte-aligned|
Datasets. We perform the experiments on the following standard test collections.
Gov2 is the TREC 2004 Terabyte Track test collection, consisting in roughly 25 million .gov sites crawled in early 2004. The documents are truncated to 256 KB.
CW09 is the ClueWeb 2009 TREC Category B test collection, consisting in roughly 50 million English web pages crawled between January and February 2009.
CCNews is a dataset of news freely available from CommonCrawl: http://commoncrawl.org/2016/10/news-dataset-available. Precisely, the datasets consists of the news appeared from 09/01/16 to 30/03/18.
Identifiers were assigned to documents according to the lexicographic order of their URLs (Silvestri, 2007). Table 2 reports the basic statistics for the collections. We choose three different levels of list density , i.e., the ratio between the size of a list and its maximum integer, and compress all lists whose density exceeds . By varying the density, we highlight how compression effectiveness changes for the two different partitioning paradigms used, still focusing on most of the integers in the collections. Refer to Table 3.
Experimental setting, methodology and testing details. The experiments are performed on a machine with 4 Intel i7-4790K CPUs clocked at 4.00 GHz, with 32 GB of RAM DDR3 and running Linux 4.13.0. All the code is compiled with gcc 7.2.0 using the highest optimization setting (compilation flags -march=native and -O3).
For the CRoaring library, we compile the code as recommended in the documentation for best performance, i.e., with full support for vectorization. The run-container optimization is enabled by calling the run_optimize function. The implementations of Elias-Fano (both with fixed and variable partitions) and Interpolative do not use explicit vectorization; the implementation of Variable-Byte makes use of the vectorized algorithm devised by Plaisance et al. (2015), called Masked-VByte.
We build the indexes in internal memory and write the corresponding data structure to a file on disk. To perform the queries, the data structure is memory mapped from the file (for CRoaring, by using the frozen_view function) and a warming-up run is executed to fetch the necessary pages from disk.
To sequentially decode the indexes, the kernel is also instructed to access the memory mapped area sequentially using posix_madvice with flag POSIX_MADV_SEQUENTIAL. To test the speed of list queries, namely AND/OR, we generated 1000 random pairs of integers and execute the queries with the corresponding lists. For point queries, namely access and nextGEQ, we similarly execute 1000 random queries for each list of the index. In particular, the 1000 random positions for the access query are not sorted. The input integers for nextGEQ are not sorted either and less then the maximum integer in the sequence (thus, the result is always well determined).
Each run of queries is repeated 10 times to smooth fluctuations during measurements. The time reported is the average among these runs.
Organization. We organize the experiments in three subsections. At the whole index level (Section 4.1), we are interested in the number of bits spent per represented integer and the time spent per decoded integer when decoding sequentially every list in the index. At the list level (Section 4.2), we report the time needed to compute pair-wise conjunctions (i.e., intersections or boolean AND queries) and pair-wise disjunctions (i.e., unions or boolean OR queries). Finally, at the single integer level (Section 4.3), we evaluate the time needed to decode an integer at a random position and resolve a nextGEQ query.
Table 4 reports the average number of bits per integer spent by the different methods. Clearly, the bit rate is increasing for decreasing values of density: the sparser a list is, the less clustered it is, thus more bits are needed to represent the values. In general, across all density levels, the bit-aligned methods EF, PEF and BIC offer the best compression effectiveness, with the latter being the most space-efficient of all. Adapting the sizes of the partitions to the distribution of the integers being compressed pays off: PEF is always more effective than EF. The byte-aligned methods V, R2 and R3 are always the largest, with R2 and R3 being always more effective than V on Gov2 but less effective on the other datasets CW09 and CCNews. The use of run containers for the R3 mechanism pays off on the more clustered Gov2, but has a smaller impact on CW09 and CCNews. In general, between the most effective methods and the least effective ones there is a factor of 2 in space consumption.
Lastly, the S solution stands in a middle position between these two classes, costing roughly bits per integer more than the most effective methods. In Fig. 18 we report the detailed breakdown of how the integers of the test collections are covered by the different universe slices and how the bits per integer rate is fractioned among them. Not surprisingly, most of the space is spent in the representation of the sparse slices of size that roughly cover (an average of) the , and of the integers of Gov2, CW09 and CCNews respectively. Another meaningful thing to notice is that more than of the integers of Gov2 are just covered by runs of elements and, thus, represented implicitly (dense chunks), whereas this does not happen on the less clustered CW09 and CCNews.
Table 5 reports the average nanoseconds spent per decoded integer, measured by calling the operation decode for each list in the index. The methods V, R2, R3 and S are the fastest. However, V decodes a stream of -gaps and we skipped the final prefix-summing scan in this experiment, whereas R2, R3 and S directly decode the values without the need of further processing (thus, the results compare more favourably for V). There is no appreciable difference between the decoding times of R2 and R3. The other bit-aligned methods EF, PEF and BIC are much slower, with the latter being the least efficient of all. In particular, the ds2i library API does not expose a decode operation, thus we implemented it for Elias-Fano-based methods. In such methods, a partition can be represented using one among three different encodings according to its characteristics, namely its relative universe of representation and size. These encodings include Elias-Fano, a bitmap and an implicit representation whenever the relative universe of a partition is equal to its size (see (Vigna, 2013) and (Ottaviano and Venturini, 2014) for details). Thus, efficient decoding of Elias-Fano codes basically reduces to reading negated unary codes; bitmaps are decoded using the same procedures as used in S (using the built-in ctzll function); implicit partitions are decoded with inexpensive for loops. The sequential decoding speed of Elias-Fano-based methods is, anyway, two times less than the one of the fastest methods.
The BIC mechanism does not feature specific optimizations, except when decoding runs of consecutive integers and is, on average, one order of magnitude slower than the fastest methods.
We now consider the two fundamental list-level queries of intersections (boolean AND) and unions (boolean OR). Again, for all methods the result of the query is materialized onto a pre-allocated output buffer of 32-bit integers, thus we slightly modify the ds2i code base to do so (rather than just counting matching integers). To ensure a fair comparison, we also slightly modify the pair-wise intersection and union functions of CRoaring, because these always output a new Roaring data structure resulting from the operation, thus including (potentially expensive) memory allocations during the process. Thus, our modification avoids memory allocation but the result is accumulated in the pre-allocated output buffer mentioned above.
Table 6 shows the result for intersections. The net result is that indexes partitioned by universe, R2, R3 and S, are significantly more efficient than those partitioned by cardinality, thanks to their “simpler” intersection algorithm using substantially less instructions and branches. As discussed, in this context simplicity means that, being aligned to the same relative universe, bitmap intersections can be carried out by a sequence of inexpensive bitwise AND 64-bit operations; sorted array intersections can be accelerated using SIMD-based algorithms. This results in faster execution for ; for ; for .
As a further evidence of this fact, we report in Table 7 some performance counts collected with the perf Linux utility, when executing the queries on the Gov2 datasets. We choose to report the counts for the PEF method because it is the one generally performing better among the PC solutions. From the numbers reported in the table we can see that both R2 and S perform significantly less instructions and branches, for example, and less instructions for and respectively, thus confirming our previous claim about the increase of performance. The PEF method is also “data hungry” compared to R2 and S as it is clear from the high number of L1 cache loads. This is explained by the frequent switching of partitions for higher density values. Observe that PEF is actually exploiting the data cache well (for example, only 228 misses out of 102 loads in L1 for ), however, the higher number of L1 references imposes a significant penalty. Also observe that S is generally slower than R2 because of the further slicing into smaller partitions, inducing more branches that are not easily predicted and thus partially eroding the instruction throughput. In fact, the (intentionally) simpler design of R2 is a lot more advantageous for SIMD instructions: to confirm this, we recompiled the CRoaring library by disabling explicit SIMD optimizations and R2 scored the same as S, so vectorization does the difference. However, notice how the difference in efficiency vanishes for lower density values because most of the skipping happens at a coarser level. Furthermore, also observe that the use of run containers in R3 prevents some SIMD optimizations (Lemire et al., 2018a), thus reducing or even annulling the performance gap between R3 and S.
|L1 loads ()|
|L1 misses ()|
|LL loads ()|
|LL misses ()|
|S with SIMD||507||1080||2370||135||378||820||11||60||159|
|S without SIMD||816||1959||5190||213||558||1344||13||72||203|
In Table 8 we also investigate the impact of SIMD instructions for the intersection of small sorted array discussed in Section 3.2. The experiment highlights two important facts, one being the consequence of the other: (1) the vectorization of small arrays pays off, as the results for AND are significantly better with SIMD instructions (roughly better for sufficiently dense sequences); (2) most of the running time is actually spent in intersecting small arrays (not surprisingly, since bitmaps require essentially bitwise instructions that are very cheap). The latter fact explains why the SIMD optimization is so effective and is consistent with the breakdowns reported in Fig. 18. Lastly, the effect of vectorization clearly tends to diminish for smaller sequences, being usually the ones with lower density values, as we can see by comparing the values reported in the columns corresponding to and .
Table 9 shows instead the result for unions. For the same reasons discussed above for intersections, the indexes partitioned by universe are superior. However, due to the scan-based nature of unions, the performance gap with respect to the indexes partitioned by cardinality is not as high as the one for intersections. It is anyway consistent and equal to for ; for ; for . Finally, notice that the results for R2, R3 and S are very similar in this case, with R3 being slightly less efficient.
For the methods V, EF and BIC, the access() operation returns the integer in position mod from the partition of index , for integers in this experimentation. In particular, the V method requires decoding the partition and perform a prefix-summing scan up to position mod . The PEF method needs to first locate the partition from which to return the integer because partitions have variable sizes. Similarly, all solutions partitioned by universe, R2, R3 and S, have to take into account the cardinality of each chunk covering a universe of size in order to locate the chunk containing the -th integer. Table 10 shows the timings of such algorithms.
The EF method provides generally the fastest query time thanks to the constant-time random access algorithm of Elias-Fano, with PEF and S being in close second position. The decoding operation performed by V imposes a performance penalty with respect to such methods, that is more evident for, clearly, sparser datasets. Again, notice that the access time decreases for decreasing values of density, because fewer partitions per encoded sequence are represented. Lastly, S is faster than R2 because the latter adopts a linear search for the proper chunk to access, whereas S builds cumulative cardinality counts. Concerning the R3 variant with run containers, the linear-search approach employed absorbs roughly 90% of the time resulting in a significant slowdown, confirming the experimental conclusions already given by the authors of Roaring (Lemire et al., 2018a).
Table 11 shows instead the results for the nextGEQ() query. In this case, for all methods partitioned by cardinality, the query is resolved by relying on the skip pointers, as explained in Section 2.1. Precisely, the wanted partition is first identified by binary searching among the skip pointers, then the operation is concluded in the partition. Differently, the mechanism partitioned by universe directly identifies the partition by considering fields of the binary representation of the key . For this reason and as already discussed in Section 3.2, this operation is actually faster than access for universe-aligned methods.
Again, the Elias-Fano-based methods provides generally better efficiency but with R2 and S being faster especially for lower density values: in such cases, S is the fastest thanks to the further skipping introduced within a single partition. The slowdown imposed by the runs in R3 is alleviated by the use of binary search in this case.
The problem of introducing a compression format for sorted integer sequences, with good practical intersection/union performance, is old but pervasive in Computer Science, given its many applications, such as web search engines to mention a notable one. Identifying a single solution to the problem is not generally easy, rather the many space/time trade-offs available can satisfy different application requirements and the “best” solution should always be determined by considering the actual data distribution. To this end, we compare the two different paradigms that partition an inverted list for efficient query processing, either by cardinality or by universe.
Figure 19 is a clear summary of such experimental comparison because it shows different space/time trade-off points achievable for the list intersection operation which is the core one for inverted indexes. On the one hand, techniques that use a partitioning-by-cardinality approach offer the best space effectiveness, such as Elias-Fano-based methods and Interpolative; on the other hand, the partitioning-by-universe paradigm offers a remarkably improved intersection efficiency at the expense of space effectiveness, as apparent with the Roaring method. The Slicing solution devised here offers a leading compromise between these two edge points, by combining operational efficiency with space effectiveness. Observe that the Variable-Byte mechanism is generally dominated by other space/time trade-off points: its main strength lies in the simplicity of the implementation and the remarkably compact corresponding code (as far as SIMD instructions are not considered).
Because of the maturity reached by the state-of-the-art and the specificity of the problem, identifying future research directions is not immediate. We mention some promising ones.
In general, devising “simpler” compression formats that can be decoded with algorithms using low-latency instructions (e.g., bitwise) and with as few branches as possible, is a profitable line of research, as demonstrated by the experimentation in this article. Such algorithms favour the super-scalar execution of modern CPUs and are also suitable for SIMD instructions.
Another direction could look at devising dynamic and compressed representations for integer sequences, able of also supporting additions and deletions. This problem is actually a specific case of the more general dictionary problem, which is a fundamental textbook problem. While a theoretical solution already exists with all operations supported in optimal time under succinct space (Pibiri and Venturini, 2017b), an implementation with good practical performance could be of great interest for dynamic inverted indexes.
Proceedings of the 28-th Annual Symposium on Combinatorial Pattern Matching. 30:1–30:14.