Handling Massive N-Gram Datasets Efficiently

06/25/2018 ∙ by Giulio Ermanno Pibiri, et al. ∙ University of Pisa 0

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the- art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The use of -grams is wide and vital for many tasks in Information Retrieval, Natural Language Processing and Machine Learning, such as: auto-completion in search engines (Bar-Yossef and Kraus, 2011; Mitra et al., 2014; Mitra and Craswell, 2015), spelling correction (Kukich, 1992), similarity search (Kondrak, 2005), identification of text reuse and plagiarism (Seo and Croft, 2008; Huston et al., 2011), automatic speech recognition (Jurafsky and Martin, 2014) and machine translation (Heafield, 2011; Pauls and Klein, 2011), to mention some of the most notable.

As an example, query auto-completion is one of the key features that any modern search engine offers to help users formulate their queries. The objective is to predict the query by saving keystrokes: this is implemented by reporting the top- most frequently-searched -grams that follow the words typed by the user (Bar-Yossef and Kraus, 2011; Mitra et al., 2014; Mitra and Craswell, 2015). The identification of such patterns is possible by traversing a data structure that stores the -grams as seen by previous user searches. Given the number of users served by large-scale search engines and the high query rates, it is of utmost importance that such data structure traversals are carried out in a handful of microseconds (Jurafsky and Martin, 2014; Bar-Yossef and Kraus, 2011; Croft et al., 2009; Mitra et al., 2014; Mitra and Craswell, 2015). Another noticeable example is spelling correction in text editors and web search. In their basic formulation, -gram spelling correction techniques work by looking up every -gram in the input string in a pre-built data structure in order to assess their existence or return a statistic, e.g., a frequency count, to guide the correction (Kukich, 1992). If the -gram is not found in the data structure it is marked as a misspelled pattern: in such case correction happens by suggesting the most frequent word that follows the pattern with the longest matching history (Jurafsky and Martin, 2014; Kukich, 1992; Croft et al., 2009).

At the core of all the mentioned applications lies an efficient data structure mapping -grams to their associated satellite data, e.g., a frequency count representing the number of occurrences of the -gram or probability/backoff weights for word-predicting computations (Heafield, 2011; Pauls and Klein, 2011). The efficiency of the data structure should come both in time and space, because modern string search and machine translation systems make very frequent queries over databases containing several billion -grams that often do not fit in internal memory (Jurafsky and Martin, 2014; Croft et al., 2009). To reduce the memory-access rate and, hence, speed up the execution of the retrieval algorithms, the design of an efficient compressed representation of the data structure appears as mandatory. While several solutions have been proposed for the indexing and retrieval of -grams, either based on tries (Fredkin, 1960) or hashing (Lewis and Cook, 1988), their practicality is actually limited because of some important inefficiencies that we discuss below.

Context information, such as the fact that relatively few words may follow a given context, is not currently exploited to achieve better compression ratios. When query processing speed is the main concern, space efficiency is almost completely neglected by not compressing the data structure using sophisticated encoding techniques (Heafield, 2011). In fact, space reductions are usually achieved by either: lossy quantization of satellite values, or by randomized approaches with false positive allowed (Talbot and Osborne, 2007). The most space-efficient and lossless proposals still employ binary search over the compressed representation to lookup for a -gram: this results in a severe inefficiency during query processing because of the lack of a compression strategy with a fast random access operation (Pauls and Klein, 2011). To support random access, current methods leverage on block-wise compression with expensive decompression of a block every time an element of the block has to be retrieved. Finally, hashing schemes based on open addressing with linear probing result extremely large for static corpora as long as the tables are allocated with extra space to allow fast random access (Heafield, 2011; Pauls and Klein, 2011).

Since a solution that is compact, fast and lossless at the same time is still missing, the first aim of this paper is that of addressing the aforementioned inefficiencies by introducing compressed data structures that, despite their small memory footprint, support efficient random access to the satellite -gram values. We refer to such problem as the on of indexing -gram datasets.

The other correlated problem that we study in this paper is the one of computing the probability distribution of the -grams extracted from large textual collections. We refer to this second problem as the one of estimation. In other words, we would like to create and efficient, compressed, index that maps the -grams of a large text to its probability of occurrence in the text. Clearly, the way such probability is computed depends on the chosen model. This is an old problem and has received a lot of attention: not surprisingly, several models have been proposed in the literature, such as Laplace, Good-Turing, Katz, Jelinek-Mercer, Witten-Bell and Kneser-Ney (see (Chen and Goodman, 1996, 1999) and references therein for a complete description and comparison).

Among the many, Kneser-Ney language models (Kneser and Ney, 1995) and, in particular, their modified version introduced by Chen and Goodman (1999), have gained popularity thanks to their relatively low-perplexity performance. This makes modified Kneser-Ney the de-facto choice for language model toolkits. The following software libraries, widely used in both academia and industry (e.g., Google (Brants et al., 2007; Chelba and Schalkwyk, 2013) and Facebook (Chen et al., 2015)), all support modified Kneser-Ney smoothing:  (Heafield, 2011),  (Pauls and Klein, 2011),  (Talbot and Osborne, 2007),  (Watanabe et al., 2009), MSRLM (Nguyen et al., 2007), SRILM (Stolcke, 2002), IRSTLM (Federico et al., 2008) and the recent approach based on suffix trees by Shareghi et al. (2015, 2016). For such reasons, Kneser-Ney is the model we consider in this work too and that we review in Section 4.

The current limitation of the mentioned software libraries is that estimation of such models occurs in internal memory and, as a result, these are not able to scale to the dimensions we consider in this work. An exception is represented by the work of Heafield, Pouzyrevsky, Clark, and Koehn (2013) () that contributed an estimation algorithm involving three steps of sorting in external memory. Their solution embodies the current state-of-art solution to the problem: the algorithm takes, on average, as low as of the CPU and of the RAM of the cited toolkits (Heafield et al., 2013). Therefore, our work aims at improving upon the I/O efficiency of this approach.

1.1. Our contributions

  1. [leftmargin=*]

  2. We introduce a compressed trie data structure in which each level of the trie is modeled as a monotone integer sequence that we encode with Elias-Fano (Elias, 1974; Fano, 1971) as to efficiently support random access operations and successor queries over the compressed sequence. Our hashing approach leverages on minimal perfect hash in order to use tables of size equal to the number of stored patterns per level, with one random access to retrieve the relative -gram information.

  3. We describe a technique for lowering the space usage of the trie data structure, by reducing the magnitude of the integers that form its monotone sequences. Our technique is based on the observation that few distinct words follow a predefined context, in any natural language. In particular, each word following a context of fixed length , i.e., its preceding words, is encoded as an integer whose value is proportional to the number of words that follow such context.

  4. We present an extensive experimental analysis to demonstrate that our technique offers a significantly better compression with respect to the plain Elias-Fano trie, while only introducing a slight penalty at query processing time. Our data structures outperform all proposals at the state-of-the-art for space usage, without compromising their time performance. More precisely, the most space-efficient proposals in the literature, that are both quantized and lossy, are no better than our trie data structure and up to times slower. Conversely, we are as fast as the fastest competitor, but also retain an advantage of up to in absolute space.

  5. We design a faster estimation algorithm that requires only one step of sorting in external memory, as opposed to the state-of-the-art approach (Heafield et al., 2013) that requires three steps of sorting. The result is achieved by the careful exploitation of the properties of the extracted -gram strings. Thanks to such properties, we show how it is possible to perform the whole estimation on the context-sorted strings and, yet, be able to efficiently lay out a reverse trie data structure, indexing such strings in suffix order. We show that saving two steps of sorting in external memory yields a solution that is faster on average than the fastest algorithm proposed in the literature.

  6. We introduce many optimizations to further enhance the running time of our proposal, such as: asynchronous CPU and I/O threads, parallel LSD radix sort, block-wise compression and multi-threading. With an extensive experimental analysis conducted over large textual datasets, we study the behavior of our solution at each step of estimation; quantify the impact of the introduced optimizations and consider the comparison against the state-of-the-art. The devised optimizations further improve the running time by on average, making our algorithm faster than the state-of-the-art solution.

1.2. Paper organization

Although the two problems we address in this paper, i.e., indexing and estimation, are strictly correlated, we treat them one after the other in order to introduce the whole material in an incremental way without burdening the exposition. In particular, we show the experimental evaluation right after the description of our techniques for each problem, rather than deferring it to the end of the paper. We believe this form is the most suitable to convey the results that we want to document with this paper. In our intention, each section of this document is an independent unit of exposition. Based on the following observations, the paper is structured as follows.

Section 2 fixes the notation and provides some basic notions about the -grams. More detailed background will be provided when needed in the relevant (sub-)sections of the paper.

Section 3 treats the problem of indexing. Subsection 3.1 reviews the standard data structures used to index -gram datasets in compressed space and how these are used by the proposals in the literature. Subsections 3.2 and 3.3 describe our compressed data structures, whose efficiency is validated in Subsection 3.4 with a rich set of experiments.

Section 4 treats the problem of estimation. After reviewing the Kneser-Net smoothing technique in Subsection 4.1, we describe the state-of-the-art approach in Subsection 4.2 because we aim at improving the efficiency of that algorithm. We present our improved estimation process in Subsection 4.3 and test its performance in Subsection 4.4. We conclude the paper in Section 5.

2. Background and Notation

A language model (LM) is a probability distribution that describes how often a string drawn from the set appears in some domain on interest. The central goal of a language model is to compute the probability of the word given its preceding history of words, called the context, that is: for all . Informally, the goal is to predict the “next” word following a given context.

When efficiency is the main concern, -gram language models are adopted. A -gram is a sequence of at most tokens. A token can be either a single character or a word, the latter intended as a sequence of characters delimited by a special symbol, e.g., a whitespace character. Unless otherwise specified, throughout the paper we consider -grams as consisting of words. Since we impose that , where is a small constant, (e.g., typically ), dealing with strings of this form permits to work with a context of at most preceding words. This ultimately implies that the aforementioned probability can be approximated with . The way each -gram probability is computed depends on the chosen model.

Several models have been proposed in the literature, such as Laplace, Good-Turing, Katz, Jelinek-Mercer, Witten-Bell and Kneser-Ney (see (Chen and Goodman, 1996, 1999) and references therein for a complete description and comparison). For a -gram backoff-smoothed language model, the probability of with context is assigned according to the following recursive equation

that is: if the model has enough information we use the full distribution , otherwise we backoff to the lower-order distribution with penalty .

Clearly, the bigger the language model the more accurate the computed probability will be. In other words, predictions will be more accurate when more -grams are used to estimate the probability of a word following a given context. Therefore, we would like to handle as many -grams as possible: this paper describes techniques to handle several billions of -grams. Such -gram strings are extracted from text, from any of its different incarnations, e.g., web pages, novels, code fragments and scientific articles, by adopting a sliding-window approach. A window of words, for , slides over a text counting the number of times such words appear in the text. This counting process is usually implemented using a hash data structure, whose keys are the distinct -gram strings and the values the accumulated frequency counts: if the extracted -gram is not already present in the table, a new entry is allocated with associated value ; otherwise the corresponding value is incremented by . This process is repeated for different widow sizes over huge text corpora: this gives birth to colossal datasets in terms of number of distinct strings. As a concrete example, if all distinct -grams for the values of ranging from to are extracted from the Agner Fog’s manual Optimizing software in C++ (Fog, 2014), we obtain the following numbers of distinct -grams: 1-grams, 2-grams, 3-grams, 4-grams and 5-grams. Thus more than thousands distinct grams for already pages written in English. Google did the same but on approximately million books, or of all books ever published (Lin et al., 2012), yielding a dataset of more than billion -grams (see also Table 1). This motivates and helps understanding the need for efficient data structures, in both memory footprint and access speed, able to manage such quantity of strings.

3. Compressed Indexes

The problem we tackle in this section of the paper is the one of representing in compressed space a dataset of -gram strings and their associated values, being either frequency counts (integers) or probabilities (floating points). Given a -gram string, the compressed data structure should allow fast random access to the corresponding associated value by means of the operation Lookup.

3.1. Related Work

In this subsection we first discuss the classic data structures used to represent efficiently large -gram datasets, highlighting the advantages/disadvantages of these approaches in relation to the structural properties that -gram datasets exhibit. Next, we consider how these approaches have been adopted by different proposals in the literature. Two different data structures are mostly used to store large and sparse -grams datasets: tries (Fredkin, 1960) and hash tables (Lewis and Cook, 1988).

Tries. A trie is a tree data structure devised for efficient indexing and search of string dictionaries, in which the common prefixes shared by the strings are represented once to achieve compact storage. This property makes this data structure useful for storing the -gram strings in compressed space. In this case, each constituent word of a -gram is associated a node in the trie and different -grams correspond to different root-to-leaf paths. These paths must be traversed to resolve a query, which retrieves the string itself or an associated satellite value, e.g., a frequency count. Conceptually, a trie implementation has to store a triplet for any node: the associated word, satellite value and a pointer to each child node. As is typically very small and each node has many children, tries are of short height and dense. Therefore, these are implemented as a collection of (few) sorted arrays: for each level of the trie, a separate array is built to contain all the triplets for that level, sorted by the words. In this implementation, a pair of adjacent pointers indicates the sub-array listing all the children for a word, which can be inspected by binary search.

Hash tables. Hashing is another way to implement associative arrays: for each value of from 1 to a separate hash table stores all grams of order . At the location indicated by the hash function the following information is stored: a fingerprint value to lower the probability of a false positive (typically the or -byte hash of the -gram itself) and the satellite data for the -gram. This data structure permits to access the specified -gram data in expected constant time. Open addressing with linear probing is usually preferred over chaining for its better locality of accesses.

Tries are usually designed for space-efficiency as the formed sorted arrays are highly compressible. However, retrieval for the value of a -gram involves exactly searches in the constituent arrays. Conversely, hashing is designed for speed but sacrifices space-efficiency since its keys, along with their fingerprint values, are randomly distributed and, therefore, incompressible. Moreover, hashing is a randomized solution, i.e., there is a non-null probability of retrieving a frequency count for a -gram not really belonging to the indexed corpus (false positive). Such probability equals , where indicates the number of bits dedicated to the fingerprint values: larger values of yield a smaller probability of false positive but also increase the space of the data structure.

State-of-the-art. The paper by Pauls and Klein (Pauls and Klein, 2011) proposes trie-based data structures in which the nodes are represented via sorted arrays or with hash tables with linear probing. The trie sorted arrays are compressed using a variable-length block encoding: a configurable radix is chosen and the number of digits to represents a number in base is written in unary. The representation then terminates with the digits, each of which requires exactly bits. To preserve the property of looking up a record by binary search, each sorted array is divided into blocks of bytes. The encoding is used to compress words, pointers and the positions that frequency counts take in a unique-value array that collect all distinct counts. The hash-based variant is likely to be faster than the sorted array variant, but requires extra table allocation space to avoid excessive collisions.

Heafield (Heafield, 2011)

improves the sorted array trie implementation with some optimizations. The keys in the arrays are replaced by their hashes and sorted, so that these are uniformly distributed over their ranges. Now finding a word ID in a trie level of size

can be done in111Unless otherwise specified, all logarithms are in base 2 and , . with high probability by using interpolation search (Demaine et al., 2004). Records in each sorted arrays are minimally sized at the bit level, improving the memory consumption over (Pauls and Klein, 2011). Pointers are compressed using the integer compressor devised in (Raj and Whittaker, 2003). Values can also be quantized using the binning method (Federico and Bertoldi, 2006) that sorts the values, divides them into equally-sized bins and then elects the average value of the bin as the representative of the bin. The number of chosen quantization bits directly controls the number of created bins and, hence, the trade-off between space and accuracy.

Talbot and Osborne (Talbot and Osborne, 2007) use Bloom filters (Bloom, 1970) with lossy quantization of frequency counts to achieve small memory footprint. In particular, the raw frequency count of gram is quantized using a logarithmic codebook, i.e., . The scale is determined by the base of the logarithm: in the implementation is set to , where is the quantization range used by the model, e.g., . Given the quantized count of gram , a Bloom filter is trained by entering composite events into the filter, represented by with an appended integer value , which is incremented from 1 to . Then at query time, to retrieve , the filter is queried with a appended to . This event is hashed using the hash functions of the filter: if all of them test positive, then the count is incremented and the process repeated. The procedure terminates as soon as any of the hash functions hits a 0 and the previous count is reported. This procedure avoids a space requirement for the counts proportional to the number of grams in the corpus because only the codebook needs to be stored. The one-sided error of the filter and the training scheme ensure that the actual quantized count cannot be larger than the reported value. As the counts are quantized using a logarithmic-scaled codebook, the count will be incremented only a small number of times. The quantized logarithmic count is finally converted back to a linear count.

The use of the succinct encoding LOUDS (Level-Order Unary-Degree Sequence) (Jacobson, 1989) is advocated in (Watanabe et al., 2009) to implicitly represent the trie nodes. In particular, the pointers for a trie of nodes are encoded using a bitvector of bits. Bit-level searches on such bitvector allow forward/backward navigation of the trie structure. Words and frequency counts are compressed using Variable-Byte encoding (Thiel and Heaps, 1972; Salomon, 2007), with an additional bitvector used to indicate the boundaries of such byte sequences as to support random access to each element. The paper also discusses the use of block-wise compression (basically gzip on blocks of KB) though it is not used in the implementation for time efficiency reasons.  Shareghi et al. (2015, 2016) also consider the usage of succinct data structures to represent suffix trees that can be used to compute Kneser-Ney probabilities on-the-fly. Experimental results indicate that the method is practical for large-scale language modeling although significantly slower to query than leading toolkits for language modeling (Heafield, 2011).

Because of the importance of strings as one of the most common computerized kind of information, the problem of representing trie-based storage for string dictionaries is among one of the most studied in computer science, with many and different solutions available (Heinz et al., 2002; Navarro et al., 2001; Clark and Munro, 1996). It goes without saying that, given the properties that -gram datasets exhibit, generic trie implementations are not suitable for their efficient treatment. However, comparing with the performance of such implementations gives useful insights about the performance gap with respect to a general solution. We mention  (Yata, 2011) as the best and practical general-purpose trie implementation. The core idea is to use Patricia tries (Morrison, 1968) to recursively represent the nodes of a Patricia trie. This clearly comes with a space/time trade off: the more levels of recursion are used, the greater the space saving but also the higher the retrieval time.

3.2. Elias-Fano Tries

In this subsection we present our main result: a compressed trie data structure, based on the Elias-Fano representation (Elias, 1974; Fano, 1971) of monotone integer sequences for its efficient random access and search operations. As we will see, the constant-time random access of Elias-Fano makes it the right choice for the encoding of the sorted-array trie levels, given that we fundamentally need to randomly access the sub-array pointed to by a pair of pointers. Such pair is retrieved in constant time too. Now every access performed by binary search takes without requiring any block decompression, differently from currently employed strategies (Pauls and Klein, 2011).

We also introduce a novel technique to lower the memory footprint of the trie levels by losslessly reducing the entity of their constituent integers. This reduction is achieved by mapping a word ID conditionally to its context of fixed length , i.e., its preceding words.

3.2.1. Core Data Structure

This subsection contains the core description of the compressed trie data structure: we dedicate one paragraph to each of its main building components, i.e., how the grams, satellite data and pointers are represented; how searches are implemented.

(a)
(b)
Figure 1. On the left (a): example of a trie of order , representing the set of grams A, AA, AAC, AC, B, BB, BBC, BBD, BC, BCD, BD, CA, CD, DB, DBB, DBC, DDD. On the right (b): the sorted-array representation of the trie. Light-gray arrays represent the pointers.

As it is standard, a unique integer ID is assigned to each distinct token (uni-gram) to form the vocabulary of the indexed corpus. Uni-grams are indexed using a hash data structure that stores for each gram its ID in order to retrieve it when needed in . If we sort the -grams following the token-ID order, we have that all the successors of gram , i.e., all grams whose prefix is , form a strictly increasing integer sequence. For example, suppose we have the uni-grams222Throughout this subsection we consider, for simplicity, a -gram as consisting of capital letters. A, B, C, D, which are assigned IDs respectively. Now consider the bi-grams AA, AC, BB, BC, BD, CA, CD, DB, DD sorted by IDs. The sequence of the successors of A, referred to as the range of A, is , i.e., ; the sequence of the successors of B, is B, C, D, i.e., and so on. Figure 1 shows a graphical representation of what described. Concatenating the ranges, we obtain the integer sequence . In order to distinguish the successors of a gram from others, we also maintain where each range begins in a monotone integer sequence of pointers. In our example, the sequence of pointers is (we also store a final dummy pointer to be able to obtain the last range length by taking the difference between the last and previous pointer). The ID assigned to a uni-gram is also used as the position at which we read the uni-gram pointer in the uni-grams pointer sequence.

Therefore, apart from uni-grams that are stored in a hash table, each level of the trie is composed by two integer sequences: one for the representation of the gram-IDs, the other for the pointers. Now, what we need is an efficient encoding for integer sequences. Among the many integer compressors available in the literature (see the book by Salomon (Salomon, 2007) for a complete overview), we choose Elias-Fano (along with its partitioned variant (Ottaviano and Venturini, 2014)), which has been recently applied to inverted index compression showing an excellent time/space trade off (Vigna, 2013; Ottaviano and Venturini, 2014; Pibiri and Venturini, 2017). We now describe this elegant integer encoding.

Elias-Fano. Given a monotonically increasing sequence of positive integers drawn from a universe of size (i.e., , for any , with ), we write each in binary using bits. The binary representation of each integer is then split into two parts: a low part consisting in the right-most bits that we call low bits and a high part consisting in the remaining bits that we similarly call high bits. Let us call and the values of low and high bits of respectively (notice that, given : and , where and are the left and right shift operators respectively, & the bitwise AND). The Elias-Fano representation of is given by the encoding of the high and low parts. The array is written explicitly in bits and represents the encoding of the low parts. Concerning the high bits, we represent them in negated unary333The negated unary representation of an integer is the bitwise NOT of its unary representation . As an example: and .

using a bit vector of

bits as follows. We start from a 0-valued bit vector and set the bit in position , for all . Finally the Elias-Fano representation of is given by the concatenation of and and overall takes

(1)

Despite its simplicity, it is possible to randomly access an integer from a sequence compressed with Elias-Fano without decompressing it. The operation is supported using an auxiliary data structure that is built on bit vector , able to efficiently answer queries, that return the position in of the -th 1 bit. This auxiliary data structure is succinct in the sense that it is negligibly small compared to , requiring only additional bits (Clark, 1996; Vigna, 2008). Using the primitive, it is possible to implement , which returns for any , in . We basically have to re-link together the high and low bits of an integer, previously split up during the encoding phase. While the low bits are trivial to retrieve as we need to read the range of bits from , the high bits deserve a bit more care. Since we write in negated unary how many integers share the same high part, we have a bit set for every integer of and a zero for every distinct high part. Therefore, to retrieve the high bits of the -th integer, we need to know how many zeros are present in . This quantity is evaluated on in as . Finally, linking the high and low bits is as simple as: , where is the left shift operator and the bitwise OR.

Partitioned Elias-Fano. The crucial characteristic of the Elias-Fano space bound (1) is that it only depends on two parameters, i.e., the length and universe of the sequence, which poorly describe the sequence itself. If the sequence presents regions of close identifiers, i.e., formed by integers that slightly differ from one another, Elias-Fano fails to exploit such natural clusters. Clearly, we would obtain a better space usage if such regions were encoded separately. Partitioning the sequence into chunks to better adapt to such regions of close identifiers is the key idea of the partitioned Elias-Fano representation (PEF in the following) (Ottaviano and Venturini, 2014).

The core idea is as follows. We partition a sequence into chunks, each of integers. The first level of the representation is made up of the last elements of each chunk, i.e., . This level is encoded with Elias-Fano. The second level is represented by the encoding of the chunks themselves. The main reason for introducing this two-level representation, is that now the elements of the -th chunk are encoded with a smaller universe, i.e., . This is, however, a uniform-partitioning strategy that may be suboptimal, since we cannot expect clusters of integers be aligned to such boundaries. As the problem of choosing the best possible partition is posed, an algorithm based on dynamic programming is presented in (Ottaviano and Venturini, 2014) which, in time, yields a partition whose cost (i.e., the space taken by the encoded sequence) is at most times away from an optimal one, for any . To support variable-size partitions, another sequence is maintained in the first level of the representation, which encodes (again with Elias-Fano) the sizes of the chunks in the second level.

This sequence organization introduces a level of indirection when resolving the queries, because a first search must be spent in the first level of the representation to identify the block in which the searched ID is located. We will return to and stress this point in the experimental Subsection 3.4.

Gram-ID sequences and pointers. While the sequences of pointers are monotonically increasing by construction and, therefore, immediately Elias-Fano encodable, the gram-ID sequences could not. However, a gram-ID sequence can be transformed into a monotone one, though not strictly increasing, by taking range-wise prefix sums: to the values of a range we sum the last prefix sum (initially equal to ). Then, our exemplar sequence becomes . The last prefix sum is initially , therefore the range of A remains the same, i.e., . Now the last prefix sum is , so we sum to the values in the range of B, yielding , and so on. In particular, if we sort the vocabulary IDs in decreasing order of occurrence, we make small IDs appear more often than large ones and this is highly beneficial for the growth of the universe and, hence, for Elias-Fano whose space occupancy critically depends on it. We emphasize this point again: for each uni-gram in the vocabulary we count the number of times it appears in all gram-ID sequences. Notice that the number of occurrences of a -gram can be different than its frequency count as reported in the indexed corpus. The reason is that such corpora often do not include the -grams appearing less than a predefined frequency threshold.

Frequency counts. To represent the frequency counts, we use the unique-value array technique, i.e., each count is represented by its rank in an array, one for each separate value of

, that collects all distinct frequency counts. The reason for this is that the distribution of the frequency counts is extremely skewed (see Table 

1), i.e., relatively few -grams are very frequent while most of them appear only a few times. Now each level of the trie, besides the sequences of gram-IDs and pointers, has also to store the sequence made by all the frequency count ranks. Unfortunately, this sequence of ranks is not monotone, yet it follows the aforementioned highly repetitive distribution. Therefore, we assigned to each count rank a codeword of variable length. As similarly done for the gram-IDs, by assigning smaller codewords to more repetitive count ranks, we have most ranks encoded with just a few bits. More specifically, starting from , we first assign all the codewords of length before increasing by and repeating the process until all count ranks have been considered. Therefore, we first assign codewords 0 and 1, then codewords 00, 01, 10, 11, 000 and so on. All codewords are then concatenated one after the other in a bitvector . Following (Fredriksson and Nikitin, 2007), to the -th value we give codeword , where is the number of bits dedicated to the codeword. From codeword and its length in bits, we can retrieve by taking the inverse of the previous formula, i.e., . Besides the bitvector for the codewords themselves, we also need to know where each codeword begins and ends. We can use another bitvector for this purpose, say , that stores a 1 for the starting position of every codeword. A small additional data structure built on allows efficient computation of , which we use to retrieve . In fact, gives us the starting position of the -th codeword. Its length is easily computed by scanning upward from position until we hit the next 1, say in position . Finally and .

In conclusion, each level of the trie stores three sequences: the gram-ID sequence , the count ranks sequence and the pointer sequence . Two exceptions are represented by uni-grams and maximum-order grams, for which gram-ID and pointer sequences are missing respectively.

Lookup. We now describe how the Lookup operation is supported, i.e., how to retrieve the frequency count given a gram for some . We first perform vocabulary lookups to map the gram tokens into its constituent IDs. We write these IDs into an array . This preliminary query-mapping step takes . Now, the search procedure basically has to locate in the -th level of the trie.

If , then our search terminates: at the position we read the rank to finally access . If, instead, is greater than , the position is used to retrieve the pair of pointers in constant time, which delimits the range of IDs in which we have to search for in the second level of the trie. This range is inspected by binary search, taking as each access to an Elias-Fano-encoded sequence is performed in constant time. Let be the position at which is found in the range. Again, if , the search terminates by accessing where is the rank . If is greater than , we fetch the pair to continue the search of in the third level of the trie, and so on. This search step is repeated for times in total, to finally return the count of .

3.2.2. Context-based Identifier Remapping

In this subsection we describe a novel technique that lowers the space occupancy of the gram-ID sequences that constitute, as we have seen, the main component of the trie data structure.

The idea is to map a word occurring after the context to an integer whose value is bounded by the number of words that follow such context, and not bounded by the total vocabulary size . Specifically, is mapped to the position it occupies within its siblings, i.e., the words following the gram . We call this technique context-based identifier remapping because each ID is re-mapped to the position it takes relatively to a context.

(a)
3-grams 4-grams 5-grams
Europarl 0
1
2
YahooV2 0
1
2
GoogleV2 0
1
2
(b)
Figure 2. The action performed by the context-based identifier remapping strategy. The last word ID of any sub-path of length , e.g., the blue one, is replaced with the position it takes within its sibling IDs. These sibling IDs are found at the end (the dark gray triangle) of the search of along the same path, e.g., the red one, in the first levels of the trie. Effect of the context-based remapping on the average gap (ratio between universe and size) of the gram-ID sequences of the datasets used in the experiments, with context length .

Figure (a)a shows a representation of the action performed by the remapping strategy: the last word ID of any sub-path of length (e.g., the blue one in the figure) is searched along the same path occurring in the first levels of the trie (e.g., the red one in the figure). This can be graphically interpreted as if the blue path were projected to the red path in order to search along its sibling IDs, that are the ones occurring after the gram (the small dark gray triangle in the figure). We stress that this projection is always possible, i.e., we are guaranteed to find any sub-path of length in the first levels of the trie, because of the sliding-window extraction process described in Section 2. Figure (a)a also highlights that using a context of length will partition the levels of the trie into two categories: the so-called mapper levels and the mapped levels. The first levels of trie act, in fact, as a mapper structure whose role is to map any word ID through searches; all the other levels are the ones formed by the remapped IDs.

The salient feature of our strategy is that it takes full advantage of the -gram model represented by the trie structure itself in that it does not need any redundancy to perform the mapping of IDs, because these are mapped by means of searches in the first levels of the trie. The strategy also allows a great deal of flexibility, in that we can choose the length of the context. In general, with a -gram dataset of order , we can choose between distinct context lengths , i.e., . Clearly, the greater the context length we use, the smaller the remapped IDs will be but the more the searches will take. The choice of the proper context length to use should take into account the characteristics of the -gram dataset; in particular the number of grams per order.

In what follows we motivate why the introduced remapping strategy offers a valuable contribution to the overall space reduction of the trie data structure, throughout some didactic and real examples. As we will see in the experimental Subsection 3.4, the dataset vocabulary can contain several million tokens, whereas the number of words that naturally occur after another is typically very small. Even in the case of stopwords, such as “the” or “are”, the number of words that can follow is far less than the whole number of distinct words for any -gram dataset. This ultimately means that the remapped integers forming the gram-ID sequences of the trie will be much smaller than the original ones, which can indeed range from to . Lowering the values of the integers clearly helps in reducing the memory footprint of the levels of the trie because any integer compressor takes advantage of encoding smaller integers, since fewer bits are needed for their representation (Moffat and Stuiver, 2000; Ottaviano and Venturini, 2014; Pibiri and Venturini, 2017). In our case the gram-ID sequences are encoded with Elias-Fano: from Subsection 3.2.1ì, equation (1), we know that Elias-Fano spends bits per integer, thus a number of bits proportional to the average gap between its values. The remapping strategy reduces the universe of representation, thus lowering the average gap and space of the sequence.

Figure 3. Example of a trie of order , representing the set of grams A, AA, AAC, AC, B, BB, BBC, BBD, BC, BCD, BD, CA, CD, DB, DBB, DBC, DDD. Vocabulary IDs are represented in blue while level- IDs in red. The green IDs are derived by applying a context-based remapping with context length .

This effect is illustrated by the numbers in Figure (b)b that shows how the average gap of the gram-ID sequences of the datasets we used in the experiments (see also Table 1) is affected by the context-based remapping. As uni-grams and bi-grams constitute the mapper levels, these are kept un-mapped: we show the statistic for the mapped levels, i.e., the third, fourth and fifth, of a trie of order built from the -grams of the datasets. For each dataset we did the experiment for context lengths , and . As we can see by considering Europarl, our technique with a context of length achieves an average reduction of times (up to on tri-grams). With a context of length , instead, we obtain an average reduction of times (up to on -grams). Very similar considerations and numbers hold for the YahooV2 dataset as well. The reduction on the GoogleV2 dataset is less dramatic instead, being on average of times with context-length and of times with context-length .

Example. To better understand how the remapping algorithm works, we consider now a small didactic example. We continue with the example from Subsection 3.2.1 and represented in Figure 3. The blue IDs are the vocabulary IDs and the red ones are the last token IDs of the tri-grams as assigned by the vocabulary. We now explain how the remapped IDs, represented in greed, are derived by the model using our technique with a context of length . Consider the tri-gram BCD. The default ID of D is . We now rewrite this ID as the position that D takes within the successors of the word preceding it, i.e., C (context ). As we can see, D appears in position within the successors of C, therefore its new ID will be . Another example: take DBB. The default ID of B is , but it occurs in position within the successors of its parent B, therefore its new ID is . The example in Figure 3 illustrates how to map tri-grams using a context of length : this is clearly the only one possible as the first two levels of the trie must be used to retrieve the mapped ID at query time. However, if we have a gram of order , i.e., , we can choose to map as the position it takes within the successors of (context length ) or within the successors of (context length ).

Lookup. The described remapping strategy comes with an overhead at query time as the search algorithm described in Subsection 3.2.1 must map the default vocabulary ID to its remapped ID, before it can be searched in the proper gram sequence. If the remapping strategy is applied with a context of length , it involves additional searches. As an example, by looking at Figure 3, before searching the mapped ID of D for the tri-gram BCD, we have to map the vocabulary ID of D, i.e., , to . For this task, we search within the successors of C. As is found in position , we now know that we have to search for within the successors of BC. On the one hand, the context-based remapping will assign smaller IDs as the length of the context rises, on the other hand it will also spend more time at query processing. In conclusion, we have a space/time trade-off that we explore with an extensive experimental analysis in Subsection 3.4.

3.3. Hashing

Since the indexed -gram corpus is static, we obtain a full hash utilization by resorting to Minimal Perfect Hash (MPH). We indexed all grams of the same order into a separate MPH table , each with its own MPH function . This introduces a twofold advantage over the linear probing approach used in the literature (Heafield, 2011; Pauls and Klein, 2011): use a hash table of size equal to the exact number of grams per order (no extra space allocation is required) and avoid the linear probing search phase by requiring one single access to the required hash location. We use the publicly available implementation of MPH as described in (Belazzougui et al., 2014) and available at https://github.com/ot/emphf. This implementation requires bits per key on average. At the hash location for a -gram we store: its -byte hash key as to have a false positive probability of (-byte hash keys are supported as well) and the position of the frequency count in the unique-value array which keeps all distinct frequency counts for order As already motivated, these unique-value arrays, one for each different order of , are negligibly small compared to the number of grams themselves and act as a direct map from the position of the count to its value. Although these unique values could be sorted and compressed, we do not perform any space optimization as these are too few to yield any improvement but we store them uncompressed and byte-aligned, in order to favor lookup time. We also use this hash approach to implement the vocabulary of the previously introduced trie data structure.

Lookup. Given -gram we compute the position in the relevant table , then we access the count rank stored at position and finally retrieve the count value .

Europarl YahooV2 GoogleV2
-grams counts -grams counts -grams counts
1
2
3
4
5
total -grams
gzip 6.98 6.45 6.20
Table 1. Number of -grams and distinct frequency counts for the datasets used in the experiments. We also report the average bytes per gram achieved by gzip as a useful baseline for comparison.

3.4. Experiments

In this subsection, we first present experiments to validate the effectiveness of our compressed data structures in relation to the corresponding query processing speed; then we compare our proposals against several solutions available in the state-of-the-art.

Datasets. We performed our experiments on the following standard datasets.

Each dataset comprises all -grams for and associated frequency counts. Table 1 shows the basic statistics of the datasets. We choose these datasets in order to test our data structures on different corpora sizes: starting from the left of Table 1 each dataset has roughly times the number of -grams of the previous one.

Compared indexes. We compare the performance of our data structures against the following software packages that use the approaches introduced in Subsection 3.1.

  • [leftmargin=*]

  • implements two trie data structures based on sorted arrays and hash tables to represent the nodes of the trie (Pauls and Klein, 2011). The code is written in Java and available at: https://github.com/adampauls/berkeleylm.

  • makes use of the LOUDS succinct encoding (Jacobson, 1989) to implicitly represent the trie structure, while the frequency counts are compressed using VByte encoding (Watanabe et al., 2009). The code is written in C++ and available at: https://github.com/tarowatanabe/expgram.

  • implements a trie with interpolation search and a hashing with linear probing (Heafield, 2011). The code is written in C++ and available at: http://kheafield.com/code/kenlm.

  • is a general-purposes string dictionary implementation in which Patricia tries are recursively used to represent the nodes of a Patricia trie (Yata, 2011). The code is written in C++ and available at: https://github.com/s-yata/marisa-trie.

  • employs Bloom filters with lossy quantization of frequency counts to attain to low memory footprint (Talbot and Osborne, 2007). The code is written in C++ and available at: https://sourceforge.net/projects/randlm.

Experimental setting and methodology. All experiments have been performed on a machine with Intel Xeon E5-2630 v3 cores ( threads) clocked at Ghz, with GBs of RAM, running Linux , bits. Our implementation is in standard C++11 and compiled with gcc 5.4.1 with the highest optimization settings. Template specialization has been preferred over inheritance to avoid the virtual method call overhead, which can be disruptive for the very fine-grained operations we consider. Except for the instructions to count the number of bits set in a word (popcount), and to find the position of the least significant bit (number of trailing zeroes), no special processor feature was used. In particular, we did not add any SIMD (Single Instruction Multiple Data) instruction to our code.

The data structures were saved to disk after construction, and loaded into main memory to be queried. For the scanning of input files we used the posix_madvice system, called with the parameter POSIX_MADV_SEQUENTIAL to instruct the kernel to optimize the sequential access to the mapped memory region. The implementation of our data structures, as well as the utilities to prepare the datasets for indexing and unit tests, is freely available at: https://github.com/jermp/tongrams.

To test the speed of Lookup queries, we use a query set consisting of million -grams for YahooV2 and GoogleV2 and of million for Europarl, drawn at random from the entire datasets. In order to smooth the effect of fluctuations during measurements, we repeat each experiment five times and consider the mean. The shown query results are, therefore, average times. All query algorithms were run on a single core.

Europarl YahooV2 GoogleV2
bytes/gram sec/query bytes/gram sec/query bytes/gram sec/query
EF
PEF
CONTEXT-BASED ID REMAPPING EF
PEF
EF
PEF
Table 2. Average bytes per gram (bytes/gram) and average Lookup time per query in micro seconds (sec/query). The bytes/gram cost also includes the space of representation for the pointer sequences.

3.4.1. Elias-Fano Tries

In this subsection we test the efficiency of our trie data structure. As already done for the description in Subsection 3.2.1, we dedicate one paragraph to the validation of each of the main building components of the trie, as well as to the introduced performance optimizations.

Gram-ID sequences. Table 2 shows the average number of bytes per gram including the cost of pointers, and lookup speed per query. The first two rows refers to the trie data structure described in Subsection 3.2.1, when the sorted arrays are encoded with Elias-Fano (EF) and partitioned Elias-Fano (PEF(Ottaviano and Venturini, 2014). Subsequent rows indicate the space gains obtained by applying the context-based remapping strategy using EF and PEF for contexts of lengths respectively and . For GoogleV2 we use a context of length , as the tri-grams alone roughly constitute of the whole the dataset, thus it would make little sense to optimize only the space of - and -grams that take of the dataset.

As expected, partitioning the gram sequences using PEF yields a better space occupancy. Though the paper by Ottaviano and Venturini (Ottaviano and Venturini, 2014) describes a dynamic programming algorithm that finds the partitioning able of minimizing the space occupancy of a monotone sequence (we refer to this scheme as PEF-OPT in the following), we instead adopt a uniform partitioning strategy. Partitioning the sequence uniformly has several advantages over variable-length partitions for our setting. As we have seen in Subsection 3.2.1, trie searches are carried out by performing a preliminary random access to the endpoints of the range pointed to by a pointer pair. Then a search in the range follows to determine the position of the gram-ID. Partitioning the sequence by variable-length blocks introduces an additional search over the sequence of partition endpoints to determine the proper block in which the search must continue. While this preliminary search only introduces a minor overhead in query processing for inverted index queries (Ottaviano and Venturini, 2014) (as it has to be performed once and successive accesses are only directed to forward positions of the sequence), it is instead the major bottleneck when random access operations are very frequent as in our case. By resorting on uniform partitions, we eliminate this first search and the cost of representation for the variable-length sizes. To speed up queries even further, we also keep the upper bounds of the blocks uncompressed and bit-aligned.

Figure 4. Bytes per gram (left vertical axis) and s per query (right vertical axis, black dashed line) by varying block size in PEF uniform on the gram-ID sequences of Europarl.

As the problem of deciding the optimal block size is posed, Figure 4 shows the space/time trade-off obtained by varying the block size on the gram-ID sequences. The plots for YahooV2 and GoogleV2 datasets exhibit the same shape, therefore we report the one for Europarl. The dashed black line illustrates how the average Lookup time varies when all the gram-ID sequences are partitioned using the same block size. The figure suggests to use partitions of integers for bi-gram sequences, and of for all other orders, i.e., for , given that the space usage remains low without increasing much the query processing speed. With this choice of block sizes, the loss in space with respect to PEF-OPT is small and equal to for Europarl; for YahooV2 and for GoogleV2.

Shrinking the size of blocks speeds up searches over plain Elias-Fano because a successor query has to be resolved over an interval potentially much smaller than a range length. This behavior is clearly highlighted by the shape of the black dashed line of Figure 4. However, excessively reducing the block size may ruin the advantage in space reduction. Therefore it is convenient to use small block sizes for the most traversed sequences, e.g., the bi-gram sequences, that indeed must be searched several times during the query-mapping phase when the context-based remapping is adopted. In conclusion, as we can see by the second row of Table 2, there is no practical difference between the query processing speed of EF and PEF: this latter sequence organization brings a negligible overhead in query processing speed (less than on Europarl and YahooV2), while maintaining a noticeable space reduction (up to on GoogleV2).

Context-based identifier remapping. Concerning the efficacy of the context-based remapping, we have that remapping the gram IDs with a context of length is already able of reducing the space of the sequences by on average when sequences are encoded with Elias-Fano, with respect to the EF cost. If we consider a context of length we double the gain, allowing for more than of space reduction without affecting the lookup time with respect to the case . As a first conclusion, when space efficiency is the main concern, it is always convenient to apply the remapping strategy with a context of length . The gain of the strategy is even more evident with PEF: this is no surprise as the encoder can better exploit the reduced IDs by encoding all the integers belonging to a block with a universe relative to the block and not to the whole sequence. This results in a space reduction of more than on average and up to on GoogleV2.

Regarding the query processing speed, as explained in Subsection 3.2.2, the remapping strategy comes with a penalty at query time as we have to map an ID before it can be searched in the proper gram sequence. On average, by looking at Table 2, we found that more time is spent with respect to the Elias-Fano baseline. Notice that PEF does not introduce any time degradation with respect to EF with context-based remapping: it is actually faster on GoogleV2.

Europarl YahooV2 GoogleV2
Variable-len. codewords
Prefix sums + EF
Prefix sums + PEF
Variable-len. block-coding
Packed
VByte
Table 3. Average bytes per count for different techniques.

Frequency counts. For the representation of frequency counts we compare three different encoding schemes: the first one refers to the strategy described in Subsection 3.2.1 that assigns variable-length codewords to the ranks of the counts and keeps track of codewords length using a binary vector (Variable-len. codewords); the other two schemes transform the sequence of count ranks into a non-decreasing sequence by taking its prefix sums and then applies EF or PEF (Prefix sums + EF/PEF).

Table 3 shows the average number of bytes per count for these different strategies. The reported space also includes the space for the storage of the arrays containing the distinct counts for each order of . As already pointed out, these take a negligible amount of space because the distribution of frequency counts is highly repetitive (see Table 1). The percentages of Prefix sums + EF/PEF are done with respect to the first row of the table, i.e., Variable-len. codewords.

The time for retrieving a count was pretty much the same for all the three techniques. Prefix-summing the sequence and apply EF does not bring any advantage over the codeword assignment technique because its space is practically the same on Europarl but it is actually larger on both YahooV2 (by up to ) and GoogleV2. These two reasons together place the codeword assignment technique in net advantage over EF. PEF, instead, offers a better space occupancy of more than on Europarl and on GoogleV2. Therefore, in the following we assume this representation for frequency counts, except for YahooV2, where we adopt Variable-len. codewords.

We also report the space occupancy for the counts representation of and which, differently from all other competitors, can also be used to index frequency counts. COMPRESSED variant uses the Variable-len. block-coding mechanism explained in Subsection 3.1 to compress count ranks, whereas the HASH variant stores bit-packed count ranks, referred to as Packed in the table, using the minimum number of bits necessary for their representation (see Table 1). , instead, does not store count ranks but directly compress the counts themselves using Variable-Byte encoding (VByte) with an additional binary vector as to be able of randomly accessing the counts sequence. The available RAM of our test machine ( GBs) was not sufficient to successfully build on GoogleV2. The same holds for and , as we are going to see next. Therefore, we report its space for Europarl and YahooV2.

We first observe that rank-encoding schemes are far more advantageous than compressing the counts themselves, as done by . Moreover, none of the these techniques beats the three ones we previously introduced, except for the COMPRESSED variant which is smaller on GoogleV2 with respect to Variable-len. codewords. However, note that this gap is completely bridged as soon as we adopt the combination Prefix sums + PEF.

Time and space breakdowns. Before concluding the subsection, we use the analysis to fix two different trie data structures that respectively privilege space efficiency and query time: we call them PEF-RTrie (the R stands for remapped) and PEF-Trie. For the PEF-RTrie variant we use PEF for representing the gram-ID sequences; Prefix sums + PEF for the counts on Europarl and GoogleV2 but Variable-len. codewords for YahooV2. We also use the maximum applicable context length for the context-based remapping technique, i.e., for Europarl and YahooV2; for GoogleV2. For the PEF-Trie variant we choose a data structure using PEF for representing gram-ID sequences and Variable-len. codewords for the counts, without remapping.

The corresponding size breakdowns are shown in Figures (c)c and (d)d respectively. Pointer sequences take very little space for both data structures (approximately ), while most of the difference lies, not surprisingly, in the space of the gram-ID sequences (roughly for Europarl and YahooV2; for GoogleV2). The timing breakdowns in Figures (a)a and (b)b clearly highlight, instead, how the context-based remapping technique rises the time we spend in the query-mapping phase, during which the IDs are mapped to their reduced IDs. In such case, the two phases of query mapping and search are almost the same, while in the PEF-Trie the search phase dominates.

(a) PEF-RTrie
(b) PEF-Trie
(c) PEF-RTrie
(d) PEF-Trie
Figure 5. Trie data structures timing (a-b) and size (c-d) breakdowns in percentage on the tested datasets. For the timing breakdowns we distinguish the three phases of query mapping, ID-search and final count lookup. For the space breakdowns we distinguish, instead, the contribution of gram-ID, count and pointer sequences.

3.4.2. Hashing

We build our MPH tables using -byte hash keys, as to yield a false positive rate of . For each different value of we store the distinct count values in an array, uncompressed and byte-aligned using bytes per distinct count on Europarl and YahooV2; bytes on GoogleV2.

For all the three datasets, the number of bytes per gram, including also the cost of the hash function itself ( bytes per gram) is . The number of bytes per count is given by the sum of the cost for the ranks and the distinct counts themselves and is equal to , and for Europarl, YahooV2 and GoogleV2 respectively. Not surprisingly, the majority of space is taken by the hash keys: clients willing to reduce this memory impact can use -byte hash keys instead, at the price of a higher false positive rate (). Therefore, it is worth observing that spending additional effort in trying to lower the space occupancy of the counts only results in poor improvements as we pay for the high cost of the hash keys.

The constant-time access capability of hashing makes gram lookup extremely fast, by requiring on average of a micro second per lookup (exact numbers are reported in Table 4). In particular, all the time is spent in computing the hash function itself and access the relative table location: the final count lookup is completely negligible.

Europarl YahooV2 GoogleV2
bytes/gram sec/query bytes/gram sec/query bytes/gram sec/query
PEF-Trie
PEF-RTrie
BerkeleyLM C.
BerkeleyLM H.3
BerkeleyLM H.50
Expgram
KenLM T.
Marisa
RandLM
MPH
KenLM P.3
KenLM P.50
Table 4. Average bytes per gram (bytes/gram) and average Lookup time per query in micro seconds per query (sec/query). For our data structures, i.e., PEF-Trie and PEF-RTrie, the bytes/gram cost also includes the space of representation for the pointer sequences.

3.4.3. Overall Comparison

In this subsection we compare the performance of our selected trie-based solutions, i.e., the PEF-RTrie and PEF-Trie, as well as our minimal perfect hash approach against the competitors introduced at the beginning of this subsection. The results of the comparison are shown in Table 4, where we report the space taken by the representation of the gram-ID sequences and average Lookup time per query in micro seconds. For the trie data structures, the reported space also includes the cost of representation for the pointers. We compare the space of representation for the -grams excluding their associated information because this varies according to the chosen implementation: for example, can only store probabilities and backoffs, whereas can be used to store either counts or probabilities. For those competitors storing frequency counts, we already discussed their count representation in Subsection 3.4.1. , and require too much memory for the building of their data structures on GoogleV2, therefore we mark as empty their entry in the table for this dataset.

Except for the last two rows of the table in which we compare the performance of our MPH table against probing (P.), we write for each competitor two percentages indicating its score against our selected trie data structures PEF-Trie and PEF-RTrie, respectively. Let us now examine each row, one by one. In the following discussion, unless explicitly stated, the numbers cited as percentages refer to average values over the different datasets.

COMPRESSED (C.) variant results larger than our PEF-RTrie implementation and slower by more than . It gains, instead, an advantage of roughly over our PEF-Trie data structure, but it is also more than times slower. The HASH variant uses hash tables with linear probing to represent the nodes of the trie. Therefore, we test it with a small extra space factor of for table allocation (H.3) and with (H.50), which is also used as the default value in the implementation, as to obtain different time/space trade-offs. Clearly the space occupancy of both hash variants do not compete with the ones of our proposals as these are from to times larger, but the -lookup capabilities of hashing makes it faster than a sorted array trie implementation: while this is no surprise, notice that our PEF-Trie data structure is anyway competitive as it is actually faster on GoogleV2.

is larger than PEF-Trie and also and times slower on Europarl and YahooV2 respectively. Our PEF-RTrie data structure retains an advantage in space of and it is still significantly faster: of about on Europarl and times on YahooV2.

is the fastest trie language model implementation in the literature. As we can see, our PEF-Trie variant retains of its space with a negligible penalty at query time. Compared to PEF-RTrie, it results a little faster, i.e., , but also and times larger on Europarl and YahooV2 respectively.

We also tested the performance of even though it is not a trie optimized for language models as to understand how our data structures compare against a general-purpose string dictionary implementation. We outperform in both space and time: compared to PEF-RTrie, it is times larger and slower; with respect to PEF-Trie it is more than larger and slower.

is designed for small memory footprint and returns approximated frequency counts when queried. We build its data structures using the default setting recommended in the documentation: bits for frequency count quantization and bits per value as to yield a false positive rate of . While being from to times slower than our exact and lossless approach, it is quite compact because the quantized frequency counts are recomputed on the fly using the procedure described in Subsection 3.1. Therefore, while its space occupancy results even larger with respect to our grams representation by , it is still no better than the whole space of our PEF-RTrie data structure. With respect to the whole space of PEF-Trie, it retains instead an advantage of . This space advantage is, however, compensated by a loss in precision and a much higher query time (up to times slower on GoogleV2).

The last two rows of Table 4 regard the performance of our MPH table with respect to PROBING. As similarly done for H., we also test the PROBING data structure with (P.3) and (P.50) extra space allocation factor for the tables. While being larger as expected, the implementation makes use of expensive hash key recombinations that yields a slower random access capability with respect to our minimal perfect hashing approach.

We finally compare the total space occupancy, as given by the sum of the space of gram-ID sequences, frequency counts and pointers, of our trie data structures against the gzip baseline reported in Table 1. The total average bytes per represented -gram for PEF-Trie are , and on the three datasets Europarl, YahooV2 and GoogleV2 respectively. Table 1 shows that gzip takes, instead, , and bytes per gram. This means that our PEF-Trie is , and smaller than gzip and it does also support efficient search of individual -grams. Finally, our PEF-RTrie is , , smaller.

Europarl YahooV2
bytes/gram sec/query bytes/gram sec/query
PEF-Trie
PEF-RTrie
BerkeleyLM C.
BerkeleyLM H.3
BerkeleyLM H.50
Expgram
KenLM T.
RandLM
MPH
KenLM P.3
KenLM P.50
Table 5. Perplexity benchmark results reporting average number of bytes per gram (bytes/gram) and micro seconds per query (sec/query) using modified Kneser-Ney -gram language models built from Europarl and YahooV2 counts.

Perplexity benchmark. Besides the efficient indexing of frequency counts, our data structures can also be used to map -grams to language model probabilities and backoffs. As done by , we also use the binning method (Federico and Bertoldi, 2006) to quantize probabilities and backoffs, but allowing any quantization bits ranging from to . Uni-grams values are stored unquantized to favor query speed: as vocabulary size is typically very small compared to the number of total -grams, this has a minimal impact on the space of the data structure. Our trie implementation is reversed as to permit a more efficient computation of sentence-level probabilities, with a stateful scoring function that carries its state on from a query to the next, as similarly done by and .

For the perplexity benchmark we used the standard query dataset publicly available at http://www.statmt.org/lm-benchmark, that contains sentences, for a total of tokens (Chelba et al., 2014). We used the utilities of to build modified Kneser-Ney (Chen and Goodman, 1996, 1999) -gram language models from the counts of Europarl and YahooV2 that have an OOV (out of vocabulary) rate of, respectively, and on the test query file. As only builds quantized models using quantization bits for both probabilities and backoffs, we also use this number of quantization bits for our tries and trie. For all data structures, truncates the mantissa of floating-point values to bits and then stores indices to distinct probabilities and backoffs. was build, as already said, with the default parameters recommended in the documentation.

Table 5 shows the results of the benchmark. As we can see, the PEF-Trie data structure is as fast as the trie while being more than more compact on average, whereas the PEF-RTrie variant doubles the space gains with negligible loss in query processing speed ( slower). We instead significantly outperform all other competitors in both space and time, including the H.3 variant. In particular, notice that we are also smaller than which is randomized and, therefore, less accurate. The query time of H.50 is smaller on YahooV2; however, it also uses from up to times the space of our tries.

The last two rows of the table are dedicated to the comparison of our MPH table with PROBING. While our data structure stores quantized probabilities and backoffs, stores uncompressed values for all orders of . We found out that storing unquantized values results in indistinguishable differences in perplexity while unnecessarily increasing the space of the data structure, as it is apparent in the results. The expensive hash key recombinations necessary for random access are avoided during perplexity computation for the left-to-right nature of the query access pattern. This makes, not surprisingly, a linear probing implementation actually faster, by on average, than a minimal perfect hash approach when a large multiplicative factor is used for tables allocation (P.50). The price to pay is, however, the double of the space. On the other hand, the P.3 variant is larger (by ) and slower (by on average).

4. Fast Estimation

The problem we tackle in this section of the paper is the one of computing the Kneser-Ney probability and backoff penalty for every -gram, , extracted from a large textual source.

4.1. Preliminaries and Related Work

Since the sorted orders defined over a set of -grams are central to the description of the algorithms we are going to consider, we now define them. Consider a set of -grams. The set is put into sorted order by sorting the -grams on their words, as considered in a specific order. If this specific order is , i.e., we sort -grams from the last word up to the first, then the block is suffix-sorted: last word is primary. If the considered order is , then the block is context-sorted: penultimate word is primary.

During the estimation process, we deal with the following assumptions:

  1. [leftmargin=*]

  2. the uncompressed -gram strings with associated satellite values, do not fit in internal memory and we necessarily need to rely on disk usage;

  3. the estimate is performed without pruning (Heafield et al., 2013), thus the minimum occurrence count for an -gram is 1;

  4. the compressed index built over the -gram strings (e.g., the trie presented in Subsection 3.2) must reside in internal memory to allow fast query processing (perplexity and machine translation) (Pibiri and Venturini, 2017; Heafield, 2011; Pauls and Klein, 2011).

Modified Kneser-Ney smoothing. The modified version of Kneser-Ney smoothing (Kneser and Ney, 1995) was introduced by Chen and Goodman (1996) and uses, instead of a single discount value, multiple discounts for all -grams having occurrence count equal to 1, 2 and 3. Under the Kneser-Ney model, the conditional probability is computed recursively according to

(2)

that is, all lower-order probabilities are interpolated together, where