Handling Massive N-Gram Datasets Efficiently

06/25/2018
by   Giulio Ermanno Pibiri, et al.
0

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the- art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.

READ FULL TEXT
research
11/06/2018

Tunneling on Wheeler Graphs

The Burrows-Wheeler Transform (BWT) is an important technique both in da...
research
03/25/2019

Algorithms to compute the Burrows-Wheeler Similarity Distribution

The Burrows-Wheeler transform (BWT) is a well studied text transformatio...
research
05/28/2015

Query by String word spotting based on character bi-gram indexing

In this paper we propose a segmentation-free query by string word spotti...
research
04/24/2022

String Rearrangement Inequalities and a Total Order Between Primitive Words

We study the following rearrangement problem: Given n words, rearrange a...
research
05/23/2023

Engineering Rank/Select Data Structures for Big-Alphabet Strings

Big-alphabet strings are common in several scenarios such as information...
research
08/16/2016

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Efficient methods for storing and querying are critical for scaling high...
research
12/26/2013

Language Modeling with Power Low Rank Ensembles

We present power low rank ensembles (PLRE), a flexible framework for n-g...

Please sign up or login with your details

Forgot password? Click here to reset