Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

08/16/2016
by   Ehsan Shareghi, et al.
0

Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions

The remarkable capabilities of large language models have been accompani...
research
04/20/2018

Lightweight Adaptive Mixture of Neural and N-gram Language Models

It is often the case that the best performing language model is an ensem...
research
02/08/2019

Faster Repetition-Aware Compressed Suffix Trees based on Block Trees

Suffix trees are a fundamental data structure in stringology, but their ...
research
01/31/2019

A Generalized Language Model in Tensor Space

In the literature, tensors have been effectively used for capturing the ...
research
07/24/2016

Latent Tree Language Model

In this paper we introduce Latent Tree Language Model (LTLM), a novel ap...
research
06/25/2018

Handling Massive N-Gram Datasets Efficiently

This paper deals with the two fundamental problems concerning the handli...
research
02/23/2023

Fast Computation of Branching Process Transition Probabilities via ADMM

Branching processes are a class of continuous-time Markov chains (CTMCs)...

Please sign up or login with your details

Forgot password? Click here to reset