Comparison of Modified Kneser-Ney and Witten-Bell Smoothing Techniques in Statistical Language Model of Bahasa Indonesia

06/23/2017
by   Ismail Rusli, et al.
0

Smoothing is one technique to overcome data sparsity in statistical language model. Although in its mathematical definition there is no explicit dependency upon specific natural language, different natures of natural languages result in different effects of smoothing techniques. This is true for Russian language as shown by Whittaker (1998). In this paper, We compared Modified Kneser-Ney and Witten-Bell smoothing techniques in statistical language model of Bahasa Indonesia. We used train sets of totally 22M words that we extracted from Indonesian version of Wikipedia. As far as we know, this is the largest train set used to build statistical language model for Bahasa Indonesia. The experiments with 3-gram, 5-gram, and 7-gram showed that Modified Kneser-Ney consistently outperforms Witten-Bell smoothing technique in term of perplexity values. It is interesting to note that our experiments showed 5-gram model for Modified Kneser-Ney smoothing technique outperforms that of 7-gram. Meanwhile, Witten-Bell smoothing is consistently improving over the increase of n-gram order.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/24/2016

Latent Tree Language Model

In this paper we introduce Latent Tree Language Model (LTLM), a novel ap...
research
07/10/2018

Multi-D Kneser-Ney Smoothing Preserving the Original Marginal Distributions

Smoothing is an essential tool in many NLP tasks, therefore numerous tec...
research
03/31/2017

N-gram Language Modeling using Recurrent Neural Network Estimation

We investigate the effective memory depth of RNN models by using them fo...
research
09/02/2019

Phrase-Level Class based Language Model for Mandarin Smart Speaker Query Recognition

The success of speech assistants requires precise recognition of a numbe...
research
01/09/2023

Generalized adaptive smoothing based neural network architecture for traffic state estimation

The adaptive smoothing method (ASM) is a standard data-driven technique ...
research
01/27/2017

Bangla Word Clustering Based on Tri-gram, 4-gram and 5-gram Language Model

In this paper, we describe a research method that generates Bangla word ...
research
12/26/2013

Language Modeling with Power Low Rank Ensembles

We present power low rank ensembles (PLRE), a flexible framework for n-g...

Please sign up or login with your details

Forgot password? Click here to reset