Statistical language model is state-of-the-art in modeling natural language. The model is used in application such as automatic speech recognition, information retrieval, and machine translation. A statistical language model tries to estimate probability of a sentence to appear in a modeled natural language.
To build statistical language model, first we have to prepare a corpus. In English language, there are several standard corpora, e.g. Brown corpus, WSJ (Wall Street Journal) corpus, and NAB (North American Business news) corpus. With these corpora, researchers in language model could benchmark their works.
In Bahasa Indonesia, we hardly found a standard corpus which is also freely available. Therefore, we used here body of text available for free from Wikipedia111dumps.wikimedia.org/idwiki/. We extracted the text from Wikipedia dump file using scripts available in the Internet222blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/. Before we used it, we preprocessed the text to clean the noise as much as we can. The resulting text has approximately 36M words in 1,7M sentences.
From the text, we build language models that are smoothed using Modified Kneser-Ney and Witten-Bell techniques. Those techniques are among the most widely used techniques in languge model. We used tools from SRI International333www.sri.com, i.e. SRI Language Model (SRILM)444www.speech.sri.com/projects/srilm/ to build our language models. The tools are available for free for non-commercial projects.
Our motivation in this work is based on the fact that we did not find any studies on various smoothing techniques in language model for Bahasa Indonesia. Smoothing technique could improve performance of any system in natural language processing that uses statistical language model. Although in its mathematical definition of statistical language model there is no dependency on specific natural language, the nature of the language itself is different across languages. For example, Russian language has large inflectional words which gives rise to different results in smoothing techniques compared to that of English language. Therefore, our contribution here is to present the study of two smoothing techniques, i.e., Modified Kneser-Ney and Witten-Bell, applied to statistical language model of Bahasa Indonesia. Various other smoothing techniques will be studied in our future research.
This paper continues as follows. In section 2, we mentioned several works in natural language processing that used statistical language model of Bahasa Indonesia. We did not find any explicit reference to smoothing techniques employed in those works. We also pointed to the sizes of corpus used in the works which are relatively small compared to that we used here. We proceed in section 3 describing statistical language model and their smoothing techniques, i.e. Modified Kneser-Ney and Witten-Bell techniques. Section 4 detailed our experiments and results which then concluded in section 5.
2 Previous Works
For English language model, Chen et al., had reported excellent study on various smoothing techniques. They also proposed a modified version of known Kneser-Ney smoothing techniques which is slightly better from the original one.
Teh used hierarchical Bayesian model to build a language model and reported that its performance is comparable to modified Kneser-Ney technique.
As far as we know, there are no previous works study various smoothing techniques for Indonesian language model. We also noted that researches in natural language processing which used statistical Indonesian language model do not refer to any smoothing techniques in their works.
Furthermore, we noted that there are no standard corpora to benchmark various language models for Bahasa Indonesia that are also available for free.
One of the attempt to build corpus for Bahasa Indonesia was reported by Larasati. It is morphologically enriched Indonesia-English parallel corpus with approximately 1M words for each language.
Sakti, in buildng Large Vocabulary Continues Speech Recognition (LVCSR) System, had used different corpora, i.e., Daily News Task (600K sentences), Telephone Application Task (2.5K sentences), and Basic Travel Expression Corpus or BTEC Task (160K sentences).
Pisceldo  used 14K words size corpus to build probabilistic part-of-speech tagging in Bahasa Indonesia.
Riza  building Indonesian-English Machine Translation System using Collaborative P2P corpus. This corpus has 280K sentences with target 1 million sentence-pair (multilingual corpus)
For all corpora used in previous mentioned works, none is available for free. The only corpus we found freely available in the Internet is in Pan Localisation Site555panl10n.net/english/OutputsIndonesia2.htm#Linguistic_Resources_ which is a work by Adriani. This corpus consists of 500K words.
As mentioned earlier, we used body of text with size of 36M words in our research. This is larger than that of corpora used in previous mentioned works (although we admitted here that our body of text is not tuned into a specific topic). We do not claim that body of text we used here is standard. It is our future goal to make it standard and freely available in the future.
3 Language Model
A language model gives an answer to question “what is the probability of sentence appears in a natural language?”. Sentence like “Saya suka buah” (or in English, “I like fruits”) is naturally more probable to appear in a text or everyday conversation than grammatically similar sentence “Saya suka meja” (“I like table”).
Formally, a language model approximates the probabilites of a sequence of words .
Equation (1) equals to find probability of the last word of the sentence given the previous words in the sentence times the probability of previous words in the sentence.
By chain rule we have
Using Markov property, we can assume that probability of a word depends only on previous words. This is -gram language model and we get
when it is understood that in , if the word is discarded.
To calculate the conditional probability in (4), we use maximum likelihood estimation. That is,
with is the count of in training set.
The most common metric for evaluation a language model is perplexity. Perplexity is defined by , where is cross-entropy of the test set.
where is number of words in test set .
A model is relatively better when it has lower perplexity compared to that of other models.
3.1 Smoothing Techniques
Previous model suffers from one problem. It gives zero probability to sentences that are not appear in the corpus. To overcome this, we discount probability from sentences appear in the corpus and distribute it to sentences that have zero probability. This techniques is called smoothing.
There are several smoothing techniques, e.g., Katz smoothing, Good-Turing, Witten-Bell, and Kneser-Ney. Generally, those smoothing techniques fall into two categories, backoff and interpolated techniques.
In backoff technique, probability of sentences that are not appear in corpus are estimated using that of its lower -grams.
where is modified probability for sentences appear in corpus and is scaling factor chosen to make the backoff probability of sentences that do not appear in corpus sum to one.
Instead of using backoff probability, interpolated technique combines probability of a sentence with that of its lower order, e.g. combined probability of trigram, bigram, and unigram for trigram language model.
In this paper, we used Modified Kneser-Ney and Witten-Bell and used backoff version of those techniques.
3.2 Modified Kneser-Ney
Initially, Kneser-Ney smoothing uses backoff technique. Chen and Goodman modify it to use interpolation technique and further modify it to have multiple discounts. This is called modified Kneser-Ney smoothing technique.
is discounting values which is applied to sentences with nonzero probabilities. is a number of words that appear after the context exactly times. Modified Kneser-Ney used 3 different discounting values , , and which are discounting value for -grams with one, two, and three of more counts, respectively.
To get the value, Witten-Bell technique considers the number of unique words following the history . This number is formally defined as
With this number, parameter is defined as
and higher order distribution is defined as follow.
4 Experiment and Results
We extracted body of texts from Indonesian version of Wikipedia. Before we fed those texts into SRILM, we preprocessed them with following steps.
Removing unwanted lines
Splitting paragraphs into sentences
Shuffle and remove duplicate sentences
Remove sentences contain less than two tokens
Table 1 shows the number of words and sentences in resulting texts.
After we preprocessed the text, we did following steps to prepare the data.
We splitted text into two sets, i.e. training set and test set. We made 90:10 proportion of training and test set.
We split the training sets into 4 disjoint texts i.e. texts with size of 1K, 10K, 100K, and 1M sentences. Total number of words in the training sets is about 22M. See Table 2.
For each training set, we build -gram language model, with = 3,5, and 7. We then compute perplexity of one same test set for all language models. We used following SRILM commands to build language models (example for 1M training set 7-gram language model).
To build language model with Modified Kneser-Ney: ngram-count -text 1M -order 7 -lm 1M.7.kn.lm -kndiscount
To build language model with Witten-Bell: ngram-count -text 1M -order 7 -lm 1M.7.wb.lm -wbdiscount
To count perplexity (Modified Kneser-Ney): ngram -ppl test -order 7 -lm 1M.7.kn.lm
Number of OOV for each training set is shown in Fig. 1.
It is important to note here that We did not attempt to optimise parameters in language models. In our experiment, We used default parameters values for both techniques.
It is interesting to note that Modified Kneser-Ney 5-gram outperforms Modified Kneser-Ney 7-gram. Meanwhile, Witten-Bell technique is consistently improved over the increase of -gram order.
We have studied Modified Kneser-Ney and Witten-Bell smoothing techniques for language model of Bahasa Indonesia. We used Indonesian version of Wikipedia as a source of training and test sets. After some data preparations, we used SRILM toolkit to build language models and calculated perplexity for each model. Our experiments with 3-gram, 5-gram, and 7-gram models showed that Modified Kneser-Ney consistently outperforms Witten-Bell technique. We will study another smoothing techniques as well as make a standard corpus in Bahasa Indonesia for our future research.
-  Mirna Adriani and Hammam Riza Designation. Research Report on Local Language Computing: Development of Indonesian Language Resources and Translation System.
-  Stanley F Chen and Joshua Goodman. An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 310–318. Association for Computational Linguistics, 1996.
-  Septina Dian Larasati. IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus. In LREC, pages 902–906, 2012.
-  Femphy Pisceldo, Ruli Manurung, and Mirna Adriani. Probabilistic Part-Of-Speech Tagging for Bahasa Indonesia. In Third International MALINDO Workshop, Colocated Event ACL-IJCNLP, 2009.
-  Hammam Riza, Adiansya P Budiono, and M Henky. I/ETS: Indonesian-English Machine Translation System using Collaborative P2P Corpus. Agency for the Assessment and Application of Technology (BPPT), Indonesia, University of North Texas, 2008.
-  Sakriani Sakti, Eka Kelana, Hammam Riza, Shinsuke Sakai, Konstantin Markov, and Satoshi Nakamura. Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project. In IJCNLP, pages 19–24, 2008.
-  Yee Whye Teh. A Hierarchical Bayesian Language Model based on Pitman-Yor Processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 985–992. Association for Computational Linguistics, 2006.
-  Edward WD Whittaker and Philip C Woodland. Comparison of Language Modelling Techniques for Russian and English. In ICSLP, 1998.