Bangla Word Clustering Based on Tri-gram, 4-gram and 5-gram Language Model

01/27/2017
by   Dipaloke Saha, et al.
0

In this paper, we describe a research method that generates Bangla word clusters on the basis of relating to meaning in language and contextual similarity. The importance of word clustering is in parts of speech (POS) tagging, word sense disambiguation, text classification, recommender system, spell checker, grammar checker, knowledge discover and for many others Natural Language Processing (NLP) applications. In the history of word clustering, English and some other languages have already implemented some methods on word clustering efficiently. But due to lack of the resources, word clustering in Bangla has not been still implemented efficiently. Presently, its implementation is in the beginning stage. In some research of word clustering in English based on preceding and next five words of a key word they found an efficient result. Now, we are trying to implement the tri-gram, 4-gram and 5-gram model of word clustering for Bangla to observe which one is the best among them. We have started our research with quite a large corpus of approximate 1 lakh Bangla words. We are using a machine learning technique in this research. We will generate word clusters and analyze the clusters by testing some different threshold values.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/18/2017

word representation or word embedding in Persian text

Text processing is one of the sub-branches of natural language processin...
research
07/27/2020

Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji

Next word prediction is an input technology that simplifies the process ...
research
12/25/2019

N-gram Statistical Stemmer for Bangla Corpus

Stemming is a process that can be utilized to trim inflected words to st...
research
08/16/2018

Computing Word Classes Using Spectral Clustering

Clustering a lexicon of words is a well-studied problem in natural langu...
research
06/23/2017

Comparison of Modified Kneser-Ney and Witten-Bell Smoothing Techniques in Statistical Language Model of Bahasa Indonesia

Smoothing is one technique to overcome data sparsity in statistical lang...
research
07/17/2022

Towards Explainability in NLP: Analyzing and Calculating Word Saliency through Word Properties

The wide use of black-box models in natural language processing brings g...
research
01/07/2021

Real-Time Optimized N-gram For Mobile Devices

With the increasing number of mobile devices, there has been continuous ...

Please sign up or login with your details

Forgot password? Click here to reset