Building a Syllable Database to Solve the Problem of Khmer Word Segmentation

03/07/2017
by   Nam Tran Van, et al.
0

Word segmentation is a basic problem in natural language processing. With the languages having the complex writing system like the Khmer language in Southern of Vietnam, this problem really very intractable, posing the significant challenges. Although there are some experts in Vietnam as well as international having deeply researched this problem, there are still no reasonable results meeting the demand, in particular, no treated thoroughly the ambiguous phenomenon, in the process of Khmer language processing so far. This paper present a solution based on the syllable division into component clusters using two syllable models proposed, thereby building a Khmer syllable database, is still not actually available. This method using a lexical database updated from the online Khmer dictionaries and some supported dictionaries serving role of training data and complementary linguistic characteristics. Each component cluster is labelled and located by the first and last letter to identify entirety a syllable. This approach is workable and the test results achieve high accuracy, eliminate the ambiguity, contribute to solving the problem of word segmentation and applying efficiency in Khmer language processing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2022

Resources for Turkish Natural Language Processing: A critical survey

This paper presents a comprehensive survey of corpora and lexical resour...
research
11/17/2018

Unsupervised Post-processing of Word Vectors via Conceptor Negation

Word vectors are at the core of many natural language processing tasks. ...
research
06/18/2019

State-of-the-Art Vietnamese Word Segmentation

Word segmentation is the first step of any tasks in Vietnamese language ...
research
09/07/2023

Word segmentation granularity in Korean

This paper describes word segmentation granularity in Korean language pr...
research
07/25/2023

Towards Resolving Word Ambiguity with Word Embeddings

Ambiguity is ubiquitous in natural language. Resolving ambiguous meaning...
research
09/07/2022

That Slepen Al the Nyght with Open Ye! Cross-era Sequence Segmentation with Switch-memory

The evolution of language follows the rule of gradual change. Grammar, v...
research
10/21/2020

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Most undeciphered lost languages exhibit two characteristics that pose s...

Please sign up or login with your details

Forgot password? Click here to reset