Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

12/03/2018
by   Yerai Doval, et al.
0

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n-gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well-known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future.

READ FULL TEXT
research
07/20/2017

Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

Syllabification does not seem to improve word-level RNN language modelin...
research
12/10/2016

A Character-Word Compositional Neural Language Model for Finnish

Inspired by recent research, we explore ways to model the highly morphol...
research
10/06/2022

Are word boundaries useful for unsupervised language learning?

Word or word-fragment based Language Models (LM) are typically preferred...
research
06/13/2019

Character n-gram Embeddings to Improve RNN Language Models

This paper proposes a novel Recurrent Neural Network (RNN) language mode...
research
01/08/2017

Sentence-level dialects identification in the greater China region

Identifying the different varieties of the same language is more challen...
research
03/23/2016

The Anatomy of a Search and Mining System for Digital Archives

Samtla (Search And Mining Tools with Linguistic Analysis) is a digital h...
research
09/02/2017

Patterns versus Characters in Subword-aware Neural Language Modeling

Words in some natural languages can have a composite structure. Elements...

Please sign up or login with your details

Forgot password? Click here to reset