Universal Word Segmentation: Implementation and Interpretation

07/09/2018
by   Yan Shao, et al.
0

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/12/2017

Cross-lingual Word Segmentation and Morpheme Segmentation as Sequence Labelling

This paper presents our segmentation system developed for the MLP 2017 s...
research
10/07/2018

Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

Previous traditional approaches to unsupervised Chinese word segmentatio...
research
06/14/2018

Urdu Word Segmentation using Conditional Random Fields (CRFs)

State-of-the-art Natural Language Processing algorithms rely heavily on ...
research
02/24/2021

Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese

Word segmentation and part-of-speech tagging are two critical preliminar...
research
03/02/2021

Unsupervised Word Segmentation with Bi-directional Neural Language Model

We present an unsupervised word segmentation model, in which the learnin...
research
01/10/2022

Towards Boosting the Accuracy of Non-Latin Scene Text Recognition

Scene-text recognition is remarkably better in Latin languages than the ...
research
05/07/2016

Neural Recovery Machine for Chinese Dropped Pronoun

Dropped pronouns (DPs) are ubiquitous in pro-drop languages like Chinese...

Please sign up or login with your details

Forgot password? Click here to reset