A realistic and robust model for Chinese word segmentation

05/21/2019
by   Chu-Ren Huang, et al.
0

A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/05/2018

Word Segmentation as Graph Partition

We propose a new approach to the Chinese word segmentation problem that ...
research
10/01/2021

Span Labeling Approach for Vietnamese and Chinese Word Segmentation

In this paper, we propose a span labeling approach to model n-gram infor...
research
06/27/2019

PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation

Chinese word segmentation (CWS) is a fundamental step of Chinese natural...
research
02/18/2020

A New Clustering neural network for Chinese word segmentation

In this article I proposed a new model to achieve Chinese word segmentat...
research
11/06/2018

Fast Neural Chinese Word Segmentation for Long Sentences

Rapidly developed neural models have achieved competitive performance in...
research
08/22/2019

Active Learning for Chinese Word Segmentation in Medical Text

Electronic health records (EHRs) stored in hospital information systems ...
research
07/22/2020

When Classical Chinese Meets Machine Learning: Explaining the Relative Performances of Word and Sentence Segmentation Tasks

We consider three major text sources about the Tang Dynasty of China in ...

Please sign up or login with your details

Forgot password? Click here to reset