Character Feature Engineering for Japanese Word Segmentation

by   Mike Tian-Jian Jiang, et al.

On word segmentation problems, machine learning architecture engineering often draws attention. The problem representation itself, however, has remained almost static as either word lattice ranking or character sequence tagging, for at least two decades. The latter of-ten shows stronger predictive power than the former for out-of-vocabulary (OOV) issue. When the issue escalating to rapid adaptation, which is a common scenario for industrial applications, active learning of partial annotations or re-training with additional lexical re-sources is usually applied, however, from a somewhat word-based perspective. Not only it is uneasy for end-users to comply with linguistically consistent word boundary decisions, but also the risk/cost of forking models permanently with estimated weights is seldom affordable. To overcome the obstacle, this work provides an alternative, which uses linguistic intuition about character compositions, such that a sophisticated feature set and its derived scheme can enable dynamic lexicon expansion with the model remaining intact. Experiment results suggest that the proposed solution, with or without external lexemes, performs competitively in terms of F1 score and OOV recall across various datasets.



There are no comments yet.


page 1

page 2

page 3

page 4


A Hybrid Word-Character Model for Abstractive Summarization

Abstractive summarization is the popular research topic nowadays. Due to...

Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation

Character-based sequence labeling framework is flexible and efficient fo...

Combining Word and Character Vector Representation on Neural Machine Translation

This paper describes combinations of word vector representation and char...

Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF

We present a character-based model for joint segmentation and POS taggin...

Improving part-of-speech tagging via multi-task learning and character-level word representations

In this paper, we explore the ways to improve POS-tagging using various ...

Neural Word Segmentation with Rich Pretraining

Neural word segmentation research has benefited from large-scale raw tex...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.