Effective Neural Solution for Multi-Criteria Word Segmentation

We present a simple yet elegant solution to train a single joint model on multi-criteria corpora for Chinese Word Segmentation (CWS). Our novel design requires no private layers in model architecture, instead, introduces two artificial tokens at the beginning and ending of input sentence to specify the required target criteria. The rest of the model including Long Short-Term Memory (LSTM) layer and Conditional Random Fields (CRFs) layer remains unchanged and is shared across all datasets, keeping the size of parameter collection minimal and constant. On Bakeoff 2005 and Bakeoff 2008 datasets, our innovative design has surpassed both single-criterion and multi-criteria state-of-the-art learning results. To the best knowledge, our design is the first one that has achieved the latest high performance on such large scale datasets. Source codes and corpora of this paper are available on GitHub.


page 1

page 2

page 3

page 4


Switch-LSTMs for Multi-Criteria Chinese Word Segmentation

Multi-criteria Chinese word segmentation is a promising but challenging ...

Adversarial Multi-Criteria Learning for Chinese Word Segmentation

Different linguistic perspectives causes many diverse segmentation crite...

Multi-Criteria Chinese Word Segmentation with Transformer

Different linguistic perspectives cause many diverse segmentation criter...

De-identification of medical records using conditional random fields and long short-term memory networks

The CEGS N-GRID 2016 Shared Task 1 in Clinical Natural Language Processi...

Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning

The ambiguous annotation criteria bring into the divergence of Chinese W...

A Seq-to-Seq Transformer Premised Temporal Convolutional Network for Chinese Word Segmentation

The prevalent approaches of Chinese word segmentation task almost rely o...

When Classical Chinese Meets Machine Learning: Explaining the Relative Performances of Word and Sentence Segmentation Tasks

We consider three major text sources about the Tang Dynasty of China in ...

1 Introduction

Unlike English language with space between every word, Chinese language has no explicit word delimiters. Therefore, Chinese Word Segmentation (CWS) is a preliminary pre-processing step for Chinese language processing tasks. Following Xue (2003)

, most approaches consider this task as a sequence tagging task, and solve it with supervised learning models such as Maximum Entropy (ME)

Jin et al. (2005) and Conditional Random Fields (CRFs) Lafferty et al. (2001); Peng et al. (2004). These early models require heavy handcrafted feature engineering within a fixed size window.

With the rapid development of deep learning, neural network word segmentation approach arose to reduce efforts in feature engineering

Zheng et al. (2013); Collobert et al. (2011); Pei et al. (2014); Chen et al. (2015b); Cai and Zhao (2016); Cai et al. (2017). Zheng et al. (2013) replaced raw character with its embedding as input, adapted the sliding-window based sequence labeling Collobert et al. (2011). Pei et al. (2014) extended Zheng et al. (2013)’s work by exploiting tag embedding and bigram features. Chen et al. (2015b) employed LSTM to capture long-distance preceding context. Noteworthily, a novel word-based approach Cai and Zhao (2016); Cai et al. (2017) was proposed to model candidate segmented results directly. Despite the outstanding runtime performance, their solution required the max word length to be a fixed hyper-parameter and replaced those words that longer than into a unique character. Thus their performance relies on an expurgation of long words, which is not practical.

Novel algorithms and deep models are not omnipotent. Large-scale corpus is also important for an accurate CWS system. Although there are many segmentation corpora, these datasets are annotated in different criteria, making it hard to fully exploit these corpora, which are shown in Table 1.

Corpora Li Le reaches Benz Inc
pku 到达 奔驰 公司
msr 李乐 到达 奔驰公司
as 李樂 到達 賓士 公司
cityu 李樂 到達 平治 公司
Table 1: Illustration of different segmentation criteria of SIGHAN bakeoff 2005.

Recently, Chen et al. (2017) designed an adversarial multi-criteria learning framework for CWS. However, their models have several complex architectures, and are not comparable with the state-of-the-art results.

In this paper, we propose a smoothly jointed multi-criteria learning solution for CWS by adding two artificial tokens at the beginning and ending of input sentence to specify the required target criteria. We have conducted various experiments on

segmentation criteria corpora from SIGHAN Bakeoff 2005 and 2008. Our models improve performance by transferring learning on heterogeneous corpora. The final scores have surpassed previous multi-criteria learning,

out of even have surpassed previous preprocessing-heavy state-of-the-art single-criterion learning results.

The contributions of this paper could be summarized as:

  • Proposed an simple yet elegant solution to perform multi-criteria learning on multiple heterogeneous segmentation criteria corpora;

  • 2 out of 4 datasets have surpassed the state-of-the-art scores on Bakeoff 2005;

  • Extensive experiments on up to datasets have shown that our novel solution has significantly improved the performance.

2 Related Work

In this section, we review the previous works from 2 directions, which are Chinese Word Segmentation and multi-task learning.

2.1 Chinese Word Segmentation

Chinese Word Segmentation has been a well-studied problem for decades Huang and Zhao (2007). After pioneer Xue (2003) transformed CWS into a character-based tagging problem, Peng et al. (2004) adopted CRF as the sequence labeling model and showed its effectiveness. Following these pioneers, later sequence labeling based works Tseng et al. (2005); Zhao et al. (2006, 2010); Sun et al. (2012) were proposed. Recent neural models Zheng et al. (2013); Pei et al. (2014); Chen et al. (2015b); Dong et al. (2016); Chen et al. (2017) also followed this sequence labeling fashion.

2.2 Multi-Task Learning

Compared to single-task learning, multi-task learning is relatively harder due to the divergence between tasks and heterogeneous annotation datasets. Recent works have started to explore joint learning on Chinese word segmentation or part-of-speech tagging. Jiang et al. (2009)

stacked two classifiers together. The later one used the former’s prediction as additional features.

Sun and Wan (2012)

proposed a structure-based stacking model in which one tagger was designed to refine another tagger’s prediction. These early models lacked a unified loss function and suffered from error propagation.

Qiu et al. (2013) proposed to learn a mapping function between heterogeneous corpora. Li et al. (2015); Chao et al. (2015) proposed and utilized coupled sequence labeling model which can directly learn and infer two heterogeneous annotations simultaneously. These works mainly focused on exploiting relationships between different tagging sets, but not shared features.

Chen et al. (2017) designed a complex framework involving sharing layers with Generative Adversarial Nets (GANs) to extract the criteria-invariant features and dataset related private layers to detect criteria-related features. This research work didn’t show great advantage over previous state-of-the-art single-criterion learning scores.

Our solution is greatly motivated by Google’s Multilingual Neural Machine Translation System, for which

Johnson et al. (2016) proposed an extremely simple solution without any complex architectures or private layers. They added an artificial token corresponding to parallel corpora and train them jointly, which inspired our design.

3 Neural Architectures for Chinese Word Segmentation

A prevailing approach to Chinese Word Segmentation is casting it to character based sequence tagging problem Xue (2003); Sun et al. (2012). One commonly used tagging set is , representing the begin, middle, end of a word, or single character forming a word. Given a sequence with characters as , sequence tagging based CWS is to find the most possible tags :


We model them jointly using a conditional random field, mostly following the architecture proposed by Lample et al. (2016), via stacking Long Short-Term Memory Networks (LSTMs) Hochreiter and Schmidhuber (1997) with a CRFs layer on top of them.

We’ll introduce our neural framework bottom-up. The bottom layer is a character Bi-LSTM (bidirectional Long Short-Term Memory Network) Graves and Schmidhuber (2005) taking character embeddings as input, outputs each character’s contextual feature representation:


After a contextual representation is generated, it will be decoded to make a final segmentation decision. We employed a Conditional Random Fields (CRF) Lafferty et al. (2001) layer as the inference layer.

First of all, a linear score function is used to assign a local score for each tag on -th character:


where is the concatenation of Bi-LSTM hidden state and bigram feature embedding , and are trainable parameters.

Then, for a sequence of predictions:


first order linear chain CRFs employed a Markov chain to define its global score as:


where is a transition matrix such that represents the score of a transition from the tag to tag . and are the start and end tags of a sentence, that are added to the tagset additionaly. is therefore a square matrix of size .

Finally, this global score is normalized to a probability in Equation (

1) via a softmax over all possible tag sequences:


In decoding phase, first order linear chain CRFs only model bigram interactions between output tags, so the maximum of a posteriori sequence in Eq. 1 can be computed using dynamic programming.

4 Elegant Solution for Multi-Criteria Chinese Word Segmentation

For closely related multiple task learning like multilingual translation system, Johnson et al. (2016) proposed a simple and practical solution. It only needs to add an artificial token at the beginning of the input sentence to specify the required target language, no need to design complex private encoder-decoder structures.

We follow their spirit and add two artificial tokens at the beginning and ending of input sentence to specify the required target criteria. For instance, sentences in SIGHAN Bakeoff 2005 will be designed to have the following form:

Corpora Li Le reaches Benz Inc
PKU pku 李 乐 到达 奔驰 公司 /pku
MSR msr 李乐 到达 奔驰公司 /msr
AS as 李樂 到達 賓士 公司 /as
CityU cityu 李樂 到達 平治 公司 /cityu
Table 2: Illustration of adding artificial tokens into 4 datasets on SIGHAN Bakeoff 2005. To be fair, these dataset and /dataset tokens will be removed when computing scores.

These artificial tokens specify which dataset the sentence comes from. They are treated as normal tokens, or more specifically, a normal character. With their help, instances from different datasets can be seamlessly put together and jointly trained, without extra efforts. These two special tokens are designed to carry criteria related information across long dependencies, affecting the context representation of every character, and finally to produce segmentation decisions matching target criteria. At test time, those tokens are used to specify the required segmentation criteria. Again, they won’t be taken into account when computing performance scores.

5 Training

The training procedure is to maximize the log-probability of the gold tag sequence:


where represents all possible tag sequences for a sentence .

6 Experiments

We conducted various experiments to verify the following questions:

  1. Is our multi-criteria solution capable of learning heterogeneous datasets?

  2. Can our solution be applied to large-scale corpus groups consisting of tiny and informal texts?

  3. More data, better performance?

Our implementation is based on Dynet Neubig et al. (2017), a dynamic neural net framework for deep learning. Additionally, we implement the CRF layer in Python, and integrated the official score script to verify our scores.

6.1 Datasets

To explore the first question, we have experimented on the 4 prevalent CWS datasets from SIGHAN2005 Emerson (2005) as these datasets are commonly used by previous state-of-the-art research works. To challenge question 2 and 3, we applied our solution on SIGHAN2008 datasets MOE (2008), which are used to compare our approach with other state-of-the-art multi-criteria learning works under a larger scale.

All datasets are preprocessed by replacing the continuous English characters and digits with a unique token. For training and development sets, lines are split into shorter sentences or clauses by punctuations, in order to make faster batch.

Specially, the Traditional Chinese corpora CityU, AS and CKIP are converted to Simplified Chinese using the popular Chinese NLP tool HanLP222https://github.com/hankcs/HanLP.

6.2 Results on SIGHAN bakeoff 2005

Our baseline model is Bi-LSTM-CRFs trained on each datasets separately. Then we improved it with multi-criteria learning. The final F scores are shown in Table 3.

Models PKU MSR CityU AS
Tseng et al. (2005) 95.0 96.4 - -
Zhang and Clark (2007) 95.0 96.4 - -
Zhao and Kit (2008) 95.4 97.6 96.1 95.7
Sun et al. (2009) 95.2 97.3 - -
Sun et al. (2012) 95.4 97.4 - -
Zhang et al. (2013) 96.1 97.4 - -
Chen et al. (2015a) 94.5 95.4 - -
Chen et al. (2015b) 94.8 95.6 - -
Chen et al. (2017) 94.3 96.0 - 94.8
Cai et al. (2017) 95.8 97.1 95.6 95.3
baseline 95.2 97.3 95.1 94.9
+multi 95.9 97.4 96.2 95.4
Table 3: Comparison with previous state-of-the-art models of results on all four Bakeoff-2005 datasets. Results with used external dictionary or corpus, with are from Cai and Zhao (2016)’s runs on their released implementations without dictionary, with expurgated long words in test set.

According to this table, we find that multi-criteria learning boosts performance on every single dataset. Compared to single-criterion learning models (baseline), multi-criteria learning model (+multi) outperforms all of them by up to . Our joint model doesn’t rob performance from one dataset to pay another, but share knowledge across datasets and improve performance on all datasets.

6.3 Results on SIGHAN bakeoff 2008

SIGHAN bakeoff 2008 MOE (2008) provided as many as heterogeneous corpora. With another non-repetitive corpora from SIGHAN bakeoff 2005, they form a large-scale standard dataset for multi-criteria CWS benchmark. We repeated our experiments on these corpora and compared our results with state-of-the-art scores, as listed in Table 4.

Single-Criterion Learning
Chen et al. (2017) P 95.70 93.64 93.67 95.19 92.44 94.00 91.86 95.11 93.95
R 95.99 94.77 92.93 95.42 93.69 94.15 92.47 95.23 94.33
F 95.84 94.20 93.30 95.30 93.06 94.07 92.17 95.17 94.14
Ours P 97.17 95.28 94.78 95.14 94.55 94.86 93.43 95.75 95.12
R 97.40 94.53 95.66 95.28 93.76 94.16 93.74 95.80 95.04
F 97.29 94.90 95.22 95.21 94.15 94.51 93.58 95.78 95.08
Multi-Criteria Learning
Chen et al. (2017) P 95.95 94.17 94.86 96.02 93.82 95.39 92.46 96.07 94.84
R 96.14 95.11 93.78 96.33 94.70 95.70 93.19 96.01 95.12
F 96.04 94.64 94.32 96.18 94.26 95.55 92.83 96.04 94.98
Ours P 97.38 96.01 95.37 95.69 96.21 95.78 94.26 96.54 95.82
R 97.32 94.94 96.19 96.00 95.27 95.43 94.42 96.44 95.64
F 97.35 95.47 95.78 95.84 95.73 95.60 94.34 96.49 95.73
Table 4: Results on test sets of 8 standard CWS datasets. Here, P, R, F indicate the precision, recall, value respectively. The maximum values are highlighted for each dataset.

In the first block for single-criterion learning, we can see that our implementation is generally more effective than Chen et al. (2017)’s. In the second block for multi-criteria learning, this disparity becomes even significant. And we further verified that every dataset benefit from our joint-learning solution. We also find that more data, even annotated with different standards or from different domains, brings better performance. Almost every dataset benefits from the larger scale of data. In comparison with large datasets, tiny datasets gain more performance growth.

7 Conclusions and Future Works

7.1 Conclusions

In this paper, we have presented a practical way to train multi-criteria CWS model. This simple and elegant solution only needs adding two artificial tokens at the beginning and ending of input sentence to specify the required target criterion. All the rest of model architectures, hyper-parameters, parameters and feature space are shared across all datasets. Experiments showed that our multi-criteria model can transfer knowledge between differently annotated corpora from heterogeneous domains. Our system is highly end-to-end, capable of learning large-scale datasets, and outperforms the latest state-of-the-art multi-criteria CWS works.

7.2 Future Works

Our effective and elegant multi-criteria learning solution can be applied to sequence labeling tasks such as POS tagging and NER. We plan to conduct more experiments of using our effective technique in various application domains.


  • Cai and Zhao (2016) Deng Cai and Hai Zhao. 2016. Neural Word Segmentation Learning for Chinese. ACL .
  • Cai et al. (2017) Deng Cai, Hai Zhao, Zhisong Zhang, Yuan Xin, Yongjian Wu, and Feiyue Huang. 2017. Fast and Accurate Neural Word Segmentation for Chinese. arXiv.org page arXiv:1704.07047.
  • Chao et al. (2015) Jiayuan Chao, Zhenghua Li, Wenliang Chen, and Min Zhang. 2015. Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging. NLPCC .
  • Chen et al. (2015a) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. 2015a. Gated Recursive Neural Network for Chinese Word Segmentation. ACL .
  • Chen et al. (2015b) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015b. Long Short-Term Memory Neural Networks for Chinese Word Segmentation. EMNLP .
  • Chen et al. (2017) Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial Multi-Criteria Learning for Chinese Word Segmentation. 1704:arXiv:1704.07556.
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P Kuksa. 2011. Natural Language Processing (Almost) from Scratch.

    Journal of Machine Learning Research

  • Dong et al. (2016) Chuanhai Dong, Jiajun Zhang, Chengqing Zong, Masanori Hattori, and Hui Di. 2016.

    Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition.

  • Emerson (2005) Thomas Emerson. 2005. The second international chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. Jeju Island, Korea, pages 123–133.
  • Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5-6):602–610.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • Huang and Zhao (2007) C. Huang and H. Zhao. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21(3):8–19.
  • Jiang et al. (2009) Wenbin Jiang, Liang Huang, and Qun Liu. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging. In the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference. Association for Computational Linguistics, Morristown, NJ, USA, pages 522–530.
  • Jin et al. (2005) Kiat Low Jin, Hwee Tou Ng, and Wenyuan Guo. 2005. A maximum entropy approach to chinese word segmentation. Proceedings of the Fourth Sighan Workshop on Chinese Language Processing .
  • Johnson et al. (2016) Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s Multilingual Neural Machine Translation System - Enabling Zero-Shot Translation. cs.CL.
  • Lafferty et al. (2001) John D. Lafferty, Andrew Mccallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Eighteenth International Conference on Machine Learning. pages 282–289.
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. CoRR .
  • Li et al. (2015) Zhenghua Li, Jiayuan Chao, Min Zhang, and Wenliang Chen. 2015. Coupled Sequence Labeling on Heterogeneous Annotations: POS Tagging as a Case Study. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Stroudsburg, PA, USA, pages 1783–1792.
  • MOE (2008) PRC MOE. 2008. The fourth international chinese language processing bakeoff: Chinese word segmentation, named entity recognition and chinese pos tagging. In Proceedings of the sixth SIGHAN workshop on Chinese language processing.
  • Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, et al. 2017. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 .
  • Pei et al. (2014) Wenzhe Pei, Tao Ge, and Baobao Chang. 2014.

    Max-Margin Tensor Neural Network for Chinese Word Segmentation.

    ACL .
  • Peng et al. (2004) Fuchun Peng, Fangfang Feng, and Andrew Mccallum. 2004. Chinese segmentation and new word detection using conditional random fields pages 562–568.
  • Qiu et al. (2013) Xipeng Qiu, Jiayi Zhao, and Xuanjing Huang. 2013. Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning. EMNLP .
  • Sun and Wan (2012) Weiwei Sun and Xiaojun Wan. 2012.

    Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations.

    ACL .
  • Sun et al. (2012) Xu Sun, Houfeng Wang, and Wenjie Li. 2012. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection pages 253–262.
  • Sun et al. (2009) Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Junichi Tsujii. 2009. A discriminative latent variable chinese segmenter with hybrid word/character information pages 56–64.
  • Tseng et al. (2005) Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter for sighan bakeoff 2005. pages 168–171.
  • Xue (2003) Nianwen Xue. 2003. Chinese Word Segmentation as Character Tagging. IJCLCLP .
  • Zhang et al. (2013) Longkai Zhang, Houfeng Wang, Xu Sun, and Mairgup Mansur. 2013. Exploring representations from unlabeled data with co-training for chinese word segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, page 311–321.
  • Zhang and Clark (2007) Yue Zhang and Stephen Clark. 2007.

    Chinese segmentation with a word-based perceptron algorithm.

    Association for Computational Linguistics, Prague, Czech Republic, pages 840–847.
  • Zhao et al. (2010) Hai Zhao, Chang Ning Huang, Mu Li, and Bao Liang Lu. 2010. A unified character-based tagging framework for chinese word segmentation. Acm Transactions on Asian Language Information Processing 9(2):1–32.
  • Zhao et al. (2006) Hai Zhao, Changning Huang, Mu Li, and Bao-Liang Lu. 2006. Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling. PACLIC .
  • Zhao and Kit (2008) Hai Zhao and Chunyu Kit. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In The Sixth SIGHAN Workshop on Chinese Language Processing. page 106–111.
  • Zheng et al. (2013) Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. 2013. Deep Learning for Chinese Word Segmentation and POS Tagging. EMNLP .