Unlike English language with space between every word, Chinese language has no explicit word delimiters. Therefore, Chinese Word Segmentation (CWS) is a preliminary pre-processing step for Chinese language processing tasks. Following Xue (2003)
, most approaches consider this task as a sequence tagging task, and solve it with supervised learning models such as Maximum Entropy (ME)Jin et al. (2005) and Conditional Random Fields (CRFs) Lafferty et al. (2001); Peng et al. (2004). These early models require heavy handcrafted feature engineering within a fixed size window.
Novel algorithms and deep models are not omnipotent. Large-scale corpus is also important for an accurate CWS system. Although there are many segmentation corpora, these datasets are annotated in different criteria, making it hard to fully exploit these corpora, which are shown in Table 1.
Recently, Chen et al. (2017) designed an adversarial multi-criteria learning framework for CWS. However, their models have several complex architectures, and are not comparable with the state-of-the-art results.
In this paper, we propose a smoothly jointed multi-criteria learning solution for CWS by adding two artificial tokens at the beginning and ending of input sentence to specify the required target criteria. We have conducted various experiments on
segmentation criteria corpora from SIGHAN Bakeoff 2005 and 2008. Our models improve performance by transferring learning on heterogeneous corpora. The final scores have surpassed previous multi-criteria learning,out of even have surpassed previous preprocessing-heavy state-of-the-art single-criterion learning results.
The contributions of this paper could be summarized as:
Proposed an simple yet elegant solution to perform multi-criteria learning on multiple heterogeneous segmentation criteria corpora;
2 out of 4 datasets have surpassed the state-of-the-art scores on Bakeoff 2005;
Extensive experiments on up to datasets have shown that our novel solution has significantly improved the performance.
2 Related Work
In this section, we review the previous works from 2 directions, which are Chinese Word Segmentation and multi-task learning.
2.1 Chinese Word Segmentation
Chinese Word Segmentation has been a well-studied problem for decades Huang and Zhao (2007). After pioneer Xue (2003) transformed CWS into a character-based tagging problem, Peng et al. (2004) adopted CRF as the sequence labeling model and showed its effectiveness. Following these pioneers, later sequence labeling based works Tseng et al. (2005); Zhao et al. (2006, 2010); Sun et al. (2012) were proposed. Recent neural models Zheng et al. (2013); Pei et al. (2014); Chen et al. (2015b); Dong et al. (2016); Chen et al. (2017) also followed this sequence labeling fashion.
2.2 Multi-Task Learning
Compared to single-task learning, multi-task learning is relatively harder due to the divergence between tasks and heterogeneous annotation datasets. Recent works have started to explore joint learning on Chinese word segmentation or part-of-speech tagging. Jiang et al. (2009)
stacked two classifiers together. The later one used the former’s prediction as additional features.Sun and Wan (2012)
proposed a structure-based stacking model in which one tagger was designed to refine another tagger’s prediction. These early models lacked a unified loss function and suffered from error propagation.
Qiu et al. (2013) proposed to learn a mapping function between heterogeneous corpora. Li et al. (2015); Chao et al. (2015) proposed and utilized coupled sequence labeling model which can directly learn and infer two heterogeneous annotations simultaneously. These works mainly focused on exploiting relationships between different tagging sets, but not shared features.
Chen et al. (2017) designed a complex framework involving sharing layers with Generative Adversarial Nets (GANs) to extract the criteria-invariant features and dataset related private layers to detect criteria-related features. This research work didn’t show great advantage over previous state-of-the-art single-criterion learning scores.
Our solution is greatly motivated by Google’s Multilingual Neural Machine Translation System, for whichJohnson et al. (2016) proposed an extremely simple solution without any complex architectures or private layers. They added an artificial token corresponding to parallel corpora and train them jointly, which inspired our design.
3 Neural Architectures for Chinese Word Segmentation
A prevailing approach to Chinese Word Segmentation is casting it to character based sequence tagging problem Xue (2003); Sun et al. (2012). One commonly used tagging set is , representing the begin, middle, end of a word, or single character forming a word. Given a sequence with characters as , sequence tagging based CWS is to find the most possible tags :
We model them jointly using a conditional random field, mostly following the architecture proposed by Lample et al. (2016), via stacking Long Short-Term Memory Networks (LSTMs) Hochreiter and Schmidhuber (1997) with a CRFs layer on top of them.
We’ll introduce our neural framework bottom-up. The bottom layer is a character Bi-LSTM (bidirectional Long Short-Term Memory Network) Graves and Schmidhuber (2005) taking character embeddings as input, outputs each character’s contextual feature representation:
After a contextual representation is generated, it will be decoded to make a final segmentation decision. We employed a Conditional Random Fields (CRF) Lafferty et al. (2001) layer as the inference layer.
First of all, a linear score function is used to assign a local score for each tag on -th character:
where is the concatenation of Bi-LSTM hidden state and bigram feature embedding , and are trainable parameters.
Then, for a sequence of predictions:
first order linear chain CRFs employed a Markov chain to define its global score as:
where is a transition matrix such that represents the score of a transition from the tag to tag . and are the start and end tags of a sentence, that are added to the tagset additionaly. is therefore a square matrix of size .
Finally, this global score is normalized to a probability in Equation (1) via a softmax over all possible tag sequences:
In decoding phase, first order linear chain CRFs only model bigram interactions between output tags, so the maximum of a posteriori sequence in Eq. 1 can be computed using dynamic programming.
4 Elegant Solution for Multi-Criteria Chinese Word Segmentation
For closely related multiple task learning like multilingual translation system, Johnson et al. (2016) proposed a simple and practical solution. It only needs to add an artificial token at the beginning of the input sentence to specify the required target language, no need to design complex private encoder-decoder structures.
We follow their spirit and add two artificial tokens at the beginning and ending of input sentence to specify the required target criteria. For instance, sentences in SIGHAN Bakeoff 2005 will be designed to have the following form:
|Corpora||Li Le reaches Benz Inc|
|PKU||pku 李 乐 到达 奔驰 公司 /pku|
|MSR||msr 李乐 到达 奔驰公司 /msr|
|AS||as 李樂 到達 賓士 公司 /as|
|CityU||cityu 李樂 到達 平治 公司 /cityu|
These artificial tokens specify which dataset the sentence comes from. They are treated as normal tokens, or more specifically, a normal character. With their help, instances from different datasets can be seamlessly put together and jointly trained, without extra efforts. These two special tokens are designed to carry criteria related information across long dependencies, affecting the context representation of every character, and finally to produce segmentation decisions matching target criteria. At test time, those tokens are used to specify the required segmentation criteria. Again, they won’t be taken into account when computing performance scores.
The training procedure is to maximize the log-probability of the gold tag sequence:
where represents all possible tag sequences for a sentence .
We conducted various experiments to verify the following questions:
Is our multi-criteria solution capable of learning heterogeneous datasets?
Can our solution be applied to large-scale corpus groups consisting of tiny and informal texts?
More data, better performance?
Our implementation is based on Dynet Neubig et al. (2017), a dynamic neural net framework for deep learning. Additionally, we implement the CRF layer in Python, and integrated the official score script to verify our scores.
To explore the first question, we have experimented on the 4 prevalent CWS datasets from SIGHAN2005 Emerson (2005) as these datasets are commonly used by previous state-of-the-art research works. To challenge question 2 and 3, we applied our solution on SIGHAN2008 datasets MOE (2008), which are used to compare our approach with other state-of-the-art multi-criteria learning works under a larger scale.
All datasets are preprocessed by replacing the continuous English characters and digits with a unique token. For training and development sets, lines are split into shorter sentences or clauses by punctuations, in order to make faster batch.
Specially, the Traditional Chinese corpora CityU, AS and CKIP are converted to Simplified Chinese using the popular Chinese NLP tool HanLP222https://github.com/hankcs/HanLP.
6.2 Results on SIGHAN bakeoff 2005
Our baseline model is Bi-LSTM-CRFs trained on each datasets separately. Then we improved it with multi-criteria learning. The final F scores are shown in Table 3.
|Tseng et al. (2005)||95.0||96.4||-||-|
|Zhang and Clark (2007)||95.0||96.4||-||-|
|Zhao and Kit (2008)||95.4||97.6||96.1||95.7|
|Sun et al. (2009)||95.2||97.3||-||-|
|Sun et al. (2012)||95.4||97.4||-||-|
|Zhang et al. (2013)||96.1||97.4||-||-|
|Chen et al. (2015a)||94.5||95.4||-||-|
|Chen et al. (2015b)||94.8||95.6||-||-|
|Chen et al. (2017)||94.3||96.0||-||94.8|
|Cai et al. (2017)||95.8||97.1||95.6||95.3|
According to this table, we find that multi-criteria learning boosts performance on every single dataset. Compared to single-criterion learning models (baseline), multi-criteria learning model (+multi) outperforms all of them by up to . Our joint model doesn’t rob performance from one dataset to pay another, but share knowledge across datasets and improve performance on all datasets.
6.3 Results on SIGHAN bakeoff 2008
SIGHAN bakeoff 2008 MOE (2008) provided as many as heterogeneous corpora. With another non-repetitive corpora from SIGHAN bakeoff 2005, they form a large-scale standard dataset for multi-criteria CWS benchmark. We repeated our experiments on these corpora and compared our results with state-of-the-art scores, as listed in Table 4.
|Chen et al. (2017)||P||95.70||93.64||93.67||95.19||92.44||94.00||91.86||95.11||93.95|
|Chen et al. (2017)||P||95.95||94.17||94.86||96.02||93.82||95.39||92.46||96.07||94.84|
In the first block for single-criterion learning, we can see that our implementation is generally more effective than Chen et al. (2017)’s. In the second block for multi-criteria learning, this disparity becomes even significant. And we further verified that every dataset benefit from our joint-learning solution. We also find that more data, even annotated with different standards or from different domains, brings better performance. Almost every dataset benefits from the larger scale of data. In comparison with large datasets, tiny datasets gain more performance growth.
7 Conclusions and Future Works
In this paper, we have presented a practical way to train multi-criteria CWS model. This simple and elegant solution only needs adding two artificial tokens at the beginning and ending of input sentence to specify the required target criterion. All the rest of model architectures, hyper-parameters, parameters and feature space are shared across all datasets. Experiments showed that our multi-criteria model can transfer knowledge between differently annotated corpora from heterogeneous domains. Our system is highly end-to-end, capable of learning large-scale datasets, and outperforms the latest state-of-the-art multi-criteria CWS works.
7.2 Future Works
Our effective and elegant multi-criteria learning solution can be applied to sequence labeling tasks such as POS tagging and NER. We plan to conduct more experiments of using our effective technique in various application domains.
- Cai and Zhao (2016) Deng Cai and Hai Zhao. 2016. Neural Word Segmentation Learning for Chinese. ACL .
- Cai et al. (2017) Deng Cai, Hai Zhao, Zhisong Zhang, Yuan Xin, Yongjian Wu, and Feiyue Huang. 2017. Fast and Accurate Neural Word Segmentation for Chinese. arXiv.org page arXiv:1704.07047.
- Chao et al. (2015) Jiayuan Chao, Zhenghua Li, Wenliang Chen, and Min Zhang. 2015. Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging. NLPCC .
- Chen et al. (2015a) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. 2015a. Gated Recursive Neural Network for Chinese Word Segmentation. ACL .
- Chen et al. (2015b) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015b. Long Short-Term Memory Neural Networks for Chinese Word Segmentation. EMNLP .
- Chen et al. (2017) Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial Multi-Criteria Learning for Chinese Word Segmentation. 1704:arXiv:1704.07556.
Collobert et al. (2011)
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel P Kuksa. 2011.
Natural Language Processing (Almost) from Scratch.
Journal of Machine Learning Research.
Dong et al. (2016)
Chuanhai Dong, Jiajun Zhang, Chengqing Zong, Masanori Hattori, and Hui Di.
Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition.NLPCC/ICCPOL .
- Emerson (2005) Thomas Emerson. 2005. The second international chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing. Jeju Island, Korea, pages 123–133.
- Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5-6):602–610.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Huang and Zhao (2007) C. Huang and H. Zhao. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21(3):8–19.
- Jiang et al. (2009) Wenbin Jiang, Liang Huang, and Qun Liu. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging. In the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference. Association for Computational Linguistics, Morristown, NJ, USA, pages 522–530.
- Jin et al. (2005) Kiat Low Jin, Hwee Tou Ng, and Wenyuan Guo. 2005. A maximum entropy approach to chinese word segmentation. Proceedings of the Fourth Sighan Workshop on Chinese Language Processing .
- Johnson et al. (2016) Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s Multilingual Neural Machine Translation System - Enabling Zero-Shot Translation. cs.CL.
- Lafferty et al. (2001) John D. Lafferty, Andrew Mccallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Eighteenth International Conference on Machine Learning. pages 282–289.
- Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. CoRR .
- Li et al. (2015) Zhenghua Li, Jiayuan Chao, Min Zhang, and Wenliang Chen. 2015. Coupled Sequence Labeling on Heterogeneous Annotations: POS Tagging as a Case Study. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Stroudsburg, PA, USA, pages 1783–1792.
- MOE (2008) PRC MOE. 2008. The fourth international chinese language processing bakeoff: Chinese word segmentation, named entity recognition and chinese pos tagging. In Proceedings of the sixth SIGHAN workshop on Chinese language processing.
- Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, et al. 2017. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 .
Pei et al. (2014)
Wenzhe Pei, Tao Ge, and Baobao Chang. 2014.
Max-Margin Tensor Neural Network for Chinese Word Segmentation.ACL .
- Peng et al. (2004) Fuchun Peng, Fangfang Feng, and Andrew Mccallum. 2004. Chinese segmentation and new word detection using conditional random fields pages 562–568.
- Qiu et al. (2013) Xipeng Qiu, Jiayi Zhao, and Xuanjing Huang. 2013. Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning. EMNLP .
Sun and Wan (2012)
Weiwei Sun and Xiaojun Wan. 2012.
Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations.ACL .
- Sun et al. (2012) Xu Sun, Houfeng Wang, and Wenjie Li. 2012. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection pages 253–262.
- Sun et al. (2009) Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Junichi Tsujii. 2009. A discriminative latent variable chinese segmenter with hybrid word/character information pages 56–64.
- Tseng et al. (2005) Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter for sighan bakeoff 2005. pages 168–171.
- Xue (2003) Nianwen Xue. 2003. Chinese Word Segmentation as Character Tagging. IJCLCLP .
- Zhang et al. (2013) Longkai Zhang, Houfeng Wang, Xu Sun, and Mairgup Mansur. 2013. Exploring representations from unlabeled data with co-training for chinese word segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, page 311–321.
Zhang and Clark (2007)
Yue Zhang and Stephen Clark. 2007.
Chinese segmentation with a word-based perceptron algorithm.Association for Computational Linguistics, Prague, Czech Republic, pages 840–847.
- Zhao et al. (2010) Hai Zhao, Chang Ning Huang, Mu Li, and Bao Liang Lu. 2010. A unified character-based tagging framework for chinese word segmentation. Acm Transactions on Asian Language Information Processing 9(2):1–32.
- Zhao et al. (2006) Hai Zhao, Changning Huang, Mu Li, and Bao-Liang Lu. 2006. Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling. PACLIC .
- Zhao and Kit (2008) Hai Zhao and Chunyu Kit. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In The Sixth SIGHAN Workshop on Chinese Language Processing. page 106–111.
- Zheng et al. (2013) Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. 2013. Deep Learning for Chinese Word Segmentation and POS Tagging. EMNLP .