Fast Neural Chinese Word Segmentation for Long Sentences

11/06/2018 ∙ by Sufeng Duan, et al. ∙ Shanghai Jiao Tong University 0

Rapidly developed neural models have achieved competitive performance in Chinese word segmentation (CWS) as their traditional counterparts. However, most of methods encounter the computational inefficiency especially for long sentences because of the increasing model complexity and slower decoders. This paper presents a simple neural segmenter which directly labels the gap existence between adjacent characters to alleviate the existing drawback. Our segmenter is fully end-to-end and capable of performing segmentation very fast. We also show a performance difference with different tag sets. The experiments show that our segmenter can provide comparable performance with state-of-the-art.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Word segmentation is the process of dividing text into words. Different from alphabetical languages like English whose words are separated by space, most of Asian languages like Chinese and Japanese have no word boundaries, for whom word segmentation is a fundamental step for most language processing tasks.

Since [Xue2003], most methods treat supervised Chinese word segmentation (CWS) task as a sequence labeling task with character position tags. Some CWS models such as Maximum Entropy (ME)[Xue2003] and Conditional Random Fields (CRF)[Lafferty, McCallum, and Pereira2001, Peng, Feng, and McCallum2004] are applied to such a labeling formalization while performance of these models depend on handcrafted features heavily.

With the development of neural network and machine learning, a number of researchers have used neural network to improve the performance of CWS and minimize the efforts in feature engineering.

[Zheng, Chen, and Xu2013] adapted the sliding-window based sequence labeling with character embeddings as input instead of characters. [Pei, Ge, and Chang2014]

introduced a model exploiting tag embeddings and tensor-based transformation.

[Chen et al.2015a] introduced a gated recursive neural network (GRNN). [Chen et al.2015b]

proposed a model with the long short-term memory (LSTM) neural network to capture long-distance context.

[Xu and Sun2016]

integrated GRNN and LSTM for deeper feature extraction.

[Liu et al.2016] used semi-CRF with neural network. [Cai and Zhao2017] proposed a framework for representing words candidates from their member characters with LSTM for further performance improvement.

It is a common practice for a neural segmentation model to take up days for training. To get a better performance, the neural model can be very complex with many neural layers and functions which makes training slowly. Neural segmenter also works much slower than traditional ones [Cai et al.2017]. Except for models with greedy decoder, most of existing models may fall into two types of decoding time complexities. For those sequence labeling models, either traditional or neural ones, only if they consist a first-order Markov mechanism, they will all share the same time complexity, (), where is the number of characters in a sentence, is the number of labeling tags (=2 for B, I and =4 for B, M, E, S tagging schemes) and

is the model complexity for computing local probability distribution. For those models with beam search decoder, they also have a similar time complexity,

(), where is the beam width (popularly set to 10 for most known models). Considering that usually it is neural models that have to use a beam search decoder and have much higher model complexity than traditional ones (i.e., is much larger), neural segmenters developed in recent years are actually much more inefficient than those early traditional ones.

According to the above analysis, the longer the sentence is, the slower the segmentation is. Not only that, it is also hard for a model to capture long distance dependence. Actually, most of existing models cannot deal with long sentences as good as short ones, which makes the long sentence segmentation both slow and inaccurate.

In this paper, we propose a fast neural segmenter for CWS. The segmenter is based on a latest developed attention mechanism, Biaffinal attention, which generally aims to lead the model to pay specific attention to different part of a specific task, since [Bahdanau, Cho, and Bengio2015]

first introduced the mechanism into the neural machine translation. Unlike other labeling based CWS models which need to predict the label of characters, our segmenter straightforwardly models the sentences and directly predicts the segmentation gap between two adjacent characters. Our model includes an encoder and a gap scorer, which is capable of scoring the gap existence and can be trained and tested end-to-end. There is no decoder in our model that means our model can predict fast and save time by skipping the decoding process. Therefore, our model can train and predict very fast with better long sentence performance than short ones.

Figure 1: An overview of our model.

Based on the idea of attention mechanism, we also develop several models for different tag sets such as . In the experiments we compared these models and found difference of performances between our segmenter and traditional tag based segmenter.

The remainder of the paper is organized as follows: Section 2 introduces the related work. Section 3 describes our fast neural segmenter. Section 4 presents our experiments and Section 5 gives a analysis for the result of experiments.

Related Work

In this section, we review the preview works about Chinese Word Segmentation and biaffinal attention.

Chinese Word Segmentation is a fundamental step for most Chinese natural language processing tasks and has been well studied for decades. Xue xue2003chinese was the first to formalize CWS tasks as a character-based tagged problem. Peng et al. peng2004chinese proposed a CRF based model to solve CWS tasks. Following these achievements, some Chinese segmenter

[Tseng et al.2005, Zhao et al.2006, Zhao and Kit2008, Zhao et al.2010, Sun, Wang, and Li2012, Zhang et al.2013] were proposed and got better performances. The method that transforms CWS into sequence labeling problem has been used in some neural models [Zheng, Chen, and Xu2013, Pei, Ge, and Chang2014, Chen et al.2015a, Chen et al.2015b]. [Huang and Zhao2006, Sun2010, Wang, Rob, and D2014] studied CWS model with both character-based and word-based segmenters. [Cai and Zhao2017] proposed a segmentation model which replaced a fixed sized sliding window with a feature window to cover complete input and segmentation history. Following the work[Cai and Zhao2017], Cai et al.CaiZZXWH17 present a greedy neural word segmenter with balanced word and character embeding input that performing segmentation much faster.

Traditionally, labeling task based CWS can find gap by labeling characters in a sentence. Based on this idea, Zhao et al.zhao2006effective compared 2-tags, 4-tags, 5-tags and 6 tags model. While CWS can also predict the gap directly. Huang et al.huang2007rethinking introduced a method that labeling gap directly instead of the character.

Our work has used biaffinal attentional mechanism. Bahdnau et al.Bahdanau15 introduced the traditional attentional mechanism in the neural machine translation(NMT). Dozat et al.dozat2017deep proposed a mechanism that use biaffine attention instead of bilinear attention or MLP-based attention. In this paper, the part of biaffine attention can evaluate score between different words.


As shown in Figure 1, our model is divided into two modules, (1) a bidirectional LSTM (BiLSTM) encoder that takes each character embedding

of the given sentence as input and generates dense vectors for all the characters, (2) a biaffine attentional scorer which takes the hidden vectors for all given character pairs as the input and predict a label score vector.

Bidirectional LSTM Encoder

Character Representation

Given a sentence, following [Bengio et al.2006] we use a lookup table to transform this sequence of characters into a sequence of character embedding .


BiLSTM is adopted for our sentence encoder. By incorporating a stack of two distinct LSTMs, BiLSTM processes an input sequence both forwardly and backwardly, then combines the outputs of two LSTMs as the representation of a character.

Given a sequence of character embedding as input, the -th element is encoded as follows:

where and respectively denote the forward and backward LSTM transformation, and are the hidden state vectors of the forward and backward LSTMs respectively and denotes the concatenation operation.

Biaffine Attentional Scorer

Typically, to label the gap existence between two adjacent characters, a segmentation scorer is employed on the top of BiLSTM encoder, which is implemented as biaffine attention introduced by [Dozat and Manning2017], for compared to its counterparts such as bilinear or MLP-based attention, the biaffine attention is more effectively capable of measuring the relationship between two elementary units. Note that biaffine attention is a natural extension of bilinear attention [Luong, Pham, and Manning2015] which is widely used in neural machine translation (NMT).

Affine Transformation

For a task like CWS, the scorer is supposed to distinguish the gap existence between two adjacent characters. To this end, we perform two distinct affine transformations on the hidden state , mapping it to vectors with smaller dimensionality:

where and are the hidden states representation respectively for the front and the rear characters.

By make such transformations over output from encoder, the scorer can benefit from deeper feature extraction. Both features of adjacent characters are learned by the same LSTMs, the scorer can get features composed from both recurrent states together with reduced dimensionality. This model also map the front character and the rear character into two distinct spaces which can help the scorer plays in different context and simple.

Biaffine Scoring

Traditionally, bilinear transformation is a good tool to judge the relationship between two objective vectors. In bilinear transformation, given a target recurrent output vector and a source recurrent output vector , a bilinear transformation calculates a score for the alignment:

In a traditional classification task, the distribution of different classes is often uneven so that the output layer of the model normally includes a fixed bias term designed to capture the prior probability

of each class, with the rest of the model focusing on learning the likelihood of each class given the data . In order to alleviate the burden of the fixed bias term and capture the prior probability dynamically, bias terms are introduced into the bilinear attention resulting in a biaffine transformation.

In CWS, the distribution of the gap existence is similarly uneven, so directly applying the primitive form of bilinear attention would fail to capture the prior probability for each class. Thus, the biaffine attention introduced in our model would be extremely helpful for gap existence prediction.

where , and will be updated during the learning process.

Given a sentence of length , for every adjacent character pairs, the scorer outputs a score vector . Then our model selects as its output the label with the highest score from each score vector: , where denotes the score of the -th gap existence.

Tag Sets

Tag set Tags Words in tagging
2-tag B,E B,BE,BEE,…
4-tag B,M,E,S S,BE,BME,BMME,…
Table 1: Definitions of and tag set

Tradiationlly speaking, most of previous segmenters have to label every character and use a decoder to find the gap in a sentence. However, how to select an effective tag set for a segmentation task is an interesting and important problem. For a basic segmenter, there are two major kinds of schemes known as and . The detailed information is in Table 1.

Different from other segmenter, our segmenter directly predicts the gaps in a sentence which leaves out the decoding step. The idea of our segmenter is that the gap is actually a special relationship of a pair of adjacent characters. Introduced by [Dozat and Manning2017]

, biaffine attention is a tool to score every pairs of element in a sequence. Our segmenter use the biaffine transformer based classifier to predict the gap existence directly, which is the simple and doesn’t need a decoder. However, inspired by the traditional method, we also find a way to combine the character labeling method and the gap labeling method. In this work, we present three different methods which is gap labeling based and using different label to combine the character labeling together.

To further explore different method in labeling the gap existence, we combine the and labeling schemes into our model. In the combined model, the label of one gap can be defined by labels of the two adjacent characters. The set of labels with tag set can be . Similar as tag set, the set of labels with tag set can be . The conflict may exist similar as traditional segmenter with and . So the segmenter needs a decoder to guarantee the validity of a sentence.

For and model, every label of gap has only two candidacy (as Tables 2 and 3 shows) that limit the time complexity of decoder. As a result the time complexity of decoder only depend on the beam-size and the length of sentence.

Figure 2: The step of traditional segmenter works.
Figure 3: The step of our BEMS segmenter works.
Tags Words in tagging
Table 2: Transition of tag set
Tags Words in tagging
Table 3: Transition of tag set

Here we have three different methods to label the gap and experiment will tell us which is the next direction for CWS.


Datasets and Settings

Train Test Train Test
#sentences 709K 14K 87K 4K
#words 4741K 108K 2,368K 107K
#characters 8368K 198K 3,981K 181K
Table 4: Data statistics
Length #sen(%) #char(%) #sen(%) #char(%)
0-30 0.875 0.705 0.270 0.126
31-60 0.135 0.263 0.486 0.434
61-90 0.007 0.045 0.175 0.266
91-120 0.001 0.004 0.047 0.100
121-inf 0.001 0.002 0.022 0.072
Table 5: Sentence length statistics
Character embedding size 300 300 300
Depth of LSTM layers 3 3 3
Hidden state size 300 300 300
Biaffine input size 300 300 300
Learning rate 0.001 0.0012 0.002
Dropout probability 0.6 0.39 0.45
Table 6: Hyperparameter settings

We evaluate our model on two popular benchmark datasets, namely AS and MSR from the second international Chinese word segmentation bakeoff [Emerson2005], which is the biggest two of all the benchmarks. The data statistics are in Table 4.

To compare performance of different tag sets, we also evaluate the model with tag, tag set and tag set in AS and MSR datasets. For tag set and tag set, the model uses beam-search decoder to get the result of sentences, and the beam size of decoder is 10. For tag, the model directly predicts the gap of the input.

The statistics about sentence length distribution is shown in Table 5. As for AS dataset, most of sentences are short due to its annotation including a manually sentence splitting preprocessing. Even so, long sentences (with more than 30 characters) carry more than characters of the entire dataset. As for the MSR, without manually sentence splitting, more than sentences are longer than 30 characters, which means that fast and accurate segmentation for long sentences is desperately in need.

In this paper, we use the same model setting as show in Table 6. These numbers are tuned on development sets.111Following conventions, the last 10% sentences of training corpus are used as development set. The character embeddings are pre-trained by word2vec [Mikolov et al.2013] toolkit using skip-gram on Chinese Wikipedia corpus. The learning rate and dropout probability of model is different with different tag set while other parameters are the same. During the training, we use dropout layer before every affine transformation to alleviate the overfitting problem. Our model is optimized using Adam [Kingma and Ba2015].

AS F-1 Train(h) Test(s)
[Wang, Rob, and D2014] 95.4 - -
[Chen et al.2017] 94.5 - -
[Cai et al.2017] 95.0 60 80
our model(01) 94.4 9 10
our model(BE) 94.5 12.25 112
our model(BEMS) 94.8 16 150
Table 7: Comparison of performance and running time
MSR F-1 Train(h) Test(s)
[Chen et al.2015a] 95.4 100 120
[Chen et al.2015b] 95.6 117 120
[Cai and Zhao2017] 96.5 96 105
[Cai et al.2017] 97.1 6 30
[Ma and Hinrichs2015] 96.6 3 28
our model(01) 96.6 2.5 5
our model(BE) 96.7 4.6 52
our model(BEMS) 96.9 10.2 90
Table 8: Comparison of performance and running time
Long(s) Short(s) Long(s) Short(s)
[Cai et al.2017] 45 37 27 8
Our model 5 7 4 3
Table 9: Predicting time in short and long sentences33footnotemark: 3


22footnotetext: More than 30 characters are regarded as long sentences.

Main Result 333We use the same hardware setting as [Cai et al.2017] in this section

In the experiment, we test all three different tag set in the model. Tables 7 and 8 compare our final results to prior neural models444We are aware there are a lot of neural models exploiting extra resources for further performance improvement, which should belong to the open test setting defined by SIGHAN-bakeoff shared task. However, to focus on the model improvement, we principally follow the closed test setting of SIGHAN-bakeoff, which only allows training dataset is used for segmenter learning. The only exception is that our comparison additionally allows all models using standard pre-trained embedding, which has been a common practice for all neural segmentation models. To let this work have an explicit focus, we thus exclude all open test concerned only work, and also those using complicated pre-trained embeddings like [Yang, Zhang, and Dong2017] and [Zhou et al.2017].. For performance, our proposed BEMS based model is comparable to the state-of-the-art models only with slight difference and faster than methods except [Zhou et al.2017] and [Ma and Hinrichs2015]. For efficiency, our 01 tag based model is much faster than state-of-the-art model in [Cai et al.2017] and [Ma and Hinrichs2015], which are so far the fastest reported model on either training or testing. To show the efficiency of 01 tag based model clearly, Table 9 further compares [Cai et al.2017] and our 01 tag model with long and short sentences respectively, which demonstrates our model earns much more efficiency improvement over long sentences.

Model Analysis 555All the results in this section are finished in development set.

Figure 4: The comparison of our model and [Cai et al.2017] dealing with different length of sentences .

Performance in Different Length

Figure 4 compares the performance of [Cai et al.2017] and our model dealing with different sized sentences in MSR. Figure shows that our model with different tags have similar trend that the longer sentence is, the better segmenter performs. Unlike the state-of-the-art model of [Cai et al.2017]

which performs best in middle length, our model performs much better when handling long sentence than short ones. Namely, our model is capable of better handling long-distance dependence relationship. Note that the length of curve in our model is a surprise and maybe a counter-example to all sequence-level NLP tasks, including POS tagging, named entity recognition, and syntactic or semantic parsing, which usually show that the longer the sentence is, the poorer the processing effectiveness is.

We think the good performance in long sentence is caused by the bidirectional LSTM encoder. The longer the sentence is, the more information the sentence have. These information tells scorer how to find a gap. Our model is based on a bidirectional LSTM encoder which means the embedding of character contains information on two directions. So embedding of character can hold information of the entire sentence which may contain some potential structural and semantic information. Our scorer works on every adjacent character pairs with the information of the entire sentence which makes the performance in different length of sentence.

Performance with Different Tag Set

Tables 7 and 8 also compare our model with different tag sets. The results shows that model with is better than other two models while model with . This results are similar as traditional segmenters[Zhao et al.2006]. Segmenters with more tags can predict the gap type accurately and the features for different kind of gap can be learned by the model. And our model focus on the gap existence. Model with -tag predicts the label of gap directly while the model pays attention to relationship of adjacent characters and ignores the relationship of characters in a same word which may affect the performance. Similar as -tag model, model cannot get the boundary of one word which may affect the performance. Though the scorer can give a probability of gap existence using two adjacent characters only, the features for one character contain information of the entire sentence. So the segmenter can get features of other related gap.

The results also show that the model is faster with a smaller size of tag set during training and testing. More tags means more parameters model needs to optimize on training which makes -tag model faster than others.

Model with -tag predicts the label of gap directly without any conflict, and the original result model output is legal. So model with -tag needs no decoder which makes it faster than others on testing. For and model, the time complexity of beam-search decoders are the same because one label of gap has only two candidacy (as Tables 2 and 3 shows) which limits the time complexity of decoder. So the difference of and model is cased by the complexity of model.

Moderate Performance Improvement

Performance of some segmenters drop when the length of sentence is large. The good performance in long sentence of our model can be used to improve other segmenters. One simple method is that to use our model to predict long sentences if the length of sentences is larger than threshold.

In the experiment, we use this method to improve model introduced by [Cai et al.2017] with our model. Figure 4 shows that performance of model introduced by [Cai et al.2017] drop when length of sentences is large than 90. And we set the threshold to 90. When the length of sentence is larger than 90, the segmenter will select result of our model. The dataset is MSR.

model F-1
[Cai et al.2017] 96.8
with our model(01) 96.9
with our model(BE) 96.9
with our model(BEMS) 96.9
Table 10: Performance improvement

Table 10 shows that the segmenter improved by , and model have a moderate performance improvement compared with original segmenter. In this method, the original segmenter is independent of , and model which means this method is more flexible than one segmenter.


This paper reports a long sentence oriented neural segmenter which straightforwardly models Chinese word segmentation as gap decision according to biaffine scoring over BiLSTM encoder. Our model can be trained and tested end-to-end with a simplified model architecture. Our model can be trained and predicted fast without decoder. We also designed two other gap types for the model. The evaluation on benchmark shows that our model extraordinarily performs with a segmentation style that the longer, the better, on both performance and efficiency.


  • [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR 2015.
  • [Bengio et al.2006] Bengio, Y.; Ducharme, R.; Vincent, P.; Jauvin, C.; Kandola, J.; Hofmann, T.; Poggio, T.; and Shawetaylor, J. 2006. A neural probabilistic language model. Springer Berlin Heidelberg.
  • [Cai and Zhao2017] Cai, D., and Zhao, H. 2017. Neural word segmentation learning for Chinese. In ACL 2016, 409–420.
  • [Cai et al.2017] Cai, D.; Zhao, H.; Zhang, Z.; Xin, Y.; Wu, Y.; and Huang, F. 2017. Fast and accurate neural word segmentation for Chinese. In ACL 2017, 608–615.
  • [Chen et al.2015a] Chen, X.; Qiu, X.; Zhu, C.; and Huang, X. 2015a. Gated recursive neural network for Chinese word segmentation. In ACL 2015, 1744–1753.
  • [Chen et al.2015b] Chen, X.; Qiu, X.; Zhu, C.; Liu, P.; and Huang, X. 2015b. Long short-term memory neural networks for Chinese word segmentation. In EMNLP 2015, 1197–1206.
  • [Chen et al.2017] Chen, X.; Shi, Z.; Qiu, X.; and Huang, X. 2017. Adversarial multi-criteria learning for Chinese word segmentation. In ACL 2017, 1193–1203.
  • [Dozat and Manning2017] Dozat, T., and Manning, C. D. 2017. Deep biaffine attention for neural dependency parsing. In ICLR 2017.
  • [Emerson2005] Emerson, T. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the fourth SIGHAN workshop on Chinese language Processing, 123–133.
  • [Huang and Zhao2006] Huang, C.-N., and Zhao, H. 2006. Which is essential for Chinese word segmentation: Character versus word. In PACLIC 2006, 1–12.
  • [Huang et al.2007] Huang, C.-R.; Šimon, P.; Hsieh, S.-K.; and PrŠvot, L. 2007. Rethinking chinese word segmentation: tokenization, character classification, or wordbreak identification. In ACL 2007, 69–72.
  • [Kingma and Ba2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR 2015.
  • [Lafferty, McCallum, and Pereira2001] Lafferty, J. D.; McCallum, A.; and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In (ICML 2001), 282–289.
  • [Liu et al.2016] Liu, Y.; Che, W.; Guo, J.; Qin, B.; and Liu, T. 2016. Exploring segment representations for neural segmentation models. In IJCAI 2016, 2880–2886.
  • [Luong, Pham, and Manning2015] Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In EMNLP 2015, 1412–1421.
  • [Ma and Hinrichs2015] Ma, J., and Hinrichs, E. 2015. Accurate linear-time Chinese word segmentation via embedding matching. In ACL 2015, 247–252.
  • [Mikolov et al.2013] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. version 2.
  • [Pei, Ge, and Chang2014] Pei, W.; Ge, T.; and Chang, B. 2014. Max-margin tensor neural network for Chinese word segmentation. In ACL 2014, volume 1, 293–303.
  • [Peng, Feng, and McCallum2004] Peng, F.; Feng, F.; and McCallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In COLING 2004, 562.
  • [Sun, Wang, and Li2012] Sun, X.; Wang, H.; and Li, W. 2012. Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection. In ACL 2012, 253–262. Association for Computational Linguistics.
  • [Sun2010] Sun, W. 2010. Word-based and character-based word segmentation models: Comparison and combination. In COLING 2010, 1211–1219. Association for Computational Linguistics.
  • [Tseng et al.2005] Tseng, H.; Chang, P.; Andrew, G.; Jurafsky, D.; and Manning, C. D. 2005. A conditional random field word segmenter for sighan bakeoff 2005. In IJCNLP 2005.
  • [Wang, Rob, and D2014] Wang; Rob; and D, C. 2014. Two knives cut better than one: Chinese word segmentation with dual decomposition. In ACL 2014, volume 2, 193–198.
  • [Xu and Sun2016] Xu, J., and Sun, X. 2016. Dependency-based gated recursive neural network for Chinese word segmentation. In ACL 2016, volume 2, 567–572.
  • [Xue2003] Xue, N. 2003. Chinese word segmentation as character tagging. International Journal of Computational Linguistics & Chinese Language Processing 8:29–48.
  • [Yang, Zhang, and Dong2017] Yang, J.; Zhang, Y.; and Dong, F. 2017. Neural word segmentation with rich pretraining. In ACL 2017, 839–849.
  • [Zhang et al.2013] Zhang, L.; Wang, H.; Sun, X.; and Mansur, M. 2013. Exploring representations from unlabeled data with co-training for Chinese word segmentation. In EMNLP 2013, 311–321.
  • [Zhao and Kit2008] Zhao, H., and Kit, C. 2008.

    Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition.

    In Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.
  • [Zhao et al.2006] Zhao, H.; Huang, C.-N.; Li, M.; and Lu, B.-L. 2006. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In PACLIC 2006, 87–94.
  • [Zhao et al.2010] Zhao, H.; Huang, C.-N.; Li, M.; and Lu, B.-L. 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing (TALIP) 9(2):5.
  • [Zheng, Chen, and Xu2013] Zheng, X.; Chen, H.; and Xu, T. 2013. Deep learning for Chinese word segmentation and pos tagging. In EMNLP 2013, 647–657.
  • [Zhou et al.2017] Zhou, H.; Yu, Z.; Zhang, Y.; Huang, S.; Dai, X. Y.; and Chen, J. 2017. Word-context character embeddings for Chinese word segmentation. In EMNLP 2017, 760–766.