Chinese word segmentation (CWS) is a task for Chinese natural language process to delimit word boundary. CWS is a basic and essential task for Chinese which is written without explicit word delimiters and different from alphabetical languages like English.Xue (2003) treats Chinese word segmentation (CWS) as a sequence labeling task with character position tags, which is followed by Lafferty et al. (2001); Peng et al. (2004); Zhao et al. (2006). Traditional CWS models depend on the design of features heavily which effects the performance of model. To minimize the effort in feature engineering, some CWS models Zheng et al. (2013); Pei et al. (2014); Chen et al. (2015a, b); Xu and Sun (2016); Cai and Zhao (2016); Liu et al. (2016); Cai et al. (2017)
are developed following neural network architecture for sequence labeling tasksCollobert et al. (2011). Neural CWS models perform strong ability of feature representation, employing unigram and bigram character embedding as input and approach good performance.
|Traditional Models||Neural Models||
|MMTNN: Pei et al. (2014)|
The CWS task is often modeled as one graph model based on a scoring model that means it is composed of two parts, one part is an encoder which is used to generate the representation of characters from the input sequence, the other part is a decoder which performs segmentation according to the encoder scoring. Table 1
summarizes typical CWS models according to their decoding ways for both traditional and neural models. Markov models such asNg and Low (2004) and Zheng et al. (2013) depend on the maximum entropy model or maximum entropy Markov model both with a Viterbi decoder. Besides, conditional random field (CRF) or Semi-CRF for sequence labeling has been used for both traditional and neural models though with different representations Peng et al. (2004); Andrew (2006); Liu et al. (2016); Wang and Xu (2017); Ma et al. (2018). Generally speaking, the major difference between traditional and neural network models is about the way to represent input sentences.
|Zheng et al. (2013), …||-|
|Chen et al. (2015b)||-|
|word based||Zhang and Clark (2007), …|
|Cai and Zhao (2016); Cai et al. (2017)|
Recent works about neural CWS which focus on benchmark dataset, namely SIGHAN Bakeoff Emerson (2005), may be put into the following three categories roughly.
Encoder. Practice in various natural language processing tasks has been shown that effective representation is essential to the performance improvement. Thus for better CWS, it is crucial to encode the input character, word or sentence into effective representation. Table 2
summarizes regular feature sets for typical CWS models including ours as well. The building blocks that encoders use include recurrent neural network (RNN) and convolutional neural network (CNN), and long-term memory network (LSTM).
Graph model. As CWS is a kind of structure learning task, the graph model determines which type of decoder should be adopted for segmentation, also it may limit the capability of defining feature, as shown in Table 2, not all graph models can support the word features. Thus recent work focused on finding more general or flexible graph model to make model learn the representation of segmentation more effective as Cai and Zhao (2016); Cai et al. (2017).
External data and pre-trained embedding. Whereas both encoder and graph model are about exploring a way to get better performance only by improving the model strength itself. Using external resource such as pre-trained embeddings or language representation is an alternative for the same purpose Yang et al. (2017); Zhao et al. (2018). SIGHAN Bakeoff defines two types of evaluation settings, closed test limits all the data for learning should not be beyond the given training set, while open test does not take this limitation Emerson (2005). In this work, we will focus on the closed test setting by finding a better model design for further CWS performance improvement.
Shown in Table 1, different decoders have particular decoding algorithms to match the respective CWS models. Markov models and CRF-based models often use Viterbi decoders with polynomial time complexity. In general graph model, search space may be too large for model to search. Thus it forces graph models to use an approximate beam search strategy. Beam search algorithm has a kind low-order polynomial time complexity. Especially, when beam width =1, the beam search algorithm will reduce to greedy algorithm with a better time complexity against the general beam search time complexity , where is the number of units in one sentences, is a constant representing the model complexity. Greedy decoding algorithm can bring the fastest speed of decoding while it is not easy to guarantee the precision of decoding when the encoder is not strong enough.
In this paper, we focus on more effective encoder design which is capable of offering fast and accurate Chinese word segmentation with only unigram feature and greedy decoding. Our proposed encoder will only consist of attention mechanisms as building blocks but nothing else. Motivated by the Transformer Vaswani et al. (2017) and its strength of capturing long-range dependencies of input sentences, we use a self-attention network to generate the representation of input which makes the model encode sentences at once without feeding input iteratively. Considering the weakness of the Transformer to model relative and absolute position information directly Shaw et al. (2018) and the importance of localness information, position information and directional information for CWS, we further improve the architecture of standard multi-head self-attention of the Transformer with a directional Gaussian mask and get a variant called Gaussian-masked directional multi-head attention. Based on the newly improved attention mechanism, we expand the encoder of the Transformer to capture different directional information. With our powerful encoder, our model uses only simple unigram features to generate representation of sentences.
For decoder which directly performs the segmentation, we use the bi-affinal attention scorer, which has been used in dependency parsing Dozat and Manning (2017) and semantic role labeling Cai et al. (2018), to implement greedy decoding on finding the boundaries of words. In our proposed model, greedy decoding ensures a fast segmentation while powerful encoder design ensures a good enough segmentation performance even working with greedy decoder together. Our model will be strictly evaluated on benchmark datasets from SIGHAN Bakeoff shared task on CWS in terms of closed test setting, and the experimental results show that our proposed model achieves new state-of-the-art.
The technical contributions of this paper can be summarized as follows.
We propose a CWS model with only attention structure. The encoder and decoder are both based on attention structure.
With a powerful enough encoder, we for the first time show that unigram (character) featues can help yield strong performance instead of diverse -gram (character and word) features in most of previous work.
To capture the representation of localness information and directional information, we propose a variant of directional multi-head self-attention to further enhance the state-of-the-art Transformer encoder.
The CWS task is often modelled as one graph model based on an encoder-based scoring model. The model for CWS task is composed of an encoder to represent the input and a decoder based on the encoder to perform actual segmentation. Figure 1
is the architecture of our model. The model feeds sentence into encoder. Embedding captures the vectorof the input character sequences of . The encoder maps vector sequences of to two sequences of vector which are and as the representation of sentences. With and
, the bi-affinal scorer calculates the probability of each segmentation gaps and predicts the word boundaries of input. Similar as the Transformer, the encoder is an attention network with stacked self-attention and point-wise, fully connected layers while our encoder includes three independent directional encoders.
2.1 Encoder Stacks
In the Transformer, the encoder is composed of a stack of N
identical layers and each layer has one multi-head self-attention layer and one position-wise fully connected feed-forward layer. One residual connection is around two sub-layers and followed by layer normalizationVaswani et al. (2017). This architecture provides the Transformer a good ability to generate representation of sentence.
With the variant of multi-head self-attention, we design a Gaussian-masked directional encoder to capture representation of different directions to improve the ability of capturing the localness information and position information for the importance of adjacent characters. One unidirectional encoder can capture information of one particular direction.
For CWS tasks, one gap of characters, which is from a word boundary, can divide one sequence into two parts, one part in front of the gap and one part in the rear of it. The forward encoder and backward encoder are used to capture information of two directions which correspond to two parts divided by the gap.
One central encoder is paralleled with forward and backward encoders to capture the information of entire sentences. The central encoder is a special directional encoder for forward and backward information of sentences. The central encoder can fuse the information and enable the encoder to capture the global information.
The encoder outputs one forward information and one backward information of each positions. The representation of sentence generated by center encoder will be added to these information directly:
where is the backward information, is the forward information, is the output of backward encoder, is the output of center encoder and is the output of forward encoder.
2.2 Gaussian-Masked Directional Multi-Head Attention
Similar as scaled dot-product attention Vaswani et al. (2017), Gaussian-masked directional attention can be described as a function to map queries and key-value pairs to the representation of input. Here queries, keys and values are all vectors. Standard scaled dot-product attention is calculated by dotting query with all keys , dividing each values by , where is the dimension of keys, and apply a softmax function to generate the weights in the attention:
Different from scaled dot-product attention, Gaussian-masked directional attention expects to pay attention to the adjacent characters of each positions and cast the localness relationship between characters as a fix Gaussian weight for attention. We assume that the Gaussian weight only relys on the distance between characters.
Firstly we introduce the Gaussian weight matrix which presents the localness relationship between each two characters:
where is the Gaussian weight between character and , is the distance between character and ,
is the cumulative distribution function of Gaussian,4) can ensure the Gaussian weight equals 1 when is 0. The larger distance between charactersis, the smaller the weight is, which makes one character can affect its adjacent characters more compared with other characters.
To combine the Gaussian weight to the self-attention, we produce the Hadamard product of Gaussian weight matrix and the score matrix produced by
where is the Gaussian-masked attention. It ensures that the relationship between two characters with long distances is weaker than adjacent characters.
The scaled dot-product attention models the relationship between two characters without regard to their distances in one sequence. For CWS task, the weight between adjacent characters should be more important while it is hard for self-attention to achieve the effect explicitly because the self-attention cannot get the order of sentences directly. The Gaussian-masked attention adjusts the weight between characters and their adjacent character to a larger value which stands for the effect of adjacent characters.
For forward and backward encoder, the self-attention sublayer needs to use a triangular matrix mask to let the self-attention focus on different weights:
where is the position of character . The triangular matrix for forward and backward encode are:
Similar as Vaswani et al. (2017), we use multi-head attention to capture information from different dimension positions as Figure 3(a) and get Gaussian-masked directional multi-head attention. With multi-head attention architecture, the representation of input can be captured by
where is the Gaussian-masked multi-head attention, is the parameter matrices to generate heads, is the dimension of model and is the dimension of one head.
2.3 Bi-affinal Attention Scorer
Regarding word boundaries as gaps between any adjacent words converts the character labeling task to the gap labeling task. Different from character labeling task, gap labeling task requires information of two adjacent characters. The relationship between adjacent characters can be represented as the type of gap. The characteristic of word boundaries makes bi-affine attention an appropriate scorer for CWS task.
Bi-affinal attention scorer is the component that we use to label the gap. Bi-affinal attention is developed from bilinear attention which has been used in dependency parsing Dozat and Manning (2017) and SRL Cai et al. (2018)
. The distribution of labels in a labeling task is often uneven which makes the output layer often include a fixed bias term for the prior probability of different labelsCai et al. (2018). Bi-affine attention uses bias terms to alleviate the burden of the fixed bias term and get the prior probability which makes it different from bilinear attention. The distribution of the gap is uneven that is similar as other labeling task which fits bi-affine.
Bi-affinal attention scorer labels the target depending on information of independent unit and the joint information of two units. In bi-affinal attention, the score of characters and is calculated by:
where is the forward information of and is the backward information of . In Equation (8), , and are all parameters that can be updated in training. is a matrix with shape and is a matrix where is the dimension of vector and is the number of labels.
|Max length (Character)||1019||581|
|Max length (Word)||659||338|
|Max length (Character)||188||350|
|Max length (Word)||211||85|
|dimension of hidden vector||256|
|number of layer||6|
|dimension of FF||1024|
|number of header||4|
|Chen et al. (2015a)||95.9||50||105||96.2||100||120|
|Chen et al. (2015b)||95.7||58||105||96.4||117||120|
|Liu et al. (2016)||94.9||-||-||94.8||-||-|
|Cai and Zhao (2016)||95.2||48||95||96.4||96||105|
|Cai et al. (2017)||95.4||3||25||97.0||6||30|
|Zhou et al. (2017)||95.0||-||-||97.2||-||-|
|Ma et al. (2018)||95.4||-||-||97.5||-||-|
|Wang et al. (2019)*||95.7*||-||-||97.4*||-||-|
|Cai et al. (2017)||95.2||-||-||95.4||-||-|
|Ma et al. (2018)||95.5||-||-||95.7||-||-|
|Wang et al. (2019)*||95.6*||-||-||95.9*||-||-|
In our model, the biaffine scorer uses the forward information of character in front of the gap and the backward information of the character behind the gap to distinguish the position of characters. Figure 4
is an example of labeling gap. The method of using biaffine scorer ensures that the boundaries of words can be determined by adjacent characters with different directional information. The score vector of the gap is formed by the probability of being a boundary of word. Further, the model generates all boundaries using activation function in a greedy decoding way.
3.1 Experimental Settings
shows the statistics of train data. We use F-score to evaluate CWS models. To train model with pre-trained embeddings in AS and CITYU, we use OpenCC111https://github.com/BYVoid/OpenCC to transfer data from traditional Chinese to simplified Chinese.
We only use unigram feature so we only trained character embeddings. Our pre-trained embedding are pre-trained on Chinese Wikipedia corpus by word2vec Mikolov et al. (2013) toolkit. The corpus used for pre-trained embedding is all transferred to simplified Chinese and not segmented. On closed test, we use embeddings initialized randomly.
For different datasets, we use two kinds of hyperparameters which are presented in Table 4. We use hyperparameters in Table 4 for small corpora (PKU and CITYU) and normal corpora (MSR and AS). We set the standard deviation of Gaussian function in Equation (4) to 2. Each training batch contains sentences with at most 4096 tokens.
where is the dimension of embeddings, is the step number of training and is the step number of warmup. When the number of steps is smaller than the step of warmup, the learning rate increases linearly and then decreases.
3.2 Hardware and Implements
We trained our models on a single CPU (Intel i7-5960X) with an nVidia 1080 Ti GPU. We implement our model in Python with Pytorch 1.0.
Tables 5 and 6 reports the performance of recent models and ours in terms of closed test setting. Without the assistance of unsupervised segmentation features userd in Wang et al. (2019), our model outperforms all the other models in MSR and AS except Ma et al. (2018) and get comparable performance in PKU and CITYU. Note that all the other models for this comparison adopt various -gram features while only our model takes unigram ones.
With unsupervised segmentation features introduced by Wang et al. (2019), our model gets a higher result. Specially, the results in MSR and AS achieve new state-of-the-art and approaching previous state-of-the-art in CITYU and PKU. The unsupervised segmentation features are derived from the given training dataset, thus using them does not violate the rule of closed test of SIGHAN Bakeoff.
Table 7 compares our model and recent neural models in terms of open test setting in which any external resources, especially pre-trained embeddings or language models can be used. In MSR and AS, our model gets a comparable result while our results in CITYU and PKU are not remarkable.
However, it is well known that it is always hard to compare models when using open test setting, especially with pre-trained embedding. Not all models may use the same method and data to pre-train. Though pre-trained embedding or language model can improve the performance, the performance improvement itself may be from multiple sources. It often that there is a success of pre-trained embedding to improve the performance, while it cannot prove that the model is better.
|Chen et al. (2015a)||96.4||97.6||-||-|
|Chen et al. (2015b)||96.5||97.4||-||-|
|Liu et al. (2016)||96.8||97.3||-||-|
|Cai and Zhao (2016)||95.5||96.5||-||-|
|Cai et al. (2017)||95.8||97.1||95.3||95.6|
|Chen et al. (2017b)||94.3||96.0||94.6||95.6|
|Wang and Xu (2017)||95.7||97.3||-||-|
|Zhou et al. (2017)||96.0||97.8||-||-|
|Chen et al. (2017c)||-||96.5||95.17||-|
|Ma et al. (2018)||96.1||98.1||96.2||97.2|
|Wang et al. (2019)||96.1||97.5||-||-|
Compared with other LSTM models, our model performs better in AS and MSR than in CITYU and PKU. Considering the scale of different corpora, we believe that the size of corpus affects our model and the larger size is, the better model performs. For small corpus, the model tends to be overfitting.
4 Related Work
4.1 Chinese Word Segmentation
CWS is a task for Chinese natural language process to delimit word boundary. Xue (2003) for the first time formulize CWS as a sequence labeling task. Zhao et al. (2006) show that different character tag sets can make essential impact for CWS. Peng et al. (2004) use CRFs as a model for CWS, achieving new state-of-the-art. Works of statistical CWS has built the basis for neural CWS.
Neural word segmentation has been widely used to minimize the efforts in feature engineering which was important in statistical CWS. Zheng et al. (2013) introduce the neural model with sliding-window based sequence labeling. Chen et al. (2015a)
propose a gated recursive neural network (GRNN) for CWS to incorporate complicated combination of contextual character and n-gram features.Chen et al. (2015b) use LSTM to learn long distance information. Cai and Zhao (2016) propose a neural framework that eliminates context windows and utilize complete segmentation history. Lyu et al. (2016) explore a joint model that performs segmentation, POS-Tagging and chunking simultaneously. Chen et al. (2017a) propose a feature-enriched neural model for joint CWS and part-of-speech tagging. Zhang et al. (2017) present a joint model to enhance the segmentation of Chinese microtext by performing CWS and informal word detection simultaneously. Wang and Xu (2017) propose a character-based convolutional neural model to capture -gram features automatically and an effective approach to incorporate word embeddings. Cai et al. (2017) improve the model in Cai and Zhao (2016) and propose a greedy neural word segmenter with balanced word and character embedding inputs. Zhao et al. (2018) propose a novel neural network model to incorporate unlabeled and partially-labeled data. Zhang et al. (2018) propose two methods that extend the Bi-LSTM to perform incorporating dictionaries into neural networks for CWS. Gong et al. (2019) propose Switch-LSTMs to segment words and provided a more flexible solution for multi-criteria CWS which is easy to transfer the learned knowledge to new criteria.
Transformer Vaswani et al. (2017)
is an attention-based neural machine translation model. The Transformer is one kind of self-attention networks (SANs) which is proposed inLin et al. (2017). Encoder of the Transformer consists of one self-attention layer and a position-wise feed-forward layer. Decoder of the Transformer contains one self-attention layer, one encoder-decoder attention layer and one position-wise feed-forward layer. The Transformer uses residual connections around the sublayers and then followed by a layer normalization layer.
Scaled dot-product attention is the key component in the Transformer. The input of attention contains queries, keys, and values of input sequences. The attention is generated using queries and keys like Equation (2). Structure of scaled dot-product attention allows the self-attention layer generate the representation of sentences at once and contain the information of the sentence which is different from RNN that process characters of sentences one by one. Standard self-attention is similar as Gaussian-masked direction attention while it does not have directional mask and gaussian mask. Vaswani et al. (2017) also propose multi-head attention which is better to generate representation of sentence by dividing queries, keys and values to different heads and get information from different subspaces.
In this paper, we propose an attention mechanism only based Chinese word segmentation model. Our model uses self-attention from the Transformer encoder to take sequence input and bi-affine attention scorer to predict the label of gaps. To improve the ability of capturing the localness and directional information of self-attention based encoder, we propose a variant of self-attention called Gaussian-masked directional multi-head attention to replace the standard self-attention. We also extend the Transformer encoder to capture directional features. Our model uses only unigram features instead of multiple -gram features in previous work. Our model is evaluated on standard benchmark dataset, SIGHAN Bakeoff 2005, which shows not only our model performs segmentation faster than any previous models but also gives new higher or comparable segmentation performance against previous state-of-the-art models.
- Andrew (2006) Galen Andrew. 2006. A hybrid Markov/semi-Markov conditional random field for sequence segmentation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 465–472, Sydney, Australia. Association for Computational Linguistics.
- Cai and Zhao (2016) Deng Cai and Hai Zhao. 2016. Neural word segmentation learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 409–420, Berlin, Germany. Association for Computational Linguistics.
- Cai et al. (2017) Deng Cai, Hai Zhao, Zhisong Zhang, Yuan Xin, Yongjian Wu, and Feiyue Huang. 2017. Fast and accurate neural word segmentation for Chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 608–615, Vancouver, Canada. Association for Computational Linguistics.
- Cai et al. (2018) Jiaxun Cai, Shexia He, Zuchao Li, and Hai Zhao. 2018. A full end-to-end semantic role labeler, syntactic-agnostic over syntactic-aware? In Proceedings of the 27th International Conference on Computational Linguistics, pages 2753–2765, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Chen et al. (2017a) Xinchi Chen, Xipeng Qiu, and Xuanjing Huang. 2017a. A feature-enriched neural model for joint chinese word segmentation and part-of-speech tagging. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 3960–3966.
- Chen et al. (2015a) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. 2015a. Gated recursive neural network for Chinese word segmentation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1744–1753, Beijing, China. Association for Computational Linguistics.
- Chen et al. (2015b) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015b. Long short-term memory neural networks for Chinese word segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1197–1206, Lisbon, Portugal. Association for Computational Linguistics.
- Chen et al. (2017b) Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017b. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1193–1203, Vancouver, Canada. Association for Computational Linguistics.
- Chen et al. (2017c) Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017c. Dag-based long short-term memory for neural word segmentation. CoRR, abs/1707.00248.
- Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537.
- Dozat and Manning (2017) Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
- Emerson (2005) Thomas Emerson. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.
- Gong et al. (2019) Jingjing Gong, Xinchi Chen, Tao Gui, and Xipeng Qiu. 2019. Switch-lstms for multi-criteria chinese word segmentation. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019., pages 6457–6464.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Lafferty et al. (2001)
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001.
Conditional random fields: Probabilistic models for segmenting and
labeling sequence data.
Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, pages 282–289.
- Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
- Liu et al. (2016) Yijia Liu, Wanxiang Che, Jiang Guo, Bing Qin, and Ting Liu. 2016. Exploring segment representations for neural segmentation models. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2880–2886.
- Low et al. (2005) Jin Kiat Low, Hwee Tou Ng, and Wenyuan Guo. 2005. A maximum entropy approach to Chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.
- Lyu et al. (2016) Chen Lyu, Yue Zhang, and Donghong Ji. 2016. Joint word segmentation, pos-tagging and syntactic chunking. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 3007–3014.
- Ma et al. (2018) Ji Ma, Kuzman Ganchev, and David Weiss. 2018. State-of-the-art Chinese word segmentation with bi-LSTMs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4902–4908, Brussels, Belgium. Association for Computational Linguistics.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
- Ng and Low (2004) Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? word-based or character-based? In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 277–284, Barcelona, Spain. Association for Computational Linguistics.
- Pei et al. (2014) Wenzhe Pei, Tao Ge, and Baobao Chang. 2014. Max-margin tensor neural network for Chinese word segmentation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 293–303, Baltimore, Maryland. Association for Computational Linguistics.
- Peng et al. (2004) Fuchun Peng, Fangfang Feng, and Andrew McCallum. 2004. Chinese segmentation and new word detection using conditional random fields. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 562–568, Geneva, Switzerland. COLING.
- Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana. Association for Computational Linguistics.
- Sun et al. (2009) Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun’ichi Tsujii. 2009. A discriminative latent variable Chinese segmenter with hybrid word/character information. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 56–64, Boulder, Colorado. Association for Computational Linguistics.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008.
- Wang and Xu (2017) Chunqi Wang and Bo Xu. 2017. Convolutional neural network with word embeddings for Chinese word segmentation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 163–172, Taipei, Taiwan. Asian Federation of Natural Language Processing.
- Wang et al. (2019) Xiaobin Wang, Deng Cai, Linlin Li, Guangwei Xu, Hai Zhao, and Luo Si. 2019. Unsupervised learning helps supervised neural word segmentation. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019., pages 7200–7207.
- Xu and Sun (2016) Jingjing Xu and Xu Sun. 2016. Dependency-based gated recursive neural network for Chinese word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 567–572, Berlin, Germany. Association for Computational Linguistics.
- Xue (2003) Nianwen Xue. 2003. Chinese word segmentation as character tagging. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, pages 29–48.
- Yang et al. (2017) Jie Yang, Yue Zhang, and Fei Dong. 2017. Neural word segmentation with rich pretraining. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 839–849, Vancouver, Canada. Association for Computational Linguistics.
- Zhang et al. (2017) Meishan Zhang, Guohong Fu, and Nan Yu. 2017. Segmenting chinese microtext: Joint informal-word detection and segmentation with neural networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 4228–4234.
- Zhang et al. (2018) Qi Zhang, Xiaoyu Liu, and Jinlan Fu. 2018. Neural networks incorporating dictionaries for chinese word segmentation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5682–5689.
- Zhang and Clark (2007) Yue Zhang and Stephen Clark. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 840–847, Prague, Czech Republic. Association for Computational Linguistics.
- Zhao et al. (2006) Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2006. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pages 87–94, Huazhong Normal University, Wuhan, China. Tsinghua University Press.
- Zhao et al. (2018) Lujun Zhao, Qi Zhang, Peng Wang, and Xiaoyu Liu. 2018. Neural networks incorporating unlabeled and partially-labeled data for cross-domain chinese word segmentation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pages 4602–4608.
- Zheng et al. (2013) Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. 2013. Deep learning for Chinese word segmentation and POS tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 647–657, Seattle, Washington, USA. Association for Computational Linguistics.
- Zhou et al. (2017) Hao Zhou, Zhenting Yu, Yue Zhang, Shujian Huang, Xinyu Dai, and Jiajun Chen. 2017. Word-context character embeddings for Chinese word segmentation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 760–766, Copenhagen, Denmark. Association for Computational Linguistics.