Attention Is All You Need for Chinese Word Segmentation

10/31/2019 ∙ by Sufeng Duan, et al. ∙ 0

This paper presents a fast and accurate Chinese word segmentation (CWS) model with only unigram feature and greedy decoding algorithm. Our model uses only attention mechanism for network block building. In detail, we adopt a Transformer-based encoder empowered by self-attention mechanism as backbone to take input representation. Then we extend the Transformer encoder with our proposed Gaussian-masked directional multi-head attention, which is a variant of scaled dot-product attention. At last, a bi-affinal attention scorer is to make segmentation decision in a linear time. Our model is evaluated on SIGHAN Bakeoff benchmark dataset. The experimental results show that with the highest segmentation speed, the proposed attention-only model achieves new state-of-the-art or comparable performance against strong baselines in terms of closed test setting.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chinese word segmentation (CWS) is a task for Chinese natural language process to delimit word boundary. CWS is a basic and essential task for Chinese which is written without explicit word delimiters and different from alphabetical languages like English.

Xue (2003) treats Chinese word segmentation (CWS) as a sequence labeling task with character position tags, which is followed by Lafferty et al. (2001); Peng et al. (2004); Zhao et al. (2006). Traditional CWS models depend on the design of features heavily which effects the performance of model. To minimize the effort in feature engineering, some CWS models Zheng et al. (2013); Pei et al. (2014); Chen et al. (2015a, b); Xu and Sun (2016); Cai and Zhao (2016); Liu et al. (2016); Cai et al. (2017)

are developed following neural network architecture for sequence labeling tasks

Collobert et al. (2011). Neural CWS models perform strong ability of feature representation, employing unigram and bigram character embedding as input and approach good performance.

Traditional Models Neural Models
Ng and Low (2004),
Low et al. (2005)
MMTNN: Pei et al. (2014)
Zheng et al. (2013),
LSTM: Chen et al. (2015b)
CRF: Peng et al. (2004),
semi-CRF: Andrew (2006), Sun et al. (2009)
BiLSTM+semi-CRF: Liu et al. (2016) ,
CNN+CRF:Wang and Xu (2017),
BiLSTM+CRF:Ma et al. (2018)
Zhang and Clark (2007)
LSTM+GCNN: Cai and Zhao (2016),
LSTM+GCNN: Cai et al. (2017)
Wang et al. (2019)
Table 1: The classification of Chinese word segmentation model.

The CWS task is often modeled as one graph model based on a scoring model that means it is composed of two parts, one part is an encoder which is used to generate the representation of characters from the input sequence, the other part is a decoder which performs segmentation according to the encoder scoring. Table 1

summarizes typical CWS models according to their decoding ways for both traditional and neural models. Markov models such as

Ng and Low (2004) and Zheng et al. (2013) depend on the maximum entropy model or maximum entropy Markov model both with a Viterbi decoder. Besides, conditional random field (CRF) or Semi-CRF for sequence labeling has been used for both traditional and neural models though with different representations Peng et al. (2004); Andrew (2006); Liu et al. (2016); Wang and Xu (2017); Ma et al. (2018). Generally speaking, the major difference between traditional and neural network models is about the way to represent input sentences.

Models Characters Words
character based Ours -
Zheng et al. (2013), … -
Chen et al. (2015b) -
word based Zhang and Clark (2007), …
Cai and Zhao (2016); Cai et al. (2017)
Table 2: Feature windows of different models. is the index of current character(word).

Recent works about neural CWS which focus on benchmark dataset, namely SIGHAN Bakeoff Emerson (2005), may be put into the following three categories roughly.

Encoder. Practice in various natural language processing tasks has been shown that effective representation is essential to the performance improvement. Thus for better CWS, it is crucial to encode the input character, word or sentence into effective representation. Table 2

summarizes regular feature sets for typical CWS models including ours as well. The building blocks that encoders use include recurrent neural network (RNN) and convolutional neural network (CNN), and long-term memory network (LSTM).

Graph model. As CWS is a kind of structure learning task, the graph model determines which type of decoder should be adopted for segmentation, also it may limit the capability of defining feature, as shown in Table 2, not all graph models can support the word features. Thus recent work focused on finding more general or flexible graph model to make model learn the representation of segmentation more effective as Cai and Zhao (2016); Cai et al. (2017).

External data and pre-trained embedding. Whereas both encoder and graph model are about exploring a way to get better performance only by improving the model strength itself. Using external resource such as pre-trained embeddings or language representation is an alternative for the same purpose Yang et al. (2017); Zhao et al. (2018). SIGHAN Bakeoff defines two types of evaluation settings, closed test limits all the data for learning should not be beyond the given training set, while open test does not take this limitation Emerson (2005). In this work, we will focus on the closed test setting by finding a better model design for further CWS performance improvement.

Shown in Table 1, different decoders have particular decoding algorithms to match the respective CWS models. Markov models and CRF-based models often use Viterbi decoders with polynomial time complexity. In general graph model, search space may be too large for model to search. Thus it forces graph models to use an approximate beam search strategy. Beam search algorithm has a kind low-order polynomial time complexity. Especially, when beam width =1, the beam search algorithm will reduce to greedy algorithm with a better time complexity against the general beam search time complexity , where is the number of units in one sentences, is a constant representing the model complexity. Greedy decoding algorithm can bring the fastest speed of decoding while it is not easy to guarantee the precision of decoding when the encoder is not strong enough.

In this paper, we focus on more effective encoder design which is capable of offering fast and accurate Chinese word segmentation with only unigram feature and greedy decoding. Our proposed encoder will only consist of attention mechanisms as building blocks but nothing else. Motivated by the Transformer Vaswani et al. (2017) and its strength of capturing long-range dependencies of input sentences, we use a self-attention network to generate the representation of input which makes the model encode sentences at once without feeding input iteratively. Considering the weakness of the Transformer to model relative and absolute position information directly Shaw et al. (2018) and the importance of localness information, position information and directional information for CWS, we further improve the architecture of standard multi-head self-attention of the Transformer with a directional Gaussian mask and get a variant called Gaussian-masked directional multi-head attention. Based on the newly improved attention mechanism, we expand the encoder of the Transformer to capture different directional information. With our powerful encoder, our model uses only simple unigram features to generate representation of sentences.

For decoder which directly performs the segmentation, we use the bi-affinal attention scorer, which has been used in dependency parsing Dozat and Manning (2017) and semantic role labeling Cai et al. (2018), to implement greedy decoding on finding the boundaries of words. In our proposed model, greedy decoding ensures a fast segmentation while powerful encoder design ensures a good enough segmentation performance even working with greedy decoder together. Our model will be strictly evaluated on benchmark datasets from SIGHAN Bakeoff shared task on CWS in terms of closed test setting, and the experimental results show that our proposed model achieves new state-of-the-art.

The technical contributions of this paper can be summarized as follows.

  • We propose a CWS model with only attention structure. The encoder and decoder are both based on attention structure.

  • With a powerful enough encoder, we for the first time show that unigram (character) featues can help yield strong performance instead of diverse -gram (character and word) features in most of previous work.

  • To capture the representation of localness information and directional information, we propose a variant of directional multi-head self-attention to further enhance the state-of-the-art Transformer encoder.

2 Models

Figure 1: The architecture of our model.

The CWS task is often modelled as one graph model based on an encoder-based scoring model. The model for CWS task is composed of an encoder to represent the input and a decoder based on the encoder to perform actual segmentation. Figure 1

is the architecture of our model. The model feeds sentence into encoder. Embedding captures the vector

of the input character sequences of . The encoder maps vector sequences of to two sequences of vector which are and as the representation of sentences. With and

, the bi-affinal scorer calculates the probability of each segmentation gaps and predicts the word boundaries of input. Similar as the Transformer, the encoder is an attention network with stacked self-attention and point-wise, fully connected layers while our encoder includes three independent directional encoders.

2.1 Encoder Stacks

Figure 2: The structure of Gaussian-Masked directional encoder.

In the Transformer, the encoder is composed of a stack of N

identical layers and each layer has one multi-head self-attention layer and one position-wise fully connected feed-forward layer. One residual connection is around two sub-layers and followed by layer normalization

Vaswani et al. (2017). This architecture provides the Transformer a good ability to generate representation of sentence.

With the variant of multi-head self-attention, we design a Gaussian-masked directional encoder to capture representation of different directions to improve the ability of capturing the localness information and position information for the importance of adjacent characters. One unidirectional encoder can capture information of one particular direction.

For CWS tasks, one gap of characters, which is from a word boundary, can divide one sequence into two parts, one part in front of the gap and one part in the rear of it. The forward encoder and backward encoder are used to capture information of two directions which correspond to two parts divided by the gap.

One central encoder is paralleled with forward and backward encoders to capture the information of entire sentences. The central encoder is a special directional encoder for forward and backward information of sentences. The central encoder can fuse the information and enable the encoder to capture the global information.

The encoder outputs one forward information and one backward information of each positions. The representation of sentence generated by center encoder will be added to these information directly:


where is the backward information, is the forward information, is the output of backward encoder, is the output of center encoder and is the output of forward encoder.

2.2 Gaussian-Masked Directional Multi-Head Attention

Similar as scaled dot-product attention Vaswani et al. (2017), Gaussian-masked directional attention can be described as a function to map queries and key-value pairs to the representation of input. Here queries, keys and values are all vectors. Standard scaled dot-product attention is calculated by dotting query with all keys , dividing each values by , where is the dimension of keys, and apply a softmax function to generate the weights in the attention:


Different from scaled dot-product attention, Gaussian-masked directional attention expects to pay attention to the adjacent characters of each positions and cast the localness relationship between characters as a fix Gaussian weight for attention. We assume that the Gaussian weight only relys on the distance between characters.

Firstly we introduce the Gaussian weight matrix which presents the localness relationship between each two characters:


where is the Gaussian weight between character and , is the distance between character and ,

is the cumulative distribution function of Gaussian,

is the standard deviation of Gaussian function and it is a hyperparameter in our method. Equation (

4) can ensure the Gaussian weight equals 1 when is 0. The larger distance between charactersis, the smaller the weight is, which makes one character can affect its adjacent characters more compared with other characters.

To combine the Gaussian weight to the self-attention, we produce the Hadamard product of Gaussian weight matrix and the score matrix produced by


where is the Gaussian-masked attention. It ensures that the relationship between two characters with long distances is weaker than adjacent characters.

The scaled dot-product attention models the relationship between two characters without regard to their distances in one sequence. For CWS task, the weight between adjacent characters should be more important while it is hard for self-attention to achieve the effect explicitly because the self-attention cannot get the order of sentences directly. The Gaussian-masked attention adjusts the weight between characters and their adjacent character to a larger value which stands for the effect of adjacent characters.

(a) The architecture of Gaussian-masked directional multi-head attention.
(b) The Gaussian-masked directional attention.
Figure 3: The illustration of Gaussian-masked directional multi-head attention and Gaussian-masked directional attention.

For forward and backward encoder, the self-attention sublayer needs to use a triangular matrix mask to let the self-attention focus on different weights:


where is the position of character . The triangular matrix for forward and backward encode are:

Similar as Vaswani et al. (2017), we use multi-head attention to capture information from different dimension positions as Figure 3(a) and get Gaussian-masked directional multi-head attention. With multi-head attention architecture, the representation of input can be captured by


where is the Gaussian-masked multi-head attention, is the parameter matrices to generate heads, is the dimension of model and is the dimension of one head.

2.3 Bi-affinal Attention Scorer

Regarding word boundaries as gaps between any adjacent words converts the character labeling task to the gap labeling task. Different from character labeling task, gap labeling task requires information of two adjacent characters. The relationship between adjacent characters can be represented as the type of gap. The characteristic of word boundaries makes bi-affine attention an appropriate scorer for CWS task.

Bi-affinal attention scorer is the component that we use to label the gap. Bi-affinal attention is developed from bilinear attention which has been used in dependency parsing Dozat and Manning (2017) and SRL Cai et al. (2018)

. The distribution of labels in a labeling task is often uneven which makes the output layer often include a fixed bias term for the prior probability of different labels

Cai et al. (2018). Bi-affine attention uses bias terms to alleviate the burden of the fixed bias term and get the prior probability which makes it different from bilinear attention. The distribution of the gap is uneven that is similar as other labeling task which fits bi-affine.

Bi-affinal attention scorer labels the target depending on information of independent unit and the joint information of two units. In bi-affinal attention, the score of characters and is calculated by:


where is the forward information of and is the backward information of . In Equation (8), , and are all parameters that can be updated in training. is a matrix with shape and is a matrix where is the dimension of vector and is the number of labels.

Figure 4: An example of bi-affinal scorer labeling the gap. The bi-affinal attention scorer only uses the forward information of front character and the backward information of character to label the gap.
Sentences 19,056 86,924
Max length (Character) 1019 581
Max length (Word) 659 338
Word Types 55,303 88,119
Words 1,109,947 2,368,391
Character Types 4,698 5,167
Characters 1,826,448 4,050,469
Sentences 708,953 53,019
Max length (Character) 188 350
Max length (Word) 211 85
Word Types 141,340 69,085
Words 5,449,698 1,455,629
Character Types 6,117 4,923
Characters 8,368,050 2,403,355
Table 3: The statistics of SIGHAN Bakeoff 2005 datasets.
dimension of hidden vector 256
number of layer 6
dimension of FF 1024
dropout 0.1
warmup 8000
number of header 4
batch size 4096
Table 4: Hyperparameters.
Models PKU MSR
Chen et al. (2015a) 95.9 50 105 96.2 100 120
Chen et al. (2015b) 95.7 58 105 96.4 117 120
Liu et al. (2016) 94.9 - - 94.8 - -
Cai and Zhao (2016) 95.2 48 95 96.4 96 105
Cai et al. (2017) 95.4 3 25 97.0 6 30
Zhou et al. (2017) 95.0 - - 97.2 - -
Ma et al. (2018) 95.4 - - 97.5 - -
Wang et al. (2019)* 95.7* - - 97.4* - -
Our results 95.1 10 4 97.5 95 4
Our results* 95.3* 12 4 97.7* 90 4
Table 5: Results on PKU and MSR compared with previous models in closed test. The asterisks indicate the result of model with unsupervised label from Wang et al. (2019).
Cai et al. (2017) 95.2 - - 95.4 - -
Ma et al. (2018) 95.5 - - 95.7 - -
Wang et al. (2019)* 95.6* - - 95.9* - -
Our results 95.5 63 9 95.4 17 1.5
Our results* 95.7* 69 9 95.7* 15 1.5
Table 6: Results on AS and CITYU compared with previous models in closed test. The asterisks indicate the result of model with unsupervised label from Wang et al. (2019).

In our model, the biaffine scorer uses the forward information of character in front of the gap and the backward information of the character behind the gap to distinguish the position of characters. Figure 4

is an example of labeling gap. The method of using biaffine scorer ensures that the boundaries of words can be determined by adjacent characters with different directional information. The score vector of the gap is formed by the probability of being a boundary of word. Further, the model generates all boundaries using activation function in a greedy decoding way.

3 Experiments

3.1 Experimental Settings


We train and evaluate our model on datasets from SIGHAN Bakeoff 2005 Emerson (2005) which has four datasets, PKU, MSR, AS and CITYU. Table 3

shows the statistics of train data. We use F-score to evaluate CWS models. To train model with pre-trained embeddings in AS and CITYU, we use OpenCC

111 to transfer data from traditional Chinese to simplified Chinese.

Pre-trained Embedding

We only use unigram feature so we only trained character embeddings. Our pre-trained embedding are pre-trained on Chinese Wikipedia corpus by word2vec Mikolov et al. (2013) toolkit. The corpus used for pre-trained embedding is all transferred to simplified Chinese and not segmented. On closed test, we use embeddings initialized randomly.


For different datasets, we use two kinds of hyperparameters which are presented in Table 4. We use hyperparameters in Table 4 for small corpora (PKU and CITYU) and normal corpora (MSR and AS). We set the standard deviation of Gaussian function in Equation (4) to 2. Each training batch contains sentences with at most 4096 tokens.


To train our model, we use the Adam Kingma and Ba (2015) optimizer with , and . The learning rate schedule is the same as Vaswani et al. (2017):


where is the dimension of embeddings, is the step number of training and is the step number of warmup. When the number of steps is smaller than the step of warmup, the learning rate increases linearly and then decreases.

3.2 Hardware and Implements

We trained our models on a single CPU (Intel i7-5960X) with an nVidia 1080 Ti GPU. We implement our model in Python with Pytorch 1.0.

3.3 Results

Tables 5 and 6 reports the performance of recent models and ours in terms of closed test setting. Without the assistance of unsupervised segmentation features userd in Wang et al. (2019), our model outperforms all the other models in MSR and AS except Ma et al. (2018) and get comparable performance in PKU and CITYU. Note that all the other models for this comparison adopt various -gram features while only our model takes unigram ones.

With unsupervised segmentation features introduced by Wang et al. (2019), our model gets a higher result. Specially, the results in MSR and AS achieve new state-of-the-art and approaching previous state-of-the-art in CITYU and PKU. The unsupervised segmentation features are derived from the given training dataset, thus using them does not violate the rule of closed test of SIGHAN Bakeoff.

Table 7 compares our model and recent neural models in terms of open test setting in which any external resources, especially pre-trained embeddings or language models can be used. In MSR and AS, our model gets a comparable result while our results in CITYU and PKU are not remarkable.

However, it is well known that it is always hard to compare models when using open test setting, especially with pre-trained embedding. Not all models may use the same method and data to pre-train. Though pre-trained embedding or language model can improve the performance, the performance improvement itself may be from multiple sources. It often that there is a success of pre-trained embedding to improve the performance, while it cannot prove that the model is better.

Chen et al. (2015a) 96.4 97.6 - -
Chen et al. (2015b) 96.5 97.4 - -
Liu et al. (2016) 96.8 97.3 - -
Cai and Zhao (2016) 95.5 96.5 - -
Cai et al. (2017) 95.8 97.1 95.3 95.6
Chen et al. (2017b) 94.3 96.0 94.6 95.6
Wang and Xu (2017) 95.7 97.3 - -
Zhou et al. (2017) 96.0 97.8 - -
Chen et al. (2017c) - 96.5 95.17 -
Ma et al. (2018) 96.1 98.1 96.2 97.2
Wang et al. (2019) 96.1 97.5 - -
Our Method 95.5 97.6 95.6 96.3
Table 7: F1 scores of our results on four datasets in open test compared with previous models.

Compared with other LSTM models, our model performs better in AS and MSR than in CITYU and PKU. Considering the scale of different corpora, we believe that the size of corpus affects our model and the larger size is, the better model performs. For small corpus, the model tends to be overfitting.

Tables 5 and 6 also show the decoding time in different datasets. Our model finishes the segmentation with the least decoding time in all four datasets, thanks to the architecture of model which only takes attention mechanism as basic block.

4 Related Work

4.1 Chinese Word Segmentation

CWS is a task for Chinese natural language process to delimit word boundary. Xue (2003) for the first time formulize CWS as a sequence labeling task. Zhao et al. (2006) show that different character tag sets can make essential impact for CWS. Peng et al. (2004) use CRFs as a model for CWS, achieving new state-of-the-art. Works of statistical CWS has built the basis for neural CWS.

Neural word segmentation has been widely used to minimize the efforts in feature engineering which was important in statistical CWS. Zheng et al. (2013) introduce the neural model with sliding-window based sequence labeling. Chen et al. (2015a)

propose a gated recursive neural network (GRNN) for CWS to incorporate complicated combination of contextual character and n-gram features.

Chen et al. (2015b) use LSTM to learn long distance information. Cai and Zhao (2016) propose a neural framework that eliminates context windows and utilize complete segmentation history. Lyu et al. (2016) explore a joint model that performs segmentation, POS-Tagging and chunking simultaneously. Chen et al. (2017a) propose a feature-enriched neural model for joint CWS and part-of-speech tagging. Zhang et al. (2017) present a joint model to enhance the segmentation of Chinese microtext by performing CWS and informal word detection simultaneously. Wang and Xu (2017) propose a character-based convolutional neural model to capture -gram features automatically and an effective approach to incorporate word embeddings. Cai et al. (2017) improve the model in Cai and Zhao (2016) and propose a greedy neural word segmenter with balanced word and character embedding inputs. Zhao et al. (2018) propose a novel neural network model to incorporate unlabeled and partially-labeled data. Zhang et al. (2018) propose two methods that extend the Bi-LSTM to perform incorporating dictionaries into neural networks for CWS. Gong et al. (2019) propose Switch-LSTMs to segment words and provided a more flexible solution for multi-criteria CWS which is easy to transfer the learned knowledge to new criteria.

4.2 Transformer

Transformer Vaswani et al. (2017)

is an attention-based neural machine translation model. The Transformer is one kind of self-attention networks (SANs) which is proposed in

Lin et al. (2017). Encoder of the Transformer consists of one self-attention layer and a position-wise feed-forward layer. Decoder of the Transformer contains one self-attention layer, one encoder-decoder attention layer and one position-wise feed-forward layer. The Transformer uses residual connections around the sublayers and then followed by a layer normalization layer.

Scaled dot-product attention is the key component in the Transformer. The input of attention contains queries, keys, and values of input sequences. The attention is generated using queries and keys like Equation (2). Structure of scaled dot-product attention allows the self-attention layer generate the representation of sentences at once and contain the information of the sentence which is different from RNN that process characters of sentences one by one. Standard self-attention is similar as Gaussian-masked direction attention while it does not have directional mask and gaussian mask. Vaswani et al. (2017) also propose multi-head attention which is better to generate representation of sentence by dividing queries, keys and values to different heads and get information from different subspaces.

5 Conclusion

In this paper, we propose an attention mechanism only based Chinese word segmentation model. Our model uses self-attention from the Transformer encoder to take sequence input and bi-affine attention scorer to predict the label of gaps. To improve the ability of capturing the localness and directional information of self-attention based encoder, we propose a variant of self-attention called Gaussian-masked directional multi-head attention to replace the standard self-attention. We also extend the Transformer encoder to capture directional features. Our model uses only unigram features instead of multiple -gram features in previous work. Our model is evaluated on standard benchmark dataset, SIGHAN Bakeoff 2005, which shows not only our model performs segmentation faster than any previous models but also gives new higher or comparable segmentation performance against previous state-of-the-art models.