Investigating Self-Attention Network for Chinese Word Segmentation

07/26/2019
by   Leilei Gan, et al.
0

Neural network has become the dominant method for Chinese word segmentation. Most existing models cast the task as sequence labeling, using BiLSTM-CRF for representing the input and making output predictions. Recently, attention-based sequence models have emerged as a highly competitive alternative to LSTMs, which allow better running speed by parallelization of computation. We investigate self attention network for Chinese word segmentation, making comparisons between BiLSTM-CRF models. In addition, the influence of contextualized character embeddings is investigated using BERT, and a method is proposed for integrating word information into SAN segmentation. Results show that SAN gives highly competitive results compared with BiLSTMs, with BERT and word information further improving segmentation for in-domain and cross-domain segmentation. Our final models give the best results for 6 heterogenous domain benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/19/2019

Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings

In this paper, we formulate keyphrase extraction from scholarly articles...
04/03/2019

CAN-NER: Convolutional Attention Network forChinese Named Entity Recognition

Named entity recognition (NER) in Chinese is essential but difficult bec...
08/23/2019

Hierarchically-Refined Label Attention Network for Sequence Labeling

CRF has been used as a powerful model for statistical sequence labeling....
09/20/2019

BERT Meets Chinese Word Segmentation

Chinese word segmentation (CWS) is a fundamental task for Chinese langua...
10/07/2020

Improving Context Modeling in Neural Topic Segmentation

Topic segmentation is critical in key NLP tasks and recent works favor h...
02/18/2020

A New Clustering neural network for Chinese word segmentation

In this article I proposed a new model to achieve Chinese word segmentat...
05/16/2019

Incorporating Sememes into Chinese Definition Modeling

Chinese definition modeling is a challenging task that generates a dicti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Word segmentation is a necessary pre-processing step for Chinese language processing Zhang and Clark (2007); Sun and Xu (2011); Jiang et al. (2013); Xu and Sun (2016); Cai et al. (2017). The dominant method treats Chinese Word Segmentation (CWS) as a sequence labeling problem Xue (2003), where neural network models Yang et al. (2017); Zhou et al. (2017); Zhang et al. (2016); Cai and Zhao (2016) have achieved the state-of-the-art results. A representative model Chen et al. (2015b, a) takes LSTM Hochreiter and Schmidhuber (1997) as a feature extractor, and a standard CRF Lafferty et al. (2001) layer is used on top of a BiLSTM layer to predict the label sequences.

Recently, self-attention network (SAN) Vaswani et al. (2017)

has been shown effectiveness for a range of natural language processing tasks, such as machine translation

Tang et al. (2018), constituency parsing Kitaev and Klein (2018), and semantic role labeling Tan et al. (2018)

. Compared with recurrent neural networks (RNNs)

Elman (1990), SAN has advantages of capturing long-term dependencies and supporting parallel computing more easily. However, its effectiveness on CWS has not been fully investigated in the literature.

We empirically investigate SAN for CWS by building a SAN-CRF word segmentor, studying the influence of global and local attention for segmentation accuracy. Based on the SAN-CRF segmentation model, we investigate two further questions. First, in Chinese, characters are highly polysemantic and the same character can have different meanings in different context. SAN has also been shown a useful method for training contextualized word representations Devlin et al. (2018); Radford et al. (2018). We compare context-independent character representations Mikolov et al. (2013); Pennington et al. (2014) with contextualized character representations in both in-domain and cross-domain CWS evaluation.

Second, out of vocabulary words, especially domain specific noun entities, raises a challenge for cross-domain CWS. To solve this problem, domain lexicons can be used

Zhang et al. (2014, 2018) for cross-domain CWS tasks. We consider a novel method for integrating lexicons to SAN for cross-domain CWS, using attention to integrate word information by generalizing words into POS tags, resulting in end-to-end neural type-supervised domain adaptation.

Results on three benchmarks show that SAN-CRF can achieve competitive performance compared with BiLSTM-CRF. In addition, BERT character embeddings are used for both in-domain and cross-domain evaluation. In cross-domain evaluation, the proposed neural type-supervised method gives an averaged error reduction of 30.32% on three cross-domain datasets. Our method gives the best results on standard benchmarks including CTB, PKU, MSR, ZX, FR and DL. To the best of our knowledge, we are the first to investigate SAN for CWS111Code and trained models will be made available.

Figure 1: Model Overview

2 Baseline

We take BiLSTM-CRF as our baseline, which has been shown giving the state-of-the-art results Chen et al. (2015b); Yang et al. (2018). Formally, given an input sentence with characters , where denotes the th character, the task of character based CWS is to assign each character with a label , where Xue (2003). The label and represent the begin, middle, end of a word and single character word, respectively.

For each character , its input representation is the concatenation of unigram character embedding and bigram character embedding . Following Chen et al. (2015b), we take BiLSTM to capture the context of character in both forward and backward directions.

The forward hidden state of character is calculated as follows:

(1)

where , and are input, forget and output gates, respectively. and are element-wise sigmoid and product function respectively while W, U and b are model parameters to learn.

The backward hidden state can be obtained in a similar way. Thus, the hidden state of character is the concatenation of and :

(2)

In the scoring layer, a CRF is used to consider the dependencies of adjacent labels. The probability of a label sequence

of sentence is:

(3)

where is the set of all possible label sequences of sentence . is the emission score of and is the transition score from to .

3 Model

Figure 1 shows our segmentor framework on a input character sequence “(Fellow of the Chinese Academy of Sciences)”. The model takes character representation and positional embeddings as input. By matching the input to a word-POS lexicon, word information is investigated by using attention for each character. Multiple layers of self-attention network Vaswani et al. (2017) are used as feature extractor to replace BiLSTM in the baseline. Similar to the baseline, we also use a CRF layer on top of the self-attention network to model the dependencies of adjacent labels.

3.1 Embedding Layer

As shown in Figure 1, the embedding layer consists of character embeddings and positional embeddings. The character representation of is the concatenation of unigram character embedding and bigram character embedding .

(4)

where represents concatenation operation.

Because self-attention network does not explicitly consider sequence information, positional embedding is added to the input of self-attention network as follows:

(5)

where pos is the position, is the dimension, is the dimension of output, respectively, and

denotes vector addition.

3.2 Self-Attention Network

We extend the model of Vaswani et al. (2017) for the SAN segmentor. The model has multiple identical layers, each of which is composed of a multi-head self-attention sub-layer and a position-wise fully connected feed-forward network.

Multi-head self-attention is used to exchange information directly between positions in the sequence. First, for single-head self-attention, the representations of a sequence is computed by scaled dot-product attention as follows:

(6)

where are query, key and value vectors, respectively, and are parameters.

Local Self-Attention  In order to investigate the effect of long-term dependencies on CWS task, we propose a local self-attention, which only attends to surrounding positions for each character instead of all positions in the sequence. The intuition is that long-term dependencies may bring more noise than information in a sequence labeling task Luong et al. (2015). The local self-attention is denoted as:

(7)

where is a matrix to control the self-attention inner a window and its element is denoted as:

(8)

Here is the window size.

Multi-Head Self-Attention is used, which linearly maps , and into multiple versions , and and then concatenates the outputs of different as follows:

(9)

where

(10)

, , and are parameters.

On top of the multi-head attention sub-layer, a fully connected feed-forward network (FNN) is applied to each position. FNN is composed of two linear transformations with a ReLU activation.

(11)

4 Rich Character and Word Features

We incorporate rich character and word features into SAN model. Specifically, pre-trained contextualized character representation is introduced as well as a word-based neural type-supervised domain adaptation method.

4.1 BERT Character Representation

BERT Devlin et al. (2018) is trained from a large scale corpora by a deep bidirectional Transformer using masked LM tasks. Usages of BERT can be divided into feature-based and fine-tuning methods. The former fixes all model parameters and directly extracts character features from the pre-trained model, while the latter jointly fine-tunes all parameters on downstream tasks. We take the latter method, feeding the input sequence of characters into BERT and use the top layer output as character representation. Development experiments show that fine-tuning BERT embeddings give higher results than the feature-based method.

Formally, character is represented using pre-trained BERT embedding according to the whole sentence.

(12)

where denotes a pre-trained BERT character embedding.

4.2 Integrating Word-POS Lexicon for Type-Supervision

We integrate word information into SAN to handle rare words in cross-domain settings. Following the definition by Zhang et al. (2014), we describe this model in a cross-domain setting only, where denotes a set of annotated source-domain sentences, and denotes an annotated target-domain lexicon, in which each word is associated with one POS tag. The domain adaptation model is firstly trained on , and makes use of when performing domain adaptation. In practice, our method can be used in in-domain settings also.

As shown in Figure 1, for each character in the input sentence, the set of all character subsequences that match words in the external lexicon is denoted as . Here and are the start and end index of the matched words in the sentence, where and . Word embeddings should intuitively be used for encoding . However, for characters forming domain specific words, there may not be readily available embeddings. POS embeddings can be used as alternative unlexicalized features of words embeddings. We introduce how to integrate POS embeddings as word information from both prediction and training.

Datasets PKU MSR CTB6 ZX FR DL
Training set #sent 19.1k 86.9k 23.4k PKU
#word 1.11m 2.37m 641k
#char 1.83m 4.05m 1.06m
Testing set #sent 1.9k 4.0k 2.8k 0.7k 1.3k 1.3k
#word 0.10m 0.11m 0.70m 35.2k 35.3k 31.5k
#char 0.17m 0.18m 1.16m 50.3k 64.2k 52.2k
Table 1: Statistics of datasets

Prediction  During testing, we match character subsequences in a given input sentence to a word-POS lexicon . For all matching subspans, we find a vector representation by first performing a lookup action to a word embedding table, and then using the corresponding POS embedding to represent the word if no word embedding is available for the subspan.

(13)

where and are word and POS embedding lookup tables, respectively.
Training Training is performed on a source domain corpus only. We do not fine-tune word embeddings. The key task for knowledge transfer is the learning of POS embeddings, which offer a generalized representation for words not in the embedding lexicon. To this end, we randomly replace words in the training data with their gold-standard POS tags as follows:

(14)

where is the frequency of in the training data and is a chosen threshold.

The representation of is:

(15)

where is a random number and is the gold-standard POS tag of .

Figure 2: Two methods to learn POS embeddings. In the left method, for characters in “张小凡(Person Name)”, they attend to the same POS NR. In the right method, different characters attend to different POS tags with positional information.

Considering the positional information of characters in the word, the set of POS tags can be denoted in combination with segmentation labels: . The difference between and is that for and matched word , if is the first, middle or end character of , the corresponding POS tag of is , and , respectively. Figure 2 shows the difference between and through an example.

For each character , we integrate dictionary word information by augmenting its embedding with a word context vector , which is the weighted sum over for all spans that contain . In particular,

(16)

where the weight for each context word is:

(17)

Considering computation efficiency, the score function is:

(18)

where is parameters. The output of the attention layer is the concatenation of the character embedding and the context vector :

(19)

4.3 Decoding and Training

For decoding, the Viterbi algorithm Viterbi (1967) is used to find the highest scored label sequence over a input sentence.

Given a training set with

samples, the loss function is negative log-likelihood of sentence-level with

regularization:

(20)
Figure 3: F1-value against training iterations

5 Experiments

We carry out an extensive set of experiments to investigate the effectiveness of SAN-CRF and the proposed neural type-supervised domain adaptation method across different domains under different settings. F1-value is taken as our main metric.

5.1 Datasets

We separately evaluate the proposed model in in-domain and cross-domain settings. For in-domain evaluation, CTB6 (Chinese Tree Bank 6.0), PKU and MSR are taken as the datasets. The train/dev/test split of CTB6 follows Zhang et al. (2016), while the split of PKU and MSR are taken from the SIGHAN Bakeoff 2005 Emerson (2005). For cross-domain evaluation, PKU is used as the source domain, and three Chinese novel datasets including DL (DouLuoDaLu), FR (FanRenXiuXianZhuan) and ZX (ZhuXian) Qiu and Zhang (2015) are used as target domains. Following Zhang et al. (2014), we collect target-domain lexicons from Internet Encyclopedia222https://baike.baidu.com/item/诛仙/12418333https://baike.baidu.com/item/凡人修仙传/54139444https://baike.baidu.com/item/斗罗大陆/5313. Table 1 shows the statistics of the datasets.

5.2 Experimental Settings

Table 2 shows the values of model hyper-parameters. For the SAN CWS model, we use the Adam Kingma and Ba (2014) optimizer with , , . Following Vaswani et al. (2017), we increase the learning rate linearly for the first warmup_steps steps, and then decrease it proportionally. The value of warmup_steps is set to 1000. When BERT is used for character embeddings, the learning rate is set to 5e-6. For the baseline model, we use stochastic gradient descent (SGD) follwing Yang et al. (2018), and the initial learning rate is set to 0.001, which gives better development results.

Parameter Value Parameter Value
Char emb size 50 SAN layer num 2
Word emb size 200 SAN head num 8
Bigram emb size 50 SAN hidden size 512
BERT emb size 768 SAN Inner size 2048
LSTM layer 1 SAN Relu dropout 0.1
LSTM hidden 200 Attention dropout 0.1
LSTM input dropout 0.1 Resiual dropout 0.1
Batch size 32 Window size 5
Table 2: Hyper-parameter values
#Layer 1 2 3 4 5 6
F1 0.956 0.956 0.956 0.955 0.947 0.931
#Head 2 4 6 8 12 16
F1 0.956 0.956 0.955 0.956 0.957 0.955
Table 3: Effect of numbers of heads and layers

Character and Word Embedding  The pre-trained word embedding size is 200, which is based on word co-occurrence and the directions of word pairs Song et al. (2018), and the word length is restricted to 4. we use the topmost layer output as character embedding of the pre-trained Chinese Simplified BERT model with 12 layers, 768 hidden units and 12 heads555https://github.com/huggingface/pytorch-pretrained-BERT. Besides that, the bigram embeddings and character unigram embeddings used for attending words are the same as Zhang et al. (2016).

5.3 Development Experiments

We perform development experiments on the CTB6 development dataset to investigate the influence of hyper-parameters of self-attention network for CWS, and compare the performance of SAN, especially local self-attention, with BiLSTM. In addition, we evaluate the effect of utilizing of BERT for CWS models.

Figure 3 shows the iteration curve of F1-value against the number of training iterations with different configurations.“_Bigram” is the model using both unigram and bigram information, and “_BERT” is the model replacing the word2vec character unigram representation with BERT. “SAN” represents the original self-attention network and “L-SAN” represents local self-attention network. “BiLSTM” is our baseline model, which uses a bidirectional LSTM as feature extractor.

Models CTB6 PKU MSR
Zhang et al. (2016) 96.0 95.7 97.7
Cai et al. (2017) - 95.8 97.1
Yang et al. (2017) 96.2 96.3 97.5
Zhou et al. (2017) 96.2 96.0 97.8
Zhang et al. (2018) 96.4 96.5 97.8
Ma et al. (2018) 96.7 96.1 98.1
BiLSTM + CRF 95.2 95.1 97.2
L-SAN + CRF 95.2 95.0 96.9
BiLSTM + CRF + BERT 97.2 96.6 98.0
L-SAN + CRF + BERT 97.4 96.7 98.3
Table 4: In domain results

Width and DepthVaswani et al. (2017) shows that increasing the number of layers can improve the performance of English-to-German translation. We investigate the effect of number of layers on CWS, by increasing the number of layers from 1 to 6 while fixing the number of heads to 8. The results are listed in Table 3. The model achieves the best F1-value 0.956 within 3 layers, after which the performance decreases with the increasing of layers. The F1-value decreases to 0.931 when the number of layers is set to 6. We fix the number of layers to 2 for the remaining experiments.

We vary the number of heads in multi-head self-attention to investigate its effect on CWS performance. The number of layers and dimension of head is fix to 2 and 64, respectively. As shown in Table 3, with increasing number of heads from 2 to 16, the performance does not vary too much. We fix the number of heads to 8 for the remaining experiments.

Effect of Local Attention As Figure 3 shows, the performance of “L-SAN_Bigram” gives much better results compared to “SAN_Bigram”, which suggests that long-term dependencies can bring more noise than useful information. The proposed local self-attention network model can achieve the competitive results comparing with the baseline BiLSTM model.

Effect of BERT

 By replacing word2vec character embeddings with BERT, both BiLSTM and L-SAN models can reach the best F1-value within several epochs, with a significant improvement, which proves that context-dependent word representation can benefit CWS task.

Model ZX FR DL
Liu and Zhang (2012) 87.2 87.5 91.4
Qiu and Zhang (2015) 87.4 86.7 91.9
Ye et al. (2019) 89.6 89.6 93.5
L-SAN + CRF + BERT 90.5 91.1 93.0
L-SAN + CRF + BERT + t 91.8 92.3 94.3
L-SAN + CRF + BERT + t_b 93.1 93.0 95.1
Table 5: Cross domain results

5.4 Final Results

In-Domain Results We evaluate our model on three news datasets, including CTB, PKU and MSR. The main results and the recent state-of-the-art models are listed in Table 4. Compared with the baseline “BiLSTM+CRF” model, the proposed “L-SAN+CRF” model can achieve similar results, which proves that self-attention network can be a competitive feature extractor for CWS besides recurrent neural network. When replacing word2vec character embedding with BERT, the “BiLSTM+CRF” model gives 41.3%, 30.6% and 31.0% error reduction on CTB6/PKU/MSR, respectively, and the “L-SAN+CRF” model has 41.3%, 32.7% and 41.3% error reductions on three in-domain datasets, respectively. Finally, “L-SAN+CRF” slightly outperforms “BiLSTM+CRF” when using BERT as unigram character representation.

Cross-Domain Results We evaluate our model on the three cross-domain datasets, including ZX, FR and DL. The main results and three state-of-the-art models are listed in Table 5. “t” means neural type-supervised method is used to learn POS embedding and domain-specific words is generalized to corresponding tag. In “t_b”, we learn different POS embeddings for different positions in a word.

As shown in Table 5, the F1-value of “L-SAN+CRF+BERT” has an average 0.7 improvement compared with the state-of-the-art results Ye et al. (2019) in ZX and FR without using Ye et al. (2019)’s domain adaptation techniques. This may be because ZX, FR and DL are all Chinese novels which contain a large number of noun entities and their wring styles are different from news domain. The result shows that BERT has rarely less effect on cross-domain CWS compared with strong domain adaptation methods. The “L-SAN+CRF+BERT+t” model has 21.15%, 25.96% and 1.54% error reduction on ZX/FR/DL datasets, respectively, which shows that the proposed neural type-supervised method can handle out of vocabulary words more effective. For characters within a word, instead of sharing the same POS embedding of the word, we further distinguish POS embeddings of characters according to their position in a word. The “L-SAN+CRF+BERT+t_b” gives 33.65%, 32.69% and 24.62% on three datasets, respectively. We believe that this is due to more supervision information.

Figure 4: F1-value against the sentence length

5.5 Analysis

Sentence Length We compare the baseline model and local self-attention network model, as well as the two models with BERT input representation on different sentence lengths. Figure 4 shows the F1-value on CTB6 test dataset. The two models without using BERT show a similar performance-length curve, which reaches a peak at around 30-character sentences and decreases when sentence length over 90. One possible reason is that very short sentences are rare while long sentences are semantically more challenging. However, the two models using BERT both show more stable performance-length curves, which shows that contextualized BERT representation can stabilize performance against sentence length.

Word Count M1 M2 M3
唐三(person name) 273 0.98 0.99 1.00
韩立(person name) 185 0.07 0.67 1.00
戴沐白(person name) 159 0.01 0.31 1.00
小舞(person name) 153 0.90 0.98 1.00
张小凡(person name) 142 0.00 0.06 1.00
玄骨(person name) 114 0.96 0.97 0.98
魂狮(proper name) 90 1.00 1.00 1.00
宁荣荣(person name) 86 0.01 0.57 1.00
朱竹清(person name) 81 0.03 0.76 1.00
魂环(proper name) 72 1.00 1.00 1.00
魂力(proper name) 71 0.97 0.99 1.00
魂兽(proper name) 53 1.00 1.00 1.00
斗魂(proper name) 51 0.71 0.73 0.76
叶知秋(person name) 51 0.00 0.00 0.00
乌丑(person name) 45 0.97 1.00 1.00
average precision 108 0.55 0.73 0.96
Table 6: Segmentation precision of noun entities with the highest frequency.

Noun Entity Segmentation Noun entities raise a key problem for cross-domain CWS. Table 6 shows the three models segmentation results on 15 noun entities with the highest frequency of three datasets. M1 and M2 represent “L-SAN+CRF+BERT” and “L-SAN+CRF+BERT+t”, respectively, while M3 represents “L-SAN+CRF+BERT+t_b”. As the table shows, the average precision of MI is 0.55. By using neural type-supervised domain adaptation method, the average precision of M2 has a improvement of 0.18 in absolute value.

Some person names are incorrectly segmented by M2, such as “戴沐白(person name)” and “张小凡(person name)”. When incorporating the positional information of character in the word, the average segmentation precision improves further and most noun entities can be correctly segmented, except the word “叶知秋(person name)”. The reason is that the domain lexicon does not contain “叶知秋”. This shows that our method makes effective use of domain lexicons.
Case Study We use two examples of neural type-supervised domain adaptation for illustrated discussion. In example 1, “L-SAN+CRF+BERT” fails to handle the domain entity noun “韩立(person name)” while the two neural type-supervised domain adaptation method segment it correctly. For example 3, only “L-SAN+CRF+BERT+t_b” segments it correctly. One possible reason is that it maybe difficult to distinguish between “戴沐白 (person name)”, which is a domain specific entity noun and “白虎 (white tiger)”, which is a common noun.

6 Related Work

Chinese Word SegmentationChen et al. (2015b, a) extract features based on character representation by using LSTM or GRU. Zhang et al. (2016) propose a transition-based neural model which can utilize the word-level features. Zhou et al. (2017) trains character embedding with word-based context information on auto-segmented data. Yang et al. (2017) exploit the effectiveness of rich external resources through multi-task learning. For cross-domain CWS, Zhang et al. (2014) propose a type-supervised domain adaptation approach for joint CWS and POS-tagging, which shows a competitive result compared to token-supervised methods. Qiu and Zhang (2015) investigate CWS for Chinese novels, proposing a method which can automatically mine noun entities for novels using a double-propagation algorithm. Zhang et al. (2018) investigate how to integrate external dictionary into CWS models. Similar to Zhang et al. (2014) and Zhang et al. (2018), our work uses domain lexicon. The difference is we utilize POS embeddings through an end-to-end neural method.

Self-Attention Network Self-attention network Vaswani et al. (2017) was first proposed for machine translation. Tan et al. (2018) and Strubell et al. (2018) use SAN for the task of semantic role labeling, which can directly capture the relationship between two arbitrary tokens in the sequence. Strubell et al. (2018) incorporate linguistic information through multi-task learning, including dependency parsing, part-of-speech and predicate detection. Shen et al. (2018)

propose multi-dimensional attention as well as directional information, achieving the state-of-the-art results on natural language inference and sentiment analysis tasks.

Kitaev and Klein (2018) show that a novel encoder based on self-attention can lead to state-of-the-art results for the constituency parsing task. Along with this strand of work, we study the influence of global and local attention for CWS and build a SAN-CRF word segmentor, which gives competitive results compared with BiLSTMs.

#Example 1: 韩立也在光罩边缘处止住了下落的身影
      Han Li also stopped the falling figure at the edge of the mask
Gold Segmentation
韩立/也/在/光罩/边缘/处/止住/了/下落/的/身形
Han Li/also/at/the mask/the edge/of/stopped/x/the falling/figure
L-SAN+CRF+BERT
韩/立/也/在/光罩/边缘/处/止住/了/下落/的/身形
Han/Li/also/at/the mask/the edge/of/stopped/x/the falling/figure
L-SAN+CRF+BERT+t
韩立/也/在/光罩/边缘/处/止住/了/下落/的/身形
Han Li/also/at/the mask/the edge/of/stopped/x/the falling/figure
L-SAN+CRF+BERT+t_b
韩立/也/在/光罩/边缘/处/止住/了/下落/的/身形
Han Li/also/at/the mask/the edge/of/stopped/x/the falling/figure
#Example 2: 戴沐白虎掌上利刃弹开
      Dai Mubai pops up the blade on the palm
Gold Segmentation
戴沐白/虎掌/上/利刃/弹开
Dai Mubai/palm/on/blade/pops up
L-SAN+CRF+BERT
戴/沐/白虎/掌/上/利刃/弹开
Dai/Mu/white tiger/palm/on/blade/pops up
L-SAN+CRF+BERT+t
戴沐/白虎/掌上/利刃/弹开
Dai Mu/white tiger/palm/on/blade/pops up
L-SAN+CRF+BERT+t_b
戴沐白/虎掌/上/利刃/弹开
Dai Mubai/palm/on/blade/pops up
Table 7: Examples. x represents ungrammatical word.

Contextualized word representation Context-dependent word representations pre-trained from large-scale corpora have received much recent attention. ELMo Peters et al. (2018) is based on recurrent neural networks language models. OpenAI GPT Radford et al. (2018) builds a left-to-right language model with a multi-layer multi-head self-attention networks, which can handle long-term dependencies better compared to recurrent networks. Different from OpenAI GPT, BERT Devlin et al. (2018) uses a deep bidirectional Transformer pre-trained on Masked LM. Our work investigates the effect of contextualized character representation on both in-domain and cross-domain CWS under a unified SAN framework.

7 Conclusion

We investigated self-attention network for Chinese word segmentation, demonstrating that it can achieve comparable results with recurrent network methods. We found that local attention gives better results compared to standard SAN. Under SAN, we also investigate the influence of rich character and word features, including BERT character embeddings and a neural attention method to integrate word information into character based CWS. Extensive in-domain and cross-domain experiments show that the proposed SAN method archives state-of-the-art performance on both in-domain and cross-domain Chinese word segmentation datasets.

References

  • Cai and Zhao (2016) Deng Cai and Hai Zhao. 2016. Neural word segmentation learning for chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 409–420.
  • Cai et al. (2017) Deng Cai, Hai Zhao, Zhisong Zhang, Yuan Xin, Yongjian Wu, and Feiyue Huang. 2017. Fast and accurate neural word segmentation for chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 608–615.
  • Chen et al. (2015a) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. 2015a. Gated recursive neural network for chinese word segmentation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1744–1753.
  • Chen et al. (2015b) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015b. Long short-term memory neural networks for chinese word segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1197–1206.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Elman (1990) Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
  • Emerson (2005) Thomas Emerson. 2005. The second international chinese word segmentation bakeoff. In Proceedings of the fourth SIGHAN workshop on Chinese language Processing.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Jiang et al. (2013) Wenbin Jiang, Meng Sun, Yajuan Lü, Yating Yang, and Qun Liu. 2013. Discriminative learning with natural annotations: Word segmentation as a case study. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 761–769.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kitaev and Klein (2018) Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2676–2686.
  • Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
  • Liu and Zhang (2012) Yang Liu and Yue Zhang. 2012. Unsupervised domain adaptation for joint segmentation and pos-tagging. Proceedings of COLING 2012: Posters, pages 745–754.
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
  • Ma et al. (2018) Ji Ma, Kuzman Ganchev, and David Weiss. 2018. State-of-the-art chinese word segmentation with bi-lstms. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4902–4908.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
  • Qiu and Zhang (2015) Likun Qiu and Yue Zhang. 2015. Word segmentation for chinese novels. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    .
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf.
  • Shen et al. (2018) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. Disan: Directional self-attention network for rnn/cnn-free language understanding. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Song et al. (2018) Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. 2018. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 175–180.
  • Strubell et al. (2018) Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5027–5038.
  • Sun and Xu (2011) Weiwei Sun and Jia Xu. 2011. Enhancing chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 970–979. Association for Computational Linguistics.
  • Tan et al. (2018) Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018. Deep semantic role labeling with self-attention. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Tang et al. (2018) Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018.

    Why self-attention? a targeted evaluation of neural machine translation architectures.

    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4263–4272.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Viterbi (1967) Andrew J. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans.informat.theory, 13(2):260–269.
  • Xu and Sun (2016) Jingjing Xu and Xu Sun. 2016. Dependency-based gated recursive neural network for chinese word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 567–572.
  • Xue (2003) Nianwen Xue. 2003. Chinese word segmentation as character tagging. International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, 8(1):29–48.
  • Yang et al. (2017) Jie Yang, Yue Zhang, and Fei Dong. 2017. Neural word segmentation with rich pretraining. arXiv preprint arXiv:1704.08960.
  • Yang et al. (2018) Jie Yang, Yue Zhang, and Shuailong Liang. 2018. Subword encoding in lattice lstm for chinese word segmentation. arXiv preprint arXiv:1810.12594.
  • Ye et al. (2019) Yuxiao Ye, Weikang Li, Yue Zhang, Likun Qiu, and Jian Sun. 2019. Improving cross-domain chinese word segmentation with word embeddings. arXiv preprint arXiv:1903.01698.
  • Zhang et al. (2014) Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting Liu. 2014. Type-supervised domain adaptation for joint segmentation and pos-tagging. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 588–597.
  • Zhang et al. (2016) Meishan Zhang, Yue Zhang, and Guohong Fu. 2016. Transition-based neural word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 421–431.
  • Zhang et al. (2018) Qi Zhang, Xiaoyu Liu, and Jinlan Fu. 2018. Neural networks incorporating dictionaries for chinese word segmentation.
  • Zhang and Clark (2007) Yue Zhang and Stephen Clark. 2007.

    Chinese segmentation with a word-based perceptron algorithm.

    In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 840–847.
  • Zhou et al. (2017) Hao Zhou, Zhenting Yu, Yue Zhang, Shujian Huang, XIN-YU DAI, and Jiajun Chen. 2017. Word-context character embeddings for chinese word segmentation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 760–766.