Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage Span Labeling

12/17/2021
by   Duc-Vu Nguyen, et al.
uit
0

Chinese word segmentation and part-of-speech tagging are necessary tasks in terms of computational linguistics and application of natural language processing. Many re-searchers still debate the demand for Chinese word segmentation and part-of-speech tagging in the deep learning era. Nevertheless, resolving ambiguities and detecting unknown words are challenging problems in this field. Previous studies on joint Chinese word segmentation and part-of-speech tagging mainly follow the character-based tagging model focusing on modeling n-gram features. Unlike previous works, we propose a neural model named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging following the span labeling in which the probability of each n-gram being the word and the part-of-speech tag is the main problem. We use the biaffine operation over the left and right boundary representations of consecutive characters to model the n-grams. Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD, or significant improvements on CTB7 and CTB9 benchmark datasets compared with the current state-of-the-art method using BERT or ZEN encoders.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/01/2021

Span Labeling Approach for Vietnamese and Chinese Word Segmentation

In this paper, we propose a span labeling approach to model n-gram infor...
02/24/2021

Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese

Word segmentation and part-of-speech tagging are two critical preliminar...
04/05/2017

Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF

We present a character-based model for joint segmentation and POS taggin...
11/16/2016

A Feature-Enriched Neural Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

Recently, neural network models for natural language processing tasks ha...
12/14/2019

Attending Form and Context to Generate Specialized Out-of-VocabularyWords Representations

We propose a new contextual-compositional neural network layer that hand...
03/31/2022

A Character-level Span-based Model for Mandarin Prosodic Structure Prediction

The accuracy of prosodic structure prediction is crucial to the naturaln...
02/18/2020

A New Clustering neural network for Chinese word segmentation

In this article I proposed a new model to achieve Chinese word segmentat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are necessary tasks in terms of computational linguistics and application of natural language processing (NLP). There are two primary approaches for joint CWS and POS tagging, including the two-step and one-step methods. The two-step approach is to find words and then assign POS tags to found words. ng-low-2004-chinese proposed the one-step approach that combines CWS and POS tagging into a unified joint task. The one-step approach was proved better than two-step approach by many prior studies [jiang-etal-2008-cascaded, jiang-etal-2009-automatic, sun-2011-stacked, zeng-etal-2013-graph, zheng-etal-2013-deep, kurita-etal-2017-neural, shao-etal-2017-character, Zhang2018ASA]. These studies proposed various methods incorporating linguistic features or contextual information into their joint model. Remarkably, tian-etal-2020-joint proposed a two-way attention mechanism incorporating both context features and corresponding syntactic knowledge from off-the-shelf toolkits for each input character.

Figure 1: The architecture of SpanSegTag for the joint CWS and POS tagging with two stages via span labeling: word segmentation and POS tagging.

To our best knowledge, we observed all previous studies for joint CWS and POS tagging following the character-based tagging paradigm. The character-based tagging effectively produces the best combination of word boundary and POS tag. However, this character-based tagging paradigm does not give us a clear explanation when processing overlapping ambiguous strings. From the view of experimental psychology, human perception and performance, Ma_2014 concluded that multiple words constituted by the characters in the perceptual span are activated when processing overlapping ambiguous strings. Besides, tian-etal-2020-improving-chinese shown that modeling word-hood for n-gram information is essential for CWS. Next, the current state-of-the-art method for joint CWS and POS tagging also confirmed the importance of modeling words and their knowledge, e.g., POS tag [tian-etal-2020-joint].

The previous studies in two views of experimental psychology, human perception and performance, [Ma_2014] and computational linguistics [tian-etal-2020-improving-chinese, tian-etal-2020-joint] inspired us to propose the span labeling approach for joint CWS and POS tagging. To avoid the model size dependent on numbers of n-grams and their corresponding POS tag, we use span to model n-gram and n-gram with POS tag instead of using the memory networks in [tian-etal-2020-improving-chinese, tian-etal-2020-joint]. More particularly, inspired by stern-etal-2017-minimal, ijcai2020-560, and [vund-etal-2021-spanseg]

, we use the biaffine operation over the left and right boundary representations of consecutive characters to model n-grams and their POS tag. As the prior work of vund-etal-2021-spanseg, we use a simple post-processing heuristic algorithm instead of using other models to deal with the overlapping ambiguity phenomenon

[li-etal-2003-unsupervised, gao-etal-2005-chinese]. Finally, we experimented with BiLSTM [hochreiter97] and BERT encoders [devlin-etal-2019-bert].

Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, UD1, and UD2, and significant improvements on the two large benchmark datasets CTB7 and CTB9 compared with the current state-of-the-art method using BERT or ZEN encoders [tian-etal-2020-joint]. Our SpanSegTag did not perfectly perform in five Chinese benchmark datasets. However, SpanSegTag achieved a good recall of in-vocabulary words and their POS tag scores on CTB6, CTB7 and CTB9 datasets. This score is used to measure the performance of the segmenter in resolving ambiguities in word segmentation [gao-etal-2005-chinese].

2 The Proposed Framework

We present the architecture of our proposed framework, namely SpanSegTag, for joint CWS and POS tagging in Figure 1. As we can see in Figure 1

, data path (1) indicates the input sentence to be fed into the BERT encoder. The hidden state vector from the BERT encoder is chunk into two vectors with the same size as the forward and backward vectors in the familiar encoder, BiLSTM. Next, all boundary representations are fed into the

Scorer module. Data path (2) indicates the span representations for the word segmentation task, and data path (5) indicates the span representations for the POS tagging task. Data path (3) indicates predicted spans representing predicted word boundaries. The SpanPostProcessor module produces the predicted spans satisfying non-overlapping between every two spans. Finally, given data paths (4) and (5), the data path (6) indicates the joint CWS and POS tagging.

2.1 Joint Chinese Word Segmentation and Part-of-speech Tagging as Two Stages Span Labeling

The input sentence of joint CWS and POS tagging is a sequence of characters with the length of . Given the input sentence , the output of CWS is a sequence of words with the length of , and the output of Chinese POS tagging is a sequence of POS tags with the length of , where . Besides, we have a property that the Chinese word is constituted by one Chinese character or consecutive characters. Therefore, we use the sequence of characters to denote that the word is constituted by consecutive characters beginning at character , where and representing single words and representing compound words. We get the inspiration of span representation in constituency parsing [stern-etal-2017-minimal] to use the span representing the word constituted by consecutive characters beginning at character , where and are the left and the right boundary index of word , respectively.

After presenting notations, we propose our approach for the joint CWS and POS tagging problem. Firstly, to our knowledge, most recent works focus on modeling the probability that a Chinese character can be one in the combination of and Chinese POS tags set. Next, the current state-of-the-art method for CWS approaching BIES tagging of tian-etal-2020-improving-chinese proposed word-hood memory to model n-gram information. Additionally, the current state-of-the-art method for joint CWS and POS tagging approaching BIES tagging of tian-etal-2020-joint shown that modeling n-gram knowledge, e.g., word and POS tag, is essential. Therefore, we get inspiration of tian-etal-2020-improving-chinese and tian-etal-2020-joint to focus on modeling words and POS tags in a straightforward way rather than modeling BIES tags of characters. Given the input sentence , our idea is to model the probability that the consecutive Chinese characters can be a word via one formulation. Similarly, given the input sentence , we also model the probability that consecutive Chinese characters can be assigned a specific POS tag or the non-word tag via one formulation. To summarize, given span representations, we formalize the joint CWS and POS tagging task as two continuous sub-tasks in our SpanSegTag as following: (i) binary classification dealing with word segmentation; (ii) multi-class classification dealing with POS tagging.

Formally, the first stage of our SpanSegTag for CWS can be formalized as:

(1)

where SpanPostProcessor() is introduced in the work of vund-etal-2021-spanseg. SpanPostProcessor() solely is an algorithm for producing the word segmentation boundary guaranteeing non-overlapping between every two spans. The is the set of predicted spans as follows:

(2)

where is the length of the input sentence. The and denote left and right boundary indexes of the specific span. The Scorer().Seg is the scoring module for the span of sentence . The output of Scorer().Seg

has a value in the range of 0 to 1. We choose the sigmoid function as the activation function at the last layer of

Scorer().Seg module.

Next, given the set of predicted spans satisfying non-overlapping between every two spans for the input sentence , the second stage of our SpanSegTag to perform Chinese POS tagging can be formalized as:

(3)

where is the union of Chinese POS tag set and the non-word tag since the can include the incorrectly predicted span. The Scorer().Tag is the scoring module for the span of sentence assigned tag . To sum up, given the input sentence , the set includes predicted spans with the POS tag. Therefore, the set is the result of the second stage of our SpanSegTag and of the joint CWS and POS tagging task.

The main idea of our SpanSegTag is formalized through three Equations 12, and 3. To train our SpanSegTag, we have to optimize parameters in Scorer().Seg and Scorer().Tag modules. As we clearly see that there is no parameters in the SpanPostProcessor() module. However, the optimization of parameters in Scorer().Tag based on the indirectly optimizes parameters in our SpanSegTag by learning from the result of SpanPostProcessor(). For example, if an incorrect span is assigned non-word tag, then our SpanSegTag is trained to deal with this case via Scorer().Tag module.

Therefore, the cost function for training our SpanSegTag is the combined loss of binary classification and multi-class classification. The cost function for training CWS in our SpanSegTag is

(4)

where is the training set and is the size of the training set. For each pair () in training set , we compute binary cross-entropy loss for all spans , where , and is the length of sentence . The term has the value of 1 if span belongs to the list of sentence and conversely, of 0. Similarly, the term has the value of 1 if span does not belong to the list of sentence and conversely, of 0. Notably, our training and prediction progress, we discard spans with length greater than 7 as the maximum n-gram length following [diao-etal-2020-zen] to reduce negative spans.

Next, the cost function for training Chinese POS tagging in our SpanSegTag is the cross entropy loss:

(5)

where denotes the truth label of span from in the input sentence . Finally, the cost function for training our SpanSegTag is

(6)

2.2 Decoding Algorithm for Predicted Span

As the problem in prior work of vund-etal-2021-spanseg, in the predicted span set mentioned in Equation 2 there exists overlapping between some two spans. To solve this, vund-etal-2021-spanseg keep the spans with the highest score and eliminate the remainder. The overlapping ambiguity phenomenon happens during our SpanSegTag predicting compound words. Additionally, our SpanSegTag encounters the missing word boundary problem. That problem can be caused by originally predicted spans, the consequence of solving overlapping ambiguity, or more than seven-character spans mentioned in subsection 2.1. Finally, we add the missing word boundary based on all predicted spans with to single words to deal with the missing word boundary problem following vund-etal-2021-spanseg. The detail of this algorithm is shown in the work of vund-etal-2021-spanseg. To sum up, SpanPostProcessor() is considered as the heuristic algorithm, while the inference algorithm in [ye-ling-2018-hybrid] is optimal.

2.3 Span Scoring

Inspired by ijcai2020-560, the span scoring module for finding probability of word is computed by using a biaffine operation over the left boundary representation of character and the right boundary representation of character :

(7)

where and the symbol denote the concatenation operation. Similarly, the span scoring module for finding score of a POS tag is computed by:

(8)

where . As mentioned in subsection 2.1, we have , where is the length of input sentence . The , , and

are multilayer perceptrons for transforming hidden states from encoder to left and boundary representations with the output dimension of

for CSW and POS tagging tasks. Vectors and denote forward and backward hidden state vectors from BiLSTM encoder. In case we use BERT encoder, we chunk the hidden state vector from BERT encoder into two vectors with the same size as the forward and backward hidden state vectors in the BiLSTM encoder.

2.4 Encoder Architecture

To experiment with our proposed SpanSegTag, we use BiLSTM encoder [hochreiter97] and encoder for Chinese [devlin-etal-2019-bert]. In case we use LSTM encoder, we use character pre-trained Chinese embedding with the dimension of 64 provided shao-etal-2017-character. In case we use BERT encoder, we use only the hidden state of the last layer of BERT as tian-etal-2020-joint.

3 Experiments

3.1 Datasets

Datasets # Sent # Char # Word OOV
CTB5 Train 18,104 804,587 493,930 -
Dev 352 11,543 6,821 8.1
Test 348 13,738 8,008 3.5
CTB6 Train 23,420 1,055,583 641,368 -
Dev 2,079 100,316 59,955 5.4
Test 2,796 134,149 81,578 5.6
CTB7 Train 31,112 1,160,209 717,874 -
Dev 10,043 387,209 236,590 5.5
Test 10,292 398,626 245,011 5.2
CTB9 Train 105,971 2,642,998 1,696,340 -
Dev 9,850 209,739 136,468 2.9
Test 15,929 378,502 242,317 3.1
UD Train 3,997 156,309 98,608 -
Dev 500 20,000 12,663 12.1
Test 500 19,206 12,012 12.4
Table 1: Statistics of five Chinese benchmark datasets. We provide the number of sentences, characters, and words. We also compute the out-of-vocabulary (OOV) rate as the percentage of unseen words in the dev and test set.

We employ the CTB5, CTB6, CTB5, and CTB9111We officially employ the Penn Chinese TreeBank data (LDC2016T13) from the Linguistic Data Consortium. benchmark datasets from the Penn Chinese Treebank [xue_xia_chiou_palmer_2005], which has been widely used in research on joint CWS and POS tagging. There are 33 POS tags in CTB. The train/dev/test split for CTB5, CTB6, CTB7 and CTB9 is according to previous studies [zhang-etal-2014-type, yang-xue-2012-chinese, wang-etal-2011-improving, shao-etal-2017-character]. We also employ UD1 and UD2 to denote the datasets using universal tag set and Chinese tag set from UD [nivre-etal-2016-universal]222We use the UD_Chinese-GSD dataset with the version 2.4, which extracted from https://universaldependencies.org/. following the research of tian-etal-2020-joint, respectively.

3.2 Implementation

SpanSegTag CTB5 CTB6 CTB7 CTB9 UD1 UD2
Encoder MLP Size Seg Tag Seg Tag Seg Tag Seg Tag Seg Tag Seg Tag
BiLSTM 100 96.71 92.80 94.33 89.43 94.46 89.17 95.64 91.27 91.84 85.21 91.48 84.80
200 96.90 93.08 94.90 90.06 94.70 89.36 95.96 91.57 92.36 85.92 92.27 85.78
300 97.03 93.21 95.00 90.06 94.86 89.39 96.05 91.61 92.43 86.14 92.72 85.93
400 96.82 93.27 95.18 90.16 95.04 89.53 96.15 91.54 93.02 86.45 92.84 86.03
500 97.30 93.39 95.29 90.19 95.10 89.53 96.27 91.61 93.08 86.74 93.12 86.29
BERT 100 98.76 97.78 97.71 95.25 97.06 94.16 97.75 94.92 98.21 95.51 98.22 95.38
200 98.78 97.71 97.66 95.25 97.11 94.24 97.78 95.07 98.23 95.64 98.21 95.50
300 98.56 97.54 97.70 95.24 97.12 94.27 97.74 95.02 98.35 95.72 98.22 95.49
400 98.57 97.64 97.69 95.26 97.05 94.18 97.80 95.10 98.28 95.70 98.17 95.44
500 98.81 97.78 97.69 95.23 97.10 94.22 97.80 95.01 98.30 95.66 98.30 95.44
Table 2: Experimental results on development sets of six Chinese benchmark datasets.

The number of layers of BiLSTM is 1, and the hidden state size of BiLSTM is 200. The dropout rate for embedding, BiLSTM, and MLPs is 0.1. We inherit hyper-parameters from the work of [dozat2017deep]

. We trained all models up to 100 with the early stopping strategy with patience epochs of

. We used AdamW optimizer [loshchilov2019decoupled] with the default configuration and learning rate of . The batch size for training and evaluating is up to 5000.

We did fine-tuning experiments based on BERT [devlin-etal-2019-bert]. We trained all models up to 100 with the early stopping strategy with patience epochs of 15 following tian-etal-2020-joint. The dropout rate for MLPs is 0.1. We used AdamW optimizer [loshchilov2019decoupled] with the default configuration and learning rate of . The batch size for training is 16.

All models were selected based on the performance of the development set. The measure we use for the main result is F-score following previous research. To evaluate F-score of joint CWS and POS tagging, we use the library

333https://github.com/chakki-works/seqeval.

following the research of tian-etal-2020-joint. We also use paired t-test following the guide of the research

[P18-1128] to test the significance of our research.

3.3 Development Performance

In Table 2, we show the performance of SpanSegTag with the output size of MLPs mentioned in subsection 2.3. Concerning the BiLSTM encoder, the larger MLP size gives the higher performance in all datasets. Because we regard the joint CWS and POS tagging as a span labeling task, it requires more contextual information. In view of dependency parsing, dozat2017deep chose the MLP size to be 500 for unlabeled parsing. Regarding the BERT encoder, the results of different MLP sizes are not clearly distinguished as those of the BiLSTM encoder since the BERT encoder provides better contextual information.

CTB5 CTB6 CTB7 CTB9 UD1 UD2
Seg Tag Seg Tag Seg Tag Seg Tag Seg Tag Seg Tag
jiang-etal-2008-cascaded 97.85 93.41 - - - - - - - - - -
kruengkrai-etal-2009-error 97.87 93.67 - - - - - - - - - -
sun-2011-stacked 98.17 94.02 - - - - - - - - - -
wang-etal-2011-improving 98.11 94.18 95.79 91.12 95.65 90.46 - - - - - -
shen-etal-2014-chinese 98.03 93.80 - - - - - - - - - -
kurita-etal-2017-neural 98.41 94.84 - - 96.23 91.25 - - - - - -
shao-etal-2017-character 98.02 94.38 - - - - 96.67 92.34 95.16 89.75 95.09 89.42
Zhang2018ASA 98.50 94.95 96.36 92.51 96.25 91.87 - - - - - -
tian-etal-2020-joint (BERT) 98.77 96.77 97.39 94.99 97.32 94.28 97.75 94.87 98.32 95.60 98.33 95.46
tian-etal-2020-joint (ZEN) 98.81 96.92 97.47 95.02 97.31 94.32 97.77 94.88 98.33 95.69 98.18 95.49
SpanSegTag (BERT) 98.67 96.77 97.53 95.04 97.30 94.50 97.86 95.22 98.06 95.59 98.12 95.54
Table 3: Experimental results on test sets of six Chinese benchmark datasets. The symbol  denotes that the improvement is statistically significant at compared with TwASP555We downloaded all pre-trained models of tian-etal-2020-joint from their publicly resource https://github.com/SVAIGBA/TwASP. However, we can not reproduce the result on the UD2 dataset.(ZEN) [tian-etal-2020-joint] using paired t-test.

3.4 Overall Performance

We run the final testing experiment with the BERT encoder on six datasets compared to previous results, as shown in Table 3. Firstly, we can see our SpanSegTag achieve competitive results on CTB5, UD1, and UD2 compared with research of tian-etal-2020-joint using BERT encoder. Our SpanSegTag achieved the competitive or higher F-score on joint CWS and POS tagging even we get the lower CWS performance on CTB5, UD1, and UD2. Besides, our SpanSegTag obtained the higher F-scores of joint CWS and POS tagging on CTB6, CTB7, and CTB9 compared with [tian-etal-2020-joint].

Compared with tian-etal-2020-joint using ZEN [diao-etal-2020-zen] encoder, we note that the ZEN encoder, which enhances the n-gram information, was better than the BERT encoder on many Chinese NLP tasks [diao-etal-2020-zen]. Though, our SpanSegTag with BERT also obtained the higher joint CWS and POS tagging performance on CTB6, CTB7, CTB9, and UD1. Moreover, our improvements on CTB7 and CTB9 is statistically significant at using paired t-test. We can explain our improvements by modeling all n-grams in the input sentence directly to the word segmentation and POS tagging task via span labeling rather than indirectly according to the work of tian-etal-2020-joint. To sum up, our SpanSegTag does not achieve state-of-the-art performance on all six datasets. However, we obtained significant results on two of the largest joint CWS and POS tagging datasets, including CTB7 and CTB9. To explore the pros and cons of our SpanSegTag, we provide analysis on the section 4.

4 Analysis

4.1 Recall of Out-of-vocabulary and in-vocabulary Words

TwASP
(BERT)
TwASP
(ZEN)

Inspired by the research of gao-etal-2005-chinese, we test the performance of detecting unknown words with POS tags () and the performance of resolving ambiguities in word segmentation with POS tags (), as shown in Table LABEL:ooviv. The analysis reveals that our SpanSegTag tends to have the higher than . This analysis motivates us to research the multi-view model of sequence tagging and span labeling in future work.

4.2 Combination Ambiguity String Error

In addition to in subsection 4.1, we also follow [gao-etal-2005-chinese] to analyze combination ambiguity string (CAS) errors, as shown in Table 5. The CAS detection requires a judgment of the syntactic and semantic sense of the segmentation. Hence, we only use the CAS measure in a pilot study. Inspired by [gao-etal-2005-chinese], we test on a set of 70 high-frequency CASs of each dataset. The result tells that our SpanSegTag solves CASs slightly better than TwASP [tian-etal-2020-joint] on the CTB6, CTB7 and CTB9 datasets. Hence, this error analysis will motivate the research community to improve the joint CWS and POS tagging task.

CTB5 CTB6 CTB7 CTB9 UD1
TwASP (BERT) 96.43 93.72 94.26 94.61 96.40
TwASP (ZEN) 96.43 94.88 94.23 95.47 97.30
Our (BERT) 95.71 95.30 94.72 95.56 97.30
Table 5: CWS accuracies of TwASP [tian-etal-2020-joint] using BERT and ZEN versus our SpanSegTag on 70 high-frequency two-character CASs.

4.3 Model Size and Inference Speed

CTB5 CTB6 CTB7 CTB9 UD1
TwASP (BERT) 514 699 716 650 435
TwASP (ZEN) 989 1,010 1,170 1,100 909
Our (BERT) 433 434 435 441 413
Table 6: Model sizes (MB) of TwASP [tian-etal-2020-joint] using BERT and ZEN versus our SpanSegTag.

In theory, our SpanSegTag is a algorithm due to computing of all possible span representations, which is equivalent to computing of memory network for context features and corresponding knowledge instances from off-the-shelf toolkits in [tian-etal-2020-joint]. In practice, when use GPU Tesla V100 via Google Colaboratory, the inference speed of our SpanSegTag (BERT) and TwASP (BERT) are 264 and 239 (sentence/second), respectively. We notice that we did not count the time TwASP [tian-etal-2020-joint] consuming by running off-the-shelf toolkits. Table 6 shows that the parameters of our SpanSegTag are independent of the datasets and significant smaller compared with TwASP [tian-etal-2020-joint].

5 Related Work

The one-step approach for joint CWS and POS tagging was proved better than the two-step one by many prior studies [jiang-etal-2008-cascaded, jiang-etal-2009-automatic, sun-2011-stacked, zeng-etal-2013-graph, zheng-etal-2013-deep, kurita-etal-2017-neural, shao-etal-2017-character, Zhang2018ASA]. Besides, tian-etal-2020-joint confirmed the importance of context features and corresponding knowledge instances from off-the-shelf toolkits. Our work is related to [chen-etal-2016-segmentation] in view of using matrix for CWS and to [sun-tsou-1995-ambiguity, chen-goodman-1996-empirical, li-etal-2003-unsupervised, gao-etal-2005-chinese, Ma_2014, chen-etal-2016-segmentation] concerning dealing with ambiguity for CWS.

6 Conclusion

In this paper, we propose a neural approach for joint CWS and POS tagging via span labeling. Our proposed approach uses the biaffine operation over the left and right boundary representations of consecutive characters to model the n-grams. Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD, and significant improvements on the CTB7 and CTB9 benchmark datasets compared with the current state-of-the-art method TwASP using BERT and ZEN encoders. Our approach does not use any context features and corresponding knowledge instances from off-the-shelf toolkits and a significantly smaller model than TwASP. However, our SpanSegTag has the disadvantage of the complexity and time running. For future work, we will explore the architecture of the BERT model [devlin-etal-2019-bert] for joint CWS and POS tagging because the primitive of BERT also has the complexity of and the self-attention mechanism over the input sentence may be related to span representation.

References

5 Related Work

The one-step approach for joint CWS and POS tagging was proved better than the two-step one by many prior studies [jiang-etal-2008-cascaded, jiang-etal-2009-automatic, sun-2011-stacked, zeng-etal-2013-graph, zheng-etal-2013-deep, kurita-etal-2017-neural, shao-etal-2017-character, Zhang2018ASA]. Besides, tian-etal-2020-joint confirmed the importance of context features and corresponding knowledge instances from off-the-shelf toolkits. Our work is related to [chen-etal-2016-segmentation] in view of using matrix for CWS and to [sun-tsou-1995-ambiguity, chen-goodman-1996-empirical, li-etal-2003-unsupervised, gao-etal-2005-chinese, Ma_2014, chen-etal-2016-segmentation] concerning dealing with ambiguity for CWS.

6 Conclusion

In this paper, we propose a neural approach for joint CWS and POS tagging via span labeling. Our proposed approach uses the biaffine operation over the left and right boundary representations of consecutive characters to model the n-grams. Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD, and significant improvements on the CTB7 and CTB9 benchmark datasets compared with the current state-of-the-art method TwASP using BERT and ZEN encoders. Our approach does not use any context features and corresponding knowledge instances from off-the-shelf toolkits and a significantly smaller model than TwASP. However, our SpanSegTag has the disadvantage of the complexity and time running. For future work, we will explore the architecture of the BERT model [devlin-etal-2019-bert] for joint CWS and POS tagging because the primitive of BERT also has the complexity of and the self-attention mechanism over the input sentence may be related to span representation.

References

6 Conclusion

In this paper, we propose a neural approach for joint CWS and POS tagging via span labeling. Our proposed approach uses the biaffine operation over the left and right boundary representations of consecutive characters to model the n-grams. Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD, and significant improvements on the CTB7 and CTB9 benchmark datasets compared with the current state-of-the-art method TwASP using BERT and ZEN encoders. Our approach does not use any context features and corresponding knowledge instances from off-the-shelf toolkits and a significantly smaller model than TwASP. However, our SpanSegTag has the disadvantage of the complexity and time running. For future work, we will explore the architecture of the BERT model [devlin-etal-2019-bert] for joint CWS and POS tagging because the primitive of BERT also has the complexity of and the self-attention mechanism over the input sentence may be related to span representation.

References

References