This paper presents new state-of-the-art models for three tasks, part-of-speech tagging, syntactic parsing, and semantic parsing, using the cutting-edge contextualized embedding framework known as BERT. For each task, we first replicate and simplify the current state-of-the-art approach to enhance its model efficiency. We then evaluate our simplified approaches on those three tasks using token embeddings generated by BERT. 12 datasets in both English and Chinese are used for our experiments. The BERT models outperform the previously best-performing models by 2.5 significant case). Moreover, an in-depth analysis on the impact of BERT embeddings is provided using self-attention, which helps understanding in this rich yet representation. All models and source codes are available in public so that researchers can improve upon and utilize them to establish strong baselines for the next decade.READ FULL TEXT VIEW PDF
It is no exaggeration to say that word embeddings trained by vector-based language modelsMikolov et al. (2013); Pennington et al. (2014); Bojanowski et al. (2017) have changed the game of NLP once and for all. These pre-trained word embeddings trained on large corpus improve downstream tasks by encoding rich word semantics into vector space. However, word senses are ignored in these earlier approaches such that a unique vector is assigned to each word, neglecting polysemy from the context.
Recently, contextualized embedding approaches emerge with advanced techniques to dynamically generate word embeddings from different contexts. To address polysemous words, Peters et al. (2018) introduce ELMo, which is a word-level Bi-LSTM language model. Akbik et al. (2018) apply a similar approach to the character-level, called Flair, while concatenating the hidden states corresponding to the first and the last characters of each word to build the embedding of that word. Apart from these unidirectional recurrent language models, Devlin et al. (2018) replace the transformer decoder from Radford et al. (2018) with a bidirectional transformer encoder, then train the BERT system on 3.3B word corpus. After scaling the model size to hundreds of millions parameters, BERT brings markedly huge improvement to a wide range of tasks without substantial task-specific modifications.
In this paper, we verify the effectiveness and conciseness of BERT by first generating token-level embeddings from it, then integrating them to task-oriented yet efficient model structures (Section 3). With careful investigation and engineering, our simplified models significantly outperform many of the previous state-of-the-art models, achieving the highest scores for 11 out of 12 datasets (Section 4).
To reveal the essence of BERT in these tasks, we analyze our tagging models with self-attention, and find that BERT embeddings capture contextual information better than pre-trained embeddings, but not necessarily better than embeddings generated by a character-level language model (Section 5.1). Furthermore, an extensive comparison between our baseline and BERT models shows that BERT models handle long sentences robustly (Section 5.2). One of the key findings is that BERT embeddings are much more related to semantic than syntactic (Section 5.3). Our findings are consistent with the training procedure of BERT, which provides guiding references for future research.
To the best of our knowledge, it is the first work that tightly integrates BERT embeddings to these three downstream tasks and present such high performance. All our resources including the models and the source codes are publicly available.111https://github.com/emorynlp/bert-2019
Rich initial word encodings substantially improve the performance of downstream NLP tasks, which have been studied over decades. Except for matrix factorization methods (Pennington et al., 2014), most work train language models to predict some words given their contexts. Among these work, CBOW and Skip-Gram Mikolov et al. (2013) are pioneers of neural language models extracting features within a fixed length window. Then, Joulin et al. (2017) augment these models with subword information to handle out-of-vocabulary words.
To learn contextualized representations, Peters et al. (2018) apply bidirectional language model (bi-LM) to tokenized unlabeled corpus. Similarly, the contextual string embeddings Akbik et al. (2018) model language on character level, which can efficiently extract morphological features. However, bi-LM consists of two unidirectional LMs without left or right context, leading to potential bias on one side. To address this limitation, BERT Devlin et al. (2018) employ masked LM to jointly condition on both left and right contexts, showing impressive improvement in various tasks.
Sequence tagging is one of the most well-studied NLP tasks, which can be directly applied to part-of-speech tagging (POS) and named entity recognition (NER). As a general trend, fine grained features often result in better performance.Akbik et al. (2018) feed the contextual string embeddings into a Bi-LSTM-CRF tagger (Huang et al., 2015), improving tagging accuracy with rich morphological and contextual information. In a more meticulously designed system, Bohnet et al. (2018) generate representations from both string and token based character Bi-LSTM language models, then employ a meta-BiLSTM to integrate them.
Besides, joint learning and semi-supervised learning can lead to more generalization. As a highly end-to-end approach, the character level transition system proposed byKurita et al. (2017) benefits from joint learning on Chinese word segmentation, POS tagging and dependency parsing. Recently, Clark et al. (2018) exploit large scale unlabeled data with Cross-View Training (CVT), which improves the RNN feature detector shared between the full model and auxiliary modules.
Dependency tree and constituency structure are two closely related syntactic forms. Choe and Charniak (2016) cast constituency parsing as language modeling, achieving high UAS after conversion to dependency tree. Kuncoro et al. (2017)
investigate recurrent neural network grammars through ablations and gated attention mechanism, finding that lexical heads are crucial in phrasal representation.
Recently graph-based parsers resurge due to their ability to exploit modern GPU parallelization. Dozat and Manning (2017) successfully implement a graph-based dependency parser with biaffine attention mechanism, showing impressive performance and decent simplicity. Clark et al. (2018) improve the feature detector of the biaffine parser through CVT and joint learning. While Ma et al. (2018) introduce stack-pointer networks to model parsing history of a transition-based parser, with biaffine attention mechanism built-in.
Currently, parsing community are shifting from syntactic dependency tree parsing to semantic dependency graph parsing (SDP). As graph nodes can have multi-head or zero head, it allows for more flexible representations of sentence meanings. Wang et al. (2018) modify the preconditions of List-Based Arc-Eager transition system Choi and McCallum (2013)
, implementing it with Bi-LSTM Subtraction and Tree-LSTM for feature extraction.
Among graph-based approaches, Peng et al. (2017)
investigate higher-order structures across different graph formalisms with tensor scoring strategy, benefiting from multitask learning.Dozat and Manning (2018) replace the softmax cross-entropy in the biaffine parser with sigmoid cross-entropy, successfully turning the syntactic tree parser into a simple yet accurate semantic graph parser.
BERT splits each token into subwords using WordPiece Wu et al. (2016), which do not necessarily reflect any morphology in linguistics. For example, ‘Rainwater’ gets split into ‘Rain’ and ‘##water’, while words such as running or rapidly remain unchanged although typical morphology would split them into run+ing and rapid+ly. To obtain token-level embeddings for tagging and parsing tasks, the following two methods are experimented:
Since the subwords from each token are trained to predict one another during language modeling, their embeddings must be correlated. Thus, one way is to pick the embedding of the last subword as a representation of the token.
For a compound word like ‘doghouse’ that gets split into ‘dog’ and ‘##house’, the last subword does not necessarily convey the key meaning of the token. Hence, another way is to take the average embedding of the subwords.
Table 1 shows results from a semantic parsing task, PSD (Section 4.3), using the last and average embedding methods with BERT and BERT models.222BERT uses 12 layers, 768 hidden cells, 12 attention heads, and 110M parameters, while BERT uses 24 layers, 1024 hidden cells, 16 attention heads, and 340M parameters. Both models are uncased, since they are reported to achieve high scores for all tasks except for NER Devlin et al. (2018). The average method is chosen for all our experiments since it gives a marginal advantage to the out-of-domain dataset.
While Devlin et al. (2018) report that adding just an additional output layer to the BERT encoder can build powerful models in a wide range of tasks, its computational cost is too high. Thus, we separate the BERT architecture from downstream models, and feed pre-generated BERT embeddings, , as input to task-specific encoders:
Alternatively, BERT embeddings can be concatenated with the output of a certain hidden layer:
Table 2 shows results from the PSD semantic parsing task (Section 4.3) using the average method from Section 3.1. shows a slight advantage for both BERT and BERT over ; thus, it is chosen for all our experiments.
For sequence tagging, the Bi-LSTM-CRF Huang et al. (2015) with the Flair contextual embeddings Akbik et al. (2018), is used to establish a baseline for English. Given a token in a sequence where and are the starting and ending characters of ( and are the character offsets; ), the Flair embedding of is generated by concatenating two hidden states of from the forward LSTM and from the backward LSTM (Figure 1):
is then concatenated with a pre-trained token embedding of and fed into the Bi-LSTM-CRF. In our approach, we present two models, one substituting the Flair and pre-trained embeddings with BERT, and the other concatenating BERT to the other embeddings. Note that variational dropout is not used in our approach to reduce complexity.
As Chinese is characterized as a morphologically poor language, the Flair embeddings are not used for tagging tasks; only pre-trained and BERT embeddings are used for our experiments in Chinese.
A simplified variant of the biaffine parser Dozat and Manning (2017) is used for syntactic parsing (Figure 2). Compared to the original version, the trainable word embeddings are removed and lemmas are used instead of forms to retrieve pre-trained embeddings in our version, leading to less complexity yet better generalization. Given the ’th token , the feature vector is created by concatenating its pre-trained lemma embedding , POS embedding learned during training and the representation from the last layer of BERT. This feature vector is fed into Bi-LSTM, generating two recurrent states and :
Two multi-layer perceptrons (MLP) are then used to extract features forbeing a head or a dependent , and two additional MLP are used to extract and for labeled parsing:
are stacked into a matrix
with a bias for the prior probability of each token being a head, andare stacked into another matrix as follows (: # of tokens, ):
is called a bilinear classifier that predicts head words. Additionally, arc labels are predicted by another biaffine classifier, which combines bilinear classifiers for multi-classification (: # of labels, , ):
During training, softmax cross-entropy is used to optimize and . Note that for the optimization of , gold heads are used instead of predicted ones. During decoding, a maximum spanning tree algorithm is adopted for searching the optimal tree based on the scores in .
Dozat and Manning (2018) adapted their original biaffine parser to generate dependency graphs for semantic parsing, where each token can have zero to many heads. Since the tree structure is no longer guaranteed, sigmoid cross-entropy is used instead so that independent binary predictions can be made for every token to be considered a head of any other token. The label predictions are made as outputting the labels with the highest scores in once arc predictions are made, as illustrated in Figure 2.
This updated implementation is further simplified in our approach by removing the trainable word embeddings, the character-level feature detector, and their corresponding linear transformers. Moreover, instead of using the interpolation between the head and label losses, equal weights are applied to both losses, reducing hyperparameters to tune.
Three sets of experiments are conducted to evaluate the impact of our approaches using BERT (Sec. 3). For sequence tagging (Section 4.1), part-of-speech tagging is chosen where each token gets assigned with a fine-grained POS tag. For syntactic parsing (Section 4.2), dependency parsing is chosen where each token finds exactly one head, generating a tree per sentence. For semantic parsing (Section 4.3), semantic dependency parsing is chosen where each token finds zero to many heads, generating a graph per sentence. Every task is tested on both English and Chinese to ensure robustness across languages.
Standard datasets are adapted to all experiments for fair comparisons to many previous approaches. All our models are experimented three times and average scores with standard deviations are reported. SectionA describes our environmental settings and data split in details for the replication of this work.
For part-of-speech tagging, the Wall Street Journal corpus from the Penn Treebank 3 Marcus et al. (1993) is used for English, and the Penn Chinese Treebank 5.1 Xue et al. (2005) is used for Chinese. Table 3 shows tagging results on the test sets.
Test results for part-of-speech tagging, where token-level accuracy is used as the evaluation metric. ALL: all tokens, OOV: out-of-vocabulary tokens.
For English, the baseline is our replication of the Flair model using both GloVe and Flair embeddings (Section 3.3). It shows a slightly lower accuracy, -0.15%, than the original model Akbik et al. (2018) due to the lack of variational dropout. \BERT substitutes GloVe and Flair with BERT embeddings, and +BERT uses all three types of embeddings. The baseline outperforms all BERT models for the ALL test, implying that Flair’s Bi-LSTM character language model is more effective than BERT’s word-piece approach. No significant difference is found between BERT and BERT. However, an interesting trend is found in the OOV test, where the +BERT model shows good improvement over the baseline. This implies that BERT embeddings can still contribute to the Flair model for OOV although the CNN character language model from Ma and Hovy (2016) is marginally more effective than +BERT for out-of-vocabulary tokens.
For Chinese, the Bi-LSTM-CRF model with FastText embeddings is used for baseline (Sec. 3.3). \BERT that substitutes FastText embeddings with BERT and +BERT that adds BERT embeddings to the baseline show progressive improvement over the prior model for both the ALL and OOV tests. +BERT gives an accuracy that is 1.25% higher than the previous state-of-the-art using joint-learning between tagging and parsing Wang and Xue (2014).
Our simplified version of the biaffine parser (Section 3.4) is used for baseline, where GloVe and FastText embeddings are used for English and Chinese, respectively. The baseline model gives a comparable result to the original model Dozat and Manning (2017) for English, yet shows a notably better result for Chinese, which can be due to higher quality embeddings from FastText. \BERT substitutes the pre-trained embeddings with BERT and +BERT adds BERT embeddings to the baseline. Moreover, BERT’s uncased base model is used for English.
Between \BERT and +BERT, no significant difference is found, implying that those pre-trained embeddings are not so useful when coupled with BERT. All BERT models show significant improvement over the baselines for both languages, and outperform the previous state-of-the-art approaches using cross-view training Clark et al. (2018) and stack-pointer networks Ma et al. (2018) by 0.29% and 3% in LAS for English and Chinese, respectively. Considering the simplicity of our +BERT models, these results are remarkable.
The English dataset from the SemEval 2015 Task 18: Broad-Coverage Semantic Dependency Parsing Oepen et al. (2015) and the Chinese dataset from the SemEval 2016 Task 9: Chinese Semantic Dependency Parsing Che et al. (2016) are used for semantic dependency parsing.
Table 5 shows the English results on the test sets. The baseline, \BERT, and +BERT models are similar to the ones in Section 4.2, except they use the sigmoid instead of the softmax function in the output layer to accept multiple heads (Section 3.5). Our baseline is a simplified version of Dozat and Manning (2018); its average scores are 1.2% higher and 1.0% lower than the original model for ID and OOD, due to different hyperparameter settings. +BERT shows good improvement over \BERT for both test sets, implying that BERT embeddings are complementary to those pre-trained embeddings, and surpasses the previous state-of-the-art scores by 3% and 2% for ID and OOD, respectively.
|Artsymenia et al. (2016)||77.64||59.06||82.41||68.59|
|Wang et al. (2018)||81.14||63.30||85.71||72.92|
|Baseline \ BERT||82.91||67.17||90.83||80.46|
|Baseline + BERT||82.92||67.27||91.10||80.41|
Table 6 shows the Chinese results on the test sets. No significant difference is found between \BERT and +BERT. +BERT significantly outperforms the previous state-of-the-art by 4% and 7.5% in LF for NEWS and TEXT, which confirms that BERT embeddings are very effective for semantic dependency parsing in both English and Chinese.
This section gives an in-depth analysis of the great results achieved by our approaches (Section 4) to better understand the role of BERT in these tasks.
The performance of \BERT models is surprisingly low for English POS tagging, compared to even a linear model achieving the accuracy of 97.64% on the same dataset Choi (2016). This aligns with the findings reported by BERT Devlin et al. (2018) and ELMo Peters et al. (2018), another popular contextualized embedding approach, where their POS and named entity tagging results do not surpass the state-of-the-art. To study how tagging models are trained with BERT embeddings, we augment the baseline and \BERT models in Table 3(a) with dot-product self-attention Luong et al. (2015), and extract their attention weights. We then average the attention matrices decoded from sentences with an equal length, 30 tokens, to find any general trend.
Comparing attention matrices across languages, it is clear that the Chinese matrices are much more checkered, implying that it requires more contents to make correct predictions in Chinese than English. This makes sense because Chinese words tend to be more polysemous than English ones Huang et al. (2007) so that they rely more on contents to disambiguate their categories. For the Flair and BERT models in English, the Flair matrix is more checkered and its diagonal is darker, implying that it uses more contents while individual token embeddings convey more information for POS tagging so their weights are higher than the ones in the BERT matrix. For the FastText and BERT models in Chinese, on the other hand, the BERT model is slightly more checkered and its diagonal is darker, indicating that BERT is better suited for this task than FastText.
Figure 4 shows the attention matrices from a sample Chinese sentence. The FastText model mispredicts 出口export and 成套whole as nouns, whereas the BERT model correctly predicts them as a verb and an adjective, respectively. Notice that the BERT model gives the highest attention to 产produce for tagging 出口export, which both happen to be verbs, whereas the Flair model gives the highest attention to 设备equipment that is a noun.
The outputs of the baseline and \BERT models on semantic dependency parsing (Table 5) are further analyzed for its robustness on long sentences. The average F1 scores for each sentence group, ranging 1-50, are displayed in Figure 5. For DM and PAS, the baseline scores drop faster than those of \BERT as sentences get longer. For PSD, the score drop rates are similar between the two, due to the challenging nature of this dataset Oepen et al. (2015). This reflects that BERT embeddings can handle far-distant dependencies in longer sentences better.
One possible explanation to BERT’s high capability of handling long sentences more robustly is its training objective and structure of masked language modeling (MLM). MLM is trained to predict randomly masked tokens through features extracted by a bidirectional Transformer, which takes up to 512 tokens as input. This is about twice larger than what recurrent neural networks typically expect in practice before gradient vanishes Khandelwal et al. (2018), and an order of magnitude larger than the context windows used by FastText or GloVe. As a result, BERT embeddings can carry information from much farther-distant tokens, leading to higher performance on tasks requiring contextual understanding such as parsing.
Prague Semantic Dependencies (PSD) is used for our labeling analysis because it is manually annotated and well-documented Cinková et al. (2006). The average labeled F1 score of each label is ranked by the difference between the baseline and \BERT models in Table 5. Figure 7 shows the top-5 labels on which \BERT outperforms, and vice versa.
The baseline performs better on certain arguments involving syntactic relations such as LOC-arg (locative), where the relation usually finds a preposition as the head of a noun phrase. \BERT shows robust generalization for arguments involving semantic reasoning i.e., CRIT (criterion) or COND (condition). For tradition in Figure 6, \BERT correctly classifies the CRIT label, while the baseline misclassified it as ACT-arg (argument of action). The far-dependent relation between tradition and reported requires deeper inference on the context, which may be beyond the capacity of the baseline.
In this paper, we describe our methods of exploiting BERT as token-level embeddings for tagging and parsing tasks. Our experiments empirically show that tagging and parsing can be tackled using much simpler models without losing accuracy. Out of 12 datasets, our approaches with BERT have established new state-of-the-art for 11 of them. As the first work of employing BERT with syntactic and semantic parsing, our approach is much simpler yet more accurate than the previous state-of-the-art.
Through a dedicated error analysis and extensive dissections based on an attention mechanism, we uncover interesting properties of BERT from syntactics, semantics, and multilingual perspectives. Beyond syntactically intensive or morphologically complex tasks, BERT embeddings are well-suited for semantic reasoning in long sentences.
Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP’16, pages 2331–2336.
Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI’18, pages 5561–5568.
Throughout this paper, we use the following notations for data splits, TRN: training, DEV: development, TST: test.
For English, the Wall Street Journal corpus from the Penn Treebank 3 Marcus et al. (1993) is used with the standard split for part-of-speech tagging. The baseline is our replication of the Flair model Akbik et al. (2018) using embeddings trained by GloVe. Specifically, we use -dim GloVe embeddings333http://nlp.stanford.edu/data/glove.6B.zip Pennington et al. (2014) trained on Wikipedia 2014 and Gigaword 5 involving 6B tokens in total.
For Chinese, the Penn Chinese Treebank 5.1 Xue et al. (2005) is used with the standard split for POS. The baseline is our replication of the Bi-LSTM-CRF model Huang et al. (2015). We use -dim FastText444https://fasttext.cc/docs/en/crawl-vectors.html with subword information.
For English, the Wall Street Journal corpus from the Penn Treebank 3 is used with the standard split, converted by the Stanford Parser 3.3.0555http://nlp.stanford.edu/software/lex-parser.shtml, for syntactic dependency parsing. For Chinese, the Penn Chinese Treebank 5.1 is used with the standard split, converted by the head-finding rules of Zhang and Clark (2008) and the labeling rules of Penn2Malt666https://cl.lingfil.uu.se/~nivre/research/Penn2Malt.html. The POS tags are auto-generated by the POS tagger in NLP4J Choi (2016)777https://emorynlp.github.io/nlp4j/ using 10-way jackknifing on the training set for English, and the gold word segmentation and POS tags are used for Chinese.
For English, the English dataset from the SemEval 2015 Task 18 Oepen et al. (2015) is used for semantic dependency parsing. For Chinese, the SemEval 2016 Task 9 Che et al. (2016) dataset is used. However, SemEval 2015 Chinese dataset is not used because it is less popular. The POS tags provided in those datasets are used as they are for both English and Chinese, and the provided word segmentation is used for Chinese.
|Baseline||± 0.24||± 0.13||± 0.25||± 0.81|
|Baseline \ BERT||± 0.07||± 0.04||± 0.19||± 0.84|
|Baseline + BERT||± 0.13||± 0.09||± 0.15||± 0.10|
Our models are implemented in MXNet, and ran on NVIDIA Tesla V100 GPUs. Note that in our implementation, the BERT large cased model requires 15GB GPU memory, which exceeds the memory limit of TITAN X (12GB). The training time for baseline+BERT models on each dataset is listed in Table 14.
The hyperparameter configuration for English and Chinese tagging models are given in Table 15
|GloVe / FastText||100 / 300|
|Flair BiLSTM||1 @ 2048|
|BiLSTM||1 @ 256|
|BERT EN / CN||1024 / 768|
|Loss & Optimizer|
We have similar configurations for both syntactic and semantic dependency parsing in English and Chinese, shown in Table 16.
|GloVe / FastText||100 / 300|
|BiLSTM||3 @ 400|
|BERT EN / CN||768 / 768|
|Loss & Optimizer|