Multilingual Constituency Parsing with Self-Attention and Pre-Training

12/31/2018 ∙ by Nikita Kitaev, et al. ∙ berkeley college 0

We extend our previous work on constituency parsing (Kitaev and Klein, 2018) by incorporating pre-training for ten additional languages, and compare the benefits of no pre-training, ELMo (Peters et al., 2018), and BERT (Devlin et al., 2018). Pre-training is effective across all languages evaluated, and BERT outperforms ELMo in large part due to the benefits of increased model capacity. Our parser obtains new state-of-the-art results for 11 languages, including English (95.8 F1) and Chinese (91.8 F1).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has recently been rapid progress in developing contextual word representations that improve accuracy across a range of natural language tasks Peters et al. (2018); Howard and Ruder (2018); Radford et al. (2018); Devlin et al. (2018). In our earlier work Kitaev and Klein (2018), we showed that such representations are helpful for constituency parsing. However, these results only considered the LSTM-based ELMo representations Peters et al. (2018), and only for the English language. We now extend this work to show that using only self-attention also works by substituting BERT Devlin et al. (2018). We further demonstrate that pre-training and self-attention are effective across languages by applying our parsing architecture to ten additional languages.

Our parser code and trained models for 11 languages are publicly available.111https://github.com/nikitakit/self-attentive-parser

2 Model

Our parser as described in Kitaev and Klein (2018)

accepts as input a sequence of vectors corresponding to words in a sentence, transforms these representations using one or more self-attention layers, and finally uses these representations to output a parse tree. We incorporate BERT by taking the token representations from the last layer of a BERT model and projecting them to 512 dimensions (the default size used by our parser) using a learned projection matrix. While our parser operates on vectors aligned to words in a sentence, BERT associates vectors to sub-word units based on WordPiece tokenization

(Wu et al., 2016). We bridge this difference by only retaining the BERT vectors corresponding to the last sub-word unit for each word in the sentence. We briefly experimented with other alternatives, such as using only the first sub-word instead, but did not find that this choice had a substantial effect on English parsing accuracy.

The fact that additional layers are applied to the output of BERT – which itself uses a self-attentive architecture – may at first seem redundant, but there are important differences between these two portions of the architecture. The extra layers on top of BERT use word-based tokenization instead of sub-words, apply the factored version of self-attention proposed in Kitaev and Klein (2018), and are randomly-initialized instead of being pre-trained. We found that omitting these additional layers and using the BERT vectors directly hurt parsing accuracies.

We also extend the parser to predict part-of-speech tags in addition to constituent labels, a feature we include based on feedback from users of our previous parser. Tags are predicted using a small feed-forward network (with only one ReLU nonlinearity) after the final layer of self-attention. This differs slightly from

Joshi et al. (2018), where tags are predicted based on span representations instead. The tagging head is trained jointly with the parser by adding an auxiliary softmax cross-entropy loss, averaged over all words present in a given batch.

We train our parser with a learning rate of and batch size

, where BERT parameters are fine-tuned as part of training. All other hyperparameters are unchanged from

Kitaev and Klein (2018) and Devlin et al. (2018).

3 Comparison of Pre-Training Methods

In this section, we compare using BERT, ELMo, and training a parser from scratch on treebank data alone. Our comparison of the different methods for English is shown in Table 1.

Method Pre-trained on F1
No pre-training 93.61
ELMo English 95.21
BERT (uncased) Chinese 93.57
BERT (cased) 104 languages 94.97
BERT (uncased) English 95.32
BERT (cased) English 95.24
BERT-large (uncased) English 95.66
BERT-large (cased) English 95.70
Ensemble (final 4 models above) 95.87
Table 1: Comparison of parsing accuracy on the WSJ development set when using different word representations. Kitaev and Klein (2018)

BERT with the “base” hyperparameter settings (12 layers, 12 attention heads per layer, and 768-dimensional hidden vectors) performs comparably or slightly better than ELMo (95.32 vs. 95.21 F1), while a larger version of BERT (24 layers, 16 attention heads per layer, and 1024-dimensional hidden vectors) leads to better parsing accuracy (95.70 F1). These results show that both the LSTM-based architecture of ELMo and the self-attentive architecture of BERT are viable for parsing, and that pre-training benefits from having a high model capacity. We did not observe a sizable difference between using a version of BERT that converts all text to lowercase and a version of BERT that retains case information.

We found that pre-training on only English outperformed multilingual pre-training given the same model capacity, but the relative decrease in error rate was less than 6% (95.24 vs. 94.97 F1). This is a promising result because it supports the idea of using joint multilingual pre-training as a way to provide support for many languages in a resource-efficient manner.

We also conduct a control experiment to try to tease apart the benefits of the BERT architecture and training setup from the effects of the data used for pre-training. We originally attempted to use a randomly-initialized BERT model, but found that it would not train effectively within the range of hyperparameters we tried.222The original BERT models were trained on significantly more powerful hardware and for a longer period of time than any of the experiments we report in this paper. Instead, we trained an English parser using a version of BERT that was pre-trained on the Chinese Wikipedia. Neither the pre-training domain nor the subword vocabulary used are a good fit for the target task; however, English does occur sporadically throughout the Chinese Wikipedia, and the model can represent losslessly English text — all English letters are present in its subword vocabulary, so in the worst case it will decompose an English word into its individual letters. We found that this model achieved comparable performance to a version of our parser that was designed to be trained on treebank data alone (93.57 vs. 93.61 F1). This result suggests that even when the pre-training data is a poor fit for the target domain, fine-tuning can still produce results comparable to purely supervised training starting with randomly-initialized parameters.

4 Results

LR LP F1
Dyer et al. (2016) 93.3
Choe and Charniak (2016) 93.8
Liu and Zhang (2017) 94.2
Fried et al. (2017) 94.66
Joshi et al. (2018) 93.8 94.8 94.3
Kitaev and Klein (2018) 94.85 95.40 95.13
Ours (single model) 95.46 95.73 95.59
Ours (ensemble of 4) 95.51 96.03 95.77
Table 2: Comparison of F1 scores on the WSJ test set.
Arabic Basque French German Hebrew Hungarian Korean Polish Swedish Avg
Dev (all lengths)
Coavoux and Crabbé (2017) 83.07 88.35 82.35 88.75 90.34 91.22 86.78 94.0 79.64 87.16
Kitaev and Klein (2018) 85.94 90.05 84.42 91.39 90.78 92.32 87.90 93.76 79.71 88.47
Ours 88.62 92.08 86.97 93.32 93.84 94.57 89.99 96.20 85.65 91.25
Test (all lengths)
Björkelund et al. (2014) 81.32 88.24 82.53 81.66 89.80 91.72 83.81 90.50 85.50 86.12
Coavoux and Crabbé (2017) 82.92 88.81 82.49 85.34 89.87 92.34 86.04 93.64 84.0 87.27
Kitaev and Klein (2018) 85.61 89.71 84.06 87.69 90.35 92.69 86.59 93.69 84.45 88.32
Ours 87.97 91.63 87.42 90.20 92.99 94.90 88.80 96.36 88.86 91.01
: Ours - Best Previous +2.36 +1.92 +3.36 +2.51 +2.64 +2.21 +2.21 +2.67 +3.36
Table 3: Results on the SPMRL dataset. All values are F1 scores calculated using the version of evalb distributed with the shared task. Björkelund et al. (2013) Uses character LSTM, whereas other results from Coavoux and Crabbé (2017) use predicted part-of-speech tags. Does not use word embeddings, unlike other results from Kitaev and Klein (2018).
LR LP F1
Fried and Klein (2018) 87.0
Teng and Zhang (2018) 87.1 87.5 87.3
Ours 91.55 91.96 91.75
Table 4: Comparison of F1 scores on the Chinese Treebank 5.1 test set.

We train and evaluate our model on treebanks for eleven languages: English (see Table 2), the nine languages represented in the SPMRL 2013/2014 shared tasks (Seddah et al., 2013) (see Table 3), and Chinese (see Table 4). For each of these languages, our parser obtains a higher F1 score than any past systems we are aware of. The English and Chinese parsers use monolingual pre-training, while the remaining parsers incorporate a version of BERT pre-trained jointly on 104 languages.

5 Conclusion

The remarkable effectiveness of unsupervised pre-training of vector representations of language suggests that future advances in this area can continue the ability of machine learning methods to model syntax (as well as other aspects of language.) At the same time, syntactic annotations remain a useful tool due to their interpretability, and we hope that our parsing software may be of use to others.

Acknowledgments

This research used the Savio computational cluster provided by the Berkeley Research Computing program at the University of California, Berkeley.

References