There has recently been rapid progress in developing contextual word representations that improve accuracy across a range of natural language tasks Peters et al. (2018); Howard and Ruder (2018); Radford et al. (2018); Devlin et al. (2018). In our earlier work Kitaev and Klein (2018), we showed that such representations are helpful for constituency parsing. However, these results only considered the LSTM-based ELMo representations Peters et al. (2018), and only for the English language. We now extend this work to show that using only self-attention also works by substituting BERT Devlin et al. (2018). We further demonstrate that pre-training and self-attention are effective across languages by applying our parsing architecture to ten additional languages.
Our parser code and trained models for 11 languages are publicly available.111https://github.com/nikitakit/self-attentive-parser
Our parser as described in Kitaev and Klein (2018)
accepts as input a sequence of vectors corresponding to words in a sentence, transforms these representations using one or more self-attention layers, and finally uses these representations to output a parse tree. We incorporate BERT by taking the token representations from the last layer of a BERT model and projecting them to 512 dimensions (the default size used by our parser) using a learned projection matrix. While our parser operates on vectors aligned to words in a sentence, BERT associates vectors to sub-word units based on WordPiece tokenization(Wu et al., 2016). We bridge this difference by only retaining the BERT vectors corresponding to the last sub-word unit for each word in the sentence. We briefly experimented with other alternatives, such as using only the first sub-word instead, but did not find that this choice had a substantial effect on English parsing accuracy.
The fact that additional layers are applied to the output of BERT – which itself uses a self-attentive architecture – may at first seem redundant, but there are important differences between these two portions of the architecture. The extra layers on top of BERT use word-based tokenization instead of sub-words, apply the factored version of self-attention proposed in Kitaev and Klein (2018), and are randomly-initialized instead of being pre-trained. We found that omitting these additional layers and using the BERT vectors directly hurt parsing accuracies.
We also extend the parser to predict part-of-speech tags in addition to constituent labels, a feature we include based on feedback from users of our previous parser. Tags are predicted using a small feed-forward network (with only one ReLU nonlinearity) after the final layer of self-attention. This differs slightly fromJoshi et al. (2018), where tags are predicted based on span representations instead. The tagging head is trained jointly with the parser by adding an auxiliary softmax cross-entropy loss, averaged over all words present in a given batch.
3 Comparison of Pre-Training Methods
In this section, we compare using BERT, ELMo, and training a parser from scratch on treebank data alone. Our comparison of the different methods for English is shown in Table 1.
|BERT (cased)||104 languages||94.97|
|Ensemble (final 4 models above)||95.87|
BERT with the “base” hyperparameter settings (12 layers, 12 attention heads per layer, and 768-dimensional hidden vectors) performs comparably or slightly better than ELMo (95.32 vs. 95.21 F1), while a larger version of BERT (24 layers, 16 attention heads per layer, and 1024-dimensional hidden vectors) leads to better parsing accuracy (95.70 F1). These results show that both the LSTM-based architecture of ELMo and the self-attentive architecture of BERT are viable for parsing, and that pre-training benefits from having a high model capacity. We did not observe a sizable difference between using a version of BERT that converts all text to lowercase and a version of BERT that retains case information.
We found that pre-training on only English outperformed multilingual pre-training given the same model capacity, but the relative decrease in error rate was less than 6% (95.24 vs. 94.97 F1). This is a promising result because it supports the idea of using joint multilingual pre-training as a way to provide support for many languages in a resource-efficient manner.
We also conduct a control experiment to try to tease apart the benefits of the BERT architecture and training setup from the effects of the data used for pre-training. We originally attempted to use a randomly-initialized BERT model, but found that it would not train effectively within the range of hyperparameters we tried.222The original BERT models were trained on significantly more powerful hardware and for a longer period of time than any of the experiments we report in this paper. Instead, we trained an English parser using a version of BERT that was pre-trained on the Chinese Wikipedia. Neither the pre-training domain nor the subword vocabulary used are a good fit for the target task; however, English does occur sporadically throughout the Chinese Wikipedia, and the model can represent losslessly English text — all English letters are present in its subword vocabulary, so in the worst case it will decompose an English word into its individual letters. We found that this model achieved comparable performance to a version of our parser that was designed to be trained on treebank data alone (93.57 vs. 93.61 F1). This result suggests that even when the pre-training data is a poor fit for the target domain, fine-tuning can still produce results comparable to purely supervised training starting with randomly-initialized parameters.
|Dyer et al. (2016)||–||–||93.3|
|Choe and Charniak (2016)||–||–||93.8|
|Liu and Zhang (2017)||–||–||94.2|
|Fried et al. (2017)||–||–||94.66|
|Joshi et al. (2018)||93.8||94.8||94.3|
|Kitaev and Klein (2018)||94.85||95.40||95.13|
|Ours (single model)||95.46||95.73||95.59|
|Ours (ensemble of 4)||95.51||96.03||95.77|
|Dev (all lengths)|
|Coavoux and Crabbé (2017)||83.07||88.35||82.35||88.75||90.34||91.22||86.78||94.0||79.64||87.16|
|Kitaev and Klein (2018)||85.94||90.05||84.42||91.39||90.78||92.32||87.90||93.76||79.71||88.47|
|Test (all lengths)|
|Björkelund et al. (2014)||81.32||88.24||82.53||81.66||89.80||91.72||83.81||90.50||85.50||86.12|
|Coavoux and Crabbé (2017)||82.92||88.81||82.49||85.34||89.87||92.34||86.04||93.64||84.0||87.27|
|Kitaev and Klein (2018)||85.61||89.71||84.06||87.69||90.35||92.69||86.59||93.69||84.45||88.32|
|: Ours - Best Previous||+2.36||+1.92||+3.36||+2.51||+2.64||+2.21||+2.21||+2.67||+3.36|
|Fried and Klein (2018)||–||–||87.0|
|Teng and Zhang (2018)||87.1||87.5||87.3|
We train and evaluate our model on treebanks for eleven languages: English (see Table 2), the nine languages represented in the SPMRL 2013/2014 shared tasks (Seddah et al., 2013) (see Table 3), and Chinese (see Table 4). For each of these languages, our parser obtains a higher F1 score than any past systems we are aware of. The English and Chinese parsers use monolingual pre-training, while the remaining parsers incorporate a version of BERT pre-trained jointly on 104 languages.
The remarkable effectiveness of unsupervised pre-training of vector representations of language suggests that future advances in this area can continue the ability of machine learning methods to model syntax (as well as other aspects of language.) At the same time, syntactic annotations remain a useful tool due to their interpretability, and we hope that our parsing software may be of use to others.
This research used the Savio computational cluster provided by the Berkeley Research Computing program at the University of California, Berkeley.
- Björkelund et al. (2014) Anders Björkelund, Ozlem Cetinoglu, Agnieszka Faleńska, Richárd Farkas, Thomas Mueller, Wolfgang Seeker, and Zsolt Szántó. 2014. The IMS-Wrocław-Szeged-CIS entry at the SPMRL 2014 shared task: Reranking and morphosyntax meet unlabeled data. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages. pages 97–102.
- Björkelund et al. (2013) Anders Björkelund, Ozlem Cetinoglu, Richárd Farkas, Thomas Mueller, and Wolfgang Seeker. 2013. (Re)ranking meets morphosyntax: State-of-the-art results from the SPMRL 2013 shared task. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages. Association for Computational Linguistics, Seattle, Washington, USA, pages 135–145. http://www.aclweb.org/anthology/W13-4916.
Choe and Charniak (2016)
Do Kook Choe and Eugene Charniak. 2016.
Parsing as language
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 2331–2336. https://doi.org/10.18653/v1/D16-1257.
- Coavoux and Crabbé (2017) Maximin Coavoux and Benoit Crabbé. 2017. Multilingual lexicalized constituency parsing with word-level auxiliary tasks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, pages 331–336. http://aclweb.org/anthology/E17-2053.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs] ArXiv: 1810.04805. http://arxiv.org/abs/1810.04805.
- Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, pages 199–209. https://doi.org/10.18653/v1/N16-1024.
- Fried and Klein (2018) Daniel Fried and Dan Klein. 2018. Policy gradient as a proxy for dynamic oracles in constituency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, pages 469–476. http://aclweb.org/anthology/P18-2075.
- Fried et al. (2017) Daniel Fried, Mitchell Stern, and Dan Klein. 2017. Improving neural parsing by disentangling model combination and reranking effects. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, pages 161–166. https://doi.org/10.18653/v1/P17-2025.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, pages 328–339. http://aclweb.org/anthology/P18-1031.
- Joshi et al. (2018) Vidur Joshi, Matthew Peters, and Mark Hopkins. 2018. Extending a parser to distant domains using a few dozen partially annotated examples. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, pages 1190–1199. http://aclweb.org/anthology/P18-1110.
- Kitaev and Klein (2018) Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, pages 2676–2686. http://aclweb.org/anthology/P18-1249.
- Liu and Zhang (2017) Jiangming Liu and Yue Zhang. 2017. In-order transition-based constituent parsing. Transactions of the Association for Computational Linguistics 5:413–424. http://aclweb.org/anthology/Q17-1029.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. http://arxiv.org/abs/1802.05365.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
- Seddah et al. (2013) Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie Candito, Jinho D. Choi, Richárd Farkas, Jennifer Foster, Iakes Goenaga, Koldo Gojenola Galletebeitia, Yoav Goldberg, Spence Green, Nizar Habash, Marco Kuhlmann, Wolfgang Maier, Joakim Nivre, Adam Przepiórkowski, Ryan Roth, Wolfgang Seeker, Yannick Versley, Veronika Vincze, Marcin Woliński, Alina Wróblewska, and Eric Villemonte de la Clergerie. 2013. Overview of the SPMRL 2013 shared task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages. Association for Computational Linguistics, pages 146–182. http://www.aclweb.org/anthology/W13-4917.
- Teng and Zhang (2018) Zhiyang Teng and Yue Zhang. 2018. Two local models for neural constituent parsing. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, pages 119–132. http://aclweb.org/anthology/C18-1011.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs] ArXiv: 1609.08144. http://arxiv.org/abs/1609.08144.