Universal Dependencies222http://universaldependencies.org/ are growing in popularity due to the cross-lingual consistency and large language coverage of the provided data. The initiative has been able to connect researchers across the globe and now includes 64 treebanks in 45 languages. It is therefore not surprising that the Conference on Computational Natural Language Learning (CoNLL) in 2017 will feature a shared task on “Multilingual Parsing from Raw Text to Universal Dependencies.”
To facilitate further research on multilingual parsing and to enable even small teams to participate in the shared task, we are releasing baseline implementations corresponding to our best models. This short paper describes (1) the model structure employed in these models, (2) how the models were trained and (3) an empirical evaluation comparing these models to those in Andor et al. (2016). Our model uses on the DRAGNN framework (Kong et al., 2017) to improve upon Andor et al. (2016) with dynamically constructed, recurrent transition-based models. The code as well as the pretrained models is available at the SyntaxNet github repository.333 https://github.com/tensorflow/models/tree/master/syntaxnet.
We note that this paper describes the parsing model used in the baseline. Further releases will describe any changes to the segmentation model compared to SyntaxNet.
2 Character-based representation
Recent work has shown that learned sub-word representations can improve over both static word embeddings and manually extracted feature sets for describing word morphology. Jozefowicz et al. (2016) use a convolutional model over the characters in each word for language modeling. Similarly, Ling et al. (2015a, b) use a bidirectional LSTM over characters in each word for parsing and machine translation.
Chung et al. (2016) take a more general approach. Instead of modeling each word explicitly, they allow the model to learn a hierarchical “multi-timescale” representation of the input, where each layer corresponds to a (learned) larger timescale.
Our modeling approach is inspired by this multi-timescale architecture in that we generate our computation graph dynamically, but we define the timescales explicitly. The input layer operates on characters and the subsequent layer operates on words, where the word representation is simply the hidden state computed by the first layer at each word boundary. In principle, this structure permits fully dynamic word representations based on left context (unlike previous work) and simplifies recurrent computation at the word level (unlike previous work with standard stacked LSTMs (Gillick et al., 2015)).
describes alternate neural network architectures for combining character- and word-level modeling for various tasks. LikeBallesteros et al. (2015), we use character-based LSTMs to improve the Stack-LSTM Dyer et al. (2015) model for dependency parsing, but we share a single LSTM run over the entire sentence.
Our model combines the recurrent multi-task parsing model of Kong et al. (2017) with character-based representations learned by a LSTM. Given a tokenized text input, the model processes as follows:
A single LSTM processes the entire character string (including whitespace)444In our UD v1.3 experiments, the raw text string is not available. Since we use gold segmentations, the whitespace is artificially induced, and functions as a “new word” signal for languages with no naturally occuring whitespace. left-to-right. The last hidden state in a given token (as given by the word boundaries) is used to represent that word in subsequent parts of the model.
A single LSTM processes the word representations (from the first step) in right-to-left order. We call this the “lookahead” model.
A single LSTM processes the lookahead
representations right-to-left. This LSTM has a softmax layer which is trained to predict POS tags, and we refer to it as the “tagger” model.
The recurrent compositional parsing model (Kong et al., 2017) predicts parse tree left-to-right using the arc-standard transition system. Given a stack and a input pointer to the buffer , the parser dynamically links and concatenates the following input representations:
Recurrently, the two steps that last modified the and (either SHIFT or REDUCE operations).
From the tagger
layer, the hidden representations for, , and .
From the lookahead layer, the hidden representation for .
All are projected to 64 dimensions before concatenating.
The parser also extracts 12 discrete features for previously predicted parse labels, the same as in Kong et al. (2017).
At inference time, we use beam decoding in the parser with a beam size of 8. We do not use local normalization, and instead train the models with “self-normalization” (see below).
This model is implemented using the DRAGNN framework in TensorFlow. All code is publicly available at the SyntaxNet repository. The code provides tools to visualize the unrolled structure of the graph at run-time.
We train using the multi-task, maximum-likelihood “stack-propagation” method described in Kong et al. (2017) and Zhang and Weiss (2016). Specifically, we use the gold labels to alternate between two updates:
: We unroll the first three LSTMs and backpropagate gradients computed from the POS tags.
Parser: We unroll the entire model, and backpropagate gradients computed from the oracle parse sequence.
We use the following schedule: pretrain Tagger for 10,000 iterations. Then alternate Tagger and Parser updates at a ratio of 1:8 until convergence.
To optimize for beam decoding, we regularize the softmax objective to be “self-normalized.”Vaswani et al. (2013); Andreas and Klein (2015). With this modification to the softmax, the log scores of the model are encouraged (but not constrained) to sum to one. We find that this helps mitigate some of the bias induced by local normalization Andor et al. (2016), while being fast and efficient to train.
Like the ratio above, many hyperparameters, including design decisions, were tuned to find reasonable values before training all 64 baseline models. While the full recipe can be deciphered from the code, here are some key points for practitioners:
We always project the LSTM hidden representations down from 25664 when we pass from one component to another.
We use moving averages of parameters at inference time.
We use the following ADAM recipe: , and set to be one of (typically ).
We normalize all gradients to have unit norm before applying the ADAM updates.
We use dropout both recurrently and on the inputs, at the same rate (typically 0.7 or 0.8).
We use a minibatch size of 4, with 4 asynchronous training threads doing asynchronous SGD.
4 Comparison to Parsey’s Cousins
Since the test set is not available for the contest, we use v1.3 of the Universal Dependencies treebanks to compare to prior state-of-the-art on 52 languages. Our results are in Table 1. We observe that the new model outperforms the original SyntaxNet baselines, sometimes quite dramatically (e.g. on Latvian, by close to 12% absolute LAS.) We note that this is not an exhaustive experiment, and further study is warranted in the future. Nonethelss, these results show that the new baselines compare very favorably to at least one publicly available state-of-the-art baseline.
|Czech-CLTT||77.34||73.40||79.9||75.79||Old Church Slavonic||84.86||78.85||87.1||81.47|
4.1 CoNLL2017 Shared Task
We provide pre-trained models for all 64 treebanks in the CoNLL2017 Shared Task on the SyntaxNet website. All source code and data is publicly available. Please see the task website for any updates.
We thank our collaborators at the Universal Dependencies and Conll2017 Shared task, as well as Milan Straka of UD-Pipe. We’d also like to thank all members of the Google Parsing Team (current and former) who made this release possible.
- Andor et al.  Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. Globally normalized transition-based neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 2442–2452, 2016.
- Andreas and Klein  Jacob Andreas and Dan Klein. When and why are log-linear models self-normalizing? In HLT-NAACL, pages 244–249, 2015.
- Ba et al.  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
Ballesteros et al. 
Miguel Ballesteros, Chris Dyer, and Noah A. Smith.
Improved transition-based parsing by modeling characters instead of
words with LSTMs.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 349–359, 2015.
- Chung et al.  Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
Dyer et al. 
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith.
Transition-based dependency parsing with stack long short-term memory.pages 334––343, 2015.
- Gillick et al.  Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. Multilingual language processing from bytes. arXiv preprint arXiv:1512.00103, 2015.
- Jozefowicz et al.  Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- Kim et al.  Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural language models. arXiv preprint arXiv:1508.06615, 2015.
- Kong et al.  Lingpeng Kong, Chris Alberti, Daniel Andor, Ivan Bogatyy, and David Weiss. Dragnn: A transition-based framework for dynamically connected neural networks. ArXiV, 2017.
- Lankinen et al.  Matti Lankinen, Hannes Heikinheimo, Pyry Takala, Tapani Raiko, and Juha Karhunen. A character-word compositional neural language model for finnish. arXiv preprint arXiv:1508.06615, 2016.
- Ling et al. [2015a] Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1520–1530, 2015a.
- Ling et al. [2015b] Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W Black. Character-based neural machine translation. arXiv preprint arXiv:1511.04586, 2015b.
- Miyamoto and Cho  Yasumasa Miyamoto and Kyunghyun Cho. Gated word-character recurrent language model. arXiv preprint arXiv:1508.06615, 2016.
- Vaswani et al.  Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. Decoding with large-scale neural language models improves translation. In EMNLP, pages 1387–1392. Citeseer, 2013.
- Zhang and Weiss  Yuan Zhang and David Weiss. Stack-propagation: Improved representation learning for syntax. In Proc. ACL, 2016.