Log In Sign Up

Non-Projective Dependency Parsing via Latent Heads Representation (LHR)

by   Matteo Grella, et al.

In this paper, we introduce a novel approach based on a bidirectional recurrent autoencoder to perform globally optimized non-projective dependency parsing via semi-supervised learning. The syntactic analysis is completed at the end of the neural process that generates a Latent Heads Representation (LHR), without any algorithmic constraint and with a linear complexity. The resulting "latent syntactic structure" can be used directly in other semantic tasks. The LHR is transformed into the usual dependency tree computing a simple vectors similarity. We believe that our model has the potential to compete with much more complex state-of-the-art parsing architectures.


page 1

page 2

page 3

page 4


Concurrent Parsing of Constituency and Dependency

Constituent and dependency representation for syntactic structure share ...

Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder

Human annotation for syntactic parsing is expensive, and large resources...

Generic Axiomatization of Families of Noncrossing Graphs in Dependency Parsing

We present a simple encoding for unlabeled noncrossing graphs and show h...

Learning compositional structures for semantic graph parsing

AM dependency parsing is a method for neural semantic graph parsing that...

ListOps: A Diagnostic Dataset for Latent Tree Learning

Latent tree learning models learn to parse a sentence without syntactic ...

An attentive neural architecture for joint segmentation and parsing and its application to real estate ads

In this paper we develop a relatively simple and effective neural joint ...

Code Repositories


An implementation of the parser described in "Non-Projective Dependency Parsing via Latent Heads Representation (LHR) - Matteo Grella and Simone Cangialosi (2018)" [DEPRECATED]

view repo


LSSEncoder is the encoder of the Latent Syntactic Structure (LSS).

view repo

1 Introduction

Dependency parsing is considered to be a fundamental step for linguistic processing because of its key importance in mediating between linguistic expression and meaning. The task is rather complex as the natural language is implicit, contextual, ambiguous and often imprecise. Recent data-driven deep-learning techniques have given very successful results in almost all the natural language processing tasks, including dependency parsing, thanks to their intrinsic ability to handle noisy inputs and the increased availability of training resources; see goldberg-methods for an introduction.

Generally speaking, modern approaches to dependency parsing can be categorized into graph-based and transition-based parsers [Kübler et al.2009]

. Most neural dependency parsers take advantage of neural networks only for features extraction, using those features to boost traditional parsing algorithms and to reduce the need for feature engineering.

111The “old school” of state-of-the-art parsers relied on hand-crafted feature functions [Zhang and Nivre2011].

Starting from chen2014fast there has been an increase of sophisticated neural architectures for features representation (dyer2015transitionbased, ballesteros2016dynamic, kiperwasser2016ef, dozat2016deep, liu2017encoder, just to name a few). Many of them use recurrent neural networks. In particular, kiperwasser2016simple were the first who demonstrated the effectiveness of using a conceptually simple BiLSTM

[Graves2008, Irsoy and Cardie2014] to reduce the features to a minimum [Shi et al.2017], achieving state-of-the-art results in both transition-based and graph-based approaches.222A BiLSTM (bidirectional LSTM) is composed of two LSTMs, and , one reading the sequence in its regular order, and the other reading it in reverse.

However, despite deep neural networks have proven to be successful in capturing the relevant information for the syntactic analysis, their use is still auxiliary to the traditional parsing algorithms. For instance, transition-based parsers use the neural components to predict the transitions, not directly the syntactic dependencies (arcs). Graph-based parsers, by contrast, use them to assign a weight to each possible arc and then construct the maximum spanning tree [McDonald et al.2005].

In this paper we introduce a novel approach based on a bidirectional recurrent autoencoder to perform globally optimized dependency parsing via semi-supervised learning. The syntactic analysis is completed at the end of the neural process that generates what we call Latent Heads Representation (LHR

), without any algorithmic constraint and with a linear complexity. The resulting “latent syntactic structure” can be used directly for other high-level tasks that benefit from syntactic information (i.e. sentiment analysis, sentence similarity, neural machine translation).

We use a simple decoder to transform the LHR to the usual tree representation computing a vector similarity, with a quadratic complexity.

An interesting property of our model compared to other approaches is that it handles unrestricted non-projective dependencies naturally, without increasing the complexity and without requiring any adaptation or post-processing.333This is particularly remarkable as discontinuities occur in most if not all natural languages: non-projective structures are attested also in more fixed word order languages, like English or Chinese.

The resulting parser has a very simple architecture and provides appreciable results without using any resource outside the tree-banks (with a baseline of 92.8% UAS on the English Penn-treebank [Marcus et al.1993] annotated with Stanford Dependencies and non-gold tags). We believe that with some tuning our model has the potential to compete with much more complex state-of-the-art parsing architectures.

Figure 2: Illustration of the behavior of the neural model when parsing a sentence. The tokens of a sentence (tk1...tkn

) are first transformed into a distributed representation (

e1...en) and then encoded into the context vectors ( by the context-encoder. The context vectors pass in turn through the heads-encoder that predicts the latent-heads ( The decoder finds the top token i-th searching for the head hi most similar to the root vector. Subsequently, for each token i-th its head token j-th is found such that i j and cj is the most similar to hi. After having found all the heads and removed all the cycles of the resulting dependency tree, the multi-tasking network assigns a deprel label and a POS tag to each token i-th, taking in input the concatenation of its context vector (ci) and that of its head (cj).

2 Our Approach

2.1 The Idea

The dependency parsing consists of creating a syntactic structure of a sentence through word-to-word dependencies (Tesniere:1959, Sgall:1986, Mel’cuk:1988, hudson90) where words are linked by binary asymmetric relations called dependencies.

Like zhang2016dependency we formalize the dependency parsing as the task of finding for each word in a sentence its most probable head without tree structure constraints (see a comparison between ours and Zhang’s model in Section


We propose a novel approach for dependency parsing based on the ability of the autoencoders [Rumelhart et al.1985] to learn a representation with the purpose of reconstructing its own input.444

While conceptually simple, autoencoders play an important role in machine learning and are one of the fundamental paradigms for unsupervised learning.

The gist of our idea is to use a bidirectional recurrent autoencoder (Figure 1) to reconstruct for each i-th input another j-th input of the same sequence, remaining in the same domain. In this way, we are able to train the network to create an approximate representation of the head of a given token.555Although in a different context, the work of rama2016lstm is the most similar to ours that we found. They use a LSTM sequence-to-sequence autoencoder to learn a deep words representation that can be used for meaningful comparison across dialects.

As detailed in Section 2.2, the input values of the autoencoder are not fixed as they are tuned on the basis of the errors of the autoencoder itself, involving a sort of information rebalancing able to converge.

Considering that the network learns its own suitable representation for both the input and the output by itself and that we give a “teaching signal” only (that is the index of the target vector of the head to reconstruct without imposing any particular representation) we consider our approach a semi-supervised one.

2.2 Unlabeled Parsing

Our model (Figure 2) is composed of two BiRNN666In this paper, we use the term BiRNN to abstract the general concept of bidirectional recurrent network, not to refer to the specific model of Schuster and Paliwal (1997) who extended the simple recurrent network (Elman, 1990). encoders which perform two distinct tasks. The first BiRNN (context-encoder) receives in input the tokens of a sentence already encoded in a dense representation.777For example, the tokens encodings can be obtained by the concatenation of word and pos embeddings. The context-encoder encodes the input tokens into “context vectors” that represent them with their surrounding context. Its contribute is crucial for the positional information that it adds to the input vectors, especially when a word occurs more than once in a sentence.

The context vectors are in turn given as input to the second BiRNN (heads-encoder) that acts as an autoencoder, transforming them into another representation that we call “latent heads”. The aim of the heads-encoder is to associate each context vector to the one that represents its head888The output vectors of the BiLSTM pass through a Feedforward network that reduces their size to the one of the input vectors.. It is trained to minimize the difference between a context vector and its representation as latent head.999

We use the mean squared error during the training phase and the cosine similarity during decoding.

During the training, the dependencies between dependent and governor tokens are taken from the gold dependency trees.

The mean absolute errors are propagated from the heads-encoder all the way back, through the context-encoder until the initial tokens embeddings (which are trained together with the model). An optimizer is used to update the parameters according to the gradients.

The model is trained to predict the latent heads of each sentence without a sequential order, generating all the tokens dependencies at the same time. So, we consider it globally optimized.

Thanks to the ability of the LSTMs [Hochreiter and Schmidhuber1997] (or equivalent gated recurrent networks) to remember information for long periods, this method allows to recognize word-to-word dependencies among arbitrary positions in a sequence of words directly, without the need of any additional transition-based or graph-based framework.

To construct the dependency tree we use a decoder that finds the head (i.e. the governor) of each token searching for the context vector most similar to its latent head (excluding itself). The top token of the sentence is found before assigning the other heads, looking for the latent head most similar to a reference vector used to represent the virtual root (see Section 2.4).

At test time, we ensure that the dependency tree given in output is well-formed by iteratively identifying and fixing cycles with simple heuristics, without any loss in accuracy.

101010For each cycle, the fix is done by removing the arc with the lowest score and assigning to its dependent the node that maximizes its latent head similarity without introducing new cycles.

Like zhang2016dependency, we empirically observed that during the decoding most outputs are already trees, without the need to fix cycles. It seems to confirm that in both our models the linear sequence of tokens itself is sufficient to recover the underlying dependency structure.

2.3 Labeled Parsing

Up to now, we described unlabeled parsing. To predict the labels, we introduced a simple module that computes a classification of the dependent-governor pairs, obtained as described in 2.2, using their related context-vectors as input. If the governor is the root node, the root vector is taken instead of the context one.

This labeler is composed by a simple feedforward network and it is trainined on the gold trees. The training objective is to set the scores of the correct labels above the scores of incorrect ones.111111We use the margin-based objective, aiming to maximize the margin between the highest scoring correct label and the highest scoring incorrect label. We also experimented with the cross-entropy loss activating the output layer with the Softmax, obtaining comparable accuracies but with a slower convergence.

It follows that the context-encoder produces a representation that is shared by the heads-encoder and the labeler, receiving two contributions during the training phase. This sharing of parameters can be seen as an instance of multi-task learning [Caruana1997].

As we show in Section 4, this method is effective: training the context-encoder to be good at supporting the prediction of the arc labels significantly improves the convergence of the heads-encoder, increasing the global unlabeled attachments score.

2.3.1 Part-of-Speech Tagger

A typical approach of syntactic parsing assumes that input tokens are morphologically disambiguated using a part-of-speech (POS) tagger before parsing begins. This is bad especially for richly inflected languages (e.g. Italian and German), where there is a considerable interaction between morphology and syntax, such that neither can be fully disambiguated without considering the other.

To train a parser for real-world setting, POS tags predicted by an external model are used instead of the gold ones. Modern neural approaches take advantage of pre-trained word embeddings and other token representations (e.g. characters embeddings) to overcome the lack of gold information, achieving results similar to those obtained using gold data.

However, most of them focus on dependency parsing without worrying about giving in output POS tags coherent with the predicted labels.

Differently, we extend the labeler (Section 2.3) to predict the arc label and the gold coarse-grained part-of-speech jointly, using two feedforward networks that share the same hidden layer. Intuitively, to predict the POS tag of a token the model considers simultaneously the neighboring words, as in most POS taggers, and its syntactic function within the sentence. The label-POS pair is chosen maximizing the sum of the label score and the POS score, evaluating only the pairs seen in the training set.

In this configuration we are experimenting different ways to create the initial tokens encoding (see 4.2).

2.4 Virtual Root

Since the dependency parsing relies on a “verb centricity” theory [Tesnière1959], one of the requirements for a well-formed dependency tree is that there is precisely one root, which is usually the main finite verb of a sentence.

Our model complies with this by selecting the token connected to the root before any other dependency.

The root vector is initialized with random values and can be trained only with the labeled parsing (Section 2.3). When fine-tuned it helps to increase the accuracy of the root attachments.

We have tried alternative solutions to avoid having an external root vector. One of these is to force the autoencoder to reconstruct the token itself in case this one points to the root. An extensive benchmark to chose the optimal solution has still to be done.

In addition, we are testing if our model is misled by the garden-path sentences [Frazier and Rayner 1982], which usually create problems with greedy decoding algorithms.

3 Latent Syntactic Structure

Starting from the results of the neural process described in Section 2.2, it is possible to construct a “latent syntactic structure” by concatenating for each context vector their related latent heads. A way to prepare the input for other semantic tasks is using an attention mechanism [Bahdanau et al.2017] as a features extractor capable to recognize the relevant information for a given task (Figure 3). This makes possible to train the parser together with the task to support, incorporating its objective directly, without requiring the latter to interpret the syntactic output.

However, the focus of this paper is the dependency parsing and not how to use its results, so we leave this topic to future works (see Section 6).

Figure 3: Example of a Latent Syntactic Structure in input to an Attention Mechanism.

4 Experiments and Results

Please note that this section will be updated with new results soon. At the moment it contains only the results that constitute our baseline.

The parser is implemented in Kotlin, using the SimpleDNN121212 neural networks library. The code is available at the github repository LHRParser.131313

Rather then top parsing accuracy, in this paper we focus more on the ability of the proposed model to learn a latent representation capable to capture the information needed for the syntactic analysis.

A performance evaluation has been carried out on the Penn Treebank (PTB) [Marcus et al.1993] converted to Stanford Dependencies (Marneffe et al., 2006) following the standard train/dev/test splits and without considering punctuation markers. This dataset contains a few non-projective trees.

Our baseline is obtained following the labeled parsing approach described in section 2.3. The input tokens encodings are built concatenating word and POS embeddings (initialized with random values and fine-tuned during the training). The part-of-speech tags are assigned by an automatic tagger.141414The predicted POS-tags are the same used in dyer2015transitionbased and kiperwasser2016ef. We thank Kiperwasser for sharing their data with us.

Like kiperwasser2016simple, during the training we replace the embedding vector of a word with an “unknown vector” with a probability that is inversely proportional to the frequency of the word in the tree-bank (tuned with an coefficient).

We optimize the parameters with the Adam [Kingma and Ba2015] update method.151515We use the default parameters ( = 0.001 = 0.9 = 0.999).

The hyper-parameters161616We performed a very minimal tuning of the hyper-parameters. used for our baseline are reported in Table 1 and the related results in Table 2.

Word embedding dimension 150
POS tag embedding dimension 50
Labeler hidden dimension 100
Labeler hidden activation Tanh
Labeler output activation Softmax
BiLSTMs activations Tanh
(word dropout) 0.25
Table 1: Hyper-parameters used for the baseline.
System Method UAS LAS
This work (baseline) LHR 92.8 90.4
Kiperwasser16 BiLSTM + transition 93.2 91.2
Kiperwasser16 BiLSTM + graph 93.1 91.0
Table 2: Our results are compared with the models of Kiperwasser and Goldberg (2016), which combine the BiLSTMs with traditional parsing algorithms. Both approaches use neither gold tags nor external resources.

The sections 4.1 and 4.2 detail the other experiments in progress.

4.1 Punctuation

Experimental results with traditional parsing algorithms showed that parsing accuracy drops on sentences which contain a higher ratio of punctuation. The problem is that, in tree-banks, punctuation is not consistently annotated as words, and this makes the learning and the consequent parsing processes difficult. Therefore, the arcs leading to the punctuation tokens just do not count in the standard CoNLL evaluation.

In our model there are no structural constraints to learn the Latent Heads Representation so that a fully complete gold dependency tree is not required. Based on this, during the learning process we skip the reconstruction of the head of the punctuation tokens, letting the model creating its own preferred representation. During decoding, the latent heads of the punctuation tokens are treated like the others.

On the one hand we did not find appreciable improvements applying this method to the PTB, but on the other hand, we did not notice a drop in performance in the evaluation that do not consider the punctuation.171717It could be interesting to find out which tokens the parser chooses as punctuation heads. This shows a robust behavior of our approach with inaccurate annotations and it supports the hypothesis that the relevant information brought by the punctuation-tokens are implicitly learned thanks to the bidirectional recurrent mechanism [Grella2018].

We are doing further experiments to test this technique on other tree-banks with a higher ratio of non-projective sentences.

4.2 Tokens Encoding

A good initial tokens encoding is crucial to obtain high results in neural parsing.

We explored different ways to transform the tokens of a sentence into a distributed representation, exploiting the capabilities of our model to predict dependency labels and POS tags jointly (Section 2.3.1).

Character-based representation

dozat2017stanford proof that adding subword information to words embeddings is useful to improve the parsing accuracy, especially for richly inflected languages.

We follow their approach, concatenating the word embeddings to a characther-based representation instead of the POS embeddings.

Part-of-Speech Correction

The input tokens encodings are built concatenating word and POS embeddings. As usual, the part-of-speech tags are assigned by a pre-existent automatic tagger. We use the approach described in Section 2.3.1 to learn to predict the gold tags together with the dependency label. During the decoding, the output tags of the labeler are assigned to each token, eventually modifying the tags predicted by the external POS tagger. The assumption is that our model should learn how to correct the mistakes made by the tagger.

5 Related Works

Like us, zhang2016dependency formalized the dependency parsing as the task of finding for each word in a sentence its most probable head. They propose a graph-based parsing model without tree structure constraints (DeNSe

). It employs a bidirectional LSTM to encode the tokens of a sentence, which are used as features of a feedforward network that estimates the most probable head of each token. In their model, the selection of the head of each token is made independently of the other tokens of the sentence, computing the associative score of all the combinations of pairs of tokens seen as dependent and governor. The most probable pairs are chosen as arcs of the resulting dependency tree, adjusting the ill-formed trees with the Chu-Liu-Edmonds algorithm.

A key difference between LHR and DeNSe lies in the training objective: ours is to globally minimize the mean absolute error between the context vectors and the latent heads, optimizing the BiRNN autoencoder (the heads-encoder); by contrast, the objective of DeNSe is to minimize the negative log likelihood of the independent predictions of each single arc, respect to the gold arcs in all the training sentences.

6 Future Research

In this section we take some space to share a few insight about the direction of our future research:

Multi-objective Training

We observed meaningful improvements on the unlabeled parsing after adding the labels, using the context-encoder as a shared intermediate layer.

On this basis, we will try to “inject” a linguistic knowledge into the model, involving the same encoder in other known tasks (e.g., semantic role labeling, named entity recognition), and preparing a number of training objective targeted for the improvement of specific difficult dependency relations

[Ficler and Yoav2017] and to create a “neuralized” lexical information (e.g., lemma, grammatical features, valency) contained into computational dictionaries.

Cross-lingual and Unsupervised Dependency Parsing

In our model there are no structural constraints to learn the Latent Heads Representation so that a fully complete gold dependency tree is not required. We plan to experiment with cross-lingual parsing by training the model on large amounts of incomplete and noisy data, obtained by means of annotation projection [Hwa et al.2005]

or transfer learning.

181818Annotation projection is a technique that allows to transfer annotations from one language to another within a parallel corpus.

In addition, we will experiment new approaches to neural language models [Bengio et al.2003] based on our bidirectional recurrent autoencoder, with the aim to improve unsupervised parsing techniques [Jiang et al.2016].

Semantic Tasks

We plan to test the effectiveness of our “latent syntactic structure” evaluating its contribution to a number of semantic tasks: sentiment analysis (Socher et al., 2013b; Tai et al., 2015), semantic sentence similarity (Marelli et al., 2014), textual inference (Bowman et al., 2015) and neural machine translation (Bahdanau et al., 2015; Jean et al., 2015b).

7 Conclusion

The dependency parsing has traditionally been recognized as a structured prediction task. In this paper we have introduced an alternative semi-supervised approach that we believe can radically transforms the way to perform the dependency parsing.

To the best of our knowledge, we are the first who use a bidirectional recurrent autoencoder to recognize word-to-word dependencies among arbitrary positions in a sequence of words directly, without involving any additional frameworks.

We are investigating what kind of “knowledge of language” the new model is capturing, extending the tests to grammaticality judgments and visualizing which information the networks consider more important in a given moment [Karpathy et al.2015].191919In our experiments we found that the RAN [Kenton et al.2017] is a valid alternative to the LSTM when speed and highly interpretable outputs are important.


  • [Goldberg2017] Yoav Goldberg. 2017. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309, 2017..
  • [Kübler et al.2009] Sandra Kübler, Ryan T. McDonald, and Joakim Nivre. 2009. Dependency Parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
  • [Chen and Manning2014] Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar, October. Association for Computational Linguistics.
  • [Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015.

    Transition-based dependency parsing with stack long short-term memory.

    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 334–343, Beijing, China, July. Association for Computational Linguistics.
  • [Kiperwasser and Goldberg2016] Eliyahu Kiperwasser and Yoav Goldberg. 2016. Easy-first dependency parsing with hierarchical tree LSTMs. Transactions of the Association for Computational Linguistics, 4.
  • [Dozat and Manning2017] Timothy Dozat and Christopher D. Manning. 2017. 2017. Deep biaffine attention for neural dependency parsing. In Proc. of ICLR.
  • [Ballesteros et al.2016] Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A. Smith. 2016. Training with exploration improves a greedy stack-LSTM parser. CoRR, abs/1603.03793.
  • [Liu and Zhang2017] Jiangming Liu and Yue Zhang 2017. Encoder-Decoder Shift-Reduce Syntactic Parsing. IWPT, 105-114
  • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
  • [Caruana1997] Rich Caruana. 1997. Multitask learning. Machine Learning, 28:41–75, July.
  • [McDonald et al.2005] Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 91–98, Ann Arbor, Michigan, June. Association for Computational Linguistics.
  • [Tesnière1959] Lucien Tesnière. 1959. Eléments de Syntaxe Structurale. Klincksieck, Paris.
  • [Sgall et al.1986] Petr Sgall, Eva Hajičová, and Jarmilla Panevová. 1986. The Meaning of the Sentence in Its Semantic and Pragmatic Aspects. Reidel, Dordrecht.
  • [Mel’cuk1988] Igor’ A. Mel’cuk. 1988. Dependency syntax: theory and practice. State Univ. of New York Pr., Albany, NY.
  • [Hudson1990] R. Hudson. 1990. English Word Grammar. Basil Blackwell, Oxford.
  • [Kingma and Ba2015] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, San Diego, California.
  • [Marcus et al.1993] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
  • [Irsoy and Cardie2014] Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 720–728, Doha, Qatar, October. Association for Computational Linguistics.
  • [Graves2008] Alex Graves. 2008. Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technical University Munich.
  • [Zhang et al.2016] Xingxing Zhang, Jianpeng Cheng, Mirella Lapata 2016. Dependency parsing as head selection arXiv preprint arXiv:1606.01280
  • [Kiperwasser and Goldberg2016] Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature representations. Transactions of the Association for Computational Linguistics 4:313–327.
  • [Iyyer et al.2015] Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1681–1691, Beijing, China, July. Association for Computational Linguistics.
  • [Zhang and Nivre2011] Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 188–193, Portland, Oregon, USA, June. Association for Computational Linguistics.
  • [Karpathy et al.2015] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015. Visualizing and understanding recurrent networks. CoRR, abs/1506.02078.
  • [Hwa et al.2005] Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, Okan Kolak 2015. Bootstrapping parsers via syntactic projection across parallel texts. In Natural language engineering of Cambridge University Press, 1:3:311–325
  • [Rumelhart et al.1985] Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J 1985. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science
  • [Kenton et al.2017] Kenton Lee, Omer Levy, Luke Zettlemoyer 2015. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1705.07393
  • [Bahdanau et al.2017] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
  • [Shi et al.2017] Tianze Shi, Liang Huang, Lillian Lee 2017. Fast(er) Exact Decoding and Global Training for Transition-Based Dependency Parsing via a Minimal Feature Set. arXiv preprint arXiv:1708.09403
  • [Bengio et al.2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin 2003. A neural probabilistic language model. In Journal of machine learning research 3:1137–1155
  • [Jiang et al.2016] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin 2016. Unsupervised neural dependency parsing. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 763–771
  • [Frazier and Rayner 1982] Lyn Frazier and Keith Rayner. 1982. Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Elsevier, Cognitive psychology 14:2:178–210
  • [Rama et al.2016] Rama, Taraka and Çöltekin, Çağrı 2016. LSTM autoencoders for dialect analysis. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 25–32
  • [Grella2018] Matteo Grella 2018. Taking Advantage of BiLSTM Encoding to Handle Punctuation in Dependency Parsing: A Brief Idea
  • [Ficler and Yoav2017] Jessica Ficler and Yoav Goldberg 2017. Improving a Strong Neural Parser with Conjunction-Specific Feature. arXiv preprint arXiv:1702.06733
  • [Dozat et al.2017] Dozat, Timothy and Qi, Peng and Manning, Christopher D 2017. Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 20–30
  • [Grella2017] Matteo Grella 2017. Italian Function Words LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (
    ’UFAL), Faculty of Mathematics and Physics, Charles University
  • [Grella2017] Matteo Grella 2017. Italian Content Words LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (
    ’UFAL), Faculty of Mathematics and Physics, Charles University