Transition-Based Dependency Parsing using Perceptron Learner

01/22/2020 ∙ by Rahul Radhakrishnan Iyer, et al. ∙ Carnegie Mellon University 12

Syntactic parsing using dependency structures has become a standard technique in natural language processing with many different parsing models, in particular data-driven models that can be trained on syntactically annotated corpora. In this paper, we tackle transition-based dependency parsing using a Perceptron Learner. Our proposed model, which adds more relevant features to the Perceptron Learner, outperforms a baseline arc-standard parser. We beat the UAS of the MALT and LSTM parsers. We also give possible ways to address parsing of non-projective trees.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dependency parsing has been a hot topic in the NLP community for several decades. There have been seminal contributions with state-of-the-art performances in recent years. Transition-based methods have given competitive accuracies and efficiencies for dependency parsing. These parsers construct dependency trees by using a sequence of transition actions, such as Shift and Reduce, over input sentences. Transition-based dependency parsing has gained considerable interest because it runs fast and performs accurately. Transition-based parsing gives complexities as low as and for projective and non-projective parsing, respectively [33]. The complexity is lower for projective parsing because a parser can deterministically skip tokens violating projectivity, while this property is not assumed for non-projective parsing.

Greedy transition-based dependency parsing has been widely deployed because of its speed [4]; however, state-of-the-art accuracies have been achieved by globally optimized parsers using beam search [38, 11, 39, 2]. These approaches generate multiple transition sequences given a sentence, and pick one with the highest confidence. Coupled with dynamic programming, transition-based dependency parsing with beam search can be done very efficiently and gives significant improvement to parsing accuracy.

One downside of beam search is that it always uses a fixed size of beam even when a smaller size of beam is sufficient for good results. In our experiments, a greedy parser performs as accurately as a parser that uses beam search for about 64% of time. Thus, it is preferred if the beam size is not fixed but proportional to the number of low confidence predictions that a greedy parser makes, in which case, fewer transition sequences need to be explored to produce the same or similar parse output.

In this work, we only look at dependency parsing of projective trees, and propose ideas and directions to approach parsing of non-projective trees.

A dependency tree is a labeled directed tree with

  • a set of nodes, labeled with words

  • a set of arcs, labeled with dependency types

  • a linear precedence order on

It is basically a labeled directed graph where

  • set of nodes. One for each position of a word in the sentence plus a node corresponding to a dummy variable.

  • is a set of labeled arcs of the form where and are nodes and is the label from some set of labels . It should also be noted that the root node does not have an incoming edge, every node has atmost one incoming arc and the graph is weakly connected.

There are two types of trees that satisfy these conditions:

  • Projective: For every arc , there is a directed path from to every word such that .

    Figure 1: Example of a projective tree
  • Non-projective: This does not satisfy the constraints of a projective tree in that arcs can cross each other.

    Figure 2: Example of a non-projective tree

In these experiments, we try to parse only projective trees.

Related Work

There has been several work done in the area of parsing in the NLP community. People have explored different kinds of parsing like phrase-structure parsing, dependency parsing etc. In particular, a lot of work has been done on dependency parsing and some of the seminal contributions which achieved state of the art performances are parsers like the arc-standard/arc-eager transition-based dependency parsing  [31, 33], using LSTMs  [5, 1]. There have also been work on graph-based dependency parsing  [28, 24, 3, 6].

Recently, several approaches involving natural language processing [22, 21, 15, 14, 19, 13, 18, 16]

, machine learning

[26, 17, 10, 20, 27]

, deep learning

[12, 25] and numerical optimizations [36, 23, 35, 8, 37] have also been used in the visual and language domains.

In this work, we attempt to beat the MALT and LSTM parser by adding more features to the baseline parser using a perceptron learner.

Paper Organization

The paper is organized as follows. We formulate the problem and explain the approach in sections (2) and (3) respectively. The experimental results obtained along with their implications are discussed in section (4). In the section (5), we present ways of parsing non-projective trees. We finally draw conclusions and explore possibilities of future work in the last section (6) of the paper.

2 Problem Formulation

A transition-based dependency parser needs to predict the next parser action at nondeterministic choice points. The mechanism for doing this could be based on heuristics, but the most obvious and flexible solution is to use machine learning. We build upon an arc-standard parser using a perceptron learner incorporating advanced features. The feature templates and approaches taken are described in the next section.

3 Technical Approach

In this work, we approach the transition-based dependency parsing problem using a Perceptron learner and build upon an arc-standard parser. We begin by implementing the functions to initialize the parser and execute transitions. We also implement the oracle, which would give correct transitions to take based on the gold-standard dataset, according to the Nivre’s arc-standard parsing algorithm. Several additional features were incorporated into the parser to enhance the perceptron learner [11]. They are shown in Fig. 3. The baseline arc-standard parser gives a UAS of .

Figure 3: Template of features used in the parser [11]

In the feature template shown in Fig. 3, and represent the word of the stack and the buffer respectively. represents the POS tag of the word of the stack and represents the word token of the buffer. and represent the leftmost and rightmost children of respectively. In addition to these features, we incorporate some additional ones like

4 Results

The enhanced parser was tested on the development set as well as on the test set. The results are given in Table 1.

Model Development Set
Baseline 52.81
Enhanced 82.79
Table 1: Experimental Results obtained

We see that the basline arc-standard parser gives a UAS of on the development whereas our model with a perceptron learner is able to give . As we can see, we have obtained a significant improvement over the baseline with the added feeatures.

5 Non-Projective Parsing

We had discussed about non-projective parsing in 1 and it is worth noting that the arc-standard parsing algorithm does not apply to non-projective trees. Several modifications and approaches have thus been proposed to address this limitation. Some of them are listed here:

  • Algorithms for non-projective dependency parsing:

    • Contraint Satisfaction Methods [7]

    • McDonald’s spanning tree algorithm [28]

    • Covington’s algorithm [32]

  • Post-processing of projective dependency graphs:

    • Pseudo-projective parsing [30]

    • Corrective modeling [9]

    • Approximate non-projective parsing [29]

Here, we look at two approaches: introducing a swap function in the arc-standard algorithm and also psuedo-projective parsing.

  1. In the non-projective parsing algorithm proposed by Nivre, the algorithm constructs arcs only between adjacent words but can parse arbitrary non-projective trees by swapping the order of the words in the input [34]. Basically a swapping operation was added to the existing arc-standard functions. The new set of functions are:

    • Left-Arc: ([], , ) ([], , ),

    • Right-Arc: ([], , ) ([], , )

    • Shift: (, [], ) ([], , )

    • Swap: ([], , ) ([], [], )

    It is important to note that the Swap operation is permissible only when the two nodes on the top of the stack are in the original word order, which prevents the same two nodes from being swapped more than once, and when the leftmost node is distinct from the root node . This works for non-projective trees because any input can be sorted using Shift and Swap, and any projective tree can be built using Left-Arc, Right-Arc and Shift. Thus, by changing the word order using Swap, we address the non-projective issue. The oracle is modified accordingly to accommodate the swap before the shift: . Here, refers to the inorder traversal linear order.

  2. In Pseudo-projective parsing [30], the original non-projective tree is transformed to a projective tree. It is called psuedo because the modified tree is not the original tree. Here, each non-projective arc is the original tree replaced by such that is the closest ancestor of that does not violate the projectivity constraints. After all the computation is done on this, we can de-pseudo-projectivize the tree using some heuristic post-processing technique, the output of which will be a non-projective tree.

6 Conclusions and Future work

In this paper, we have implemented a working transition-based dependency parser using a perceptron learner with a lot of additional features. We have shown significant improvements over a baseline arc-standard parser. Other classifier methods like SVM can be explored and we can identify additional features that can better train the classifier.


  • [1] M. Ballesteros, C. Dyer, and N. A. Smith (2015) Improved transition-based parsing by modeling characters instead of words with lstms. arXiv preprint arXiv:1508.00657. Cited by: §1.
  • [2] B. Bohnet and J. Nivre (2012) A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp. 1455–1465. Cited by: §1.
  • [3] X. Carreras (2007) Experiments with a higher-order projective dependency parser.. In EMNLP-CoNLL, pp. 957–961. Cited by: §1.
  • [4] D. M. Cer, M. De Marneffe, D. Jurafsky, and C. D. Manning (2010) Parsing to stanford dependencies: trade-offs between speed and accuracy.. In LREC, Cited by: §1.
  • [5] C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith (2015)

    Transition-based dependency parsing with stack long short-term memory

    arXiv preprint arXiv:1505.08075. Cited by: §1.
  • [6] J. M. Eisner (1996) Three new probabilistic models for dependency parsing: an exploration. In Proceedings of the 16th conference on Computational linguistics-Volume 1, pp. 340–345. Cited by: §1.
  • [7] K. A. Foth, M. Daum, and W. Menzel (2004) A broad-coverage parser for german based on defeasible constraints. Constraint Solving and Language Processing, pp. 88. Cited by: 1st item.
  • [8] H. P. Gupta, T. Venkatesh, S. V. Rao, T. Dutta, and R. R. Iyer (2016) Analysis of coverage under border effects in three-dimensional mobile sensor networks. IEEE Transactions on Mobile Computing 16 (9), pp. 2436–2449. Cited by: §1.
  • [9] K. Hall and V. Novák (2005) Corrective modeling for non-projective dependency parsing. In Proceedings of the Ninth International Workshop on Parsing Technology, pp. 42–52. Cited by: 2nd item.
  • [10] M. Honke, R. Iyer, and D. Mittal (2018) Photorealistic style transfer for videos. arXiv preprint arXiv:1807.00273. Cited by: §1.
  • [11] L. Huang and K. Sagae (2010) Dynamic programming for linear-time incremental parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1077–1086. Cited by: §1, Figure 3, §3.
  • [12] R. Iyer, Y. Li, H. Li, M. Lewis, R. Sundar, and K. Sycara (2018)

    Transparency and explanation in deep reinforcement learning neural networks

    In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 144–150. Cited by: §1.
  • [13] R. Iyer, R. Mandrekar, A. Aggarwal, P. Chaphekar, and G. Bhatia (2017) RecoMob: opinion mining for product enhancement. In 2017 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–5. Cited by: §1.
  • [14] R. R. Iyer, K. P. Sycara, and Y. Li (2017) Detecting type of persuasion: is there structure in persuasion tactics?. In CMNA@ ICAIL, pp. 54–64. Cited by: §1.
  • [15] R. R. Iyer, J. Chen, H. Sun, and K. Xu (2019) A heterogeneous graphical model to understand user-level sentiments in social media. arXiv preprint arXiv:1912.07911. Cited by: §1.
  • [16] R. R. Iyer, R. Kohli, and S. Prabhumoye (2020) Modeling product search relevance in e-commerce. arXiv preprint arXiv:2001.04980. Cited by: §1.
  • [17] R. R. Iyer, S. Parekh, V. Mohandoss, A. Ramsurat, B. Raj, and R. Singh (2016) Content-based video indexing and retrieval using corr-lda. arXiv preprint arXiv:1602.08581. Cited by: §1.
  • [18] R. R. Iyer, Y. Pei, and K. Sycara (2019) Simultaneous identification of tweet purpose and position. arXiv preprint arXiv:2001.00051. Cited by: §1.
  • [19] R. R. Iyer and C. P. Rose (2019) A machine learning framework for authorship identification from texts. arXiv preprint arXiv:1912.10204. Cited by: §1.
  • [20] R. R. Iyer, M. Sharma, and V. Saradhi (2020) A correspondence analysis framework for author-conference recommendations. arXiv preprint arXiv:2001.02669. Cited by: §1.
  • [21] R. R. Iyer and K. Sycara (2019) An unsupervised domain-independent framework for automated detection of persuasion tactics in text. arXiv preprint arXiv:1912.06745. Cited by: §1.
  • [22] R. R. Iyer, R. Zheng, Y. Li, and K. Sycara (2019)

    Event outcome prediction using sentiment analysis and crowd wisdom in microblog feeds

    arXiv preprint arXiv:1912.05066. Cited by: §1.
  • [23] R. Iyer and A. H. Tewfik (2012) Optimal ordering of observations for fast sequential detection. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 126–130. Cited by: §1.
  • [24] T. Koo and M. Collins (2010) Efficient third-order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1–11. Cited by: §1.
  • [25] Y. Li, K. Sycara, and R. Iyer (2018) Object-sensitive deep reinforcement learning. arXiv preprint arXiv:1809.06064. Cited by: §1.
  • [26] Y. Li, R. Zheng, T. Tian, Z. Hu, R. Iyer, and K. Sycara (2016) Joint embedding of hierarchical categories and entities for concept categorization and dataless classification. arXiv preprint arXiv:1607.07956. Cited by: §1.
  • [27] Y. Li, R. Zheng, T. Tian, Z. Hu, R. Iyer, and K. Sycara (2016) Joint embeddings of hierarchical categories and entities. arXiv preprint arXiv:1605.03924. Cited by: §1.
  • [28] R. McDonald, F. Pereira, K. Ribarov, and J. Hajič (2005) Non-projective dependency parsing using spanning tree algorithms. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 523–530. Cited by: §1, 2nd item.
  • [29] R. T. McDonald and F. C. Pereira (2006) Online learning of approximate dependency parsing algorithms.. In EACL, Cited by: 3rd item.
  • [30] J. Nivre and J. Nilsson (2005) Pseudo-projective dependency parsing. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 99–106. Cited by: 1st item, item 2.
  • [31] J. Nivre (2004) Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together, pp. 50–57. Cited by: §1.
  • [32] J. Nivre (2006) Constraints on non-projective dependency parsing.. In EACL, Cited by: 3rd item.
  • [33] J. Nivre (2008) Algorithms for deterministic incremental dependency parsing. Computational Linguistics 34 (4), pp. 513–553. Cited by: §1, §1.
  • [34] J. Nivre (2009) Non-projective dependency parsing in expected linear time. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pp. 351–359. Cited by: item 1.
  • [35] H. Qian, S. Yang, R. Iyer, X. Feng, M. Wellons, and C. Welton (2014) Parallel time series modeling-a case study of in-database big data analytics. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 417–428. Cited by: §1.
  • [36] R. Radhakrishnan, A. K. Singh, S. Bhaumik, and N. K. Tomar (2016) Multiple sparse-grid gauss–hermite filtering. Applied Mathematical Modelling 40 (7-8), pp. 4441–4450. Cited by: §1.
  • [37] R. Radhakrishnan, A. Yadav, P. Date, and S. Bhaumik (2018)

    A new method for generating sigma points and weights for nonlinear filtering

    IEEE Control Systems Letters 2 (3), pp. 519–524. Cited by: §1.
  • [38] Y. Zhang and S. Clark (2008) A tale of two parsers: investigating and combining graph-based and transition-based dependency parsing. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 562–571. Cited by: §1.
  • [39] Y. Zhang and J. Nivre (2011) Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pp. 188–193. Cited by: §1.