1 Introduction
The initial stages of modern NLP models typically deal with tokenaligned vector representations. These may begin as word vectors, and can later be modified to incorporate contextual information using neural architectures such as RNNs or the Transformer. Architectures that produce these representations are generalpurpose, can be shared across tasks, and can be effectively pretrained on large amounts of data.
Modern parsers make use of such representations, but require leaving the wordsynchronous domain when producing their output. For example, our previous parser (Kitaev and Klein, 2018) constructs and operates over representations for each span in the sentence. In this paper, we present an approach to parsing that fully operates in the wordsynchronous paradigm. Our method has the following properties:

It is wordsynchronous. The entire trainable portion of the parser consists of producing wordaligned feature vectors and predicting labels using these features. As a result, the method can immediately leverage any advances in building generalpurpose word representations.

It uses constant bit rate: each position in the sentence is assigned a label from a fixed vocabulary. The set of labels is fully determined by the grammar of the language and does not depend on the specific inputs presented to the parser.

Inference time per word is constant, subject to mild assumptions based on the observation that certain syntactic configurations are difficult for humans to understand and are unattested in treebank data.
2 Related Work
Chart parsing
Chart parsers fundamentally operate over spanaligned rather than wordaligned representations. This is true for both classical methods (Klein and Manning, 2003) and more recent neural approaches (Durrett and Klein, 2015; Stern et al., 2017). The size of a chart is quadratic in the length of the sentence, and the unoptimized CKY algorithm has cubic running time. Because the bit rate and inference time per word are not constant, additional steps must be taken to scale these systems to the paragraph or document level.
Labelbased parsing
A variety of approaches have been proposed to mostly or entirely reduce parsing to a sequence labeling task. One family of these approaches is supertagging (Bangalore and Joshi, 1999), which is particularly common for CCG parsing. CCG imposes constraints on which supertags may form a valid derivation, necessitating complex search procedures for finding a highscoring sequence of supertags that is selfconsistent. An example of how such a search procedure can be implemented is the system of Lee et al. (2016), which uses A search. The required inference time per word is not constant with this method, and in fact the worstcase running time is exponential in the sentence length. GómezRodríguez and Vilares (2018) proposed a different approach that fully reduces parsing to sequence labeling, but the label vocabulary is unbounded: it expands with tree depth and related properties of the input, rather than being fixed for any given language. There have been attempts to address this by adding redundant labels, where each word has multiple correct labels (Vilares et al., 2019), but that only increases the label vocabulary rather than restricting it to a finite set. Our approach, on the other hand, uses just 4 labels in its simplest formulation (hence the name tetratagging).
Shiftreduce transition systems
A number of parsers proposed in the literature fall into the broad category of shiftreduce parsers (Henderson, 2003; Sagae and Lavie, 2005; Zhang and Clark, 2009; Zhu et al., 2013). These systems rely on generating sequences of actions, but the actions need not be evenly distributed throughout the sentence. For example, the construction of a deep rightbranching tree might involve a series of shift actions (one per word in the sentence), followed by equally many consecutive reduce
actions that all cluster at the end of the derivation. Due to the uneven alignment between actions and locations in a sentence, neural network architectures in recent shiftreduce systems
(Vinyals et al., 2015; Dyer et al., 2016; Liu and Zhang, 2017) broadly follow an encoderdecoder approach rather than directly assigning labels to positions in the input. Our proposed parser is also transitionbased, but there are guaranteed to be exactly two decisions to make after shifting one word and before shifting the next. As a result, the amount of computation required per word is uniform as the algorithm proceeds lefttoright through the sentence.Leftcorner parsing
Our parsing algorithm is inspired by and shares several key properties with leftcorner parsing; see Section 3.5 for a discussion of related work in this area.
3 Method
To introduce our method, we first restrict ourselves to only consider unlabeled full binary trees (no labels, no unary chains, and no nodes with more than two children). We defer the discussion of labeling and nonbinary structure to Section 3.6.
3.1 Preliminaries
Before we present our wordsynchronous parse tree representation, let’s first answer the question: what is the minimal bit rate (per word) required to represent a parse tree? Absent any assumptions about the linguistic infeasibility of certain syntactic configurations, a parser would need to be capable of producing all possible trees over a given set of words. The number of full binary trees over words is the Catalan number, . Asymptotically, the Catalan numbers scale as:
From this we can conclude that a parser that selects between 4 possible options per word is asymptotically optimal, in the sense that selecting from a smaller inventory of options is insufficient to encode all possible trees. We exhibit such a method in the next subsection.
3.2 Labels
Consider the example tree shown in Figure 1
. The tree is fully binarized (i.e. every node has either 0 or 2 children) and consists of 5 terminal symbols (A,B,C,D,E) and 4 nonterminal nodes (1,2,3,4). For any full binary parse tree, the number of nonterminals will always be one less than the number of words, so we can construct a onetoone mapping between nonterminals and fenceposts (i.e. positions between words): each fencepost is matched with the smallest span that crosses it. Equivalently, we can number the nonterminals based on the inorder traversal of the tree and match node 1 with the first fencepost (between words A and B), node 2 with the second fencepost (between words B and C), etc.
For each node, we calculate the direction of its parent, i.e. whether the node is a leftchild or a rightchild. Although the root node in the tree does not have a parent, by convention we treat it as though it were a leftchild (in Figure 1, this is denoted by the dummy parent labeled $).
Our scheme associates each word and fencepost in the sentence with one of four labels:

“ ”: This terminal node is a leftchild

“ ”: This terminal node is a rightchild

“ ”: The shortest span crossing this fencepost is a leftchild

“ ”: The shortest span crossing this fencepost is a rightchild
We refer to our method as tetratagging because it uses only these four labels.
Given a sentence with words, there are altogether decisions (each with two options). The label representation of a tree is unique by construction. However, a fully onetoone mapping between trees and label sequences is not possible because a sentence with words admits possible label sequences but only distinct trees.
In the next subsection we show how the four labels above can be interpreted as actions in a transitionbased parser, whereby some label sequences are valid and can be mapped back to trees, while others are invalid. In the following subsection, we describe an efficient dynamic program for finding the highestscoring valid sequence under a probabilistic model.
3.3 Transition System
In this section, we reinterpret the four labels (“ ”, “ ”, “ ”, “ ”) as actions in a transition system can map from label sequences back to trees. Our transition system maintains a stack of partiallyconstructed trees, where each element of the stack is one of the following: (a) a terminal symbol, i.e. a word; (b) a complete tree; or (c) a tree with a single empty slot, denoted by the special element . An empty slot must be the rightmost leaf node in its tree, but may occur at any depth.
Action  Stack  Buffer 
(0)  empty  A B C D E 
(1)

B C D E  
(2)

B C D E  
(3)

C D E  
(4)

C D E  
(5)

D E  
(6)

D E  
(7)

E  
(8)

E  
(9)

empty 
The tree operations used are:

MakeNode(leftchild, rightchild): creates a new tree node.

Combine(parenttree, childtree): replaces the empty slot in the parent tree with the child tree.
The decoding system is shown in Algorithm 1, and an example derivation in shown in Figure 2.
Each action in the transition system is responsible for adding a single tree node onto the stack: the actions “ ” and “ ” do this by shifting in a leaf node, while the actions “ ” and “ ” construct a new nonterminal node. The transition system maintains the invariant that the topmost stack element is a complete tree if and only if a leaf node was just shifted (i.e. the last action was either “ ” or “ ”), and all other stack elements have a single empty slot.
The actions “ ” and “ ” both make use of the Combine operation to fill an empty slot on the stack with a newlyintroduced node, which makes the new node a rightchild. New nodes from the actions “ ” and “ ”, on the other hand, are introduced directly onto the stack and can become leftchildren via a later MakeNode operation. As a result, the behavior of the four actions (“ ”, “ ”, “ ”, “ ”) matches the label definitions from the previous section.
3.4 Inference
It should be noted that not all sequences of labels are valid under our transition system; in particular:

The stack is initially empty and the only valid initial action is “ ”, which shifts the first word in the sentence from the buffer onto the stack.

The action “ ” relies on there being more than one element on the stack (lines 1618 of Algorithm 1).

After executing all actions, the stack should contain a single element. Due to the invariant that the top stack element after a “ ” or “ ” action is always a tree with no empty slots, this single stack element is guaranteed to be a complete tree that spans the full sentence.
A tagging model that directly outputs an arbitrary sequence of labels is not guaranteed to produce a valid tree. Instead, we will work with models that output a probability distribution
over sequences of labels; in particular, we will assume that label probabilities are predicted independently for each position in the sentence (conditioned on the input):
We observe that the validity constraints for our transition system can be expressed entirely in terms of the number of stack elements at each point in the derivation, and do not depend on the precise structure of those elements. This property enables an optimal and efficient dynamic program for finding the valid sequence of labels that has the highest probability under the model.
The dynamic program maintains a table of the highestscoring parser state for each combination of number of actions taken and stack depth. Prior to taking any actions, the stack must be empty. The algorithm then proceeds lefttoright to fill in highestscoring stack configurations after action 1, 2, etc.
This same inference algorithm can also be seen as a type of beam search, where at each timestep the set of hypotheses under consideration is updated based on the label scores at that timestep. For each possible stack depth, only the highestscoring hypothesis needs to be retained on the beam to achieve an optimal decode.
3.5 Connection to leftcorner parsing
The design of our transition system borrows from past work on leftcorner parsing, in that it preserves key properties that are motivated by considerations regarding human syntactic processing.
It has been observed that humans have no particular difficulty in processing deep left and rightbranching constructions, but that even a small amount of centerembedding greatly hurts comprehension (Miller and Chomsky, 1963). This disparity has been attributed to cognitive limitations with respect to working memory and processing capabilities.
An analysis of the processing and space requirements of topdown and bottomup parsing strategies reveals that neither strategy has equivalent treatment of left and rightbranching structure. On the other hand, leftcorner parsing has the property that space utilization is constant when processing fully left or rightbranching structures, but increases whenever a centerembedded construct is encountered (Abney and Johnson, 1991; Resnik, 1992).
Past work has operationalized these considerations by defining a class of grammars and the order in which grammar rules are applied (Rosenkrantz and Lewis, 1970), or by applying a leftcorner transform (or the closelyrelated rightcorner transform) to syntactic trees (Johnson, 1998; Schuler et al., 2010). Our method is instead formulated as a transformation from trees to label sequences, and uses wordsynchronous feature vectors to drive the derivation rather than requiring an explicit grammar to fulfill that role.
In the context of our tetratag parser, a leftbranching tree is characterized by the action sequence “ …”, where neither “ ” nor “ ” change the size of the stack (see Algorithm 1). A rightbranching tree has the action sequence “… ”, where each “ ” action grows the stack by one element and each “ ” action shrinks the stack by one element. As a result, the depth of the stack remains effectively constant when parsing either left or rightbranching structures. Larger stack sizes are needed only when centerembedded constructs are encountered.
In practice, the largest stack depth observed at any point in the derivation for any tree in the Penn Treebank is 8. By comparison, the median sentence length in the data is 23, and the largest sentence is over 100 words long.
These observations directly impact how efficiently we can perform inference in our parsing framework. The time complexity of the dynamic program described in the previous section is , where is the length of the sentence and is the maximum possible depth of the stack. If our parser were required to produce arbitrary trees, we would have and an overall time complexity of . When dealing with naturallanguage parse trees, however, we can cap the maximum stack depth allowed in our inference procedure, for example by setting . If we assume that this cap is a constant (as would be the case if it corresponds directly to human cognitive limitations), the time complexity effectively becomes . In other words, our inference procedure will, in practice, require a constant amount of time per word in the sentence.
3.6 Handling of labels and nonbinary trees
Thus far, we have only dealt with binary unlabeled trees, but our method can be readily extended to the labeled and nonbinary settings.
To incorporate labels, we note that each of our four actions corresponds to a single node in the binary tree. The label to assign to a node can therefore be incorporated into the corresponding action; for example, the action “ S” will construct an S node that is a leftchild in the tree, and the action “ NP” will construct a singleword Noun Phrase that is a rightchild. We do not impose any constraints on valid label configurations, so our inference procedure essentially remains unchanged.
To handle nonbinary trees, we follow past work by binarizing the trees and collapsing unary chains. We use fully rightbranching binarization, where a dummy label is introduced and assigned to nodes generated as a result of binarization.
4 Results
We evaluate our proposed method by training a parser that directly predicts action sequences from pretrained BERT (Devlin et al., 2018) word representations. Two independent projection matrices are applied to the feature vector for the last subword unit within each word: one projection produces scores for actions corresponding to that word, and the other for actions at the following fencepost. The model is trained to maximize the likelihood of the correct action sequence, where BERT parameters are finetuned as part of training. We compare our model with our previous chart parser (Kitaev and Klein, 2018), which was finetuned from the same initial BERT representations. Unlike the tetratagging approach, the chart parser constructs feature vectors for each span in the sentence and uses the cubictime CKY algorithm for inference.
5 WordSynchronous vs. Incremental
Thus far we’ve discussed wordsynchronous syntactic analysis. A closely related question, arising from a number of perspectives including computational efficiency and cognitive modeling, is whether we can build a fully incremental parser. While our parsing algorithm itself is incremental (meaning that it operates in a strictly lefttoright manner) the BERT word representations underneath are not. Indeed, the BERT model is specifically constructed to be deeply bidirectional.
We experiment with incremental parsing instead by using the GPT2
(Radford et al., 2019) architecture to construct tokenaligned vector representations. The publiclyavailable GPT2 model is configured to be roughly comparable to BERT: both models use a 12layer architecture with 768dimensional hidden states and 12 selfattention heads per layer. However, in GPT2 a given position in the sentence is only allowed to attend to what came before it (and not any of the later words).^{1}^{1}1BERT and GPT2 also use different pretraining data and subword tokenization.We further augment GPT2 with a notion of lookahead by allowing the parser to incorporate information from a fixed context window following a word. We opt to add lookahead by introducing extra layers on top of GPT2, rather than modifying the preinitialized GPT2 architecture in a way that deviates from the pretraining conditions. To achieve a lookahead of words, we first add 8 selfattention layers on top of the GPT2 architecture. Attention in the first of these extra layers is constrained to look no more than words into the future. The following 7 layers use fully causal selfattention, meaning that attention to subsequent words is disallowed.
We find that using GPT2 with no lookahead performs poorly, but that even a modest amount of lookahead leads to substantial quality improvements (Figure 3). We also include in our comparison two architectures that allow arbitrary amounts of lookahead. The first uses GPT2 but allows unrestricted attention patterns in all 8 layers that follow it. This deeply bidirectional lookahead outperforms lookahead at only a single layer (93.5 F1 vs. 92.3 F1). The second point of comparison is a model that admits no lookahead in the parserspecific selfattention layers, but uses BERT rather than GPT2. The model with BERT achieves 95 F1 vs. 93.5 F1 for the best application of GPT2, which suggests that some aspect of the BERT architecture (perhaps its deep bidirectionality during pretraining) makes it more suitable for use in our parser than GPT2.
6 Conclusion
We present a wordsynchronous linearized tree representation that uses a set of four actions. The actions are wordaligned and can be directly predicted from contextualized word vectors using an approach we call tetratagging. We present an optimal dynamic programming algorithm for finding the highestscoring tree from a sequence of tag probabilities and show that, with mild assumptions, the inference algorithm runs in linear time.
References
 Abney and Johnson (1991) Steven P. Abney and Mark Johnson. 1991. Memory requirements and local ambiguities of parsing strategies. Journal of Psycholinguistic Research, 20(3):233–250.
 Bangalore and Joshi (1999) Srinivas Bangalore and Aravind K. Joshi. 1999. Supertagging: An Approach to Almost Parsing. Computational Linguistics, 25(2):237–265.
 Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs].

Durrett and Klein (2015)
Greg Durrett and Dan Klein. 2015.
Neural CRF Parsing.
In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages 302–312. Association for Computational Linguistics.  Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent Neural Network Grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209. Association for Computational Linguistics.
 GómezRodríguez and Vilares (2018) Carlos GómezRodríguez and David Vilares. 2018. Constituent Parsing as Sequence Labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1314–1324. Association for Computational Linguistics.
 Henderson (2003) James Henderson. 2003. Inducing History Representations for Broad Coverage Statistical Parsing. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 103–110.
 Johnson (1998) Mark Johnson. 1998. Finitestate Approximation of Constraintbased Grammars using Leftcorner Grammar Transforms. In COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics.
 Kitaev and Klein (2018) Nikita Kitaev and Dan Klein. 2018. Multilingual Constituency Parsing with SelfAttention and PreTraining. arXiv:1812.11760 [cs].
 Klein and Manning (2003) Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational LinguisticsVolume 1, pages 423–430. Association for Computational Linguistics.
 Lee et al. (2016) Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2016. Global Neural CCG Parsing with Optimality Guarantees. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2366–2376. Association for Computational Linguistics.
 Liu and Zhang (2017) Jiangming Liu and Yue Zhang. 2017. InOrder Transitionbased Constituent Parsing. Transactions of the Association for Computational Linguistics, 5:413–424.
 Marcus et al. (1993) Mitchell P Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2).
 Miller and Chomsky (1963) George A Miller and Noam Chomsky. 1963. Finitary models of language users. Handbook of mathematical psychology, II.
 Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. page 24.
 Resnik (1992) Philip Resnik. 1992. Leftcorner parsing and psychological plausibility. In Proceedings of the 14th Conference on Computational LinguisticsVolume 1, pages 191–197. Association for Computational Linguistics.
 Rosenkrantz and Lewis (1970) Daniel J. Rosenkrantz and Philip M. Lewis. 1970. Deterministic left corner parsing. In Switching and Automata Theory, 1970., IEEE Conference Record of 11th Annual Symposium On, pages 139–152. IEEE.
 Sagae and Lavie (2005) Kenji Sagae and Alon Lavie. 2005. A ClassifierBased Parser with Linear RunTime Complexity. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 125–132. Association for Computational Linguistics.
 Schuler et al. (2010) William Schuler, Samir AbdelRahman, Tim Miller, and Lane Schwartz. 2010. BroadCoverage Parsing Using HumanLike Memory Constraints. Computational Linguistics, 36(1):1–30.
 Stern et al. (2017) Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A Minimal SpanBased Neural Constituency Parser. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 818–827. Association for Computational Linguistics.
 Vilares et al. (2019) David Vilares, Mostafa Abdou, and Anders Søgaard. 2019. Better, Faster, Stronger Sequence Tagging Constituent Parsers. arXiv:1902.10985 [cs].
 Vinyals et al. (2015) Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a Foreign Language. In Advances in Neural Information Processing Systems 28, pages 2755–2763. Curran Associates, Inc.
 Zhang and Clark (2009) Yue Zhang and Stephen Clark. 2009. TransitionBased Parsing of the Chinese Treebank using a Global Discriminative Model. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09), pages 162–171. Association for Computational Linguistics.
 Zhu et al. (2013) Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and Accurate ShiftReduce Constituent Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 434–443. Association for Computational Linguistics.
Comments
There are no comments yet.