1 Introduction
Constituency or phrase structure parsing is a core task in natural language processing (NLP) with myriad downstream applications. Therefore, devising effective and efficient algorithms for parsing has been a key focus in NLP.
With the advancements in neural approaches, various neural architectures have been proposed for constituency parsing as they are able to effectively encode the input tokens into dense vector representations while modeling the structural dependencies between tokens in a sentence. These include recurrent networks
(dyeretal2016recurrent; sternetal2017effective) and more recently selfattentive networks (kitaevklein2018constituency).The parsing methods can be broadly distinguished based on whether they employ a greedy transitionbased algorithm or a globally optimized chart parsing algorithm. The transitionbased parsers (dyeretal2016recurrent; crosshuang2016span; liuzhang2017shift) generate trees autoregressively as a form of shiftreduce decisions. Though computationally attractive, the local decisions made at each step may propagate errors to subsequent steps which would suffer from exposure bias.
Chart parsing methods, on the other hand, learn scoring functions for subtrees and perform global search over all possible trees to find the most probable tree for a sentence
(durrettklein2015neural; gaddyetal2018whats; kitaevklein2018constituency; kitaevetal2019multilingual). In this way, these methods can ensure consistency in predicting structured output. The limitation, however, is that they run slowly at or higher time complexity.In this paper, we propose a novel parsing approach that casts constituency parsing into a series of pointing problems (Figure 1). Specifically, our parsing model estimates the pointing score from one word to another in the input sentence, which represents the likelihood of the span covering those words being a legitimate phrase structure (i.e., a subtree in the constituency tree). During training, the likelihoods of legitimate spans are maximized using the cross entropy loss. This enables our model to enforce structural consistency, while avoiding the use of structured loss that requires expensive CKY inference gaddyetal2018whats; kitaevklein2018constituency. The training in our model can be fully parallelized without requiring structured inference as in shenetal2018straight; gomezrodriguezvilares2018constituent. Our pointing mechanism also allows efficient topdown decoding with a best and worse case running time of and , respectively.
In the experiments with English Penn Treebank parsing, our model without any pretraining achieves 92.78 F1, outperforming all existing methods with similar time complexity. With pretrained BERT (devlinetal2019bert), our model pushes the F1 score to 95.48, which is on par with the stateoftheart kitaevetal2019multilingual, while supporting faster decoding. Our model also performs competitively on the multilingual parsing tasks in the SPMRL 2013/2014 shared tasks and establishes new stateoftheart in Basque and Swedish. We will release our code at https://ntunlpsg.github.io/project/parser/ptrconstituencyparser
2 Model
Similar to minimalspanbasedparsing, we view constituency parsing as the problem of finding a set of labeled spans over the input sentence. Let denote the set of labeled spans for a parse tree . Formally, can be expressed as
(1) 
where is the number of spans in the tree. Figure 1 shows an example constituency tree and its corresponding labeled span representation.
Following the standard practice in parsing gaddyetal2018whats; shenetal2018straight, we convert the ary tree into a binary form and introduce a dummy label to spans that are not constituents in the original tree but created as a result of binarization. Similarly, the labels in unary chains corresponding to nested labeled spans are collapsed into unique atomic labels, such as SVP in Fig. 1.
Although our method shares the same “spanbased” view with that of minimalspanbasedparsing, our approach diverges significantly from their framework in the way we treat the whole parsing problem, and the representation and modeling of the spans, as we describe below.
2.1 Parsing as Pointing
In contrast to previous approaches, we cast parsing as a series of pointing decisions. For each index in the input sequence, the parsing model points it to another index in order to identify the tree span , where . Similar to Pointer Networks VinyalsNIPS2015, each pointing mechanism is modeled as a multinomial distribution over the indices of the input tokens (or encoder states). However, unlike the original pointer network where a decoder state points to an encoder state, in our approach, every encoder state points to another encoder state .
In this paper, we generally use to mean points to . We will refer to the pointing operation either as a function of the encoder states (e.g., ) or simply the corresponding indices (e.g., ). They both mean the same operation where the pointing function takes the encoder state as the query vector and points to by computing an attention distribution over all the encoder states.
Let denote the set of pointing decisions derived from a tree by a transformation , i.e., . For the parsing process to be valid, the transformation and its inverse which transforms back to , should both have a onetoone mapping property. Otherwise, the parsing model may confuse two different parse trees with the same pointing representation. In this paper, we propose a novel transformation that satisfies this property, as defined by the following proposition (proof provided in the Appendix).
Proposition 1
Given a binary constituency tree for a sentence containing tokens, the transformation converts it into a set of pointing decisions such that is the largest span that starts or ends at , and is the label of the nonterminal associated with the span.
To elaborate further, each pointing decision in represents a specific span in . The pointing is directional, while the span that it represents is nondirectional. In other words, there may exist position such that , while . In fact, it is easy to see that if the token at index is a leftchild of a subtree, the largest span involving starts at , and in this case and . On the other hand, if the token is a rightchild of a subtree, the respective largest span ends at position , in which case and (e.g., see in Figure 1). In addition, as the spans in are unique, it can be shown that the pointing decisions in are also distinct from one another (see Appendix for a proof by contradiction).
Given such pointing formulation, for every constituency tree, there exists a trivial case where and is generally ‘S’. Thus, to make our formulation more general with inputs and outputs and convenient for the method description discussed later on, we add another trivial case . With this generalization, we can represent the pointing decisions of any binary constituency tree as:
(2) 
The pointing representation of the tree in Figure 1 is given at the bottom of the figure. To illustrate, in the parse tree, the largest phrase that starts or ends at token 2 (‘enjoys’) is the subtree rooted at ‘’, which spans from 2 to 5. In this case, the span starts at token 2. Similarly, the largest phrase that starts or ends at token 4 (‘tennis’) is the span “enjoys playing tennis”, which is rooted at ‘VP’. In this case, the span ends at token 4.
Algorithm 1 describes the procedure to convert a binary tree to its corresponding pointing representation. Specifically, from each leaf token , the algorithm traverses upward along the hierarchy until the nonterminal node that does not start or end with . In this way, the largest span starting or ending with can be identified.
2.2 TopDown Tree Inference
In the previous section, we described how to convert a constituency tree into a sequence of pointing decisions . We use this transformation to train the parsing model (described in detail in Sections 2.3  2.4). During inference, given a sentence to parse, our decoder with the help of the parsing model predicts , from which we can construct the tree . However, not all sets of pointings guarantee the generation of a valid tree. For example, for a sentence with four (4) tokens, the pointing does not generate a valid tree because token ‘3’ cannot belong to both spans and . In other words, simply taking the over the pointing distributions may not generate a valid tree.
Our approach to decoding is inspired by the spanbased approach of minimalspanbasedparsing. In particular, to reduce the search space, we score for span identification (given by the pointing function) and label assignment separately.
Span Identification.
We adopt a topdown greedy approach formulated as follows.
(3) 
where is the score of having a splitpoint at position (), as defined by the following equation.
(4) 
where and are the pointing scores (probabilities) for spans and , respectively. Note that the pointing scores are asymmetric, meaning that may not be equal to , because pointing from to is different from pointing from to . This is different from previous approaches, where the score of a span is defined to be symmetric. We build a tree for the input sentence by computing Eq. 3 recursively starting from the full sentence span .
In the general case when , our pointingbased parsing model should learn to assign high scores to the two spans and , or equivalently the pointing decisions and . However, the pointing formulation described so far omits the trivial selfpointing decisions, which represent the singleton spans. A singleton span is only created when the splitting decision splits an size span into a singletoken span (singleton span) and a subspan of size , i.e., when or . For instance, for the parsing process in Figure 1(a), the splitting decision at the root span results in a singleton span and a general span . For this splitting decision, Eq. 3 requires the scores of and . However, the set of pointing decisions does not cover the pointing for . This discrepancy can be resolved by modeling the singleton spans separately. To achieve that, we redefine Eq. 3 as follows:
(5)  
where and respectively represent the scores for the singleton and general pointing functions (to be defined formally in Section 2.3).
Remark on structural consistency. It is important to note that since the pointing functions are defined to have a global structural property (i.e., the largest span that starts/ends with ), our model inherently enforces structural consistency. The pointing formulation of the parsing problem also makes the training process simple and efficient; it allows us to train the model effectively with simple cross entropy loss (see Section 2.4).
using the label classifier
as per Eqn. 6. The recursion of splitting and labeling continues until the process reaches a terminal node. The label assignment for the unary spans is done by the classifier.Label Assignment.
Label assignment of spans is performed after every split decision. Specifically, as we split a span into two subspans and which corresponds to the pointing functions of and , we perform the label assignments for the two new subspans as
(6)  
where is the label classifier for any general (nonunary) span and is the set of possible nonterminal labels. Following shenetal2018straight, we use a separate classifier for determining the labels of the unary spans, e.g., the first layer of labels NP, , , NP, ) in Figure 2. Also, note that the label assignment is done based on only the query vector (the encoder state that is used to point).
Figure 2 illustrates the topdown parsing process for our running example. It consists of a sequence of pointing decisions (Figure 1(a), top to bottom), which are then trivially converted to the parse tree (Figure 1(b)). We also provide the pseudocode in Algorithm 2. Specifically, the algorithm finds the best split for the current span using the pointing scores and pushes the newly created subspans into the FIFO queue . The process terminates when there are no more spans to be split. Similar to minimalspanbasedparsing, our parsing algorithm has the worst and best case time complexities of and , respectively.
2.3 Model Architecture
We now describe the architecture of our parsing model: the sentence encoder, the pointing model and the labeling model.
Sentence Encoder.
Given an input sequence of words , we first embed each word to its respective vector representation as:
(7) 
where , , are respectively the character, word, and partofspeech (POS) embeddings of the word . Following kitaevklein2018constituency, we use a character LSTM to compute the character embedding of a word. We experiment with both randomly initialized and pretrained word embeddings. If pretrained embeddings are used, the word embedding is the summation of the word’s randomlyinitialized embedding and the pretrained embedding. The POS embeddings () are randomly initialized.
The word representations (
) are then passed to a neural network based sequence encoder to obtain their hidden representations. Since our method does not require any specific encoder, one may use any encoder model, such as BiLSTM
(Hochreiter:1997) or selfattentive encoder (kitaevklein2018constituency). In this paper, unless otherwise specified, we use the selfattentive encoder model as our main sequence encoder because of its efficiency with parallel computation. The model is factorized into content and position information in both the selfattention sublayer and the feedforward layer. Details about this factorization process is provided in kitaevklein2018constituency.Pointing and Labeling Models.
The results of the aforementioned sequence encoding process are used to compute the pointing and labeling scores. More formally, the encoder network produces a sequence of latent vectors for the input sequence . After that, we apply four (4) separate positionwise twolayer FeedForward Networks (FFN), formulated as , to transform into taskspecific latent representations for the respective pointing and labeling tasks.
(8)  
(9) 
Note that there is no parameter sharing between , , and . The pointing functions are then modeled as the multinomial (or attention) distributions over the input indices for each input position as follows.
(10)  
(11) 
For label assignment functions, we simply feed the label representations and into the respective softmax classification layers as follows.
(12)  
(13) 
where and are the set of possible labels for the general and unary spans respectively, and are the classspecific trainable weight vectors.
2.4 Training Objective
We train our parsing model by minimizing the total loss defined as:
(14)  
where each individual loss is a cross entropy loss computed for the corresponding labeling or pointing task, and represents the overall model parameters; specifically, denotes the encoder parameters shared by all components, while and denote the separate parameters catering for the four pointing and labeling functions, and , respectively.
3 Experiments
To show the effectiveness of our approach, we conduct experiments on English and Multilingual parsing tasks. For English, we use the standard Wall Street Journal (WSJ) part of the Penn Treebank (PTB) PTB:Marcus:1993, whereas for multilingual, we experiment with seven (7) different languages from the SPMRL 20132014 shared task (seddahetal2013overview): Basque, French, German, Hungarian, Korean, Polish and Swedish.
For evaluation on PTB, we report the standard labeled precision (LP), labeled recall (LR), and labelled F1 computed by evalb^{1}^{1}1http://nlp.cs.nyu.edu/evalb/. For the SPMRL datasets, we report labeled F1 and use the same setup in evalb as kitaevklein2018constituency.
3.1 English (PTB) Experiments
Setup.
We follow the standard train/valid/test split, which uses sections 221 for training, section 22 for development and section 23 for evaluation. This gives 45K sentences for training, 1,700 sentences for development, and 2,416 sentences for testing. Following previous studies, our model uses POS tags predicted by the Stanford tagger (Toutanova:2003:FPT:1073445.1073478).
For our model, we adopt the selfattention encoder with similar hyperparameter details proposed by
kitaevklein2018constituency. The character embeddings are of dimensions. For general and unary label classifiers ( and ), the hidden dimension of the specific positionwise feedforward networks is 250, while those for pointing functions ( and ) have hidden dimensions of . Our model is trained using the Adam optimizer (KingmaB14) with a batch size of sentences. Additionally, we use warmup steps, within which we linearly increase the learning rate from to the base learning rate of . Model selection for testing is performed based on the labeled F1 score on the validation set.Results for Single Models.
The experimental results on PTB for the models without pretraining are shown in Table 1. As it can be seen, our model achieves an F1 of , the highest among the models using topdown inference strategies. Specifically, our method outperforms minimalspanbasedparsing and shenetal2018straight by about point in F1score. Notably, our model with LSTM encoder achieves an F1 of 92.26, which is still better than all the topdown parser methods.
Model  LR  LP  F1 

TopDown Inference  
minimalspanbasedparsing  93.20  90.30  91.80 
shenetal2018straight  92.00  91.70  91.80 
Our Model  92.81  92.75  92.78 
CKY/Chart Inference  
gaddyetal2018whats      92.10 
kitaevklein2018constituency  93.20  93.90  93.55 
Other Approaches  
gomezrodriguezvilares2018constituent      90.7 
liuzhang2017shift      91.8 
sternetal2017effective  92.57  92.56  92.56 
ZhouZ19  93.64  93.92  93.78 
On the other hand, while kitaevklein2018constituency and ZhouZ19 achieve higher F1 score, their inference speed is significantly slower than ours because of the use of CKY based algorithms, which run at time complexity for kitaevklein2018constituency and for ZhouZ19. Furthermore, their training objectives involve the use of structural hinge loss, which requires online CKY inference during training. This makes their training time considerably slower than that of our method, which is trained directly with spanwise cross entropy loss.
In addition, ZhouZ19 uses external supervision (head information) from the dependency parsing task. Dependency parsing models, in fact, have a strong resemblance to the pointing mechanism that our model employs (maetal2018stack). As such, integrating dependency parsing information into our model may also be beneficial. We leave this for future work.
Results with Pretraining
Similar to kitaevklein2018constituency and kitaevetal2019multilingual, we also evaluate our models with BERT devlinetal2019bert embeddings . Following them in the inclusion of contextualized token representations, we adjust the number of selfattentive layers to and the base learning rate to .
As shown in Table 2, our model achieves an F1 score of 95.48, which is on par with the stateoftheart models. However, the advantage of our method is that it is faster than those methods. Specifically, our model runs at worstcase time complexity, while that of kitaevetal2019multilingual is . Comparison on parsing speed is discussed in the following section.
Model  F1 

Our model  95.34 
Our model  95.48 
kitaevklein2018constituency ELMO  95.13 
kitaevetal2019multilingual  95.59 
Parsing Speed Comparison.
In addition to parsing performance in F1 scores, we also compare our parser against the previous neural approaches in terms of parsing speed. We record the parsing timing over 2416 sentences of the PTB test set with batch size of 1, on a machine with NVIDIA GeForce GTX 1080Ti GPU and Intel(R) Xeon(R) Gold 6152 CPU. This setup is comparable to the setup of shenetal2018straight.
Model  # sents/sec 

petrovklein2007improved 
6.2 
zhuetal2013fast  89.5 
liuzhang2017shift  79.2 
minimalspanbasedparsing  75.5 
kitaevklein2018constituency  94.40 
shenetal2018straight  111.1 
Our model 
130.2 

Model  Basque  French  German  Hebrew  Hungarian  Korean  Polish  Swedish 

spmrl2014  88.24  82.53  81.66  89.80  91.72  83.81  90.50  85.50 
coavouxcrabbe2017multilingual  88.81  82.49  85.34  89.87  92.34  86.04  93.64  84.0 
kitaevklein2018constituency  89.71  84.06  87.69  90.35  92.69  86.59  93.69  84.45 
Our Model  90.23  82.20  84.91  90.63  91.07  85.36  93.99  86.87 
Model  Basque  French  German  Hebrew  Hungarian  Korean  Polish  Swedish 

kitaevetal2019multilingual  91.63  87.43  90.20  92.99  94.90  88.80  96.36  88.86 
Our model  92.02  86.69  90.28  93.67  94.24  88.71  96.14  89.10 

Language  Train  Valid  Test 

Basque  7,577  948  946 
French  14,759  1,235  2,541 
German  40,472  5,000  5,000 
Hebrew  5,000  500  716 
Hungarian  8,146  1,051  1,009 
Korean  23,010  2,066  2,287 
Polish  6,578  821  822 
Swedish  5,000  494  666 
As shown in Table 3, our parser outperforms shenetal2018straight by 19 more sentences per second, despite the fact that our parsing algorithm runs at worsecase time complexity while the one used by shenetal2018straight can theoretically run at time complexity. To elaborate further, the algorithm presented in shenetal2018straight can only run at complexity. To achieve complexity, it needs to sort the list of syntactic distances, which the provided code^{2}^{2}2https://github.com/hantek/distanceparser does not implement. In addition, the speed up for our method can be attributed to the fact that our algorithm (see Algorithm 2) uses a while loop, while the algorithm of shenetal2018straight has many recursive function calls. Recursive algorithms tend to be less empirically efficient than their equivalent while/for loops in handling lowlevel memory allocations and function call stacks.
3.2 SPMRL Multilingual Experiments
Setup.
Similar to the English PTB experiments, we use the predicted POS tags from external taggers (provided in the SPMRL datasets). The train/valid/test split is reported in Table 6. For single model evaluation, we use the identical hyperparameters and optimizer setups as in English PTB. For experiments with pretrained models, we use the multilingual BERT (devlinetal2019bert), which was trained jointly on 104 languages.
Results.
The results for the single models are reported in Table 4. We see that our model achieves the highest F1 score in Basque and Swedish, which are higher than the baselines by and respective in F1. Our method also performs competitively with the previous stateoftheart methods on other languages.
Table 5 reports the performance of the models using pretrained BERT. Evidently, our method achieves stateoftheart results in Basque and Swedish, and performs on par with the previous best method by kitaevetal2019multilingual in the other five languages. Again, note that our method is considerably faster and easier to train than the method of kitaevetal2019multilingual.
4 Related Work
Prior to the neural tsunami in NLP, parsing methods typically model correlations in the output space through probabilistic contextfree grammars (PCFGs) on top of sparse (and discrete) input representations either in a generative regime kleinmanning2003accurate or a discriminative regime (finkeletal2008efficient) or a combination of both charniakjohnson2005coarse. Beside the chart parser approach, there is also a long tradition of transitionbased parsers sagaelavie2005classifier
Recently, however, with the advent of powerful neural encoders such as LSTMs Hochreiter:1997, the focus has been switched more towards effective modeling of correlations in the input’s latent space, as the output structures are nothing but a function of the input (gaddyetal2018whats). Various neural network models have been proposed to effectively encode the dense input representations and correlations, and have achieved stateoftheart parsing results. To enforce the structural consistency, existing neural parsing methods either employ a transitionbased algorithm (dyeretal2016recurrent; liuzhang2017shift; tetratagging2019) or a globally optimized chartparsing algorithm (gaddyetal2018whats; kitaevklein2018constituency).
Meanwhile, researchers also attempt to convert the constituency parsing problem into tasks that can be solved in alternative ways. For instance, fernandezgonzalezmartins2015parsing transform the phrase structure into a special form of dependency structure. Such a dependency structure, however, requires certain corrections while converting back to the corresponding constituency tree. gomezrodriguezvilares2018constituent and shenetal2018straight propose to map the constituency tree for a sentence of tokens into a sequence of labels or scalars based on the depth or height of the lowest common ancestors between pairs of consecutive tokens. In addition, methods like NIPS2015_5635; NIPS2017_7181 apply the sequencetosequence framework to “translate” a sentence into the linearized form of its constituency tree. While being trivial and simple, parsers of this type do not guarantee structural correctness, because the syntax of the linearized form is not constrained during tree decoding.
Our approach differs from previous work in that it represents the constituency structure as a series of pointing representations and has a relatively simpler cross entropy based learning objective. The pointing representations can be computed in parallel, and can be efficiently converted into a full constituency tree using a topdown algorithm. Our pointing mechanism shares certain similarities with the Pointer Network (VinyalsNIPS2015), but is distinct from it in that our method points a word to another word within the same encoded sequence.
5 Conclusion
We have presented a novel constituency parsing method that is based on a pointing mechanism. Our method utilizes an efficient topdown decoding algorithm that uses pointing functions for scoring possible spans. The pointing formulation inherently captures global structural properties and allows efficient training with cross entropy loss. With experiments we have shown that our method outperforms all existing topdown methods on the English Penn Treebank parsing task. Our method with pretraining rivals the stateoftheart method, while being faster than it. On multilingual constituency parsing, it also establishes new stateoftheart in Basque and Swedish.
Acknowledgments
We would like to express our gratitude to the anonymous reviewers for their insightful feedback on our paper. Shafiq Joty would like to thank the funding support from his Startup Grant (M4082038.020).
References
Appendix
Proof of Proposition 1
Given , generated from tree (here we omit the unary leaves and POStags), we at first define the inverse as follows:
We would prove
A binary tree has exactly internal nodes (or spans).
It is noteworthy to mention that for each pointing , is a span in . As we consider from to , there are totally at most n1 such spans in (we do not know whether these spans are not be distinct). Therefore, if we can prove that all spans are distinct for , will cover all the span in , therefore, . We prove this by contradiction.
Assume that there exist such that for . First, if , then according to the above condition, . This means, either or , which contradicts with our initial assumption that and . So, cannot be equal to . Similarly, we can prove that also cannot be equal to . Thus, we can conclude that .
Now, without loss of generality, let us assume that . With this assumption, the two spans will be identical if and only if and . In this case, the span would be the largest span that starts with and ends at . However, since , the span must be a left or right child of another (parent) span. If is the left child, then the parent span needs to start with , making it larger than . This contradicts to the property that is the largest span that starts or ends at . Similarly, if is the right child, then the parent span needs to end at , making it larger than . This again contradicts to the property that is the largest span that starts or ends at .
In conclusion, we have . This would guarantee that and are onetoone: If there exist such that , we would have or .If there exist such that , we would have .