Efficient Constituency Parsing by Pointing

We propose a novel constituency parsing model that casts the parsing problem into a series of pointing tasks. Specifically, our model estimates the likelihood of a span being a legitimate tree constituent via the pointing score corresponding to the boundary words of the span. Our parsing model supports efficient top-down decoding and our learning objective is able to enforce structural consistency without resorting to the expensive CKY inference. The experiments on the standard English Penn Treebank parsing task show that our method achieves 92.78 F1 without using pre-trained models, which is higher than all the existing methods with similar time complexity. Using pre-trained BERT, our model achieves 95.48 F1, which is competitive with the state-of-the-art while being faster. Our approach also establishes new state-of-the-art in Basque and Swedish in the SPMRL shared tasks on multilingual constituency parsing.


page 1

page 2

page 3

page 4


A Conditional Splitting Framework for Efficient Constituency Parsing

We introduce a generic seq2seq parsing framework that casts constituency...

Unsupervised Parsing via Constituency Tests

We propose a method for unsupervised parsing based on the linguistic not...

Hierarchical Pointer Net Parsing

Transition-based top-down parsing with pointer networks has achieved sta...

Improved Latent Tree Induction with Distant Supervision via Span Constraints

For over thirty years, researchers have developed and analyzed methods f...

A Span-based Linearization for Constituent Trees

We propose a novel linearization of a constituent tree, together with a ...

Improving Neural Parsing by Disentangling Model Combination and Reranking Effects

Recent work has proposed several generative neural models for constituen...

AMR Parsing via Graph-Sequence Iterative Inference

We propose a new end-to-end model that treats AMR parsing as a series of...

1 Introduction

Constituency or phrase structure parsing is a core task in natural language processing (NLP) with myriad downstream applications. Therefore, devising effective and efficient algorithms for parsing has been a key focus in NLP.

With the advancements in neural approaches, various neural architectures have been proposed for constituency parsing as they are able to effectively encode the input tokens into dense vector representations while modeling the structural dependencies between tokens in a sentence. These include recurrent networks

(dyer-etal-2016-recurrent; stern-etal-2017-effective) and more recently self-attentive networks (kitaev-klein-2018-constituency).

The parsing methods can be broadly distinguished based on whether they employ a greedy transition-based algorithm or a globally optimized chart parsing algorithm. The transition-based parsers (dyer-etal-2016-recurrent; cross-huang-2016-span; liu-zhang-2017-shift) generate trees autoregressively as a form of shift-reduce decisions. Though computationally attractive, the local decisions made at each step may propagate errors to subsequent steps which would suffer from exposure bias.

Chart parsing methods, on the other hand, learn scoring functions for subtrees and perform global search over all possible trees to find the most probable tree for a sentence

(durrett-klein-2015-neural; gaddy-etal-2018-whats; kitaev-klein-2018-constituency; kitaev-etal-2019-multilingual). In this way, these methods can ensure consistency in predicting structured output. The limitation, however, is that they run slowly at or higher time complexity.














Span Representation

{((1, 5), S), ((2, 5), ), ((2, 4), VP), ((3, 4), S-VP)}

Pointing Representation

{(,S), (,), (,S-VP), (,VP), (,S)}
Figure 1:

A binarized constituency tree for the sentence “She enjoys playing tennis.”. The node

S-VP is an example of a collapsed atomic label. We omit POS tags and singleton spans for simplicity. Below the tree, we show span and pointing representations of the tree.

In this paper, we propose a novel parsing approach that casts constituency parsing into a series of pointing problems (Figure 1). Specifically, our parsing model estimates the pointing score from one word to another in the input sentence, which represents the likelihood of the span covering those words being a legitimate phrase structure (i.e., a subtree in the constituency tree). During training, the likelihoods of legitimate spans are maximized using the cross entropy loss. This enables our model to enforce structural consistency, while avoiding the use of structured loss that requires expensive CKY inference gaddy-etal-2018-whats; kitaev-klein-2018-constituency. The training in our model can be fully parallelized without requiring structured inference as in shen-etal-2018-straight; gomez-rodriguez-vilares-2018-constituent. Our pointing mechanism also allows efficient top-down decoding with a best and worse case running time of and , respectively.

In the experiments with English Penn Treebank parsing, our model without any pre-training achieves 92.78 F1, outperforming all existing methods with similar time complexity. With pre-trained BERT (devlin-etal-2019-bert), our model pushes the F1 score to 95.48, which is on par with the state-of-the-art kitaev-etal-2019-multilingual, while supporting faster decoding. Our model also performs competitively on the multilingual parsing tasks in the SPMRL 2013/2014 shared tasks and establishes new state-of-the-art in Basque and Swedish. We will release our code at https://ntunlpsg.github.io/project/parser/ptr-constituency-parser

2 Model

Similar to minimal-span-based-parsing, we view constituency parsing as the problem of finding a set of labeled spans over the input sentence. Let denote the set of labeled spans for a parse tree . Formally, can be expressed as


where is the number of spans in the tree. Figure 1 shows an example constituency tree and its corresponding labeled span representation.

Following the standard practice in parsing gaddy-etal-2018-whats; shen-etal-2018-straight, we convert the -ary tree into a binary form and introduce a dummy label to spans that are not constituents in the original tree but created as a result of binarization. Similarly, the labels in unary chains corresponding to nested labeled spans are collapsed into unique atomic labels, such as S-VP in Fig. 1.

Although our method shares the same “span-based” view with that of minimal-span-based-parsing, our approach diverges significantly from their framework in the way we treat the whole parsing problem, and the representation and modeling of the spans, as we describe below.

2.1 Parsing as Pointing

In contrast to previous approaches, we cast parsing as a series of pointing decisions. For each index in the input sequence, the parsing model points it to another index in order to identify the tree span , where . Similar to Pointer Networks VinyalsNIPS2015, each pointing mechanism is modeled as a multinomial distribution over the indices of the input tokens (or encoder states). However, unlike the original pointer network where a decoder state points to an encoder state, in our approach, every encoder state points to another encoder state .

In this paper, we generally use to mean points to . We will refer to the pointing operation either as a function of the encoder states (e.g., ) or simply the corresponding indices (e.g., ). They both mean the same operation where the pointing function takes the encoder state as the query vector and points to by computing an attention distribution over all the encoder states.

Let denote the set of pointing decisions derived from a tree by a transformation , i.e., . For the parsing process to be valid, the transformation and its inverse which transforms back to , should both have a one-to-one mapping property. Otherwise, the parsing model may confuse two different parse trees with the same pointing representation. In this paper, we propose a novel transformation that satisfies this property, as defined by the following proposition (proof provided in the Appendix).

Proposition 1

Given a binary constituency tree for a sentence containing tokens, the transformation converts it into a set of pointing decisions such that is the largest span that starts or ends at , and is the label of the nonterminal associated with the span.

To elaborate further, each pointing decision in represents a specific span in . The pointing is directional, while the span that it represents is non-directional. In other words, there may exist position such that , while . In fact, it is easy to see that if the token at index is a left-child of a subtree, the largest span involving starts at , and in this case and . On the other hand, if the token is a right-child of a subtree, the respective largest span ends at position , in which case and (e.g., see in Figure 1). In addition, as the spans in are unique, it can be shown that the pointing decisions in are also distinct from one another (see Appendix for a proof by contradiction).

Given such pointing formulation, for every constituency tree, there exists a trivial case where and is generally ‘S’. Thus, to make our formulation more general with inputs and outputs and convenient for the method description discussed later on, we add another trivial case . With this generalization, we can represent the pointing decisions of any binary constituency tree as:


The pointing representation of the tree in Figure 1 is given at the bottom of the figure. To illustrate, in the parse tree, the largest phrase that starts or ends at token 2 (‘enjoys’) is the subtree rooted at ‘’, which spans from 2 to 5. In this case, the span starts at token 2. Similarly, the largest phrase that starts or ends at token 4 (‘tennis’) is the span “enjoys playing tennis”, which is rooted at ‘VP’. In this case, the span ends at token 4.

0:  Binary tree and its span representation
0:  Pointing representation
   Empty pointing list
  for  in  do
      Initialize current span,
      Initialize label of current span
     while  do
          The span’s label
          Span covered by node
     end whileUntil is no longer start/end point
  end for
Algorithm 1 Convert binary tree to Pointing

Algorithm 1 describes the procedure to convert a binary tree to its corresponding pointing representation. Specifically, from each leaf token , the algorithm traverses upward along the hierarchy until the non-terminal node that does not start or end with . In this way, the largest span starting or ending with can be identified.

2.2 Top-Down Tree Inference

In the previous section, we described how to convert a constituency tree into a sequence of pointing decisions . We use this transformation to train the parsing model (described in detail in Sections 2.3 - 2.4). During inference, given a sentence to parse, our decoder with the help of the parsing model predicts , from which we can construct the tree . However, not all sets of pointings guarantee the generation of a valid tree. For example, for a sentence with four (4) tokens, the pointing does not generate a valid tree because token ‘3’ cannot belong to both spans and . In other words, simply taking the over the pointing distributions may not generate a valid tree.

Our approach to decoding is inspired by the span-based approach of minimal-span-based-parsing. In particular, to reduce the search space, we score for span identification (given by the pointing function) and label assignment separately.

Span Identification.

We adopt a top-down greedy approach formulated as follows.


where is the score of having a split-point at position (), as defined by the following equation.


where and are the pointing scores (probabilities) for spans and , respectively. Note that the pointing scores are asymmetric, meaning that may not be equal to , because pointing from to is different from pointing from to . This is different from previous approaches, where the score of a span is defined to be symmetric. We build a tree for the input sentence by computing Eq. 3 recursively starting from the full sentence span .

In the general case when , our pointing-based parsing model should learn to assign high scores to the two spans and , or equivalently the pointing decisions and . However, the pointing formulation described so far omits the trivial self-pointing decisions, which represent the singleton spans. A singleton span is only created when the splitting decision splits an -size span into a single-token span (singleton span) and a sub-span of size , i.e., when or . For instance, for the parsing process in Figure 1(a), the splitting decision at the root span results in a singleton span and a general span . For this splitting decision, Eq. 3 requires the scores of and . However, the set of pointing decisions does not cover the pointing for . This discrepancy can be resolved by modeling the singleton spans separately. To achieve that, we redefine Eq. 3 as follows:


where and respectively represent the scores for the singleton and general pointing functions (to be defined formally in Section 2.3).

Remark on structural consistency. It is important to note that since the pointing functions are defined to have a global structural property (i.e., the largest span that starts/ends with ), our model inherently enforces structural consistency. The pointing formulation of the parsing problem also makes the training process simple and efficient; it allows us to train the model effectively with simple cross entropy loss (see Section 2.4).

(a) Execution of pointing parsing algorithm
(b) Output parse tree.
Figure 2: Inferring the parse tree for a given sentence and its part-of-speech (POS) tags (predicted by an external POS tagger). Starting with the full sentence span and its label S, we predict split point using the base () and general () pointing scores as per Eqn. 3-5. The left singleton span is assigned with a label NP and the right span is assigned with a label

using the label classifier

as per Eqn. 6. The recursion of splitting and labeling continues until the process reaches a terminal node. The label assignment for the unary spans is done by the classifier.

Label Assignment.

Label assignment of spans is performed after every split decision. Specifically, as we split a span into two sub-spans and which corresponds to the pointing functions of and , we perform the label assignments for the two new sub-spans as


where is the label classifier for any general (non-unary) span and is the set of possible non-terminal labels. Following shen-etal-2018-straight, we use a separate classifier for determining the labels of the unary spans, e.g., the first layer of labels NP, , , NP, ) in Figure 2. Also, note that the label assignment is done based on only the query vector (the encoder state that is used to point).

Figure 2 illustrates the top-down parsing process for our running example. It consists of a sequence of pointing decisions (Figure 1(a), top to bottom), which are then trivially converted to the parse tree (Figure 1(b)). We also provide the pseudocode in Algorithm 2. Specifically, the algorithm finds the best split for the current span using the pointing scores and pushes the newly created sub-spans into the FIFO queue . The process terminates when there are no more spans to be split. Similar to minimal-span-based-parsing, our parsing algorithm has the worst and best case time complexities of and , respectively.

0:  Sentence length ; pointing scores: , ; label scores: , ,
0:  Parse tree
   queue of spans
   general spans, labels
   unary spans, labels
  while  do
     if  then
     end if
     if  then
     else if  then
     end if
  end while
Algorithm 2 Pointing parsing algorithm

2.3 Model Architecture

We now describe the architecture of our parsing model: the sentence encoder, the pointing model and the labeling model.

Sentence Encoder.

Given an input sequence of words , we first embed each word to its respective vector representation as:


where , , are respectively the character, word, and part-of-speech (POS) embeddings of the word . Following kitaev-klein-2018-constituency, we use a character LSTM to compute the character embedding of a word. We experiment with both randomly initialized and pretrained word embeddings. If pretrained embeddings are used, the word embedding is the summation of the word’s randomly-initialized embedding and the pretrained embedding. The POS embeddings () are randomly initialized.

The word representations (

) are then passed to a neural network based sequence encoder to obtain their hidden representations. Since our method does not require any specific encoder, one may use any encoder model, such as Bi-LSTM

(Hochreiter:1997) or self-attentive encoder (kitaev-klein-2018-constituency). In this paper, unless otherwise specified, we use the self-attentive encoder model as our main sequence encoder because of its efficiency with parallel computation. The model is factorized into content and position information in both the self-attention sub-layer and the feed-forward layer. Details about this factorization process is provided in kitaev-klein-2018-constituency.

Pointing and Labeling Models.

The results of the aforementioned sequence encoding process are used to compute the pointing and labeling scores. More formally, the encoder network produces a sequence of latent vectors for the input sequence . After that, we apply four (4) separate position-wise two-layer Feed-Forward Networks (FFN), formulated as , to transform into task-specific latent representations for the respective pointing and labeling tasks.


Note that there is no parameter sharing between , , and . The pointing functions are then modeled as the multinomial (or attention) distributions over the input indices for each input position as follows.


For label assignment functions, we simply feed the label representations and into the respective softmax classification layers as follows.


where and are the set of possible labels for the general and unary spans respectively, and are the class-specific trainable weight vectors.

2.4 Training Objective

We train our parsing model by minimizing the total loss defined as:


where each individual loss is a cross entropy loss computed for the corresponding labeling or pointing task, and represents the overall model parameters; specifically, denotes the encoder parameters shared by all components, while and denote the separate parameters catering for the four pointing and labeling functions, and , respectively.

3 Experiments

To show the effectiveness of our approach, we conduct experiments on English and Multilingual parsing tasks. For English, we use the standard Wall Street Journal (WSJ) part of the Penn Treebank (PTB) PTB:Marcus:1993, whereas for multilingual, we experiment with seven (7) different languages from the SPMRL 2013-2014 shared task (seddah-etal-2013-overview): Basque, French, German, Hungarian, Korean, Polish and Swedish.

For evaluation on PTB, we report the standard labeled precision (LP), labeled recall (LR), and labelled F1 computed by evalb111http://nlp.cs.nyu.edu/evalb/. For the SPMRL datasets, we report labeled F1 and use the same setup in evalb as kitaev-klein-2018-constituency.

3.1 English (PTB) Experiments


We follow the standard train/valid/test split, which uses sections 2-21 for training, section 22 for development and section 23 for evaluation. This gives 45K sentences for training, 1,700 sentences for development, and 2,416 sentences for testing. Following previous studies, our model uses POS tags predicted by the Stanford tagger (Toutanova:2003:FPT:1073445.1073478).

For our model, we adopt the self-attention encoder with similar hyperparameter details proposed by

kitaev-klein-2018-constituency. The character embeddings are of dimensions. For general and unary label classifiers ( and ), the hidden dimension of the specific position-wise feed-forward networks is 250, while those for pointing functions ( and ) have hidden dimensions of . Our model is trained using the Adam optimizer (KingmaB14) with a batch size of sentences. Additionally, we use warm-up steps, within which we linearly increase the learning rate from to the base learning rate of . Model selection for testing is performed based on the labeled F1 score on the validation set.

Results for Single Models.

The experimental results on PTB for the models without pre-training are shown in Table 1. As it can be seen, our model achieves an F1 of , the highest among the models using top-down inference strategies. Specifically, our method outperforms minimal-span-based-parsing and shen-etal-2018-straight by about point in F1-score. Notably, our model with LSTM encoder achieves an F1 of 92.26, which is still better than all the top-down parser methods.

Model LR LP F1
Top-Down Inference
minimal-span-based-parsing 93.20 90.30 91.80
shen-etal-2018-straight 92.00 91.70 91.80
Our Model 92.81 92.75 92.78
CKY/Chart Inference
gaddy-etal-2018-whats - - 92.10
kitaev-klein-2018-constituency 93.20 93.90 93.55
Other Approaches
gomez-rodriguez-vilares-2018-constituent - - 90.7
liu-zhang-2017-shift - - 91.8
stern-etal-2017-effective 92.57 92.56 92.56
ZhouZ19 93.64 93.92 93.78
Table 1: Results for single models (no pre-training) on the PTB WSJ test set, Section 23.

On the other hand, while kitaev-klein-2018-constituency and ZhouZ19 achieve higher F1 score, their inference speed is significantly slower than ours because of the use of CKY based algorithms, which run at time complexity for kitaev-klein-2018-constituency and for ZhouZ19. Furthermore, their training objectives involve the use of structural hinge loss, which requires online CKY inference during training. This makes their training time considerably slower than that of our method, which is trained directly with span-wise cross entropy loss.

In addition, ZhouZ19 uses external supervision (head information) from the dependency parsing task. Dependency parsing models, in fact, have a strong resemblance to the pointing mechanism that our model employs (ma-etal-2018-stack). As such, integrating dependency parsing information into our model may also be beneficial. We leave this for future work.

Results with Pre-training

Similar to kitaev-klein-2018-constituency and kitaev-etal-2019-multilingual, we also evaluate our models with BERT devlin-etal-2019-bert embeddings . Following them in the inclusion of contextualized token representations, we adjust the number of self-attentive layers to and the base learning rate to .

As shown in Table 2, our model achieves an F1 score of 95.48, which is on par with the state-of-the-art models. However, the advantage of our method is that it is faster than those methods. Specifically, our model runs at worst-case time complexity, while that of kitaev-etal-2019-multilingual is . Comparison on parsing speed is discussed in the following section.

Model F1
Our model 95.34
Our model 95.48
kitaev-klein-2018-constituency ELMO 95.13
kitaev-etal-2019-multilingual 95.59
Table 2: Restuls on PTB WSJ test set with pretraining.

Parsing Speed Comparison.

In addition to parsing performance in F1 scores, we also compare our parser against the previous neural approaches in terms of parsing speed. We record the parsing timing over 2416 sentences of the PTB test set with batch size of 1, on a machine with NVIDIA GeForce GTX 1080Ti GPU and Intel(R) Xeon(R) Gold 6152 CPU. This setup is comparable to the setup of shen-etal-2018-straight.

Model # sents/sec

zhu-etal-2013-fast 89.5
liu-zhang-2017-shift 79.2
minimal-span-based-parsing 75.5
kitaev-klein-2018-constituency 94.40
shen-etal-2018-straight 111.1

Our model

Table 3: Parsing speed for different models computed on the PTB WSJ test set.
Model Basque French German Hebrew Hungarian Korean Polish Swedish
spmrl2014 88.24 82.53 81.66 89.80 91.72 83.81 90.50 85.50
coavoux-crabbe-2017-multilingual 88.81 82.49 85.34 89.87 92.34 86.04 93.64 84.0
kitaev-klein-2018-constituency 89.71 84.06 87.69 90.35 92.69 86.59 93.69 84.45
Our Model 90.23 82.20 84.91 90.63 91.07 85.36 93.99 86.87
Table 4: SPMRL experiment single model test.
Model Basque French German Hebrew Hungarian Korean Polish Swedish
kitaev-etal-2019-multilingual 91.63 87.43 90.20 92.99 94.90 88.80 96.36 88.86
Our model 92.02 86.69 90.28 93.67 94.24 88.71 96.14 89.10

Table 5: SPMRL experiment pre-trained model test (with pretraining).
Language Train Valid Test
Basque 7,577 948 946
French 14,759 1,235 2,541
German 40,472 5,000 5,000
Hebrew 5,000 500 716
Hungarian 8,146 1,051 1,009
Korean 23,010 2,066 2,287
Polish 6,578 821 822
Swedish 5,000 494 666
Table 6: SPMRL Multilingual dataset split.

As shown in Table 3, our parser outperforms shen-etal-2018-straight by 19 more sentences per second, despite the fact that our parsing algorithm runs at worse-case time complexity while the one used by shen-etal-2018-straight can theoretically run at time complexity. To elaborate further, the algorithm presented in shen-etal-2018-straight can only run at complexity. To achieve complexity, it needs to sort the list of syntactic distances, which the provided code222https://github.com/hantek/distance-parser does not implement. In addition, the speed up for our method can be attributed to the fact that our algorithm (see Algorithm 2) uses a while loop, while the algorithm of shen-etal-2018-straight has many recursive function calls. Recursive algorithms tend to be less empirically efficient than their equivalent while/for loops in handling low-level memory allocations and function call stacks.

3.2 SPMRL Multilingual Experiments


Similar to the English PTB experiments, we use the predicted POS tags from external taggers (provided in the SPMRL datasets). The train/valid/test split is reported in Table 6. For single model evaluation, we use the identical hyper-parameters and optimizer setups as in English PTB. For experiments with pre-trained models, we use the multilingual BERT (devlin-etal-2019-bert), which was trained jointly on 104 languages.


The results for the single models are reported in Table 4. We see that our model achieves the highest F1 score in Basque and Swedish, which are higher than the baselines by and respective in F1. Our method also performs competitively with the previous state-of-the-art methods on other languages.

Table 5 reports the performance of the models using pre-trained BERT. Evidently, our method achieves state-of-the-art results in Basque and Swedish, and performs on par with the previous best method by kitaev-etal-2019-multilingual in the other five languages. Again, note that our method is considerably faster and easier to train than the method of kitaev-etal-2019-multilingual.

4 Related Work

Prior to the neural tsunami in NLP, parsing methods typically model correlations in the output space through probabilistic context-free grammars (PCFGs) on top of sparse (and discrete) input representations either in a generative regime klein-manning-2003-accurate or a discriminative regime (finkel-etal-2008-efficient) or a combination of both charniak-johnson-2005-coarse. Beside the chart parser approach, there is also a long tradition of transition-based parsers sagae-lavie-2005-classifier

Recently, however, with the advent of powerful neural encoders such as LSTMs Hochreiter:1997, the focus has been switched more towards effective modeling of correlations in the input’s latent space, as the output structures are nothing but a function of the input (gaddy-etal-2018-whats). Various neural network models have been proposed to effectively encode the dense input representations and correlations, and have achieved state-of-the-art parsing results. To enforce the structural consistency, existing neural parsing methods either employ a transition-based algorithm (dyer-etal-2016-recurrent; liu-zhang-2017-shift; tetra-tagging-2019) or a globally optimized chart-parsing algorithm (gaddy-etal-2018-whats; kitaev-klein-2018-constituency).

Meanwhile, researchers also attempt to convert the constituency parsing problem into tasks that can be solved in alternative ways. For instance, fernandez-gonzalez-martins-2015-parsing transform the phrase structure into a special form of dependency structure. Such a dependency structure, however, requires certain corrections while converting back to the corresponding constituency tree. gomez-rodriguez-vilares-2018-constituent and shen-etal-2018-straight propose to map the constituency tree for a sentence of tokens into a sequence of labels or scalars based on the depth or height of the lowest common ancestors between pairs of consecutive tokens. In addition, methods like NIPS2015_5635; NIPS2017_7181 apply the sequence-to-sequence framework to “translate” a sentence into the linearized form of its constituency tree. While being trivial and simple, parsers of this type do not guarantee structural correctness, because the syntax of the linearized form is not constrained during tree decoding.

Our approach differs from previous work in that it represents the constituency structure as a series of pointing representations and has a relatively simpler cross entropy based learning objective. The pointing representations can be computed in parallel, and can be efficiently converted into a full constituency tree using a top-down algorithm. Our pointing mechanism shares certain similarities with the Pointer Network (VinyalsNIPS2015), but is distinct from it in that our method points a word to another word within the same encoded sequence.

5 Conclusion

We have presented a novel constituency parsing method that is based on a pointing mechanism. Our method utilizes an efficient top-down decoding algorithm that uses pointing functions for scoring possible spans. The pointing formulation inherently captures global structural properties and allows efficient training with cross entropy loss. With experiments we have shown that our method outperforms all existing top-down methods on the English Penn Treebank parsing task. Our method with pre-training rivals the state-of-the-art method, while being faster than it. On multilingual constituency parsing, it also establishes new state-of-the-art in Basque and Swedish.


We would like to express our gratitude to the anonymous reviewers for their insightful feedback on our paper. Shafiq Joty would like to thank the funding support from his Start-up Grant (M4082038.020).



Proof of Proposition 1

Given , generated from tree (here we omit the unary leaves and POS-tags), we at first define the inverse as follows:

We would prove
A binary tree has exactly internal nodes (or spans). It is noteworthy to mention that for each pointing , is a span in . As we consider from to , there are totally at most n-1 such spans in (we do not know whether these spans are not be distinct). Therefore, if we can prove that all spans are distinct for , will cover all the span in , therefore, . We prove this by contradiction.
Assume that there exist such that for . First, if , then according to the above condition, . This means, either or , which contradicts with our initial assumption that and . So, cannot be equal to . Similarly, we can prove that also cannot be equal to . Thus, we can conclude that . Now, without loss of generality, let us assume that . With this assumption, the two spans will be identical if and only if and . In this case, the span would be the largest span that starts with and ends at . However, since , the span must be a left or right child of another (parent) span. If is the left child, then the parent span needs to start with , making it larger than . This contradicts to the property that is the largest span that starts or ends at . Similarly, if is the right child, then the parent span needs to end at , making it larger than . This again contradicts to the property that is the largest span that starts or ends at .
In conclusion, we have . This would guarantee that and are one-to-one: If there exist such that , we would have or .If there exist such that , we would have .