1 Introduction
Several recent papers (Shen et al., 2018a; Htut et al., 2018; Shen et al., 2019; Li et al., 2019; Shi† et al., 2019) have used a new parsing algorithm to recover hierarchical constituency structures from sequiantial recurrent language models trained only on sequences of words. This parser, which we call a parser for reasons which we will describe below, was introduced by Shen et al. (2018a)
. It operates top-down by recursively splitting larger constituents into smaller ones, based on estimated changes of “syntactic depth” until only terminal symbols remain (§
2). In contrast to previous work, which has mostly used explicit tree-structured models with stochastic latent variables, this line of work is an interestingly different approach to the classic problem of inferring hierarchical structure from flat sequences.In this paper, we make two primary contributions. First, we show, using the same methodology, that phrase structure can be recovered from conventional LSTMs comparably well as from the “ordered neurons” LSTM of
Shen et al. (2019), which is an LSTM variant designed to mimic certain aspects of phrasal syntactic processing (§3). Second, we provide a careful analysis of the parsing algorithm (§4), showing that in contrast to traditional parsers, which can recover any binary tree, the parser is able to recover only a small fraction of all binary trees (and further, that many valid structures are not recoverable by this parser). This incomplete support, together with the greedy search procedure used in the algorithm, results in a marked bias for returning right-branching structures, which in turn overestimates parsing performance on right-branching languages like English.Since an adequate model of syntax discovery must be able to account for languages with different branching biases and assign valid structural descriptions to any sentence, our analysis indicates particular care must be taken when relying on this algorithm to analyze the structural competence of language models.
2 Parsers
Fig. 1 gives the greedy top-down parsing algorithm originally described in Shen et al. (2018a) as a system of inference rules (Goodman, 1999).111Our presentation is slightly different to the one in Shen et al. (2018a), but our notational variant was chosen to facilitate comparison with other parsers presented in these terms and also to simplify the proof of Prop. 2 (see §4). We call this a parser, which will be explained below in the analysis section (§4). The parser operates on a sentence of length
by taking a vector of scores
and recursively splitting the sentence into pairs of smaller and smaller constituents until only single terminals remain. Given a constituent and the score vector, only a single inference rule ever applies. When the deduction completes, the collection of constituents encountered during parsing constitutes the parse.Premises | ||||
Inference rules | ||||
binary | ||||
left | ||||
right | ||||
middle | ||||
Goals |
Depending on the interpretation of , the consequent of the middle rule can be changed to be , which places the word triggering the split of in the left (rather than right) child constituent. We refer to these variants as the L- and R-variants of the parser. We analyze the algorithm, and the implications of the these variants below (§4), but first we provide a demonstration of how this algorithm can be used to recover trees from sequential neural language models.
3 Unsupervised Syntax in LSTMs
Shen et al. (2019)
propose a modification to LSTMs that imposes an ordering on the memory cells. Rather than computing each forget gate independently given the previous state and input as is done in conventional LSTMs, the forget gates in Shen et al.’s ordered neuron LSTM (ON-LSTM) are tied via a new activation function called a cumulative softmax so that when a higher level forget gate is on, lower level ones are forced to be on as well. The value for
is then defined to be the average forget depth—that is, the average number of neurons that are turned off by the cumulative softmax—at each position . Intuitively, this means that closing more constituents in the tree is linked to forgetting more information about the history. In this experiment, we operationalize the same linking hypothesis, but we simply sum the values of the forget gates at each time step (variously at different layers, or all layers together) in a conventional LSTM to obtain a measure of how much information is being forgotten when updating the recurrent representation with the most recently generated word. To ensure a fair comparison, both models were trained on the 10k vocabulary Penn Treebank (PTB) to minimize cross entropy; they made use of the same number of parameters; and the best model was selected based on validation perplexity. Details of our LSTM language model are found in Appendix A.Results on the 7422 sentence WSJ10 set (consisting of sentences from the PTB of up to 10 words) and the PTB23 set (consisting of all sentences from §23, the traditional PTB test set for supervised parse evaluation) are shown in Tab. 1, with the ON-LSTM of Shen et al. (2019) provided as a reference. We note three things. First, the F1 scores of the two models are comparable; however, the LSTM baseline is slightly worse on WSJ10. Second, whereas the ON-LSTM model requires that a single layer be selected to compute the parser’s scoring criterion (since each layer will have a different expected forget depth), the unordered nature of LSTM forget gates means our baseline can use gates from all layers jointly. Third, the L-variant of the parser is substantially worse.
WSJ10 | PTB23 | ||
---|---|---|---|
ON-LSTM-1 | R | 35.2 | 20.0 |
ON-LSTM-2 | R | 65.1 | 47.7 |
ON-LSTM-3 | R | 54.0 | 36.6 |
LSTM-1 | R | 58.4 | 43.7 |
LSTM-2 | R | 58.4 | 45.1 |
LSTM-1,2 | R | 60.1 | 47.0 |
LSTM-1 | L | 43.8 | 31.8 |
LSTM-2 | L | 47.4 | 35.1 |
LSTM-1,2 | L | 46.3 | 33.8 |
4 Analysis of the Parser
We now turn to a formal analysis of the parser, and we show two things: first, that the parser is able to recover only a rapidly decreasing (with length) fraction of valid trees (§4.1); and second, that the parser has a marked bias in favour of returning right-branching structures (§4.2).
4.1 Incomplete support
Here we characterize properties of the trees that are recoverable with the parser, and describe our reason for naming it as we have.
Proposition 1.
Ignoring all single-terminal bracketings, the R-variant parser can generate all valid binary bracketings that do not contain the contiguous string .222Proofs of propositions are found in Appendix B.
This avoidance of close-open-open leads us to call this the parser, where the notation indicates that is forbidden. In Fig. 2 we give an example of an unrecoverable parse for a sentence of (there are 14 binary bracketings of length-5 sentences, but the parser can only recover 13 of them).
[ [ some trees ] [ [ grow branches ] leftwards ] ]
Proposition 2.
The number of parses of a string of length that is recoverable by a parser is given by 333This sequence is https://oeis.org/A082582, which counts a permutation-avoiding path construction that shows up in several combinatorial problems (Baxter and Pudwell, 2015)., where
Although this sequence grows in , Fig. 3 shows that as the length of the input increases, the ratio of the extractable parses to the number of total binary parses, which is given by the th Catalan number, Motzkin (1948), converges logarithmically to 0.

4.2 Branching direction bias
Since not every binary tree can be recovered by a parser, we explore to what extent this biases the solutions found by the algorithm. We explore the nature of the bias in two ways. First, in Fig. 4
we plot the marginal probability that
is a constituent when all binary trees are equiprobable, and compare it with the probability that the R-parser will identify it as a span under a uniform distribution over the relative orderings of the
’s (since it is the relative orderings that determine the parse structure). As we can see, there is no directionality bias in the uniform tree model (all rows have constant probabilities), but in the R-parser, the probability of the right-most constituent of length is twice that of the left-most one , indicating a right-branching bias. The L-variant marginals (not shown) are the reflection of the R-variant ones across the vertical axis, indicating it has the same bias, only reversed in direction.
Thus, while the R-variant
parser fails to reach a large number of trees, it nevertheless has a bias toward the right-branching syntactic structures that are common in English. Since parse evaluation is in terms of retrieval of spans (not entire trees), we may reasonably assume that the right-branching bias is more beneficial than the existence of unreachable trees (correct though they may be) is harmful. A final experiment supports this hypothesis: we run a Monte Carlo simulation to approximate the expected F1 score obtained on WSJ10 and PTB23 under the uniform binary distribution, the left-skewed distribution, and the right-skewed distribution. These estimates are reported in Tab.
2 and show that the R-variant parser confers significant advantages over the uniform tree model, and the L-variant parser is worse again.WSJ10 | PTB23 | |
Left-skew | 31.6 () | 16.8 () |
Uniform | 33.7 () | 18.3 () |
Right-skew | 37.5 () | 19.9 () |
5 Potential Fixes
Is it possible to fix the parser to remove the biases identified above and to make the parser complete? It is. In general, it is possible to recursively split phrases top down and to obtain any possible binary bracketing Stern et al. (2017); Shen et al. (2018b). The locus of the bias in the parser is the decision rule for splitting a span into two daughters, based on the maximally “deep” word. Since the maximum word in a larger span will still be maximal in a resulting smaller span, certain configurations will necessarily be unreachable. However, at least two alternative scoring possibilities suggest themselves: (1) scoring transitions between words rather than individual words and (2) scoring spans rather than words. By scoring the transitions between words, each potential division between will lie outside of the resulting daughter spans, avoiding the problem of having the maximally scoring element present in both the parent and the daughter. By scoring the spans rather than words or word gaps, a similar argument holds.
Although these algorithms are well known, they have only been used for supervised parsing. The challenge for using either of these decision rules in the context of unsupervised parsing is to find a suitable linking hypothesis that connects the activations of a sequential language model to hypotheses about either changes in syntactic depth that occur from one word to the next (i.e., scoring the gaps) or that assign scores to all spans in a sequence. Candidates for such quantities in conventional networks do not, however, immediately suggest themselves.
6 Related Work
Searching for latent linguistic structure in the activations of neural networks that were trained on surface strings has a long history, although the correlates of syntactic structure have only recently begun to be explored in earnest.
Elman (1990) used entropy spikes to segment character sequences into words and, quite similarly to this work, Wang et al. (2017)used changes in reset gates in GRU-based autoencoders of acoustic signals to discover phone boundaries.
Hewitt and Manning (2019) found they could use linear models to recover the (squared) tree distance between pairs of words as well as depth in the contextual embeddings of undirected language models.A large number of papers have also used recursively structured networks to discover syntax, refer to a Shen et al. (2019) for a comprehensive list.
7 Discussion
The learning process that permits children to acquire a robust knowledge of hierarchical structure from unstructured strings of words in their environments has been an area of active study for over half a century. Despite the scientific effort—not to mention the reliable success that children learning their first language exhibit—we still have no adequate model of this process. Thus, new modeling approaches, like the ones reviewed in this paper, are important. However, an understanding of what the results are telling us is more than simply a question of F1 scores. In this case, a biased parser that is well-matched to English structures appears to show that LSTMs operate more syntactically than they may actually do. Of course, the tendency for languages to have consistent branching patterns has been hypothesized to arise from a head-directionality parameter that is set during learning (Chomsky, 1981), and biases of the sort imposed by the parser could be reasonable, if a mechanism for learning the head-directionality parameter were specified.444Indeed, children demonstrate knowledge of word order from their earliest utterances (Brown, 1973), and even pre-lexical infants are aware of word order patterns in their first language before knowing its words (Gervain et al., 2008). When comparing unsupervised parsing algorithms aiming to shed light on how children discover structure in language, we must acknowledge that the goal is not simply to obtain the best accuracy, but to do so in a plausibly language agnostic way. If one’s goal is not to understand the general problem of structure, but merely to get better parsers, language specific biases are certainly on the table. However, we argue that the work should make these goals clear, and that it is unreasonable to mix comparisons across models designed with different goals in mind.
This paper has drawn attention to a potential confound in interpreting the results of experiments using parsers. We wish to emphasize again that the works employing this parser represent a meaningful step forward in the important scientific question of how hierarchical structure is inferred from unannotated sentences. However, an awareness of the biases of the algorithms being used to assess this question is important.
Acknowledgments
We thank Adhi Kuncoro for his helpful suggestions on this writeup.
References
- Baxter and Pudwell (2015) Andrew M. Baxter and Lara K. Pudwell. 2015. Ascent sequences avoiding pairs of patterns. The Electronic Journal of Combinatorics, 22(1):1–23.
- Brown (1973) Roger A. Brown. 1973. A First Language: The Early Stages. Harvard.
- Chomsky (1981) Noam Chomsky. 1981. Lectures on Government and Binding. Foris.
- Elman (1990) Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14:179–211.
-
Gal and Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani. 2016.
A theoretically grounded application of dropout in recurrent neural networks.
In Advances in Neural Information Processing Systems, pages 1019–1027. - Gervain et al. (2008) Judit Gervain, Marina Nespor, Reiko Mazuka, Ryota Horie, and Jacques Mehler. 2008. Bootstrapping word order in prelexical infants: A Japanese–Italian cross-linguistic study. Cognitive Psychology, 57(1):56–74.
- Gong et al. (2018) Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. FRAGE: frequency-agnostic word representation. In Advances in Neural Information Processing Systems, pages 1334–1345.
- Goodman (1999) Joshua Goodman. 1999. Semiring parsing. Computational Linguistics, 25(4):573–605.
- Hewitt and Manning (2019) John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proc. NAACL.
- Htut et al. (2018) Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman. 2018. Grammar induction with neural language models: An unusual replication. In Proc. EMNLP.
- Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
-
Li et al. (2019)
Bowen Li, Lili Mou, and Frank Keller. 2019.
An imitation learning approach to unsupervised parsing.
In Proc. ACL. - Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The Penn treebank. Computational linguistics, 19(2):313–330.
- Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proc. Interspeech.
- Motzkin (1948) Theodore Motzkin. 1948. Relations between hypersurface cross ratios, and a combinatorial formula for partitions of a polygon, for permanent preponderance, and for non-associative products. Bulletin of the American Mathematical Society, 54:352–360.
-
Shen et al. (2018a)
Yikang Shen, Zhouhan Lin, Chin-Wei Huang, and Aaron Courville.
2018a.
Neural language modeling by jointly learning syntax and lexicon.
In Proc. ICLR. - Shen et al. (2018b) Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessandro Sordoni, Aaron Courville, and Yoshua Bengio. 2018b. Straight to the tree: Constituency parsing with neural syntactic distance. In Proc. ACL.
- Shen et al. (2019) Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2019. Ordered neurons: Integrating tree structures into recurrent neural networks. In Proc. ICLR.
- Shi† et al. (2019) Haoyue Shi†, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019. Visually grounded neural syntax acquisition. In Proc. ACL.
- Stern et al. (2017) Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A minimal span-based neural constituency parser. In Proc. ACL.
- Wang et al. (2017) Yu-Hsuan Wang, Cheng-Tao Chung, and Hung-Yi Lee. 2017. Gate activation signal analysis for gated recurrent neural networks and its correlation with phoneme boundaries. In Proc. Interspeech.
- Yang et al. (2018) Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. 2018. Breaking the softmax bottleneck: a high-rank RNN language model. Proc. ICLR.
Appendix A Experimental Details
For our experiments, we trained a two-layer LSTM with 950 units and 24M parameters on the Penn Treebank (Marcus et al., 1993, PTB) language modelling dataset with preprocessing from Mikolov et al. (2010). Its lack of architectural innovations makes this model ideal as a baseline. In particular, we refrain from using recent inventions such as the Mixture of Softmaxes (Yang et al., 2018) and FRAGE (Gong et al., 2018). Still, the trained model is within 2 perplexity points of the state of the art (Gong et al., 2018), achieving 55.8 and 54.6 on the validation and test sets, respectively.
As is the case for all models that are strong on small datasets such as the PTB, this one is also heavily regularized. An penalty controls the magnitude of all weights. Dropout is applied to the inputs of the two LSTM layers, to their recurrent connections (Gal and Ghahramani, 2016), and to the output. Optimization is performed by Adam (Kingma and Ba, 2014) with
, a setting that resembles RMSProp without momentum. Dropout is turned off at test time, when we extract values of the forget gate.
Appendix B Proofs of Propositions
Although it was convenient to present the results in reverse order, we first prove Proposition 2, the recurrence counting the number of recoverable parses, and then Proposition 1 since it is used in proving Proposition 1.
b.1 Proof of Proposition 2
To derive the recurrence giving the number of recoverable parses, the general strategy will be to exploit the duality between top-down and bottom-up parsing algorithms by reversing the inference system in Fig. 1 and parsing sentences bottom up. This admits a more analyzable algorithm. In this proof, we focus on the default R-variant, although the analysis is identical for the L-variant.
We begin by noting that after the top-down middle rule applies, the new constituent will be used immediately to derive a pair of constituents and via either the left or binary rule, depending on the size of . This is because is the index of the maximum score in , therefore it will also be the index of the maximum in the daughter constituent . Since there is only a single derivation of from its subconstituents (and vice-versa), we can combine these two steps into a single step by transforming the middle rule as follows:
This ternary form results in a system with the same number of derivations but is easier to analyze.
In the bottom-up version of the parser, we start with the goals of the top-down parser, run the inference rules backwards (from bottom to top) and derive ultimately the top-down parser’s premise as the unique goal. Since we want to consider all possible decisions the parser might make, the bottom-up rules are also transformed to disregard the constraints imposed by the weight vector. Finally, to count the derivations, we use an inside algorithm with each item associated with a number that counts the unique derivations of that item in the bottom-up forest. When combining items to build a new item, the weights of the antecedent items multiply; when an item may be derived multiple ways, they add (Goodman, 1999).
Let refer to the number of possible parses for a sequence of length . It is obvious that since the only way to construct single-length items in the bottom-up parser (i.e., ) is with the initialization step (all other rules have constraints that build longer constituents). Likewise since there is only one way to build a constituent of length 2, namely the binary rule, which combines two single-length constituents (each which can only be derived one way).
Consider the general case where . Here, there are three ways of deriving : the left and right rules, and the more general rule. Both left and right combine a tree spanning symbols with a single terminal. The single terminal has, as we have seen, one derivation, and the tree has, by induction, derivations. Thus, the left and right rules contribute derivations to .
It remains to account for the contribution of the rule. To do so, we observe that derives a span by combining two arbitrarily large spans: and , having derivation counts and respectively (by induction), with a single element , having a single derivation (by definition). Thus, for a possible value of , the contribution to is . Since may be any value in the range , we have in aggregate as the contribution of . Thus, combining the contributions of left, right, and , we obtain:
for .∎
b.2 Proof of Proposition 1
We prove this in two parts, first we show that the string cannot be generated by the parser. Second we show that when brackets containing this string are removed from the set of all binary brackets of symbols, its cardinality is .
Part 1.
We prove that the string cannot be generated by the parser by contradiction. Assume that the bracketing produced by the parser contains the string . To exist in a balanced binary tree, there must be at least two symbols to the left (terminals are unbracketed, therefore the closing bracket that has not ended the tree will be at least a length-2 constituent); and three symbols to the right of the split (if the following material was length-2, then two opening brackets would result in a unary production, which cannot exist in a binary tree).
Thus, to obtain the split between and , the middle rule would have applied because of the size constraints on the inference rules. However, as we showed in the proof of Prop. 1, after the middle applies, a terminal symbol must be the left child of the resulting right daughter constituent, thus we obligatorily must have the sequence , where is any single terminal symbol. This contradicts our starting assumption. ∎
Part 2.
It remains to show that the only unreachable trees are those that contain . Proving this formally is considerably more involved than Part 1, so we only outline the strategy. First, we construct a finite state machine that generates a language in consisting of all strings that do not contain the contiguous substring . Second, we intersect this with the following grammar that generates all valid binary bracketed expressions:
Finally, we remove the bracketing symbols, and show that the number of derivations of the string is . ∎