Unsupervised Dependency Parsing: Let's Use Supervised Parsers

04/18/2015 ∙ by Phong Le, et al. ∙ University of Amsterdam 0

We present a self-training approach to unsupervised dependency parsing that reuses existing supervised and unsupervised parsing algorithms. Our approach, called `iterated reranking' (IR), starts with dependency trees generated by an unsupervised parser, and iteratively improves these trees using the richer probability models used in supervised parsing that are in turn trained on these trees. Our system achieves 1.8 parser of Spitkovsky et al. (2013) on the WSJ corpus.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised dependency parsing and its supervised counterpart have many characteristics in common: they take as input raw sentences, produce dependency structures as output, and often use the same evaluation metric (DDA, or UAS, the percentage of tokens for which the system predicts the correct head). Unsurprisingly, there has been much more research on supervised parsing – producing a wealth of models, datasets and training techniques – than on unsupervised parsing, which is more difficult, much less accurate and generally uses very simple probability models. Surprisingly, however, there have been no reported attempts to reuse supervised approaches to tackle the unsupervised parsing problem (an idea briefly mentioned in spitkovsky2010viterbiem).

There are, nevertheless, two aspects of supervised parsers that we would like to exploit in an unsupervised setting. First, we can increase the model expressiveness in order to capture more linguistic regularities. Many recent supervised parsers use third-order (or higher order) features [Koo and Collins2010, Martins et al.2013, Le and Zuidema2014] to reach state-of-the-art (SOTA) performance. In contrast, existing models for unsupervised parsing limit themselves to using simple features (e.g., conditioning on heads and valency variables) in order to reduce the computational cost, to identify consistent patterns in data [Naseem2014, page 23], and to avoid overfitting [Blunsom and Cohn2010]

. Although this makes learning easier and more efficient, the disadvantage is that many useful linguistic regularities are missed: an upper bound on the performance of such simple models – estimated by using annotated data – is 76.3% on the WSJ corpus

[Spitkovsky et al.2013], compared to over 93% actual performance of the SOTA supervised parsers.

Second, we would like to make use of information available from lexical semantics, as in bansal2014tailoring, le2014the, and chen2014. Lexical semantics is a source for handling rare words and syntactic ambiguities. For instance, if a parser can identify that “he” is a dependent of “walks” in the sentence “He walks”, then, even if “she” and “runs” do not appear in the training data, the parser may still be able to recognize that “she” should be a dependent of “runs” in the sentence “she runs”. Similarly, a parser can make use of the fact that “sauce” and “John” have very different meanings to decide that they have different heads in the two phrases “ate spaghetti with sauce” and “ate spaghetti with John”.

However, applying existing supervised parsing techniques to the task of unsupervised parsing is, unfortunately, not trivial. The reason is that those parsers are optimally designed for being trained on manually annotated data. If we use existing unsupervised training methods (like EM), learning could be easily misled by a large amount of ambiguity naturally embedded in unannotated training data. Moreover, the computational cost could rapidly increase if the training algorithm is not designed properly. To overcome these difficulties we propose a framework, iterated reranking (IR), where existing supervised parsers are trained without the need of manually annotated data, starting with dependency trees provided by an existing unsupervised parser as initialiser. Using this framework, we can employ the work of le2014the to build a new system that outperforms the SOTA unsupervised parser of DBLP:conf/emnlp/SpitkovskyAJ13 on the WSJ corpus.

The contribution of this paper is twofold. First, we show the benefit of using lexical semantics for the unsupervised parsing task. Second, our work is a bridge connecting the two research areas± unsupervised parsing and its supervised counterpart. Before going to the next section, in order to avoid confusion introduced by names, it is worth noting that we use un-trained existing supervised parsers which will be trained on automatically annotated treebanks.

2 Related Work

2.1 Unsupervised Dependency Parsing

The first breakthrough was set by DBLP:conf/acl/KleinM04 with their dependency model with valence (DMV), the first model to outperform the right-branching baseline on the DDA metric: 43.2% vs 33.6% on sentences up to length 10 in the WSJ corpus. Nine years later, DBLP:conf/emnlp/SpitkovskyAJ13 achieved much higher DDAs: 72.0% on sentences up to length 10, and 64.4% on all sentences in section 23. During this period, many approaches have been proposed to attempt the challenge.

naseem2011using, tu2012unambiguity, spitkovsky2012wabisabi, DBLP:conf/emnlp/SpitkovskyAJ13, and DBLP:conf/acl/MarecekS13 employ extensions of the DMV but with different learning strategies. naseem2011using use semantic cues, which are event annotations from an out-of-domain annotated corpus, in their model during training. Relying on the fact that natural language grammars must be unambiguous in the sense that a sentence should have very few correct parses, tu2012unambiguity incorporate unambiguity regularisation to posterior probabilities. spitkovsky2012wabisabi bootstrap the learning by slicing up all input sentences at punctuation. DBLP:conf/emnlp/SpitkovskyAJ13 propose a complete deterministic learning framework for breaking out of local optima using count transforms and model recombination. DBLP:conf/acl/MarecekS13 make use of a large raw text corpus (e.g., Wikipedia) to estimate stop probabilities, using the reducibility principle.

Differing from those works, bisk2012simple rely on Combinatory Categorial Grammars with a small number of hand-crafted general linguistic principles; whereas blunsom2010unsupervised use Tree Substitution Grammars with a hierarchical non-parametric Pitman-Yor process prior biasing the learning to a small grammar.

2.2 Reranking

Our work relies on reranking which is a technique widely used in (semi-)supervised parsing. Reranking requires two components: a -best parser and a reranker. Given a sentence, the parser generates a list of best candidates, the reranker then rescores those candidates and picks the one that has the highest score. Reranking was first successfully applied to supervised constituent parsing [Collins2000, Charniak and Johnson2005]. It was then employed in the supervised dependency parsing approaches of sangati2009generative, hayashi2013efficient, and le2014the.

Closest to our work is the work series on semi-supervised constituent parsing of McClosky and colleagues, e.g. mcclosky2006effective, using self-training. They use a -best generative parser and a discriminative reranker to parse unannotated sentences, then add resulting parses to the training treebank and re-train the reranker. Different from their work, our work is for unsupervised dependency parsing, without manually annotated data, and uses iterated reranking instead of single reranking. In addition, both two components, -best parser and reranker, are re-trained after each iteration.

3 The IR Framework

Existing training methods for the unsupervised dependency task, such as blunsom2010unsupervised, gillenwater2011posterior, and tu2012unambiguity, are hypothesis-oriented search with the EM algorithm or its variants: training is to move from a point which represents a model hypothesis to another point. This approach is feasible for optimising models using simple features since existing dynamic programming algorithms can compute expectations, which are sums over all possible parses, or to find the best parse in the whole parse space with low complexities. However, the complexity increases rapidly if rich, complex features are used. One way to reduce the computational cost is to use approximation methods like sampling as in blunsom2010unsupervised.

3.1 Treebank-oriented Greedy Search

Believing that the difficulty of using EM is from the fact that treebanks are ‘hidden’, leading to the need of computing sum (or max) overall possible treebanks, we propose a greedy local search scheme based on another training philosophy: treebank-oriented search. The key idea is to explicitly search for concrete treebanks which are used to train parsing models. This scheme thus allows supervised parsers to be trained in an unsupervised parsing setting since there is a (automatically annotated) treebank at any time.

Given a set of raw sentences, the search space consists of all possible treebanks where is a dependency tree of sentence . The target of search is the optimal treebank that is as good as human annotations. Greedy search with this philosophy is as follows: starting at an initial point , we pick up a point among its neighbours such that

(1)

where is an objective function measuring the goodness of (which may or may not be conditioned on ). We then continue this search until some stop criterion is satisfied. The crucial factor here is to define and . Below are two special cases of this scheme.

Semi-supervised parsing using reranking

[McClosky et al.2006]. This reranking is indeed one-step greedy local search. In this scenario, is the Cartesian product of -best lists generated by a -best parser, and is a reranker.

Unsupervised parsing with hard-EM

[Spitkovsky et al.2010b] In hard-EM, the target is to maximise the following objective function with respect to a parameter set

(2)

where is the set of all possible dependency structures of . The two EM steps are thus

  • Step 1:

  • Step 2:

In this case, is the whole treebank space and .

3.2 Iterated Reranking

We instantiate the greedy search scheme by iterated reranking which requires two components: a -best parser , and a reranker . Firstly, is used to train these two components, resulting in and . The parser then generates a set of lists of candidates (whose Cartesian product results in ) for the set of training sentences . The best candidates, according to reranker , are collected to form for the next iteration. This process is halted when a pre-defined stop criterion is met.111 It is worth noting that, although has the size where is the number of sentences, reranking only needs to process parses if these sentences are assumed to be independent.

It is certain that we can, as in the work of spitkovsky2010viterbiem and many bootstrapping approaches, employ only parser . Reranking, however, brings us two benefits. First, it allows us to employ very expressive models like the -order generative model proposed by le2014the. Second, it embodies a similar idea to co-training [Blum and Mitchell1998]: and play roles as two views of the data.

3.3 Multi-phase Iterated Reranking

Training in machine learning often uses

starting big which is to use up all training data at the same time. However, elman1993learning suggests that in some cases, learning should start by training simple models on small data and then gradually increase the model complexity and add more difficult data. This is called starting small.

In unsupervised dependency parsing, starting small is intuitive. For instance, given a set of long sentences, learning the fact that the head of a sentence is its main verb is difficult because a long sentence always contains many syntactic categories. It would be much easier if we start with only length-one sentences, e.g “Look!”, since there is only one choice which is usually a verb. This training scheme was successfully applied by SpitkovskyEtAl10 under the name: Baby Step.

We adopt starting small to construct the multi-phase iterated reranking (MPIR) framework. In phase 0, a parser with a simple model is trained on a set of short sentences as in traditional approaches. This parser is used to parse a larger set of sentences , resulting in . is then used as the starting point for the iterated reranking in phase 1. We continue this process until phase finishes, with (). In general, we use the resulting reranker in the previous phase to generate the starting point for the iterated reranking in the current phase.

4 le2014the’s Reranker

le2014the’s reranker is an exception among supervised parsers because it employs an extremely expressive model whose features are -order222In fact, the order is finite but unbound.

. To overcome the problem of sparsity, they introduced the inside-outside recursive neural network (IORNN) architecture that can estimate tree-generating models including those proposed by eisner1996three and collins2003head.

4.1 The -order Generative Model

le2014the’s reranker employs the generative model proposed by eisner1996three. Intuitively, this model is top-down: starting with ROOT, we generate its left dependents and its right dependents. We then generate dependents for each ROOT’s dependent. The generative process recursively continues until there is no dependent to generate. Formally, this model is described by the following formula

(3)

where is the current head, is the fragment of the dependency parse rooted at , and is the context to generate . are respectively ’s left dependents and right dependents, plus (End-Of-Children), a special token to inform that there are no more dependents to generate. Thus, is the probability of generating the entire dependency structure .

Le and Zuidema’s -order generative model is defined as Eisner’s model in which the context to generate contains all of ’s generated siblings, its ancestors and their siblings. Because of very large fragments that contexts are allowed to hold, traditional count-based methods are impractical (even if we use smart smoothing techniques). They thus introduced the IORNN architecture to estimate the model.

4.2 Estimation with the IORNN

Figure 1: Inside-Outside Recursive Neural Network (IORNN). Black/white rectangles correspond to inner/outer representations.

An IORNN (Figure 1) is a recursive neural network whose topology is a tree. What make this network different from traditional RNNs [Socher et al.2010] is that each tree node

caries two vectors:

- the inner representation, represents the content of the phrase covered by the node, and - the outer representation, represents the context around that phrase. In addition, information in an IORNN is allowed to flow not only bottom-up as in RNNs, but also top-down. That makes IORNNs a natural tool for estimating top-down tree-generating models.

Applying the IORNN architecture to dependency parsing is straightforward, along the generative story of the -order generative model. First of all, the “inside” part of this IORNN is simpler than what is depicted in Figure 1: the inner representation of a phrase is assumed to be the inner representation of its head. This approximation is plausible since the meaning of a phrase is often dominated by the meaning of its head. The inner representation at each node, in turn, is a function of a vector representation for the word (in our case, the word vectors are initially borrowed from collobert_natural_2011), the POS-tag and capitalisation feature.

Without loss of generality and ignoring directions for simplicity, they assume that the model is generating dependent for node conditioning on context which contains all of ’s ancestors (including ) and theirs siblings, and all of previously generated ’s sisters. Now there are two types of contexts: full contexts of heads (e.g., ) whose dependents are being generated, and contexts to generate nodes (e.g., ). Contexts of the first type are clearly represented by outer representations. Contexts of the other type are represented by partial outer representations, denoted by . Because the context to generate a node can be constructed recursively by combining the full context of its head and its previously generated sisters, they can compute as a function of and the inner representations of its previously generated sisters. On the top of

, they put a softmax layer to estimate the probability

.

Training this IORNN is to minimise the cross entropy over all dependents. This objective function is indeed the negative log likelihood of training treebank .

4.3 The Reranker

Le and Zuidema’s (generative) reranker is given by

where (Equation 4.1) is computed by the -order generative model which is estimated by an IORNN; and is a -best list.

5 Complete System

Our system is based on the multi-phase IR. In general, any third-party parser for unsupervised dependency parsing can be used in phase 0, and any third-party parser that can generate -best lists can be used in the other phases. In our experiments, for phase 0, we choose the parser using an extension of the DMV model with stop-probability estimates computed on a large corpus proposed by DBLP:conf/acl/MarecekS13. This system has a moderate performance333 DBLP:conf/acl/MarecekS13 did not report any experimental result on the WSJ corpus. We use their source code at http://ufal.mff.cuni.cz/udp with the setting presented in Section 6.1. Because the parser does not provide the option to parse unseen sentences, we merge the training sentences (up to length 15) to all the test sentences to evaluate its performance. Note that this result is close to the DDA (55.4%) that the authors reported on CoNLL 2007 English dataset, which is a portion of the WSJ corpus. on the WSJ corpus: 57.1% vs the SOTA 64.4% DDA of DBLP:conf/emnlp/SpitkovskyAJ13. For the other phases, we use the MSTParser444http://sourceforge.net/projects/mstparser/ (with the second-order feature mode) [McDonald and Pereira2006].

Our system uses le2014the’s reranker (Section 4.3). It is worth noting that, in this case, each phase with iterated reranking could be seen as an approximation of hard-EM (see Equation 2) where the first step is replaced by

(4)

In other words, instead of searching over the treebank space, the search is limited in a neighbour set generated by -best parser .

5.1 Tuning Parser

Parser trained on defines neighbour set which is the Cartesian product of the -best lists in . The position and shape of is thus determined by two factors: how well can fit , and . Intuitively, the lower the fitness is, the more goes far away from ; and the larger is, the larger is. Moreover, the diversity of is inversely proportional to the fitness. When the fitness decreases, patterns existing in the training treebank become less certain to the parser, patterns that do not exist in the training treebank thus have more chances to appear in -best candidates. This leads to high diversity of . We blindly set in all of our experiments.

With the MSTParser, there are two hyper-parameters: iters

, the number of epochs, and

training-k, the -best parse set size to create constraints during training. training-k is always 1 because constraints from -best parses with almost incorrect training parses are useless.

Because iters controls the fitness of the parser to training treebank , it, as pointed out above, determines the distance from to and the diversity of the former. Therefore, if we want to encourage the local search to explore more distant areas, we should set iters low. In our experiments, we test two strategies: (i) MaxEnc, iters = 1, maximal encouragement, and (ii) MinEnc, iters = 10, minimal encouragement.

5.2 Tuning Reranker

Tuning the reranker is to set values for dim, the dimensions of inner and outer representations, and iters, the number of epochs to train the IORNN. Because the

-order model is very expressive and feed-forward neural networks are universal approximators

[Cybenko1989], the reranker is capable of perfectly remembering all training parses. In order to avoid this, we set dim = 50, and set iters = 5 for very early stopping.

5.3 Tuning multi-phase IR

Because DBLP:conf/acl/MarecekS13’s parser does not distinguish training data from test data, we postulate . Our system has phases such that contain all sentences up to length , () contains all sentences up to length , and contains all sentences up to length 25. Phase 1 halts after 100 iterations whereas all the following phases run with one iteration. Note that we force the local search in phase 1 to run intensively because we hypothesise that most of the important patterns for dependency parsing can be found within short sentences.

6 Experiments

6.1 Setting

We use the Penn Treebank WSJ corpus: sections 02-21 for training, and section 23 for testing. We then apply the standard pre-processing555 http://www.cs.famaf.unc.edu.ar/~francolq/en/proyectos/dmvccm for unsupervised dependency parsing task [Klein and Manning2004]: we strip off all empty sub-trees, punctuation, and terminals (tagged # and $) not pronounced where they appear; we then convert the remaining trees to dependencies using Collins’s head rules [Collins2003b]. Both word forms and gold POS tags are used. The directed dependency accuracy (DDA) metric is used for evaluation.

The vocabulary is taken as a list of words occurring more than two times in the training data. All other words are labelled ‘UNKNOWN’ and every digit is replaced by ‘0’. We initialise the IORNN with the 50-dim word embeddings from collobert_natural_2011 666http://ml.nec-labs.com/senna/. These word embeddings were unsupervisedly learnt from Wikipedia. , and train it with the learning rate ,

6.2 Results

We compare our system against recent systems (Table 1 and Section 2.1). Our system with the two encouragement levels, MinEnc and MaxEnc, achieves the highest reported DDAs on section 23: 1.8% and 1.2% higher than DBLP:conf/emnlp/SpitkovskyAJ13 on all sentences and up to length 10, respectively. Our improvements over the system’s initialiser [Marecek and Straka2013] are 9.1% and 4.4%.

System DDA (@10)
bisk2012simple 53.3 (71.5)
blunsom2010unsupervised 55.7 (67.7)
tu2012unambiguity 57.0 (71.4)
DBLP:conf/acl/MarecekS13 57.1 (68.8)
naseem2011using 59.4 (70.2)
spitkovsky2012wabisabi 61.2 (71.4)
DBLP:conf/emnlp/SpitkovskyAJ13 64.4 (72.0)
Our system (MinEnc) 66.2 (72.7)
Our system (MaxEnc) 65.8 (73.2)
Table 1: Performance on section 23 of the WSJ corpus (all sentences and up to length 10) for recent systems and our system. MinEnc and MaxEnc denote iters = 10 and iters = 1 respectively.

6.3 Analysis

In this section, we analyse our system along two aspects. First, we examine three factors which determine the performance of the whole system: encouragement level, lexical semantics, and starting point. We then search for what IR (with the MaxEnc option) contributes to the overall performance by comparing the quality of the treebank resulted in the end of phase 1 against the quality of the treebank given by its initialier, i.e. DBLP:conf/acl/MarecekS13.

The effect of encouragement level

Figure 2: of all phases on the their training sets (e.g., phase with containing all training sentences up to length 17).

Figure 2 shows the differences in DDA between using MaxEnc and MinEnc in each phase: we compute of each phase on its training set (e.g., phase with containing all training sentences up to length 17). MinEnc outperforms MaxEnc within phases 1, 2, 3, and 4. However, from phase 5, the latter surpasses the former. It suggests that exploring areas far away from the current point with long sentences is risky. The reason is that long sentences contain more ambiguities than short ones; thus rich diversity, high difference from the current point, but small size (i.e., small ) could easily lead the learning to a wrong path.

The performance of the system with the two encouragement levels on section 23 (Table 1) also suggests the same. MaxEnc strategy helps the system achieve the highest accuracy on short sentences (up to length 10). However, it is less helpful than MinEnc when performing on long sentences.

The role of lexical semantics

We examine the role of the lexical semantics, which is given by the word embeddings. Figure 3 shows DDAs on training sentences up to length 15 (i.e. ) of phase 1 (MaxEnc) with and without the word-embeddings. With the word-embeddings, phase 1 achieves 71.11%. When the word-embeddings are not given, i.e. the IORNN uses randomly generated word vectors, the accuracy drops 4.2%. It shows that lexical semantics plays a decisive role in the performance of the system.

However, it is worth noting that, even without that knowledge (i.e., with the -order generative model alone), the DDA of phase 1 is 2% higher than before being trained (66.89% vs 64.9%). It suggests that phase 1 is capable of discovering some useful dependency patterns that are invisible to the parser in phase 0. This, we conjecture, is thanks to high-order features captured by the IORNN.

Figure 3: DDA of phase 1 (MaxEnc), with and without the word embeddings (denoted by w/ sem and wo/ sem, respectively), on training sentences up to length 15 (i.e. ).

The importance of the starting point

Figure 4: DDA of phase 1 (MaxEnc) before and after training with three different starting points provided by three parsers used in phase 0: MS [Marecek and Straka2013], GGGPT [Gillenwater et al.2011], and Harmonic [Klein and Manning2004].

Starting point is claimed to be important in local search. We examine this by using three different parsers in phase 0: (i) MS [Marecek and Straka2013], the parser used in the previous experiments, (ii) GGGPT [Gillenwater et al.2011]777code.google.com/p/pr-toolkit employing an extension of the DMV model and posterior regularization framework for training, and (iii) Harmonic, the harmonic initializer proposed by DBLP:conf/acl/KleinM04.

Figure 4 shows DDAs of phase 1 (MaxEnc) on training sentences up to length 15 with three starting-points given by those parsers. Starting point is clearly very important to the performance of the iterated reranking: the better the starting point is, the higher performance phase 1 has. However, a remarkable point here is that the iterated reranking of phase 1 always finds out more useful patterns for parsing whatever the starting point is in this experiment. It is certainly due to the high order features and lexical semantics, which are not exploited in those parsers.

The contribution of Iterated Reranking

Figure 5: Precision (top) and recall (bottom) over binned HEAD distance of iterated reranking (IR) and its initializer (MS) on the training sentences in phase 1 ( words).
Figure 6: Correct-head accuracies over POS-tags (sorted in the descending order by frequency) of iterated reranking (IR) and its initializer (MS) on the training sentences in phase 1 ( words).

We compare the quality of the treebank resulted in the end of phase 1 against the quality of the treebank given by the initialier DBLP:conf/acl/MarecekS13. Figure 5 shows precision (top) and recall (bottom) over binned HEAD distance. IR helps to improve the precision on all distance bins, especially on the bins corresponding to long distances (). The recall is also improved, except on the bin corresponding to (but the F1-score on this bin is increased). We attribute this improvement to the -order model which uses very large fragments as contexts thus be able to capture long dependencies.

Figure 6 shows the correct-head accuracies over POS-tags. IR helps to improve the accuracies over almost all POS-tags, particularly nouns (e.g. NN, NNP, NNS), verbs (e.g. VBD, VBZ, VBN, VBG) and adjectives (e.g. JJ, JJR). However, as being affected by the initializer, IR performs poorly on conjunction (CC) and modal auxiliary (MD). For instance, in the treebank given by the initializer, almost all modal auxilaries are dependents of their verbs instead of the other way around.

7 Discussion

Our system is different from the other systems shown in Table 1 as it uses an extremely expressive model, the -order generative model, in which conditioning contexts are very large fragments. Only the work of blunsom2010unsupervised, whose resulting grammar rules can contain large tree fragments, shares this property. The difference is that their work needs a pre-defined prior, namely hierarchical non-parametric Pitman-Yor process prior, to avoid large, rare fragments and for smoothing. The IORNN of our system, in contrast, does that automatically. It learns by itself how to deal with distant conditioning nodes, which are often less informative than close conditioning nodes on computing . In addition, smoothing is given free: recursive neural nets are able to map ‘similar’ fragments onto close points [Socher et al.2010] thus an unseen fragment tends to be mapped onto a point close to points corresponding to ‘similar’ seen fragments.

Another difference is that our system exploits lexical semantics via word embeddings, which were learnt unsupervisedly. By initialising the IORNN with these embeddings, the use of this knowledge turns out easy and transparent. DBLP:conf/emnlp/SpitkovskyAJ13 also exploit lexical semantics but in a limited way, using a context-based polysemous unsupervised clustering method to tag words. Although their approach can distinguish polysemes (e.g., ‘cool’ in ‘to cool the selling panic’ and in ‘it is cool’), it is not able to make use of word meaning similarities (e.g., the meaning of ‘dog’ is closer to ‘animal’ than to ‘table’). naseem2011using’s system uses semantic cues from an out-of-domain annotated corpus, thus is not fully unsupervised.

We have showed that IR with a generative reranker is an approximation of hard-EM (see Equation 4). Our system is thus related to the works of DBLP:conf/emnlp/SpitkovskyAJ13 and tu2012unambiguity. However, what we have proposed is more than that: IR is a general framework that we can have more than one option for choosing -best parser and reranker. For instance, we can make use of a generative -best parser and a discriminative reranker that are used for supervised parsing. Our future work is to explore this.

The experimental results reveal that starting point is very important to the iterated reranking with the -order generative model. On the one hand, that is a disadvantage compared to the other systems, which use uninformed or harmonic initialisers. But on the other hand, that is an innovation as our approach is capable of making use of existing systems. The results shown in Figure 4 suggest that if phase 0 uses a better parser which uses less expressive model and/or less external knowledge than our model, such as the one proposed by DBLP:conf/emnlp/SpitkovskyAJ13, we can expect even a higher performance. The other systems, except blunsom2010unsupervised, however, might not benefit from using good existing parsers as initializers because their models are not significantly more expressive than others 888 In an experiment, we used the DBLP:conf/acl/MarecekS13’s parser as an initializer for the gillenwater2011posterior’s parser. As we expected, the latter was not able to make use of this. .

8 Conclusion

We have proposed a new framework, iterated reranking (IR), which trains supervised parsers without the need of manually annotated data by using a unsupervised parser as an initialiser. Our system, employing DBLP:conf/acl/MarecekS13’s unsupervised parser as the initialiser, the -best MSTParser, and le2014the’s reranker, achieved 1.8% DDA higher than the SOTA parser of DBLP:conf/emnlp/SpitkovskyAJ13 on the WSJ corpus. Moreover, we also showed that unsupervised parsing benefits from lexical semantics through using word-embeddings.

Our future work is to exploit other existing supervised parsers that fit our framework. Besides, taking into account the fast development of the word embedding research [Mikolov et al.2013, Pennington et al.2014], we will try different word embeddings.

Acknowledgments

We thank Remko Scha and three anonymous reviewers for helpful comments. Le thanks Milo Stanojevi for helpful discussion.

References

  • [Bansal et al.2014] Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Tailoring continuous word representations for dependency parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • [Bisk and Hockenmaier2012] Yonatan Bisk and Julia Hockenmaier. 2012. Simple robust grammar induction with combinatory categorial grammars. In AAAI.
  • [Blum and Mitchell1998] Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled sata with co-training. In COLT, pages 92–100.
  • [Blunsom and Cohn2010] Phil Blunsom and Trevor Cohn. 2010. Unsupervised induction of tree substitution grammars for dependency parsing. In

    Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

    , pages 1204–1213. Association for Computational Linguistics.
  • [Charniak and Johnson2005] Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine n-best parsing and maxent discriminative reranking. In ACL.
  • [Chen and Manning2014] Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In Empirical Methods in Natural Language Processing (EMNLP).
  • [Collins2000] Michael Collins. 2000. Discriminative reranking for natural language parsing. In ICML, pages 175–182.
  • [Collins2003a] Michael Collins. 2003a. Head-driven statistical models for natural language parsing. Computational linguistics, 29(4):589–637.
  • [Collins2003b] Michael Collins. 2003b. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4):589–637.
  • [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537.
  • [Cybenko1989] George Cybenko. 1989.

    Approximation by superpositions of a sigmoidal function.

    Mathematics of control, signals and systems, 2(4):303–314.
  • [Eisner1996] Jason M Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of the 16th conference on Computational linguistics-Volume 1, pages 340–345. Association for Computational Linguistics.
  • [Elman1993] Jeffrey L Elman. 1993. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99.
  • [Gillenwater et al.2011] Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, and Ben Taskar. 2011. Posterior sparsity in unsupervised dependency parsing. The Journal of Machine Learning Research, 12:455–490.
  • [Hayashi et al.2013] Katsuhiko Hayashi, Shuhei Kondo, and Yuji Matsumoto. 2013. Efficient stacked dependency parsing by forest reranking. Transactions of the Association for Computational Linguistics, 1(1):139–150.
  • [Klein and Manning2004] Dan Klein and Christopher D. Manning. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In ACL, pages 478–485.
  • [Koo and Collins2010] Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1–11. Association for Computational Linguistics.
  • [Le and Zuidema2014] Phong Le and Willem Zuidema. 2014. The inside-outside recursive neural network model for dependency parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  • [Marecek and Straka2013] David Marecek and Milan Straka. 2013. Stop-probability estimates computed on a large corpus improve unsupervised dependency parsing. In ACL (1), pages 281–290.
  • [Martins et al.2013] André FT Martins, Miguel B Almeida, and Noah A Smith. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proc. of ACL.
  • [McClosky et al.2006] David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, pages 152–159. Association for Computational Linguistics.
  • [McDonald and Pereira2006] Ryan T. McDonald and Fernando C. N. Pereira. 2006. Online learning of approximate dependency parsing algorithms. In EACL.
  • [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.
  • [Naseem and Barzilay2011] Tahira Naseem and Regina Barzilay. 2011. Using semantic cues to learn syntax. In AAAI.
  • [Naseem2014] Tahira Naseem. 2014. Linguistically Motivated Models for Lightly-Supervised Dependency Parsing. Ph.D. thesis, Massachusetts Institute of Technology.
  • [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12.
  • [Sangati et al.2009] Federico Sangati, Willem Zuidema, and Rens Bod. 2009. A generative re-ranking model for dependency parsing. In Proceedings of the 11th International Conference on Parsing Technologies, pages 238–241. Association for Computational Linguistics.
  • [Socher et al.2010] Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2010. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In

    Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop

    .
  • [Spitkovsky et al.2010a] Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2010a. From Baby Steps to Leapfrog: How “Less is More” in unsupervised dependency parsing. In Proc. of NAACL-HLT.
  • [Spitkovsky et al.2010b] Valentin I. Spitkovsky, Hiyan Alshawi, Daniel Jurafsky, and Christopher D. Manning. 2010b. Viterbi training improves unsupervised dependency parsing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL-2010).
  • [Spitkovsky et al.2012] Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2012. Bootstrapping dependency grammar inducers from incomplete sentence fragments via austere models. In Proceedings of the 11th International Conference on Grammatical Inference.
  • [Spitkovsky et al.2013] Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2013. Breaking out of local optima with count transforms and model recombination: A study in grammar induction. In EMNLP, pages 1983–1995.
  • [Tu and Honavar2012] Kewei Tu and Vasant Honavar. 2012.

    Unambiguity regularization for unsupervised learning of probabilistic grammars.

    In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1324–1334. Association for Computational Linguistics.