1 Introduction
Latentvariable probabilistic contextfree grammars (LPCFGs) have been used in the natural language processing community (NLP) for syntactic parsing for over a decade. They were introduced in the NLP community by matsuzaki05 and prescher2005head, with Matsuzaki et al. using the expectationmaximization (EM) algorithm to estimate them. Their performance on syntactic parsing of English at that stage lagged behind stateoftheart parsers.
petrov06 showed that one of the reasons that the EM algorithm does not estimate stateoftheart parsing models for English is that the EM algorithm does not control well for the model size used in the parser – the number of latent states associated with the various nonterminals in the grammar. As such, they introduced a coarsetofine technique to estimate the grammar. It splits and merges nonterminals (with latent state information) with the aim to optimize the likelihood of the training data. Together with other types of fine tuning of the parsing model, this led to stateoftheart results for English parsing.
In more recent work, cohen12 described a different family of estimation algorithms for LPCFGs. This socalled “spectral” family of learning algorithms is compelling because it offers a rigorous theoretical analysis of statistical convergence, and sidesteps local maxima issues that arise with the EM algorithm.
While spectral algorithms for LPCFGs are compelling from a theoretical perspective, they have been lagging behind in their empirical results on the problem of parsing. In this paper we show that one of the main reasons for that is that spectral algorithms require a more careful tuning procedure for the number of latent states than that which has been advocated for until now. In a sense, the relationship between our work and the work of cohen13b is analogous to the relationship between the work by petrov06 and the work by matsuzaki05: we suggest a technique for optimizing the number of latent states for spectral algorithms, and test it on eight languages.
Our results show that when the number of latent states is optimized using our technique, the parsing models the spectral algorithms yield perform significantly better than the vanillaestimated models, and for most of the languages – better than the Berkeley parser of petrov06.
As such, the contributions of this parser are twofold:

We describe a search algorithm for optimizing the number of latent states for spectral learning.

We describe an analysis of spectral algorithms on eight languages (until now the results of LPCFG estimation with spectral algorithms for parsing were known only for English). Our parsing algorithm is rather languagegeneric, and does not require significant linguisticallyoriented adjustments.
In addition, we dispel the common wisdom that more data is needed with spectral algorithms. Our models yield high performance on treebanks of varying sizes from 5,000 sentences (Hebrew and Swedish) to 40,472 sentences (German).
The rest of the paper is organized as follows. In §2 we describe notation and background. §3 further investigates the need for an optimization of the number of latent states in spectral learning and describes our optimization algorithm, a search algorithm akin to beam search. In §4 we describe our experiments with natural language parsing for Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish. We conclude in §5.
2 Background and Notation
We denote by the set of integers . An LPCFG is a 5tuple where:

is the set of nonterminal symbols in the grammar. is a finite set of interminals. is a finite set of preterminals. We assume that , and . Hence we have partitioned the set of nonterminals into two subsets.

is a function that maps each nonterminal to the number of latent states it uses. The set includes the possible hidden states for nonterminal .

is the set of possible words.

For all , , , , , , we have a binary contextfree rule .

For all , , , we have a lexical contextfree rule .
The estimation of an LPCFG requires an assignment of probabilities (or weights) to each of the rules
and , and also an assignment of starting probabilities for each , where and . Estimation is usually assumed to be done from a set of parse trees (a treebank), where the latent states are not included in the data – only the “skeletal” trees which consist of nonterminals in .LPCFGs, in their symbolic form, are related to regular tree grammars, an old grammar formalism, but they were introduced as statistical models for parsing with latent heads more recently by matsuzaki05 and prescher2005head. Earlier work about LPCFGs by matsuzaki05 used the expectationmaximization (EM) algorithm to estimate the grammar probabilities. Indeed, given that the latent states are not observed, EM is a good fit for LPCFG estimation, since it aims to do learning from incomplete data. This work has been further extended by petrov06 to use EM in a coarsetofine fashion: merging and splitting nonterminals using the latent states to optimize the number of latent states for each nonterminal.
cohen12 presented a socalled spectral algorithm to estimate LPCFGs. This algorithm uses linearalgebraic procedures such as singular value decomposition (SVD) during learning. The spectral algorithm of Cohen et al. builds on an estimation algorithm for HMMs by hsu09.
^{1}^{1}1A related algorithm for weighted tree automata (WTA) was developed by bailly10. However, the conversion from LPCFGs to WTA is not straightforward, and information is lost in this conversion. See also [Rabusseau et al.2016]. cohen13b experimented with this spectral algorithm for parsing English. A different variant of a spectral learning algorithm for LPCFGs was developed by cohen14b. It breaks the problem of LPCFG estimation into multiple convex optimization problems which are solved using EM.The family of LPCFG spectral learning algorithms was further extended by narayan15. They presented a simplified version of the algorithm of cohen12 that estimates sparse grammars and assigns probabilities (instead of weights) to the rules in the grammar, and as such does not suffer from the problem of negative probabilities that arise with the original spectral algorithm (see discussion in Cohen et al., 2013). In this paper, we use the algorithms by narayan15 and cohen12, and we compare them against stateoftheart LPCFG parsers such as the Berkeley parser [Petrov et al.2006]. We also compare our algorithms to other stateoftheart parsers where elaborate linguisticallymotivated feature specifications [Hall et al.2014], annotations [Crabbé2015] and formalism conversions [FernándezGonzález and Martins2015] are used.
3 Optimizing Spectral Estimation
In this section, we describe our optimization algorithm and its motivation.
3.1 Spectral Learning of LPCFGs and Model Size
[.VP [.V chased ] [.NP [.D the ] [.N cat ] ] ]  [.S [.NP [.D the ] [.N mouse ] ] VP ] 
The family of spectral algorithms for latentvariable PCFGs rely on feature functions that are defined for inside and outside trees. Given a tree, the inside tree for a node contains the entire subtree below that node; the outside tree contains everything in the tree excluding the inside tree. Figure 1 shows an example of inside and outside trees for the nonterminal VP in the parse tree of the sentence “the mouse chased the cat”.
With LPCFGs, the model dictates that an inside tree and an outside tree that are connected at a node are statistically conditionally independent of each other given the node label and the latent state that is associated with it. As such, one can identify the distribution over the latent states for a given nonterminal by using the crosscovariance matrix of the inside and the outside trees, . For more information on the definition of this crosscovariance matrix, see cohen12 and narayan15.
The LPCFG spectral algorithms use singular value decomposition (SVD) on to reduce the dimensionality of the feature functions. If is computed from the true LPCFG distribution then the rank of (the number of nonzero singular values) gives the number of latent states according to the model.
In the case of estimating from data generated from an LPCFG, the number of latent states for each nonterminal can be exposed by capping it when the singular values of are smaller than some threshold value. This means that spectral algorithms give a natural way for the selection of the number of latent states for each nonterminal in the grammar.
However, when the data from which we estimate an LPCFG model are not drawn from an LPCFG (the model is “incorrect”), the number of nonzero singular values (or the number of singular values which are large) is no longer sufficient to determine the number of latent states for each nonterminal. This is where our algorithm comes into play: it optimizes the number of latent search for each nonterminal by applying a search algorithm akin to beam search.
3.2 Optimizing the Number of Latent States
As mentioned in the previous section, the number of nonzero singular values of gives a criterion to determine the number of latent states for a given nonterminal . In practice, we cap not to include small singular values which are close to 0, because of estimation errors of .
This procedure does not take into account the interactions that exist between choices of latent state numbers for the various nonterminals. In principle, given the independence assumptions that LPCFGs make, choosing the nonterminals based only on the singular values is “statistically correct.” However, because in practice the modeling assumptions that we make (that natural language parse trees are drawn from an LPCFG) do not hold, we can improve further the accuracy of the model by taking into account the nonterminal interaction. Another source of difficulty in choosing the number of latent states based the singular values of is sampling error: in practice, we are using data to estimate , and as such, even if the model is correct, the rank of the estimated matrix does not have to correspond to the rank of according to the true distribution. As a matter of fact, in addition to neglecting small singular values, the spectral methods of cohen13b and narayan15 also cap the number of latent states for each nonterminal to an upper bound to keep the grammar size small.
petrov06 improves over the estimation described in matsuzaki05 by taking into account the interactions between the nonterminals and their latent state numbers in the training data. They use the EM algorithm to split and merge nonterminals using the latent states, and optimize the number of latent states for each nonterminal such that it maximizes the likelihood of a training treebank. Their refined grammar successfully splits nonterminals to various degrees to capture their complexity. We take the analogous step with spectral methods. We propose an algorithm where we first compute on the training data and then we optimize the number of latent states for each nonterminal by optimizing the PARSEVAL metric [Black et al.1991] on a development set.
Our optimization algorithm appears in Figure 2. The input to the algorithm is training and development data in the form of parse trees, a basic spectral estimation algorithm in its default setting, an upper bound on the number of latent states that can be used for the different nonterminals and a beam size which gives a maximal queue size for the beam. The algorithm aims to learn a function that maps each nonterminal to the number of latent states. It initializes by estimating a default grammar using and setting . It then iterates over , improving such that it optimizes the PARSEVAL metric on the development set.
The state of the algorithm includes a queue that consists of tuples of the form where is an assignment of latent state numbers to each nonterminal in the grammar, is the index of a nonterminal to be explored in the input nonterminal list , is the score on the development set for a grammar that is estimated with and is a tag that can either be or .
The algorithm orders these tuples by in the queue, and iteratively dequeues elements from the queue. Then, depending on the label , it either makes a refined search for the number of latent states for , or a more coarse search. As such, the algorithm can be seen as a variant of a beam search algorithm.
lang. 
Basque  French  GermanN  GermanT  Hebrew  Hungarian  Korean  Polish  Swedish  

train 
sent.  7,577  14,759  18,602  40,472  5,000  8,146  23,010  6,578  5,000 
tokens  96,565  443,113  328,531  719,532  128,065  170,221  301,800  66,814  76,332  
lex. size  25,136  27,470  48,509  77,219  15,971  40,775  85,671  21,793  14,097  
#nts  112  222  208  762  375  112  352  198  148  
dev 
sent.  948  1,235  1,000  5,000  500  1,051  2,066  821  494 
tokens  13,893  38,820  17,542  76,704  11,305  30,010  25,729  8,391  9,339  
test 
sent.  946  2,541  1,000  5,000  716  1,009  2,287  822  666 
tokens  11,477  75,216  17,585  92,004  17,002  19,913  28,783  8,336  10,675 
Statistics about the different datasets used in our experiments for the training (“train”), development (“dev”) and test (“test”) sets. “sent.” denotes the number of sentences in the dataset, “tokens” denotes the total number of words in the dataset, “lex. size” denotes the vocabulary size in the training set and “#nts” denotes the number of nonterminals in the training set after binarization.
The search algorithm can be used with any training algorithm for LPCFGs, including the algorithms of cohen13b and narayan15. These methods, in their default setting, use a function which maps each nonterminal to a fixed number of latent states it uses. In this case, takes as input training data, in the form of a treebank, decomposes into inside and outside trees at each node in each tree in the training set; and reduces the dimensionality of the inside and outside feature functions by running SVD on the crosscovariance matrix of the inside and the outside trees, for each nonterminal
. cohen13b estimate the parameters of the LPCFG up to a linear transformation using
nonzero singular values of , whereas narayan15 use the feature representations induced from the SVD step to cluster instances of nonterminal in the training data into clusters; these clusters are then treated as latent states that are “observed.” Finally, Narayan and Cohen follow up with a simple frequency count maximum likelihood estimate to estimate the parameters in the LPCFG with these latent states.An important point to make is that the learning algorithms of narayan15 and cohen13b are relatively fast,^{2}^{2}2It has been documented in several papers that the family of spectral estimation algorithms is faster than algorithms such as EM, not just for LPCFGs. See, for example, parikh2012spectral. in comparison to the EM algorithm. They require only one iteration over the data. In addition, the SVD step of for these learning algorithms is computed just once for a large . The SVD of a lower rank can then be easily computed from that SVD.
4 Experiments
In this section, we describe our setup for parsing experiments on a range of languages.
4.1 Experimental Setup
Datasets
We experiment with nine treebanks consisting of eight different morphologically rich languages: Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish. Table 1 shows the statistics of 9 different treebanks with their splits into training, development and test sets. Eight out of the nine datasets (Basque, French, GermanT, Hebrew, Hungarian, Korean, Polish and Swedish) are taken from the workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL; Seddah et al., 2013). The German corpus in the SPMRL workshop is taken from the TiGer corpus (GermanT, Brants et al., 2004). We also experiment with another German corpus, the NEGRA corpus (GermanN, Skut et al., 1997), in a standard evaluation split.^{3}^{3}3We use the first 18,602 sentences as a training set, the next 1,000 sentences as a development set and the last 1,000 sentences as a test set. This corresponds to an 80%10%10% split of the treebank. Words in the SPMRL datasets are annotated with their morphological signatures, whereas the NEGRA corpus does not contain any morphological information.
Data preprocessing and treatment of rare words
We convert all trees in the treebanks to a binary form, train and run the parser in that form, and then transform back the trees when doing evaluation using the PARSEVAL metric. In addition, we collapse unary rules into unary chains, so that our trees are fully binarized. The column “#nts” in Table 1 shows the number of nonterminals after binarization in the various treebanks. Before binarization, we also drop all functional information from the nonterminals. We use fine tags for all languages except Korean. This is in line with bjorkelundetal:2013:spmrl.^{4}^{4}4In their experiments bjorkelundetal:2013:spmrl found that fine tags were not useful for Basque also; they did not find a proper explanation for that. In our experiments, however, we found that fine tags were useful for Basque. To retrieve the fine tags, we concatenate coarse tags with their refinement feature (“AZP”) values. For Korean, there are 2,825 binarized nonterminals making it impractical to use our optimization algorithm, so we use the coarse tags.
bjorkelundetal:2013:spmrl have shown that the morphological signatures for rare words are useful to improve the performance of the Berkeley parser. In our preliminary experiments with naïve spectral estimation, we preprocess rare words in the training set in two ways: (i) we replace them with their corresponding POS tags, and (ii) we replace them with their corresponding POS+morphological signatures. We follow bjorkelundetal:2013:spmrl and consider a word to be rare if it occurs less than 20 times in the training data. We experimented both with a version of the parser that does not ignore and does ignore letter cases, and discovered that the parser behaves better when case is not ignored.
lang. 
Basque  French  GermanN  GermanT  Hebrew  Hungarian  Korean  Polish  Swedish  

Bk 
van  69.2  79.9    81.7  87.8  83.9  71.0  84.1  74.5 
rep  84.3  79.7    82.7  89.6  89.1  82.8  87.1  75.5  
Cl 
van (pos)  69.8  73.9  75.7  78.3  88.0  81.3  68.7  90.3  70.9 
van (rep)  78.6  73.7    78.8  88.1  84.7  76.5  90.4  71.4  
opt  81.2  76.7  77.8  81.7  90.1  87.2  79.2  92.0  75.2  
Sp 
van  78.1  78.0  77.6  82.0  89.2  87.7  80.6  91.7  73.4 
opt  79.0  78.1  79.0  82.9  90.3  87.8  80.9  91.7  75.5  
Bk multiple  87.4  82.5    85.0  90.5  91.1  84.6  88.4  79.5  
Cl multiple  83.4  79.9  82.7  85.1  90.6  89.0  80.8  92.5  78.3  
Hall et al. ’14  83.7  79.4    83.3  88.1  87.4  81.9  91.1  76.0  
Crabbé ’15  84.0  80.9    84.1  90.7  88.3  83.1  92.8  77.9 
lang. 
Basque  French  GermanN  GermanT  Hebrew  Hungarian  Korean  Polish  Swedish  

Bk  74.7  80.4  80.1  78.3  87.0  85.2  78.6  86.8  80.6  
Cl 
van  79.6  74.3  76.4  74.1  86.3  86.5  76.5  90.5  76.4 
opt  81.4  75.6  78.0  76.0  87.2  88.4  78.4  91.2  79.4  
Sp 
van  79.9  78.7  78.4  78.0  87.8  89.1  80.3  91.8  78.4 
opt  80.5  79.1  79.4  78.2  89.0  89.2  80.0  91.8  80.9  
Bk multiple  87.9  82.9  84.5  81.3  89.5  91.9  84.3  87.8  84.9  
Cl multiple  83.4  80.4  82.7  80.4  89.2  89.9  80.3  92.4  82.8  
Hall et al. ’14  83.4  79.7    78.4  87.2  88.3  80.2  90.7  82.0  
F&M ’15  85.9  78.8    78.7  89.0  88.2  79.3  91.2  82.8  
Crabbé ’15  84.9  80.8    79.3  89.7  90.1  82.7  92.7  83.2 
Spectral algorithms: subroutine choices
The latent state optimization algorithm will work with either the clustering estimation algorithm of narayan15 or the spectral algorithm of cohen13b. In our setup, we first run the latent state optimization algorithm with the clustering algorithm. We then run the spectral algorithm once with the optimized from the clustering algorithm. We do that because the clustering algorithm is significantly faster to iteratively parse the development set, because it leads to sparse estimates.
Our optimization algorithm is sensitive to the initialization of the number of latent states assigned to each nonterminals as it sequentially goes through the list of nonterminals and chooses latent state numbers for each nonterminal, keeping latent state numbers for other nonterminals fixed. In our setup, we start our search algorithm with the best model from the clustering algorithm, controlling for all hyperparameters; we tune
, the function which maps each nonterminal to a fixed number of latent states , by running the vanilla version with different values of for different languages. Based on our preliminary experiments, we set to 4 for Basque, Hebrew, Polish and Swedish; 8 for GermanN; 16 for GermanT, Hungarian and Korean; and 24 for French.We use the same features for the spectral methods as in narayan15 for GermanN. For the SPMRL datasets we do not use the head features. These require linguistic understanding of the datasets (because they require head rules for propagating leaf nodes in the tree), and we discovered that simple heuristics for constructing these rules did not yield an increase in performance.
We use the kmeans function in Matlab to do the clustering for the spectral algorithm of narayan15. We experimented with several versions of means, and discovered that the version that works best in a set of preliminary experiments is hard means.^{5}^{5}5To be more precise, we use the Matlab function kmeans while passing it the parameter ‘start’=‘sample’ to randomly sample the initial centroid positions. In our experiments, we found that default initialization of centroids differs in Matlab14 (random) and in Matlab15 (kmeans++). Our estimation performs better with random initialization.
Decoding and evaluation
For efficiency, we use a base PCFG without latent states to prune marginals which receive a value less than in the dynamic programming chart. This is just a barebones PCFG that is estimated using maximum likelihood estimation (with frequency count). The parser takes partofspeech tagged sentences as input. We tag the GermanN data using the Turbo Tagger [Martins et al.2010]. For the languages in the SPMRL data we use the MarMot tagger of muellerschmidschutze:2013:EMNLP to jointly predict the POS and morphological tags.^{6}^{6}6See bjorkelundetal:2013:spmrl for the performance of the MarMot tagger on the SPMRL datasets. The parser itself can assign different partofspeech tags to words to avoid parse failure. This is also particularly important for constituency parsing with morphologically rich languages. It helps mitigate the problem of the taggers to assign correct tags when longdistance dependencies are present.
For all results, we report the measure of the PARSEVAL metric [Black et al.1991]. We use the EVALB program^{7}^{7}7We use EVALB from http://nlp.cs.nyu.edu/evalb/ for the GermanN data and its modified version from http://dokufarm.phil.hhu.de/spmrl2014/doku.php?id=shared_task_description for the SPMRL datasets. with the parameter file COLLINS.prm [Collins1999] for the GermanN data and the SPMRL parameter file, spmrl.prm, for the SPMRL data [Seddah et al.2013].
In this setup, the latent state optimization algorithm terminates in few hours for all datasets except French and GermanT. The GermanT data has 762 nonterminals to tune over a large development set consisting of 5,000 sentences, whereas, the French data has a high average sentence length of 31.43 in the development set.^{8}^{8}8To speed up tuning on the French data, we drop sentences with length 46 from the development set, dropping its size from 12,35 to 1,006.
Following narayan15, we further improve our results by using multiple spectral models where noise is added to the underlying features in the training set before the estimation of each model.^{9}^{9}9We only use the algorithm of narayan15 for the noisy model estimation. They have shown that decoding with noisy models performs better with their sparse estimates than the dense estimates of cohen13b. Using the optimized , we estimate models for each of noise induction mechanisms in Narayan and Cohen: Dropout, Gaussian (additive) and Gaussian (multiplicative). To decode with multiple noisy models, we train the MaxEnt reranker of charniak05.^{10}^{10}10Implementation: https://github.com/BLLIP/bllipparser. More specifically, we used the programs extractspfeatures, cvlmlbfgs and bestindices. extractspfeatures uses head features, we bypass this for the SPMRL datasets by creating a dummy heads.cc file. cvlmlbfgs was used with the default hyperparameters from the Makefile. Hierarchical decoding with “maximal tree coverage” over MaxEnt models, further improves our accuracy. See narayan15 for more details on the estimation of a diverse set of models, and on decoding with them.
preterminals  interminals  all  

language  div.  #nts  div.  #nts  div.  #nts  
Basque  311  419  196  169  91  227  152  31  402  646  348  200 
French  839  715  476  108  1145  1279  906  114  1984  1994  1382  222 
GermanN  425  567  416  109  323  578  361  99  748  1145  777  208 
GermanT  1251  890  795  378  1037  1323  738  384  2288  2213  1533  762 
Hebrew  434  442  182  279  169  544  393  96  603  986  575  375 
Hungarian  457  415  282  87  186  261  129  25  643  676  411  112 
Korean  1077  980  547  331  218  220  150  21  1295  1200  697  352 
Polish  252  311  197  135  132  180  86  63  384  491  283  198 
Swedish  191  284  127  106  85  345  266  42  276  629  393  148 
4.2 Results
Table 2 and Table 3 give the results for the various languages.^{11}^{11}11See more in http://cohort.inf.ed.ac.uk/lpcfg/. Our main focus is on comparing the coarsetofine Berkeley parser [Petrov et al.2006] to our method. However, for the sake of completeness, we also present results for other parsers, such as parsers of hall2014less, fernandezgonzalezmartins:2015 and crabbe:2015:EMNLP.
In line with bjorkelundetal:2013:spmrl, our preliminary experiments with the treatment of rare words suggest that morphological features are useful for all SPMRL languages except French. Specifically, for Basque, Hungarian and Korean, improvements are significantly large.
Our results show that the optimization of the number of latent states with the clustering and spectral algorithms indeed improves these algorithms performance, and these increases generalize to the test sets as well. This was a point of concern, since the optimization algorithm goes through many points in the hypothesis space of parsing models, and identifies one that behaves optimally on the development set – and as such it could overfit to the development set. However, this did not happen, and in some cases, the increase in accuracy of the test set after running our optimization algorithm is actually larger than the one for the development set.
While the vanilla estimation algorithms (without latent state optimization) lag behind the Berkeley parser for many of the languages, once the number of latent states is optimized, our parsing models do better for Basque, Hebrew, Hungarian, Korean, Polish and Swedish. For GermanT we perform close to the Berkeley parser (78.2 vs. 78.3). It is also interesting to compare the clustering algorithm of narayan15 to the spectral algorithm of cohen13b. In the vanilla version, the spectral algorithm does better in most cases. However, these differences are narrowed, and in some cases, overcome, when the number of latent states is optimized. Decoding with multiple models further improves our accuracy. Our “Cl multiple” results lag behind “Bk multiple.” We believe this is the result of the need of head features for the MaxEnt models.^{12}^{12}12bjorkelundetal:2013:spmrl also use the MaxEnt raranker with multiple models of the Berkeley parser, and in their case also the performance after the raranking step is not always significantly better. See footnote 10 on how we create dummy headfeatures for our MaxEnt models.
Our results show that spectral learning is a viable alternative to the use of expectationmaximization coarsetofine techniques. As we discuss later, further improvements have been introduced to stateoftheart parsers that are orthogonal to the use of a specific estimation algorithm. Some of them can be applied to our setup.
4.3 Further Analysis
In addition to the basic set of parsing results, we also wanted to inspect the size of the parsing models when using the optimization algorithm in comparison to the vanilla models. Table 4 gives this analysis. In this table, we see that in most cases, on average, the optimization algorithm chooses to enlarge the number of latent states. However, for GermanT and Korean, for example, the optimization algorithm actually chooses a smaller model than the original vanilla model.
preterminal  freq.  b.  a.  preterminal  freq.  b.  a.  preterminal  freq.  b.  a.  preterminal  freq.  b.  a. 

PWAT  64  2  2  TRUNC  614  8  1  PIS  1,628  8  8  KON  8,633  8  30 
XY  135  3  1  VAPP  363  6  4  $*LRB*  13,681  8  6  PPER  4,979  8  100 
NPNN  88  2  1  PDS  988  8  8  ADJD  6,419  8  60  $.  17,699  8  3 
VMINF  177  3  5  AVPADV  211  4  11  KOUS  2,456  8  1  APPRART  6,217  8  15 
PTKA  162  3  1  FM  578  8  3  PIAT  1,061  8  8  ADJA  18,993  8  10 
VPVVINF  409  6  2  VVIMP  76  2  1  NPPPER  382  6  1  APPR  26,717  8  7 
PRELAT  94  2  1  KOUI  339  5  2  VVPP  5,005  8  20  VVFIN  13,444  8  3 
APADJD  178  3  1  VAINF  1,024  8  1  PPPROAV  174  3  1  $,  16,631  8  1 
APPO  89  2  2  PRELS  2,120  8  40  VAFIN  8,814  8  1  VVINF  4,382  8  10 
PWS  361  6  1  CARD  6,826  8  8  PTKNEG  1,884  8  8  ART  35,003  8  10 
KOKOM  800  8  37  NE  17,489  8  6  PTKZU  1,586  8  1  ADV  15,566  8  8 
VPVVPP  844  8  5  PRF  2,158  8  1  VVIZU  479  7  1  PIDAT  1,254  8  20 
PWAV  689  8  1  PDAT  1,129  8  1  PPOSAT  2,295  8  6  NN  68,056  8  12 
APZR  134  3  2  PROAV  1,479  8  10  PTKVZ  1,864  8  3  VMFIN  3,177  8  1 
We further inspected the behavior of the optimization algorithm for the preterminals in GermanN, for which the optimal model chose (on average) a larger number of latent states. Table 5 describes this analysis. We see that in most cases, the optimization algorithm chose to decrease the number of latent states for the various preterminals, but in some cases significantly increases the number of latent states.^{13}^{13}13Interestingly, most of the punctuation symbols, such as $LRB, $. and $,, drop their latent state number to a significantly lower value indicating that their interactions with other nonterminals in the tree are minimal.
Our experiments dispel another “common wisdom” about spectral learning and training data size. It has been believed that spectral learning do not behave very well when small amounts of data are available (when compared to maximum likelihood estimation algorithms such as EM) – however we see that our results do better than the Berkeley parser for several languages with small training datasets, such as Basque, Hebrew, Polish and Hungarian. The source of this common wisdom is that ML estimators tend to be statistically “efficient:” they extract more information from the data than spectral learning algorithms do. Indeed, there is no reason to believe that spectral algorithms are statistically efficient. However, it is not clear that indeed for LPCFGs with the EM algorithm, the ML estimator is statistically efficient either. MLE is statistically efficient under specific assumptions which are not clearly satisfied with LPCFG estimation. In addition, when the model is “incorrect,” (i.e. when the data is not sampled from LPCFG, as we would expect from natural language treebank data), spectral algorithms could yield better results because they can mimic a higher order model. This can be understood through HMMs. When estimating an HMM of a low order with data which was generated from a higher order model, EM does quite poorly. However, if the number of latent states (and feature functions) is properly controlled with spectral algorithms, a spectral algorithm would learn a “product” HMM, where the states in the lower order model are the product of states of a higher order.^{14}^{14}14For example, a trigram HMM can be reduced to a bigram HMM where the states are products of the original trigram HMM.
Stateoftheart parsers for the SPMRL datasets improve the Berkeley parser in ways which are orthogonal to the use of the basic estimation algorithm and the method for optimizing the number of latent states. They include transformations of the treebanks such as with unary rules [Björkelund et al.2013], a more careful handling of unknown words and better use of morphological information such as decorating preterminals with such information [Björkelund et al.2014, Szántó and Farkas2014], with careful feature specifications [Hall et al.2014] and headannotations [Crabbé2015], and other techniques. Some of these techniques can be applied to our case.
5 Conclusion
We demonstrated that a careful selection of the number of latent states in a latentvariable PCFG with spectral estimation has a significant effect on the parsing accuracy of the LPCFG. We described a search procedure to do this kind of optimization, and described parsing results for eight languages (with nine datasets). Our results demonstrate that when comparing the expectationmaximization with coarsetofine techniques to our spectral algorithm with latent state optimization, spectral learning performs better on six of the datasets. Our results are comparable to other stateoftheart results for these languages. Using a diverse set of models to parse these datasets further improves the results.
Acknowledgments
The authors would like to thank David McClosky for his help with running the BLLIP parser and his comments on the paper and also the three anonymous reviewers for their helpful comments. We also thank Eugene Charniak, DK Choe and Geoff Gordon for useful discussions. Finally, thanks to Djamé Seddah for providing us with the SPMRL datasets and to Thomas Müller and Anders Björkelund for providing us the MarMot models. This research was supported by an EPSRC grant (EP/L02411X/1) and an EU H2020 grant (688139/H2020ICT2015; SUMMA).
References
 [Bailly et al.2010] Raphaël Bailly, Amaury Habrard, and François Denis. 2010. A spectral approach for probabilistic grammatical inference on trees. In Proceedings of International Conference on Algorithmic Learning Theory.
 [Björkelund et al.2013] Anders Björkelund, Özlem Çetinoğlu, Richárd Farkas, Thomas Müeller, and Wolfgang Seeker. 2013. (Re)ranking meets morphosyntax: Stateoftheart results from the SPMRL 2013 shared task. In Proceedings of the Fourth Workshop on Statistical Parsing of MorphologicallyRich Languages.
 [Björkelund et al.2014] Anders Björkelund, Özlem Çetinoğlu, Agnieszka Faleńska, Richárd Farkas, Thomas Müller, Wolfgang Seeker, and Zsolt Szántó. 2014. Introducing the IMSWrocławSzegedCIS entry at the SPMRL 2014 shared task: Reranking and morphosyntax meet unlabeled data. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of NonCanonical Languages.
 [Black et al.1991] Ezra W. Black, Steven Abney, Daniel P. Flickinger, Claudia Gdaniec, Ralph Grishman, Philip Harrison, Donald Hindle, Robert J. P. Ingria, Frederick Jelinek, Judith L. Klavans, Mark Y. Liberman, Mitchell P. Marcus, Salim Roukos, Beatrice Santorini, and Tomek Strzalkowski. 1991. A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of DARPA Workshop on Speech and Natural Language.
 [Brants et al.2004] Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia HansenSchirra, Esther König, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. TIGER: Linguistic interpretation of a German corpus. Research on Language and Computation, 2(4):597–620.
 [Charniak and Johnson2005] Eugene Charniak and Mark Johnson. 2005. Coarsetofine best parsing and maxent discriminative reranking. In Proceedings of ACL.
 [Cohen and Collins2014] Shay B. Cohen and Michael Collins. 2014. A provably correct learning algorithm for latentvariable PCFGs. In Proceedings of ACL.
 [Cohen et al.2012] Shay B. Cohen, Karl Stratos, Michael Collins, Dean F. Foster, and Lyle Ungar. 2012. Spectral learning of latentvariable PCFGs. In Proceedings of ACL.
 [Cohen et al.2013] Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster, and Lyle Ungar. 2013. Experiments with spectral learning of latentvariable PCFGs. In Proceedings of NAACL.
 [Collins1999] Michael Collins. 1999. HeadDriven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania.
 [Crabbé2015] Benoit Crabbé. 2015. Multilingual discriminative lexicalized phrase structure parsing. In Proceedings of EMNLP.
 [FernándezGonzález and Martins2015] Daniel FernándezGonzález and André F. T. Martins. 2015. Parsing as reduction. In Proceedings of ACLIJCNLP.
 [Hall et al.2014] David Hall, Greg Durrett, and Dan Klein. 2014. Less grammar, more features. In Proceedings of ACL.

[Hsu et al.2009]
Daniel Hsu, Sham M. Kakade, and Tong Zhang.
2009.
A spectral algorithm for learning hidden Markov models.
In Proceedings of COLT.  [Martins et al.2010] André F. T. Martins, Noah A. Smith, Eric P. Xing, Mário A. T. Figueiredo, and Pedro M. Q. Aguiar. 2010. TurboParsers: Dependency parsing by approximate variational inference. In Proceedings of EMNLP.
 [Matsuzaki et al.2005] Takuya Matsuzaki, Yusuke Miyao, and Junichi Tsujii. 2005. Probabilistic CFG with latent annotations. In Proceedings of ACL.
 [Müeller et al.2013] Thomas Müeller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higherorder CRFs for morphological tagging. In Proceedings of EMNLP.
 [Narayan and Cohen2015] Shashi Narayan and Shay B. Cohen. 2015. Diversity in spectral learning for natural language parsing. In Proceedings of EMNLP.

[Parikh et al.2012]
Ankur P. Parikh, Le Song, Mariya Ishteva, Gabi Teodoru, and Eric P. Xing.
2012.
A spectral algorithm for latent junction trees.
In
Proceedings of the TwentyEighth Conference on Uncertainty in Artificial Intelligence
.  [Petrov et al.2006] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of COLINGACL.
 [Petrov2010] Slav Petrov. 2010. Products of random latent variable grammars. In Proceedings of HLTNAACL.
 [Prescher2005] Detlef Prescher. 2005. Headdriven PCFGs with latenthead statistics. In Proceedings of IWPT.
 [Rabusseau et al.2016] Guillaume Rabusseau, Borja Balle, and Shay B. Cohen. 2016. Lowrank approximation of weighted tree automata. In Proceedings of The 19th International Conference on Artificial Intelligence and Statistics.
 [Seddah et al.2013] Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie Candito, Jinho D. Choi, Richárd Farkas, Jennifer Foster, Iakes Goenaga, Koldo Gojenola Galletebeitia, Yoav Goldberg, Spence Green, Nizar Habash, Marco Kuhlmann, Wolfgang Maier, Joakim Nivre, Adam Przepiórkowski, Ryan Roth, Wolfgang Seeker, Yannick Versley, Veronika Vincze, Marcin Woliński, Alina Wróblewska, and Eric Villemonte de la Clérgerie. 2013. Overview of the SPMRL 2013 shared task: A crossframework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of MorphologicallyRich Languages.
 [Skut et al.1997] Wojciech Skut, Brigitte Krenn, Thorsten Brants, and Hans Uszkoreit. 1997. An annotation scheme for free word order languages. In Proceedings of ANLP.
 [Szántó and Farkas2014] Zsolt Szántó and Richárd Farkas. 2014. Special techniques for constituent parsing of morphologically rich languages. In Proceedings of EACL.
Comments
There are no comments yet.