Introduction
Grammar induction is the task of deriving plausible syntactic structures from raw text, without the use of annotated training data. In the case of dependency parsing, the syntactic structure takes the form of a tree whose nodes are the words of the sentence, and whose arcs are directed and denote headdependent relationships between words. Inducing such a tree without annotated training data is challenging because of data sparseness and ambiguity, and because the search space of potential trees is huge, making optimization difficult.
Most existing approaches to dependency grammar induction have used inference over graph structures and are based either on the dependency model with valence (DMV) of Klein and Manning (2004) or the maximum spanning tree algorithm (MST) for dependency parsing by McDonald, Petrov, and Hall (2011). Stateoftheart representatives include LCDMV (Noji, Miyao, and Johnson, 2016) and ConvexMST (Grave and Elhadad, 2015)
. Recently, researchers have also introduced neural networks for feature extraction in graphbased models
(Jiang, Han, and Tu, 2016; Cai, Jiang, and Tu, 2017).Though graphbased models achieve impressive results, their inference procedure requires time complexity. Meanwhile, features in graphbased models must be decomposable over substructures to enable dynamic programming. In comparison, transitionbased models allow faster inference with linear time complexity and richer feature sets. Although relying on local inference, transitionbased models have been shown to perform well in supervised parsing (Kiperwasser and Goldberg, 2016; Dyer et al., 2015). However, unsupervised transition parsers are not wellstudied. One exception is the work of Rasooli and Faili (2012), in which searchbased structure prediction (Daumé III, 2009) is used with a simple feature set. However, there is still a large performance gap compared to graphbased models.
Recently, Dyer et al. (2016)
proposed recurrent neural network grammars (RNNGs)—a probabilistic transitionbased model for constituency trees. RNNG can be used either in a generative way as a language model or in a discriminative way as a parser.
Cheng, Lopez, and Lapata (2017)use an autoencoder to integrate discriminative and generative RNNGs, yielding a reconstruction process with parse trees as latent variables and enabling the two components to be trained jointly on a language modeling objective. However, their work uses observed trees for training and does not study unsupervised learning.
In this paper, we make a more radical departure from the existing
literature in dependency grammar induction, by proposing an
unsupervised neural variational transitionbased parser.
Specifically, we first modify the transition actions in the original
RNNG into a set of arcstandard actions for projective dependency
parsing, and then build a dependency variant of the model of
Cheng, Lopez, and
Lapata (2017). Although this approach performs well for
supervised parsing, when applied in an unsupervised setting, the
performance decreases dramatically (see
Experiments for details). We hypothesize that this is because the parser is fairly unconstrained without prior
linguistic knowledge
(Naseem et al., 2010; Noji, Miyao, and
Johnson, 2016). Therefore, we augment the
model with posterior regularization, allowing us to seamlessly
integrate linguistic knowledge in the shape of a small number of
universal linguistic rules. In addition, we propose a novel variance
reduction method for stabilizing neural variational inference with
discrete latent variables. This yields the first known model that
makes it possible to use posterior regularization for neural variational inference with
discrete latent variables.
When evaluating on the English Penn Treebank and
on eight languages from the Universal Dependency (UD) Treebank, we
find that our model with posterior regularization outperforms the best
unsupervised transitionbased dependency parser
(Rasooli and Faili, 2012), and approaches the performance of graphbased
models. We also show how a weak form of supervision can be integrated
elegantly into our framework in the form of rule
expectations. Finally, we present empirical evidence for the
complexity advantage of transitionbased models: our model attains a
large speedup compared to a stateoftheart graphbased model. Code
and Supplementary Material are available.^{1}^{1}1https://github.com/libowen2121/
VIdependencysyntax
Background
RNNG is a topdown transition system originally proposed for constituency parsing and generation. There are two variants: the discriminative RNNG and the generative RNNG. The discriminative RNNG takes a sentence as input, and predicts the probability of generating a corresponding parse tree from the sentence. The model uses a buffer to store unprocessed terminal words and a stack to store partially completed syntactic constituents. It then follows topdown transition actions to shift words from the buffer to the stack to construct syntactic constituents incrementally.
The discriminative RNNG can be modified slightly to formulate the generative RNNG, an algorithm for incrementally producing trees and sentences in a generative fashion. In generative RNNG, there is no buffer of unprocessed words, but there is an output buffer for storing words that have been generated. Topdown actions are then specified to generate words and tree nonterminals in preorder. Though not able to parse on its own, a generative RNNG can be used for language modeling as long as parse trees are sampled from a known distribution.
We modify the transition actions in the original RNNG into a set of arcstandard actions for projective dependency parsing. In the discriminative modeling case, the action space includes:

shift fetches the first word in the buffer and pushes it onto the top of the stack.

leftreduce adds a left arc in between the top two words of the stack and merges them into a single construct.

rightreduce adds a right arc in between the top two words of the stack and merges them into a single construct.
In the generative modeling case, the shift operation is replaced by a gen operation:

gen generates a word and adds it to the stack and the output buffer.
Methodology
To build our dependency grammar induction model, we follow Cheng, Lopez, and Lapata (2017) and propose a dependencybased, encoderdecoder RNNG. This model includes (1) a discriminative RNNG as the encoder for mapping the input sentence into a latent variable, which for the grammar induction task is a sequence of parse actions for building the dependency tree; (2) a generative RNNG as the decoder for reconstructing the input sentence based on the latent parse actions. The training objective is the likelihood of the observed input sentence, which is reformulated as an evidence lower bound (ELBO), and solved with neural variational inference. The REINFORCE algorithm (Williams, 1992) is utilized to handle discrete latent variables in optimization. Overall, the encoder and decoder are jointly trained, inducing latent parse trees or actions from only unlabelled text data. To further regularize the space of parse trees with a linguistic prior, we introduce posterior regularization into the basic framework. Finally, we propose a novel variance reduction technique to train our posterior regularized framework more effectively.
Encoder
We formulate the encoder as a discriminative dependency RNNG that computes the conditional probability of the transition action sequence given the observed sentence . The conditional probability is factorized over time steps, and parameterized by a transitional state embedding :
(1) 
where is the transitional state embedding of the encoder at time step . The encoder is the actual component for parsing at run time.
Decoder
The decoder is a generative dependency RNNG that models the joint probability of a latent transition action sequence and an observed sentence
. This joint distribution can be factorized into a sequence of action and word (emitted by
gen) probabilities, which are parameterized by a transitional state embedding :(2) 
where is an indicator function and is the state embedding at time step . The features and the modeling details of both the encoder and the decoder can be found in the Supplementary Material.
Training Objective
Consider a latent variable model in which the encoder infers the latent transition actions (i.e., the dependency structure) and the decoder reconstructs the sentence from these actions. The maximum likelihood estimate of the model parameters is determined by the log marginal likelihood of the sentence:
(3) 
Since the form of the log likelihood is intractable in our case, we optimize the ELBO (by Jensen’s Inequality) as follows:
(4) 
where
is the KullbackLeibler divergence and
is the variational approximation of the true posterior. This training objective is optimized with the EM algorithm. In the Estep, we approximate the variational distribution based on the encoder and the observation — is parameterized as . Similarly, the joint probability is parameterized by the decoder as .In the Mstep, the decoder parameters can be directly updated by gradient descent via Monte Carlo simulation:
(5) 
where samples are drawn independently to compute the stochastic gradient.
For the encoder parameters , since the sampling operation is not differentiable, we approximate the gradients using the REINFORCE algorithm (Williams, 1992):
(6) 
where is known as the score function and computed as:
(7) 
Posterior Regularization
As will become clear in the Experiments section, the basic model discussed previously performs poorly when used for unsupervised parsing, barely outperforming a leftbranching baseline for English. We hypothesize the reason is that the basic model is fairly unconstrained: without any constraints to regularize the latent space, the induced parses will be arbitrary, since the model is only trained to maximize sentence likelihood (Naseem et al., 2010; Noji, Miyao, and Johnson, 2016).
We therefore introduce posterior regularization (PR; Ganchev et al. 2010) to encourage the neural network to generate wellformed trees. Via posterior regularization, we can give the model access to a small amount of linguistic information in the form of universal syntactic rules (Naseem et al., 2010), which are the same for all languages. These rules effectively function as features, which impose soft constraints on the neural parameters in the form of expectations.
To integrate the PR constraints into the model, a set of allowed posterior distributions over the hidden variables can be defined as:
(8) 
where
is a vector of feature functions,
is a vector of given negative expectations, is a vector of slack variables, is a predefined small value and denotes some norm. The PR algorithm only works if is nonempty.In dependency grammar induction, (the element in ) can be set as the negative number of times a given rule (dependency arcs, e.g., Root Verb, Verb Noun) occurs in a sentence. We hope to bias the learning so that each sentence is parsed to contain these kinds of arcs more than a threshold in the expectation. The posterior regularized likelihood is then:
(9) 
Equation (9) indicates that, in the posterior regularized framework, not only approximates the true posterior (estimated by the encoder network ) but also belongs to the constrained set . To optimize via the EM algorithm, we get the revised Estep as:
(10) 
Formally, the optimization problem in the Estep can be described as:
(11) 
Following Ganchev et al. (2010), we can solve the optimization problem in (11) in its Lagrangian dual form. Since our transitionbased encoder satisfies the decomposition property, the conditional probability can be factored as in (1). Thus, the factored primal solution can be written as:
(12) 
where is the Lagrangian multiplier whose solution is given as ^{2}^{2}2 is the dual norm of . Here we use norm for both primal norm and dual norm . and is given as:
(13) 
We also define the multiplier computed by PR as:
(14) 
In our case, computing the normalization term is intractable for transitionbased dependency parsing systems. To address this problem, we view as an expectation and estimate it by Monte Carlo simulation as:
(15) 
Finally, we compute the gradients for encoder and decoder in the Mstep as follows:
(16) 
where is the score function computed as in (7). Details of the derivation of the Mstep can be found in the Supplementary Material.
Variance Reduction in the Mstep
Training a neural variational inference framework with discrete latent variables is known to be a challenging problem (Mnih and Gregor, 2014; Miao and Blunsom, 2016; Miao, Yu, and Blunsom, 2016). This is mainly caused by the sampling step of discrete latent variables which results in high variance, especially at the early stage of training when both encoder and decoder parameters are far from optimal. Intuitively, the score function weights the gradient for each latent sample , and its variance plays a crucial role in updating the parameters in the Mstep.
To reduce the variance of the score function and stabilize learning, previous work (Mnih and Gregor, 2014; Miao and Blunsom, 2016; Miao, Yu, and Blunsom, 2016) adopts the baseline method (RLBL), redefining the score function as:
(17) 
where is a parameterized, inputdependent baseline (e.g., a neural language model in our case) and is the bias. The baseline method is able to reduce the variance to some extent, but also introduces extra model parameters that complicate optimization. In the following we propose an alternative generic method for reducing the variance of the gradient estimator in the Mstep, as well as another taskspecific method which results in further improvement.
1. Generic Method
The intuition behind the generic method is as follows: the algorithm takes latent samples for each input and a score is computed for each sample , hence the variance can be reduced by normalization within the group of samples. This motivates the following normalized score function :
(18) 
2. TaskSpecific Method
Besides the generic variance reduction method which applies to discrete neural variational inference in general, we further propose to enhance the quality of the score function for the specific dependency grammar induction task.
Intuitively, the score function in (16) weights the gradient of a given sample by a positive or negative value, while only weights the gradient by a positive value. As a result, the score function plays a much more significant role in determining the optimization direction. Therefore, we propose to correct the polarity of our with the number of rules that occur in the induced dependency structure, where returns the sum of vector elements. The refined score function is:
(19) 
where .
Since provides a natural corrective, we can obtain a simpler variant of (19) by directly using as the score function:
(20) 
We will experimentally compare the different variance reduction techniques (or score functions) of the reinforcement learning objective.
Experiments
Datasets, Universal Rules, and Setup
English Penn Treebank
We use the Wall Street Journal (WSJ) section of the English Penn Treebank (Marcus, Marcinkiewicz, and Santorini, 1993)
. The dataset is preprocessed to strip off punctuation. We train our model on sections 2–21, tune the hyperparameters on section 22, and evaluate on section 23. Sentences of length
are used for training, and we report directed dependency accuracy (DDA) on test sentences of length (WSJ10), and on all sentences (WSJ).Universal Dependency Treebank
We select eight languages from the Universal Dependency Treebank 1.4 (Nivre et al., 2016). We train our model on training sentences of length and report DDA on test sentences of length and . We found that training on short sentences generally increased performance compared to training on longer sentences (e.g., length ).
Universal Rules
We employ the universal linguistic rules of Naseem et al. (2010) and Noji, Miyao, and Johnson (2016) for WSJ and the Universal Dependency Treebank, respectively (details can be found in the Supplementary Material). For WSJ, we expand the coarse rules defined in Naseem et al. (2010) with the Penn Treebank finegrained partofspeech tags. For example, Verb is expanded as VB, VBD, VBG, VBN, VBP and VBZ.
Setup
To avoid a scenario in which REINFORCE has to work with an arbitrarily initialized encoder and decoder, our posterior regularized neural variational dependency parser is pretrained with the direct reward from PR. (This will be discussed later; for more details on the training, see Supplementary Material.)
We use AdaGrad (Duchi, Hazan, and Singer, 2011) to optimize the parameters of the encoder and decoder, as well as the projected gradient descent algorithm (Bertsekas, 1999) to optimize the parameters of posterior regularization.
We use GloVe embeddings (Pennington, Socher, and Manning, 2014) to initialize English word vectors and FastText embeddings (Bojanowski et al., 2016) for the other languages. Across all experiments, we test both unlexicalized and lexicalized versions of our models. The unlexicalized versions use gold POS tags as model inputs, while the lexicalized versions additionally use word tokens (Le and Zuidema, 2015). We use Brown clustering (Brown et al., 1992) to obtain additional features in the lexicalized versions (Buys and Blunsom, 2015).
We report average DDA and best DDA over five runs for our main parsing results.
Exploration of Model Variants
Posterior Regularization
To study the effectiveness of posterior regularization in the neural grammar induction model, we first implement a fully unsupervised model without posterior regularization. This model is trained with variational inference, using the standard REINFORCE objective with a baseline (Mnih and Gregor, 2014; Miao and Blunsom, 2016; Miao, Yu, and Blunsom, 2016) and employing no posterior regularization.
Table 1 shows the results for the unsupervised model, together with the random and left and rightbranching baselines. We observe that the unsupervised model (both the unlexicalized and lexicalized versions) fails to beat the leftbranching baseline. These results suggest that without any prior linguistic knowledge, the trained model is fairly unconstrained. A comparison with posteriorregularized results in Table 2 (to be discussed next) reveals the effectiveness of posterior regularization in incorporating such knowledge.



Model  WSJ10  WSJ 


Random  19.1  16.4 
Left branching  36.2  30.2 
Right branching  20.1  20.6 
Unsupervised  33.3 (39.0)  29.0 (30.5) 
LUnsupervised  34.9 (36.4)  28.0 (30.2) 

Pretraining
Unsupervised models in general face a coldstart problem since no gold annotations exist to “warm up” the model parameters quickly. This can be observed in (16): the gradient updates of the model are dependent on the score function , which in return relies on the model parameters. At the beginning of training we cannot obtain an accurately approximated for updating model parameters. To alleviate this problem, one approach is to ignore the score function in the gradient update at the early stage. In this case, both the encoder and decoder are trained with the direct reward from PR (detailed equations can be found in the Supplementary Material). We test the effectiveness of this approach, which we call pretraining.
Table 2 shows the results of a standard posteriorregularized model compared to one only with pretraining. Both models use the unlexicalized setup. We find that the posteriorregularized model benefits a lot from pretraining, which therefore is a useful way to avoid cold start.



WSJ10  WSJ  


No Pretraining  47.5 (59.8)  36.7 (46.3) 
Pretraining  64.8 (67.1)  42.0 (43.7) 




RLBL  RLSN  RLC  RLPC  


58.7  60.8  64.4  66.7  
1.8  0.6  0.3  0.7  

and its standard deviation
over five runs.Variance Reduction
Previously, we described various variance reduction techniques, or modified score functions, for the reinforcement learning objective. These include the conventional baseline method (RLBL), our sample normalization method (RLSN), sample normalization with additional polarity correction (RLPC), and a simplified version of the later (RLC). We now compare these techniques; all experiments were conducted with pretraining and on the unlexicalized model.
The experimental results in Table 3 show that RLSN outperforms RLBL on average DDA, which indicates that sample normalization is more effective in reducing the variance of the gradient estimator. We believe the gain comes from the fact that sample normalization does not introduce extra model parameters, whereas RLBL does. Polarity correction further boosts performance. However, polarity correction uses the number of universal rules present in a induced dependency structure, i.e., it is a taskspecific method for variance reduction. Also RLC (the simplified version of RLPC) achieves competitive performance.
Universal Rules
In our PR scheme, the rule expectations can be uniformly initialized. This approach does not require any annotated training data; the parser is furnished only with a small set of universal linguistic rules. We call this setting UniversalRules.
However, we can initialize the rule expectation nonuniformly, which allows us to introduce a degree of supervision into the PR scheme. Here, we explore one way of doing this: we assume a training set that is annotated with dependency rules (the training portion of the WSJ), based on which we estimate expectations for the universal rules. We call this setting WeaklySupervised.
The results of an experiment comparing these two settings is shown in Table 4. In both cases we use pretraining and the best performing score function RLPC. Here we report results using both unlexicalized and lexicalized settings. It can be seen that the best performing UniversalRules model is the unlexicalized one, while the best WeaklySupervised model is lexicalized. Overall, WeaklySupervised outperforms UniversalRules, which demonstrates that our posterior regularized parser is able to effectively use weak supervision in the form of an empirical initialization of the rule expectations.



Model  WSJ10  WSJ 


UniversalRules  54.7 (58.2)  37.8 (39.3) 
LUniversalRules  54.7 (56.3)  36.8 (38.1) 
WeaklySupervised  66.7 (67.6)  43.6 (45.0) 
LWeaklySupervised  68.2 (71.1)  48.6 (50.2) 

Parsing Results



Model  WSJ10  WSJ 


Graphbased models  
ConvexMST  60.8  48.6 
HDPDEP  71.9  – 


Transitionbased models  
RF  37.3 (40.7)  32.1 (33.1) 
RF+H1+H2  51.0 (52.7)  37.2 (37.6) 
UniversalRules  54.7 (58.2)  37.8 (39.3) 
LWeaklySupervised  68.2 (71.1)  48.6 (50.2) 

) with previous work on the English Penn Treebank. H1 and H2 are two heuristics used in
Rasooli and Faili (2012).



Model  RF+H1+H2  LCDMV  ConvMST  LWeaklySup  UnivRules 


Length 15  
Basque  49.0 (51.0)  47.9  52.5  55.2 (56.0)  52.9 (55.1) 
Dutch  26.6 (31.9)  35.5  43.4  38.7 (41.3)  39.6 (40.2) 
French  33.2 (37.5)  52.1  61.6  56.6 (57.2)  59.9 (61.6) 
German  40.5 (44.0)  51.9  54.4  59.7 (59.9)  57.5 (59.4) 
Italian  33.3 (38.9)  73.1  73.2  58.5 (59.8)  59.7 (62.3) 
Polish  46.8 (59.7)  66.2  66.7  61.8 (63.4)  57.1 (59.3) 
Portuguese  35.7 (43.7)  70.5  60.7  52.5 (54.1)  52.7 (54.2) 
Spanish  35.9 (38.3)  65.5  61.6  55.8 (56.2)  55.6 (56.8) 
Average  37.6 (43.1)  57.8  59.3  54.9 (56.0)  54.4 (56.1) 


Length 40  
Basque  45.4 (47.6)  45.4  50.0  51.0 (51.7)  48.9 (51.5) 
Dutch  23.1 (30.4)  34.1  45.3  42.2 (44.8)  42.5 (44.3) 
French  27.3 (30.7)  48.6  62.0  46.4 (47.5)  55.4 (56.3) 
German  32.5 (37.0)  50.5  51.4  55.6 (56.3)  54.2 (56.3) 
Italian  27.7 (33.0)  71.1  69.1  54.1 (55.6)  55.7 (58.7) 
Polish  43.3 (46.0)  63.7  63.4  57.3 (59.4)  51.7 (52.8) 
Portuguese  28.8 (35.9)  67.2  57.9  44.6 (48.6)  45.3 (46.5) 
Spanish  26.9 (28.8)  61.9  61.9  50.8 (54.0)  52.4 (53.9) 
Average  31.9 (36.2)  55.3  57.6  50.3 (52.2)  50.8 (52.5) 

English Penn Treebank
We compare our unsupervised UniversalRules model and its WeaklySupervised variant with (1) the stateoftheart unsupervised transitionbased parser of Rasooli and Faili (2012),^{3}^{3}3Since we used different preprocessing, we reimplemented their model for a fair comparison. denoted as RF, and (2) two stateoftheart unsupervised graphbased parsers with universal linguistic rules: ConvexMST (Grave and Elhadad, 2015) and HDPDEP (Naseem et al., 2010). Both of these are not transitionbased, and thus not directly comparable to our approach, but are useful for reference.
The parser of Rasooli and Faili (2012) is unlexicalized and countbased. To reach the best performance, the authors employed “baby steps” (i.e., they start training on short sentences and gradually add longer sentences (Spitkovsky, Alshawi, and Jurafsky, 2009)), as well as two heuristics called H1 and H2. H1 involves multiplying the probability of the last verb reduction in a sentence by . H2 involves multiplying each Noun Verb, Adjective Verb, and Adjective Noun rule by . These heuristics seem fairly adhoc; they presumably bias the probability estimates towards more linguistically plausible values.
As the results in Table 5 show, our UniversalRules model outperforms RF on both WSJ10 and full WSJ, achieving a new state of the art for transitionbased dependency grammar induction. The RF model does not use universal rules, but its linguistic heuristics play a similar role, which makes our comparison fair. Note that our LWeaklySupervised model achieves a further improvement over UniversalRules, making it comparable with ConvexMST and HDPDEP, demonstrating the potential of the neural, transitionbased dependency grammar induction approach, which should be even clearer on large datasets.
Universal Dependency Treebank
Our multilingual experiments use the UD treebank. There we evaluate the two models that perform the best on the WSJ: the unlexicalized UniversalRule model and lexicalized LWeaklySupervised model. We use the same hyperparameters as in the WSJ experiments. Again, we mainly compare our models with the transitionbased model RF (with heuristics H1 and H2), but we also include the graphbased ConvexMST and LCDMV models for reference.
Table 6 shows the UD treebank results. It can be observed that both UniversalRules and LWeaklySupervised significantly outperform the RF on both short and long sentences. The improvement of average DDA is roughly 20% on sentences of length 40. This shows that although the heuristic approach employed by Rasooli and Faili (2012) is useful for English, it does not generalize well across languages, in contrast to our posteriorregularized neural networks with universal rules.
Parsing Speed
To highlight the advantage of our linear time complexity parser, we compare both lexicalized and unlexicalized variants of our parser with a representative DMVbased model (LCDMV) in terms of parsing speed. The results in Table 7 show that our unlexicalized parser results in a 1.8fold speedup for short sentences (length 15), and a speedup of factor 16 for long sentences (full length). And our parser does not lose much parsing speed even in a lexicalized setting.


Sentence length  15  40  All 
LCDMV  663  193  74 
Our Unlexicalized  1192  1194  1191 
Our Lexicalized  939  938  983 

Related Work
In the family of graphbased models, besides LCDMV, ConvexMST, and HDPDEP, a lot of work has focused on improving the DMV, such as adding more types of valence (Headden III, Johnson, and McClosky, 2009), training with artificial negative examples (Smith and Eisner, 2005), and learning initial parameters from shorter sentences (Spitkovsky, Alshawi, and Jurafsky, 2009). Among graphbased models, there is also some work conceptually related to our approach. Jiang, Han, and Tu (2017) combine a discriminative and a generative unsupervised parser using dual decomposition. Cai, Jiang, and Tu (2017) use CRF autoencoder for unsupervised parsing. In contrast to these two approaches, we use neural variational inference to combine discriminative and generative models.
For transitionbased models, Daumé III (2009) introduces a structure prediction approach and Rasooli and Faili (2012) propose a model with simple features based on this approach. Recently, Shi, Huang, and Lee (2017) and GomezRodriguez, Shi, and Lee (2018) show that practical dynamic programminglike global inference is possible for transitionbased systems using a minimal set of bidirectional LSTM features. These models achieve competitive performance for projective or nonprojective supervised dependency parsing but have not been applied to unsupervised parsing.
Conclusions
In this work, we propose a neural variational transitionbased model for dependency grammar induction. The model consists of a generative RNNG for generation, and a discriminative RNNG for parsing and inference. We train the model on unlablled corpora with an integration of neural variational inference, posterior regularization and variance reduction techniques. This allows us to use a small amount of universal linguistic rules as prior knowledge to regularize the latent space. We show that it is straightforward to integrate weak supervision into our model in the form of rule expectations. Our parser obtains a new state of the art for unsupervised transitionbased dependency parsing, with linear time complexity and significantly faster parsing speed compared to graphbased models.
In future, we plan to conduct a largersclae of grammar induction experiment with our model. We will also explore better training and optimization techiniques for neural variational inference with discrete autoregressive latent variables.
Acknowledgments
We gratefully acknowledge the support of the Leverhulme Trust (award IAF2017019 to FK). We also thank Li Dong and Jiangming Liu at ILCC for fruitful discussions, Yong Jiang at ShanghaiTech for sharing preprocessed WSJ dataset, and the anonymous reviewers of AAAI19 for the constructive comments.
References
 Bertsekas (1999) Bertsekas, D. P. 1999. Nonlinear programming. Athena scientific Belmont.
 Bojanowski et al. (2016) Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.

Brown et al. (1992)
Brown, P. F.; Desouza, P. V.; Mercer, R. L.; Pietra, V. J. D.; and Lai, J. C.
1992.
Classbased ngram models of natural language.
Computational linguistics 18(4):467–479.  Buys and Blunsom (2015) Buys, J., and Blunsom, P. 2015. Generative incremental dependency parsing with neural networks. In Proceedings of the ACL Conference.
 Cai, Jiang, and Tu (2017) Cai, J.; Jiang, Y.; and Tu, K. 2017. Crf autoencoder for unsupervised dependency parsing. In Proceedings of the EMNLP Conference.
 Cheng, Lopez, and Lapata (2017) Cheng, J.; Lopez, A.; and Lapata, M. 2017. A generative parser with a discriminative recognition algorithm. In Proceedings of the ACL Conference.
 Daumé III (2009) Daumé III, H. 2009. Unsupervised searchbased structured prediction. In Proceedings of the ICML Conference.
 Duchi, Hazan, and Singer (2011) Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12(Jul):2121–2159.

Dyer et al. (2015)
Dyer, C.; Ballesteros, M.; Ling, W.; Matthews, A.; and Smith, N. A.
2015.
Transitionbased dependency parsing with stack long shortterm memory.
In Proceedings of the ACL Conference.  Dyer et al. (2016) Dyer, C.; Kuncoro, A.; Ballesteros, M.; and Smith, N. A. 2016. Recurrent neural network grammars. In Proceedings of the NAACL Conference.
 Ganchev et al. (2010) Ganchev, K.; Gillenwater, J.; Taskar, B.; et al. 2010. Posterior regularization for structured latent variable models. JMLR 11(Jul):2001–2049.
 GomezRodriguez, Shi, and Lee (2018) GomezRodriguez, C.; Shi, T.; and Lee, L. 2018. Global transitionbased nonprojective dependency parsing. In Proceedings of the ACL Conference.
 Grave and Elhadad (2015) Grave, E., and Elhadad, N. 2015. A convex and featurerich discriminative approach to dependency grammar induction. In Proceedings of the ACL Conference.
 Headden III, Johnson, and McClosky (2009) Headden III, W. P.; Johnson, M.; and McClosky, D. 2009. Improving unsupervised dependency parsing with richer contexts and smoothing. In Proceedings of the NAACL Conference.
 Jiang, Han, and Tu (2016) Jiang, Y.; Han, W.; and Tu, K. 2016. Unsupervised neural dependency parsing. In Proceedings of the EMNLP Conference.
 Jiang, Han, and Tu (2017) Jiang, Y.; Han, W.; and Tu, K. 2017. Combining generative and discriminative approaches to unsupervised dependency parsing via dual decomposition. In Proceedings of the EMNLP Conference.
 Kiperwasser and Goldberg (2016) Kiperwasser, E., and Goldberg, Y. 2016. Simple and accurate dependency parsing using bidirectional lstm feature representations. In TACL.
 Klein and Manning (2004) Klein, D., and Manning, C. 2004. Corpusbased induction of syntactic structure: Models of dependency and constituency. In Proceedings of the ACL Conference.
 Le and Zuidema (2015) Le, P., and Zuidema, W. 2015. Unsupervised dependency parsing: Let’s use supervised parsers. In Proceedings of the NAACL Conference.
 Marcus, Marcinkiewicz, and Santorini (1993) Marcus, M. P.; Marcinkiewicz, M. A.; and Santorini, B. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2):313–330.
 McDonald, Petrov, and Hall (2011) McDonald, R.; Petrov, S.; and Hall, K. 2011. Multisource transfer of delexicalized dependency parsers. In Proceedings of the EMNLP Conference.
 Miao and Blunsom (2016) Miao, Y., and Blunsom, P. 2016. Language as a latent variable: Discrete generative models for sentence compression. In Proceedings of the EMNLP Conference.
 Miao, Yu, and Blunsom (2016) Miao, Y.; Yu, L.; and Blunsom, P. 2016. Neural variational inference for text processing. In Proceedings of the ICML Conference.
 Mnih and Gregor (2014) Mnih, A., and Gregor, K. 2014. Neural variational inference and learning in belief networks. In Proceedings of the ICML Conference.
 Naseem et al. (2010) Naseem, T.; Chen, H.; Barzilay, R.; and Johnson, M. 2010. Using universal linguistic knowledge to guide grammar induction. In Proceedings of the EMNLP Conference.
 Nivre et al. (2016) Nivre, J.; de Marneffe, M.C.; Ginter, F.; Goldberg, Y.; Hajic, J.; Manning, C. D.; McDonald, R. T.; Petrov, S.; Pyysalo, S.; Silveira, N.; et al. 2016. Universal dependencies v1: A multilingual treebank collection. In LREC.
 Noji, Miyao, and Johnson (2016) Noji, H.; Miyao, Y.; and Johnson, M. 2016. Using leftcorner parsing to encode universal structural constraints in grammar induction. In Proceedings of the EMNLP Conference.
 Pennington, Socher, and Manning (2014) Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP Conference.

Rasooli and Faili (2012)
Rasooli, M. S., and Faili, H.
2012.
Fast unsupervised dependency parsing with arcstandard transitions.
In
Proceedings of the Joint Workshop on Unsupervised and SemiSupervised Learning in NLP
.  Shi, Huang, and Lee (2017) Shi, T.; Huang, L.; and Lee, L. 2017. Fast(er) exact decoding and global training for transitionbased dependency parsing via a minimal feature set. In Proceedings of the EMNLP Conference.
 Smith and Eisner (2005) Smith, N. A., and Eisner, J. 2005. Guiding unsupervised grammar induction using contrastive estimation. In Proceedings of IJCAI Workshop on Grammatical Inference Applications.
 Spitkovsky, Alshawi, and Jurafsky (2009) Spitkovsky, V. I.; Alshawi, H.; and Jurafsky, D. 2009. Baby steps: How “less is more” in unsupervised dependency parsing. NIPS: Grammar Induction, Representation of Language and Language Learning.
 Williams (1992) Williams, R. J. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8(34):229–256.
Supplementary Material
Derivation of the Mstep for Revised EM
Original Score Function
Under the posterior regularized EM framework, the original score function without a baseline should be defined as:
But in practical training, we observed that will assign large weights (larger than 1) to more likely parse trees and small weights (less than 1) to less likely parse trees. Thus
would effectively reverse the direction of optimization, which could dramatically mislead the learning process. To cope with this issue, we simply define the score function for our revised EM algorithm as Eq. (7). We will show how this definition will affect the loss function.
Since
we have
Thus, in theory, can be viewed as a regularization for posterior regularization.
Parameter Updating
In the revised EM algorithm, the parameters of both encoder and decoder should be updated under the distribution of rather than . Since , the gradient for the parameter of the encoder via MC sampling will be:
For the decoder, to boost the performance, we also use the score function for optimization:
Linguistic Rules
Table 8 and 9 present the universal linguistic rules used for WSJ and the Universal Dependency Treebank respectively.
Root Auxiliary  Noun Adjective 

Root Verb  Noun Article 
Verb Noun  Noun Noun 
Verb Pronoun  Noun Numeral 
Verb Adverb  Preposition Noun 
Verb Verb  Adjective Adverb 
Auxiliary Verb 
ROOT VERB  NOUN ADJ 
ROOT NOUN  NOUN DET 
VERB NOUN  NOUN NOUN 
VERB ADV  NOUN NUM 
VERB VERB  NOUN CONJ 
VERB AUX  NOUN ADP 
ADJ ADV 
Experimental Details
Model Configuration
Encoder and Decoder
We follow the model configuration in Cheng, Lopez, and Lapata (2017) to build the encoder and decoder by using StackLSTMs (Dyer et al., 2015). The differences are (1) we use neither the parent nonterminal embedding nor the action history embedding for both the decoder and encoder; (2) we do not use the adaptive buffer embedding for the encoder.
RlBl
For the baseline (
) in RLBL, we first pretrain a LSTM language model. We use word embeddings of size 100, 2 layer LSTM with 100 hidden size, and tie weights of output classifiers and word embeddings. During training the RLBL, we fix the LSTM language model and rescale and shift the output
to fit the ELBO of the given sentence asHyperParameters and Optimization
word embeddings dimensions  80 
PoS embeddings dimensions  80 
Encoder LSTM dimensions  64 
Decoder LSTM dimensions  64 
LSTM layer  1 
Encoder dropout  0.5 
Decoder dropout  0.5 
Learning rate  0.01 
gradient clip  0.25 
gradient clip (for pretraining)  0.5 
regularization  1e4 
in Eq. (8)  0.1 
number of MC samples  20 
In all experiments, both PoS tag and word embeddings are used in our lexicalized models while only PoS tag embeddings are used in our unlexicalized models. Table 10 presents the hyperparameter settings. For pretrained word embeddings, we project them into lower dimension (word embedding dimensions). We select in Eq. (8) via a grid search on WSJ10 development set. And we use Glorot for parameter initialization, and Adagrad for optimization except for posterior regularization.
Comments
There are no comments yet.