virnng
Code for "A Generative Parser with a Discriminative Recognition Algorithm"(http://homepages.inf.ed.ac.uk/s1537177/resources/virnng.pdf)
view repo
Generative models defining joint distributions over parse trees and sentences are useful for parsing and language modeling, but impose restrictions on the scope of features and are often outperformed by discriminative models. We propose a framework for parsing and language modeling which marries a generative model with a discriminative recognition model in an encoder-decoder setting. We provide interpretations of the framework based on expectation maximization and variational inference, and show that it enables parsing and language modeling within a single implementation. On the English Penn Treen-bank, our framework obtains competitive performance on constituency parsing while matching the state-of-the-art single-model language modeling score.
READ FULL TEXT VIEW PDFCode for "A Generative Parser with a Discriminative Recognition Algorithm"(http://homepages.inf.ed.ac.uk/s1537177/resources/virnng.pdf)
Generative models defining joint distributions over parse trees and sentences are good theoretical models for interpreting natural language data, and appealing tools for tasks such as parsing, grammar induction and language modeling Collins (1999); Henderson (2003); Titov and Henderson (2007); Petrov and Klein (2007); Dyer et al. (2016)
. However, they often impose strong independence assumptions which restrict the use of arbitrary features for effective disambiguation. Moreover, generative parsers are typically trained by maximizing the joint probability of the parse tree and the sentence—an objective that only indirectly relates to the goal of parsing. At test time, these models require a relatively expensive recognition algorithm
Collins (1999); Titov and Henderson (2007) to recover the parse tree, but the parsing performance consistently lags behind their discriminative competitors Nivre et al. (2007); Huang (2008); Goldberg and Elhadad (2010), which are directly trained to maximize the conditional probability of the parse tree given the sentence, where linear-time decoding algorithms exist (e.g., for transition-based parsers).In this work, we propose a parsing and language modeling framework that marries a generative model with a discriminative recognition algorithm in order to have the best of both worlds. The idea of combining these two types of models is not new. For example, collins2005discriminative propose to use a generative model to generate candidate constituency trees and a discriminative model to rank them. sangati2009generative follow the opposite direction and employ a generative model to re-rank the dependency trees produced by a discriminative parser. However, previous work combines the two types of models in a goal-oriented, pipeline fashion, which lacks model interpretations and focuses solely on parsing.
In comparison, our framework unifies generative and discriminative parsers with a single objective, which connects to expectation maximization and variational inference in grammar induction settings. In a nutshell, we treat parse trees as latent factors generating natural language sentences and parsing as a posterior inference task. We showcase the framework using Recurrent Neural Network Grammars (RNNGs;
Dyer et al. 2016), a recently proposed probabilistic model of phrase-structure trees based on neural transition systems. Different from this work which introduces separately trained discriminative and generative models, we integrate the two in an auto-encoder which fits our training objective. We show how the framework enables grammar induction, parsing and language modeling within a single implementation. On the English Penn Treebank, we achieve competitive performance on constituency parsing and state-of-the-art single-model language modeling score.In this section we briefly describe Recurrent Neural Network Grammars (RNNGs; Dyer et al. 2016), a top-down transition-based algorithm for parsing and generation. There are two versions of RNNG, one discriminative, the other generative. We follow the original paper in presenting the discriminative variant first.
The discriminative RNNG follows a shift-reduce parser that converts a sequence of words into a parse tree. As in standard shift-reduce parsers, the RNNG uses a buffer to store unprocessed terminal symbols and a stack to store partially completed syntactic constituents. At each timestep, one of the following three operations^{2}^{2}2To be precise, the total number of operations under our description is X+2 since the nt operation varies with the non-terminal choice X. is performed:
nt(X) introduces an open non-terminal X onto the top of the stack, represented as an open parenthesis followed by X, e.g., (NP.
shift fetches the terminal in the front of the buffer and pushes it onto the top of the stack.
reduce completes a subtree by repeatedly popping the stack until an open non-terminal is encountered. The non-terminal is popped as well, after which a composite term representing the entire subtree is pushed back onto the top of the stack, e.g., (NP the cat).
The above transition system can be adapted with minor modifications to an algorithm that generates trees and sentences. In generator transitions, there is no input buffer of unprocessed words but there is an output buffer for storing words that have been generated. To reflect the change, the previous shift operation is modified into a gen operation defined as follows:
gen generates a terminal symbol and add it to the stack and the output buffer.
Our framework unifies generative and discriminative parsers within a single training objective. For illustration, we adopt the two RNNG variants introduced above with our customized features. Our starting point is the generative model (§ 3.1), which allows us to make explicit claims about the generative process of natural language sentences. Since this model alone lacks a bottom-up recognition mechanism, we introduce a discriminative recognition model (§ 3.2) and connect it with the generative model in an encoder-decoder setting. To offer a clear interpretation of the training objective (§ 3.3), we first consider the parse tree as latent and the sentence as observed. We then discuss extensions that account for labeled parse trees. Finally, we present various inference techniques for parsing and language modeling within the framework (§ 3.4).
The decoder is a generative RNNG that models the joint probability of a latent parse tree and an observed sentence . Since the parse tree is defined by a sequence of transition actions , we write as .^{3}^{3}3We assume that the action probability does not take the actual terminal choice into account. The joint distribution is factorized into a sequence of transition probabilities and terminal probabilities (when actions are gen), which are parametrized by a transitional state embedding :
(1) | |||||
where is an indicator function and represents the state embedding at time step . Specifically, the conditional probability of the next action is:
(2) |
where represents the action embedding at time step , the action space and the bias. Similarly, the next word probability (when gen is invoked) is computed as:
(3) |
where denotes all words in the vocabulary.
To satisfy the independence assumptions imposed by the generative model, uses only a restricted set of features defined over the output buffer and the stack — we consider as a context insensitive prior distribution. Specifically, we use the following features: 1) the stack embedding which encodes the stack of the decoder and is obtained with a stack-LSTM Dyer et al. (2015, 2016); 2) the output buffer embedding ; we use a standard LSTM to compose the output buffer and is represented as the most recent state of the LSTM; and 3) the parent non-terminal embedding which is accessible in the generative model because the RNNG employs a depth-first generation order. Finally, is computed as:
(4) |
where s are weight parameters and the bias.
The encoder is a discriminative RNNG that computes the conditional probability of the transition action sequence given an observed sentence . This conditional probability is factorized over time steps as:
(5) |
where is the transitional state embedding of the encoder at time step .
The next action is predicted similarly to Equation (2), but conditioned on . Thanks to the discriminative property, has access to any contextual features defined over the entire sentence and the stack — acts as a context sensitive posterior approximation. Our features^{4}^{4}4Compared to dyer2016recurrent, the new features we introduce are 3) and 4), which we found empirically useful. are: 1) the stack embedding obtained with a stack-LSTM that encodes the stack of the encoder; 2) the input buffer embedding ; we use a bidirectional LSTM to compose the input buffer and represent each word as a concatenation of forward and backward LSTM states; is the representation of the word on top of the buffer; 3) to incorporate more global features and a more sophisticated look-ahead mechanism for the buffer, we also use an adaptive buffer embedding ; the latter is computed by having the stack embedding attend to all remaining embeddings on the buffer with the attention function in vinyals2015grammar; and 4) the parent non-terminal embedding . Finally, is computed as follows:
(6) |
where s are weight parameters and the bias.
Consider an auto-encoder whose encoder infers the latent parse tree and the decoder generates the observed sentence from the parse tree.^{5}^{5}5Here, gen and shift refer to the same action with different definitions for encoding and decoding.
The maximum likelihood estimate of the decoder parameters is determined by the log marginal likelihood of the sentence:
(7) |
We follow expectation-maximization and variational inference techniques to construct an evidence lower bound of the above quantity (by Jensen’s Inequality), denoted as follows:
(8) |
where comes from the decoder or the generative model, and comes from the encoder or the recognition model. The objective function^{6}^{6}6See § 4 and Appendix A for comparison between this objective and the importance sampler of dyer2016recurrent. in Equation (8), denoted by , is unsupervised and suited to a grammar induction task. This objective can be optimized with the methods shown in miao2016language.
Next, consider the case when the parse tree is observed. We can directly maximize the log likelihood of the parse tree for the encoder output and the decoder output :
(9) |
This supervised objective leverages extra information of labeled parse trees to regularize the distribution and , and the final objective is:
(10) |
where and can be balanced with the task focus (e.g, language modeling or parsing).
We consider two inference tasks, namely parsing and language modeling.
In parsing, we are interested in the parse tree that maximizes the posterior (or the joint ). However, the decoder alone does not have a bottom-up recognition mechanism for computing the posterior. Thanks to the encoder, we can compute an approximated posterior in linear time and select the parse tree that maximizes this approximation. An alternative is to generate candidate trees by sampling from , re-rank them with respect to the joint (which is proportional to the true posterior), and select the sample that maximizes the true posterior.
In language modeling, our goal is to compute the marginal probability , which is typically intractable. To approximate this quantity, we can use Equation (8) to compute a lower bound of the log likelihood and then exponentiate it to get a pessimistic approximation of .^{7}^{7}7As a reminder, the language modeling objective is , where denotes the total negative log likelihood of the test data and the token counts.
Another way of computing (without lower bounding) would be to use the variational approximation as the proposal distribution as in the importance sampler of dyer2016recurrent. We discuss details in Appendix A.
Learned word embedding dimensions | 40 |
Pretrained word embedding dimensions | 50 |
POS tag embedding dimensions | 20 |
Encoder LSTM dimensions | 128 |
Decoder LSTM dimensions | 256 |
LSTM layer | 2 |
Encoder dropout | 0.2 |
Decoder dropout | 0.3 |
Our framework is related to a class of variational autoencoders
Kingma and Welling (2014), which use neural networks for posterior approximation in variational inference. This technique has been previously used for topic modeling Miao et al. (2016) and sentence compression Miao and Blunsom (2016). Another interpretation of the proposed framework is from the perspective of guided policy search in reinforcement learning
Bachman and Precup (2015), where a generative parser is trained to imitate the trace of a discriminative parser. Further connections can be drawn with the importance-sampling based inference of dyer2016recurrent. There, a generative RNNG and a discriminative RNNG are trained separately; during language modeling, the output of the discriminative model serves as the proposal distribution of an importance sampler . Compared to their work, we unify the generative and discriminative RNNGs in a single framework, and adopt a joint training objective.Discriminative parsers | socher2013parsing | 90.4 |
zhu2013fast | 90.4 | |
dyer2016recurrent | 91.7 | |
cross2016incremental | 89.9 | |
vinyals2015grammar | 92.8 | |
Generative parsers | petrov2007improved | 90.1 |
shindo2012bayesian | 92.4 | |
dyer2016recurrent | 93.3 | |
This work | 89.3 | |
90.1 |
We performed experiments on the English Penn Treebank dataset; we used sections 2–21 for training, 24 for validation, and 23 for testing. Following dyer-EtAl:2015:ACL-IJCNLP, we represent each word in three ways: as a learned vector, a pretrained vector, and a POS tag vector. The encoder word embedding is the concatenation of all three vectors while the decoder uses only the first two since we do not consider POS tags in generation. Table
1 presents details on the hyper-parameters we used. To find the MAP parse tree (where is used rank the output of ) and to compute the language modeling perplexity (where ), we collect 100 samples from , same as dyer2016recurrent.Experimental results for constituency parsing and language modeling are shown in Tables 2 and 3, respectively. As can be seen, the single framework we propose obtains competitive parsing performance. Comparing the two inference methods for parsing, ranking approximated MAP trees from with respect to yields a small improvement, as in dyer2016recurrent. It is worth noting that our parsing performance lags behind dyer2016recurrent. We believe this is due to implementation disparities, such as the modeling of the reduce operation. While dyer2016recurrent use an LSTM as the syntactic composition function of each subtree, we adopt a rather simple composition function based on embedding averaging, which gains computational efficiency but loses accuracy.
On language modeling, our framework achieves lower perplexity compared to dyer2016recurrent and baseline models. This gain possibly comes from the joint optimization of both the generative and discriminative components towards a language modeling objective. However, we acknowledge a subtle difference between dyer2016recurrent and our approach compared to baseline language models: while the latter incrementally estimate the next word probability, our approach (and Dyer et al. 2016) directly assigns probability to the entire sentence. Overall, the advantage of our framework compared to dyer2016recurrent is that it opens an avenue to unsupervised training.
KN-5 | 255.2 |
---|---|
LSTM | 113.4 |
dyer2016recurrent | 102.4 |
This work: | 99.8 |
We proposed a framework that integrates a generative parser with a discriminative recognition model and showed how it can be instantiated with RNNGs. We demonstrated that a unified framework, which relates to expectation maximization and variational inference, enables effective parsing and language modeling algorithms. Evaluation on the English Penn Treebank, revealed that our framework obtains competitive performance on constituency parsing and state-of-the-art results on single-model language modeling. In the future, we would like to perform grammar induction based on Equation (8), with gradient descent and posterior regularization techniques Ganchev et al. (2010).
We thank three anonymous reviewers and members of the ILCC for valuable feedback, and Muhua Zhu and James Cross for help with data preparation. The support of the European Research Council under award number 681760 “Translating Multiple Modalities into Text” is gratefully acknowledged.
In this appendix we highlight the connections between importance sampling and variational inference, thereby comparing our method with dyer2016recurrent.
Consider a simple directed graphical model with discrete latent variables (e.g., is the transition action sequence) and observed variables (e.g., is the sentence). The model evidence, or the marginal likelihood is often intractable to compute. Importance sampling transforms the above quantity into an expectation over a distribution , which is known and easy to sample from:
(11) |
where is the proposal distribution and the importance weight. The proposal distribution can potentially depend on the observations , i.e., .
A challenge with importance sampling lies in choosing a proposal distribution which leads to low variance. As shown in rubinstein2008simulation, the optimal choice of the proposal distribution is in fact the true posterior
, in which case the importance weight is constant with respect to . In dyer2016recurrent, the proposal distribution depends on , i.e., , and is computed with a separately-trained, discriminative model. This proposal choice is close to optimal, since in a fully supervised setting is also observed and the discriminative model can be trained to approximate the true posterior well. We hypothesize that the performance of their importance sampler is dependent on this specific proposal distribution. Besides, their training strategy does not generalize to an unsupervised setting.In comparison, variational inference approach approximates the log marginal likelihood with the evidence lower bound. It is a natural choice when one aims to optimize Equation (11) directly:
(12) |
where is the variational approximation of the true posterior. Again, the variational approximation can potentially depend on the observation (i.e., ) and can be computed with a discriminative model. Equation (12) is a well-defined, unsupervised training objective which allows us to jointly optimize generative (i.e., ) and discriminative (i.e., ) models. To further support the observed variable , we augment this objective with supervised terms shown in Equation (10), following kingma2014semi and miao2016language.
Equation (12) can be also used to approximate the marginal likelihood (e.g., in language modeling) with its lower bound. An alternative choice without lower bounding is to use the variational approximation as the proposal distribution in importance sampling (Equation (11)). ghahramani1999variational show that this proposal distribution leads to improved results of importance samplers.
Transition-based dependency parsing with stack long short-term memory.
InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
. Beijing, China, pages 334–343.Journal of Machine Learning Research
11(Jul):2001–2049.