SparseMAP: differentiable sparse structure inference
Deep NLP models benefit from underlying structures in the data---e.g., parse trees---typically extracted using off-the-shelf parsers. Recent attempts to jointly learn the latent structure encounter a tradeoff: either make factorization assumptions that limit expressiveness, or sacrifice end-to-end differentiability. Using the recently proposed SparseMAP inference, which retrieves a sparse distribution over latent structures, we propose a novel approach for end-to-end learning of latent structure predictors jointly with a downstream predictor. To the best of our knowledge, our method is the first to enable unrestricted dynamic computation graph construction from the global latent structure, while maintaining differentiability.READ FULL TEXT VIEW PDF
We define a latent structure model (LSM) random graph as a random dot pr...
We treat projective dependency trees as latent variables in our probabil...
In this paper, we propose an end-to-end graph learning framework, namely...
Models for recommender systems use latent factors to explain the prefere...
This paper presents Sparse Partitioning, a Bayesian method for identifyi...
Translating document renderings (e.g. PDFs, scans) into hierarchical
Full projector compensation aims to modify a projector input image such ...
SparseMAP: differentiable sparse structure inference
Latent structure models are a powerful tool for modeling compositional data and building NLP pipelines Smith (2011). An interesting emerging direction is to dynamically adapt a network’s computation graph, based on structure inferred from the input; notable applications include learning to write programs Bosnjak et al. (2017), answering visual questions by composing specialized modules Hu et al. (2017); Johnson et al. (2017), and composing sentence representations using latent syntactic parse trees Yogatama et al. (2017).
But how to learn a model that is able to condition on such combinatorial variables? The question then becomes: how to marginalize over all possible latent structures? For tractability, existing approaches have to make a choice. Some of them eschew global latent structure, resorting to computation graphs built from smaller local decisions: e.g., structured attention networks use local posterior marginals as attention weights Kim et al. (2017); Liu and Lapata (2018)
, and dani construct sentence representations from parser chart entries. Others allow more flexibility at the cost of losing end-to-end differentiability, ending up with reinforcement learning problemsYogatama et al. (2017); Hu et al. (2017); Johnson et al. (2017); Williams et al. (2018). More traditional approaches employ an off-line structure predictor (e.g., a parser) to define the computation graph Tai et al. (2015); Chen et al. (2017), sometimes with some parameter sharing Bowman et al. (2016). However, these off-line methods are unable to jointly
train the latent model and the downstream classifier via error gradient information.
We propose here a new strategy for building dynamic computation graphs with latent structure, through sparse structure prediction. Sparsity allows selecting and conditioning on a tractable number of global structures, eliminating the limitations stated above. Namely, our approach is the first that:
is fully differentiable;
supports latent structured variables;
can marginalize over full global structures.
This contrasts with off-line and with reinforcement learning-based approaches, which satisfy B and C but not A; and with local marginal-based methods such as structured attention networks, which satisfy A and B, but not C. Key to our approach is the recently proposed inference Niculae et al. (2018), which induces, for each data example, a very sparse posterior distribution over the possible structures, allowing us to compute the expected network output efficiently and explicitly in terms of a small, interpretable set of latent structures. Our model can be trained end-to-end with gradient-based methods, without the need for policy exploration or sampling.
We demonstrate our strategy on inducing latent dependency TreeLSTMs, achieving competitive results on sentence classification, natural language inference, and reverse dictionary lookup.
We describe our proposed approach for learning with combinatorial structures (in particular, non-projective dependency trees) as latent variables.
Let and denote classifier inputs and outputs, and a latent variable; for example, can be the set of possible dependency trees for
. We would like to train a neural network to model
where is a structured-output parsing model that defines a distribution over trees, and is a classifier whose computation graph may depend freely and globally on the structure (e.g., a TreeLSTM). The rest of this section focuses on the challenge of defining such that Eqn. 1 remains tractable and differentiable.
Denote by a scoring function, assigning each tree a non-normalized score. For instance, we may have an arc-factored score , where we interpret a tree as a set of directed arcs , each receiving an atomic score . Deriving given is known as structured inference. This can be written as a -regularized optimization problem of the form
where is the set of all possible probability distributions over . Examples follow.
With negative entropy regularization, i.e., , we recover marginal inference, and the probability of a tree becomes (Wainwright and Jordan, 2008)
This closed-form derivation, detailed in Appendix A, provides a differentiable expression for . However, crucially, since , every tree is assigned strictly nonzero probability. Therefore—unless the downstream is constrained to also factor over arcs, as in rush; lapata—the sum in Eqn. 1 requires enumerating the exponentially large . This is generally intractable, and even hard to approximate via sampling, even when is tractable.
At the polar opposite, setting yields maximum a posteriori () inference (see Appendix A). assigns a probability of to the highest-scoring tree, and to all others, yielding a very sparse . However, since the top-scoring tree (or top-, for fixed ) does not vary with small changes in , error gradients cannot propagate through . This prevents end-to-end gradient-based training for -based latent variables, which makes them more difficult to use. Related reinforcement learning approaches also yield only one structure, but sidestep non-differentiability by instead introducing more challenging search problems.
In this work, we propose using inference (Niculae et al., 2018) to sparsify the set while preserving differentiability. uses a quadratic penalty on the posterior marginals
Situated between marginal inference and inference, assigns nonzero probability to only a small set of plausible trees , of size at most equal to the number of arcs (Martins et al., 2015, Proposition 11). This guarantees that the summation in Eqn. 1 can be computed efficiently by iterating over : this is depicted in Figure 1 and described in the next paragraphs.
To compute (Eqn. 1), we observe that the posterior is nonzero only on a small set of trees , and thus we only need to compute for . The support and values of are obtained by solving the inference problem, as we describe in Niculae et al. (2018). The strategy, based on the active set algorithm (Nocedal and Wright, 1999, chapter 16), involves a sequence of calls (here: maximum spanning tree problems.)
We next show how to compute end-to-end gradients efficiently. Recall from Eqn. 1 where is a discrete index of a tree. To train the classifier, we have , therefore only the terms with nonzero probability (i.e., ) contribute to the gradient. is readily available by implementing in an automatic differentiation library.111Here we assume and to be disjoint, but weight sharing is easily handled by automatic differentiation via the product rule. Differentiation w.r.t. the summation index is not necessary: may use the discrete structure freely and globally. To train the latent parser, the total gradient is the sum We derive the expression of in Appendix B. Crucially, the gradient sum is also sparse, like , and efficient to compute, amounting to multiplying by a -by- matrix. The proof, given in Appendix B, is a novel extension of the backward pass (Niculae et al., 2018).
Our description focuses on probabilistic classifiers, but our method can be readily applied to networks that output any representation, not necessarily a probability. For this, we define a function , consisting of any auto-differentiable computation w.r.t. , conditioned on the discrete latent structure in arbitrary, non-differentiable ways. We then compute
This strategy is demonstrated in our reverse-dictionary experiments in §3.4. In addition, our approach is not limited to trees: any structured model with tractable inference may be used.
. Following the authors, for an input definition, we rank a shortlist of approximately 50k candidate words according to the cosine similarity to the output vector, and report median rank of the expected word, accuracy at 10, and at 100.
We evaluate our approach on three natural language processing tasks: sentence classification, natural language inference, and reverse dictionary lookup.
Unless otherwise mentioned, we initialize with 300-dimensional GloVe word embeddings Pennington et al. (2014) We transform every sentence via a bidirectional LSTM encoder, to produce a context-aware vector encoding word .
We combine the word vectors in a sentence into a single vector using a tree-structured Child-Sum LSTM, which allows an arbitrary number of children at any node Tai et al. (2015). Our baselines consist in extreme cases of dependency trees: where the parent of word is word (resulting in a left-to-right sequential LSTM), and where all words are direct children of the root node (resulting in a flat additive model). We also consider off-line dependency trees precomputed by Stanford CoreNLP Manning et al. (2014).
All networks are trained via stochastic gradient with 16 samples per batch. We tune the learning rate on a log-grid, using a decay factor of
after every epoch at which the validation performance is not the best seen, and stop after five epochs without improvement. At test time, we scale the arc scoresby a temperature chosen on the validation set, controlling the sparsity of the distribution. All hidden layers are -dimensional.222Our dynet Neubig et al. (2017) implementation is available at https://github.com/vene/sparsemap.
We apply our strategy to the SNLI corpus Bowman et al. (2015), which consists of classifying premise-hypothesis sentence pairs into entailment, contradiction or neutral relations. In this case, for each pair (), the running sum is over two latent distributions over parse trees, i.e., For each pair of trees, we independently encode the premise and hypothesis using a TreeLSTM. We then concatenate the two vectors, their difference, and their element-wise product Mou et al. (2016). The result is passed through one tanh hidden layer, followed by the softmax output layer.333For NLI, our architecture is motivated by our goal of evaluating the impact of latent structure for learning compositional sentence representations. State-of-the-art models conditionally transform the sentences to achieve better performance, e.g., 88.6% accuracy in Chen et al. (2017).
The reverse dictionary task aims to compose a dictionary definition into an embedding that is close to the defined word. We therefore used fixed input and output embeddings, set to unit-norm 500-dimensional vectors provided, together with training and evaluation data, by Hill et al. (2016). The network output is a projection of the TreeLSTM encoding back to the dimension of the word embeddings, normalized to unit norm. We maximize the cosine similarity of the predicted vector with the embedding of the defined word.
Classification and NLI results are reported in Table 1. Compared to the latent structure model of latentparse, our model performs better on SNLI (80.5%) but worse on SST (86.5%). On SNLI, our model also outperforms dani (81.6%). To our knowledge, latent structure models have not been tested on subjectivity classification. Surprisingly, the simple flat and left-to-right baselines are very strong, outperforming the off-line dependency tree models on all three datasets. The latent TreeLSTM model reaches the best accuracy on two out of the three datasets. On reverse dictionary lookup (Table 2), our model also performs well, especially on concept classification, where the input definitions are more different from the ones seen during training. For context, we repeat the scores of the CKY-based latent TreeLSTM model of dani, as well as of the LSTM from hill; these different-sized models are not entirely comparable. We attribute our model’s performance to the latent parser’s flexibility, investigated below.
We analyze the latent structures selected by our model on SST, where the flat composition baseline is remarkably strong. We find that our model, to maximize accuracy, prefers flat or nearly-flat trees, but not exclusively: the average posterior probability of the flat tree is 28.9%. In Figure2, the highest-ranked tree is flat, but deeper trees are also selected, including the projective CoreNLP parser output. Syntax is not necessarily an optimal composition order for a latent TreeLSTM, as illustrated by the poor performance of the off-line parser (Table 1). Consequently, our (fully unsupervised) latent structures tend to disagree with CoreNLP: the average probability of CoreNLP arcs is 5.8%; Williams et al. (2018) make related observations. Indeed, some syntactic conventions may be questionable for recursive composition. Figure 3 shows two examples where our model identifies a plausible symmetric composition order for coordinate structures: this analysis disagrees with CoreNLP, which uses the asymmetrical Stanford / UD convention of assigning the left-most conjunct as head (Nivre et al., 2016). Assigning the conjunction as head instead seems preferable in a Child-Sum TreeLSTM.
Our model must evaluate at least one TreeLSTM for each sentence, making it necessarily slower than the baselines, which evaluate exactly one. Thanks to sparsity and auto-batching, the actual slow-down is not problematic; moreover, as the model trains, the latent parser gets more confident, and for many unambiguous sentences there may be only one latent tree with nonzero probability. On SST, our average training epoch is only 4.7 slower than the off-line parser and 6 slower than the flat baseline.
We presented a novel approach for training latent structure neural models, based on the key idea of sparsifying the set of possible structures, and demonstrated our method with competitive latent dependency TreeLSTM models. Our method’s generality opens up several avenues for future work: since it supports any structure for which inference is available (e.g., matchings, alignments), and we have no restrictions on the downstream , we may design latent versions of more complicated state-of-the-art models, such as ESIM for NLI Chen et al. (2017). In concurrent work, spigot proposed an approximate backward pass, relying on a relaxation and a gradient projection. Unlike our method, theirs does not support multiple latent structures; we intend to further study the relationship between the methods.
This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by the Fundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013. We thank Annabelle Carrell, Chris Dyer, Jack Hessel, Tim Vieira, Justine Zhang, Sydney Zink, and the anonymous reviewers, for helpful and well-structured feedback.
Foundations and Trends® in Machine Learning, 1(1–2):1–305.
In this section, we provide a brief explanation of the known result that marginal and inference can be expressed as optimization problems of the form
where is the set of all possible probability distributions over , i.e., .
We set , i.e., the negative Shannon entropy (with base ). The resulting problem is well-studied (Boyd and Vandenberghe, 2004, Example 3.25). Its Lagrangian is
The KKT conditions for optimality are
The gradient takes the form
and setting yields the condition
The above implies , which, by complementary slackness, means . Therefore
where we introduced . From the primal feasibility condition, we have
yielding the desired result: .
results in a linear program over a polytope
According to the fundamental theorem of linear programming (Dantzig et al., 1955, Theorem 6), this maximum is achieved at a vertex of . The vertices of are peaked “indicator” distributions, therefore a solution is given by finding any highest-scoring structure, which is precisely inference
Using a small variation of the method described by Niculae et al. (2018), we can compute the gradient of with respect to . This gradient is sparse, therefore both the forward and the backward passes only involve the small set of active trees
. For this reason, the entire latent model can be efficiently trained end-to-end using gradient-based methods such as stochastic gradient descent.
Let denote the posterior probability distribution,444For a measure-zero set of pathologic inputs there may be more than one optimal distribution . This did not pose any problems in practice, where any ties can be broken at random. i.e., the solution of Equation 2 for , where for an appropriately defined indicator matrix . Define , where we denote by the column-subset of indexed by the support . Denote the sum of column of by , and the overall sum of by .
Then, for any , we have
As in the backward step of Niculae et al. (2018, Appendix B), a solution satisfies
where we denote
To simplify notation, we denote where
Putting it all together, we obtain
which is the top branch of the conditional. For the other branch, observe that the support is constant within a neighborhood of , yielding . Importantly, since is computed as a side-effect of the forward pass, the backward pass computation is efficient.