Towards Dynamic Computation Graphs via Sparse Latent Structure

09/03/2018 ∙ by Vlad Niculae, et al. ∙ Unbabel Inc. cornell university 0

Deep NLP models benefit from underlying structures in the data---e.g., parse trees---typically extracted using off-the-shelf parsers. Recent attempts to jointly learn the latent structure encounter a tradeoff: either make factorization assumptions that limit expressiveness, or sacrifice end-to-end differentiability. Using the recently proposed SparseMAP inference, which retrieves a sparse distribution over latent structures, we propose a novel approach for end-to-end learning of latent structure predictors jointly with a downstream predictor. To the best of our knowledge, our method is the first to enable unrestricted dynamic computation graph construction from the global latent structure, while maintaining differentiability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

Code Repositories

sparsemap

SparseMAP: differentiable sparse structure inference


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent structure models are a powerful tool for modeling compositional data and building NLP pipelines Smith (2011). An interesting emerging direction is to dynamically adapt a network’s computation graph, based on structure inferred from the input; notable applications include learning to write programs Bosnjak et al. (2017), answering visual questions by composing specialized modules Hu et al. (2017); Johnson et al. (2017), and composing sentence representations using latent syntactic parse trees Yogatama et al. (2017).

But how to learn a model that is able to condition on such combinatorial variables? The question then becomes: how to marginalize over all possible latent structures? For tractability, existing approaches have to make a choice. Some of them eschew global latent structure, resorting to computation graphs built from smaller local decisions: e.g., structured attention networks use local posterior marginals as attention weights Kim et al. (2017); Liu and Lapata (2018)

, and dani construct sentence representations from parser chart entries. Others allow more flexibility at the cost of losing end-to-end differentiability, ending up with reinforcement learning problems

Yogatama et al. (2017); Hu et al. (2017); Johnson et al. (2017); Williams et al. (2018). More traditional approaches employ an off-line structure predictor (e.g., a parser) to define the computation graph Tai et al. (2015); Chen et al. (2017), sometimes with some parameter sharing Bowman et al. (2016). However, these off-line methods are unable to jointly

train the latent model and the downstream classifier via error gradient information.

We propose here a new strategy for building dynamic computation graphs with latent structure, through sparse structure prediction. Sparsity allows selecting and conditioning on a tractable number of global structures, eliminating the limitations stated above. Namely, our approach is the first that:

  • is fully differentiable;

  • supports latent structured variables;

  • can marginalize over full global structures.

This contrasts with off-line and with reinforcement learning-based approaches, which satisfy B and C but not A; and with local marginal-based methods such as structured attention networks, which satisfy A and B, but not C. Key to our approach is the recently proposed inference Niculae et al. (2018), which induces, for each data example, a very sparse posterior distribution over the possible structures, allowing us to compute the expected network output efficiently and explicitly in terms of a small, interpretable set of latent structures. Our model can be trained end-to-end with gradient-based methods, without the need for policy exploration or sampling.

We demonstrate our strategy on inducing latent dependency TreeLSTMs, achieving competitive results on sentence classification, natural language inference, and reverse dictionary lookup.

Figure 1:

Our method computes a sparse probability distribution over all possible latent structures: here, only two have nonzero probability. For each selected tree

, we evaluate by dynamically building the corresponding computation graph (e.g.

, a TreeLSTM). The final, posterior prediction is a sparse weighted average.

2 Sparse Latent Structure Prediction

We describe our proposed approach for learning with combinatorial structures (in particular, non-projective dependency trees) as latent variables.

2.1 Latent Structure Models

Let and denote classifier inputs and outputs, and a latent variable; for example, can be the set of possible dependency trees for

. We would like to train a neural network to model

(1)

where is a structured-output parsing model that defines a distribution over trees, and is a classifier whose computation graph may depend freely and globally on the structure (e.g., a TreeLSTM). The rest of this section focuses on the challenge of defining such that Eqn. 1 remains tractable and differentiable.

2.2 Global Inference

Denote by a scoring function, assigning each tree a non-normalized score. For instance, we may have an arc-factored score , where we interpret a tree as a set of directed arcs , each receiving an atomic score . Deriving given is known as structured inference. This can be written as a -regularized optimization problem of the form

where is the set of all possible probability distributions over . Examples follow.

Marginal inference.

With negative entropy regularization, i.e., , we recover marginal inference, and the probability of a tree becomes (Wainwright and Jordan, 2008)

This closed-form derivation, detailed in Appendix A, provides a differentiable expression for . However, crucially, since , every tree is assigned strictly nonzero probability. Therefore—unless the downstream is constrained to also factor over arcs, as in rush; lapata—the sum in Eqn. 1 requires enumerating the exponentially large . This is generally intractable, and even hard to approximate via sampling, even when is tractable.

inference.

At the polar opposite, setting yields maximum a posteriori () inference (see Appendix A). assigns a probability of to the highest-scoring tree, and to all others, yielding a very sparse . However, since the top-scoring tree (or top-, for fixed ) does not vary with small changes in , error gradients cannot propagate through . This prevents end-to-end gradient-based training for -based latent variables, which makes them more difficult to use. Related reinforcement learning approaches also yield only one structure, but sidestep non-differentiability by instead introducing more challenging search problems.

2.3 Sparse Inference

In this work, we propose using inference (Niculae et al., 2018) to sparsify the set while preserving differentiability. uses a quadratic penalty on the posterior marginals

Situated between marginal inference and inference, assigns nonzero probability to only a small set of plausible trees , of size at most equal to the number of arcs (Martins et al., 2015, Proposition 11). This guarantees that the summation in Eqn. 1 can be computed efficiently by iterating over : this is depicted in Figure 1 and described in the next paragraphs.

Forward pass.

To compute (Eqn. 1), we observe that the posterior is nonzero only on a small set of trees , and thus we only need to compute for . The support and values of are obtained by solving the inference problem, as we describe in Niculae et al. (2018). The strategy, based on the active set algorithm (Nocedal and Wright, 1999, chapter 16), involves a sequence of calls (here: maximum spanning tree problems.)

Backward pass.

We next show how to compute end-to-end gradients efficiently. Recall from Eqn. 1 where is a discrete index of a tree. To train the classifier, we have , therefore only the terms with nonzero probability (i.e., ) contribute to the gradient. is readily available by implementing in an automatic differentiation library.111Here we assume and to be disjoint, but weight sharing is easily handled by automatic differentiation via the product rule. Differentiation w.r.t. the summation index is not necessary: may use the discrete structure freely and globally. To train the latent parser, the total gradient is the sum We derive the expression of in Appendix B. Crucially, the gradient sum is also sparse, like , and efficient to compute, amounting to multiplying by a -by- matrix. The proof, given in Appendix B, is a novel extension of the backward pass (Niculae et al., 2018).

Generality.

Our description focuses on probabilistic classifiers, but our method can be readily applied to networks that output any representation, not necessarily a probability. For this, we define a function , consisting of any auto-differentiable computation w.r.t. , conditioned on the discrete latent structure in arbitrary, non-differentiable ways. We then compute

This strategy is demonstrated in our reverse-dictionary experiments in §3.4. In addition, our approach is not limited to trees: any structured model with tractable inference may be used.

3 Experiments

subj. SST SNLI
left-to-right 92.71 82.10 80.98
flat 92.56 83.96 81.74
off-line 92.15 83.25 81.37
latent 92.25 84.73 81.87
Table 1: Accuracy scores for classification and NLI.
seen unseen concepts
rank acc acc rank acc acc rank acc acc
left-to-right 17 42.6 73.8 43 33.2 61.8 28 35.9 66.7
flat 18 45.1 71.1 31 38.2 65.6 29 34.3 68.2
latent 12 47.5 74.6 40 35.6 60.1 20 38.4 70.7
dani 58 30.9 56.1 40 33.4 57.1 40 57.1 62.6
hill 12 48 28 22 41 70 69 28 54
Table 2: Results on the reverse dictionary lookup task Hill et al. (2016)

. Following the authors, for an input definition, we rank a shortlist of approximately 50k candidate words according to the cosine similarity to the output vector, and report median rank of the expected word, accuracy at 10, and at 100.

We evaluate our approach on three natural language processing tasks: sentence classification, natural language inference, and reverse dictionary lookup.

3.1 Common aspects

Word vectors.

Unless otherwise mentioned, we initialize with 300-dimensional GloVe word embeddings Pennington et al. (2014) We transform every sentence via a bidirectional LSTM encoder, to produce a context-aware vector encoding word .

Dependency TreeLSTM.

We combine the word vectors in a sentence into a single vector using a tree-structured Child-Sum LSTM, which allows an arbitrary number of children at any node Tai et al. (2015). Our baselines consist in extreme cases of dependency trees: where the parent of word is word (resulting in a left-to-right sequential LSTM), and where all words are direct children of the root node (resulting in a flat additive model). We also consider off-line dependency trees precomputed by Stanford CoreNLP Manning et al. (2014).

Neural arc-factored dependency parsing.

We compute arc scores

with one-hidden-layer perceptrons

Kiperwasser and Goldberg (2016).

Experimental setup.

All networks are trained via stochastic gradient with 16 samples per batch. We tune the learning rate on a log-grid, using a decay factor of

after every epoch at which the validation performance is not the best seen, and stop after five epochs without improvement. At test time, we scale the arc scores

by a temperature chosen on the validation set, controlling the sparsity of the distribution. All hidden layers are -dimensional.222Our dynet Neubig et al. (2017) implementation is available at https://github.com/vene/sparsemap.

3.2 Sentence classification

We evaluate our models for sentence-level subjectivity classification Pang and Lee (2004) and for binary sentiment classification on the Stanford Sentiment Treebank Socher et al. (2013). In both cases, we use a softmax output layer on top of the Dependency TreeLSTM output representation.

3.3 Natural language inference (NLI)

We apply our strategy to the SNLI corpus Bowman et al. (2015), which consists of classifying premise-hypothesis sentence pairs into entailment, contradiction or neutral relations. In this case, for each pair (), the running sum is over two latent distributions over parse trees, i.e., For each pair of trees, we independently encode the premise and hypothesis using a TreeLSTM. We then concatenate the two vectors, their difference, and their element-wise product Mou et al. (2016). The result is passed through one tanh hidden layer, followed by the softmax output layer.333For NLI, our architecture is motivated by our goal of evaluating the impact of latent structure for learning compositional sentence representations. State-of-the-art models conditionally transform the sentences to achieve better performance, e.g., 88.6% accuracy in Chen et al. (2017).

3.4 Reverse dictionary lookup

The reverse dictionary task aims to compose a dictionary definition into an embedding that is close to the defined word. We therefore used fixed input and output embeddings, set to unit-norm 500-dimensional vectors provided, together with training and evaluation data, by Hill et al. (2016). The network output is a projection of the TreeLSTM encoding back to the dimension of the word embeddings, normalized to unit norm. We maximize the cosine similarity of the predicted vector with the embedding of the defined word.

28% [hide label,edge unit distance=1.1ex] [column sep=0.4cm] & a & vivid & cinematic & portrait & .
121.0 131.0 141.0 151.0 161.0
✓ 16% [hide label,edge unit distance=1.1ex] [column sep=0.4cm] & a & vivid & cinematic & portrait & .
521.0 531.0 541.0 151.0 561.0

13%

[hide label,edge unit distance=1.1ex] [column sep=0.4cm] & a & vivid & cinematic & portrait & .
421.0 431.0 641.0 451.0 [edge unit distance=0.8ex]161.0

Figure 2: Three of the sixteen trees with nonzero probability for an SST test example. Flat representations, such as the first tree, perform well on this task, as reflected by the baselines. The second tree, marked with ✓, agrees with the off-line parser.

† 22.6%

[hide label,edge unit distance=1.1ex] [column sep=0.3cm] & lovely & and & poignant & .

321.0 131.0 341.0 351.0

✓ 21.4%

[hide label,edge unit distance=1.1ex] [column sep=0.3cm] & lovely & and & poignant & .

121.0 231.0 241.0 251.0

19.84%

[hide label,edge unit distance=1.1ex] [column sep=0.3cm] & lovely & and & poignant & .

421.0 431.0 141.0 451.0

15.33%

[hide label,edge unit distance=1.1ex] [column sep=0.3cm] & a & deep & and & meaningful & film & .

121.0 131.0 141.0 151.0 161.0 171.0

† 15.27%

[hide label,edge unit distance=1.1ex] [column sep=0.3cm] & a & deep & and & meaningful & film & .

421.0 431.0 141.0 451.0 461.0 471.0

✓ 0%

[hide label,edge unit distance=1.1ex] [column sep=0.3cm] & a & deep & and & meaningful & film & .

621.0 631.0 341.0 351.0 161.0 671.0

Figure 3: Examples of coordinate structures where our model assigns high probability to a symmetric parse (marked †). While not consistent with the standard asymmetrical parse produced by CoreNLP (marked with ✓), the symmetric analysis may be more appropriate for TreeLSTM composition.

4 Discussion

Experimental performance.

Classification and NLI results are reported in Table 1. Compared to the latent structure model of latentparse, our model performs better on SNLI (80.5%) but worse on SST (86.5%). On SNLI, our model also outperforms dani (81.6%). To our knowledge, latent structure models have not been tested on subjectivity classification. Surprisingly, the simple flat and left-to-right baselines are very strong, outperforming the off-line dependency tree models on all three datasets. The latent TreeLSTM model reaches the best accuracy on two out of the three datasets. On reverse dictionary lookup (Table 2), our model also performs well, especially on concept classification, where the input definitions are more different from the ones seen during training. For context, we repeat the scores of the CKY-based latent TreeLSTM model of dani, as well as of the LSTM from hill; these different-sized models are not entirely comparable. We attribute our model’s performance to the latent parser’s flexibility, investigated below.

Selected latent structures.

We analyze the latent structures selected by our model on SST, where the flat composition baseline is remarkably strong. We find that our model, to maximize accuracy, prefers flat or nearly-flat trees, but not exclusively: the average posterior probability of the flat tree is 28.9%. In Figure 

2, the highest-ranked tree is flat, but deeper trees are also selected, including the projective CoreNLP parser output. Syntax is not necessarily an optimal composition order for a latent TreeLSTM, as illustrated by the poor performance of the off-line parser (Table 1). Consequently, our (fully unsupervised) latent structures tend to disagree with CoreNLP: the average probability of CoreNLP arcs is 5.8%; Williams et al. (2018) make related observations. Indeed, some syntactic conventions may be questionable for recursive composition. Figure 3 shows two examples where our model identifies a plausible symmetric composition order for coordinate structures: this analysis disagrees with CoreNLP, which uses the asymmetrical Stanford / UD convention of assigning the left-most conjunct as head (Nivre et al., 2016). Assigning the conjunction as head instead seems preferable in a Child-Sum TreeLSTM.

Training efficiency.

Our model must evaluate at least one TreeLSTM for each sentence, making it necessarily slower than the baselines, which evaluate exactly one. Thanks to sparsity and auto-batching, the actual slow-down is not problematic; moreover, as the model trains, the latent parser gets more confident, and for many unambiguous sentences there may be only one latent tree with nonzero probability. On SST, our average training epoch is only 4.7 slower than the off-line parser and 6 slower than the flat baseline.

5 Conclusions and future work

We presented a novel approach for training latent structure neural models, based on the key idea of sparsifying the set of possible structures, and demonstrated our method with competitive latent dependency TreeLSTM models. Our method’s generality opens up several avenues for future work: since it supports any structure for which inference is available (e.g., matchings, alignments), and we have no restrictions on the downstream , we may design latent versions of more complicated state-of-the-art models, such as ESIM for NLI Chen et al. (2017). In concurrent work, spigot proposed an approximate backward pass, relying on a relaxation and a gradient projection. Unlike our method, theirs does not support multiple latent structures; we intend to further study the relationship between the methods.

Acknowledgments

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by the Fundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013. We thank Annabelle Carrell, Chris Dyer, Jack Hessel, Tim Vieira, Justine Zhang, Sydney Zink, and the anonymous reviewers, for helpful and well-structured feedback.

References

Appendix A Variational formulations of marginal and MAP inference.

In this section, we provide a brief explanation of the known result that marginal and inference can be expressed as optimization problems of the form

(2)

where is the set of all possible probability distributions over , i.e., .

Marginal inference.

We set , i.e., the negative Shannon entropy (with base ). The resulting problem is well-studied (Boyd and Vandenberghe, 2004, Example 3.25). Its Lagrangian is

(3)

The KKT conditions for optimality are

(4)

The gradient takes the form

(5)

and setting yields the condition

(6)

The above implies , which, by complementary slackness, means . Therefore

(7)

where we introduced . From the primal feasibility condition, we have

(8)

and thus

(9)

yielding the desired result: .

inference.

Setting

results in a linear program over a polytope

(10)

According to the fundamental theorem of linear programming (Dantzig et al., 1955, Theorem 6), this maximum is achieved at a vertex of . The vertices of are peaked “indicator” distributions, therefore a solution is given by finding any highest-scoring structure, which is precisely inference

(11)

Appendix B Derivation of the backward pass.

Using a small variation of the method described by Niculae et al. (2018), we can compute the gradient of with respect to . This gradient is sparse, therefore both the forward and the backward passes only involve the small set of active trees

. For this reason, the entire latent model can be efficiently trained end-to-end using gradient-based methods such as stochastic gradient descent.

Let denote the posterior probability distribution,444For a measure-zero set of pathologic inputs there may be more than one optimal distribution . This did not pose any problems in practice, where any ties can be broken at random. i.e., the solution of Equation 2 for , where for an appropriately defined indicator matrix . Define , where we denote by the column-subset of indexed by the support . Denote the sum of column of by , and the overall sum of by .

Then, for any , we have

Proof.

As in the backward step of Niculae et al. (2018, Appendix B), a solution satisfies

(12)

where we denote

(13)

To simplify notation, we denote where

(14)

Differentiation yields

(15)

Putting it all together, we obtain

(16)

which is the top branch of the conditional. For the other branch, observe that the support is constant within a neighborhood of , yielding . Importantly, since is computed as a side-effect of the forward pass, the backward pass computation is efficient.