Jointly Learning Sentence Embeddings and Syntax with Unsupervised Tree-LSTMs

05/25/2017 ∙ by Jean Maillard, et al. ∙ Google University of Cambridge 0

We introduce a neural network that represents sentences by composing their words according to induced binary parse trees. We use Tree-LSTM as our composition function, applied along a tree structure found by a fully differentiable natural language chart parser. Our model simultaneously optimises both the composition function and the parser, thus eliminating the need for externally-provided parse trees which are normally required for Tree-LSTM. It can therefore be seen as a tree-based RNN that is unsupervised with respect to the parse trees. As it is fully differentiable, our model is easily trained with an off-the-shelf gradient descent method and backpropagation. We demonstrate that it achieves better performance compared to various supervised Tree-LSTM architectures on a textual entailment task and a reverse dictionary task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent neural networks, in particular the Long Short-Term Memory (LSTM) architecture (Hochreiter and Schmidhuber, 1997) and some of its variants (Graves and Schmidhuber, 2005; Bahdanau et al., 2014)

have been widely applied to problems in natural language processing. Examples include language modelling

(Sundermeyer et al., 2012; Józefowicz et al., 2016), textual entailment (Bowman et al., 2015; Sha et al., 2016), and machine translation (Bahdanau et al., 2014; Sutskever et al., 2014) amongst others.

The topology of an LSTM network is linear: words are read sequentially, normally in left-to-right order. However, language is known to have an underlying hierarchical, tree-like structure (Chomsky, 1957). How to capture this structure in a neural network, and whether doing so leads to improved performance on common linguistic tasks, is an open question. The Tree-LSTM network (Tai et al., 2015; Zhu et al., 2015)

provides a possible answer, by generalising the LSTM to tree-structured topologies. It was shown to be more effective than a standard LSTM in semantic relatedness and sentiment analysis tasks.

Despite their superior performance on these tasks, Tree-LSTM networks have the drawback of requiring an extra labelling of the input sentences in the form of parse trees. These can be either provided by an automatic parser (Tai et al., 2015), or taken from a gold-standard resource such as the Penn Treebank (Kiperwasser and Goldberg, 2016). Yogatama et al. (2016)

proposed to remove this requirement by including a shift-reduce parser in the model, to be optimised alongside the composition function based on a downstream task. This makes the full model non-differentiable so it needs to be trained with reinforcement learning, which can be slow due to high variance.

Our proposed approach is to include a chart parser in the model, inspired by the CYK constituency parser (Cocke, 1969; Younger, 1967; Kasami, 1965)

. Due to the parser being fully differentiable, the entire model can be trained end-to-end for a downstream task by using stochastic gradient descent. Our model is also unsupervised with respect to the parse trees, similar to

Yogatama et al. (2016). We show that the proposed method outperforms other Tree-LSTM architectures based on fully left-branching, right-branching, and supervised parse trees on a textual entailment task and a reverse dictionary task.

2 Related work

Our work can be seen as part of a wider class of sentence embedding models that let their composition order be guided by a tree structure. These can be further split into two groups: (1) models that rely on traditional syntactic parse trees, usually provided as input, and (2) models that induce a tree structure based on some downstream task.

In the first group, Paperno et al. (2014)

take inspiration from the standard Montagovian semantic treatment of composition. They model nouns as vectors, and relational words that take arguments (such as adjectives, that combine with nouns) as tensors, with tensor contraction representing application

(Coecke et al., 2011)

. These tensors are trained via linear regression based on a downstream task, but the tree that determines their order of application is expected to be provided as input.

Socher et al. (2012) and Socher et al. (2013) also rely on external trees, but use recursive neural networks as the composition function.

Instead of using a single parse tree, Le and Zuidema (2015)

propose a model that takes as input a parse forest from an external parser, in order to deal with uncertainty. The authors use a convolutional neural network composition function and, like our model, rely on a mechanism similar to the one employed by the CYK parser to process the trees.

Ma et al. (2015) propose a related model, also making use of syntactic information and convolutional networks to obtain a representation in a bottom-up manner. Convolutional neural networks can also be used to produce embeddings without the use of tree structures, such as in Kalchbrenner et al. (2014).

Bowman et al. (2016) propose an RNN that produces sentence embeddings optimised for a downstream task, with a composition function that works similarly to a shift-reduce parser. The model is able to operate on unparsed data by using an integrated parser. However, it is trained to mimic the decisions that would be taken by an external parser, and is therefore not free to explore using different tree structures. Dyer et al. (2016) introduce a probabilistic model of sentences that explicitly models nested, hierarchical relationships among words and phrases. They too rely on a shift-reduce parsing mechanism to obtain trees, trained on a corpus of gold-standard trees.

In the second group, Yogatama et al. (2016) shows the most similarities to our proposed model. The authors use reinforcement learning to learn tree structures for a neural network model similar to Bowman et al. (2016), taking performance on a downstream task that uses the computed sentence representations as the reward signal. Kim et al. (2017) take a slightly different approach: they formalise a dependency parser as a graphical model, viewed as an extension to attention mechanisms, and hand-optimise the backpropagation step through the inference algorithm.

3 Models

All the models take a sentence as input, represented as an ordered sequence of words. Each word in the vocabulary is encoded as a (learned) word embedding . The models then output a sentence representation , where the output space does not necessarily coincide with the input space .

3.1 Bag of Words

Our simplest baseline is a bag-of-words (BoW) model. Due to its reliance on addition, which is commutative, any information on the original order of words is lost. Given a sentence encoded by embeddings it computes

where is a learned input projection matrix.

3.2 Lstm

An obvious choice for a baseline is the popular Long Short-Term Memory (LSTM) architecture of Hochreiter and Schmidhuber (1997). It is a recurrent neural network that, given a sentence encoded by embeddings , runs for time steps and computes

where is the standard logistic function. The LSTM is parametrised by the matrices ,

, and the bias vector

. The vectors are known as input, forget, and output gates respectively, while we call the vector the candidate update. We take , the -state of the last time step, as the final representation of the sentence.

Following the recommendation of Jozefowicz et al. (2015), we deviate slightly from the vanilla LSTM architecture described above by also adding a bias of 1 to the forget gate, which was found to improve performance.

3.3 Tree-LSTM

Tree-LSTMs are a family of extensions of the LSTM architecture to tree structures (Tai et al., 2015; Zhu et al., 2015). We implement the version designed for binary constituency trees. Given a node with children labelled and , its representation is computed as


where in (1) is a word embedding, only nonzero at the leaves of the parse tree; and and are the node children’s - and -states, only nonzero at the branches. These computations are repeated recursively following the tree structure, and the representation of the whole sentence is given by the -state of the root node. Analogously to our LSTM implementation, here we also add a bias of 1 to the forget gates.

3.4 Unsupervised Tree-LSTM

While the Tree-LSTM is very powerful, it requires as input not only the sentence, but also a parse tree structure defined over it. Our proposed extension optimises this step away, by including a basic CYK-style (Cocke, 1969; Younger, 1967; Kasami, 1965) chart parser in the model. The parser has the property of being fully differentiable, and can therefore be trained jointly with the Tree-LSTM composition function for some downstream task.

neuro linguistic programming rocks
neuro linguistic programming linguistic programming rocks
neuro linguistic linguistic programming programming rocks
neuro linguistic programming rocks
Table 1: Chart for the sentence “neuro linguistic programming rocks”.

The CYK parser relies on a chart data structure, which provides a convenient way of representing the possible binary parse trees of a sentence, according to some grammar. Here we use the chart as an efficient means to store all possible binary-branching trees, effectively using a grammar with only a single non-terminal. This is sketched in simplified form in Table 1 for an example input. The chart is drawn as a diagonal matrix, where the bottom row contains the individual words of the input sentence. The row contains all cells with branch nodes spanning words (here each cell is represented simply by the span – see Figure 1 below for a forest representation of the nodes in all possible trees). By combining nodes in this chart in various ways it is possible to efficiently represent every binary parse tree of the input sentence.

The unsupervised Tree-LSTM uses an analogous chart to guide the order of composition. Instead of storing sequences of words however, here each cell is made up of a pair of vectors representing the state of the Tree-LSTM RNN at that particular node in the tree. The process starts at the bottom row, where each cell is filled in by calculating the Tree-LSTM output (1)-(3) with set to the embedding of the corresponding word. These are the leaves of the parse tree. Then, the second row is computed by repeatedly calling the Tree-LSTM with the appropriate children. This row contains the nodes that are directly combining two leaves. They might not all be needed for the final parse tree: some leaves might connect directly to higher-level nodes, which have not yet been considered. However, they are all computed, as we cannot yet know whether there are better ways of connecting them to the tree. This decision is made at a later stage.

Figure 1: Unsupervised Tree-LSTM network structure for the sentence “neuro linguistic programming rocks”.

Starting from the third row, ambiguity arises since constituents can be built up in more than one way: for example, the constituent “neuro linguistic programming” in Table 1 can be made up either by combining the leaf “neuro” and the second-row node “linguistic programming”, or by combining the second-row node “neuro linguistic” and the leaf “programming”. In these cases, all possible compositions are performed, leading to a set of candidate constituents . Each is assigned an energy, given by



indicates the cosine similarity function and

is a (trained) vector of weights. All energies are then passed through a softmax function to normalise them, and the cell representation is finally calculated as a weighted sum of all candidates using the softmax output:


The softmax uses a temperature hyperparameter

which, for small values, has the effect of making the distribution sparse by making the highest score tend to 1. In all our experiments the temperature is initialised as , and is smoothly decreasing as , where

is the fraction of training epochs that have been completed. In the limit

, this mechanism will only select the highest scoring option, and is equivalent to the argmax operation. The same procedure is repeated for all higher rows, and the final output is given by the -state of the top cell of the chart.

The whole process is sketched in Figure 1 for an example sentence. Note how, for instance, the final sentence representation can be obtained in three different ways, each represented by a coloured circle. All are computed, and the final representation is a weighted sum of the three, represented by the dotted lines. When the temperature in (5) reaches very low values, this effectively reduces to the single “best” tree, as selected by gradient descent.

4 Experiments

All models are implemented in Python 3.5.2 with the DyNet neural network library (Neubig et al., 2017) at commit bffe22b. The code for all following experiments can be found on the first author’s website.111

For training we use stochastic gradient descent with a batch size of 16, which was found to perform better than AdaGrad (Duchi et al., 2010) and similar methods on our development data. Performance on the development data is used to determine when to stop training.

The textual entailment model was trained on a Intel Xeon E5-2660 CPU, and took one and a half weeks to converge. The reverse dictionary model was trained on a NVIDIA GeForce GTX TITAN Black GPU, and took five days to converge.

On top of the baselines already described in §3, for the following experiments we also train two additional Tree-LSTM models that use a fixed composition order: one that uses a fully left-branching tree, and one that uses a fully right-branching tree.

4.1 Textual Entailment

We test our model and baselines on the Stanford Natural Language Inference task (Bowman et al., 2015), consisting of 570 k manually annotated pairs of sentences. Given two sentences, the aim is to predict whether the first entails, contradicts, or is neutral with respect to the second. For example, given “children smiling and waving at camera” and “there are children present”, the model would be expected to predict entailment.

For this experiment, we choose 100D input embeddings, initialised with 100D GloVe vectors (Pennington et al., 2014) and with out-of-vocabulary words set to the average of all other vectors. This results in a word embedding matrix, fine-tuned during training. For the supervised Tree-LSTM model, we used the parse trees included in the dataset.

Given a pair of sentences, one of the models is used to produce the embeddings . Following Yogatama et al. (2016) and Bowman et al. (2016), we then compute

where and are trained model parameters. Finally, the correct label is predicted by , where and are trained model parameters.

Table 2 lists the accuracy and number of parameters for our model, baselines, as well as other sentence embedding models in the literature. When the information is available, we report both the number of intrinsic model parameters as well as the number of word embedding parameters. For other models these figures are based on the data from the SNLI website222 and the original papers.

Model Accuracy # Parameters
Bag-of-words 77.6 % 91 k + 3.7 M
LSTM 81.2 % 161 k + 3.7 M
Left-branching Tree-LSTM 80.5 % 231 k + 3.7 M
Right-branching Tree-LSTM 81.3 % 231 k + 3.7 M
Supervised Tree-LSTM 81.5 % 231 k + 3.7 M
Unupervised Tree-LSTM 81.6 % 231 k + 3.7 M
Bowman et al. (2015), 100D LSTM 77.6 % 220 k + ?
Bowman et al. (2016), 300D LSTM 80.6 % 3.0 M + ?
Bowman et al. (2016), 300D SPINN 83.2 % 3.7 M + ?
Yogatama et al. (2016), 100D latent 80.5 % 500 k + 1.8 M
Munkhdalai and Yu (2017), 300D NSE 84.6 % 3.0 M + ?
Table 2: Test accuracy (higher is better) on the SNLI dataset and number of parameters. We report separately the number of intrinsic model parameters and the number of word embedding parameters.

4.2 Reverse Dictionary

We also test our model and baselines on the reverse dictionary task of Hill et al. (2016), which consists of 852 k word-definition pairs. The aim is to retrieve the name of a concept from a list of words, given its definition. For example, when provided with the sentence “control consisting of a mechanical device for controlling fluid flow”, a model would be expected to rank the word “valve” above other confounders in a list. We use three test sets provided by the authors: two sets involving word definitions, either seen during training or held out; and one set involving concept descriptions instead of formal definitions. Performance is measured via three statistics: the median rank of the correct answer over a list of over 66 k words; and the proportion of cases in which the correct answer appears in the top 10 and 100 ranked words (top 10 accuracy and top 100 accuracy).

As output embeddings, we use the 500D CBOW vectors (Mikolov et al., 2013) provided by the authors. As input embeddings we use the same vectors, reduced to 256 dimensions with PCA. Given a training definition as a sequence of (input) embeddings , the model produces an embedding which is then mapped to the output space via a trained projection matrix . The training objective to be maximised is then the cosine similarity between the definition embedding and the (output) embedding of the word being defined. For the supervised Tree-LSTM model, we additionally parsed the definitions with Stanford CoreNLP (Manning et al., 2014) to obtain parse trees.

We hold out 128 batches from the training set to be used as development data. The softmax temperature in (5) is allowed to decrease as described in §3.4 until it reaches a value of , and then kept constant. This was found to have the best performance on the development set.

Model Median rank Top 10 accuracy Top 100 accuracy
Bag-of-words 75.0 66.0 70.5 30.3% 29.9% 25.8% 53.7% 55.2% 56.6%
LSTM 57.5 59.0 48.5 28.9% 29.7% 29.3% 55.3% 56.8% 57.1%
Left-branching Tree-LSTM 78.0 64.0 48.0 28.9% 28.3% 28.8% 52.7% 54.8% 61.1%
Right-branching Tree-LSTM 70.5 51.0 42.5 30.1% 30.9% 29.8% 54.5% 58.0% 62.1%
Supervised Tree-LSTM 108.5 79.0 160.5 23.1% 26.9% 20.2% 49.0% 52.9% 42.4%
Unsupervised Tree-LSTM 58.5 40.0 40.0 30.9% 33.4% 30.3% 56.1% 57.1% 62.6%
Hill et al. (2016) 512D LSTM 12 22 69 48% 41% 28% 73% 70% 54%
Hill et al. (2016) 500D BoW 22 19 50 44% 43% 34% 65% 69% 60%
Table 3: Median rank (lower is better) and accuracies (higher is better) at 10 and 100 on the three test sets for the reverse dictionary task: seen words (S), unseen words (U), and concept descriptions (C).

Table 3 shows the results for our model and baselines, as well as the models of Hill et al. (2016) which are based on the same cosine training objective. Our bag-of-words model consists of 193.8 k parameters; our LSTM uses 653 k parameters; the fixed-branching, supervised, and unsupervised Tree-LSTM models all use 1.1 M parameters. On top of these, the input word embeddings consist of parameters. Output embeddings are not counted as they are not updated during training.

5 Discussion

The results in Tables 2 and 3 show that the unsupervised Tree-LSTM matches or outperforms all tested baselines.

For the textual entailment task, our model compares favourably to all baselines including the supervised Tree-LSTM, as well as some of the other sentence embedding models in the literature that have a higher number of parameters. Our model could be plausibly improved by combining it with aspects of other models, and we make some concrete suggestions in that direction in §6.

Figure 2: Median rank on the development set for the reverse dictionary task.

In the reversed dictionary task, the very poor performance of the supervised Tree-LSTM can be explained by the unusual tokenisation algorithm used in the dataset of Hill et al. (2016): all punctuation is simply stripped, turning for instance “(archaic) a section of a poem” into “archaic a section of a poem”, or stripping away the semicolons in long lists of synonyms. On the one hand, this might seem unfair on the supervised Tree-LSTM, which received suboptimal trees as input. On the other, it demonstrates the robustness of our method to noisy data. Our model also performed well in comparison to the LSTM and the other Tree-LSTM baselines. Despite the slower training time due to the additional complexity, Figure 2 shows how our model needed fewer training examples to reach convergence in this task.

Following Yogatama et al. (2016), we also manually inspect the learned trees to see how closely they match conventional syntax trees, as would typically be assigned by trained linguists. We analyse the same four sentences they chose. The trees produced by our model are shown in Figure 3. One notable feature of the trees is the fact that verbs are joined with their subject noun phrases first, which differs from the standard verb phrase structure. Type-raising and composition in formalisms such as combinatory categorial grammar (Steedman, 2000) do however allow such constituents. The spans of prepositional phrases in (b), (c) and (d) are correctly identified at the highest level; but only in (d) does the structure of the subtree match convention. As could be expected, other features such as the attachment of the full stops or of some determiners do not appear to match human intuition.

[ [ [ [ a woman ] !0.5cm [ [ wearing sunglasses ] is ] ] !3cm frowning ] !3cm ]


[ [ [ [ a boy ] drags ] [ his sleds ] !1.5cm ] !3.5cm [ [ [ through the ] snow ] ] ]


[ [ [ family members ] !2cm standing ] !2cm [ [ [ outside a ] home ] ] ]


[ [ [ [ two men ] are ] [ playing frisbee ] ] !4.5cm [ [ in [ the park ] ] !1.5cm ] ]

Figure 3: Binary parse trees of sentences from the SNLI dataset induced by the unsupervised Tree-LSTM model.

6 Conclusions

We presented a fully differentiable model to jointly learn sentence embeddings and syntax, based on the Tree-LSTM composition function. We demonstrated its benefits over standard Tree-LSTM on a textual entailment task and a reverse dictionary task. The model is conceptually simple, and easy to train via backpropagation and stochastic gradient descent with popular deep learning toolkits based on dynamic computation graphs such as DyNet

(Neubig et al., 2017)

and PyTorch.


The unsupervised Tree-LSTM we presented is relatively simple, but could be plausibly improved by combining it with aspects of other models. It should be noted in particular that (4), the function assigning an energy to alternative ways of forming constituents, is extremely basic and does not rely on any global information on the sentence. Using a more complex function, perhaps relying on a mechanism such as the tracking LSTM in Bowman et al. (2016)

, might lead to improvements in performance. Techniques such as batch normalization

(Ioffe and Szegedy, 2015) or layer normalization (Ba et al., 2016) might also lead to further improvements.

In future work, it might be possible to obtain trees closer to human intuition by training a model to perform well on multiple tasks instead of a single one, an important feature for intelligent agents to demonstrate (Legg and Hutter, 2007). Techniques such as elastic weight consolidation (Kirkpatrick et al., 2017) have been shown to help with multitask learning, and could be readily applied to our model.