On Tree-Based Neural Sentence Modeling

by   Haoyue Shi, et al.
ByteDance Inc.
Peking University

Neural networks with tree-based sentence encoders have shown better results on many downstream tasks. Most of existing tree-based encoders adopt syntactic parsing trees as the explicit structure prior. To study the effectiveness of different tree structures, we replace the parsing trees with trivial trees (i.e., binary balanced tree, left-branching tree and right-branching tree) in the encoders. Though trivial trees contain no syntactic information, those encoders get competitive or even better results on all of the ten downstream tasks we investigated. This surprising result indicates that explicit syntax guidance may not be the main contributor to the superior performances of tree-based neural sentence modeling. Further analysis show that tree modeling gives better results when crucial words are closer to the final representation. Additional experiments give more clues on how to design an effective tree-based encoder. Our code is open-source and available at https://github.com/ExplorerFreda/TreeEnc.


Transformer-Based Neural Text Generation with Syntactic Guidance

We study the problem of using (partial) constituency parse trees as synt...

Learning to Compose Words into Sentences with Reinforcement Learning

We use reinforcement learning to learn tree-structured neural networks f...

Latent Tree Learning with Differentiable Parsers: Shift-Reduce Parsing and Chart Parsing

Latent tree learning models represent sentences by composing their words...

Evaluation of sentence embeddings in downstream and linguistic probing tasks

Despite the fast developmental pace of new sentence embedding methods, i...

Learning to parse from a semantic objective: It works. Is it syntax?

Recent work on reinforcement learning and other gradient estimators for ...

Unsupervised Learning of Explainable Parse Trees for Improved Generalisation

Recursive neural networks (RvNN) have been shown useful for learning sen...

Neural Language Priors

The choice of sentence encoder architecture reflects assumptions about h...

1 Introduction

Sentence modeling is a crucial problem in natural language processing (NLP). Recurrent neural networks with long short term memory

Hochreiter and Schmidhuber (1997)

or gated recurrent units

Cho et al. (2014)

are commonly used sentence modeling approaches. These models embed sentences into a vector space and the resulting vectors can be used for classification or sequence generation in the downstream tasks.

In addition to the plain sequence of hidden units, recent work on sequence modeling proposes to impose tree structure in the encoder Socher et al. (2013); Tai et al. (2015); Zhu et al. (2015). These tree-based LSTMs introduce syntax tree as an intuitive structure prior for sentence modeling. They have already obtained promising results in many NLP tasks, such as natural language inference Bowman et al. (2016); Chen et al. (2017c) and machine translation Eriguchi et al. (2016); Chen et al. (2017a, b); Zhou et al. (2017). li2015tree empirically concludes that syntax tree-based sentence modeling are effective for tasks requiring relative long-term context features.

On the other hand, some works propose to abandon the syntax tree but to adopt the latent tree for sentence modeling Choi et al. (2018); Yogatama et al. (2017); Maillard et al. (2017); Williams et al. (2018)

. Such latent trees are directly learned from the downstream task with reinforcement learning

Williams (1992) or Gumbel Softmax Jang et al. (2017); Maddison et al. (2017). However, williams2018latent empirically show that, Gumbel softmax produces unstable latent trees with the same hyper-parameters but different initializations, while reinforcement learning Williams et al. (2018) even tends to generate left-branching trees. Neither gives meaningful latent trees in syntax, but each method still obtains considerable improvements in performance. This indicates that syntax may not be the main contributor to the performance gains.

With the above observation, we bring up the following questions: What does matter in tree-based sentence modeling? If tree structures are necessary in encoding the sentences, what mostly contributes to the improvement in downstream tasks? We attempt to investigate the driving force of the improvement by latent trees without syntax.

In this paper, we empirically study the effectiveness of tree structures in sentence modeling. We compare the performance of bi-LSTM and five tree LSTM encoders with different tree layouts, including the syntax tree, latent tree (from Gumbel softmax) and three kinds of designed trivial trees (binary balance tree, left-branching tree and right-branching tree). Experiments are conducted on 10 different tasks, which are grouped into three categories, namely the single sentence classification (5 tasks), sentence relation classification (2 tasks), and sentence generation (3 tasks). These tasks depend on different granularities of features, and the comparison among them can help us learn more about the results. We repeat all the experiments 5 times and take the average to avoid the instability caused by random initialization of deep learning models.

We get the following conclusions:

  • Tree structures are helpful to sentence modeling on classification tasks, especially for tasks which need global (long-term) context features, which is consistent with previous findings Li et al. (2015).

  • Trivial trees outperform syntactic trees, indicating that syntax may not be the main contributor to the gains of tree encoding, at least on the ten tasks we investigate.

  • Further experiments shows that, given strong priors, tree based methods give better results when crucial words are closer to the final representation. If structure priors are unavailable, balanced tree is a good choice, as it makes the path distances between word and sentence encoding to be roughly equal, and in such case, tree encoding can learn the crucial words itself more easily.

(a) Encoder-decoder framework for sentence generation.

Encoder-classifier framework for sentence classification.

(c) Siamese encoder-classifier framework for sentence relation classification.
Figure 1:

The encoder-classifier/decoder framework for three different groups of tasks. We apply multi-layer perceptron (MLP) for classification, and left-to-right decoders for generation in all experiments.

2 Experimental Framework

(a) Parsing tree.
(b) Balanced tree.
(c) Gumbel tree.
(d) Left-branching tree.
(e) Right-branching tree.
Figure 2: Examples of different tree structures for the encoder part.

We show the applied encoder-classifier/decoder framework for each group of tasks in Figure 1. Our framework has two main components: the encoder part and the classifier/decoder part. In general, models encode a sentence to a length-fixed vector, and then applies the vector as the feature for classification and generation.

We fix the structure of the classifier/decoder, and propose to use five different types of tree structures for the encoder part including:

  • Parsing tree. We apply binary constituency tree as the representative, which is widely used in natural language inference Bowman et al. (2016) and machine translation Eriguchi et al. (2016); Chen et al. (2017a). Dependency parsing trees Zhou et al. (2015, 2016a) are not considered in this paper.

  • Binary balanced tree. To construct a binary balanced tree, we recursively divide a group of leafs into two contiguous groups with the size of and , until each group has only one leaf node left.

  • Gumbel trees, which are produced by straight-forward Gumbel softmax models Choi et al. (2018). Note that Gumbel trees are not stable to sentences Williams et al. (2018), and we only draw a sample among all of them.

  • Left-branching trees. We combine two nodes from left to right, to construct a left-branching tree, which is similar to those generated by the reinforce based RL-SPINN model Williams et al. (2018).

  • Right-branching trees. In contrast to left-branching ones, nodes are combined from right to left to form a right-branching tree.

We show an intuitive view of the five types of tree structures in Figure 2. In addition, existing works Choi et al. (2018); Williams et al. (2018) show that using hidden states of bidirectional RNNs as leaf node representations (bi-leaf-RNN) instead of word embeddings may improve the performance of tree LSTMs, as leaf RNNs help encode context information more completely. Our framework also support leaf RNNs for tree LSTMs.

3 Description of Investigated Tasks

We conduct experiments on 10 different tasks, which are grouped into 3 categories, namely the single sentence classification (5 tasks), sentence relation classification (2 tasks), and sentence generation (3 tasks). Each of the tasks is compatible to the encoder-classifier/decoder framework shown in Figure 1. These tasks cover a wide range of NLP applications, and depend on different granularities of features.

Note that the datasets may use articles or paragraphs as instances, some of which consist of only one sentence. For each dataset, we only pick the subset of single-sentence instances for our experiments, and the detailed meta-data is in Table 1.

3.1 Sentence Classification

First, we introduce four text classification datasets from zhang2015character, including AG’s News, Amazon Review Polarity , Amazon Review Full and DBpedia. Additionally, noticing that parsing tree was shown to be effective Li et al. (2015) on the task of word-level semantic relation classification Hendrickx et al. (2009), we also add this dataset to our selections.

AG’s News (AGN).

Each sample in this dataset is an article, associated with a label indicating its topic: world, sports, business or sci/tech.

Amazon Review Polarity (ARP).

The Amazon Review dataset is obtained from the Stanford Network Analysis Project (SNAP; McAuley and Leskovec, 2013). It collects a large amount of product reviews as paragraphs, associated with a star rate from 1 (most negative) to 5 (most positive). In this dataset, 3-star reviews are dropped, while others are classified into two groups: positive (4 or 5 stars) and negative (1 or 2 stars).

Amazon Review Full (ARF).

Similar to the ARP dataset, the ARF dataset is also collected from Amazon product reviews. Labels in this dataset are integers from 1 to 5.


DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia Lehmann et al. (2015). zhang2015character select 14 non-overlapping classes from DBpedia 2014 to construct this dataset. Each sample is given by the title and abstract of the Wikipedia article, associated with the class label.

Word-Level Semantic Relation (WSR)

SemEval-2010 Task 8 Hendrickx et al. (2009) is to find semantic relationships between pairs of nominals. Each sample is given by a sentence, of which two nominals are explicitly indicated, associated with manually labeled semantic relation between the two nominals. For example, the sentence “My [apartment] has a pretty large [kitchen] .” has the label component-whole(, ). Different from retrieving the path between two labels Li et al. (2015); Socher et al. (2013), we feed the entire sentence together with the nominal indicators (i.e., tags of and ) as words to the framework. We also ignore the order of and in the labels given by the dataset. Thus, this task turns to be a 10-way classification one.

3.2 Sentence Relation Classification

To evaluate how well a model can capture semantic relation between sentences, we introduce the second group of tasks: sentence relation classification.

Natural Language Inference (NLI).

The Stanford Natural Language Inference (SNLI) Corpus Bowman et al. (2015) is a challenging dataset for sentence-level textual entailment. It has 550K training sentence pairs, as well as 10K for development and 10K for test. Each pair consists of two relative sentences, associated with a label which is one of entailment, contradiction and neutral.

Conjunction Prediction (Conj).

Information about the coherence relation between two sentences is sometimes apparent in the text explicitly Miltsakaki et al. (2004): this is the case whenever the second sentence starts with a conjunction phrase. jernite2017discourse propose a method to create conjunction prediction dataset from unlabeled corpus. They create a list of phrases, which can be classified into nine types, as conjunction indicators. The object of this task is to recover the conjunction type of given two sentences, which can be used to evaluate how well a model captures the semantic meaning of sentences. We apply the method proposed by jernite2017discourse on the Wikipedia corpus to create our conj dataset.

3.3 Sentence Generation

We also include the sentence generation tasks in our experiments, to investigate the representation ability of different encoders over global (long-term) context features. Note that our framework is based on encoding, which is different from those attention based approaches.

Paraphrasing (Para).

Quora Question Pair Dataset is a widely applied dataset to evaluate paraphrasing models Wang et al. (2017); Li et al. (2017b). 111https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs In this work, we treat the paraphrasing task as a sequence-to-sequence one, and evaluate on it with our sentence generation framework.

Machine Translation (MT).

Machine translation, especially cross-language-family machine translation, is a complex task, which requires models to capture the semantic meanings of sentences well. We apply a large challenging English-Chinese sentence translation task for this investigation, which is adopted by a variety of neural translation work Tu et al. (2016); Li et al. (2017a); Chen et al. (2017a). We extract the parallel data from the LDC corpora,222The corpora includes LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06 selecting 1.2M from them as our training set, 20K and 80K of them as our development set and test set, respectively.

Auto-Encoding (AE).

We extract the English part of the machine translation dataset to form a auto-encoding task, which is also compatible with our encoder-decoder framework.

Dataset #Sentence #Cls Avg.
Train Dev Test Len
Sentence Classification
News 60K 6.7K 4.3K 4 31.5
ARP 128K 14K 16K 2 33.7
ARF 110K 12K 27K 5 33.8
DBpedia 106K 11K 15K 14 20.1
WSR 7.1K 891 2.7K 10 23.1
Sentence Relation
SNLI 550K 10K 10K 3 11.2
Conj 552K 10K 10K 9 23.3
Sentence Generation
Para 98K 2K 3K N/A 10.2
MT 1.2M 20K 80K N/A 34.1
AE 1.2M 20K 80K N/A 34.1
Table 1: Meta-data of the downstream tasks we investigated. For each task, we list the quantity of instances in train/dev/test set, the average length (by words) of sentences (source sentence only for generation task), as well as the number of classes if applicable.

4 Experiments

Sentence Classification Sentence Relation Sentence Generation
Model AGN ARP ARF DBpedia WSR NLI Conj Para MT AE
Latent Trees
Gumbel 91.8 87.1 48.4 98.6 66.7 80.4 51.2 20.4 17.4 39.5
+bi-leaf-RNN 91.8 88.1 49.7 98.7 69.2 82.9 53.7 20.5 22.3 75.3
(Constituency) Parsing Trees
Parsing 91.9 87.5 49.4 98.8 66.6 81.3 52.4 19.9 19.1 44.3
+bi-leaf-RNN 92.0 88.0 49.6 98.8 68.6 82.8 53.4 20.4 22.2 72.9
Trivial Trees
Balanced 92.0 87.7 49.1 98.7 66.2 81.1 52.1 19.7 19.0 49.4
+bi-leaf-RNN 92.1 87.8 49.7 98.8 69.6 82.6 54.0 20.5 22.3 76.0
Left-branching 91.9 87.6 48.5 98.7 67.8 81.3 50.9 19.9 19.2 48.0
+bi-leaf-RNN 91.2 87.6 48.9 98.6 67.7 82.8 53.3 20.6 21.6 72.9
Right-branching 91.9 87.7 49.0 98.8 68.6 81.0 51.3 20.4 19.7 54.7
+bi-leaf-RNN 91.9 87.9 49.4 98.7 68.7 82.8 53.5 20.9 23.1 80.4
Linear Structures
LSTM 91.7 87.8 48.8 98.6 66.1 82.6 52.8 20.3 19.1 46.9
+bidirectional 91.7 87.8 49.2 98.7 67.4 82.8 53.3 20.2 21.3 67.0
Avg. Length 31.5 33.7 33.8 20.1 23.1 11.2 23.3 10.2 34.1 34.1
Table 2: Test results for different encoder architectures trained by a unified encoder-classifier/decoder framework. We report accuracy for classification tasks, and BLEU score (Papineni et al., 2002; word-level for English targets and char-level for Chinese targets) for generation tasks. Large is better for both of the metrics. The best number(s) for each task are in bold. In addition, average sentence length (in words) of each dataset is attached in the last row with underline.

In this section, we present our experimental results and analysis. Section 4.1 introduces our set-up for all the experiments. Section 4.2 shows the main results and analysis on ten downstream tasks grouped into three classes, which can cover a wide range of NLP applications. Regarding that trivial tree based LSTMs perform the best among all models, we draw two hypotheses, which are i) right-branching tree benefits a lot from strong structural priors; ii) balanced tree wins because it fairly treats all words so that crucial information could be more easily learned by the LSTM gates automatically. We test the hypotheses in Section 4.3. Finally, we compare the performance of linear and tree LSTMs with three widely applied pooling mechanisms in Section 4.4.

4.1 Set-up

In experiments, we fix the structure of the classifier as a two-layer MLP with ReLU activation, and the structure of decoder as GRU-based recurrent neural networks

Cho et al. (2014). 333We observe that ReLU can significantly boost the performance of Bi-LSTM on SNLI. The hidden-layer size of MLP is fixed to 1024, while that of GRU is adapted from the size of sentence encoding. We initialize the word embeddings with 300-dimensional GloVe Pennington et al. (2014) vectors.444http://nlp.stanford.edu/data/glove.840B.300d.zip We apply 300-dimensional bidirectional (600-dimensional in total) LSTM as leaf RNN when necessary. We use Adam Kingma and Ba (2015) optimizer to train all the models, with the learning rate of 1e-3 and batch size of 64. In the training stage, we drop the samples with the length of either source sentence or target sentence larger than 64. We do not apply any regularization or dropout term in all experiments except the task of WSR, on which we tune dropout term with respect to the development set. We generate the binary parsing tree for the datasets without parsing trees using ZPar Zhang and Clark (2011).555https://www.sutd.edu.sg/cmsresource/faculty/yuezhang/zpar.html More details are summarized in supplementary materials.

4.2 Main Results

In this subsection, we aim to compare the results from different encoders. We do not include any attention Wang et al. (2016); Lin et al. (2017) or pooling Collobert and Weston (2008); Socher et al. (2011); Zhou et al. (2016b) mechanism here, in order to avoid distractions and make the encoder structure affects the most. We will further analyze pooling mechanisms in Section 4.4.

Table 2 presents the performances of different encoders on a variety of downstream tasks, which lead to the following observations:

Tree encoders are useful on some tasks.

We get the same conclusion with li2015tree that tree-based encoders perform better on tasks requiring long-term context features. Despiting the linear structured left-branching and right-branching tree encoders, we find that, tree-based encoders generally perform better than Bi-LSTMs on tasks of sentence relation and sentence generation, which may require relatively more long term context features for obtaining better performances. However, the improvements of tree encoders on NLI and Para are relatively small, which may be caused by that sentences of the two tasks are shorter than others, and the tree encoder does not get enough advantages to capture long-term context in short sentences.

Trivial tree encoders outperform other encoders.

Surprisingly, binary balanced tree encoder gets the best results on most tasks of classification and right-branching tree encoder tends to be the best on sentence generation. Note that binary balanced tree and right-branching tree are only trivial tree structures, but outperform syntactic tree and latent tree encoders. The latent tree is really competitive on some tasks, as its structure is directly tuned by the corresponding tasks. However, it only beats the binary balanced tree by very small margins on NLI and ARP. We will give analysis about this in Section 4.3.

Larger quantity of parameters is not the only reason of the improvements.

Table 2 shows that tree encoders benefit a lot from adding leaf-LSTM, which brings not only sentence level information to leaf nodes, but also more parameters than the bi-LSTM encoder. However, left-branching tree LSTM has a quite similar structure with linear LSTM, and it can be viewed as a linear LSTM-on-LSTM structure. It has the same amounts of parameters as other tree-based encoders, but still falls behind the balance tree encoder on most of the tasks. This indicates that larger quantity of parameters is at least not the only reason for binary balance tree LSTM encoders to gain improvements against bi-LSTMs.

4.3 Why Trivial Trees Work Better?

Binary balanced tree and right-branching are trivial ones, hardly containing syntax information. In this section, we analyze why these trees achieve high scores in deep.

4.3.1 Right Branching Tree Benefits from Strong Structural Prior

(a) Balanced tree, MT.
(b) Left-branching tree, MT.
(c) Right-branching, MT.
(d) Bi-LSTM, MT.
(e) Balanced tree, AE.
(f) Left-branching tree, AE.
(g) Right-branching, AE.
(h) Bi-LSTM, AE.
Figure 3: Saliency visualization of words in learned MT and AE models. Darker means more important to the sentence encoding.

We argue that right-branching trees benefit from its strong structural prior. In sentence generation tasks, models generate sentences from left to right, which makes words in the left of the source sentence more important Sutskever et al. (2014). If the encoder fails to memorize the left words, the information about right words would not help due to the error propagation. In right-branching trees, left words of the sentence are closer to the final representation, which makes the left words are more easy to be memorized, and we call this structure prior. Oppositely, in the case of left-branching trees, right words of the sentence are closer to the representation.

To validate our hypothesis, we propose to visualize the Jacobian as word-level saliency Shi et al. (2018), which can be viewed as the contribution of each word to the sentence encoding:

where denotes the embedding of a sentence, and denotes embedding of a word. We can compute the saliency score using backward propagation. For a word in a sentence, higher saliency score means more contribution to sentence encoding.

We present the visualization in Figure 3 using the visualization tool from lin2017astructured. It shows that right-branching tree LSTM encoders tend to look at the left part of the sentence, which is very helpful to the final generation performance, as left words are more crucial. Balanced trees also have this feature and we think it is because balance tree treats these words fairly, and crucial information could be more easily learned by the LSTM gates automatically.

However, bi-LSTM and left-branching tree LSTM also pay much attention to words in the right (especially the last two words), which maybe caused by the short path from the right words to the root representation, in the two corresponding tree structures.

Additionally, Table 3 shows that models trained with the same hyper-parameters but different initializations have strong agreement with each other. Thus, “looking at the first words” is a stable behavior of balanced and right-branching tree LSTM encoders in sentence generation tasks. So is “looking at the first and the last words” for Bi-LSTMs and left-branching tree LSTMs.

Model MT AE
Balanced (BiLRNN) 93.1 96.9
Left-Branching (BiLRNN) 94.2 95.4
Right-Branching (BiLRNN) 92.3 95.1
Bi-LSTM 96.4 96.1
Table 3: Mean average Pearson correlation across five models trained with same hyper-parameters. For each testing sentence, we compute the saliency scores of words. Cross-model Pearson correlation can show the agreement of two models on one sentence, and average Pearson correlation is computed through all sentences. We report mean average Pearson correlation of the model pairs.

4.3.2 Binary Balanced Tree Benefits from Shallowness

(a) -depth line for WSR.
(b) -Acc. line for WSR.
(c) -depth line for MT.
(d) -BLEU line for MT.
(e) -depth line for AE.
(f) -BLEU line for AE.
Figure 4: -depth and -performance lines for three tasks. There is a trend that the depth drops and the performance raises with the growth of .
(a) Length-Accuracy lines for WSR.
(b) Length-BLEU lines for MT.
(c) Length-BLEU lines for AE.
Figure 5: Length-performance lines for the further investigated tasks. We divide test instances into several groups by length, and report the performance on each group respectively. Sentences with length in are put to the first group, and the group covers the range of in length. ]

Compared to syntactic and latent trees, the only advantage of balanced tree we can hypothesize is that, it is shallower and more balanced than others. Shallowness may lead to shorter path for information propagation from leafs to the root representation, and makes the representation learning more easy due to the reduction of errors in the propagation process. Balance makes the tree fairly treats all leaf nodes, which makes it more easily to automatically select the crucial information over all words in a sentence.

To test our hypothesis, we conduct the following experiments. We select three tasks, on which binary balanced tree encoder wins Bi-LSTMs with a large margin (WSR, MT and AE). We generate random binary trees for sentences, while controlling the depth using a hyper-parameter . We start by a group with all words (nodes) in the sentence. At each time, we separate nodes to two continuous groups sized ,

with probability

, while those sized with probability . Trees generated with are exactly left-branching trees, and those generated with are binary balanced trees. The expected node depth of the tree turns smaller with varies from 0 to 1.

Figure 4 shows that, in general, trees with shallower node depth have better performance on all of the three tasks (for binary tree, shallower also means more balanced), which validates our above hypothesis that binary balanced tree gains the reward from its shallow and balanced structures.

Additionally, Figure 5 demonstrates that binary balanced trees work especially better with relative long sentences. As desired, on short-sentence groups, the performance gap between Bi-LSTM and binary balanced tree LSTM is not obvious, while it grows with the test sentences turning longer. This explains why tree-based encoder gives small improvements on NLI and Para, because sentences on these two tasks are much shorter than others.

4.4 Can Pooling Replace Tree Encoder?

(a) Balanced tree.
(b) Bi-LSTM.
Figure 6: An illustration of the investigated self-attentive pooling mechanism.

Max pooling Collobert and Weston (2008); Zhao et al. (2015), mean pooling Conneau et al. (2017) and self-attentive pooling (also known as self-attention; Santos et al., 2016; Liu et al., 2016; Lin et al., 2017) are three popular and efficient choices to improve sentence encoding. In this part, we will compare the performance of tree LSTMs and bi-LSTM on the tasks of WSR, MT and AE, with each pooling mechanism respectively, aiming to demonstrate the role that pooling plays in sentence modeling, and validate whether tree encoders can be replaced by pooling.

As shown in Figure 6, for linear LSTMs, we apply pooling mechanism to all hidden states; as for tree LSTMs, pooling is applied to all hidden states and leaf states of tree LSTMs. Implementation details are summarized in the supplementary materials.

Table 4 shows that max and attentive pooling improve all the structures on the task of WSR, but all the pooling mechanisms fail on MT and AE that require the encoding to capture complete information of sentences, while pooling mechanism may cause the loss of information through the procedure. The result indicates that, though pooling mechanism is efficient on some tasks, it cannot totally gain the advantages brought by tree structures. Additionally, we think the attention mechanism has the benefits of the balanced tree modeling, which also fairly treat all words and learn the crucial parts automatically. The path from representation to words in attention are even shorter than the balanced tree. Thus the fact that attentive pooling outperforms balanced trees on WSR is not surprising to us.

Bi-LSTM 67.4 21.3 67.0
+max-pooling 71.8 21.6 48.0
+mean-pooling 64.3 21.8 47.8
+self-attention 72.5 21.2 60.4
Parsing (BiLRNN) 68.6 22.2 72.9
+max-pooling 69.7 21.8 48.3
+mean-pooling 58.0 21.2 50.7
+self-attention 72.2 21.5 69.1
Balanced (BiLRNN) 69.6 22.3 76.0
+max-pooling 70.6 21.6 48.5
+mean-pooling 54.1 21.3 52.7
+self-attention 72.5 21.6 69.5
Left (BiLRNN) 67.7 21.6 72.9
+max-pooling 71.2 20.5 47.6
+mean-pooling 67.3 21.4 51.8
+self-attention 72.1 21.6 – 70.2
Right (BiLRNN) 68.7 23.1 80.4
+max-pooling 71.6 21.6 48.4
+mean-pooling 67.2 22.1 53.9
+self-attention 72.4 21.6 68.9
Table 4: Performance of tree and linear-structured encoders with or without pooling, on the selected three tasks. We report accuracy , char-level BLEU for MT and word-level BLEU for AE. All of the tree models have bidirectional leaf RNNs (BiLRNN). The best number(s) for each task are in bold. The top and down arrows indicate the increment or decrement of each pooling mechanism, against the baseline of pure tree based encoder with the same structure.

5 Discussions

Balanced tree for sentence modeling has been explored by munkhdalai2017neural and williams2018latent in natural language inference (NLI). However, munkhdalai2017neural focus on designing inter-attention on trees, instead of comparing balanced tree with other linguistic trees in the same setting. williams2018latent do compare balanced trees with latent trees, but balanced tree does not outperform the latent one in their experiments, which is consistent with ours. We analyze it in Section 4.2 that sentences in NLI are too short for the balanced tree to show the advantage.

P18-2116 argue that LSTM works for the gates’ ability to compute an element-wise weighted sum. In such case, tree LSTM can also be regarded as a special case of attention, especially for the balanced-tree modeling, which also automatically select the crucial information from all word representation. kim2017structured propose a tree structured attention networks, which combine the benefits of tree modeling and attention, and the tree structures in their model are also learned instead of the syntax trees.

Although binary parsing trees do not produce better numbers than trivial trees on many downstream tasks, it is still worth noting that we are not claiming the useless of parsing trees, which are intuitively reasonable for human language understanding. A recent work Blevins et al. (2018) shows that RNN sentence encodings directly learned from downstream tasks can capture implicit syntax information. Their interesting result may explain why explicit syntactic guidance does not work for tree LSTMs. In summary, we still believe in the potential of linguistic features to improve neural sentence modeling, and we hope our investigation could give some sense to afterwards hypothetical exploring of designing more effective tree-based encoders.

6 Conclusions

In this work, we propose to empirically investigate what contributes mostly in the tree-based neural sentence encoding. We find that trivial trees without syntax surprisingly give better results, compared to the syntax tree and the latent tree. Further analysis indicates that the balanced tree gains from its shallow and balance properties compared to other trees, and right-branching tree benefits from its strong structural prior under the setting of left-to-right decoder.


We thank Hang Li, Yue Zhang, Lili Mou and Jiayuan Mao for their helpful comments on this work, and the anonymous reviewers for their valuable feedback.


  • Blevins et al. (2018) Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018. Deep RNNs Encode Soft Hierarchical Syntax. In Proc. of ACL.
  • Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A Large Annotated Corpus for Learning Natural Language Inference. In Proc. of EMNLP.
  • Bowman et al. (2016) Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. 2016. A Fast Unified Model for Parsing and Sentence Understanding. In Proc. of ACL.
  • Chen et al. (2017a) Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. 2017a.

    Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder.

    In Proc. of ACL.
  • Chen et al. (2017b) Kehai Chen, Rui Wang, Masao Utiyama, Lemao Liu, Akihiro Tamura, Eiichiro Sumita, and Tiejun Zhao. 2017b. Neural Machine Translation with Source Dependency Representation. In Proc. of EMNLP.
  • Chen et al. (2017c) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017c. Enhanced LSTM for Natural Language Inference. In Proc. of ACL.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proc. of EMNLP.
  • Choi et al. (2018) Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2018. Learning to Compose Task-Specific Tree Structures. In Proc. of AAAI.
  • Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proc. of ICML.
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proc. of EMNLP.
  • Eriguchi et al. (2016) Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-Sequence Attentional Neural Machine Translation. In Proc. of ACL.
  • Hendrickx et al. (2009)

    Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2009.

    Semeval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals. In Proc. of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In Proc. of ICLR.
  • Jernite et al. (2017) Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based Objectives for Fast Unsupervised Sentence Representation Learning. arXiv preprint arXiv:1705.00557.
  • Kim et al. (2017) Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. 2017. Structured Attention Networks. In Proc. of ICLR.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. of ICLR.
  • Lehmann et al. (2015) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. DBpedia–A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web.
  • Levy et al. (2018) Omer Levy, Kenton Lee, Nicholas FitzGerald, and Luke Zettlemoyer. 2018. Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum. In Proc. of ACL.
  • Li et al. (2015) Jiwei Li, Thang Luong, Dan Jurafsky, and Eduard Hovy. 2015. When Are Tree Structures Necessary for Deep Learning of Representations? In Proc. of EMNLP.
  • Li et al. (2017a) Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, Min Zhang, and Guodong Zhou. 2017a. Modeling Source Syntax for Neural Machine Translation. In Proc. of ACL.
  • Li et al. (2017b) Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. 2017b. Paraphrase Generation with Deep Reinforcement Learning. arXiv preprint arXiv:1711.00279.
  • Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A Structured Self-Attentive Sentence Embedding. In Proc. of ICLR.
  • Liu et al. (2016) Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. 2016. Learning Natural Language Inference using Bidirectional LSTM Model and Inner-Attention. arXiv preprint arXiv:1605.09090.
  • Maddison et al. (2017) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2017.

    The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.

    In Proc. of ICLR.
  • Maillard et al. (2017) Jean Maillard, Stephen Clark, and Dani Yogatama. 2017. Jointly Learning Sentence embeddings and Syntax with Unsupervised Tree-LSTMs. arXiv preprint arXiv:1705.09189.
  • McAuley and Leskovec (2013) Julian McAuley and Jure Leskovec. 2013. Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text. In Proc. of the 7th ACM Conference on Recommender Systems.
  • Miltsakaki et al. (2004) Eleni Miltsakaki, Rashmi Prasad, Aravind K. Joshi, and Bonnie L Webber. 2004. The Penn Discourse Treebank. In Proc. of LREC.
  • Mou et al. (2016) Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2016.

    Natural language inference by tree-based convolution and heuristic matching.

    In Proc. of ACL.
  • Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. 2017. Neural Tree Indexers for Text Understanding. In Proc. of EACL.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proc. of ACL.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proc. of EMNLP.
  • Santos et al. (2016) Cicero dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. 2016. Attentive Pooling Networks. arXiv preprint arXiv:1602.03609.
  • Shi et al. (2018) Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, and Jian Sun. 2018. Learning Visually-Grounded Semantics from Contrastive Adversarial Samples. In Proc. of COLING.
  • Socher et al. (2011) Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, and Andrew Y. Ng. 2011.

    Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection.

    In Proc. of NIPS.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank. In Proc. of EMNLP.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proc. of NIPS.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proc. of ACL-IJCNLP.
  • Tu et al. (2016) Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling Coverage for Neural Machine Translation. In Proc. of ACL.
  • Wang et al. (2016) Yequan Wang, Minlie Huang, Li Zhao, and Xiaoyan Zhu. 2016. Attention-Based LSTM for Aspect-Level Sentiment Classification. In Proc. of EMNLP.
  • Wang et al. (2017) Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral Multi-Perspective Matching for Natural Language Sentences. In Proc. of IJCAI.
  • Williams et al. (2018) Adina Williams, Andrew Drozdov, and Samuel R. Bowman. 2018. Do Latent Tree Learning Models Identify Meaningful Structure in Sentences? Transaction of ACL.
  • Williams (1992) Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. In Reinforcement Learning.
  • Yogatama et al. (2017) Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2017. Learning to Compose Words into Sentences with Reinforcement Learning. Proc. of ICLR.
  • Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification. In Proc. of NIPS.
  • Zhang and Clark (2011) Yue Zhang and Stephen Clark. 2011. Syntactic Processing using the Generalized Perceptron and Beam Search. Computational Linguistics.
  • Zhao et al. (2015) Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015. Self-Adaptive Hierarchical Sentence Model. In Proc. of IJCAI.
  • Zhou et al. (2017) Hao Zhou, Zhaopeng Tu, Shujian Huang, Xiaohua Liu, Hang Li, and Jiajun Chen. 2017. Chunk-Based Bi-Scale Decoder for Neural Machine Translation. In Proc. of ACL.
  • Zhou et al. (2015) Hao Zhou, Yue Zhang, Shujian Huang, and Jiajun Chen. 2015. A Neural Probabilistic Structured-Prediction Model for Transition-based Dependency Parsing. In Proc. of ACL.
  • Zhou et al. (2016a) Hao Zhou, Yue Zhang, Shujian Huang, Junsheng Zhou, Xin-Yu Dai, and Jiajun Chen. 2016a. A Search-Based Dynamic Reranking Model for Dependency Parsing. In Proc. of ACL.
  • Zhou et al. (2016b) Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. 2016b. Text Classification Improved by Integrating Bidirectional LSTM with Two-Dimensional Max Pooling. In Proc. of COLING.
  • Zhu et al. (2015) Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long Short-Term Memory Over Recursive Structures. In Proc. of ICML.

Appendix A Implementation Details

Our codebase is built on PyTorch 0.3.0.

666https://pytorch.org/docs/0.3.0 All the sentences was tokenized with SpaCy.777https://spacy.io

a.1 Sentence Encoding

We use LSTM based sentence encodings as the extracted features of sentences for downstream classification or generation tasks. We use typical long short term memory (LSTM; Hochreiter and Schmidhuber, 1997) units for linear structures, which can be summarized as:

where indicates the time step of a state; is the hidden state and is the input vector. We apply binary tree LSTM units adapted from zhu2015long for binary tree LSTMs, which can be summarized as:

where the subscript denotes the current state, and denote the left and right child states respectively. We also apply LSTM Hochreiter and Schmidhuber (1997) as leaf-node RNN when necessary.

It is worth noting that left-branching tree LSTM without leaf-node RNN is structurally equivalent to unidirectional LSTM. The only difference between them, which may cause the slight difference on performance, comes from the implementation of LSTM units.

The candidate set of dropout ratio we explore for the task of word-level semantic relation (WSR) is .

a.2 Sentence Relation Classification

In the task of sentence relation classification, the feature vector consists of the concatenation of two sentence vectors, their difference, and their element-wise product Mou et al. (2016):

a.3 Pooling Mechanism

Following Socher et al. (2011), we apply pooling mechanism to all leaf states (of tree LSTMs) and hidden states. The detailed pooling methods are described as follows.

Max Pooling.

Max pooling takes the max value for each dimension

where denotes a leaf state in tree LSTMs or a hidden state; for tree LSTMs and for linear LSTMs; denotes the final sentence encoding.

Mean Pooling.

Mean pooling (average pooling) takes the average of all hidden states as the sentence representation, which can be summarized as:


We follow conneau2017supervised and lin2017astructured to build a self-attentive mechanism, which can be summarized as:

where denotes attention weights computed by learned parameters and . In all experiments, is a 128-d vector.