Recently, as deep neural models are popular in NLP research community, learning distributed sentence representation becomes a basic but crucial problem for a variety of NLP tasks, including but not limited to sentence classification [Kim2014, Tai, Socher, and Manning2015], question qnswering [Tan et al.2016, Shen, Yang, and Deng2017], and natural langauge inference [Yu and Munkhdalai2017, Lin et al.2017]Hochreiter and Schmidhuber1997]
and convolutional neural networks (CNN)[Kim2014] are adapted to model sentences sequentially. However, this type of models can’t always achieve the best performance because they ignore the syntactic structure.
In contrast, RecNN [Socher et al.2011]
assigns a vector representation to each word at the leaf nodes of the pre-obtained syntactic parse tree structures and composes word/phrase pairs to get phrase representation at each non-leaf nodes recursively over trees. The final representation of root node is regarded as sentence representation. It is from a completely different perspective that natural language allows speakers to determine the meaning of a longer expression based on the meanings of its words and the rules used to combine them[Socher et al.2012].
Shortly after the standard RecNN is presented, researchers are aware that different from RNN, a group of shared parameters of semantic composition function limits the capacity of RecNN because different types of word/phrase pairs require pretty different composition rules. For example, the compositionality function of a verb-noun pair should distinguish from the function of an adverb-adjective pair intuitively. Therefore, some RecNN-based models are proposed for [Socher et al.2012, Socher et al.2013a, Socher et al.2013b, Dong et al.2014, Qian et al.2015, Huang, Qian, and Zhu2017] modeling the diversity and complexity of semantic composition. Overall speaking, these RecNN variants can be divided into two classes which partition composition function according to implicit and explicit rules respectively. The first class of models [Socher et al.2012, Socher et al.2013b, Dong et al.2014] do not take syntactic information into consideration. Therefore, models have to determine a suitable composition function without any guidance, which makes the learning of models more difficult. The other class of models [Socher et al.2013a, Qian et al.2015, Huang, Qian, and Zhu2017] define an untied composition function for different Part-Of-Speech (POS) tags which represents syntactic roles of words and phrases. These models can select corresponding function according to tag information at each node. However the number of parameters are one or two orders of magnitude much larger than standard RecNN so that models easily suffer from overfitting.
In this paper, we propose Tag-Guided Hyper RecNN/TreeLSTM (TG-HRecNN/TreeLSTM) to learn a dynamic semantic compositon function over tree structures with the help of POS tags. We introduce hypernetwork framework [Ha, Dai, and Le2017] into RecNN/TreeLSTM. In the proposed models, a main RecNN/TreeLSTM whose parameters are tag-specific, is used to compose word/phrase pairs and learn sentence representation as ordinary RecNNs do. The purpose of hyper RecNN/TreeLSTM is to predict parameters of the main RecNN/TreeLSTM dynamically under the guidance of POS tags. Our work is inspired by recent progress in dynamic parameter prediction [Bertinetto et al.2016, Jia et al.2016, Ha, Dai, and Le2017] and the motivation is two-fold. First, to obtain tag-specific composition functions, without increasing the number of parameters significantly, a hypernetwork which takes the factors that determine the semantic composition rules as inputs and has a similar architecture with main RecNN, is a good choice. The factors include POS tags of nodes and hidden vectors which have been proved to be useful in previous works. lpf17a lpf17a conduct a pioneer work which firstly combines dynamic parameter prediction with RecNN. But their prediction network namely meta network, takes only hidden vector as inputs so that it is unable to capture syntactic information as discussed above. The empirical results show this work does not perform strongly enough so that it is meaningful to discovery a way to combine dynamic parameter prediction with RecNN as well as achieve excellent performance.
Second, we have observed the tag distribution is very imbalanced over datasets. For example, in Stanford Sentiment Treebank [Socher et al.2013b] corpus, the frequencies of different POS tags differ from less than ten to tens of thousands. There are over 70 different types of tags in amount and if considering the combination of parent and child nodes, the tag configurations of nodes are much more. It is unrealistic to establish a composition function for each configuration explicitly. In case of overfitting due to discrete representation of POS tags, we use low-dimension distributed vectors to represent tags which we term tag embedding.
We focus on learning sentence representation over constituency parser tree in which each non-leaf node just has two children. We give an example for constituency tree in Figure 1. It should be noted when parsing sentences into constituency trees, POS tags of word/phrase are obtained simultaneously so no extra preprocessing is need for our models except parsing. In summary, our contributions are as follows:
To improve the semantic composition process over tree structures, we introduce hypernetwork framework into RecNN and propose two novel models: TG-HRecNN and TG-TreeLSTM. Semantic composition parameters of our models are predicted dynamically under the guidance of POS tag information.
We design the information fusion layer to incorporate POS tag and semantic information at each node to guide the composition process. The obvious gap between our models and DC-RecNN/TreeLSTM [Liu, Qiu, and Huang2017b] manifests the effectiveness of tag information in dynamic parameter prediction.
Experiments on five datasets for two typical NLP tasks show proposed models both obtain significant improvement compared with RecNN and TreeLSTM consistently. Specially, TG-HTreeLSTM outperforms all existing RecNN-based models and achieves or is competitive with state-of-the-art performance on four sentence classification benchmarks. The qualitative analysis illustrates how our model works.
Recursive Neural Networks
Since RecNN [Socher et al.2011]
was proposed, several works focus on improving composition function over tree structures. socher12 socher12 replace vectors with matrix-vector pairs to represent nodes and matrix-vector multiplications are expected to model semantic composition flexibly and adaptively. socher13b socher13b utilize more complex bilinear neural tensor layer as composition function. dong14 dong14 improve RecNN by learning multiple composition functions whose outputs are summed up weightedly with self-adaptive weights. However, too much parameters make the learning of these models difficult. Besides these implicit ways to make composition function more flexible, socher13a socher13a and qian15 qian15 establish different composition function for each type of syntactic constituents. qian15 qian15 also propose a Tag Embedded RNN (TE-RNN) which firstly uses embedding vectors to represent POS tags of words/phrases and then takes tag vectors as inputs of composition function. A similar idea about exploiting POS tags has also been introduced into TreeLSTM[Huang, Qian, and Zhu2017]. lpf17b lpf17b and lpf17c lpf17c aim to address the non-compositional phenomenon and compose non-compositional phrases with a special function.
In addition, there are other explorations on enhancing the RecNN. LSTM cell could also help to learn long-term dependencies over trees [Tai, Socher, and Manning2015]. teng17 teng17 propose a bi-directional version of TreeLSTM.
Dynamic Parameter Prediction
The idea of dynamic parameter prediction that modifies the weights of one network by another is closely related to the concept of fast weights [Schmidhuber1992] in which one network can produce context-dependent weight changes for a second network. Recently, this idea draws researchers’s attention again because of the renaissance of deep neural networks. bertinetto16 bertinetto16 attempt to learn example-dependent network weights for one-shot learning. jia16 jia16 introduce a framework which can dynamically generates CNN filters depended on network inputs. ha17 ha17 propose a hypernetwork framework to generate weights for recurrent networks. They can be seen as a form of relaxed weight-sharing in the time dimension. Due to the similarity between RNN and RecNN, we construct our model based on this framework. lpf17a lpf17a firstly employ the idea of dynamic parameter prediction to improve RecNN while the parameters are generated by exploiting limited information so state-of-the-art performance is not achieved.
Hypernetwork [Ha, Dai, and Le2017] framework is proposed to break the parameter sharing characteristic of recurrent networks. In this framework, there are two RNNs which are called hyper RNN and main RNN respectively. The former network is a standard RNN and is utilized to predict parameters of main RNN dynamically. Input sequences, such as sentences, are modeled by the latter as an ordinary RNN does. The update formulation of main RNN is given by:
All vectors are outputs of hyper RNN and parameter matrices , and bias are functions of corresponding where (.) . The input , hidden state and nonlinearity are same as that in the standard RNN. The update formulation of hyper RNN is similar with that of standard RNN:
where are linear functions of :
If directly projecting into the matrix , it may be not practical because we have to maintain a learnable tensor so that the memory usage becomes too large for real problems. An approximate mechanism is to revise equation 1 in the following way:
We can see that now
modify the corresponding parameters by scaling each row of weight matrix linearly by an element in vector. Although in this way the degrees of freedom of parameter prediction process is reduced, the memory usage becomes available. It should be noted that in practice the dimension of hypernetwork is much lower than that of main network which means the total number of model parameters will not increase significantly.
RecNNs can model sentences over tree structures. They use a group of shared parameters to compose word/phrase pairs recursively which limits the expressive power of models. Inspired by the fact that word/phrase pairs with different POS tags require different semantic composition functions, we introduce the hypernetwork framework into vanilla RecNN and TreeLSTM and propose Tag-Guided Hyper RecNN and Tag-Guided Hyper TreeLSTM respectively. Both of them consist of a hyper network and a main network as hyper RNN. Figure 2 shows the illustration of proposed model. At each nodes of the tree, adaptive composition parameters are predicted dynamically according to POS tag information.
Tag-Guided HyperRecNN (TG-HRecNN)
For a non-leaf node in constituency parse trees, the RecNN obtains its hidden state by composing the hidden states of its left child and right child, namely and respectively. The composition function is a simple affine transformation as follows:
where is the parameter matrix and is the bias. is a standard element-wise nonlinearity.
TG-HRecNN consists of a main RecNN and a hyper RecNN. The hyper RecNN is similar to a standard RecNN whose composition function is given by:
Compared to vanilla RecNN, there is an additional input which contains POS tag information to determine the semantic composition at current node. We design an information fusion layer to compute and we delay introducing this layer until Sec 4.3. denotes the hidden state and , and are learnable parameters.
Similar to the main RNN in equation 5, the compostition function of main RecNN has dynamic parameters:
are computed through linear transformation as equation 4 where (.):
Tag-Guided HyperTreeLSTM (TG-HTreeLSTM)
tai15 tai15 adopts a standard LSTM cell to TreeLSTM cell which can be used as composition function over constituency parse trees. The hidden state now is calculated as follows:
where denotes the memory cell. is the input of node which is word embedding at leaf nodes or zero at non-leaf nodes.
denotes sigmoid function anddenotes elementwise multiplication. are termed input gate, forget gate and output gate respectively. The superscript and
represent the left child and right child respectively. For binarized trees this model computes two untied forget gates for each children. All matrices, , and bias are learnable parameters.
TG-HTreeLSTM also consists of a main TreeLSTM and a hyper TreeLSTM. The composition formulation of hyper TreeLSTM is almost identical to TreeLSTM except the definition of which will be described in Sec 4.3:
where , and are learnbale parameters. Then we use to dynamically predict parameters of main LSTM:
All are computed through linear transformation by equation 9 where (.) and hidden state can be obtained by equation 11-12.
Information Fusion Layer
To enable the hyper network to guide the semantic composition of the main network, we incorporate the syntactic information and semantic representation into , the input of hyper network at each node in equation 6 and 12. For a non-leaf node , we refer to as its tag embedding and the tag embeddings of its children are denoted by respectively. We consider as syntactic information which determine the composition function at each node. We make use of the semantic information about the node as well. In equation 3, the hyper RNN [Ha, Dai, and Le2017] utilizes the hidden state of last time step and current input of main RNN. However, the input in non-leaf node is zero. To fill this gap, we resort to the head-lexicalized [lexical] in PCFG parser. A non-leaf node is associated with a head word which is the head word of one of its children according to pre-defined rules. teng17 teng17 firstly exploit it in neural network in a soft gated way. We calculate the head word for a non-leaf node in a similar way while utilizing tag embeddings as additional inputs:
where are the head word of two children nodes. controls the composition of head words adaptively. and are learnable parameters. Then we can calculate the
with two heuristic strategies. The first is to directly concatenate all tag embeddings and semantic representations as follows:
is Rectified Linear Units as nonlinearity.and are learnable parameters. The other strategy is to project tag embeddings and semantic representations separately and then operate an element-wise multiplication between them:
where and are learnable parameters. We term these two strategies concat and multi for short respectively.
|Bidirectional LSTM [Tai, Socher, and Manning2015]||49.1||87.5||-||-||-||-|
|RecNN [Socher et al.2011]||43.2||82.4||76.4||91.8||90.2||74.9|
|MV-RNN [Socher et al.2012]||44.4||82.9||-||-||-||75.5|
|RNTN [Socher et al.2013b]||45.7||86.4||-||-||-||76.9|
|AdaMC-RNN [Dong et al.2014]||45.8||87.1||-||-||-||-|
|TG-RNN [Qian et al.2015]||46.1||86.2||76.4||-||-||-|
|TE-RNN [Qian et al.2015]||47.8||86.5||77.9||-||-||-|
|TreeLSTM [Tai, Socher, and Manning2015]||51.0||88.0||81.2||93.2||93.6||77.5|
|AdaHT-LSTM [Liu, Qiu, and Huang2017a]||50.2||87.8||81.9||94.1||-||-|
|iTLSTM [Liu et al.2017]||51.2||88.2||82.5||94.5||-||-|
|TE-LSTM [Huang, Qian, and Zhu2017]||52.6||89.6||82.2||-||-||-|
|BiTreeLSTM [Teng and Zhang2017]||53.5||90.3||-||-||94.8||-|
|DC-RecNN [Liu, Qiu, and Huang2017b]||-||86.1||80.2||93.5||91.2||77.9|
|DC-TreeLSTM [Liu, Qiu, and Huang2017b]||-||87.8||81.7||93.7||93.8||80.2|
Given a sentence with its parser tree, proposed models can compute hidden state of each node over the tree recursively. The hidden state computed by proposed models can be regarded as the representation of the phrase spanned by node . Specially, we use the hidden state of root node as the sentence representation and apply it to two realistic NLP tasks. We utilize different output layers for two tasks.
For sentence classification, we should predict a label from a pre-defined class set for a sentence . We directly feed the sentence representation
into a softmax classifier.
For text semantic matching, we deal with a classification problem about sentence pairs. Given two sentences and , we need to predict a label which represents the relation between them. We firstly obtain their representations , with a parameter shared TG-HRecNN/TreeLSTM and then combine the features in this way:
We feed it into a network of one hidden layer with ReLU activation before into softmax classifier:
The training objective for two tasks is to minimize the cross-entropy of the predicted and true label distributions:
where is the number of training samples. and are the k-th sample and label in dataset respectively. Then the prediction is given in this way:
To evaluate the effectiveness of proposed models, we conduct experiments on four benchmarks for sentence classification and SICK dataset for text semantic matching:
SST: Stanford Sentiment Treebank [Socher et al.2013b] for sentiment classification. SST-1 denotes the evaluation with fine-grained labels (very positive, positive, neutral, negative, very negative) and SST-2 denotes the evaluation with binary labels by neglecting the neutral samples during test. During training, we also utilize the phrase-level labels as previous works do.111http://nlp.stanford.edu/sentiment/
MR: Movie reviews with two polarity classes. (positive/negative) [Pang and Lee2005]222https://www.cs.cornell.edu/people/pabo/movie-review-data/
SUBJ: Subjectivity datasets with two classes. (subjective/objective) [Pang and Lee2004]333https://www.cs.cornell.edu/people/pabo/movie-review-data/
The detailed dataset statistics are listed in Table 1.
In all experiments, we initialize word embeddings with 300-dimensional Glove 840B vectors666http://nlp.stanford.edu/projects/glove/ [Pennington, Socher, and Manning2014]. We only fine-tune word embeddings on SST during training. We use AdaGrad [Duchi, Hazan, and Singer2011] optimizer with an initial learning rate of 0.05. The hidden size of main networks is 150. The hidden size of hyper networks is 50 and the input size of hyper networks is 100. The dimension of tag embedding is 50. We apply dropout on both embedding and output layer with a dropout rate of 0.5. Recurrent dropout [Semeniuta, Severyn, and Barth2016] with a dropout rate of 0.25 for main networks is applied for sentence classification and 3e-5 L2-regularization is applied for text semantic matching. The minibatch size is always 50. We obtain the constituency parser trees and POS tags of word/phrases using Stanford Parser [Klein and Manning2003]
. The code is implemented with Theano[Theano Development Team2016].
|AdaSent [Zhao, Lu, and Poupart2015]||-||-||83.1||95.5||92.4|
|d-TBCNN [Mou et al.2015]||51.4||87.9||-||-||96.0|
|DSCNN-Pretrain [Zhang, Lee, and Radev2016]||50.6||88.7||82.2||93.9||95.6|
|BLSTM-2DCNN [Zhou et al.2016]||52.4||89.5||82.3||94.0||96.1|
|NTI [Yu and Munkhdalai2017]||53.1||89.3||-||-||-|
|BCN+Char+CoVe [McCann et al.2017]||53.7||90.3||-||-||95.8|
|Dimensions||Tag Types||Examples for Phrases to Compose|
|16-th||JJ+NN||aching+beauty, perfect+film, recent+favourite, implausible+situation|
|30-th||DT+NN||a+movie, this+examination, the+film, no+picture|
|62-nd||NP+PP||the classic films+of Jean Renoir, the most powerful thing+in life, every opportunity+for a breakthrough|
|79-th||ADJP+NN||most impossibly dry+account, far more thoughtful+film, weak or careless+performance|
|132-nd||IN+NP||of+mystery and quietness, in+his bratty character, in+a mess of purposeless violence|
We first compare proposed models with RecNN-based models as well as baseline models. Then we make comparison between TG-HTreeLSTM and non RecNN-based state-of-the-art models.
Comparison with RecNN-Based Models
The experimental results about this comparison are displayed in columns 2 to 6 in Table 2. Firstly, we find that TG-HRecNN and TG-HTreeLSTM both outperform RecNN and TreeLSTM on all datasets. Specially, compared with RecNN and TreeLSTM, TG-HRecNN and TG-HTreeLSTM obtain about 5.4%/4.8% and 2.7%/2.4% improvements on SST-1/SST-2 respectively which are much greater than those on other datasets. We think it is beacuse there are phrase-level labels in SST so that the dynamic parameter prediction process can get more supervision during training. The vanilla RecNN are more easily boosted by dynamic parameter prediction and TG-HRecNN is competitive with strong baselines CNN and BiLSTM surprisingly. Secondly, compared with TG/TE-RNN and TE-LSTM which also exploit POS tags to enhance the expressive power of semantic composition function, our models are still superior to them. This means that to guide the semantic composition function dynamically with POS tags the hypernetwork framework is more effective. Proposed models are also much stronger than AdaMC-RNN, AdaHT-LSTM and iTLSTM which uses different composition function in a self-adaptive way although the last two models expliot external knowledge about idiom. Thirdly, compared with DC-RecNN/TreeLSTM, the results again demonstrate the superiority of our models. TG-HRecNN and TG-HTreeLSTM outperform DC-RecNN and DC-TreeLSTM with about 1%-2.5% accuracy on SST, MR and TREC respectively while on SUBJ our models also outperform them by a small margin. We think it can confirm our conjecture that POS tags are useful information for dynamic parameter prediction. Especially, DC-TreeLSTM has only similar performance with CNN without tag information. Lastly, TG-HTreeLSTM beats all RecNN-based models including bidirectional TreeLSTM (BiTreeLSTM) [Teng and Zhang2017] which is powerful with the help of top-down information flow. Compared with BiTreeLSTM, the most competitive RecNN variant, TG-HTreeLSTM has much less parameters (0.84M versus 1.30M).
In addition, the two strategies concat and multi introduced for information fusion layer both can be more effective in some specific settings.
Comparison with non RecNN-Based State-of-the-art Models
In table 3, we list the accuracies of our TG-HTreeLSTM model and state-of-the art models for sentence classification. Overall, TG-HTreeLSTM has consistently strong performances on four datasets and achieve the best scores on SST. We can find AdaSent [Zhao, Lu, and Poupart2015] performs very well on MR and SUBJ and keeps the state-of-the-art on the two datasets since 2015. Nevertheless, the gap between TG-HTreeLSTM and AdaSent is 0.2% on MR and 0.6% on SUBJ which are relatively small. On TREC, TG-HTreeLSTM is also competitive with the best BLSTM-2DCNN [Zhou et al.2016] model with a 0.3% gap. This comparison shows that TG-HTreeLSTM has excellent generalization ability.
Text Semantic Matching
The last column in Table 2 summarizes the performance of different models on SICK. It should be noted that the aim of this experiment is to prove that proposed models can improve RecNN/TreeLSTM for different NLP tasks instead of pursuing state-of-the-art performance. So we only compare sentence encoding-based models which encode two sentences into vectors and then classify as described in Sec 4.4. TG-HRecNN and TG-HTreeLSTM achieve 3.4%/6.1% improvements than RecNN and TreeLSTM respectively. Although TG-HRecNN is only competitive with DC-RecNN, TG-HTreeLSTM shows effective performance in this task and outperforms DC-TreeLSTM with 3.4% accuracy.
To explain the effectiveness of proposed models, we conduct experiments to examine how the hyper RecNN predicts the composition parameter of main RecNN dynamically. As we describe in Sec 4, hyper RecNN modifies the parameters by scaling each row of the parameter matrix. So we examine the value of each dimension of in equation 8 which is output by hyper RecNN and determines the final composition parameter of main RecNN on the test set of SST. We find the occurrence of large value of some dimensions of is dominated by nodes with specific tag types. Table 4 illustrates some examples of these interpretable dimensions which supports that proposed models can predict reasonable composition parameter for different types of word/phrase pairs by enlarging different rows of the parameter matrix.
We also find the occurrence of large value in some dimensions is dominated by nodes with specific sentiment polarity. In figure 3, we give a sample to visualize the behaviours of 13-rd dimension which is sensitive to negative sentiment. The label of the whole sentence is very negative while labels of phrases are neutral except “did it ever get made”. The values of this dimension get much larger at two nodes with negative label than those at other nodes. Although during test no label can be seen, this dimension of entails sentiment information so that our model can give a correct prediction with adaptive composition parameters.
In this paper, we introduce hypernetwork framework into RecNNs to address the problem caused by shared composition parameter. We propose two novel RecNN variants in which a hyper RecNN taking as inputs POS tag information predicts the composition parameter of main RecNN dynamically. An information fusion layer is designed to incorporate POS tag and semantic information for parameter prediction. Our models beat all RecNN-based models on five datasets for sentence classification and text semantic matching. Proposed TG-HTreeLSTM achieves or is competitive with state-of-the-art on four sentence classification benchmarks. We also give qualitative analysis to explain why our models work well.
Experimental results show that proposed models are able to encode a sentence into powerful distributed representation, which will benefit many NLP tasks. In future work, we will employ our models as sentence encoders and apply encoded sentence embedding to high level tasks, such as document classification and reading comprehension. We will also explore the effectiveness of other syntax information for guiding dynamic parameter prediction.
- [Bertinetto et al.2016] Bertinetto, L.; Henriques, J. F.; Valmadre, J.; Torr, P. H. S.; and Vedaldi, A. 2016. Learning feed-forward one-shot learners. In Proceedings of NIPS, 523–531.
[Dong et al.2014]
Dong, L.; Wei, F.; Zhou, M.; and Xu, K.
Adaptive multi-compositionality for recursive neural models with applications to sentiment analysis.In Proceedings of AAAI, 1537–1543.
[Duchi, Hazan, and Singer2011]
Duchi, J. C.; Hazan, E.; and Singer, Y.
Adaptive subgradient methods for online learning and stochastic
Journal of Machine Learning Research12:2121–2159.
- [Ha, Dai, and Le2017] Ha, D.; Dai, A. M.; and Le, Q. V. 2017. Hypernetworks. In Proceedings of ICLR.
- [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
- [Huang, Qian, and Zhu2017] Huang, M.; Qian, Q.; and Zhu, X. 2017. Encoding syntactic knowledge in neural networks for sentiment classification. ACM Trans. Inf. Syst. 35(3):26:1–26:27.
- [Jia et al.2016] Jia, X.; Brabandere, B. D.; Tuytelaars, T.; and Gool, L. V. 2016. Dynamic filter networks. In Proceedings of NIPS, 667–675.
- [Kim2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. In Proceedings of EMNLP, 1746–1751.
- [Klein and Manning2003] Klein, D., and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of ACL, 423–430.
- [Li and Roth2002] Li, X., and Roth, D. 2002. Learning question classifiers. In Proceedings of COLING.
- [Lin et al.2017] Lin, Z.; Feng, M.; dos Santos, C. N.; Yu, M.; Xiang, B.; Zhou, B.; and Bengio, Y. 2017. A structured self-attentive sentence embedding. In Proceedings of ICLR.
- [Liu et al.2017] Liu, P.; Qian, K.; Qiu, X.; and Huang, X. 2017. Idiom-aware compositional distributed semantics. In Proceedings of EMNLP, 1215–1224.
- [Liu, Qiu, and Huang2017a] Liu, P.; Qiu, X.; and Huang, X. 2017a. Adaptive semantic compositionality for sentence modelling. In Proceedings of IJCAI, 4061–4067.
- [Liu, Qiu, and Huang2017b] Liu, P.; Qiu, X.; and Huang, X. 2017b. Dynamic compositional neural networks over tree structure. In Proceedings of IJCAI, 4054–4060.
- [Marelli et al.2014] Marelli, M.; Bentivogli, L.; Baroni, M.; Bernardi, R.; Menini, S.; and Zamparelli, R. 2014. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In SemEval-2014, 1–8.
- [McCann et al.2017] McCann, B.; Bradbury, J.; Xiong, C.; and Socher, R. 2017. Learned in translation: Contextualized word vectors. In Proceedings of NIPS, 6297–6308.
- [Mou et al.2015] Mou, L.; Peng, H.; Li, G.; Xu, Y.; Zhang, L.; and Jin, Z. 2015. Discriminative neural sentence modeling by tree-based convolution. In Proceedings of EMNLP, 2315–2325.
- [Pang and Lee2004] Pang, B., and Lee, L. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of ACL, 271–278.
- [Pang and Lee2005] Pang, B., and Lee, L. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of ACL, 115–124.
- [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP, 1532–1543.
- [Qian et al.2015] Qian, Q.; Tian, B.; Huang, M.; Liu, Y.; Zhu, X.; and Zhu, X. 2015. Learning tag embeddings and tag-specific composition functions in recursive neural network. In Proceedings of ACL, 1365–1374.
- [Schmidhuber1992] Schmidhuber, J. 1992. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation 4(1):131–139.
- [Semeniuta, Severyn, and Barth2016] Semeniuta, S.; Severyn, A.; and Barth, E. 2016. Recurrent dropout without memory loss. In Proceedings of COLING, 1757–1766.
- [Shen, Yang, and Deng2017] Shen, G.; Yang, Y.; and Deng, Z. 2017. Inter-weighted alignment network for sentence pair modeling. In Proceedings of EMNLP, 1179–1189.
- [Socher et al.2011] Socher, R.; Lin, C. C.; Ng, A. Y.; and Manning, C. D. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of ICML, 129–136.
- [Socher et al.2012] Socher, R.; Huval, B.; Manning, C. D.; and Ng, A. Y. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of EMNLP, 1201–1211.
- [Socher et al.2013a] Socher, R.; Bauer, J.; Manning, C. D.; and Ng, A. Y. 2013a. Parsing with compositional vector grammars. In Proceedings of ACL, 455–465.
- [Socher et al.2013b] Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A.; and Potts, C. 2013b. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, 1631–1642.
- [Tai, Socher, and Manning2015] Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of ACL, 1556–1566.
- [Tan et al.2016] Tan, M.; dos Santos, C. N.; Xiang, B.; and Zhou, B. 2016. Improved representation learning for question answer matching. In Proceedings of ACL.
- [Teng and Zhang2017] Teng, Z., and Zhang, Y. 2017. Head-lexicalized bidirectional tree lstms. TACL 5:163–177.
- [Theano Development Team2016] Theano Development Team. 2016. Theano: A python framework for fast computation of mathematical expressions. CoRR abs/1605.02688.
- [Yu and Munkhdalai2017] Yu, H., and Munkhdalai, T. 2017. Neural semantic encoders. In Proceedings of EACL, 11–21.
- [Zhang, Lee, and Radev2016] Zhang, R.; Lee, H.; and Radev, D. R. 2016. Dependency sensitive convolutional neural networks for modeling sentences and documents. In Proceedings of NAACL, 1512–1521.
- [Zhao, Lu, and Poupart2015] Zhao, H.; Lu, Z.; and Poupart, P. 2015. Self-adaptive hierarchical sentence model. In Proceedings of IJCAI, 4069–4076.
[Zhou et al.2016]
Zhou, P.; Qi, Z.; Zheng, S.; Xu, J.; Bao, H.; and Xu, B.
Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling.In Proceedings of COLING, 3485–3495.