Sentence-State LSTM for Text Representation

05/07/2018 ∙ by Yue Zhang, et al. ∙ University of Rochester Singapore University of Technology and Design 0

Bi-directional LSTMs are a powerful tool for text representation. On the other hand, they have been shown to suffer various limitations due to their sequential nature. We investigate an alternative LSTM structure for encoding text, which consists of a parallel state for each word. Recurrent steps are used to perform local and global information exchange between words simultaneously, rather than incremental reading of a sequence of words. Results on various classification and sequence labelling benchmarks show that the proposed model has strong representation power, giving highly competitive performances compared to stacked BiLSTM models with similar parameter numbers.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural models have become the dominant approach in the NLP literature. Compared to hand-crafted indicator features, neural sentence representations are less sparse, and more flexible in encoding intricate syntactic and semantic information. Among various neural networks for encoding sentences, bi-directional LSTMs (BiLSTM)

Hochreiter and Schmidhuber (1997) have been a dominant method, giving state-of-the-art results in language modelling Sundermeyer et al. (2012), machine translation Bahdanau et al. (2015), syntactic parsing Dozat and Manning (2017) and question answering Tan et al. (2015).

Despite their success, BiLSTMs have been shown to suffer several limitations. For example, their inherently sequential nature endows computation non-parallel within the same sentence Vaswani et al. (2017), which can lead to a computational bottleneck, hindering their use in the industry. In addition, local ngrams, which have been shown a highly useful source of contextual information for NLP, are not explicitly modelled Wang et al. (2016). Finally, sequential information flow leads to relatively weaker power in capturing long-range dependencies, which results in lower performance in encoding longer sentences Koehn and Knowles (2017).

Figure 1: Sentence-State LSTM

We investigate an alternative recurrent neural network structure for addressing these issues. As shown in Figure

1, the main idea is to model the hidden states of all words simultaneously at each recurrent step, rather than one word at a time. In particular, we view the whole sentence as a single state, which consists of sub-states for individual words and an overall sentence-level state. To capture local and non-local contexts, states are updated recurrently by exchanging information between each other. Consequently, we refer to our model as sentence-state LSTM, or S-LSTM in short. Empirically, S-LSTM can give effective sentence encoding after 3 – 6 recurrent steps. In contrast, the number of recurrent steps necessary for BiLSTM scales with the size of the sentence.

At each recurrent step, information exchange is conducted between consecutive words in the sentence, and between the sentence-level state and each word. In particular, each word receives information from its predecessor and successor simultaneously. From an initial state without information exchange, each word-level state can obtain 3-gram, 5-gram and 7-gram information after 1, 2 and 3 recurrent steps, respectively. Being connected with every word, the sentence-level state vector serves to exchange non-local information with each word. In addition, it can also be used as a global sentence-level representation for classification tasks.

Results on both classification and sequence labelling show that S-LSTM gives better accuracies compared to BiLSTM using the same number of parameters, while being faster. We release our code and models at, which include all baselines and the final model.

2 Related Work

LSTM Graves and Schmidhuber (2005)

showed its early potentials in NLP when a neural machine translation system that leverages LSTM source encoding gave highly competitive results compared to the best SMT models

Bahdanau et al. (2015). LSTM encoders have since been explored for other tasks, including syntactic parsing Dyer et al. (2015), text classification Yang et al. (2016) and machine reading Hermann et al. (2015). Bi-directional extensions have become a standard configuration for achieving state-of-the-art accuracies among various tasks Wen et al. (2015); Ma and Hovy (2016); Dozat and Manning (2017). S-LSTMs are similar to BiLSTMs in their recurrent bi-directional message flow between words, but different in the design of state transition.

CNNs Krizhevsky et al. (2012) also allow better parallelisation compared to LSTMs for sentence encoding Kim (2014)

, thanks to parallelism among convolution filters. On the other hand, convolution features embody only fix-sized local n-gram information, whereas sentence-level feature aggregation via pooling can lead to loss of information

Sabour et al. (2017). In contrast, S-LSTM uses a global sentence-level node to assemble and back-distribute local information in the recurrent state transition process, suffering less information loss compared to pooling.

Attention Bahdanau et al. (2015) has recently been explored as a standalone method for sentence encoding, giving competitive results compared to Bi-LSTM encoders for neural machine translation Vaswani et al. (2017). The attention mechanism allows parallelisation, and can play a similar role to the sentence-level state in S-LSTMs, which uses neural gates to integrate word-level information compared to hierarchical attention. S-LSTM further allows local communication between neighbouring words.

Hierarchical stacking of CNN layers LeCun et al. (1995); Kalchbrenner et al. (2014); Papandreou et al. (2015); Dauphin et al. (2017) allows better interaction between non-local components in a sentence via incremental levels of abstraction. S-LSTM is similar to hierarchical attention and stacked CNN in this respect, incrementally refining sentence representations. However, S-LSTM models hierarchical encoding of sentence structure as a recurrent state transition process. In nature, our work belongs to the family of LSTM sentence representations.

S-LSTM is inspired by message passing over graphs Murphy et al. (1999); Scarselli et al. (2009). Graph-structure neural models have been used for computer program verification Li et al. (2016) and image object detection Liang et al. (2016)

. The closest previous work in NLP includes the use of convolutional neural networks

Bastings et al. (2017); Marcheggiani and Titov (2017) and DAG LSTMs Peng et al. (2017) for modelling syntactic structures. Compared to our work, their motivations and network structures are highly different. In particular, the DAG LSTM of TACL1028 is a natural extension of tree LSTM Tai et al. (2015), and is sequential rather than parallel in nature. To our knowledge, we are the first to investigate a graph RNN for encoding sentences, proposing parallel graph states for integrating word-level and sentence-level information. In this perspective, our contribution is similar to that of kim2014convolutional and bahdanau2014neural in introducing a neural representation to the NLP literature.

3 Model

Given a sentence , where represents the th word and is the sentence length, our goal is to find a neural representation of , which consists of a hidden vector for each input word , and a global sentence-level hidden vector . Here represents syntactic and semantic features for under the sentential context, while represents features for the whole sentence. Following previous work, we additionally add and to the two ends of the sentence as and , respectively.

3.1 Baseline BiLSTM

The baseline BiLSTM model consists of two LSTM components, which process the input in the forward left-to-right and the backward right-to-left directions, respectively. In each direction, the reading of input words is modelled as a recurrent process with a single hidden state. Given an initial value, the state changes its value recurrently, each time consuming an incoming word.

Take the forward LSTM component for example. Denoting the initial state as , which is a model parameter, the recurrent state transition step for calculating is defined as follows Graves and Schmidhuber (2005):


where denotes the word representation of ; , , and represent the values of an input gate, an output gate, a forget gate and an actual input at time step , respectively, which controls the information flow for a recurrent cell and the state vector ; , and () are model parameters.

is the sigmoid function.

The backward LSTM component follows the same recurrent state transition process as described in Eq 1. Starting from an initial state , which is a model parameter, it reads the input , , , , changing its value to , , , , respectively. A separate set of parameters , and () are used for the backward component.

The BiLSTM model uses the concatenated value of and as the hidden vector for :

A single hidden vector representation of the whole input sentence can be obtained using the final state values of the two LSTM components:

Stacked BiLSTM

Multiple layers of BiLTMs can be stacked for increased representation power, where the hidden vectors of a lower layer are used as inputs for an upper layer. Different model parameters are used in each stacked BiLSTM layer.

3.2 Sentence-State LSTM

Formally, an S-LSTM state at time step can be denoted by:

which consists of a sub state for each word and a sentence-level sub state .

S-LSTM uses a recurrent state transition process to model information exchange between sub states, which enriches state representations incrementally. For the initial state , we set , where is a parameter. The state transition from to consists of sub state transitions from to and from to . We take an LSTM structure similar to the baseline BiLSTM for modelling state transition, using a recurrent cell for each and a cell for .

As shown in Figure 1, the value of each is computed based on the values of , , , and , together with their corresponding cell values:


where is the concatenation of hidden vectors of a context window, and , , , and are gates that control information flow from and to . In particular, controls information from the input ; , , and control information from the left context cell , the right context cell , and the sentence context cell , respectively. The values of , , , and are normalised such that they sum to 1. is an output gate from the cell state to the hidden state . , , and () are model parameters. is the sigmoid function.

The value of is computed based on the values of for all :


where and are gates controlling information from and , respectively, which are normalised. is an output gate from the recurrent cell to . , and () are model parameters.

Contrast with BiLSTM

The difference between S-LSTM and BiLSTM can be understood with respect to their recurrent states. While BiLSTM uses only one state in each direction to represent the subsequence from the beginning to a certain word, S-LSTM uses a structural state to represent the full sentence, which consists of a sentence-level sub state and word-level sub states, simultaneously. Different from BiLSTMs, for which at different time steps are used to represent , respectively, the word-level states and sentence-level state of S-LSTMs directly correspond to the goal outputs and , as introduced in the beginning of this section. As increases from 0, and are enriched with increasingly deeper context information.

From the perspective of information flow, BiLSTM passes information from one end of the sentence to the other. As a result, the number of time steps scales with the size of the input. In contrast, S-LSTM allows bi-directional information flow at each word simultaneously, and additionally between the sentence-level state and every word-level state. At each step, each captures an increasing larger ngram context, while additionally communicating globally to all other via . The optimal number of recurrent steps is decided by the end-task performance, and does not necessarily scale with the sentence size. As a result, S-LSTM can potentially be both more efficient and more accurate compared with BiLSTMs.

Increasing window size. By default S-LSTM exchanges information only between neighbouring words, which can be seen as adopting a 1-word window on each side. The window size can be extended to 2, 3 or more words in order to allow more communication in a state transition, expediting information exchange. To this end, we modify Eq 2, integrating additional context words to , with extended gates and cells. For example, with a window size of 2, . We study the effectiveness of window size in our experiments.

Additional sentence-level nodes. By default S-LSTM uses one sentence-level node. One way of enriching the parameter space is to add more sentence-level nodes, each communicating with word-level nodes in the same way as described by Eq 3. In addition, different sentence-level nodes can communicate with each other during state transition. When one sentence-level node is used for classification outputs, the other sentence-level node can serve as hidden memory units, or latent features. We study the effectiveness of multiple sentence-level nodes empirically.

3.3 Task settings

We consider two task settings, namely classification and sequence labelling. For classification, is fed to a softmax classification layer:


is the probability distribution of output class labels and

and are model parameters. For sequence labelling, each can be used as feature representation for a corresponding word .

External attention

It has been shown that summation of hidden states using attention Bahdanau et al. (2015); Yang et al. (2016) give better accuracies compared to using the end states of BiLSTMs. We study the influence of attention on both S-LSTM and BiLSTM for classification. In particular, additive attention Bahdanau et al. (2015) is applied to the hidden states of input words for both BiLSTMs and S-LSTMs calculating a weighted sum


Here , and are model parameters.

External CRF

For sequential labelling, we use a CRF layer on top of the hidden vectors for calculating the conditional probabilities of label sequences Huang et al. (2015); Ma and Hovy (2016):

where and are parameters specific to two consecutive labels and .

For training, standard log-likelihood loss is used with regularization given a set of gold-standard instances.

4 Experiments

We empirically compare S-LSTMs and BiLSTMs on different classification and sequence labelling tasks. All experiments are conducted using a GeForce GTX 1080 GPU with 8GB memory.

Dataset Training Development Test
#sent #words #sent #words #sent #words
Movie review Pang and Lee (2008) 8527 201137 1066 25026 1066 25260
Books 1400 297K 200 59K 400 68K
Electronics 1398 924K 200 184K 400 224K
DVD 1400 1,587K 200 317K 400 404K
Kitchen 1400 769K 200 153K 400 195K
Apparel 1400 525K 200 105K 400 128K
Camera 1397 1,084K 200 216K 400 260K
Text Health 1400 742K 200 148K 400 175K
Classification Music 1400 1,176K 200 235K 400 276K
Liu et al. (2017) Toys 1400 792K 200 158K 400 196K
Video 1400 1,311K 200 262K 400 342K
Baby 1300 855K 200 171K 400 221K
Magazines 1370 1,033K 200 206K 400 264K
Software 1315 1,143K 200 228K 400 271K
Sports 1400 833K 200 183K 400 218K
IMDB 1400 2,205K 200 507K 400 475K
MR 1400 196K 200 41K 400 48K
POS tagging Marcus et al. (1993) 39831 950011 1699 40068 2415 56671
NER Sang et al. (2003) 14987 204567 3466 51578 3684 46666
Table 1: Dataset statistics

4.1 Experimental Settings

Datasets. We choose the movie review dataset of pang2008opinion, and additionally the 16 datasets of liu2017adversarial for classification evaluation. We randomly split the movie review dataset into training (80%), development (10%) and test (10%) sections, and the original split of liu2017adversarial for the 16 classification datasets.

For sequence labelling, we choose the Penn Treebank Marcus et al. (1993) POS tagging task and the CoNLL Sang et al. (2003) NER task as our benchmarks. For POS tagging, we follow the standard split Manning (2011), using sections 0 – 18 for training, 19 – 21 for development and 22 – 24 for test. For NER, we follow the standard split, and use the BIOES tagging scheme Ratinov and Roth (2009). Statistics of the four datasets are shown in Table 1.

Hyperparameters. We initialise word embeddings using GloVe Pennington et al. (2014) 300 dimensional embeddings.111 Embeddings are fine-tuned during model training for all tasks. Dropout Srivastava et al. (2014) is applied to embedding hidden states, with a rate of 0.5. All models are optimised using the Adam optimizer Kingma and Ba (2014), with an initial learning rate of 0.001 and a decay rate of 0.97. Gradients are clipped at 3 and a batch size of 10 is adopted. Sentences with similar lengths are batched together. The L2 regularization parameter is set to 0.001.

Model Time (s) Acc # Param
+0 dummy node 56 81.76 7,216K
+1 dummy node 65 82.64 8,768K
+2 dummy node 76 82.24 10,321K
Hidden size 100 42 81.75 4,891K
Hidden size 200 54 82.04 6,002K
Hidden size 300 65 82.64 8,768K
Hidden size 600 175 81.84 17,648K
Hidden size 900 235 81.66 33,942K
Without , 63 82.36 8,768K
With , 65 82.64 8,768K
Table 2: Movie review Dev results of S-LSTM
Figure 2: Accuracies with various window sizes and time steps on movie review development set

4.2 Development Experiments

We use the movie review development data to investigate different configurations of S-LSTMs and BiLSTMs. For S-LSTMs, the default configuration uses and words for augmenting words of a sentence. A hidden layer size of 300 and one sentence-level node are used.

Hyperparameters: Table 2

shows the development results of various S-LSTM settings, where Time refers to training time per epoch. Without the sentence-level node, the accuracy of S-LSTM drops to 81.76%, demonstrating the necessity of global information exchange. Adding one additional sentence-level node as described in Section 

3.2 does not lead to accuracy improvements, although the number of parameters and decoding time increase accordingly. As a result, we use only 1 sentence-level node for the remaining experiments. The accuracies of S-LSTM increases as the hidden layer size for each node increases from 100 to 300, but does not further increase when the size increases beyond 300. We fix the hidden size to 300 accordingly. Without using and , the performance of S-LSTM drops from 82.64% to 82.36%, showing the effectiveness of having these additional nodes. Hyperparameters for BiLSTM models are also set according to the development data, which we omit here.

State transition. In Table 2, the number of recurrent state transition steps of S-LSTM is decided according to the best development performance. Figure 2 draws the development accuracies of S-LSTMs with various window sizes against the number of recurrent steps. As can be seen from the figure, when the number of time steps increases from 1 to 11, the accuracies generally increase, before reaching a maximum value. This shows the effectiveness of recurrent information exchange in S-LSTM state transition.

On the other hand, no significant differences are observed on the peak accuracies given by different window sizes, although a larger window size (e.g. 4) generally results in faster plateauing. This can be be explained by the intuition that information exchange between distant nodes can be achieved using more recurrent steps under a smaller window size, as can be achieved using fewer steps under a larger window size. Considering efficiency, we choose a window size of 1 for the remaining experiments, setting the number of recurrent steps to 9 according to Figure 2.

Model Time (s) Acc # Param
LSTM 67 80.72 5,977K
BiLSTM 106 81.73 7,059K
2 stacked BiLSTM 207 81.97 9,221K
3 stacked BiLSTM 310 81.53 11,383K
4 stacked BiLSTM 411 81.37 13,546K
S-LSTM 65 82.64* 8,768K
CNN 34 80.35 5,637K
2 stacked CNN 40 80.97 5,717K
3 stacked CNN 47 81.46 5,808K
4 stacked CNN 51 81.39 5,855K
Transformer (N=6) 138 81.03 7,234K
Transformer (N=8) 174 81.86 7,615K
Transformer (N=10) 214 81.63 8,004K
BiLSTM+Attention 126 82.37 7,419K
S-LSTM+Attention 87 83.07* 8,858K
Table 3: Movie review development results

S-LSTM vs BiLSTM: As shown in Table 3, BiLSTM gives significantly better accuracies compared to uni-directional LSTM222

using t-test. For the remaining of this paper, we use the same measure for statistical significance.

, with the training time per epoch growing from 67 seconds to 106 seconds. Stacking 2 layers of BiLSTM gives further improvements to development results, with a larger time of 207 seconds. 3 layers of stacked BiLSTM does not further improve the results. In contrast, S-LSTM gives a development result of 82.64%, which is significantly better compared to 2-layer stacked BiLSTM, with a smaller number of model parameters and a shorter time of 65 seconds.

We additionally make comparisons with stacked CNNs and hierarchical attention Vaswani et al. (2017), shown in Table 3 (the CNN and Transformer rows), where indicates the number of attention layers. CNN is the most efficient among all models compared, with the smallest model size. On the other hand, a 3-layer stacked CNN gives an accuracy of 81.46%, which is also the lowest compared with BiLSTM, hierarchical attention and S-LSTM. The best performance of hierarchical attention is between single-layer and two-layer BiLSTMs in terms of both accuracy and efficiency. S-LSTM gives significantly better accuracies compared with both CNN and hierarchical attention.

Influence of external attention mechanism. Table 3 additionally shows the results of BiLSTM and S-LSTM when external attention is used as described in Section 3.3. Attention leads to improved accuracies for both BiLSTM and S-LSTM in classification, with S-LSTM still outperforming BiLSTM significantly. The result suggests that external techniques such as attention can play orthogonal roles compared with internal recurrent structures, therefore benefiting both BiLSTMs and S-LSTMs. Similar observations are found using external CRF layers for sequence labelling.

Model Accuracy Train (s) Test (s)
  socher2011semi 77.70
  socher2012semantic 79.00
  kim2014convolutional 81.50
  qian2016linguistically 81.50
  BiLSTM 81.61 51 1.62
  2 stacked BiLSTM 81.94 98 3.18
  3 stacked BiLSTM 81.71 137 4.67
  3 stacked CNN 81.59 31 1.04
  Transformer (N=8) 81.97 89 2.75
  S-LSTM 82.45* 41 1.53
Table 4: Test set results on movie review dataset (* denotes significance in all tables).
Dataset SLSTM Time (s) BiLSTM Time (s) 2 BiLSTM Time (s)
Camera 90.02* 50 (2.85) 87.05 115 (8.37) 88.07 221 (16.1)
Video 86.75* 55 (3.95) 84.73 140 (12.59) 85.23 268 (25.86)
Health 86.5 37 (2.17) 85.52 118 (6.38) 85.89 227 (11.16)
Music 82.04* 52 (3.44) 78.74 185 (12.27) 80.45 268 (23.46)
Kitchen 84.54* 40 (2.50) 82.22 118 (10.18) 83.77 225 (19.77)
DVD 85.52* 63 (5.29) 83.71 166 (15.42) 84.77 217 (28.31)
Toys 85.25 39 (2.42) 85.72 119 (7.58) 85.82 231 (14.83)
Baby 86.25* 40 (2.63) 84.51 125 (8.50) 85.45 238 (17.73)
Books 83.44* 64 (3.64) 82.12 240 (13.59) 82.77 458 (28.82)
IMDB 87.15* 67 (3.69) 86.02 248 (13.33) 86.55 486 (26.22)
MR 76.2 27 (1.25) 75.73 39 (2.27) 75.98 72 (4.63)
Appeal 85.75 35 (2.83) 86.05 119 (11.98) 86.35* 229 (22.76)
Magazines 93.75* 51 (2.93) 92.52 214 (11.06) 92.89 417 (22.77)
Electronics 83.25* 47 (2.55) 82.51 195 (10.14) 82.33 356 (19.77)
Sports 85.75* 44 (2.64) 84.04 172 (8.64) 84.78 328 (16.34)
Software 87.75* 54 (2.98) 86.73 245 (12.38) 86.97 459 (24.68)
Average 85.38* 47.30 (2.98) 84.01 153.48 (10.29) 84.64 282.24 (20.2)
Table 5: Results on the 16 datasets of liu2017adversarial. Time format: train (test)
(a) CoNLL03
(b) WSJ
Figure 3: Sequence labelling development results.
Model Accuracy Train (s) Test (s)
manning2011part 97.28
collobert2011natural 97.29
sun2014structure 97.36
sogaard2011semisupervised 97.50
huang2015bidirectional 97.55
ma2016end 97.55
yang2017transfer 97.55
BiLSTM 97.35 254 22.50
2 stacked BiLSTM 97.41 501 43.99
3 stacked BiLSTM 97.40 746 64.96
S-LSTM 97.55 237 22.16
Table 6: Results on PTB (POS tagging)
Model F1 Train (s) Test (s)
collobert2011natural 89.59
passos2014lexicon 90.90
luo2015joint 91.20
huang2015bidirectional 90.10
lample2016neural 90.94
ma2016end 91.21
yang2017transfer 91.26
rei:2017:Long 86.26
peters2017semi 91.93
BiLSTM 90.96 82 9.89
2 stacked BiLSTM 91.02 159 18.88
3 stacked BiLSTM 91.06 235 30.97
S-LSTM 91.57* 79 9.78
Table 7: Results on CoNLL03 (NER)

4.3 Final Results for Classification

The final results on the movie review and rich text classification datasets are shown in Tables 4 and 5, respectively. In addition to training time per epoch, test times are additionally reported. We use the best settings on the movie review development dataset for both S-LSTMs and BiLSTMs. The step number for S-LSTMs is set to 9.

As shown in Table 4, the final results on the movie review dataset are consistent with the development results, where S-LSTM outperforms BiLSTM significantly, with a faster speed. Observations on CNN and hierarchical attention are consistent with the development results. S-LSTM also gives highly competitive results when compared with existing methods in the literature.

As shown in Table 5, among the 16 datasets of liu2017adversarial, S-LSTM gives the best results on 12, compared with BiLSTM and 2 layered BiLSTM models. The average accuracy of S-LSTM is 85.6%, significantly higher compared with 84.9% by 2-layer stacked BiLSTM. 3-layer stacked BiLSTM gives an average accuracy of 84.57%, which is lower compared to a 2-layer stacked BiLSTM, with a training time per epoch of 423.6 seconds. The relative speed advantage of S-LSTM over BiLSTM is larger on the 16 datasets as compared to the movie review test test. This is because the average length of inputs is larger on the 16 datasets (see Section 4.5).

4.4 Final Results for Sequence Labelling

Bi-directional RNN-CRF structures, and in particular BiLSTM-CRFs, have achieved the state of the art in the literature for sequence labelling tasks, including POS-tagging and NER. We compare S-LSTM-CRF with BiLSTM-CRF for sequence labelling, using the same settings as decided on the movie review development experiments for both BiLSTMs and S-LSTMs. For the latter, we decide the number of recurrent steps on the respective development sets for sequence labelling. The POS accuracies and NER F1-scores against the number of recurrent steps are shown in Figure 3 (a) and (b), respectively. For POS tagging, the best step number is set to 7, with a development accuracy of 97.58%. For NER, the step number is set to 9, with a development F1-score of 94.98%.

As can be seen in Table 6, S-LSTM gives significantly better results compared with BiLSTM on the WSJ dataset. It also gives competitive accuracies as compared with existing methods in the literature. Stacking two layers of BiLSTMs leads to improved results compared to one-layer BiLSTM, but the accuracy does not further improve with three layers of stacked LSTMs.

For NER (Table 7), S-LSTM gives an F1-score of 91.57% on the CoNLL test set, which is significantly better compared with BiLSTMs. Stacking more layers of BiLSTMs leads to slightly better F1-scores compared with a single-layer BiLSTM. Our BiLSTM results are comparable to the results reported by ma2016end and lample2016neural, who also use bidirectional RNN-CRF structures. In contrast, S-LSTM gives the best reported results under the same settings.

In the second section of Table 7

, yang2017transfer use cross-domain data, obtaining an F-score of 91.26%; rei:2017:Long perform multi-task learning using additional language model objectives, obtaining an F-score of 86.26%; peters2017semi leverage character-level language models, obtaining an F-score of 91.93%, which is the current best result on the dataset. All the three models are based on BiLSTM-CRF. On the other hand, these semi-supervised learning techniques are orthogonal to our work, and can potentially be used for S-LSTM also.

4.5 Analysis

(a) Movie review
(b) CoNLL03
Figure 4: Accuracies against sentence length.
Figure 5: Time against sentence length.

Figure 4 (a) and (b) show the accuracies against the sentence length on the movie review and CoNLL datasets, respectively, where test samples are binned in batches of 80. We find that the performances of both S-LSTM and BiLSTM decrease as the sentence length increases. On the other hand, S-LSTM demonstrates relatively better robustness compared to BiLSTMs. This confirms our intuition that a sentence-level node can facilitate better non-local communication.

Figure 5 shows the training time per epoch of S-LSTM and BiLSTM on sentences with different lengths on the 16 classification datasets. To make these comparisons, we mix all training instances, order them by the size, and put them into 10 equal groups, the medium sentence lengths of which are shown. As can be seen from the figure, the speed advantage of S-LSTM is larger when the size of the input text increases, thanks to a fixed number of recurrent steps.

Similar to hierarchical attention (Vaswani et al., 2017), there is a relative disadvantage of S-LSTM in comparison with BiLSTM, which is that the memory consumption is relatively larger. For example, over the movie review development set, the actual GPU memory consumption by S-LSTM, BiLSTM, 2-layer stacked BiLSTM and 4-layer stacked BiLSTM are 252M, 89M, 146M and 253M, respectively. This is due to the fact that computation is performed in parallel by S-LSTM and hierarchical attention.

5 Conclusion

We have investigated S-LSTM, a recurrent neural network for encoding sentences, which offers richer contextual information exchange with more parallelism compared to BiLSTMs. Results on a range of classification and sequence labelling tasks show that S-LSTM outperforms BiLSTMs using the same number of parameters, demonstrating that S-LSTM can be a useful addition to the neural toolbox for encoding sentences.

The structural nature in S-LSTM states allows straightforward extension to tree structures, resulting in highly parallelisable tree LSTMs. We leave such investigation to future work. Next directions also include the investigation of S-LSTM to more NLP tasks, such as machine translation.


We thank the anonymous reviewers for their constructive and thoughtful comments.


  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR 2015.
  • Bastings et al. (2017) Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Simaan. 2017. Graph convolutional encoders for syntax-aware neural machine translation. In Proceedings of EMNLP 2017. Copenhagen, Denmark, pages 1957–1967.
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. JMLR 12(Aug):2493–2537.
  • Dauphin et al. (2017) Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In ICML. pages 933–941.
  • Dozat and Manning (2017) Timothy Dozat and Christopher D Manning. 2017. Deep biaffine attention for neural dependency parsing. In ICLR 2017.
  • Dyer et al. (2015) Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015.

    Transition-based dependency parsing with stack long short-term memory.

    In Proceedings of ACL 2015. Beijing, China, pages 334–343.
  • Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks pages 602–610.
  • Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In NIPS. pages 1693–1701.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991 .
  • Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of ACL 2014. Baltimore, Maryland, pages 655–665.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of EMNLP 2014. Doha, Qatar, pages 1746–1751.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  • Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation. Vancouver, pages 28–39.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. pages 1097–1105.
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016.

    Neural architectures for named entity recognition.

    In Proceedings of the 2016 NAACL. San Diego, California, pages 260–270.
  • LeCun et al. (1995) Yann LeCun, Yoshua Bengio, et al. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995.
  • Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2016. Gated graph sequence neural networks. In ICLR 2016.
  • Liang et al. (2016) Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. 2016. Semantic object parsing with graph lstm. In ECCV. Springer, pages 125–143.
  • Liu et al. (2017) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-task learning for text classification. In Proceedings of ACL 2017. Vancouver, Canada, pages 1–10.
  • Luo et al. (2015) Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 2015. Joint entity recognition and disambiguation. In Proceedings of EMNLP 2015. pages 879–888.
  • Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of ACL 2016. Berlin, Germany, pages 1064–1074.
  • Manning (2011) Christopher D Manning. 2011. Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In CICLing. Springer, pages 171–189.
  • Marcheggiani and Titov (2017) Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of EMNLP 2017. Copenhagen, Denmark, pages 1506–1515.
  • Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2):313–330.
  • Murphy et al. (1999) Kevin P Murphy, Yair Weiss, and Michael I Jordan. 1999. Loopy belief propagation for approximate inference: An empirical study. In UAI. Morgan Kaufmann Publishers Inc., pages 467–475.
  • Pang and Lee (2008) Bo Pang and Lillian Lee. 2008.

    Opinion mining and sentiment analysis.

    Foundations and Trends® in Information Retrieval 2(1–2):1–135.
  • Papandreou et al. (2015) George Papandreou, Liang-Chieh Chen, Kevin Murphy, and Alan L Yuille. 2015. Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. arXiv preprint arXiv:1502.02734 .
  • Passos et al. (2014) Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. In CoNLL. Ann Arbor, Michigan, pages 78–86.
  • Peng et al. (2017) Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. 2017. Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics 5:101–115.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP 2014. pages 1532–1543.
  • Peters et al. (2017) Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of ACL 2017. Vancouver, Canada, pages 1756–1765.
  • Qian et al. (2016) Qiao Qian, Minlie Huang, Jinhao Lei, and Xiaoyan Zhu. 2016. Linguistically regularized lstms for sentiment classification. arXiv preprint arXiv:1611.03949 .
  • Ratinov and Roth (2009) Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL. pages 147–155.
  • Rei (2017) Marek Rei. 2017. Semi-supervised multitask learning for sequence labeling. In Proceedings of ACL 2017. Vancouver, Canada, pages 2121–2130.
  • Sabour et al. (2017) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. In NIPS. pages 3859–3869.
  • Sang et al. (2003) Tjong Kim Sang, Erik F, and De Meulder Fien. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of HLT-NAACL 2003-Volume 4. pages 142–147.
  • Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80.
  • Socher et al. (2012) Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of EMNLP 2012. pages 1201–1211.
  • Socher et al. (2011) Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D Manning. 2011.

    Semi-supervised recursive autoencoders for predicting sentiment distributions.

    In Proceedings of EMNLP 2011. pages 151–161.
  • Søgaard (2011) Anders Søgaard. 2011. Semisupervised condensed nearest neighbor for part-of-speech tagging. In Proceedings of ACL 2011. pages 48–52.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. JMLR 15(1):1929–1958.
  • Sun (2014) Xu Sun. 2014. Structure regularization for structured prediction. In NIPS. pages 2402–2410.
  • Sundermeyer et al. (2012) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. Lstm neural networks for language modeling. In InterSpeech.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of ACL 2015. Beijing, China, pages 1556–1566.
  • Tan et al. (2015) Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. Lstm-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108 .
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. pages 6000–6010.
  • Wang et al. (2016) Xingyou Wang, Weijie Jiang, and Zhiyong Luo. 2016. Combination of convolutional and recurrent neural network for sentiment analysis of short texts. In Proceedings of COLING 2016. pages 2428–2437.
  • Wen et al. (2015) Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015.

    Semantically conditioned lstm-based natural language generation for spoken dialogue systems.

    In Proceedings of EMNLP 2015. Lisbon, Portugal, pages 1711–1721.
  • Yang et al. (2017) Zhilin Yang, Ruslan Salakhutdinov, and William W Cohen. 2017. Transfer learning for sequence tagging with hierarchical recurrent networks. In ICLR 2017.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of NAACL 2016. pages 1480–1489.