Since sentence modeling serves as the basis for a wide range of natural language processing (NLP) tasks, many sentence encoders have been developed to produce vector representations for describing sentence meanings. Traditionally, the representation of a sentence can be derived based on simple statistics and linguistic rules (e.g., bag-of-words or bag-of-n-grams
). However, these models suffer from the lack of context and word order information. With the development of word embedding and deep learning techniques, the focus on sentence modeling has shifted to deriving the compositional sentence representation from a sequence of word vectors using neural networks
, such as convolutional neural networks (CNNs)[3, 4] and recurrent neural networks (RNNs).
RNNs are currently the most popular sentence encoding models, and they process sentences word by word using a chain structure. Ideally, RNNs own the capability to retain the history information from an arbitrarily long context window. However, the problem of vanishing and exploding gradients during model training make it difficult for RNNs to learn long-range dependencies[6, 7]. To overcome the deficiency of RNNs, the ones that use long short-term memory (LSTM) units  was proposed and have been successfully applied to many NLP tasks, including dependency parsing 10, 11], question answering [12, 13], machine translation [14, 15, 16], and so forth. However, due to their flat chain structure, LSTM-RNNs are incapable of utilizing syntactic information, which is intrinsically embedded in natural languages, to composite words into sentences for semantic representation .
Some efforts have been made to extend chain-structured LSTMs to tree-structured topologies, i.e., Tree-LSTMs [18, 19], to address this issue. Tree-LSTMs process sentences in a hierarchical and recursive manner by propagating information up a given parsing tree to consider long-distance interactions over syntax structures. Tree-LSTMs have outperformed chain-structured LSTM-RNNs in some NLP tasks, such as sentiment classification and semantic relatedness analysis . However, there are still some deficiencies with Tree-LSTMs. First, the construction of Tree-LSTMs depends on external syntax parsers. A high-performance parser is difficult to construct and may not be available for all languages. Second, Tree-LSTMs derive a state vector for each non-leaf node in the parsing tree. Therefore, combining Tree-LSTMs with other sequential encoders that produce a state vector for each word is not straightforward. Third, the recursive tree structure makes the training of Tree-LSTMs time consuming because the batch-mode computation commonly adopted for training conventional LSTM-RNNs cannot be applied to directly train Tree-LSTMs.
Therefore, this paper proposes a novel neural encoder, named sequential neural encoder with latent structured description (SNELSD), for modeling sentences. This model introduces latent chunk-level representations into conventional sequential neural encoders, i.e., LSTM-RNNs, to implicitly consider the compositionality of languages in semantic modeling. Here, word chunks are adopted as intermediate units between words and sentences for sentence modeling. The boundaries of word chunks are hidden and determined in a task-dependent way, which is different from the conventional text chunking 
task in NLP. An SNELSD is a hierarchical chain-structured model that is composed of a detection layer and a description layer. The detection layer estimates the boundaries of latent word chunks and obtains a chunk-level representation for each word. The description layer processes these representations using modified LSTM units. The model parameters are estimated in an end-to-end manner without using parsing trees. Therefore, the outputs of SNELSDs are sequential, and they can be conveniently augmented with other semantic representations, such as word vectors and the states of LSTM-RNNs, to produce a more comprehensive description of sentence meanings.
The proposed SNELSD model has two main characteristics. First, latent word chunks are adopted as the intermediate units between words and sentences in SNELSD to consider syntax-related structure information during sentence modeling. The conventional text chunking task aims to identify constituent parts of sentences (nouns, verbs, adjectives, and so forth) and then link them to higher order units (noun groups, verb groups, phrases, and so on), which is also called shallow parsing. Chunking sentences into several essential parts is also considered to be a basic cognitive mechanism of human reading. For example, previous research indicated that chunking reading material or separating sentences into meaningful related parts improved the reading comprehension of readers with low reading ability . The mental process of chunking words into phrases was considered to be necessary since our mind can not hold more than approximately four to seven separate items in short-term memory .
Second, word chunks are treated as hidden units, and the latent chunk-level representations are embedded in a sequential model structure. The SNELSD model is designed in an end-to-end manner, which learns to split sentences into hidden word chunks without relying on additional text chunking algorithms. In other words, the SNELSD model is capable of exploring the latent structure information of sentences in a task-dependent manner. In contrast to Tree-LSTMs, which adopt tree-structured topologies, the proposed SNELSD model still follows the chain structure, which guarantees efficient model training by using batch-mode computation and makes it convenient to combine other sequential sentence encoders.
In this paper, we first introduce the architecture and related computational formulas of our SNELSD model after a brief review of related works. Then, we evaluate the proposed SNELSD model on a natural language inference (NLI) task and a sentiment analysis (SA) task by comparing it with other sentence encoders, including LSTM-RNNs and Tree-LSTMs. The main contributions of this paper are twofold. On the one hand, this paper proposes a novel sequential neural encoder that implicitly considers the influence of syntax structure on semantic modeling. The SNELSD model has a two-layer hierarchical chain structure, which makes a good balance between the flat chain structure (i.e., LSTM-RNN) and the hierarchical tree structure (i.e., Tree-LSTM). In other words, the SNELSD is an absolutely sequence-based model similar to the LSTM-RNNs, and it also utilizes the strategy of processing sentences hierarchically as does Tree-LSTMs. However, our SNELSD model does not rely on additional syntax parsing or text chunking modules but rather explores the latent structured information within sentences through end-to-end training. In our experiments on the NLI and SA tasks, the trained SNELSD models can truly capture some useful and regular chunking patterns that match the intrinsic characteristics of different tasks well. On the other hand, the proposed SNELSD model helps to obtain better performance than ordinary LSTM-RNNs and Tree-LSTMs on both NLI and SA tasks. On the Stanford Natural Language Inference (SNLI) task , the proposed method achieved an accuracy of , which is a new state-of-the-art performance without a multi-model ensemble.
Ii Related work
Ii-a Text Chunking
Text chunking , also called shallow parsing , is an NLP task that aims to identify the constituent parts of sentences (nouns, verbs, adjectives, and so on) and then link them to higher order units (noun groups, verb groups, phrases, and so forth). As an alternative to full parsing, which is more complicated and less robust, text chunking has been applied to many NLP tasks, such as information extraction (IE) [25, 26] and question answering (QA) , to extract and represent syntax-related information.
The SNELSD model proposed in this paper adopts word chunks as intermediate units between words and sentences for sentence modeling. The use of word chunks here is different from the conventional text chunking task in two aspects. First, the conventional text chunking task is generally performed over sequences of part-of-speech tags 
based on syntactic regular expressions, and its outputs typically have syntactic labels. However, the determination of word chunks in SNELSD models is not syntax dependent but rather task dependent by employing end-to-end model training. Therefore, the construction of SNELSD models does not rely on additional syntax parsing or text chunking algorithms. Second, in contrast to text chunking, which provides explicit boundaries of syntactic units, the word chunks in SNELSD models are latent descriptions, which are represented by the probabilities of a chunk boundary existing after each word.
Ii-B Long Short-Term Memory
The long short-term memory (LSTM) model  was proposed to address the gradient vanishing and exploding problem when training recurrent neural networks (RNNs). RNNs with LSTM cells (LSTM-RNN) have been widely used in NLP applications for the sequential modeling of sentences. An LSTM cell is a complex hidden unit. As shown in Fig. 1, it contains three gates, namely, input gate , output gate , and forget gate , which determine whether to utilize the input, whether to create an output, and whether to update the cell memory state, respectively. Therefore, an RNN that uses LSTM cells is capable of remembering the information from a long span of time steps. The following equations define a regular LSTM unit.
is the sigmoid function andis the element-wise multiplication between two vectors. At the -th step, the LSTM unit accepts word vector as input, and it employs a set of internal vectors, including an input gate , a forget gate , an output gate , and a memory cell , to generate a hidden state . Fig. 2(a) shows the flat chain structure of a 1-layer LSTM-RNN. For a sentence with words, we can use either the sequence of output hidden states or the last hidden state as the final sentence representation.
To access both the preceding and succeeding contexts, RNNs with bidirectional LSTM units (BLSTM-RNNs) are commonly adopted. A BLSTM-RNN can be viewed as two independent LSTM-RNNs that process a sentence along the forward and backward directions. For example, to encode a sentence with a length of , the two independent LSTM-RNNs compute the sequences of hidden states and . Then, these two sequences are merged by concatenating the two hidden states and at the same time position to obtain the final representation for sentence X.
A deep LSTM-RNN can be constructed by stacking multiple recurrent hidden layers one on the top of another. The LSTM units in the first layer accept a word sequence as input, and the upper layers accept the hidden states of the lower layers as input. The final sentence representation is composed of the hidden states of the top layer. A deep BLSTM-RNN can be constructed in a similar way.
Similar to LSTM-RNN, the SNELSD model proposed in this paper is also a sequential encoder for sentence modeling. However, the SNELSD adopts a two-layer hierarchical chain structure, in which the hidden units in the two layers are not LSTMs, and they have different architectures to fulfill different functions. The details of SNELSDs will be introduced in Section III.
Tree-LSTMs have recently been investigated to incorporate syntactic parsing information for deriving the representation of sentences [18, 19, 29, 30]. Tree-LSTMs are developed from chain-structured LSTMs and have tree-structured network topologies, as shown in Fig. 2(b). At the -th node of a binary parsing tree, an input vector , which is a word vector at leaf nodes and a zero vector at non-leaf nodes, and the hidden vectors of its two child nodes (the left child and the right child ) are taken as the inputs to calculate the hidden state vector of the current node. These input vectors are used to configure the memory cell and the four gates, i.e., the input gate , output gate , and two forget gates and . Moreover, the memory cell considers the memory vectors and from the two child nodes. Specifically, the forward propagation of a Tree-LSTM unit can be calculated as follows .
Similar to Tree-LSTMs, the SNELSD model proposed in this paper also aims to introduce the compositionality of languages into the semantic modeling of sentences. However, the SNELSD model utilizes hidden word chunks and an end-to-end training strategy, thus avoiding the reliance on additional syntax parsing. Furthermore, SNELSD is a sequential model, not a recursive one with tree structures, which guarantees the efficiency of model training and the convenience of combining other sequential encoders.
Iii Sequential Neural Encoder with Latent Structured Description (SNELSD)
The LSTM-RNN and Tree-LSTM models exploit a flat chain structure and a hierarchical tree structure, respectively. Our SNELSD model utilizes a hierarchical chain structure, which can be considered as a trade-off between LSTM-RNNs and Tree-LSTMs. Specifically, an SNELSD model has a two-layer structure as shown in Fig. 3. The first layer, named the detection layer, predicts the boundaries of latent word chunks in an input sentence and derives a chunk-level vector for each word. The second layer, named the description layer, utilizes modified LSTM units to process these chunk-level vectors in a recurrent manner and produces sequential encoding outputs. The details of these two layers will be introduced in this section.
Iii-a Detection Layer
The detection layer is designed to segment a sentence into word chunks. For a sentence with words, the detection layer accepts the sequence of word vectors as input and produces a sequence of chunk-level vector representations together with a sequence of boundary indicators . The value represents the possibility of a chunk boundary existing after word , which is a continuous measurement between and . In other words, the word chunks are considered to be hidden units in the proposed SNELSD model.
The structure of a detection unit is shown in Fig. 4. At the -th step, the detection unit receives the current word vector , the next word vector , the chunk boundary indicator of the previous step and the chunk-level representation of the previous step as input, and it returns a chunk boundary indicator together with the updated chunk-level representation at the current step . Mathematically, a detection unit is defined by the following equations.
These equations can be divided into three operations. The first is when the words and belong to the same word chunk, which means that the word is not at the beginning of a new word chunk. The updated chunk-level representation, denoted as here, depends on both the input word and the previous chunk-level representation , as shown in (13)-(15). Two internal gate vectors, i.e., an input gate and a forget gate , are adopted to control the continuous encoding process within a certain word chunk. This is similar to the conventional LSTM unit since a word chunk can be considered as a very short sentence. Second, when the word is at the end of the last word chunk and the word is at the beginning of a new word chunk, the updated chunk-level representation, denoted as here, only depends on the input word , as shown in (16). Finally, these two different chunk-level representations and are linearly fused using the boundary indicator , as shown in (17). The boundary indicator , which describes the possibility of a chunk boundary existing after word , is also calculated using and the following word , as shown in (18).
Iii-B Description Layer
The function of the description layer is to derive the representation of a sentence from the sequence of chunk-level representations given by the detection layer. The structure of a description unit is shown in Fig. 5 and is formulated as follows.
At the -th step, the description unit accepts the chunk-level representation , the boundary indicator and the hidden state of the previous step as inputs and generates an updated hidden state . A description unit is a modified LSTM unit. The difference is that a blended input vector is calculated to replace for sequential encoding, as shown in (19). In this equation, is a constant vector to be estimated. The motivation for introducing is to emphasize the chunk-level representation of the words, which is the last word of a chunk (i.e., ), and to ignore the chunk-level representation of intermediary words within a chunk (i.e., ).
From (16)-(24), we find that the detector layer degrades to a simple project layer to process word vectors and the description unit is simply identical to a conventional LSTM unit when is constantly equal to .
An SNELSD model can work in either stand-alone mode or joint mode. In stand-alone mode, the sequence of hidden state vectors produced by the description layer are used directly for sentence representation. In joint mode, these hidden state vectors are further concatenated with word vectors or the output of an auxiliary chain-structured sentence encoder (e.g., BLSTM-RNN) at each step to produce a more comprehensive description of sentence meanings. In either mode, all the model parameters including word embeddings and all transformation matrices are learned in an end-to-end manner.
Iv Evaluation Tasks
Two NLP tasks, namely, natural language inference (NLI) and sentiment analysis (SA), are adopted to evaluate the performance of our proposed SNELSD model. In this section, we briefly introduce these two tasks and explain how to apply SNELSD models to accomplish these two tasks.
Iv-a Natural Language Inference (NLI)
Understanding entailment and contradiction is fundamental to understanding the meaning of natural language [31, 23].
Thus, the inference about entailment and contradiction is a valuable task for evaluating sentence encoding models.
Specifically, the natural language inference (NLI) task aims to determine whether a natural language hypothesis can be inferred from a natural language premise , as described in the following example .
: Several airlines polled saw costs grow more than expected, even after adjusting for inflation.
: Some of the companies in the poll reported cost increases.
In this example, the hypothesis can be regarded as being entailed from the premise .
In recent years, there have been advances in NLI. One main advancement is the availability of a large annotated dataset, the Stanford Natural Language Inference (SNLI) dataset , which contains 570K human-written English sentence pairs manually labeled by multiple human subjects. For each sentence pair, the annotation indicates whether the premise sentence entails the hypothesis sentence, whether they are contradicting each other or whether they have no inference relation. With its large-scale and human-grounded annotation, the SNLI dataset is competent for training rather complicated sentence encoding models.
We evaluate our SNELSD model based on the infrastructure of the enhanced BLSTM inference model (EBIM) , which has achieved nearly state-of-the-art performance on the SNLI task. Let two sentences of word vectors and denote the premise sentence and the hypothesis sentence. Each or is an embedding of an -dimensional vector, which can be initialized with pre-trained word embeddings. The goal is to predict a label that indicates the inference relationship between A and B. The architecture of EBIM can be divided into four parts, as follows.
Iv-A1 Sentence encoding
First, the premise and hypothesis sentences A and B are input into a sentence encoder to obtain their context representation vectors and as
In the original EBIM, the sentence encoder is a single-layer BLSTM-RNN model. Here, we replace it with the other models introduced above for comparison, including multi-layer LSTM-RNN and BLSTM-RNN, Tree-LSTM, and the SNELSD model proposed in this paper. Note that when a Tree-LSTM is utilized to encode sentences, the number of derived representation vectors and is equal to the number of non-leaf nodes in the binary syntactic trees of these two sentences. Otherwise, and because all the other models have sequential structures.
Iv-A2 Soft alignment
The attention weights between each pair of and are calculated as
Subsequently, the attention weights are normalized and are utilized to obtain soft alignment vectors and for both sentences as
Iv-A3 Inference collection and composition
We collect the inference-related information using the context representation vectors and soft alignment vectors calculated above. Specifically, the EBIM model uses vector concatenation, vector difference, and element-wise product to collect the inference sequences and as follows:
where , and are the sequences composed of , and , respectively. Then, the sequences and
are composed using a BLSTM-RNN model together with average and max pooling to obtain a fixed-dimensional discriminative vectorv, which describes the inference relationship between the entire premise sentence and its hypothesis. The calculation is as follows:
Iv-A4 Inference determination
Finally, the vector vy is a 3-dimensional vector that indicates the probability of three different inference relationships between the premise sentence and its hypothesis.
Iv-B Sentiment Analysis (SA)
Sentiment analysis (SA)  aims to determine the attitude of someone with respect to some topics or the overall contextual polarity or emotional reaction.
There are many commercial and social applications related to sentiment analysis.
Actually, sentiment analysis can utilize multimodal data including text, speech and video [33, 34, 35].
In natural language processing, a basic sentiment analysis task is to classify the polarity of a given text at the document, sentence, or feature and aspect level.
Therefore, sentence modeling is essential for this task.
In this paper, we evaluate our SNELSD model on the sentiment polarity task using the Stanford Sentiment Treebank dataset (SST) .
This dataset includes fine-grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences.
Each label can be one of the 5 sentiment classes from very negative to very positive (). The following examples are taken from this dataset, in which only the sentence-level labels are shown.
: taking care of my cat offers a refreshingly different slice of Asian cinema
: no movement, no yuks, not much of anything
Since the focus of this paper is sentence modeling, we exclude the phrase-level samples in the SST dataset and only use sentence-level samples for evaluating different sentence encoders.
The overall architecture of the SA model used in this paper is much simpler than that for NLI. In an input sentence , each is an -dimensional embedding vector, which can be initialized with pre-trained word embeddings. First, the sequence of words X is sent into a sentence encoder to obtain the context representations as
Here, the sentence encoder can be LSTM-RNN or the proposed SNELSD model. Similar to (25) and (26), since both LSTM-RNN and SNELSD model has sequential structures. Then, a discriminative vector v is obtained by performing average and max pooling on as
Finally, the vector v is fed into an MLP classifier to determine the sentiment polarity of the input sentence.
V Experiments on NLI
V-a Experimental Setup
In our experiments, the Stanford Natural Language Inference (SNLI) dataset  was adopted. Following previous work , the sentence pairs lacking consensus among multiple human annotators were removed, and this dataset was split into a training set, a development set, and a test set with , , and sentence pairs, respectively.
The entire NLI model was constructed following the introduction in Section IV-A. Various sentence encoders, including LSTM-RNN, BLSTM-RNN, Tree-LSTM, and the proposed SNELSD model, were integrated to achieve the sentence encoding in (25) and (26). The constituency parse trees for constructing Tree-LSTM models were produced using the Stanford PCFG Parser 3.5.3 . All word embeddings and the hidden state vectors of sentence encoders had 300 dimensions. Specifically, the hidden states of the BLSTM-RNN model had 600 dimensions since it was composed of two unidirectional LSTM-RNNs along different directions. The model parameters of the sentence encoders were estimated in an end-to-end manner together with the parameters of other parts of the NLI model. The word embeddings were initialized by pre-trained 300-D Glove 840B vectors 39] method was adopted for optimization. The first momentum was set to be 0.9 and the second to be 0.999 in Adam optimization. The initial learning rate was 0.0004, and the batch size was 128. Dropout with a rate of 0.5 was applied to the MLP layer and the word embedding layer.
|B||[1-layer LSTM; word embedding]||92.9||87.5|
|[1-layer BLSTM; word embedding]||92.8||87.7|
|[2-layer LSTM; word embedding]||93.9||87.4|
|[2-layer BLSTM; word embedding]||92.8||87.7|
|[SNELSD; word embedding]||92.2||88.0|
|C||[2-layer LSTM; 1-layer BLSTM]||95.1||87.8|
|[SNELSD; 1-layer BLSTM]||94.5||88.0|
V-B Performance of SNELSD in Stand-Alone Mode
We compared the performance of using SNELSD in stand-alone mode for sentence modeling with that of using the LSTM-RNN and Tree-LSTM models. Considering that an SNELSD has a two-layer sequential structure, four different LSTM-RNNs, including 1-layer and 2-layer unidirectional LSTM-RNNs and 1-layer and 2-layer BLSTM-RNNs, were used for comparison. The NLI accuracies of using these sentence encoders are shown in Part A of Table I.
First, we observe that the SNELSD model achieved similar performance with the Tree-LSTM model, which exploited syntactic information explicitly. The SNELSD model performed better than the 2-layer LSTM-RNN, which also had a 2-layer unidirectional chain structure. However, the accuracy of the SNELSD model was still lower than that of the 2-layer BLSTM-RNN. One possible reason for this result is the lack of backward processing in SNELSD.
Furthermore, it can be found that both 1-layer LSTM-RNN and BLSTM-RNN outperformed their 2-layer counterparts and the two models with structured descriptions (i.e., Tree-LSTM and SNELSD). Although it has been observed empirically that multi-layer RNNs achieved better performance than 1-layer RNNs on some tasks, such as machine translation in an encoder-decoder framework , the gains of using stacked RNNs for sentence modeling are still not theoretically clear . One possible reason for the superiority of 1-layer RNNs in our experiment is that stacking RNN layers or introducing a structured description leads to abstract and compositional representations of sentences, which may be inappropriate for the NLI task. Some less compositional descriptions, such as the meanings of individual key words in sentences, are also important for deciding the inference relationship between two sentences.
In terms of the training efficiency of different models, we observed that the training speed of SNELSD was similar with 2-layer LSTM, but significantly faster than Tree-LSTM in our experiments. As discussed in Section 1, the sequential architecture of SNELSD guarantees the efficiency of model training.
V-C Performance of SNELSD in Joint Mode
The performance of using SNELSD in joint mode was further investigated through experiments. Here, the hidden state vectors produced by the SNELSD model were concatenated with unprocessed word embeddings or with the output of a 1-layer BLSTM-RNN to achieve the sentence encoding in (25) and (26). Some models that combined the output of LSTM-RNNs, BLSTM-RNNs, and word embeddings for sentence modeling were also constructed for comparison. The results of these models are shown in Parts B and C of Table I.
Comparing the results in Part B of Table I with those in Part A, we find that the concatenation with word embeddings failed to improve the performance of 1-layer RNNs, whereas it increased the accuracy of 2-layer RNNs and the SNELSD model. Comparing the results in Part C with those in Part A, we can observe the positive effects of concatenating the output of 1-layer BLSTM-RNN with 2-layer LSTM-RNN and the SNELSD model. These results demonstrate the importance of utilizing a comprehensive sentence representation for the NLI task.
As shown in Parts B and C of Table I, the proposed joint-mode SNELSD model achieved the highest accuracy of among all evaluated sentence encoders. This result demonstrates that by introducing a latent structured description, the SNELSD model is able to provide some useful structured semantic information for NLI that conventional sequential sentence encoders may ignore.
V-D Comparison with State-of-the-Art Results
(1) decomposable attention model
|(2) NTI-SLSTM-LSTM ||88.5||87.3|
|(3) EBIM ||92.9||87.7|
|(4) ESIM ||92.6||88.0|
|(5) BiMPM ||–||86.9|
|(6) [SNELSD; 1-layer BLSTM]||92.2||88.0|
|(7) [SNELSD; 1-layer BLSTM]*||94.1||88.3|
|(8) ESIM + Syntactic tree-LSTM (Ensemble)||93.5||88.6|
|(9) BiMPM (Ensemble) ||–||88.8|
|(10) [SNELSD; 1-layer BLSTM]* (Ensemble)||93.3||88.7|
Here, the proposed method using SNELSD models was compared with other published state-of-the-art methods on the benchmark SNLI task. The results are shown in Table II. All the models compared here utilized the framework of word-by-word attention between the semantic representation of two sentences. The methods marked with (1)-(7) adopted a single model for NLI, and the methods marked with (8)-(10) ensembled multi-models for decision making.
The decomposable attention model  marked with (1) can be treated as a simplified edition of EBIM. The most obvious difference between them is that the decomposable attention model employs feedforward neural networks, whereas the EBIM model uses BLSTM-RNNs to encode the sentences and to compose the inference information. We can find that this model behaved considerably worse than the other models shown in Table II.
The NTI-SLSTM-LSTM model  marked with (2) utilized a full binary tree structure, which incorporates the sequential LSTM-based encoding, recursive networks and complicated combination of attention. However, this model still performed worse than EBIM and joint-mode SNELSD. The tree structure also made training of the model time consuming.
The EBIM model  marked with (3) is the same as the model using 1-layer BLSTM-RNN as the sentence encoder in Table I. The ESIM model  marked with (4) is an improved version of the EBIM model and calculated at the inference collection step described in Section IV-A.
The BiMPM model  marked with (5) is almost the same as EBIM. There are two main differences between these models. The first difference is that the original word representation used in BiMPM is the concatenation of pre-trained GloVe word embeddings and character-composed embeddings derived from an LSTM-RNN . The second difference is that BiMPM adopts a more complicated matching operation than EBIM. We find that both EBIM and joint-mode SNELSD were superior to BiMPM.
The models marked with (6) and (7) utilized joint-mode SNELSD in an EBIM and ESIM fashion, respectively. When combining joint-mode SNELSD with ESIM, the outputs of SNELSD and 1-layer BLSTM-RNN were concatenated after inference collection and dimension reduction. It can be observed that models (6) and (7) achieved higher NLI accuracies than their counterparts using 1-layer BLSTM-RNN for sentence modeling. Furthermore, combining joint-mode SNELSD in an ESIM fashion achieved an accuracy of , which is the best result among all single-model methods listed in Table II. A close examination on the test set errors made by models (4) and (7) shows that about 3.5% test set samples were misclassfied by model (4) while classified correctly by model (7). On the contrary, about 3.2% test set samples were misclassfied by model (7) while classified correctly by model (4). More than 70% of the errors made by these two models were the same. This implies that there still exist common deficiencies with these two models. We also observed that some of the common errors made by these two models were due to the lack of background and commonsense knowledge during the inference. How to integrate external knowledge represented by lexical databases, such as WordNet , into the sentence modeling and matching of NLI is a topic worth further investigation.
The model marked with (8) assembled an original ESIM and a modified ESIM using a syntactic Tree-LSTM for sentence encoding and information composition . It achieved a significant improvement over the single ESIM model. The model marked with (9) assembled two BiMPMs with identical architectures 
. Similarly, we constructed an ensemble model using two instances of model (7) trained independently. Thus, two posterior probability vectors corresponding to the three options of NLI were returned for each sentence. Similar to the model ensemble strategy used in[43, 44], these two posterior probability vectors were averaged to obtain the final one for decision. It can be observed that this model ensemble achieved an accuracy of , which was higher than the accuracy of using a single model, and a little bit higher than ensembled ESIM utilizing syntactic tree-LSTM. Although the differences between the models marked with (8) and (10) is insignificant, the SNELSD model marked with (10) has other benefits, such as no dependency on extra syntactic parsers and simple end-to-end training framework. Additionally, it can be observed that our SNELSD model didn’t achieve as much gain as the BiMPM model from the ensemble. One possible reason is that the performance of our single SNELSD model was quite stable among its instances trained independently.
V-E Visualization of Latent Word Chunks in SNELSD for NLI
As described in Section I, an SNELSD model segments sentences into word chunks in a task-dependent manner for sentence modeling. The word chunks are latent and described by a sequence of stochastic chunk boundary indicators. Here, the calculated boundary indicators for several sentence examples are displayed to visualize the latent word chunks considered in SNELSD modeling. The single SNELSD model in Part A of Table I was adopted here, and the results are shown in Fig. 6. In this figure, each line corresponds to a hypothesis sentence in the SNLI test set. The value of is written below the -th word in the sentence together within a red-green scale color block. When the color block is more red, the value of is smaller, which means that it is more unlikely to have a word chunk boundary after the word . Conversely, when the color block is more green, the value of is larger, and the probability of a chunk boundary existing after the word is higher. For better illustration, red arrows are added in Fig. 6 to indicate the positions where is higher than .
As shown in Fig. 6, the chunking patterns provided by the SNELSD model can partially capture the main grammar structure of a sentence. Most of the phrase boundaries have a high value of , which means that they are also judged to have high possibilities of being word chunk boundaries in the SNELSD model. This result is reasonable since the SNELSD model learns how to segment sentences into word chunks by task-dependent and end-to-end model training, and the syntax-related information should be useful for the NLI task.
|B||[1-layer LSTM; word embedding]||46.2(1.27)|
|[1-layer BLSTM; word embedding]||46.8(1.46)|
|[2-layer LSTM; word embedding]||46.0(1.00)|
|[2-layer BLSTM; word embedding]||46.1(1.35)|
|[SNELSD; word embedding]||46.7(0.90)|
|C||[2-layer LSTM; 1-layer BLSTM]||47.0(1.14)|
|[SNELSD; 1-layer BLSTM]||47.9(1.24)|
Average accuracies (%) with standard deviations of using different sentence encoders on SA.
Vi Experiments on SA
In contrast to NLI, which concerns the relationship between two sentences, sentiment analysis (SA) is to classify a single sentence. In this section, we performed experiments to compare the SNELSD, LSTM-RNN and Tree-LSTM models on SA similar to Section V.
Vi-a Experimental Setup
We used the Stanford Sentiment Treebank (SST) dataset and the same data split as in Socher et al. (2013) , in which the number of training, development and test sentences were , , and , respectively. Since the focus of this paper is on sentence modeling, we excluded the phrase-level samples in the SST dataset and only used sentence-level samples for evaluating different sentence encoders.
The framework introduced in Section IV-B was followed to construct the SA model. Different sentence encoders were used to fulfill the sentence encoding in (37) for comparison. During model training, cross entropy was chosen as the loss function for optimization, and the Adadelta  method was used for the optimization process. The was set to and was set to for Adadelta optimization. The batch size was 16. All word embeddings and the hidden state vectors of sentence encoders had 300 dimensions. Specifically, the hidden states of the BLSTM-RNN model had 600 dimensions since it was composed of two unidirectional LSTM-RNNs along different directions. The drop-out strategy was not applied to avoid the instability among different training trials. The word embeddings were initialized by pre-trained 300D GloVe 840B vectors . The results are shown in Table III, where the evaluated sentence encoders were the same as those shown in Table I. All the results in Table III are the averages and standard deviations of 40 training trials on the test set.
Vi-B Performance of SNELSD in Stand-Alone Mode
We compared the performance of using SNELSD in stand-alone mode for the sentence encoding in SA with that of using the LSTM-RNN and Tree-LSTM models. The results are shown in Part A of Table III.
It can be observed that the Tree-LSTM model achieved the best performance among all evaluated sentence encoders.111This result was derived using the codes for training the constituency Tree-LSTM model in . Similar to the NLI results shown in Table III, the 2-layer LSTM and BLSTM models also performed worse than their 1-layer counterparts on the SA task, which is consistent with the experimental results in . However, our SNELS model, which had a similar 2-layer unidirectional chain structure as the 2-layer LSTM-RNN, outperformed both 2-layer LSTM and 2-layer BLSTM and obtained almost the same test accuracy as 1-layer LSTM and BiLSTM. This result implies that the detection layer and description layer in our SNELS model may be capable of utilizing latent structure information in sentences that conventional LSTM units cannot capture.
Vi-C Performance of SNELSD in Joint Mode
Comparing the results in Part B of Table I with those in Part A, it can be found that the concatenation with word embeddings degraded the performance of all sentence encoders, particularly for the 1-layer LSTM and 2-layer BLSTM models, which had a significant accuracy reduction of approximately . This result is inconsistent with the results on the NLI task and implies that to simply consider the representation of all individual words in the sentence modeling may be inappropriate for the SA task.
Examining the results in Part C of Table III, it can be observed that both 2-layer LSTM and our proposed SNELSD model benefited from the concatenation with another 1-layer BLSTM encoder. The joint-mode SNELSD model achieved an accuracy of , which was the highest accuracy among all the evaluated sentence encoders. This result further confirms that the SNELSD model can truly capture some information that ordinary LSTMs models cannot capture.
Vi-D Comparison with Published Results
Most existing works on the Stanford Sentiment Treebank (SST) dataset trained their models using both phrase-level and sentence-level labels. The best published results we can found that only adopted sentence-level samples of this dataset for model training and testing are compared in Table IV.
The S-LSTM  marked with (1) is also a tree-structured LSTM model, which is similar to our Tree-LSTM model but initialized their word embeddings randomly. The models marked with (2) and (3) are CNN-based models  using Word2Vec  and GloVe  to initialize word embeddings, respectively. We can find that the joint-mode SNELSD model achieved the best performance among all these models.
Vi-E Visualization of Latent Word Chunks in SNELSD for SA
We also visualized the chunking pattern of applying the SNELSD model on the SA task. The values of of the SNELSD model in Part A of Table III were used for illustration, and the results are shown in Fig. 7. The first 5 sentences had negative sentiment labels, whereas the last 3 sentences were positive. From this figure, it can be found that the word chunk boundaries tend to exist following the words with strong emotion polarities. This result is reasonable since more attention should be given to these words for sentiment analysis.
This paper has proposed a novel sentence encoding model named sequential neural encoder with latent structure description (SNELSD). This model has a 2-layer hierarchical chain structure and splits sentences into latent word chunks by end-to-end learning. A natural language inference (NLI) task and a sentiment analysis (SA) task are introduced to evaluate the proposed SNELSD model. The experimental results show that the proposed SNELSD model can fit these tasks very well and that joint-mode SNELSDs outperform ordinary LSTM models by capturing and utilizing the latent structured information of sentences in a task-dependent manner. Applying the proposed SNELSD model to more tasks, such as paragraph comprehension and question answering, will be the tasks of our future work.
-  Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146–162, 1954.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 3104–3112. [Online]. Available: http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks
-  Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014, pp. 1746–1751. [Online]. Available: http://aclweb.org/anthology/D14-1181
-  N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2014, pp. 655–665. [Online]. Available: http://aclweb.org/anthology/P14-1062
-  Z. C. Lipton, “A critical review of recurrent neural networks for sequence learning,” CoRR, vol. abs/1506.00019, 2015. [Online]. Available: http://arxiv.org/abs/1506.00019
-  Y. Bengio, P. Y. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 157–166, 1994. [Online]. Available: https://doi.org/10.1109/72.279181
-  S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” 2001.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  E. Kiperwasser and Y. Goldberg, “Simple and accurate dependency parsing using bidirectional LSTM feature representations,” Transactions of the Association of Computational Linguistics, vol. 4, pp. 313–327, 2016. [Online]. Available: http://aclweb.org/anthology/Q16-1023
-  G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2016, pp. 260–270. [Online]. Available: http://aclweb.org/anthology/N16-1030
-  Z. Huang, W. Xu, and K. Yu, “Bidirectional LSTM-CRF models for sequence tagging,” CoRR, vol. abs/1508.01991, 2015. [Online]. Available: http://arxiv.org/abs/1508.01991
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Deep compositional question answering with neural module networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn, 2016.
-  M. Ren, R. Kiros, and R. S. Zemel, “Exploring models and data for image question answering,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 2953–2961. [Online]. Available: http://papers.nips.cc/paper/5640-exploring-models-and-data-for-image-question-answering
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014. [Online]. Available: http://arxiv.org/abs/1409.0473
T. Luong, H. Pham, and D. C. Manning, “Effective approaches to attention-based neural machine translation,” inProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2015, pp. 1412–1421. [Online]. Available: http://aclweb.org/anthology/D15-1166
-  W. Ling, I. Trancoso, C. Dyer, and A. W. Black, “Character-based neural machine translation,” Computer Science, 2015.
-  D. Dowty, “Compositionality as an empirical problem,” Direct compositionality, no. 14, pp. 23–101, 2007.
-  S. K. Tai, R. Socher, and D. C. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2015, pp. 1556–1566. [Online]. Available: http://aclweb.org/anthology/P15-1150
X. Zhu, P. Sobhani, and H. Guo, “Long short-term memory over recursive
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 1604–1612. [Online]. Available: http://jmlr.org/proceedings/papers/v37/zhub15.html
-  T. Zhang, F. Damerau, and D. Johnson, “Text chunking based on a generalization of winnow,” Journal of Machine Learning Research, vol. 2, pp. 615–637, 2002. [Online]. Available: http://www.jmlr.org/papers/v2/zhang02c.html
-  C. A. Casteel, “Effects of chunked text-material on reading comprehension of high and low ability readers,” Reading Improvement, vol. 27, no. 4, p. 269, 1990.
-  G. A. Miller, “The magical number seven, plus or minus two: some limits on our capacity for processing information.” Psychological review, vol. 63, no. 2, p. 81, 1956.
-  R. S. Bowman, G. Angeli, C. Potts, and D. C. Manning, “A large annotated corpus for learning natural language inference,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2015, pp. 632–642. [Online]. Available: http://aclweb.org/anthology/D15-1075
-  S. Federici, S. Montemagni, and V. Pirrelli, “Shallow parsing and text chunking: a view on underspecification in syntax,” Cognitive science research paper-university of Sussex CSRP, pp. 35–44, 1996.
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “Open
information extraction from the web,” in
IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, January 6-12, 2007, 2007, pp. 2670–2676. [Online]. Available: http://ijcai.org/Proceedings/07/Papers/429.pdf
-  S. Sarawagi and W. W. Cohen, “Semi-Markov conditional random fields for information extraction,” in Advances in Neural Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada], 2004, pp. 1185–1192. [Online]. Available: http://papers.nips.cc/paper/2648-semi-markov-conditional-random-fields-for-information-extraction
-  R. Soricut and E. Brill, “Automatic question answering: Beyond the factoid,” in Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, 2004. [Online]. Available: http://aclweb.org/anthology/N04-1008
-  L. Ramshaw and M. Marcus, Third Workshop on Very Large Corpora, 1995, ch. Text Chunking using Transformation-Based Learning. [Online]. Available: http://aclweb.org/anthology/W95-0107
-  Q. Chen, X. Zhu, Z. Ling, S. Wei, and H. Jiang, “Enhancing and combining sequential and Tree LSTM for natural language inference,” arXiv preprint arXiv:1609.06038v1, 2016.
-  P. Le and W. Zuidema, “Compositional distributional semantics with long short term memory,” in Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, 2015, pp. 10–19. [Online]. Available: http://aclweb.org/anthology/S15-1002
-  B. MacCartney and D. C. Manning, “Modeling semantic containment and exclusion in natural language inference,” in Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Coling 2008 Organizing Committee, 2008, pp. 521–528. [Online]. Available: http://aclweb.org/anthology/C08-1066
-  B. Pang, L. Lee et al., “Opinion mining and sentiment analysis,” Foundations and Trends® in Information Retrieval, vol. 2, no. 1–2, pp. 1–135, 2008.
A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor fusion network for multimodal sentiment analysis,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, 2017, pp. 1114–1125. [Online]. Available: http://aclanthology.info/papers/D17-1116/d17-1116
-  S. Poria, I. Chaturvedi, E. Cambria, and A. Hussain, “Convolutional MKL based multimodal emotion recognition and sentiment analysis,” in IEEE 16th International Conference on Data Mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, 2016, pp. 439–448. [Online]. Available: https://doi.org/10.1109/ICDM.2016.0055
-  S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L. Morency, “Context-dependent sentiment analysis in user-generated videos,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, 2017, pp. 873–883. [Online]. Available: https://doi.org/10.18653/v1/P17-1081
-  R. Socher, A. Perelygin, J. Wu, J. Chuang, D. C. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2013, pp. 1631–1642. [Online]. Available: http://aclweb.org/anthology/D13-1170
-  D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 2003. [Online]. Available: http://aclweb.org/anthology/P03-1054
-  J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014, pp. 1532–1543. [Online]. Available: http://aclweb.org/anthology/D14-1162
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
-  Y. Goldberg, “A primer on neural network models for natural language processing,” J. Artif. Intell. Res., vol. 57, pp. 345–420, 2016. [Online]. Available: https://doi.org/10.1613/jair.4992
-  A. Parikh, O. Täckström, D. Das, and J. Uszkoreit, “A decomposable attention model for natural language inference,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2016, pp. 2249–2255. [Online]. Available: http://aclweb.org/anthology/D16-1244
-  T. Munkhdalai and H. Yu, “Neural tree indexers for text understanding,” CoRR, vol. abs/1607.04492, 2016. [Online]. Available: http://arxiv.org/abs/1607.04492
-  Q. Chen, X. Zhu, Z. Ling, S. Wei, and H. Jiang, “Enhancing and combining sequential and Tree LSTM for natural language inference,” arXiv preprint arXiv:1609.06038v2, 2017.
-  Z. Wang, W. Hamza, and R. Florian, “Bilateral multi-perspective matching for natural language sentences,” arXiv preprint arXiv:1702.03814, 2017.
-  G. A. Miller, “WordNet: a lexical database for English,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
-  M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” CoRR, vol. abs/1212.5701, 2012. [Online]. Available: http://arxiv.org/abs/1212.5701
-  Y. Zhang and B. C. Wallace, “A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification,” CoRR, vol. abs/1510.03820, 2015. [Online]. Available: http://arxiv.org/abs/1510.03820
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inAdvances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., 2013, pp. 3111–3119. [Online]. Available: http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality