Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval

02/24/2015 ∙ by Hamid Palangi, et al. ∙ Microsoft The University of British Columbia 0

This paper develops a model that addresses sentence embedding, a hot topic in current natural language processing research, using recurrent neural networks with Long Short-Term Memory (LSTM) cells. Due to its ability to capture long term memory, the LSTM-RNN accumulates increasingly richer information as it goes through the sentence, and when it reaches the last word, the hidden layer of the network provides a semantic representation of the whole sentence. In this paper, the LSTM-RNN is trained in a weakly supervised manner on user click-through data logged by a commercial web search engine. Visualization and analysis are performed to understand how the embedding process works. The model is found to automatically attenuate the unimportant words and detects the salient keywords in the sentence. Furthermore, these detected keywords are found to automatically activate different cells of the LSTM-RNN, where words belonging to a similar topic activate the same cell. As a semantic representation of the sentence, the embedding vector can be used in many different applications. These automatic keyword detection and topic allocation abilities enabled by the LSTM-RNN allow the network to perform document retrieval, a difficult language processing task, where the similarity between the query and documents can be measured by the distance between their corresponding sentence embedding vectors computed by the LSTM-RNN. On a web search task, the LSTM-RNN embedding is shown to significantly outperform several existing state of the art methods. We emphasize that the proposed model generates sentence embedding vectors that are specially useful for web document retrieval tasks. A comparison with a well known general sentence embedding method, the Paragraph Vector, is performed. The results show that the proposed method in this paper significantly outperforms it for web document retrieval task.



There are no comments yet.


page 1

page 7

page 16

page 22

page 23

page 24

Code Repositories


The LSTM-DSSM model refering to the paper: Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Learning a good representation (or features) of input data is an important task in machine learning. In text and language processing, one such problem is learning of an embedding vector for a sentence; that is, to train a model that can automatically transform a sentence to a vector that encodes the semantic meaning of the sentence. While word embedding is learned using a loss function defined on word pairs, sentence embedding is learned using a loss function defined on sentence pairs. In the sentence embedding usually the relationship among words in the sentence, i.e., the context information, is taken into consideration. Therefore, sentence embedding is more suitable for tasks that require computing semantic similarities between text strings. By mapping texts into a unified semantic representation, the embedding vector can be further used for different language processing applications, such as machine translation


, sentiment analysis

[2], and information retrieval [3]. In machine translation, the recurrent neural networks (RNN) with Long Short-Term Memory (LSTM) cells, or the LSTM-RNN, is used to encode an English sentence into a vector, which contains the semantic meaning of the input sentence, and then another LSTM-RNN is used to generate a French (or another target language) sentence from the vector. The model is trained to best predict the output sentence. In [2]

, a paragraph vector is learned in an unsupervised manner as a distributed representation of sentences and documents, which are then used for sentiment analysis. Sentence embedding can also be applied to information retrieval, where the contextual information are properly represented by the vectors in the same space for fuzzy text matching


In this paper, we propose to use an RNN to sequentially accept each word in a sentence and recurrently map it into a latent space together with the historical information. As the RNN reaches the last word in the sentence, the hidden activations form a natural embedding vector for the contextual information of the sentence. We further incorporate the LSTM cells into the RNN model (i.e. the LSTM-RNN) to address the difficulty of learning long term memory in RNN. The learning of such a model is performed in a weakly supervised manner on the click-through data logged by a commercial web search engine. Although manually labelled data are insufficient in machine learning, logged data with limited feedback signals are massively available due to the widely used commercial web search engines. Limited feedback information such as click-through data provides a weak supervision signal that indicates the semantic similarity between the text on the query side and the clicked text on the document side. To exploit such a signal, the objective of our training is to maximize the similarity between the two vectors mapped by the LSTM-RNN from the query and the clicked document, respectively. Consequently, the learned embedding vectors of the query and clicked document are specifically useful for web document retrieval task.

An important contribution of this paper is to analyse the embedding process of the LSTM-RNN by visualizing the internal activation behaviours in response to different text inputs. We show that the embedding process of the learned LSTM-RNN effectively detects the keywords, while attenuating less important words, in the sentence automatically by switching on and off the gates within the LSTM-RNN cells. We further show that different cells in the learned model indeed correspond to different topics, and the keywords associated with a similar topic activate the same cell unit in the model. As the LSTM-RNN reads to the end of the sentence, the topic activation accumulates and the hidden vector at the last word encodes the rich contextual information of the entire sentence. For this reason, a natural application of sentence embedding is web search ranking, in which the embedding vector from the query can be used to match the embedding vectors of the candidate documents according to the maximum cosine similarity rule. Evaluated on a real web document ranking task, our proposed method significantly outperforms many of the existing state of the art methods in NDCG scores. Please note that when we refer to document in the paper we mean the title (headline) of the document.

Ii Related Work

Inspired by the word embedding method [4, 5], the authors in [2]

proposed an unsupervised learning method to learn a paragraph vector as a distributed representation of sentences and documents, which are then used for sentiment analysis with superior performance. However, the model is not designed to capture the fine-grained sentence structure. In

[6], an unsupervised sentence embedding method is proposed with great performance on large corpus of contiguous text corpus, e.g., the BookCorpus [7]. The main idea is to encode the sentence and then decode previous and next sentences, i.e., and

, using two separate decoders. The encoder and decoders are RNNs with Gated Recurrent Unit (GRU)

[8]. However, this sentence embedding method is not designed for document retrieval task having a supervision among queries and clicked and unclicked documents. In [9]

, a Semi-Supervised Recursive Autoencoder (RAE) is proposed and used for sentiment prediction. Similar to our proposed method, it does not need any language specific sentiment parsers. A greedy approximation method is proposed to construct a tree structure for the input sentence. It assigns a vector per word. It can become practically problematic for large vocabularies. It also works both on unlabeled data and supervised sentiment data.

Similar to the recurrent models in this paper, The DSSM [3] and CLSM [10]

models, developed for information retrieval, can also be interpreted as sentence embedding methods. However, DSSM treats the input sentence as a bag-of-words and does not model word dependencies explicitly. CLSM treats a sentence as a bag of n-grams, where n is defined by a window, and can capture


word dependencies. Then a Max-pooling layer is used to form a global feature vector. Methods in

[11] are also convolutional based networks for Natural Language Processing (NLP). These models, by design, cannot capture long distance dependencies, i.e., dependencies among words belonging to non-overlapping n-grams. In [12]

a Dynamic Convolutional Neural Network (DCNN) is proposed for sentence embedding. Similar to CLSM, DCNN does not rely on a parse tree and is easily applicable to any language. However, different from CLSM where a regular max-pooling is used, in DCNN a dynamic

-max-pooling is used. This means that instead of just keeping the largest entries among word vectors in one vector, largest entries are kept in different vectors. DCNN has shown good performance in sentiment prediction and question type classification tasks. In [13], a convolutional neural network architecture is proposed for sentence matching. It has shown great performance in several matching tasks. In [14], a Bilingually-constrained Recursive Auto-encoders (BRAE) is proposed to create semantic vector representation for phrases. Through experiments it is shown that the proposed method has great performance in two end-to-end SMT tasks.

Long short-term memory networks were developed in [15] to address the difficulty of capturing long term memory in RNN. It has been successfully applied to speech recognition, which achieves state-of-art performance [16, 17]. In text analysis, LSTM-RNN treats a sentence as a sequence of words with internal structures, i.e., word dependencies. It encodes a semantic vector of a sentence incrementally which differs from DSSM and CLSM. The encoding process is performed left-to-right, word-by-word. At each time step, a new word is encoded into the semantic vector, and the word dependencies embedded in the vector are “updated”. When the process reaches the end of the sentence, the semantic vector has embedded all the words and their dependencies, hence, can be viewed as a feature vector representation of the whole sentence. In the machine translation work [1]

, an input English sentence is converted into a vector representation using LSTM-RNN, and then another LSTM-RNN is used to generate an output French sentence. The model is trained to maximize the probability of predicting the correct output sentence. In

[18], there are two main composition models, “ADD” model that is bag of words and “BI” model that is a summation over bi-gram pairs plus a non-linearity. In our proposed model, instead of simple summation, we have used LSTM model with letter tri-grams which keeps valuable information over long intervals (for long sentences) and throws away useless information. In [19], an encoder-decoder approach is proposed to jointly learn to align and translate sentences from English to French using RNNs. The concept of “attention” in the decoder, discussed in this paper, is closely related to how our proposed model extracts keywords in the document side. For further explanations please see section V-A2. In [20] a set of visualizations are presented for RNNs with and without LSTM cells and GRUs. Different from our work where the target task is sentence embedding for document retrieval, the target tasks in [20] were character level sequence modelling for text characters and source codes. Interesting observations about interpretability of some LSTM cells and statistics of gates activations are presented. In section V-A we show that some of the results of our visualization are consistent with the observations reported in [20]. We also present more detailed visualization specific to the document retrieval task using click-through data. We also present visualizations about how our proposed model can be used for keyword detection.

Different from the aforementioned studies, the method developed in this paper trains the model so that sentences that are paraphrase of each other are close in their semantic embedding vectors — see the description in Sec. IV further ahead. Another reason that LSTM-RNN is particularly effective for sentence embedding, is its robustness to noise. For example, in the web document ranking task, the noise comes from two sources: (i) Not every word in query / document is equally important, and we only want to “remember” salient words using the limited “memory”. (ii) A word or phrase that is important to a document may not be relevant to a given query, and we only want to “remember” related words that are useful to compute the relevance of the document for a given query. We will illustrate robustness of LSTM-RNN in this paper. The structure of LSTM-RNN will also circumvent the serious limitation of using a fixed window size in CLSM. Our experiments show that this difference leads to significantly better results in web document retrieval task. Furthermore, it has other advantages. It allows us to capture keywords and key topics effectively. The models in this paper also do not need the extra max-pooling layer, as required by the CLSM, to capture global contextual information and they do so more effectively.

Iii Sentence Embedding Using RNNs with and without LSTM Cells

In this section, we introduce the model of recurrent neural networks and its long short-term memory version for learning the sentence embedding vectors. We start with the basic RNN and then proceed to LSTM-RNN.

Iii-a The basic version of RNN

The RNN is a type of deep neural networks that are “deep” in temporal dimension and it has been used extensively in time sequence modelling [21, 22, 23, 24, 25, 26, 27, 28, 29]. The main idea of using RNN for sentence embedding is to find a dense and low dimensional semantic representation by sequentially and recurrently processing each word in a sentence and mapping it into a low dimensional vector. In this model, the global contextual features of the whole text will be in the semantic representation of the last word in the text sequence — see Figure 1, where is the -th word, coded as a 1-hot vector, is a fixed hashing operator similar to the one used in [3] that converts the word vector to a letter tri-gram vector, is the input weight matrix, is the recurrent weight matrix, is the hidden activation vector of the RNN, which can be used as a semantic representation of the -th word, and associated to the last word is the semantic representation vector of the entire sentence. Note that this is very different from the approach in [3] where the bag-of-words representation is used for the whole text and no context information is used. This is also different from [10] where the sliding window of a fixed size (akin to an FIR filter) is used to capture local features and a max-pooling layer on the top to capture global features. In the RNN there is neither a fixed-sized window nor a max-pooling layer; rather the recurrence is used to capture the context information in the sequence (akin to an IIR filter).

Fig. 1: The basic architecture of the RNN for sentence embedding, where temporal recurrence is used to model the contextual information across words in the text string. The hidden activation vector corresponding to the last word is the sentence embedding vector (blue).

The mathematical formulation of the above RNN model for sentence embedding can be expressed as


where and are the input and recurrent matrices to be learned, is a fixed word hashing operator,

is the bias vector and

is assumed to be . Note that the architecture proposed here for sentence embedding is slightly different from traditional RNN in that there is a word hashing layer that convert the high dimensional input into a relatively lower dimensional letter tri-gram representation. There is also no per word supervision during training, instead, the whole sentence has a label. This is explained in more detail in section IV.

Iii-B The RNN with LSTM cells

Although RNN performs the transformation from the sentence to a vector in a principled manner, it is generally difficult to learn the long term dependency within the sequence due to vanishing gradients problem. One of the effective solutions for this problem in RNNs is using memory cells instead of neurons originally proposed in

[15] as Long Short-Term Memory (LSTM) and completed in [30] and [31] by adding forget gate and peephole connections to the architecture.

We use the architecture of LSTM illustrated in Fig. 2 for the proposed sentence embedding method. In this figure, are input gate, forget gate, output gate and cell state vector respectively, and are peephole connections, , and , are input connections, recurrent connections and bias values, respectively, and are function and

is the sigmoid function. We use this architecture to find

for each word, then use the corresponding to the last word in the sentence as the semantic vector for the entire sentence.

Fig. 2: The basic LSTM architecture used for sentence embedding

Considering Fig. 2, the forward pass for LSTM-RNN model is as follows:


where denotes Hadamard (element-wise) product. A diagram of the proposed model with more details is presented in section VI of Supplementary Materials.

Iv Learning Method

To learn a good semantic representation of the input sentence, our objective is to make the embedding vectors for sentences of similar meaning as close as possible, and meanwhile, to make sentences of different meanings as far apart as possible. This is challenging in practice since it is hard to collect a large amount of manually labelled data that give the semantic similarity signal between different sentences. Nevertheless, the widely used commercial web search engine is able to log massive amount of data with some limited user feedback signals. For example, given a particular query, the click-through information about the user-clicked document among many candidates is usually recorded and can be used as a weak (binary) supervision signal to indicate the semantic similarity between two sentences (on the query side and the document side). In this section, we explain how to leverage such a weak supervision signal to learn a sentence embedding vector that achieves the aforementioned training objective. Please also note that above objective to make sentences with similar meaning as close as possible is similar to machine translation tasks where two sentences belong to two different languages with similar meanings and we want to make their semantic representation as close as possible.

We now describe how to train the model to achieve the above objective using the click-through data logged by a commercial search engine. For a complete description of the click-through data please refer to section 2 in [32]. To begin with, we adopt the cosine similarity between the semantic vectors of two sentences as a measure for their similarity:


where and are the lengths of the sentence and sentence , respectively. In the context of training over click-through data, we will use and to denote “query” and “document”, respectively. In Figure 3, we show the sentence embedding vectors corresponding to the query, , and all the documents, , where the subscript denotes the (clicked) positive sample among the documents, and the subscript denotes the -th (un-clicked) negative sample. All these embedding vectors are generated by feeding the sentences into the RNN or LSTM-RNN model described in Sec. III and take the corresponding to the last word — see the blue box in Figure 1.

Fig. 3: The click-through signal can be used as a (binary) indication of the semantic similarity between the sentence on the query side and the sentence on the document side. The negative samples are randomly sampled from the training data.

We want to maximize the likelihood of the clicked document given query, which can be formulated as the following optimization problem:


where denotes the collection of the model parameters; in regular RNN case, it includes and in Figure 1, and in LSTM-RNN case, it includes , , , , , , , , , , , , , and in Figure 2. is the clicked document for -th query, is the probability of clicked document given the -th query, is number of query / clicked-document pairs in the corpus and


where , was defined earlier in (3), is the -th negative candidate document for -th query and denotes the number of negative samples used during training.

The expression in (IV) is a logistic loss over . It upper-bounds the pairwise accuracy, i.e., the 0 - 1 loss. Since the similarity measure is the cosine function, . To have a larger range for , we use for scaling. It helps to penalize the prediction error more. Its value is set empirically by experiments on a held out dataset.

To train the RNN and LSTM-RNN, we use Back Propagation Through Time (BPTT). The update equations for parameter

at epoch

are as follows:


where is the gradient of the cost function in (4), is the learning rate and is a momentum parameter determined by the scheduling scheme used for training. Above equations are equivalent to Nesterov method in [33]. To see why, please refer to appendix A.1 of [34] where Nesterov method is derived as a momentum method. The gradient of the cost function, , is:


where is the number of time steps that we unfold the network over time and


in (7) and error signals for different parameters of RNN and LSTM-RNN that are necessary for training are presented in Appendix A. Full derivation of gradients in both models is presented in section III of supplementary materials.

To accelerate training by parallelization, we use mini-batch training and one large update instead of incremental updates during back propagation through time. To resolve the gradient explosion problem we use gradient re-normalization method described in [35, 24]. To accelerate the convergence, we use Nesterov method [33] and found it effective in training both RNN and LSTM-RNN for sentence embedding.

We have used a simple yet effective scheduling for for both RNN and LSTM-RNN models, in the first and last 2% of all parameter updates and for the other 96% of all parameter updates . We have used a fixed step size for training RNN and a fixed step size for training LSTM-RNN.

A summary of training method for LSTM-RNN is presented in Algorithm 1.

Inputs: Fixed step size “”, Scheduling for “

”, Gradient clip threshold “

”, Maximum number of Epochs “”, Total number of query / clicked-document pairs “”, Total number of un-clicked (negative) documents for a given query “”, Maximum sequence length for truncated BPTT “”.
Outputs: Two trained models, one in query side “”, one in document side “”.
Initialization: Set all parameters in and to small random numbers, , .
procedure LSTM-RNN(,)
       while  do
             for “first minibatch” “last minibatch” do
                    while  do
                          for  do
                                 Compute use (8)
                                  use (14) to (A-B5) in appendix A
                                  use (14) to (A-B5) in appendix A
                                 sum above terms for and over
                          end for
                          sum above terms for and over
                    end while
                    Compute use (7)
                    Compute use (7)
                    if  then
                    end if
                    if  then
                    end if
                    Compute use (IV)
                    Compute use (IV)
             end for
       end while
end procedure
Algorithm 1 Training LSTM-RNN for Sentence Embedding

V Analysis of the Sentence Embedding Process and Performance Evaluation

To understand how the LSTM-RNN performs sentence embedding, we use visualization tools to analyze the semantic vectors generated by our model. We would like to answer the following questions: (i) How are word dependencies and context information captured? (ii) How does LSTM-RNN attenuate unimportant information and detect critical information from the input sentence? Or, how are the keywords embedded into the semantic vector? (iii) How are the global topics identified by LSTM-RNN?

To answer these questions, we train the RNN with and without LSTM cells on the click-through dataset which are logged by a commercial web search engine. The training method has been described in Sec. IV. Description of the corpus is as follows. The training set includes 200,000 positive query / document pairs where only the clicked signal is used as a weak supervision for training LSTM. The relevance judgement set (test set) is constructed as follows. First, the queries are sampled from a year of search engine logs. Adult, spam, and bot queries are all removed. Queries are “de-duped” so that only unique queries remain. To reflex a natural query distribution, we do not try to control the quality of these queries. For example, in our query sets, there are around 20% misspelled queries, and around 20% navigational queries and 10% transactional queries, etc. Second, for each query, we collect Web documents to be judged by issuing the query to several popular search engines (e.g., Google, Bing) and fetching top-10 retrieval results from each. Finally, the query-document pairs are judged by a group of well-trained assessors. In this study all the queries are preprocessed as follows. The text is white-space tokenized and lower-cased, numbers are retained, and no stemming/inflection treatment is performed. Unless stated otherwise, in the experiments we used 4 negative samples, i.e., in Fig. 3.

We now proceed to perform a comprehensive analysis by visualizing the trained RNN and LSTM-RNN models. In particular, we will visualize the on-and-off behaviors of the input gates, output gates, cell states, and the semantic vectors in LSTM-RNN model, which reveals how the model extracts useful information from the input sentence and embeds it properly into the semantic vector according to the topic information.

Although giving the full learning formula for all the model parameters in the previous section, we will remove the peephole connections and the forget gate from the LSTM-RNN model in the current task. This is because the length of each sequence, i.e., the number of words in a query or a document, is known in advance, and we set the state of each cell to zero in the beginning of a new sequence. Therefore, forget gates are not a great help here. Also, as long as the order of words is kept, the precise timing in the sequence is not of great concern. Therefore, peephole connections are not that important as well. Removing peephole connections and forget gate will also reduce the amount of training time, since a smaller number of parameters need to be learned.

V-a Analysis

In this section we would like to examine how the information in the input sentence is sequentially extracted and embedded into the semantic vector over time by the LSTM-RNN model.

V-A1 Attenuating Unimportant Information

First, we examine the evolution of the semantic vector and how unimportant words are attenuated. Specifically, we feed the following input sentences from the test dataset into the trained LSTM-RNN model:

  • Query: “hotels in shanghai

  • Document: “shanghai hotels accommodation hotel in shanghai discount and reservation

Activations of input gate, output gate, cell state and the embedding vector for each cell for query and document are shown in Fig. 4 and Fig. 5, respectively. The vertical axis is the cell index from to , and the horizontal axis is the word index from to numbered from left to right in a sequence of words and color codes show activation values.

Fig. 4: Query: “”. Since the sentence ends at the third word, all the values to the right of it are zero (green color).
Fig. 5: Document: “
”. Since the sentence ends at the ninth word, all the values to the right of it are zero (green color).

From Figs.45, we make the following observations:

  • Semantic representation and cell states are evolving over time. Valuable context information is gradually absorbed into and , so that the information in these two vectors becomes richer over time, and the semantic information of the entire input sentence is embedded into vector , which is obtained by applying output gates to the cell states .

  • The input gates evolve in such a way that it attenuates the unimportant information and detects the important information from the input sentence. For example, in Fig. 5(a), most of the input gate values corresponding to word , word and word have very small values (light green-yellow color)111If this is not clearly visible, please refer to Fig. 1 in section I of supplementary materials. We have adjusted color bar for all figures to have the same range, for this reason the structure might not be clearly visible. More visualization examples could also be found in section IV of Supplementary Materials, which corresponds to the words “”, “” and “”, respectively, in the document sentence. Interestingly, input gates reduce the effect of these three words in the final semantic representation, , such that the semantic similarity between sentences from query and document sides are not affected by these words.

V-A2 Keywords Extraction

In this section, we show how the trained LSTM-RNN extracts the important information, i.e., keywords, from the input sentences. To this end, we backtrack semantic representations, , over time. We focus on the 10 most active cells in final semantic representation. Whenever there is a large enough change in cell activation value (), we assume an important keyword has been detected by the model. We illustrate the result using the above example (“hotels in shanghai”). The evolution of the 10 most active cells activation, , over time are shown in Fig. 6 for the query and the document sentences.222Likewise, the vertical axis is the cell index and horizontal axis is the word index in the sentence.

(a) top 10 for query
(b) top 10 for document
Fig. 6: Activation values, , of 10 most active cells for Query: “hotels in shanghai” and Document: “shanghai hotels accommodation hotel in shanghai discount and reservation

From Fig. 6, we also observe that different words activate different cells. In Tables III, we show the number of cells each word activates.333

Note that before presenting the first word of the sequence, activation values are initially zero so that there is always a considerable change in the cell states after presenting the first word. For this reason, we have not indicated the number of cells detecting the first word as a keyword. Moreover, another keyword extraction example can be found in section IV of supplementary materials.

We used Bidirectional LSTM-RNN to get the results of these tables where in the first row, LSTM-RNN reads sentences from left to right and in the second row it reads sentences from right to left. In these tables we labelled a word as a keyword if more than of top active cells in both directions declare it as keyword. The boldface numbers in the table show that the number of cells assigned to that word is more than , i.e., of top active cells. From the tables, we observe that the keywords activate more cells than the unimportant words, meaning that they are selectively embedded into the semantic vector.

Number of assigned
cells out of 10
Left to Right - 0 7
Number of assigned
cells out of 10
Right to Left 6 0 -
TABLE I: Key words for query: “
Number of assigned
cells out of 10
Left to Right - 4 3 8 1 8 5 3 4
Number of assigned
cells out of 10
Right to Left 4 6 5 4 5 1 7 5 -
TABLE II: Key words for document: “

V-A3 Topic Allocation

Now, we further show that the trained LSTM-RNN model not only detects the keywords, but also allocates them properly to different cells according to the topics they belong to. To do this, we go through the test dataset using the trained LSTM-RNN model and search for the keywords that are detected by a specific cell. For simplicity, we use the following simple approach: for each given query we look into the keywords that are extracted by the 5 most active cells of LSTM-RNN and list them in Table III. Interestingly, each cell collects keywords of a specific topic. For example, cell 26 in Table III extracts keywords related to the topic “food” and cells 2 and 6 mainly focus on the keywords related to the topic “health”.

Query cell 1 cell 2 cell 3 cell 4 cell 5 cell 6 cell 7 cell 8 cell 9 cell 10 cell 11 cell 12 cell 13 cell 14 cell 15 cell 16
al yo yo sauce yo sauce sauce
atkins diet lasagna diet
blender recipes
cake bakery edinburgh bakery
canning corn beef hash beef, hash
torre de pizza
famous desserts desserts
fried chicken chicken chicken
smoked turkey recipes
italian sausage hoagies sausage
do you get allergy allergy
much pain will after total knee replacement pain pain, knee
how to make whiter teeth make, teeth to
illini community hospital community, hospital hospital community
implant infection infection infection
introductory psychology psychology psychology
narcotics during pregnancy side effects pregnancy pregnancy,effects, during during
fight sinus infections infections
health insurance high blood pressure insurance blood high, blood
all antidepressant medications antidepressant, medications
Query cell 17 cell 18 cell 19 cell 20 cell 21 cell 22 cell 23 cell 24 cell 25 cell 26 cell 27 cell 28 cell 29 cell 30 cell 31 cell 32
al yo yo sauce
atkins diet lasagna diet diet
blender recipes recipes
cake bakery edinburgh bakery bakery
canning corn beef hash corn, beef
torre de pizza pizza pizza
famous desserts
fried chicken chicken
smoked turkey recipes turkey recipes
italian sausage hoagies hoagies sausage sausage
do you get allergy
much pain will after total knee replacement knee replacement
how to make whiter teeth whiter
illini community hospital hospital hospital
implant infection infection
introductory psychology psychology
narcotics during pregnancy side effects
fight sinus infections sinus, infections infections
health insurance high blood pressure high, pressure insurance,high
all antidepressant medications antidepressant medications
TABLE III: Keywords assigned to each cell of LSTM-RNN for different queries of two topics, “food” and “health

V-B Performance Evaluation

V-B1 Web Document Retrieval Task

In this section, we apply the proposed sentence embedding method to an important web document retrieval task for a commercial web search engine. Specifically, the RNN models (with and without LSTM cells) embed the sentences from the query and the document sides into their corresponding semantic vectors, and then compute the cosine similarity between these vectors to measure the semantic similarity between the query and candidate documents.

Experimental results for this task are shown in Table IV using the standard metric mean Normalized Discounted Cumulative Gain (NDCG) [36] (the higher the better) for evaluating the ranking performance of the RNN and LSTM-RNN on a standalone human-rated test dataset. We also trained several strong baselines, such as DSSM [3] and CLSM [10], on the same training dataset and evaluated their performance on the same task. For fair comparison, our proposed RNN and LSTM-RNN models are trained with the same number of parameters as the DSSM and CLSM models (M parameters). Besides, we also include in Table IV two well-known information retrieval (IR) models, BM25 and PLSA, for the sake of benchmarking. The BM25 model uses the bag-of-words representation for queries and documents, which is a state-of-the-art document ranking model based on term matching, widely used as a baseline in IR society. PLSA (Probabilistic Latent Semantic Analysis) is a topic model proposed in [37]

, which is trained using the Maximum A Posterior estimation

[38] on the documents side from the same training dataset. We experimented with a varying number of topics from 100 to 500 for PLSA, which gives similar performance, and we report in Table IV the results of using 500 topics. Results for a language model based method, uni-gram language model (ULM) with Dirichlet smoothing, are also presented in the table.

To compare the performance of the proposed method with general sentence embedding methods in document retrieval task, we also performed experiments using two general sentence embedding methods.

  1. In the first experiment, we used the method proposed in [2] that generates embedding vectors known as Paragraph Vectors. It is also known as doc2vec. It maps each word to a vector and then uses the vectors representing all words inside a context window to predict the vector representation of the next word. The main idea in this method is to use an additional paragraph token from previous sentences in the document inside the context window. This paragraph token is mapped to vector space using a different matrix from the one used to map the words. A primary version of this method is known as word2vec proposed in [39]. The only difference is that word2vec does not include the paragraph token.

    To use doc2vec on our dataset, we first trained doc2vec model on both train set (about 200,000 query-document pairs) and test set (about 900,000 query-document pairs). This gives us an embedding vector for every query and document in the dataset. We used the following parameters for training:

    • min-count=1 : minimum number of of words per sentence, sentences with words less than this will be ignored. We set it to 1 to make sure we do not throw away anything.

    • window=5 : fixed window size explained in [2]. We used different window sizes, it resulted in about just 0.4% difference in final NDCG values.

    • size=100 : feature vector dimension. We used 400 as well but did not get significantly different NDCG values.

    • sample=1e-4 : this is the down sampling ratio for the words that are repeated a lot in corpus.

    • negative=5 : the number of noise words, i.e., words used for negative sampling as explained in [2].

    • We used 30 epochs of training. We ran an experiment with 100 epochs but did not observe much difference in the results.

    • We used gensim [40] to perform experiments.

    To make sure that a meaningful model is trained, we used the trained doc2vec model to find the most similar words to two sample words in our dataset, e.g., the words “pizza” and “infection”. The resulting words and corresponding scores are presented in section V of Supplementary Materials. As it is observed from the resulting words, the trained model is a meaningful model and can recognise semantic similarity.

    Doc2vec also assigns an embedding vector for each query and document in our test set. We used these embedding vectors to calculate the cosine similarity score between each query-document pair in the test set. We used these scores to calculate NDCG values reported in Table IV for the Doc2Vec model.

    Comparing the results of doc2vec model with our proposed method for document retrieval task shows that the proposed method in this paper significantly outperforms doc2vec. One reason for this is that we have used a very general sentence embedding method, doc2vec, for document retrieval task. This experiment shows that it is not a good idea to use a general sentence embedding method and using a better task oriented cost function, like the one proposed in this paper, is necessary.

  2. In the second experiment, we used the Skip-Thought vectors proposed in [6]. During training, skip-thought method gets a tuple () where it encodes the sentence using one encoder, and tries to reconstruct the previous and next sentences, i.e., and , using two separate decoders. The model uses RNNs with Gated Recurrent Unit (GRU) which is shown to perform as good as LSTM. In the paper, authors have emphasized that: ”Our model depends on having a training corpus of contiguous text”. Therefore, training it on our training set where we barely have more than one sentence in query or document title is not fair. However, since their model is trained on 11,038 books from BookCorpus dataset [7] which includes about 74 million sentences, we can use the trained model as an off-the-shelf sentence embedding method as authors have concluded in the conclusion of the paper.

    To do this we downloaded their trained models and word embeddings (its size was more than 2GB) available from “”. Then we encoded each query and its corresponding document title in our test set as vector.

    We used the combine-skip sentence embedding method, a vector of size , where it is concatenation of a uni-skip, i.e., a unidirectional encoder resulting in a vector, and a bi-skip, i.e., a bidirectional encoder resulting in a vector by forward encoder and another vector by backward encoder. The authors have reported their best results with the combine-skip encoder.

    Using the embedding vectors for each query and document we calculated the scores and NDCG for the whole test set which are reported in Table IV.

    The proposed method in this paper is performing significantly better than the off-the-shelf skip-thought method for document retrieval task. Nevertheless, since we used skip-thought as an off-the-shelf sentence embedding method, its result is good. This result also confirms that learning embedding vectors using a model and cost function specifically designed for document retrieval task is necessary.

As shown in Table IV, the LSTM-RNN significantly outperforms all these models, and exceeds the best baseline model (CLSM) by in NDCG@1 score, which is a statistically significant improvement. As we pointed out in Sec. V-A, such an improvement comes from the LSTM-RNN’s ability to embed the contextual and semantic information of the sentences into a finite dimension vector. In Table IV, we have also presented the results when different number of negative samples, , is used. Generally, by increasing we expect the performance to improve. This is because more negative samples results in a more accurate approximation of the partition function in (IV). The results of using Bidirectional LSTM-RNN are also presented in Table IV. In this model, one LSTM-RNN reads queries and documents from left to right, and the other LSTM-RNN reads queries and documents from right to left. Then the embedding vectors from left to right and right to left LSTM-RNNs are concatenated to compute the cosine similarity score and NDCG values.

@1 @3 @10
Skip-Thought 26.9% 29.7% 36.2%
Doc2Vec 29.1% 31.8% 38.4%
ULM 30.4% 32.7% 38.5%
BM25 30.5% 32.8% 38.8%
PLSA (T=500) 30.8% 33.7% 40.2%
DSSM (nhid = 288/96) 31.0% 34.4% 41.7%
2 Layers
CLSM (nhid = 288/96, win=1) 31.8% 35.1% 42.6%
2 Layers, 14.4 M parameters
CLSM (nhid = 288/96, win=3) 32.1% 35.2% 42.7%
2 Layers, 43.2 M parameters
CLSM (nhid = 288/96, win=5) 32.0% 35.2% 42.6%
2 Layers, 72 M parameters
RNN (nhid = 288) 31.7% 35.0% 42.3%
1 Layer
LSTM-RNN (ncell = 32) 31.9% 35.5% 42.7%
1 Layer, 4.8 M parameters
LSTM-RNN (ncell = 64) 32.9% 36.3% 43.4%
1 Layer, 9.6 M parameters
LSTM-RNN (ncell = 96) 32.6% 36.0% 43.4%
1 Layer, n = 2
LSTM-RNN (ncell = 96) 33.1% 36.5% 43.6%
1 Layer, n = 4
LSTM-RNN (ncell = 96) 33.1% 36.6% 43.6%
1 Layer, n = 6
LSTM-RNN (ncell = 96) 33.1% 36.4% 43.7%
1 Layer, n = 8
Bidirectional LSTM-RNN 33.2% 36.6% 43.6%
(ncell = 96), 1 Layer
TABLE IV: Comparisons of NDCG performance measures (the higher the better) of proposed models and a series of baseline models, where nhid refers to the number of hidden units, ncell refers to number of cells, win refers to window size, and n is the number of negative samples which is set to 4 unless otherwise stated. Unless stated otherwise, the RNN and LSTM-RNN models are chosen to have the same number of model parameters as the DSSM and CLSM models: M, where . The boldface numbers are the best results.

A comparison between the value of the cost function during training for LSTM-RNN and RNN on the click-through data is shown in Fig. 7. From this figure, we conclude that LSTM-RNN is optimizing the cost function in (4) more effectively. Please note that all parameters of both models are initialized randomly.

Fig. 7: LSTM-RNN compared to RNN during training: The vertical axis is logarithmic scale of the training cost, , in (4). Horizontal axis is the number of epochs during training.

Vi Conclusions and Future Work

This paper addresses deep sentence embedding. We propose a model based on long short-term memory to model the long range context information and embed the key information of a sentence in one semantic vector. We show that the semantic vector evolves over time and only takes useful information from any new input. This has been made possible by input gates that detect useless information and attenuate it. Due to general limitation of available human labelled data, we proposed and implemented training the model with a weak supervision signal using user click-through data of a commercial web search engine.

By performing a detailed analysis on the model, we showed that: 1) The proposed model is robust to noise, i.e., it mainly embeds keywords in the final semantic vector representing the whole sentence and 2) In the proposed model, each cell is usually allocated to keywords from a specific topic. These findings have been supported using extensive examples. As a concrete sample application of the proposed sentence embedding method, we evaluated it on the important language processing task of web document retrieval. We showed that, for this task, the proposed method outperforms all existing state of the art methods significantly.

This work has been motivated by the earlier successes of deep learning methods in speech [41, 42, 43, 44, 45] and in semantic modelling [3, 10, 46, 47], and it adds further evidence for the effectiveness of these methods. Our future work will further extend the methods to include 1) Using the proposed sentence embedding method for other important language processing tasks for which we believe sentence embedding plays a key role, e.g., the question / answering task. 2) Exploit the prior information about the structure of the different matrices in Fig. 2 to develop a more effective cost function and learning method. 3) Exploiting attention mechanism in the proposed model to improve the performance and find out which words in the query are aligned to which words of the document.

Appendix A Expressions for the Gradients

In this appendix we present the final gradient expressions that are necessary to use for training the proposed models. Full derivations of these gradients are presented in section III of supplementary materials.

A-a Rnn

For the recurrent parameters, (we have ommitted subscript for simplicity):


where means -th candidate document that is not clicked and


and the same as (A-A) for with subscript for document side model. Please also note that:




For the input parameters, :


A full derivation of BPTT for RNN is presented in section III of supplementary materials.

A-B Lstm-Rnn

Starting with the cost function in (4), we use the Nesterov method described in (IV) to update LSTM-RNN model parameters. Here, is one of the weight matrices or bias vectors
in the LSTM-RNN architecture. The general format of the gradient of the cost function, , is the same as (7). By definition of , we have:


We omit and subscripts for simplicity and present for different parameters of each cell of LSTM-RNN in the following subsections. This will complete the process of calculating in (7) and then we can use (IV) to update LSTM-RNN model parameters. In the subsequent subsections vectors and are defined as:


where and are defined in (A-A). Full derivation of truncated BPTT for LSTM-RNN model is presented in section III of supplementary materials.

A-B1 Output Gate

For recurrent connections we have:




and the same as (17) for with subscript for document side model. For input connections, , and peephole connections, , we will have:


The derivative for output gate bias values will be:


A-B2 Input Gate

For the recurrent connections we have:




In equation (A-B2), and are the same as (A-B2) with subscript. For the input connections we will have the following:




For the peephole connections we will have:




For bias values, , we will have:




A-B3 Forget Gate

For the recurrent connections we will have:




For input connections to forget gate we will have:




For peephole connections we have:




For forget gate’s bias values we will have:




A-B4 Input without Gating ()

For recurrent connections we will have: