Sequence to Sequence Learning for Query Expansion

12/25/2018 ∙ by Salah Zaiem, et al. ∙ UQAM 0

Using sequence to sequence algorithms for query expansion has not been explored yet in Information Retrieval literature nor in Question-Answering's. We tried to fill this gap in the literature with a custom Query Expansion engine trained and tested on open datasets. Starting from open datasets, we built a Query Expansion training set using sentence-embeddings-based Keyword Extraction. We therefore assessed the ability of the Sequence to Sequence neural networks to capture expanding relations in the words embeddings' space.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent works on Search-Oriented chatbots have been focusing on the reranking process after the preselection of a small number (usually around 10) of similar queries or hypothetical answers. Using basic similarity measures (Bag of words vectorization using BM25 or TF-IDF weighting for instance), a few hypotheses are selected for further inspection.

The preselection phase is rarely tackled in the literature. It is nevertheless an important task. Two reasons illustrate its importance. First, missing the correct answer or document in the preselection caps the accuracy at a certain coverage. Second, finding the relevant document in the fewest number of hypotheses makes the reranking process faster by reducing the number of answers to rerank. With the reranking processes getting more and more complex Yan et al. (2016); Qiu et al. (2017) while the need is for faster answer generation, a fast and efficient preselection increases the accuracy and improves the users’ experience.
The importance of Query Expansion in Question Answering is highlighted in Derczynski et al. (2008). To avoid the knock-on effect induced by a failure in any steps of the QA system, Query Expansion through Pseudo Relevance Feedback (PRF) is used to improve the IR component performance. PRF improves the coverage of the Answer Extraction component.

Throughout this work, we will explore a new technique using Sequence to Sequence neural architecture to expand the queries introduced by the users.
Here is a summary of our main contributions :

  • Starting from open datasets, we built a Query Expansion training set using sentence-embeddings-based Keyword Extraction.

  • We assess the ability of the Sequence to Sequence neural networks to capture expanding relations in the words embeddings’ space.

Our work is organized as follows : Section 2 presents the state-of-the-art related to this research. Section 3 presents our approach. Section 4 shows our experiments and and evaluations. Section 5 concludes this paper and presents some perspectives.

2 Related Work

Relevance feedback has been a popular choice for query expansion, starting with the Rocchio Algorithm Salton (1971) in SMART Information Retrieval System. Using a set of relevant and non relevant documents, the original query vector is modified. Pseudo Relevance Feedback relies on terms collected from the most pseudo-relevant documents of a first search. These terms are added to the query and a second search gives the final documents/answers.
But, we can also expand queries using external resources. Query expansion also relied on anthologies and thesaurus-based methods, using large lexical databases like WordNet Miller (1995). The query is expanded by adding terms using synonymy, hypernymy and other semantic relationships.

Recently, the introduction of word embeddings Mikolov et al. (2013) allowed new possibilities for Query Expansion Roy et al. (2016)

. The distributed representations of the words in a query made it possible to produce expansions without extracting them from the documents. Using the centroid of the words introduced and cosine-similar tokens, Kuzi and al

Kuzi et al. (2016) proposed a document-independent expansion method. This method has been therefore integrated to a custom PRF technique yielding particularly interesting results. As referenced by Mitra and Crasswell Mitra and Craswell (2017) , Diaz and al Diaz et al. (2016) showed that locally trained word embeddings along with topic specific language models lead to a significant improvement on Information Retrieval tasks.

Lately, Nogueira and Cho Nogueira and Cho (2017)

proposed a Reinforcement Learning based query expander. After selecting a set of candidate terms from the documents, a search engine looks for the relevant documents. Using relevance judgments, the system measures how much the suggested expansion improved the search accuracy and updates the parameters of the network.

Paraphrase generation is a very close field. Generating new utterances carrying the same meaning expands the initial query and highly increases the robustness of a search-based chatbot McClendon et al. (2014); Kozlowski et al. (2003).

Recently, Prakash and al. Prakash et al. (2016) proposed a model using stacked residual LSTM networks. While it reaches high BLEU and METEOR scores, we lack information about how different the produced queries are from the introduced ones. And therefore, we cannot assess how much do they increase a conversational engine robustness. In addition, Buck and al. Buck et al. (2017) proposed a Reinforcement Learning based method for question reformulation. The reformulation network is updated according to the performance on Question Answering.

2.1 Sentence embeddings

A vectorial representation of whole sentences and documents is useful for many tasks such as Text Classification. Arora and al. Arora et al. (2017) proposed a simple baseline that performs well in several NLP tasks. It starts by averaging weighted pre-trained vectors of the words in the sentence. Then it deletes a common part to all sentences by removing a projection on the first singular vector of the sentences vectors’ matrix. Another approach developed by Conneau and al Conneau et al. (2017) used the SNLI database and a BiLSTM architecture to improve the sentences encoding using Natural Language Inference detection. The network updates its encoding of the sentences to predict the semantic relationship between two sentences. We will use a feature of that BiLSTM encoder to extract the keywords of a sentence in our expansion generation engine.

2.2 Sequence to sequence architectures

Sequence to Sequence is a neural architecture very popular in machine translation since it achieved state of the art results Luong et al. (2015). Proposed by Sutskever and al Sutskever et al. (2014) and Kalchbrenner & Blunsom Kalchbrenner and Blunsom (2013)

, it consists in a two-component model using recurrent neural networks to link variable-length input sequences to variable-length output sequences. The introduced sequence gets encoded by the first component into a vectorial representation. Therefore, the decoder transforms that vector into the target sequence. For the example of automatic translation, the source sequence is a sentence in language A. The decoder outputs the target sentence in language B.

The target sequence is the argmax of

And at each step the next token maximizes :

where is the i-th hidden state of the encoder, c the final vector output by the encoder representing the entire input sentence and the i-th generated token. g is the function learned by the decoder.
The encoding and decoding parts are usually composed of RNN cells. Bengio and al Bengio et al. (1994)

showed that Long Short Term Memory cells were very efficient to deal with vanishing gradient issues. LSTM cells

Hochreiter and Schmidhuber (1997) help the network capture long term dependencies and therefore exploit long sequences.
Attention mechanisms are an extension to the encoder-decoder model. Proposed first by Bahdanau and al. Bahdanau et al. (2014), it helps at each step the decoder with a context vector pointing out where the relevant information about the next token is located. With attention, the formula leading to the generation of the next token becomes :

We remark that the main difference is that the vector c becomes dependent of the rank of the generated token i. The vector is a weighted sum of annotations used in the encoding process ( being the length of the input sequence). Using hidden states, the vector points out how much a part of the source sequence should participate into the generation of the next token.

3 Our approach

3.1 Building the training set

3.1.1 Databases

If a lot of paraphrasing databases exist and are made available for free, they usually consist in very short paraphrases (synonyms or slightly changed reformulations). These datasets are not very appropriate to enhance and expand question queries as they would tackle each token separately, while we are trying to find expansions based on whole sentences. The second case is very long paraphrases ( piece of news on the same special event ). They are too long to be trained for conversational queries. This is why we had to find genuine sources of training material.
We used MultiNLI Williams et al. (2018) and SNLI Bowman et al. (2015)

. NLI stands for Natural Language Inference and describes datasets presenting a list of sentence pairs with the semantic relationship linking the two sentences. The main semantic relationships are ”Neutral” for approximate paraphrase, ”Entailment” for semantic entailment (Example : (A) The president was assassinated. entails (B) The president is dead. )and ”Contradiction”. For both corpuses, we naturally eliminate pairs classified as contradiction as they shall not provide relevant expansions.

We also selected the duplicate pairs from the Quora question pairs dataset and trained our expansion model using the words that do not appear in the first formulation.
Finally, MSCOCO Lin et al. (2014) dataset consists in human annotated captions of over 120K images. Since they are describing the same image, (which usually focuses on only a few objects and generally one prominent object or action), we can assume the words appearing in one description and not in the other are an eventual expansion for the first annotation.

Dataset Size
Stanford NLI 570k
Multi NLI 433K
Quora Duplicates 404k
Images annotations MSCOCO 120K

Table 1: Datasets

3.1.2 Keywords extraction

We will use sentence embeddings to find out which words contribute most to the final vector. This computation is based on the hidden states of the encoder. The last layer in the Infersent model is a Maxpooling one. A max pooling layer performs down-sampling by dividing the input into rectangular pooling regions, and computing the maximum of each region. What we can see in the Figure 1 is the number of times the maxpooling layer chose the hidden state

which is like a sentence representation centered around the t-th token. The words chosen the most times will be our selected keywords. These keywords will compose the expansions.

Figure 1: Keyword extraction with the maxpooling layer (Credits : Conneau and al.)

Example : a picture of an old parade going through a town gives these extracted keywords : [’parade’, ’old’, ’picture’, ’town’]

3.2 The training

3.2.1 Preprocessing

The specificity of the Query Expansion task makes us able to take both A-B and B-A pairs for training. We extract the keywords from the target sequence, and then we remove the ones that appear in the source sequence.
To get targets with similar lengths, we remove the pairs with a target having less than 3 tokens. We also limit the number of target tokens to 6.
We finally get 520k pairs of sentence-expansion, this number may not be large enough for Sequence to Sequence learning, but we are limited by the number of available exploitable datasets. Here is an example from our training set :

Query : who is the president of the U.S? Expansion : american elected actual

3.2.2 Training Model

We initiated the encoder and decoder weights with pre-trained word embeddings. We chose Glove Pennington et al. (2014) 840B with 300 dimensioned vectors.
To choose the hyper parameters, we relied mainly on the best practices given by Britz and al. Britz et al. (2017). They conducted massive tests to give best practices advice for these architectures. We used a Bidirectional LSTM encoder with two layers of 500 hidden units. Graves and Schimdhuber Graves and Schmidhuber (2005) have shown that Bidirectional encoding outperforms unidirectional one. The idea behind is to present the sentence in both forwards and backwards to two separate LSTM networks. It ensures that for every point in the input sequence, the network has complete information about the token preceding and following him. The decoder is a 2-layer LSTM with 500 hidden units.

We used mini batches of 32 examples, and applied a 0.35 dropout probability in the LSTM stacks. We used Stochastic Gradient Descent as our optimizer and started with a learning rate of 0.001. The learning rate goes down with a decay of 0.5 after every epoch. There were 25 epochs of training for a total training time of 85 hours.

The loss function is a Softmax cross entropy loss comparing the words computed by the decoder and the actual ”True” targets.

We use a Bahdanau attention model for the expansion generation. We could think that attention is not needed in our context, as tokens should be produced based on the whole sentence. Yet, focusing on part of the sentences yields better results and was therefore our choice.

After generating the expanding sequence, we remove the words appearing in the initial query, and expand the source sentence with the remaining ones. The training accuracy reaches 25.85 percent. But it is not a relevant metric to test our query expansion model. Instead, using the same ”search” component, we will evaluate the effect of the query expansion on the quality of the results/documents proposed.

Figure 2: Training evolution

4 Evaluation

We tested our query expansion model on three different tasks : Information Retrieval, Text Classification and Answer preselection.

4.1 Information Retrieval

For this task, we will use the TREC Robust 2004 Dataset Vorrhees (2001). It consists in a set of 250 queries and 528,155 documents. For the search component, we will use Apache Lucene search.
We start with the queries, and we expand them using our QE system and then we check the quality of the results provided by Lucene Search using StandardAnalyzer and two weighting schemes : BM25 and TF-IDF.
Here is an example :

  • Query : bullying prevention programs / Expansion : school program security

We will use Mean Average Precision as our metric to evaluate the results provided by our search. Mean Average Precision is the mean of the average precision scores for each query.

with Q the number of queries and :

with n the number of documents, rel(k) an indicator function equaling 1 if the item at rank k is a relevant document, zero otherwise and :

The table shows the MAP results:

Method MAP
TF-IDF without QE 0.2517
TF-IDF with QE 0.2581
BM25 without QE 0.2709
BM25 with QE 0.2783

Table 2: Information Retrieval results

We run a Student t-test to check the significance of the difference. For the TF-IDF vectorization, the p-value reaches 0.434 while it equals 0.42 with BM-25. The difference is obviously not significant at 20% .

4.2 Text Classification

Text Classification is a very popular task in Natural Language Processing. To see how useful our QE engine can be for this task, we will use The Guardian API. We download a set of articles classified by topic from The Guardian newspaper. We will only use headlines for article classification. We divide the set, we take 40000 articles for training and we keep 1000 for testing.

For learning we use the SVM algorithm applied on vectors obtained with a basic TF-IDF vectorization. To make it easier, we only keep seven ”large” labels : Culture, Sport, World, Politics, Business, Science and Media.
Our QE engine intervenes in the testing phase, we compare the results obtained with or without expanding the headlines of the testing set. We then compare the accuracies of the prediction :

Method Accuracy
Without QE 0.7167
With QE 0.7180

Table 3: Text Classification results

We run the Student’s T test on this task also. With a p-value of 0.47427, the difference is not significant at 20% .

4.3 Answer Preselection

This is the task that motivated us first to explore Query Expansion techniques.
We will use the WikiQA Dataset Yang et al. (2015). The WikiQA corpus is an open set of question and sentence pairs, collected and annotated for research on open-domain question answering.
For each question, we start a search on the set of answers with a similarity computation. We select the ten most similar answers, then we count the proportion of relevant answers in the ten hypotheses, taking into account that some questions have less than 10 possible answers. This will be our accuracy measure. Coverage is defined as the proportion of queries that had at least one appropriate answer among the ten hypotheses.
We compare the accuracy measures using or not the Query Expansion engine. Here is an example of queries in the Dataset :

  • Query : How to lose weight ? / Expansion : aerobic sport diet

Method Accuracy Coverage
TF-IDF without QE 0.2871 0.7840
TF-IDF with QE 0.2889 0.7901

Table 4: Answer Preselection results

Due to the number of queries (2117), the t-test shows better results. However, with 0.44855 and 0.31393 p-values respectively for accuracy and coverage, the difference remains not significant at 20%.

5 Qualitative analysis and future tracks

Our Query Expansion engine does improve the results in the different tasks we tested it in, but the progress is far from being impressive, and is logically not statistically significant. Here is a list of the issues limiting the reliability of our QE system and a few future tracks for improvement :

  • The QE system fails to capture the semantic mechanisms behind Query Expansion, and therefore could not expand queries of unseen topics. This may not be that surprising as the task seems very complicated and nothing proofs that the actual embedding space ensures and holds this type of semantic relationships. The expansions are therefore learned through the examples, and the models fails to enrich queries on topics it did not witness before (45% of the queries are not expanded since no new word is added). The testing datasets are mainly composed of such queries. When we have a look to the attention matrices, we find out that the network does not rely on the question formulation in generating the new tokens. It almost only looks for the ”keywords”.

  • This makes us think that although it may not be efficient for open topics, training this model on local entailments would yield great expansion results.

  • The nature of the testing queries, mainly the fact that they contained a lot of named entities, has been a huge handicap for our query expansions. Confronted to unknowns, the network replicates the words of the source sentence and the introduced query does not get any useful expanding word. We tried to remove the named entities in the sentence before expanding it. But as the sentence loses a lot of the information it carried first, the expansions were not very relevant.

  • We will explore the possibility of including the search in the training process. The progress on search would be a loss function updating the weights of the encoder-decoder network. After the first training, a second one ,based on the reward for the search, would refine the parameters of our network and make it more search-oriented.

6 Conclusion

Throughout this work, we introduced a sequence to sequence framework for automatic query expansion. We implemented a genuine keyword extraction method to create the training dataset. When tested on three different tasks, our Query Expansion engines leads to a slight improvement. Although our model seems still unable to propose relevant expansions on unseen topics, it performs well on known ones. For the future, we are thinking about updating the Sequence to sequence network according to the search results. We would add on top of our encoder-decoder network a deep reinforcement learning component. This component would rely on the impact of our expansion on search quality.