Deep Neural Network for Learning to Rank Query-Text Pairs

02/25/2018 ∙ by Baoyang Song, et al. ∙ 0

This paper considers the problem of document ranking in information retrieval systems by Learning to Rank. We propose ConvRankNet combining a Siamese Convolutional Neural Network encoder and the RankNet ranking model which could be trained in an end-to-end fashion. We prove a general result justifying the linear test-time complexity of pairwise Learning to Rank approach. Experiments on the OHSUMED dataset show that ConvRankNet outperforms systematically existing feature-based models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper considers the problem of document ranking in information retrieval systems by Learning to Rank. Traditionally, people used to hand-tune ranking models such as TF-IDF or Okapi BM25 (Manning et al., 2008)

which is not only time-inefficient but also tedious. Learning to Rank, on the other hand, aims to fit automatically the ranking model using machine learning techniques. In recent years, Learning to Rank draws much attention and quickly becomes one of the most active research areas in information retrieval. A number of supervised and semi-supervised ranking models has been proposed and extensively studied. We refer to

(Liu, 2011) for a detailed exposition.

Though successful, these Learning to Rank models are mostly “feature based”. In other words, given a query-document pair

, the inputs to ranking models are vectors of form

where is a feature extractor. Some widely used features are such as TF-IDF similarity or PageRank score. However, feature based models suffers from many problems in practice. On the one hand, feature engineering is generally non-trivial and requires many trial-and-error before finding distinctive features; on the other hand, the computation of could be very challenging in a real use case. For instance, in the popular Elasticsearch (Gormley and Tong, 2015), there is no direct way to calculate only IDF (though the product TFIDF is readily available).

In this paper, we show how raw query-document pairs could be directly used to fit an existing feature-based ranking model. We develop ConvRankNet, a strong Learning to Rank framework composed of: (1) Siamese Convolutional Neural Network (CNN) encoder, a module designed to, given query and two documents , extract automatically feature vectors and and (2) RankNet, a successful three-layer neural network-based pairwise ranking model. We prove also a general result justifying the linear test-time complexity of pairwise Learning to Rank approaches. Our experiments show that ConvRankNet improves significantly state-of-the-art feature based ranking models.

2 Related Work

Our approach is based on the pairwise Learning to Rank approach (Liu, 2011).

Pairwise approach is extensively studied under supervised setting. As its name suggests, it takes a pair of documents and as input and is trained to predict if is more relevant than . Joachims (2002)

uses “clickthough log” to infer pairwise preference and trains a linear Support Vector Machines (SVM) on the difference of feature vectors

. Burges et al. (2005) introduce a three-layer Siamese neural network with a probabilistic cost function which can be efficiently optimized by gradient descent. Instead of working with non-smooth cost function, Burges et al. (2007) propose LambdaRank which model directly the gradient of an implicit cost function. Burges (2010) introduces LambdaMART which is the boosted tree version of LambdaRank. LambdaMART is generally considered as the state-of-the-art supervised ranking model.

Under semi-supervised setting, however, there is considerably fewer work. Szummer and Yilmaz (2011) make use of unlabeled data by breaking the cost function into two parts: that depends only on labeled data and that depends only on unlabeled ones. They report a statistically significant improvement over its supervised counterpart on some benchmark datasets.

Recent years, CNN (Krizhevsky et al., 2012)

achieves impressive performance on many domains, including Natural Language Processing (NLP)

(Kim, 2014). It is shown that CNN is able to efficiently learn to embed sentences into low-dimensional vector space on preserving important syntactic and semantic aspects. Moreover, CNN is able to be trained in an end-to-end fashion, i.e. little preprocessing and feature engineering are required. Therefore, people attempt to adapt CNN to build an end-to-end Learning to Rank model.

Severyn and Moschitti (2015) combine a CNN with a pointwise111 Pointwise approach treats the feature vector of query-document pairs independently. and considers a regression or multi-class problem on the relevance . model to rank short query-text pairs and report state-of-the-art result on several Text Retrieval Conference (TREC) tracks. Though successful, their approach has several drawbacks: (1) both query and document are limited to a single sentence; (2) the underlying Learning to Rank approach is pointwise, which is rarely used in practice. Moreover, it is difficult to take advantage of pairwise preference provided by clickthough log and thus not practical for a real use case; (3) the add of “additional features” to the join layer is questionable. Indeed, the method is claimed to be end-to-end, but additional features could be so informative that the feature maps learned by CNN do not play an import role.

We are thus motivated to generalize the idea in Severyn and Moschitti (2015) and to build a real end-to-end framework whose underlying ranker is a successful Learning to Rank model.

3 ConvRankNet

Throughout the rest of paper, we denote the set of queries and the set of documents. Given , note the set of documents which “match”222For example, could be the set of documents sharing at least one token with .. For , we write if is more relevance than ( is defined similarly) and if there is a tie. Note further the pairwise preference such that

(1)

In the following we describe our system ConvRankNet for ranking short query-text pairs in an end-to-end way which consists of (1) Siamese CNN Encoder, a module designed to extract automatically feature vectors from query and text and (2) RankNet, the underlying ranking model. Figure 1 gives an illustration of ConvRankNet.

Figure 1: ConvRankNet structure.

3.1 Siamese CNN Encoder

The Siamese CNN Encoder extracts feature vectors. As shown in Figure 1, the encoder consists of three sub-networks sharing the same weights (a.k.a. Siamese network (Bromley et al., 1994)). It is made up of the following major components: sentence matrix, convolution feature maps, activation units, pooling layer and similarity measure.

Sentence Matrix

Given a sentence , the sentence matrix is such that each row is the embedding of a word (or -gram) into a -dimensional space by looking up a pre-trained word embedding model.

Convolution Feature Maps, Activation and Pooling

Convolutional layer is used to extract discriminative patterns in the input sentence. A -filter of size is applied on a sliding window of rows of representing consecutive words (or -grams). Note that the filter is of the same width as the sentence matrix, therefore, a column vector is produced. Formally, the -th component of is such that

(2)

where is a bias. An non-linear activation unit is applied element-wise on

which permits the network to learn non-linearity. A number of activation units are widely used in many settings, in the scope of ConvRankNet, the rectified linear (ReLU) function is privileged. The output of activation unit is further passed to a max-pooling layer. In other words,

is represented by . In practice, a set of filters of different size are used to produce feature maps . Each is passed individually through the activation unit and max pooling layer so that we have a vector

(3)

in the end.

Similarity Mesure

Given , the encoder produces three vectors and respectively. In order to feed RankNet, two feature vectors and need to be further computed. Severyn and Moschitti (2015) introduce a similarity matrix and defines

. However, such a choice is difficult to be fitted in a modern deep learning framework. In ConvRankNet, we choose instead

(4)

where the square is element-wise.

The output of Siamese CNN Encoder, and , are then piped to a standard RankNet. We privilege RankNet for its simple implementation and yet impressive performance on benchmark datasets. Our idea, however, is applicable to any pairwise ranking model.

3.2 RankNet

Proposed in (Burges et al., 2005), RankNet quickly becomes a popular ranking model and is deployed in commercial search engines such as Microsoft Bing. It is well studied in the literature. For sake of completeness, however, we describe briefly here its structure.

For

, suppose that there exists a deterministic target probability

such that

(5)

The objective is to learn a posterior probability distribution

that is “close” to . A natural measure of closeness between probability distribution is the binary cross-entropy

(6)

which is minimized when . The posterior is modeled by the Bradley-Terry model

(7)
(8)

where is a score function and and .

Under this assumption, the loss function (

6) further becomes

(9)

being convex, it is straightforward to optimize it by gradient descent.

Since iff. , without loss of generality we suppose that for we always have . Moreover, Burges et al. (2005) show that training on ties makes little difference. Therefore, we could consider only document pairs such that .

3.3 Time Complexity

In this section we discuss the time complexity of general pairwise Learning to Rank models (and in particular, ConvRankNet).

In pairwise approach, we generally consider a bivariate function such that iff. . is then used to construct the cost function.

It is clear that the training time complexity is since every pair such that has to be considered. One may infer that the test cost is also quadratic since we have to evaluate on the collection of all possible pairs on test data and construct a consistent total order (For example, induces the total order ).

However, we argue that under a very loose assumption, a linear time333Here we do not take into account the sort of score, which is in general . However, if the score is of fix precision, on could use e.g. Radix sort to achieve linear time complexity. actually suffices for constructing the total order. First recall the following result in graph theory:

Lemma 3.1.

The topological sort of a directed graph is an order of vertices such that all edges of go from left to right in the order. It is shown that

  • is a directed acyclic graph (DAG) iff. there exists a topological sort on (Skiena, 2008).

  • the topological sort is unique iff. there is a directed edge between each pair of consecutive vertices in the topological order (i.e.  has a Hamiltonian path) (Sedgewick and Wayne, 2011).

We have then the following result:

Theorem 3.2.

Suppose that the hypothesis has no tie, i.e. . If there exists such that and iff. , then the total order defined by on is the same as that of on .

Proof.

Consider the graph induced by , i.e.  iff. . Remark first that is a DAG. If not, there exists such that . Then , a contradiction. By Lemma 3.1, there exists a topological sort. Without loss of generality note the sort . Since has no tie, is an Hamiltonian path, thus the topological sort is furthermore unique. It is easy to see that it is nothing but the sort with respect to . ∎

Corollary 3.2.1.

ConvRankNet has linear test time.

4 Experimental Results

In this section we evaluate ConvRankNet on standard benchmark datasets and compare it with standard RankNet and LambdaRank.

4.1 Datasets

Since ConvRankNet is an end-to-end model, we need datasets to which we have access to raw query and documents.

To the best of our knowledge, OHSUMED dataset444http://mlr.cs.umass.edu/ml/machine-learning-databases/ohsumed/ is the only freely available dataset. Subset of MEDLINE (a database on medical publications), it consists of queries on medical documents during 1987-1991. The relevance of query-document pairs are provided by human assessors on three levels: non-relevant (n), partially relevant (p) and definitely relevant (d). In particular, non-relevant pairs are explicitly provided. For historical reasons, each query-document pair is judged independently by up to assessors. To avoid ambiguity, the highest relevance label is taken in our experiments.

Query id 1-21 22-42 43-63 64-84 85-106
Table 1: Partition of OHSUMED dataset.

To compare with classical feature-base models, we also use a synthesized version (where only feature vectors are accessible) of OHSUMED which is included in Microsoft’s LETOR 3.0 dataset 555https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/.. As in LETOR 3.0, we partition raw OHSUMED dataset into folds as shown in Table 1.

4.2 Experimental Setup

All models are implemented in PyTorch framework.

In general, query and documents are not of the same length. Though PyTorch uses dynamic graph and is capable of handling texts of various lengths, one could only train the network one query-document-document triple a time. In order to perform batch training, both query and document are truncated to

words with zero-padding.

We use ConceptNet Numberbatch666https://github.com/commonsense/conceptnet-numberbatchSpeer et al. (2016) as the default word embedding. Part of the ConceptNet open data project, ConceptNet Numberbatch consists of state-of-the-art semantic vectors that can be used directly as representation of word meanings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting. Several benchmarks show that ConceptNet Numberbatch outperfoms word2vec and GloVe. ConceptNet Numberbatch includes not only vectorization of single word but also that of some bigrams and trigrams. Bigrams and trigrams are semantically more informative. For example, “la carte” is clearly better characterized than “la” and “carte” separately. To exploit -grams, we use a greedy approach in mapping text to vector matrix, i.e. we extend as long as possible the -gram to map to vector. For example, the following toy “sentence” hello world peace would be segmented as , even though is also possible. One possible extension to our work is then to find the semantically optimal segmentation of sentence. In our network, unknown words are mapped to a random vector and zero padding is mapped to zero vector.

Three different filters of size , with copies each, are used for the convolution layer so that the input for RankNet is a -dimensional vector. In order to prevent overfitting, during training stage a drop-out layerSrivastava et al. (2014) with is used after the max-pooling layer.

RankNet, LambdaRank and ConvRankNet are all trained for epochs with learning rate respectively. All tests were performed on a Ubuntu 16.04.4 server with Xeon GHz CPU, GB RAM and Tesla P100 GB GPU. Cross-validations are run in parallel with the help of GNU Parallel (Tange, 2011).

Normalized Discounted Cumulative Gain at truncation level (NDCG@) is used as the evaluation measure. NDCG@ is a multi-level ranking quality measure widely used in previous work. It ranges from to with for the perfect ranking. We refer to (Manning et al., 2008) for a detailed presentation.

4.3 Results

NDCG@1 NDCG@2 NDCG@3 NDCG@4 NDCG@5
method
ConvRankNet 0.5479 0.5265 0.5204 0.5241 0.5204
LambdaRank 0.5677 0.5267 0.4942 0.4884 0.4780
RankNet 0.5737 0.5362 0.5128 0.4898 0.4746
NDCG@6 NDCG@7 NDCG@8 NDCG@9 NDCG@10
method
ConvRankNet 0.5179 0.5109 0.5139 0.5122 0.5132
LambdaRank 0.4681 0.4604 0.4552 0.4553 0.4503
RankNet 0.4648 0.4608 0.4560 0.4493 0.4461
Table 2: -fold cross-validation of NDCG@ on test set.

-fold cross-validation is performed on both datasets. Table 2 reports NDCG@ at all truncation levels. It is clear that ConvRankNet outperforms RankNet and LambdaRank especially for large truncation level .

RankNet LambdaRank ConvRankNet
RankNet - 0.575 0.021
LambdaRank - - 0.012
ConvRankNet - - -
Table 3: -value of two-tailed Wilcoxon signed-rank test.

A two tailed Wilcoxon signed-rank test (Dalgaard, 2008) is performed on these values. As we can see from Table 3, the improvement of ConvRankNet over RankNet and LambdaRank is statistically significant at level. Moreover, according to (Liu, 2011), ConvRankNet also outperfoms systematically existing methods.

5 Discussion and Conclusions

In this paper, we proposed ConvRankNet, an end-to-end Learning to Rank model which directly takes raw query and documents as input. ConvRankNet is shown to have linear test time and thus applicable in real-time use cases. Our results indicate that it outperforms significant existing methods on OHSUMED dataset.

Future work could aim to study the generalization of the underlying RankNet module to other stronger neural network based model (such as LambdaRank).

Acknowledgements

The author would like to thank Dr. Michalis Vazirigiannis of École Polytechnique for his valuable suggestions.

The author would also like to thank Mr. Geoffrey Scoutheeten and Mr. Édouard d’Archimbaud of BNP Paribas for their valuable comments.

The work is supported by the Data & AI Lab of BNP Paribas.

References