This paper considers the problem of document ranking in information retrieval systems by Learning to Rank. Traditionally, people used to hand-tune ranking models such as TF-IDF or Okapi BM25 (Manning et al., 2008)
which is not only time-inefficient but also tedious. Learning to Rank, on the other hand, aims to fit automatically the ranking model using machine learning techniques. In recent years, Learning to Rank draws much attention and quickly becomes one of the most active research areas in information retrieval. A number of supervised and semi-supervised ranking models has been proposed and extensively studied. We refer to(Liu, 2011) for a detailed exposition.
Though successful, these Learning to Rank models are mostly “feature based”. In other words, given a query-document pair
, the inputs to ranking models are vectors of formwhere is a feature extractor. Some widely used features are such as TF-IDF similarity or PageRank score. However, feature based models suffers from many problems in practice. On the one hand, feature engineering is generally non-trivial and requires many trial-and-error before finding distinctive features; on the other hand, the computation of could be very challenging in a real use case. For instance, in the popular Elasticsearch (Gormley and Tong, 2015), there is no direct way to calculate only IDF (though the product TFIDF is readily available).
In this paper, we show how raw query-document pairs could be directly used to fit an existing feature-based ranking model. We develop ConvRankNet, a strong Learning to Rank framework composed of: (1) Siamese Convolutional Neural Network (CNN) encoder, a module designed to, given query and two documents , extract automatically feature vectors and and (2) RankNet, a successful three-layer neural network-based pairwise ranking model. We prove also a general result justifying the linear test-time complexity of pairwise Learning to Rank approaches. Our experiments show that ConvRankNet improves significantly state-of-the-art feature based ranking models.
2 Related Work
Our approach is based on the pairwise Learning to Rank approach (Liu, 2011).
Pairwise approach is extensively studied under supervised setting. As its name suggests, it takes a pair of documents and as input and is trained to predict if is more relevant than . Joachims (2002)
uses “clickthough log” to infer pairwise preference and trains a linear Support Vector Machines (SVM) on the difference of feature vectors. Burges et al. (2005) introduce a three-layer Siamese neural network with a probabilistic cost function which can be efficiently optimized by gradient descent. Instead of working with non-smooth cost function, Burges et al. (2007) propose LambdaRank which model directly the gradient of an implicit cost function. Burges (2010) introduces LambdaMART which is the boosted tree version of LambdaRank. LambdaMART is generally considered as the state-of-the-art supervised ranking model.
Under semi-supervised setting, however, there is considerably fewer work. Szummer and Yilmaz (2011) make use of unlabeled data by breaking the cost function into two parts: that depends only on labeled data and that depends only on unlabeled ones. They report a statistically significant improvement over its supervised counterpart on some benchmark datasets.
Recent years, CNN (Krizhevsky et al., 2012)
achieves impressive performance on many domains, including Natural Language Processing (NLP)(Kim, 2014). It is shown that CNN is able to efficiently learn to embed sentences into low-dimensional vector space on preserving important syntactic and semantic aspects. Moreover, CNN is able to be trained in an end-to-end fashion, i.e. little preprocessing and feature engineering are required. Therefore, people attempt to adapt CNN to build an end-to-end Learning to Rank model.
Severyn and Moschitti (2015) combine a CNN with a pointwise111 Pointwise approach treats the feature vector of query-document pairs independently. and considers a regression or multi-class problem on the relevance . model to rank short query-text pairs and report state-of-the-art result on several Text Retrieval Conference (TREC) tracks. Though successful, their approach has several drawbacks: (1) both query and document are limited to a single sentence; (2) the underlying Learning to Rank approach is pointwise, which is rarely used in practice. Moreover, it is difficult to take advantage of pairwise preference provided by clickthough log and thus not practical for a real use case; (3) the add of “additional features” to the join layer is questionable. Indeed, the method is claimed to be end-to-end, but additional features could be so informative that the feature maps learned by CNN do not play an import role.
We are thus motivated to generalize the idea in Severyn and Moschitti (2015) and to build a real end-to-end framework whose underlying ranker is a successful Learning to Rank model.
Throughout the rest of paper, we denote the set of queries and the set of documents. Given , note the set of documents which “match”222For example, could be the set of documents sharing at least one token with .. For , we write if is more relevance than ( is defined similarly) and if there is a tie. Note further the pairwise preference such that
In the following we describe our system ConvRankNet for ranking short query-text pairs in an end-to-end way which consists of (1) Siamese CNN Encoder, a module designed to extract automatically feature vectors from query and text and (2) RankNet, the underlying ranking model. Figure 1 gives an illustration of ConvRankNet.
3.1 Siamese CNN Encoder
The Siamese CNN Encoder extracts feature vectors. As shown in Figure 1, the encoder consists of three sub-networks sharing the same weights (a.k.a. Siamese network (Bromley et al., 1994)). It is made up of the following major components: sentence matrix, convolution feature maps, activation units, pooling layer and similarity measure.
Given a sentence , the sentence matrix is such that each row is the embedding of a word (or -gram) into a -dimensional space by looking up a pre-trained word embedding model.
Convolution Feature Maps, Activation and Pooling
Convolutional layer is used to extract discriminative patterns in the input sentence. A -filter of size is applied on a sliding window of rows of representing consecutive words (or -grams). Note that the filter is of the same width as the sentence matrix, therefore, a column vector is produced. Formally, the -th component of is such that
where is a bias. An non-linear activation unit is applied element-wise on
which permits the network to learn non-linearity. A number of activation units are widely used in many settings, in the scope of ConvRankNet, the rectified linear (ReLU) function is privileged. The output of activation unit is further passed to a max-pooling layer. In other words,is represented by . In practice, a set of filters of different size are used to produce feature maps . Each is passed individually through the activation unit and max pooling layer so that we have a vector
in the end.
Given , the encoder produces three vectors and respectively. In order to feed RankNet, two feature vectors and need to be further computed. Severyn and Moschitti (2015) introduce a similarity matrix and defines
. However, such a choice is difficult to be fitted in a modern deep learning framework. In ConvRankNet, we choose instead
where the square is element-wise.
The output of Siamese CNN Encoder, and , are then piped to a standard RankNet. We privilege RankNet for its simple implementation and yet impressive performance on benchmark datasets. Our idea, however, is applicable to any pairwise ranking model.
Proposed in (Burges et al., 2005), RankNet quickly becomes a popular ranking model and is deployed in commercial search engines such as Microsoft Bing. It is well studied in the literature. For sake of completeness, however, we describe briefly here its structure.
, suppose that there exists a deterministic target probabilitysuch that
The objective is to learn a posterior probability distributionthat is “close” to . A natural measure of closeness between probability distribution is the binary cross-entropy
which is minimized when . The posterior is modeled by the Bradley-Terry model
where is a score function and and .
Under this assumption, the loss function (6) further becomes
being convex, it is straightforward to optimize it by gradient descent.
Since iff. , without loss of generality we suppose that for we always have . Moreover, Burges et al. (2005) show that training on ties makes little difference. Therefore, we could consider only document pairs such that .
3.3 Time Complexity
In this section we discuss the time complexity of general pairwise Learning to Rank models (and in particular, ConvRankNet).
In pairwise approach, we generally consider a bivariate function such that iff. . is then used to construct the cost function.
It is clear that the training time complexity is since every pair such that has to be considered. One may infer that the test cost is also quadratic since we have to evaluate on the collection of all possible pairs on test data and construct a consistent total order (For example, induces the total order ).
However, we argue that under a very loose assumption, a linear time333Here we do not take into account the sort of score, which is in general . However, if the score is of fix precision, on could use e.g. Radix sort to achieve linear time complexity. actually suffices for constructing the total order. First recall the following result in graph theory:
The topological sort of a directed graph is an order of vertices such that all edges of go from left to right in the order. It is shown that
We have then the following result:
Suppose that the hypothesis has no tie, i.e. . If there exists such that and iff. , then the total order defined by on is the same as that of on .
Consider the graph induced by , i.e. iff. . Remark first that is a DAG. If not, there exists such that . Then , a contradiction. By Lemma 3.1, there exists a topological sort. Without loss of generality note the sort . Since has no tie, is an Hamiltonian path, thus the topological sort is furthermore unique. It is easy to see that it is nothing but the sort with respect to . ∎
ConvRankNet has linear test time.
4 Experimental Results
In this section we evaluate ConvRankNet on standard benchmark datasets and compare it with standard RankNet and LambdaRank.
Since ConvRankNet is an end-to-end model, we need datasets to which we have access to raw query and documents.
To the best of our knowledge, OHSUMED dataset444http://mlr.cs.umass.edu/ml/machine-learning-databases/ohsumed/ is the only freely available dataset. Subset of MEDLINE (a database on medical publications), it consists of queries on medical documents during 1987-1991. The relevance of query-document pairs are provided by human assessors on three levels: non-relevant (n), partially relevant (p) and definitely relevant (d). In particular, non-relevant pairs are explicitly provided. For historical reasons, each query-document pair is judged independently by up to assessors. To avoid ambiguity, the highest relevance label is taken in our experiments.
To compare with classical feature-base models, we also use a synthesized version (where only feature vectors are accessible) of OHSUMED which is included in Microsoft’s LETOR 3.0 dataset 555https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/.. As in LETOR 3.0, we partition raw OHSUMED dataset into folds as shown in Table 1.
4.2 Experimental Setup
All models are implemented in PyTorch framework.
In general, query and documents are not of the same length. Though PyTorch uses dynamic graph and is capable of handling texts of various lengths, one could only train the network one query-document-document triple a time. In order to perform batch training, both query and document are truncated to
words with zero-padding.
We use ConceptNet Numberbatch666https://github.com/commonsense/conceptnet-numberbatchSpeer et al. (2016) as the default word embedding. Part of the ConceptNet open data project, ConceptNet Numberbatch consists of state-of-the-art semantic vectors that can be used directly as representation of word meanings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting. Several benchmarks show that ConceptNet Numberbatch outperfoms word2vec and GloVe. ConceptNet Numberbatch includes not only vectorization of single word but also that of some bigrams and trigrams. Bigrams and trigrams are semantically more informative. For example, “la carte” is clearly better characterized than “la” and “carte” separately. To exploit -grams, we use a greedy approach in mapping text to vector matrix, i.e. we extend as long as possible the -gram to map to vector. For example, the following toy “sentence” hello world peace would be segmented as , even though is also possible. One possible extension to our work is then to find the semantically optimal segmentation of sentence. In our network, unknown words are mapped to a random vector and zero padding is mapped to zero vector.
Three different filters of size , with copies each, are used for the convolution layer so that the input for RankNet is a -dimensional vector. In order to prevent overfitting, during training stage a drop-out layerSrivastava et al. (2014) with is used after the max-pooling layer.
RankNet, LambdaRank and ConvRankNet are all trained for epochs with learning rate respectively. All tests were performed on a Ubuntu 16.04.4 server with Xeon GHz CPU, GB RAM and Tesla P100 GB GPU. Cross-validations are run in parallel with the help of GNU Parallel (Tange, 2011).
Normalized Discounted Cumulative Gain at truncation level (NDCG@) is used as the evaluation measure. NDCG@ is a multi-level ranking quality measure widely used in previous work. It ranges from to with for the perfect ranking. We refer to (Manning et al., 2008) for a detailed presentation.
-fold cross-validation is performed on both datasets. Table 2 reports NDCG@ at all truncation levels. It is clear that ConvRankNet outperforms RankNet and LambdaRank especially for large truncation level .
A two tailed Wilcoxon signed-rank test (Dalgaard, 2008) is performed on these values. As we can see from Table 3, the improvement of ConvRankNet over RankNet and LambdaRank is statistically significant at level. Moreover, according to (Liu, 2011), ConvRankNet also outperfoms systematically existing methods.
5 Discussion and Conclusions
In this paper, we proposed ConvRankNet, an end-to-end Learning to Rank model which directly takes raw query and documents as input. ConvRankNet is shown to have linear test time and thus applicable in real-time use cases. Our results indicate that it outperforms significant existing methods on OHSUMED dataset.
Future work could aim to study the generalization of the underlying RankNet module to other stronger neural network based model (such as LambdaRank).
The author would like to thank Dr. Michalis Vazirigiannis of École Polytechnique for his valuable suggestions.
The author would also like to thank Mr. Geoffrey Scoutheeten and Mr. Édouard d’Archimbaud of BNP Paribas for their valuable comments.
The work is supported by the Data & AI Lab of BNP Paribas.
- Manning et al.  Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.
- Liu  Tie-Yan Liu. Learning to Rank for Information Retrieval. Springer, 2011. ISBN 978-3-642-14266-6.
- Gormley and Tong  Clinton Gormley and Zachary Tong. Elasticsearch: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2015. ISBN 1449358543, 9781449358549.
- Joachims  Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 133–142, New York, NY, USA, 2002. ACM. ISBN 1-58113-567-X. doi: 10.1145/775047.775067. URL http://doi.acm.org/10.1145/775047.775067.
- Burges et al.  Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 89–96, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: 10.1145/1102351.1102363. URL http://doi.acm.org/10.1145/1102351.1102363.
- Burges et al.  Christopher J. Burges, Robert Ragno, and Quoc V. Le. Learning to rank with nonsmooth cost functions. In P. B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 193–200. MIT Press, 2007. URL http://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf.
- Burges  Christopher J. C. Burges. From RankNet to LambdaRank to LambdaMART: An overview. Technical report, Microsoft Research, 2010. URL http://research.microsoft.com/en-us/um/people/cburges/tech_reports/MSR-TR-2010-82.pdf.
- Szummer and Yilmaz  Martin Szummer and Emine Yilmaz. Semi-supervised learning to rank with preference regularization. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 269–278, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0717-8. doi: 10.1145/2063576.2063620. URL http://doi.acm.org/10.1145/2063576.2063620.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257.
- Kim  Yoon Kim. Convolutional neural networks for sentence classification. CoRR, abs/1408.5882, 2014. URL http://arxiv.org/abs/1408.5882.
- Severyn and Moschitti  Aliaksei Severyn and Alessandro Moschitti. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pages 373–382, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3621-5. doi: 10.1145/2766462.2767738. URL http://doi.acm.org/10.1145/2766462.2767738.
- Bromley et al.  Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a ”siamese” time delay neural network. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 737–744. Morgan-Kaufmann, 1994. URL http://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf.
- Skiena  Steven S. Skiena. The Algorithm Design Manual. Springer Publishing Company, Incorporated, 2nd edition, 2008. ISBN 1848000693, 9781848000698.
- Sedgewick and Wayne  Robert Sedgewick and Kevin Wayne. Algorithms. Addison-Wesley Professional, 4th edition, 2011. ISBN 032157351X, 9780321573513.
- Speer et al.  Robert Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. CoRR, abs/1612.03975, 2016. URL http://arxiv.org/abs/1612.03975.
- Srivastava et al.  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2627435.2670313.
- Tange  O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011. URL http://www.gnu.org/s/parallel.
- Dalgaard  Peter Dalgaard. Introductory Statistics with R. Statistics and Computing. Springer, New York, second edition, 2008. ISBN 978-0-387-79053-4. doi: 10.1007/978-0-387-79054-1. URL http://dx.doi.org/10.1007/978-0-387-79054-1.