Learning to Rank in TensorFlow
TensorFlow Ranking is the first open source library for solving large-scale ranking problems in a deep learning framework. It is highly configurable and provides easy-to-use APIs to support different scoring mechanisms, loss functions and evaluation metrics in the learning-to-rank setting. Our library is developed on top of TensorFlow and can thus fully leverage the advantages of this platform. For example, it is highly scalable, both in training and in inference, and can be used to learn ranking models over massive amounts of user activity data. We empirically demonstrate the effectiveness of our library in learning ranking functions for large-scale search and recommendation applications in Gmail and Google Drive.READ FULL TEXT VIEW PDF
LIBS2ML is a library based on scalable second order learning algorithms ...
In designing personalized ranking algorithms, it is desirable to encoura...
We propose RoBiRank, a ranking algorithm that is motivated by observing ...
Deep learning is a promising tool to determine the physical model that
Deep neural networks has become the first choice for researchers working...
DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a
Picasso is a free open-source (Eclipse Public License) web application
Learning to Rank in TensorFlow
With the high potential of deep learning (DL) for real-world data-intensive applications, multiple open source packages have emerged in recent years and are under active development, including TensorFlow abadi2016tensorflow
, Caffejia2014caffe , MXNet chen2015mxnet russakovsky2015imagenet is to predict image categories, which can be formulated as a multi-class classification problem. However, compared with the comprehensive support for classification or regression in open-source DL packages, there is a paucity of support for ranking problems.
A ranking problem is defined as a derivation of ordering over a list of examples that maximizes the utility of the entire list. It is widely applicable in several domains, such as Information Retrieval (IR) and Natural Language Processing (NLP). Some important practical applications include web search, recommender systems, machine translation, document summarization, question answering, etcli2011learning .
In general, a ranking problem is different from classification or regression tasks. While the goal of classification or regression is to predict a label or a value for each individual example as accurately as possible, the goal of ranking is to optimally sort the entire example list, such that the examples of highest relevance are presented first. In practice, relevance can be obtained from human judgment (explicit feedback) or from data such as click logs (implicit feedback). Because of this difference, widely used ranking metrics such as Normalized Discounted Cumulative Gain (NDCG) jarvelin2002cumulated , Mean Reciprocal Rank (MRR) craswell2009mean , Mean Average Precision (MAP) and Average Relevance Position (ARP) zhu2004recall are different from those that are used in classification or regression. These metrics take the ranks of the examples into consideration and are generally designed to emphasize the examples that are ranked higher in the list.
Learning-to-rank (LTR) li2011learning
is a set of supervised machine learning techniques that can be used to solve ranking problems. It attempts to learn a scoring function that maps example feature vectors to real-valued scores from labeled data. During inference, this scoring function is used to sort and rank examples.pointwise, pairwise or listwise loss functions are used during training with the goal to optimize the correctness of relative order among a list of examples. Using pointwise loss functions can be seen as approximating ranking problem to a classification or regression techniques. In practice, pairwise or listwise loss functions, and advanced loss functions such as LambdaLoss wang2018lambdaloss tend to outperform the pointwise approaches li2011learning , as they are more closely aligned with optimizing the ranking metrics over the entire list. It is worth noting that in traditional LTR settings the loss function and the ranking metrics are different, since most optimizers require differentiable loss functions whereas all ranking metrics mentioned above are non-continuous over the permutations of a ranked list.
In the past decade, several techniques unique to ranking problems were developed to directly optimize ranking metrics burges2010ranknet ; wang2018lambdaloss . In addition, recent work on unbiased learning-to-rank ai2018unbiased ; Wang+al:2018 ; Joachims:WSDM17
have been proposed to handle position bias in biased data like user clicks, to produce a consistent and unbiased ranker. These approaches usually involve estimating Inverse Propensity Weights to counter position bias. This works well with pairwise or listwise losses, but not with pointwise lossWang+al:2018 . Thus, a learning-to-rank library that implements advanced listwise loss functions is important for realistic ranking settings, which often involve biased click data.
Existing open source packages for learning-to-rank such as RankLib croft2013lemur and LightGBM ke2017lightgbm have several important drawbacks. First, they were developed for small labeled data sets such as LETOR DBLP:journals/corr/QinL13 (thousands of examples), but not for massive click log datasets (hundreds of millions of examples) that are common in industrial applications. Second, since the existing learning-to-rank packages are not based on deep learning techniques, they cannot naturally handle sparse features like text, and require extensive feature engineering. In contrast, deep learning packages like TensorFlow can effectively handle raw sparse features through embeddings mikolov2013word2vec .
To address this gap, in this paper, we present our effort to build a scalable, comprehensive and configurable industry-grade learning-to-rank library in TensorFlow. Our main contributions are:
We propose a unified library for training large scale learning-to-rank models using deep learning in TensorFlow.
The library is flexible and highly configurable: it provides easy-to-use APIs to support different scoring mechanisms, loss functions and evaluation metrics.
The library provides support for unbiased learning-to-rank by incorporating inverse propensity weights (IPW) in losses and metrics.
We demonstrate the wide applicability of our framework by experiments on large-scale search and recommendation applications, where we show that listwise losses and ranking metric optimization significantly outperform standard pointwise losses.
Our current implementation of the TensorFlow Ranking library is by no means exhaustive. We envision that this library will provide a convenient open platform for hosting and advancing state-of-the-art ranking models based on deep learning techniques, and thus facilitate both academic research as well as industrial applications.
In this section, we provide a high-level overview of learning-to-rank techniques. We present our setup first and then describe scoring functions, ranking losses and ranking metrics.
Let be the universe of examples and the universe of example lists; similarly let be the universe of labels and the universe of label lists. Assume that is an ordered space (such as or ), and without loss of generality that larger label values are preferable. In the LTR setting, a training data point is , where is a list of examples and is a list of labels. A label list whose labels are monotonically decreasing is called ideal.
Consider search as an example. In this setting, we have a list of documents for each query. Let be the query and be the document list for this query. Each query-document pair becomes an example . For each , we have a corresponding relevance label that can be a binary or graded integer value jarvelin2000ir .
Given a list of examples and a corresponding list of ground-truth relevance labels , is the permutation that makes ideal. The goal of learning-to-rank is to learn a mapping function , where is the space of all permutations of size , such that is close to for any given . Note that or alternatively can be seen to act on a list of items to return a permuted list of items.
Directly finding a mapping function is difficult, as the space over all possible permutations is intractably large. In practice, a score-and-sort approach is used instead. Let be a scoring function that maps a list of examples to a list of scores . induces the permutation , such that is monotonically decreasing.
In the pointwise case, scoring function can be decomposed into a per-example scoring function as shown in Equation 1, where maps a feature vector to a real-valued score.
The scoring function or
is typically parameterized; we use deep neural networks in this paper. The notion of scoring function can be extended to more sophisticated ones like multi-item scoring functionswang2018groupwise , where the scores of a set of examples are computed jointly. For the purpose of this paper, we focus on single-item scoring functions, though our library supports more advanced multi-item scoring functions as well.
Learning-to-rank techniques aim to find an optimal scoring function by minimizing a loss objective. In the machine learning setting, the empirical risk for a scoring function , given a loss function , is defined as
where are labeled training data. We define the loss functions in learning-to-rank as follows.
A pointwise loss is defined over each individual example. For example, the sigmoid cross entropy for binary label is:
A pairwise loss considers scores of a pair of examples. It includes hinge loss, logistic loss etc. For example the pairwise logistic loss is defined as:
where is the indicator function.
Several standard ranking metrics are used to evaluate ranked lists sorted by scoring functions. In ranking, it is preferable to have fewer errors at higher ranked positions than at the lower ranked positions, which is reflected in many metrics. Our library supports commonly used ranking metrics and we list the following as examples:
where is the rank of in ranked according to . MRR is the mean of the reciprocal rank of the first relevant example. ARP is the average of positions of examples weighted by their relevance values zhu2004recall ; wang2018lambdaloss . DCG is the Discounted Cumulative Gain jarvelin2002cumulated , and NDCG is its normalized version. The loss functions are available in the library via the factory method tfr.losses.make_loss_fn.
Unbiased learning-to-rank Joachims:WSDM17 ; Wang+al:2018 looks at dealing with bias in relevance scores arising due to position bias, where users are more likely to click on examples presented higher up the list. One approach to handle this bias is to compute Inverse Propensity Weights (IPW) and use these weights to produce a better ranking estimator. These weights can be estimated per-query or per-example Wang+al:2016 ; Wang+al:2018 and the losses and metrics are reweighted using the inverse propensity weights to counter this bias.
Our library is based on TensorFlow. Similar to design patterns in TensorFlow, it provides functions and closures for users to construct models and also allows users to build custom functions if needed. More specifically, we use the TensorFlow Estimator framework cheng2017tensorflow to build ranking models. This framework supports both local and distributed training. The core component of this framework is a model_fn function that takes features and labels as input and returns loss, prediction, metrics, and training ops, depending on the mode (TRAIN, EVAL, PREDICT). In the following, we first describe our input format and then show how we build a ranking model for learning-to-rank.
A ranking problem usually has a context (e.g., a query) and a list of examples. For the sake of simplicity, we assume that all the examples have the same set of features. These features can be sparse or dense. For sparse features, we use embeddings to densify them. After transformation, for a mini-batch, we have each context feature as a 2-D tensor with shape[batch_size, feature_size], where feature_size is the length of the vector for a feature, which varies from feature to feature. For each example feature, we have a 3-D tensor with shape [batch_size, list_size, feature_size]. Both batch_size and list_size are constant across features.
Our library allows users to specify features (e.g., dense or sparse) using a TensorFlow library such as tf.feature_column. We allow closures for customized feature transformations and provide utility functions for these transformations.
The overall flow of building a model_fn is shown in Figure 1. There are two important components:
Scoring Function. We focus on single-item scoring functions and multi-item scoring functions wang2018groupwise in this paper. A single-item scoring function takes all context features and all features for a single example as input and outputs a score, as defined in Equation 1. A multi-item scoring function extends this to a group of examples. Conceptually, we slice the tensor for each example list into a number of tensors with shape of [batch_size, group_size, feature_size], where for pointwise scoring functions. These group features are combined with context features, and passed to the scoring function to generate scores. After the scoring phase, a voting layer is applied to obtain a tensor with shape [batch_size, list_size] for scores, as shown in Figure 1 for a mini-batch. The scoring function is a user-specified closure which is passed to the ranking model_fn builder.
Ranking Head. The ranking head structure computes ranking metrics and ranking losses, given scores, labels and optionally example weights. The ranking head abstraction allows for the user to specify different losses and metrics, for the same scoring logic, or vice-versa. In our library, both score and labels tensors have the same shape [batch_size, list_size], representing batch_size number of example lists. Ranking head also incorporates example weights, described in Section 2.5, which can be per-example with shape [batch_size, list_size] or for the entire list with shape [batch_size]. Ranking head is available in the library via the factory method tfr.head.create_ranking_head.
As shown in Figure 1, our library provides factory methods to create metrics and losses. Furthermore, our APIs allow to specify loss functions when creating a ranking head. This enables the user to switch between different loss functions or combine multiple loss functions easily. More importantly, we provide a builder function that takes a scoring function and a ranking head and returns a model_fn to construct a tf.estimator.Estimator. When mode = PREDICT in model_fn, the learned scoring function is exported for serving.
In this section, we demonstrate the effectiveness of the TensorFlow Ranking library for two real-world ranking scenarios: Gmail search Wang+al:2016 ; Zamani+al:2017 and document recommendation in Google Drive tata2017quick . We focus on these scenarios, since in both cases the model is trained based on large quantities of click data, which is beyond the capabilities of the existing open source packages for ranking such as RankLib croft2013lemur .
Evaluation. The models are evaluated using the metrics defined in Section 2.4. Due to the proprietary nature of the models, we only report relative improvements with respect to a pointwise sigmoid cross entropy loss.
We evaluate several ranking models trained on search logs from Gmail. The privacy of user information is preserved by removal of personal information and anonymization of data using
-anonymization. When a user types a query, five results are shown and user clicks are used as relevance labels for ranking. The set of features consists of dense and sparse features. Some of the sparse features considered are word and character level n-grams derived from queries and email titles. The vocabulary of n-grams is pruned to retain only n-grams that occur across more thanusers. This is done to preserve user privacy, as well as to promote a common vocabulary for learning a shared representations across users. In total we collect around 250M queries and use 10% of them for evaluation. The losses and metrics are weighted by Inverse Propensity Weighting Wang+al:2016 computed to counter position bias.
|Sigmoid Cross Entropy||–||–||–|
|Pairwise Logistic Loss||+1.52||+1.64||+1.00|
|Listwise Softmax Loss||+1.80||+1.88||+1.57|
Quick Access in Google Drive tata2017quick is a zero-state recommendation engine that surfaces documents currently relevant to the user when she visits the Drive home screen. We evaluate several ranking models trained on user click data over these recommended results. The set of features consists of mostly dense features, as described in Tata et al. tata2017quick . In total we collected around 30M instances and use 10% of them for evaluation.
|Sigmoid Cross Entropy||–||–||–|
|Pairwise Logistic Loss||+0.70||+1.86||+0.35|
|Listwise Softmax Loss||+1.08||+1.88||+1.05|
The results for Gmail and Google Drive are summarized in Table 1 and Table 2 respectively. From both tables, we observe that a listwise loss performs better than a pairwise loss, which is in turn better than a pointwise loss. This clearly indicates the importance of listwise losses for ranking problems, as they capture the relevancy of the entirety of example list better than their pointwise and pairwise counterparts.
In this paper we introduced TensorFlow Ranking – a learning-to-rank library that allows users to define flexible ranking models in TensorFlow. The library is highly configurable, and has easy-to-use APIs for scoring mechanisms, loss functions and evaluation metrics. Unlike the existing learning-to-rank open source packages, which are designed for small datasets, TensorFlow Ranking can be used to solve real-world large-scale ranking problems with hundreds of millions of training examples. We empirically demonstrate its performance on Gmail search and Google Drive document recommendation. We also demonstrate the improvements from using ranking estimators in these real-world applications, highlighting the need for such a library. TensorFlow Ranking is available to the open source community with the hope that it facilitates further academic research and industrial applications.
International Journal of Computer Vision, 115(3):211–252, 2015.