Learning to rank (LTR) has become an essential component for many real-world applications such as search (8186875)
. In ranking problems, the focus is on predicting the relative order of a list of documents or items given a query. It is thus different from classification problems whose goal is to predict the class label of a single item. While many neural machine learning techniques have been proposed recently, they are mostly for classification problems. Given the difference between ranking and classification problems, it is interesting to study how neural techniques can be used for ranking problems.
Knowledge distillation (KD) (hinton2015distilling; gou2020knowledge) is one of such recently popular techniques. The initial goal of KD is to pursue a good trade-off between performance and efficiency. Given a high-capacity teacher model with desired high performance, a more compact student model is trained using teacher labels (heo2019knowledge; sun2019patient; sanh2019distilbert). Student models trained with KD usually work better than those trained from original labels without teacher guidance. How to effectively apply KD to ranking problems is not straightforward and has not been well-studied yet, for the following reasons:
First, teacher models in classification typically predict a probability distribution over all classes. Such “dark knowledge” is believed to be a key reason why KD works(hinton2015distilling). However, a teacher model in ranking does not convey such distribution over pre-defined classes, as ranking models only care about the relative orders of items and simply output a single score for each item in a possibly unbounded candidate list. Ranking scores are typically neither calibrated (comparing to the probabilistic interpretation of classification scores (guo2017calibration)) nor normalized (comparing to summing up to one for classification scores over all classes for most popular losses (hinton2015distilling)). Thus directly using the outputs of a teacher model as labels for ranking tasks with existing KD methods may be less optimal.
Second, listwise information over all input items should be considered to achieve the best ranking performance, since the goal is to infer the relative orders among them. For example, listwise losses have been shown to be more effective than other alternatives for LTR problems (cao2007learning). On the other hand, classification tasks almost universally treat each item independently based on the i.i.d. assumption. It is interesting to also consider listwise frameworks when studying KD for ranking problems, which has not been explored in the literature, to the best of our knowledge.
Third, though there is a consensus in the classification KD literature that given no severe over-fitting, larger teacher models usually work better (he2016deep), recent works show that for traditional LTR problems, it is hard for larger models to achieve better performance (revisit; dasalc)
due to the difficulty of applying standard techniques such as data augmentation (compared to image rotation in computer vision(Xie_2020_CVPR)) and the lack of very large human-labeled ranking datasets. This makes it harder to use the standard KD setting to improve the performance of ranking models by distilling from a very large teacher model.
Thus, KD techniques need special adjustments for LTR problems. In this paper, inspired by the Born Again Networks (BAN) (ban) in which student models are configured with the same capacity as teachers and are shown to outperform teachers for classification problems, we study the BAN techniques for ranking problems. To this end, we propose Born Again neural Rankers (BAR) that train the student models using listwise distillation and properly transformed teacher scores, which can achieve new state-of-the-art ranking performance for neural rankers. While existing ranking distillation works such as (rd; pmlr-v130-reddi21a) require a more powerful teacher model and focus on performance-efficiency trade-offs, the primary goal of our paper is to improve ranking performance over state-of-the-art teacher rankers.
In summary, our contributions are as follows:
We propose Born Again neural Rankers (BAR) for learning to rank. This is the first knowledge distillation work that targets for better ranking performance without increasing the model capacity.
We show that the key success factors of BAR are (1) an appropriate teacher score transformation function, and (2) a ranking specific listwise distillation loss. Both design choices are tailored for LTR problems and are rarely studied in the knowledge distillation literature.
We provide new theoretical explanations on why BAR works better than other alternatives. This contributes to both the general knowledge distillation and ranking distillation research.
We verify our hypothesis on rigorous public LTR benchmarks and show that BAR is able to significantly improve upon state-of-the-art neural teacher rankers.
2 Related Work
Knowledge distillation has been a popular research topic recently in several areas such as image recognition (hinton2015distilling; romero2014fitnets; park2019relational), natural language understanding (sanh2019distilbert; jiao2019tinybert; aguilar2020knowledge)kim2016sequence; chen2017teacher; tan2019multilingual) as a way to generate compact models, achieving good performance-efficiency trade-offs. As we mentioned in the Introduction, both the classical setting (e.g., using pointwise losses on i.i.d. data) and theoretical analysis (e.g., “dark knowledge” among classes) for classification tasks may not be optimal for the ranking setting.
The major goal of this work is to push the limits of neural rankers on rigorous benchmarks. Since the introduction of RankNet (ranknet) over a decade ago, only recently, neural rankers were shown to be competitive with well-tuned Gradient Boosted Decision Trees (GBDT) on traditional LTR datasets (dasalc). We build upon (dasalc) and show that BAR can further push the state-of-the-art of neural rankers on rigorous benchmarks. We also provide extra experiments to show listwise distillation helps neural ranking in other settings.
We are motivated by Born Again Networks (BAN) that were introduced by (ban). BAN is the first work that shows better performance can be achieved by parameterizing the student model the same as the teacher. However, BAN only focuses on classification and direct application of BAN does not help LTR problems. Building upon the general “born again” ideas (Zhang_2019_ICCV; clark2019bam), our contribution in this paper is in developing specific new techniques and theory that make these ideas applicable for the important LTR setting.
Another closely related work is Ranking Distillation (RD) (rd; ensemble_bert), since it studies knowledge distillation for ranking. There are several marked differences between our work and RD. First, RD focuses on performance-efficiency trade-offs, where the student usually underperforms the teacher in terms of ranking metrics (pmlr-v130-reddi21a), while we focus on outperforming the teacher. Second, the work only uses pointwise logistic loss for distillation. The authors state that “We also tried to use a pair-wise distillation loss when learning from teacher’s top-K ranking. However, the results were disappointing”, without going into more detail or explore listwise approaches, which we show are the key success factor. Also, Tang and Wang (rd)
use a hyperparameter K: items ranked as top K are labeled as positive and others as negative for distillation. Besides only working for binary datasets, this setting is not very practical since real-world ranking lists may have very different list sizes or number of relevant items. Our method does not require such a parameter. Furthermore, they only evaluated their methods on some recommender system datasets where only user id, item id, and binary labels are available. It is not a typical LTR setting, so the effectiveness of RD over state-of-the-art neural rankers is unclear.
3 Background on Learning to Rank
For LTR problems, the training data can be represented as a set , where x is a list of items and y is a list of relevance labels for . We use as the universe of all items. In traditional LTR problems, each
corresponds to a query-item pair and is represented as a feature vector inwhere is the number of feature dimensions. With slight abuse of notation, we also use as the feature vector and say . The objective is to learn a function that produces an ordering of items in x so that the utility of the ordered list is maximized, that is, the items are ordered by decreasing relevance.
Most LTR algorithms formulate the problem as learning a ranking function to score and sort the items in a list. As such, the goal of LTR boils down to finding a parameterized ranking function , where denotes the set of trainable parameters, to minimize the empirical loss:
is the loss function on a single list.
There are many existing ranking metrics such as NDCG and MAP used in LTR problems. A common property of these metrics is that they are rank-dependent and place more emphasis on the top ranked items. For example, the widely used NDCG metric is defined as
where is a ranked list induced by the ranking function on x, is the ideal list (where x is sorted by decreasing y), and is defined as:
In practice, the truncated version that only considers the top-k ranked items, denoted as NDCG@k, is often used.
4 Born Again Neural Rankers
4.1 The general formulation of BAR
In addition to the original training data , we assume there is a well-trained teacher model . The goal of BAR is to train a student model , where is parameterized the same as , with the following loss:
where is a weighting factor between the two losses, is the additional Born Again loss, and is a transformation function for teacher scores. The practical necessity of this transform function is further analyzed in Section 4.3. The value of can be tuned and we found that a wide range of works well in our experiments.
An illustration of the BAR framework is shown in Fig. 1. The BAR framework uses a multi-objective setting for student score . In addition to the original loss function, BAR adds an additional distillation loss between student and (transformed) teacher scores.
4.2 Loss functions
Both the original loss function and the born again loss can be any ranking loss, ranging from pointwise to pairwise and listwise losses (8186875). Generally, listwise and pairwise losses are more effective than pointwise ones (cao2007learning). For example, the teacher models we leverage from (dasalc) use the listwise Softmax cross entropy loss. Specifically, given the labels y and scores s, the Softmax cross entropy loss is defined over a list of items:
where is the number of items in the ranking list.
In experiments we show that the original loss function can be more flexible (e.g., it can be a pointwise or listwise loss), but the born again loss is only effective when it is the listwise one.
4.3 Teacher ranker score transformation
We formalize the necessity of properly transforming the scores from the teacher ranker. First, many popular ranking losses are translation invariant. For example, in the Softmax cross entropy loss, scores appear in the paired form of :
It is easy to see that the loss stays invariant after adding a constant to all scores. This is also the case for other popular ranking losses such as ApproxNDCG loss (revisit) and the pairwise RankNet loss (burges2010ranknet).
Second, since ranking scores are not normalized (compared to summing up to one for classification over all classes), teacher ranker’s insensitivity to score scales may lead to numerical issues for distillation. For example, two documents can be ranked correctly with very close scores (e.g., 1.001 and 1.0) where the student may not differentiate them effectively, even if we subtract a constant from them. Thus, the scale of the teacher scores also need attention.
Third, unlike classification problems where predictions are non-negative probabilities, teacher ranker scores can be negative, which will make many ranking losses ill-behaved, such as any cross entropy based losses.
Putting these together, we propose a simple teacher score transformation function which parameterizes as an affine transformation function of teacher scores:
is a safeguard to make sure the transformed teacher scores are non-negative. The slope and the intercept of the affine transformation are treated as hyperparameters. Note that we aim for a general formulation highlighting potential caveats of teacher ranking scores due to the difference from classification problem. The actual tuning depends on teacher ranker behavior and may be simple in practice. For example, when and , becomes the standard ReLU function, which turns out to be effective for two out of three datasets used in our experiments.
We discuss possible alternatives that we will compare with in experiments. These alternatives are used to help highlight that BAR is referred to the specific setting where ranking loss with original labels, listwise distillation on teacher score, and tunable affine teacher score transformation are used. Our comparison is to help better understand what makes BAR effective in practice.
Pointwise distillation. Instead of the listwise loss, we can perform a pointwise loss for the distillation objective in Eq 4:
Note that a pointwise loss is decomposed across items independently, following the i.i.d. assumption for regression. Throughout the paper, we use the mean squared error loss as the pointwise loss:
since the original labels are graded with real values, and as we mentioned in Section 4.3, the range of teacher scores can be unbounded, even for binary labeled datasets.
Teacher score only distillation. Another way to distill teacher score is to ignore the original loss function using original labels ( in Equation 4):
The original BAN paper (ban) shows that this single objective formulation works better than the two objective formulation in certain classification scenarios.
Softmax teacher score transformation. It is tempting to apply the Softmax transformation on teacher scores in each list, since it converts all labels to be non-negative while preserving their relative order. We study Softmax teacher score transform with temperatures, , in Section 5.6 and empirically show that it under-performs affine transformations.
We conduct experiments on three public LTR datasets to show that BAR achieves state-of-the-art performance for neural rankers. We also perform various ablation studies to better understand BAR.
5.1 Experimental setup
We follow the experimental setup of (dasalc) that provides a rigorous LTR benchmark. Three widely adopted data sets for web search ranking are used: Web30K (web30k), Yahoo (yahoo), and Istella (istella). The documents for each query were labeled with multilevel graded relevance judgments by human raters. We compare with a comprehensive list of benchmark methods.
MART (lightgbm) and MART are the two GBDT-based implementations for LTR. RankSVM (joachims2006training) is a classic pairwise learning-to-rank model built on SVM. GSF (gsf) is a neural model using groupwise scoring function and fully connected layers. ApproxNDCG (revisit) is a neural model with fully connected layers and a differentiable loss that approximates NDCG (qin2010general). DLCM (dlcm) is an RNN based neural model that uses list context information to rerank a list of documents based on MART as in the original paper. SetRank (setrank) is a neural model using self-attention to encode the entire list and perform a joint scoring. SetRank (setrank) is SetRank plus ordinal embeddings based on the initial document ranking generated by MART as in the original paper. DASALC (dasalc) is the state-of-the-art neural rankers combining data transformation and augmentation, effective feature crosses, and listwise context. DASALC-ens is an ensemble of DASALC models, significantly outperforming the strong MART baseline on 4 out of 9 major metrics.
For the main results of BAR, we obtain the teacher model scores and configurations of DASALC (not DASALC-ens) for each dataset from the authors. For the Web30K and Yahoo dataset, the teacher scores are used as they are (i.e., we use the identity function as the affine transformation in ). As the teacher scores for the Istella dataset are of larger range (mean = 28.4, std = 299.4) and are causing some numeric issues, we dampen them by using ). Better results may be achieved by more tuning. The model architecture configurations, such as the number of hidden layers, are unchanged during BAR training. We simply use to assign equal weights between the original objective and distillation objective, and show its robustness in Section 5.5. We do perform hyper-parameter searches over dropout rate, learning rate, and data augmentation noise level on validation sets. Note that all BAR models are only born again once (instead of iteratively born again multiple times by treating previous student model as the new teacher), due to computation overhead and the lack of theoretical guarantee in the original BAN work. BAR is a single neural ranker. BAR-ens is an ensemble of rankers () that is tuned on each dataset with the same architecture from different runs. Note that the candidates in the ensemble for each data set are supervised by the same teacher.
5.2 Research questions
We want to answer the following research questions by comparing with competitive methods on LTR benchmarks and performing ablation studies:
RQ1: Is the BAR framework able to further push the limits of neural rankers over the state-of-the-art models?
RQ2: Is the listwise distillation loss necessary for BAR to be effective for LTR problems, compared to the common pointwise distillation loss that follows the i.i.d. assumption?
RQ3: Is the dual-objective architecture necessary for BAR, considering that (ban) shows that using only the teacher objective works better in certain scenarios? How robust is the balancing parameter between the two objectives?
RQ4: Is the affine teacher score transformation more effective than the Softmax transformation?
5.3 Main results (RQ1)
Our main results are shown in Table 1. From this table, we can see that the models trained under the BAR framework can significantly push the limits over the state-of-the-art neural rankers without sacrificing efficiency. A single BAR model performs best among non-GBDT methods on 8 out of 9 metrics. BAR-ens universally outperforms DASALC-ens, and can significantly outperform MART on 7 out of 9 metrics.
|Models||Web30K NDCG@k||Yahoo NDCG@k||Istella NDCG@k|
|( over DASALC)||(+1.49%)||(+1.26%)||(+1.30%)||(-0.21%)||(+0.18%)||(+0.24%)||(+0.27%)||(+0.26%)||(+0.23%)|
|( over DASALC-ens)||(+0.64%)||(+0.77%)||(+0.67%)||(+0.29%)||(+0.73%)||(+0.44%)||(+0.15%)||(+0.27%)||(+0.16%)|
5.4 The necessity of listwise distillation (RQ2)
We compare with pointwise distillation in Table 2 on the Web30K dataset. We use the MSE loss and tune in Eq. 4 for the datasets with graded relevance labels. Results on other datasets are consistent. In this table, listwise teacher without distillation is the DASALC model, listwise teacher plus listwise student is the BAR model. We have the following observations:
By comparing pointwise teacher with listwise teacher (DASALC), we confirm that listwise loss is indeed effective, regardless of model distillations.
We can also see that when listwise distillation is used, it always outperforms the corresponding teacher model, even if the teacher model is a pointwise one. On the other hand, when pointwise distillation is applied, performance gets worse. This confirms the necessity of the listwise distillation loss.
Pointwise distillation does not help even if the teacher model uses pointwise loss, eliminating the potential concern that it does not work due to inconsistency between the two objectives.
5.5 The necessity of two objectives (RQ3)
We study the balancing factor in Eq. 4, with the special case () that only uses the teacher-only objective in Section 4.4. The results are shown in Table 3 and Figure 3. We can see that student models trained with BAR can outperform the teacher model () with a wide range of , as long as both objectives are used. The performance decreases significantly if the born again ranker is trained with the teacher objective alone (), which is different from the observations for some classification problems in (ban).
5.6 Softmax score transformation (RQ4)
We empirically show that Softmax score transformation under-performs the simple affine transformation in Figure 2 (Left) with varying the temperature . A possible reason is that Softmax scores are normalized within each query and this may limit their power as labels for ranking problems. A better understanding of why certain teacher score transformations are more effective for ranking distillation is a promising future research direction.
6 Understanding BAR
Existing theories of Born Again Networks or knowledge distillation in general include (1) reweighting the data points by confidence (ban), (2) incorporating dark knowledge from negative classes (hinton2015distilling), and (3) ensembling of “multi-view” features in student (allen-zhu2020towards). However, these theories could not explain everything we observed in experiments for ranking distillation, especially, why the listwise distillation is necessary in Section 5.4. Here we provide a new theory that helps to explain BAR effectiveness.
Theorem 1. There exists a way to combine the teacher prediction score and the label to train a student model, such that when the teacher and the student have exactly equivalent capability, the student model can outperform or perform as well as the teacher model after being trained with the same amount of resource.
It’s easy to show the necessity: we can just train the student model with the true label with in Eq.(4) and the same computation resource and it then performs as well as the teacher model. In the Appendix Section A.1, we show the sufficiency of the theorem when there are data points that are hard to fit, or that have erroneous labels.
Here we give an illustrative example of Theorem 1. Consider a training set of three data points with a single input feature , , and corresponding labels are , , , shown as red points in Fig. 3 and a test set of two data points and in cyan. We fit the training points with a single-parameter nonlinear model: , shown in a green dashed line in Fig. 3. To this model, two data points and are easy-to-fit and one data point is hard or mistakenly labeled as it is inconsistent with in test set.
By minimizing a mean squared error loss, shown as the blue line in Fig. 3,
we find two nontrivial solutions at and . Suppose we obtain a teacher model with and scores , , for the three inputs and then train a student model following the BAR method with , which is equivalent to having the student labels , , respectively, shown as the yellow points in Fig. 3. There are again two nontrivial solutions with and for the student. Ignoring the trivial solution at , we find the student solution at achieves a better performance by having a smaller mean square error metric on the test points than of the teacher solution at .
Applying Theorem 1 to learning to rank. Some of the labels in the LTR dataset could be error-prone as a result of the pointwise evaluation nature of human rating. In the listwise ranking sense: (1) Some of the labels may be error-prone, such that a document labeled with 2 might not be more relevant than a document labeled with 1; (2) The graded relevance labels are discrete and may not be able to fully capture the relevance in-between like 1.5; (3) A listwise loss only needs the comparison within each individual list and this can alleviate the impact of noisy labels comparing with a pointwise loss. As a result, the listwise distillation is able to capture these mistakenly labeled data points and thus performs significantly better than the examples using pointwise distillation. This deduction is also consistent with the findings in Section 5.4: the most effective distillation is on models trained with the listwise Softmax loss.
The regularization view. Even if the human labels are not noisy but just contain some hard examples, Theorem 1 implies that combination like Eq.(4) reduces overfitting on hard examples and plays a regularization role. Qin et al. (dasalc) show that one major bottleneck of modern neural rankers is overfitting, due to the lack of very large-scale ranking datasets and effective regularization techniques. For example, data augmentation is not intuitive for LTR problems, comparing to, e.g., image rotations for computer vision tasks. By examining Eq. 4, the second term can be treated as regularization to the original objective. We empirically show that the BAR framework works as a very effective regularization technique for LTR problems. As seen in Figure 2 (Right), the BAR student gets lower training data ranking metrics and higher validation data ranking metrics during training, the desired behavior of effective regularization.
We propose Born Again neural Rankers (BAR) and further push the limits of neural ranker effectiveness without sacrificing efficiency. We show that the key success factors of BAR lie in a proper teacher score transform and a listwise distillation approach specifically designed for ranking problems, which do not follow the common assumptions made in most knowledge distillation work. We further conduct theoretical analysis on why BAR works, filling the gap between existing knowledge distillation theories and the learning to rank setting. As promising directions for future work, we consider advanced teacher score transformations for BAR, and label noise reduction techniques.
Appendix A Appendix
a.1 Proof of Theorem 1
In this section, we show the student model trained with Eq.(4) can outperform the teacher model with the same capacity at an optimal if there exist hard examples or the labels are noisy.
We first formulate the problem and give the mathematical definitions of “hard examples” and “noisy labels”. When training converges, a deep neural network teacher modelapproximately satisfies
for any parameter in the total trainable parameters. The first derivative on the right hand side is the loss function dependent gradient. For example:
Mean Squared Error loss: , we have
Pointwise logistic loss: with , we have
Softmax loss in Eq.(5), we have
If we normalize the labels by having in Softmax loss and define the model gradients on parameter at data point , as a matrix, we then have ranking model solutions satisfying equilibrium equations,
with as a vector of model predictions depending on different losses as shown in above examples. When the rank of is , we only have trivial solutions with . However, noting that the matrix is a nonlinear function of , in general we have the rank of is with non-trivial solutions living in the null space of .
To such a solution, we can distinct hard and easy problems in the training dataset by quantifying the deviations from the target labels in the trained teacher model : the larger this deviation is, the harder the corresponding data point is. These data points in the training set could be hard in two different senses:
Case One – Hard examples: they are hard in nature, meaning that the trained model does not capture the most deterministic feature of the data points.
Case Two – Noisy labels: they could be mistakenly labeled, meaning that there exists a ground-truth model and the ground-truth model predicts results different from the labels for these data points.
When we train a student model from scratch with combined teacher predictions and labels following Eq.(4) with the teacher model solution as so that , the student solution should then also satisfy Eq.(12) but with a new set of labels . First of all, it’s easy to show that the teacher model with is a solution. But for a complex nonlinear model trained from scratch, it would be almost impossible to end up with the same solution as the teacher in general, especially when the labels are changed, which determine the initial gradients. We thus focus on the general solutions different from .
In Case One, the model output will be insensitive to the change of parameters for the hard points at initial training stage. After easy points are fit well, , the training enters a stage of fitting the hard points with irrelevant features, which directly leads to overfitting. So in practice, we use predictions from a teacher regularized with early-stopping as the distillation labels for students. With such a teacher, the initial gradients of the student model are still dominated by easy data points and in the later stage the contributions of the hard data points are lowered by fraction compared to the training of the teacher model. The overfitting effect will thus be weakened for the student model. In this case, the student model can achieve a better performance on the test data than the teacher model with .
In Case Two, the final solution of the teacher model will be balanced by contributions of the correct labels and wrong labels: let’s say there are out of training data points mistakenly labeled and most are correct labels : on average, the teacher model prediction will deviate from the label by , where are correct labels and are wrong labels. We thus have the teacher predictions deviating from the correct labels by and from the wrong labels by , so the student labels are for correct labels and for wrong labels. After convergence, the student predictions will thus be for correct labels and for wrong labels. When , student predictions on examples with correct labels are consistent with labels and on examples with wrong labels are approximately independent of the label. Assuming that there is no correlation of making the wrong labels in training and test sets, then the student models validated on the test set will result in better metrics than their teachers as they are making predictions closer to the correctly labeled data points.