Neural Document Expansion with User Feedback

08/08/2019 ∙ by Yue Yin, et al. ∙ Microsoft Tsinghua University 0

This paper presents a neural document expansion approach (NeuDEF) that enriches document representations for neural ranking models. NeuDEF harvests expansion terms from queries which lead to clicks on the document and weights these expansion terms with learned attention. It is plugged into a standard neural ranker and learned end-to-end. Experiments on a commercial search log demonstrate that NeuDEF significantly improves the accuracy of state-of-the-art neural rankers and expansion methods on queries with different frequencies. Further studies show the contribution of click queries and learned expansion weights, as well as the influence of document popularity of NeuDEF's effectiveness.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Neural information retrieval (Neu-IR) methods have shown promising results in various search scenarios. These neural ranking models leverage distributed representations (embeddings) to conduct soft matches between query-documents, and at the same time, leverage large model capacity to revive ”classic” IR intuitions, such as translation model 

(K-NRM), phrase matches (dai2018convolutional), and multi-field evidence combination (zamani2018neural). This paper revisits the document expansion technique and develops NeuDEF, a Neural Document Expansion method that explicitly expands documents using user Feedback signals.

NeuDEF first harvests candidate expansion terms for a document from queries lead to user clicks on it (click queries). Then its attention mechanism weights the expansion terms based on both self-attention and the matches between the document and click queries. The weighted expansion terms form an additional document representation and can be integrated into various neural ranking models. During learning, the attention mechanism on expansion terms and the neural ranking model are end-to-end trained using document ranking labels; the integrated system learns how to expand and rank documents jointly.

In our experiments on two search log samples from a commercial search engine, NeuDEF significantly improves the ranking accuracy of its base ranker: K-NRM (K-NRM) and outperforms previous state-of-the-art neural ranking methods and document expansion techniques by large margins. Our additional studies show that NeuDEF’s attention mechanism assigns higher weights to novel expansion terms and NeuDEF generalizes user feedback signals to unseen queries.

2. Neural Document Expansion

This section presents K-NRM (K-NRM), the base ranker, NeuDEF, our expansion model, and the joint learning of the two.

2.1. Base Ranker Recap


is an interaction-based neural ranking model that uses density estimation kernels to soft match term pairs 



is a Gaussian (RBF) kernel. It ”soft counts” the number of document terms that match the query term near its kernel region

. The match is calculated by the cosine similarity,

, of their word embeddings, which are learned end-to-end for the whole vocabulary. is the kernel weight to be learned.

2.2. Expansion Term Selection and Weighting

NeuDEF leverages user feedback in search logs to find expansion terms. It first considers the click queries of the document :


is True if there is any click on from the query .

Terms from clicked queries form the candidate expansion set :


A clicked query is likely to be related to the document as a user has used the query to search and click on the document.

Given the expansion terms and clicked queries of document , NeuDEF calculates the weights of the expansion terms by an attention mechanism.

The attention first uses Multi-Head Self Attention (vaswani2017attention) to weight document and clicked queries independently. To capture the cross-queries information and generate better word-level attention weights, we concatenate all the clicked queries for a specific document and feed into the transformer (Self-ATT). Then the attention matches the clicked queries to the document:


It uses the same architecture as the base ranker but different parameters. Then it calculates the attention weight of term by summing match scores from the clicked queries it appears in:


The more related clicked queries the term appears in, the more expansion weights, , it receives.

NeuDEF provides the expansion terms , which introduce user feedback, and their weights learned by self-attention and document-query matches.

2.3. Joint Learning with Neural Rankers

NeuDEF is integrated to the base ranker by providing an additional expansion field, which includes expansion terms and attention weights and is linearly combined with the base ranker ( and are the combine weights):


We use a modified K-NRM on the expansion field:


The modified version weights expansion term by the attention , and adds a projection on the expansion word embeddings to distinguish them from the original words. The three K-NRM models used in , , and (Eq. 6) share the same word embeddings and kernel hyper-parameters.

NeuDEF is then trained with the neural ranker using standard pairwise hinge loss:


are the relevant and irrelevant document pairs of the query. Instead of conducting the expansion and ranking separately, NeuDEF learns the document expansion and document ranking jointly from ranking labels using back-propagation.

3. Experimental Methodology

Dataset. Sogou, a commercial search engine based in China, released a sample of search log with queries, documents, and user clicks to various academic partners. Our experiments use two training datasets sampled from Sogou log: Sogou-KNRM, the one used by K-NRM (K-NRM), for a fair comparison, and Sogou-QCL, the public release of Sogou search log (zheng2018sogou). We follow the same setting with prior research (K-NRM; zheng2018sogou) and refer to their papers for more details in the datasets due to space limitation (K-NRM; zheng2018sogou).

Our evaluations on Sogou-KNRM dataset omit the Testing-SAME setting which is prune to overfitting (dai2018convolutional) and add torso (50-1000 appearances) and tail (less than 50) queries. We use four evaluation scenarios for Sogou-KNRM dataset: (1) Testing-Raw Head, Torso, and Tail: User clicks as relevance labels and evaluate on head, torso, and tail queries; (2) Testing-DIFF: Click model (TACM (liu2017time)) inferred relevance labels and evaluate on head queries. For Sogou-QCL dataset, we use TACM inferred relevance labels to train and test our model on head queries, following the exact same setting for the Table 5 in Sogou-QCL original paper (zheng2018sogou).

To study the effectiveness of document expansion with body field, The body texts are from our crawled HTMLs and parsed by Boilerpipe. They are combined linearly with the titles following standard multi-field ranking setup.

Testing-RAW, MRR Testing-DIFF
Model Head Torso Tail NDCG@1 NDCG@10
DRMM  (jiafeng2016deep) 0.2335 -30.1% 0.3102 -16.0% 0.2951 -6.4% 0.2126 -27.5% 0.3592 -14.3%
CDSSM (shen2014learning) 0.2501 -25.1% 0.3184 -13.7% 0.2928 -7.1% 0.2017 -31.2% 0.3500 -16.5%
K-NRM (K-NRM) 0.3339 - 0.3691 - 0.3152 - 0.2931 - 0.4190 -
Conv-KNRM (dai2018convolutional) 0.3382 +1.3% 0.3645 -1.2% 0.3218 +2.1% 0.2988 +1.9% 0.4204 +0.3%
K-NRM+DELM+TF (tao2006language) 0.3351 +0.4% 0.3701 +0.3% 0.3121 -1.0% 0.2901 -1.0% 0.4203 +0.3%
K-NRM+ExpaNet (tang2017end) 0.3402 +1.9% 0.3702 +0.3% 0.3234 +2.6% 0.3004 +2.5% 0.4212 +0.5%
K-NRM+DocFreq +4.9% 0.3714 +0.6% +4.6% +10.0% 0.4289 +2.4%
K-NRM+CQCount +7.9% 0.3785 +2.5% +7.4% +14.1% +5.0%
NRM-F (zamani2018neural) +12.2% +10.9% +12.5% +16.6% +14.0%
NeuDEF-TF +18.2% +15.0% +8.8% +25.3% +16.8%
NeuDEF-NoTrans +21.4% +27.0% +13.7% +29.1% +19.9%
NeuDEF +20.9% +28.1% +16.6% +31.6% +20.7%
Table 1. Ranking Accuracy on Sogou-KNRM dataset. Relative performances over K-NRM are in percentages. , , and indicate statistically significant improvements over K-NRM, NRM-F, NeuDEF-TF, and NeuDEF-NoTrans.
Testing-RAW, MRR Testing-DIFF
Model Head Torso Tail NDCG@1 NDCG@10
K-NRM(T) 0.3440 - 0.3747 - 0.3244 - 0.3132 - 0.4288 -
NRM-F(T) +11.5% +11.4% +6.3% +18.3% +8.5%
NeuDEF-TF(T) +13.5% +20.7% +2.4% +17.4% +13.4%
NeuDEF-NoTrans(T) +19.3% +25.2% +11.2% +20.0% +16.3%
NeuDEF(T) +16.8% +27.4% +14.2% +20.1% +16.7%
K-NRM(B) 0.2728 - 0.3226 - 0.2539 - 0.2275 - 0.3744 -
NRM-F(B) +27.8% +21.1% +23.9% +35.9% +22.9%
NeuDEF-TF(B) +32.3% +22.3% +24.6% +46.6% +26.3%
NeuDEF-NoTrans(B) +32.1% +28.3% +29.1% +45.2% +26.3%
NeuDEF(B) +33.2% +33.6% +33.8% +46.7% +25.9%
Table 2. Performance on title (T) and body (B) field individually on Sogou-KNRM. Relative performances compared to K-NRM and the significant improvements over K-NRM, NRM-F, NeuDEF-TF and NeuDEF-NoTrans are compared in each field group.

Expansion Candidates. All expansion approaches harvest the expansion terms solely using the queries and clicks in the training split. As there is no overlap in the training and testing queries, the expansion approaches use no information from the testing data.

Evaluation Metrics. Testing-DIFF is evaluated by NDCG@{1, 10} and Testing-Raw is evaluated by MRR (K-NRM). Statistical significance is tested by permutation test with .

Model performances of the ranking model () w.r.t the baseline model () at per document level are compared by RR:


is the relevance label. RR(q, d) is the reciprocal rank of d under q. Better ranking models have positive RRs.

Baselines. Using the same experimental setup with prior research (K-NRM; zheng2018sogou) makes our method directly comparable with their baselines: CDSSM (shen2014learning), DRMM (jiafeng2016deep), K-NRM (K-NRM) and Conv-KNRM (dai2018convolutional). We use their shared implementations to obtain baseline results on torso and tail queries. We also implemented NRM-F (zamani2018neural), the fielded version of CDSSM, using the same fields as in NeuDEF. All neural ranking baselines leverage user feedback following Eq. 8.

We implemented and compared with many document expansion baselines. DELM (tao2006language) is a traditional document expansion language model via document neighbors. We selected the top 5 words as document expansion fields according to their frequency in all the neighbor documents. ExpaNet (tang2017end) is a neural text expansion model with memory network generated via document neighbors. We combined the expanded document features from ExpaNet with soft match features in original K-NRM’s dense layer. We treated all the documents under a specific query as neighborhoods.

We also compared with expanding using other meta-data: (1) DocFreq: the number of times the document appears in search log; (2) CQCount: the number of queries lead to clicks in the document;

Two simpler versions of NeuDEF are compared too: (1) NeuDEF-TF, which weights expansion terms only by their frequency in the clicked queries; (2) NeuDEF-NoTrans, which does not use transformer’s self attention.

Implementation Details. All baselines use the same setting in prior research (K-NRM): 300-dimension embedding layer; 165,877-word vocabulary; one exact match kernel (); and ten kernels equally distributed in (-1, 1) (). 1 multi-head attention layer with 4 heads is used in NeuDEF. Documents with no clicked query are not expanded. The learning of all methods use the same training data. All neural methods use Adam optimizer with learning rate , batch size 64 and , and early stopping on random selected validation data.

4. Evaluation

This section presents evaluation results.

4.1. Overall Ranking Accuracy

Table 1 lists the results on Sogou-KNRM. NeuDEF outperforms all baselines. It improves NRM-F, the previous state-of-the-art, on Head, Torso, and Tail. NeuDEF outperforms it base ranker (K-NRM) on tail queries by (). The attention mechanism learns effective expansion weights: NeuDEF significantly outperforms NeuDEF-TF. The transformer helps on the tail, as compared with NeuDEF-NoTrans.

Table 2 lists the results of K-NRM, NRM-F, and NeuDEFs on the title and body individually. NeuDEF performs the best on each field. Note that it has been observed that adding body fields does not contribute much (K-NRM; zamani2018neural). How to better model long body text is a future research direction. The performances of NeuDEF and main baselines on Sogou-QCL (zheng2018sogou) are in Table 3. The trends are similar.

4.2. Learned Expansion Weights

This experiment analyzes the expansion weights learned by NeuDEF’s attention mechanism on three groups of expansion terms: those from clicked queries that have No Overlap with the document content, those have Partial Overlaps, and those Contained by the document. Figure 1 shows the distribution of terms in these three groups and their average expansion weights normalized on each document.

Model NDCG@1 NDCG@10
K-NRM (K-NRM) 0.2409 - 0.3888 -
NRM-F (zamani2018neural) +5.7% +14.1%
NeuDEF-TF +13.5% +18.5%
NeuDEF-NoTrans +16.3% +19.6%
NeuDEF +16.4% +19.4%
Table 3. Accuracy on Sogou-QCL dataset. Relative performances are compared to K-NRM. Statistical significance is marked by (K-NRM), (NRM-F), (NeuDEF-TF) and (NeuDEF-NoTrans).
Figure 1. Frequency Distribution and normalized Attention Weights on the expansion terms from three clicked query groups: those have No Overlap with, Partial Overlap with, or are Contained by the corresponding document fields.

About clicked queries have no term overlap with the document title or body (they might be retrieved by some query expansion-alike techniques). NeuDEF assigns about one thirds of its learned attention weights on clicked queries that have no overlap with the document; with document title, of expansion terms from the No Overlap group received of expansion weights. NeuDEF learns to favor novel expansion terms that are related to but do not appear in the document, which may bring in extra information.

Figure 2. Performance on documents with different number of clicked queries. X-axis is the number of clicked queries. Histograms are the number of documents. Plots and Y-axis are the average RR compared to K-NRM; higher is better.

4.3. Document Level Performances

In our experiments, all testing queries are “unseen” queries as they never appear in the training split nor used in the document expansion. The advantage of NeuDEF is that it leverages the feedback signals at document level: a query might never appear before in the query log but the candidate documents may have seen before.

This experiment studies NeuDEF’s document level performance w.r.t. different amounts of user feedback signals. It groups documents based on their number of clicked queries, then it evaluates the RR of NeuDEF over K-NRM on each combination of document fields. The results on head queries are shown in Figure 2. Results on torso and tail are similar and omitted due to space constraints.

The number of clicked queries per document follows a long tail distribution. The user preferences also heavily favor popular documents; documents with more click queries are more likely to be relevant. NeuDEF performs better than K-NRM on all groups and with all types of document content. Even on documents with no clicked queries where NeuDEF withdraws to the base K-NRM model, adding expansion terms provides extra information in training and helps NeuDEF learn better parameters than its base ranker.

5. Conclusions and Future Work

This paper presents NeuDEF, a neural document expansion approach that enriches document representations for neural ranking models using user feedback signals. Experiments demonstrate NeuDEF’s effectiveness and its ability to better utilize user feedback signals and generalize them to unseen queries through document expansions. Future work includes bringing expansion terms from external resources and developing more advanced neural expansion models.

6. Acknowledgement

This work is supported by the National Key Research and Development Program of China (No. 2018YFC0831901). We thank Sogou for providing access to the search log.