Log In Sign Up

Cost-sensitive Learning of Deep Semantic Models for Sponsored Ad Retrieval

by   Nikit Begwani, et al.

This paper formulates the problem of learning a neural semantic model for IR as a cost-sensitive learning problem. Current semantic models trained on click-through data treat all historical clicked query-document pairs as equivalent. It argues that this approach is sub-optimal because of the noisy and long tail nature of click-through data (especially for sponsored search). It proposes a cost-sensitive (weighted) variant of the state-of-the-art convolutional latent semantic model (CLSM). We explore costing (weighing) strategies for improving the model. It also shows that weighing the pair-wise loss of CLSM makes the upper bound on NDCG tighter. Experimental evaluation is done on Bing sponsored search and Amazon product recommendation. First, the proposed weighted model is trained on query-ad pairs from Bing sponsored search click logs. Online A/B testing on live search engine traffic shows 11.8% higher click-through rate and 8.2% lower bounce rate as compared to the unweighted model. Second, the weighted model trained on amazon co-purchased product recommendation dataset showed improvement of 0.27 at NDCG@1 and 0.25 at NDCG@3 as compared to the unweighted model.


page 1

page 2

page 3

page 4


Learning From Weights: A Cost-Sensitive Approach For Ad Retrieval

Retrieval models such as CLSM is trained on click-through data which tre...

Beyond Lexical: A Semantic Retrieval Framework for Textual SearchEngine

Search engine has become a fundamental component in various web and mobi...

Deeply Supervised Semantic Model for Click-Through Rate Prediction in Sponsored Search

In sponsored search it is critical to match ads that are relevant to a q...

A Semantic Alignment System for Multilingual Query-Product Retrieval

This paper mainly describes our winning solution (team name: www) to Ama...

TSI: an Ad Text Strength Indicator using Text-to-CTR and Semantic-Ad-Similarity

Coming up with effective ad text is a time consuming process, and partic...

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Predicting the click-through rate of an advertisement is a critical comp...

Embracing Structure in Data for Billion-Scale Semantic Product Search

We present principled approaches to train and deploy dyadic neural embed...

1. Introduction

Objective : This paper formulates the problem of learning neural semantic models for IR as a cost-sensitive learning problem. It explores various costing (weighing) techniques for improving neural semantic models, specifically the CLSM model (Shen et al., 2014). In online retrieval, we have millions of documents with which query similarity needs to be calculated within milliseconds and thus query-document word level interaction is not possible and hence we rely on representation based model and CLSM is the state-of-the-art representation based semantic model.

Sponsored Ad Retrieval : Search engines generate most of their revenue via sponsored advertisements, which are displayed alongside organic search results. Ad retrieval system is designed to select ads in response to the user queries. Historically, advertisers created their campaign by providing ad and a set of queries (bid terms) that they want to display their ad on. This scenario is called an exact match. But it is not possible for advertisers to provide bid terms to cover all tail queries by exact match. An advanced match is used to overcome this issue, where user queries are semantically matched with ads. Each serving of an ad, in response to a user query, is called an impression. For a query-ad pair, total clicks divided by total impressions over a time interval is defined as click-through rate (CTR), while the percentage of times a user returns back immediately after clicking an ad is referred as bounce rate (BR). High click-through rate and low bounce rate is desirable for a sponsored search system.

Earlier information retrieval techniques matched queries with ads based on syntactic similarity (Jones et al., 2000). Lot of recent research went in developing semantic retrieval techniques like LSA (Deerwester et al., 1990), pLSI (Hofmann, 1999) and LDA (Blei et al., 2003), which maps query and document in lower dimensional semantic space, where match can happen even with no token overlap.

Figure 1. CLSM Architecture

DSSM : Recently, there has been a shift towards neural network based semantic models trained using click-through data. DSSM (Huang et al., 2013)

is one such representation based neural network model. It takes a bag of words representation of historically clicked query-document pairs and maps them in lower dimension space using a discriminative approach. It uses a set of non-linear layers to generate query and document semantic vectors. The learning objective is to maximize the cosine similarity between the vectors of clicked query-document pairs.

CLSM : Bag-of-words representation for query/document is used in DSSM, which is not suitable for capturing contextual structures. CLSM (Shen et al., 2014) tries to solve this issue by running a contextual sliding window over the input word sequence. As shown in figure 1

, CLSM has letter trigram based word hashing layer, followed by convolution layer based sliding window which generates a local contextual feature vector for each word within its context window. These local features are then aggregated using a max-pool layer to generate the global feature vector, which is then fed to a fully connected layer to generate the high-level semantic vector for query/document.

Motivation : Current semantic models trained on click-through data treat all historical clicked query-document pairs as equally important, which is not true. For example, an ad related to ”thyroid treatment” can receive a click from a wide variety of queries like Hashimoto disease, hypothyroidism, medication hypothyroidism, Hashimoto, swollen glands neck, heart conditions, liver diseases, Perthes diseases etc. Some of these queries are very specific, while others are generic. Treating all these pairs as equal while training can result in model learning only generic patterns. Other dimensions of the problem are the noise in click-through data (i.e. not all clicked ads for a query are relevant to the query) and long tail nature of click-through data (see figure 2). For example, we can fairly confidently say that a query-ad pair having 95 clicks from 100 impressions is relevant, but not much can be said about a query-ad pair with 1 click from 1 impression.

Figure 2. Clicks v/s number of queries showing the long tail nature of click-through data. Note that x-axis is on log-scale.

One solution is to generate training data by applying minimum impression, click and CTR thresholds on query-ad pairs. But this can result in the filtering of most of the tail queries (since long tail queries never repeat, so all query-ad pairs for such queries will have only one impression). This can result in below par performance of semantic models for tail queries. This is not an acceptable solution since semantic models are mainly used for an advanced match in tail queries.

To address this issue, we propose that all clicked query-ad pairs should be used for training semantic models. Further, we propose different costing (weighing) techniques on click data, so that model learns from important pairs.

Contributions : This paper formulates the neural semantic model training as a cost-sensitive learning problem. It propose approaches for re-weighing training data and guiding principles for the same. Online A/B testing of the weighted model on live traffic shows 11.8% gains in clicks and CTR and 8.2% improvement in terms of quality measured using bounce rate(BR) as compared to the unweighted model. Further evaluation of weighted-CLSM model on amazon co-purchased dataset shows 0.27 absolute increase in NDCG@1 and 0.25 absolute increase in NDCG@3 over the unweighted model.

2. Related Work

2.1. Traditional IR Techniques

Many IR techniques have been proposed for modeling contextual information between queries and documents (Jones et al., 2000) (Bendersky et al., 2011) (Gao et al., 2004) (Gao et al., 2010) (Lavrenko and Croft, 2001) (Lv and Zhai, 2009) (Metzler and Croft, 2005) (Metzler and Croft, 2007). Classical TF-IDF and BM25 (Jones et al. (2000)

) based techniques are based on a bag of words representation. These approaches are further extended by modeling term/n-gram dependencies using Markov Random Field (

Metzler and Croft (2005)), Latent Concept Expansion (Metzler and Croft (2007)), dependence model (Gao et al. (2004)) and phrase translation model (Gao et al. (2010)).

2.2. Semantic Models for IR

Classical IR techniques based on lexical matching can fail to retrieve relevant documents due to language/vocabulary mismatch between query and documents. Latent Semantic Models aim to solve this issue by mapping both query and document into a lower dimensional semantic space and then, matching the query with documents based on vector similarity in the latent space. Techniques in this area include Latent Semantic Analysis (Deerwester et al. (1990)), Probabilistic latent semantic indexing (Hofmann (1999)), Latent Dirichlet Allocation (Blei et al. (2003)), LDA based document models (Wei and Croft (2006)), Bi-Lingual Topic Model (Gao et al. (2011)) etc.

Representation based neural models: These models use a neural network to map both query and document to low dimensional latent embedding space and then perform matching in latent space. Models proposed in this category are DSSM (Huang et al. (2013)), CLSM (Shen et al. (2014)), ARC-I (Hu et al. (2014)) etc.

Interaction based neural models: These models compute interaction (syntactic/semantic similarity) between each term of query and document. These interactions are further summarized to generate a matching score. Multiple models have been proposed in this category such as DRMM (Guo et al. (2016)), MatchPyramid (Pang et al. (2016)), aNMM (Yang et al. (2016)), Match-SRNN (Wan et al. (2016)), K-NRM (Xiong et al. (2017)) etc.

2.3. Learning to rank

Learning to rank (LTR) models for IR aim to learn a ranking model which can rank documents for a given query. Liu (2009) categorized LTR approaches into three categories based on learning objective: Pointwise approaches (Fuhr (1989), Cossock and Zhang (2006), Li et al. (2007)), Pairwise approaches (RankNet (Burges et al., 2005)) and Listwise approaches (LambdaRank (Burges, 2010), ListNet (Cao et al., 2007), ListMLE (Xia et al., 2008)).

2.4. Cost Sensitive Learning

Cost-sensitive learning refers to the problem of learning models when different misclassification errors incur different penalties (Elkan (2001)). It was shown by Zadrozny et al. (2003) that learning algorithms can be converted into cost-sensitive learning algorithms by cost-proportionate weighing of training examples or by rejection based subsampling. Dmochowski et al. (2010) showed that, for the cost-sensitive scenario, the empirical loss is upper bounded by negative weighted log likelihood. It further argues that weighted maximum likelihood should be the standard approach for solving cost-sensitive problems.

3. Cost-sensitive Semantic Model

3.1. Proposed Formulation

Neural semantic models like CLSM (Shen et al., 2014)

learns using click-through data. For a given query, it models the posterior probability of positive/clicked doc (

) as softmax over positive doc and J randomly selected unclicked documents.

where, D contains (clicked doc) and J randomly selected unclicked documents. represents the cosine similarity between query Q and document d semantic vectors generated by the model.

Model is learned by minimizing the negative log-likelihood of clicked query-ad pairs:

where are model parameters, Q is the set of all queries and is the set of all documents which have received click when displayed for query Q (based on click logs).

For case of J=1 (i.e. one negatively sampled doc per clicked doc ), we have:

Assuming true label to be 1 for and 0 for , this loss is the same as pair-wise loss from Burges et al. (2005). As discussed earlier, treating all clicked query-document pairs as equally relevant is not optimal since click-through data is noisy and has long tail nature. To address this issue, we propose to assign label

of document D based on its probability of generating a click for query Q. Based on click logs, the probability of click for a query-doc pair can be estimated by its click-through rate. Given these real-valued labels, we can now directly optimize list-wise loss (like DCG). As shown in

Burges (2010), DCG can be efficiently optimized by modifying gradients as follows:

Figure 3. Weighted training of CLSM model incorporates domain knowledge to weigh training pairs.

where is the change in DCG on swapping ranks of and .

For a query Q, let be the top k predicted documents based on model scores with corresponding true labels ,… ,. Then, DCG@k is defined as:

For training pair , change in DCG on swapping rank position with a negative doc :

(Note that will be 0, since is has zero clicks).

For j=1, i.e. :

This is equivalent to optimizing following loss function:


This can be interpreted as weighing each training point by weight . This shows that in the CLSM scenario, train data weighing is same as optimizing DCG, rather than a pair-wise loss. Figure 3 shows a graphical representation of our approach, where the loss for each training point is weighed based on domain knowledge.

The proposed loss function is general in two ways. First, it can be used to learn any representation and interaction based neural semantic model. Second, different weighing strategies (other than CTR) can also be used based on the problem domain.

3.2. Relation to Cost-Sensitive Learning

While learning semantic models for Ad retrieval, the cost of misclassification is not the same for all documents. Most sponsored ad systems use cost per click model and hence, try to maximize click-through rates (CTR). So, for a given query, the cost of misclassification is more for a doc with larger expected CTR as compared to a doc with lower expected CTR. With this motivation, we treat the problem of learning semantic models using click data as a cost-sensitive learning problem. As shown in Zadrozny et al. (2003) and Dmochowski et al. (2010), the standard approach to solving such problems is to optimize weighted log-likelihood of data, where weights are set according to ”costliness” of misclassifying that example.


where, is the cost and can be set as

(since )

This shows that proposed weighted loss function of eq. 1 can also be derived by treating semantic model learning as a cost-sensitive learning problem.

3.3. Bounds on Ranking Measure

Chen et al. (2009) showed that, for a given query, pair-wise loss upper bounds (1-NDCG).

where , f is the learned ranking function, G is an increasing function (called Gain function), D is a decreasing function (called position discount function), is the DCG of ideal ranking, is the set of documents to be ranked, are labels for x and is the pair-wise loss. As shown in Chen et al. (2009), this bound can be tightened by introducing weights proportional to in the pair-wise loss as follows:

where is the logistic function ().

Our formulation of eq. 1 can be derived by setting weights as follows:

Since i.e. . This establishes that proposed CTR based weighting of pair-wise loss tightens the upper bound on (1-NDCG).

3.4. Weighing Strategies

We propose a set of guiding principles, which can used for coming up with weighing strategies:

  1. Weight should be global in nature i.e. weight for a (query,doc) should be comparable across documents and queries.

  2. Weight should not be biased towards head/torso/tail queries.

  3. Weight should not be biased towards head/torso/tail documents.

  4. Weight should be proportional to the clicks..

(2) and (3) ensures that learned model is not biased towards a particular set of queries or documents, (1) ensures that global threshold tuning can be done on model outputs and (4) ensures that more clickable pairs are ranked higher.

Given click logs over N queries and M documents, let , and represent number of impressions, number of clicks and weight for {, } pair. We can then define following weighing strategies:

nClicks: Computed as number of clicks for a query-document pair normalized by total clicks for the query over all documents.

CTR: Refers to Click through rate and is computed as number of clicks divided by number of impressions.

Weighing Strategy P-1 P-2 P-3 P-4
nClicks N Y N Y
Table 1. Weighing strategies with satisfying principles.

Table 1 shows the guiding principles satisfied by each of these weighing strategies. CTR satisfy all 4 principles.

4. Experiments

In this section, we discuss the experimental results to demonstrate the impact of weighing the click-through data on CLSM model training. We compare following approaches of training CLSM model:

  1. Curated Training: Only high confidence clicked query-ad pairs are used for training. Where high confidence query-ad pairs are those with CTR greater than the market average.

  2. Unweighted Training: All historically clicked query-ad pairs are used for training with equal weight.

  3. Weighted Training: All historically clicked query-ad pairs are used for training and weights for these pairs are set based on weighting strategies discussed in section 3.4.

These models are compared using two sets of experiments. Firstly, models are evaluated against a human-labeled set of query-ad pairs and AUC numbers are reported to show that weighing doesn’t deteriorate the offline metric. Secondly, A/B testing is performed on live search engine traffic and user interaction metrics are reported.

To prove the efficacy of weighing in a general scenario, we also evaluate the performance of weighted model on Amazon co-purchased dataset discussed in (McAuley et al., 2015) and report nDCG scores.

4.1. Offline Experiments on Sponsored Search

4.1.1. Dataset and Evaluation Methodology

We take training and evaluation set from Bing sponsored search data for travel vertical. The training data had 11M clicked query-ad pairs and evaluation data set had 154K human-labeled pairs. The ad is represented by ad title (as suggested in (Shen et al., 2014)). All pairs are then preprocessed such that the text is white-space tokenized, lowercased and alphanumeric in nature, we don’t perform any stemming/inflection.

(a)   Type of Query Percentage Share Head 11.25% Torso 35.08% Tail 53.67%

(b)   Total Positive(1) Negative(0) 154K 115K 39K

Table 2. Human-labeled Query-Ad evaluation set

Evaluation data collected from search log is labeled by human judges into positive(1) and negative(0) pairs and Table 2 shows the distribution. Performance of different models is evaluated using Area Under the ROC Curve (AUC-ROC) and Area Under the Precision-Recall Curve (AUC-PR). Note that, the candidate set of documents for a query changes very frequently in sponsored search and labeling of complete candidate document set for queries is not feasible. Hence, a random set of query-document pairs were labeled and AUC is reported.

4.1.2. Model Comparisons

First set of models (Curated and Unweighted) treat all training pairs to be of equal importance i.e. no weighing (Row 1 and Row 2 in Table 3). Further, we create a second set of weighted models where each query-ad pair is weighted using strategies discussed in section 3.4 i.e. nClicks and CTR.

The neural network weights for each of the model were randomly initialized as suggested in (Müller and Orr, 1998)

. The models are trained using mini-batch based stochastic gradient descent with the mini-batch size of 1024 training samples. Each training point was associated with 4 randomly selected negative samples during training.

# Model Type AUC-ROC AUC-PR
1 Curated 75.03% 89.62%
2 Unweighted 78.23% 90.89%
3 Weighted-nClicks 78.38% 91.03%
4 Weighted-CTR 78.61% 91.22%
Table 3. Offline evaluation of different CLSM models for sponsored search on human-labeled set.

4.1.3. Results

As shown in Table 3, the Unweighted model shows 1.27% higher AUC-PR and 3.2% higher AUC-ROC than the curated model on human-labeled evaluation set. Weighing further improves the unweighted model, with the best performing weighted model (Weighted-CTR) having 0.33% higher AUC-PR and 0.38% higher AUC-ROC. This demonstrates that on offline human labeled data, the weighted model performed equally well as the unweighted model in fact slightly improving the overall AUC numbers. These improvements were observed by running multiple iterations with different random weight initialization

Query Type Model AUC-ROC AUC-PR
Curated 79.07% 92.61%
Torso Unweighted 81.62% 93.55%
Weighted-CTR 81.83% 93.57%
Curated 71.48% 85.84%
Tail Unweighted 75.45% 87.74%
Weighted-CTR 75.85% 88.40%
Table 4. Evaluation of CLSM model for sponsored search on human labeled data for torso and tail queries.

Table 4 shows the AUC gains when we break down the total gains into the torso and tail queries. We don’t consider head queries here because head queries generally trigger exact matches rather than advanced matches. The weighted model shows much better AUC, especially in the tail bucket, with the weighted model having 0.66% higher AUC-PR than unweighted model and 2.56% higher AUC-PR than the curated model.

Purchased product Title of top - 1 returned product
comfort control harness large Weighted: comfort control harness xxl blue
Unweighted: comfort control harness
ford gt red remote control car rc cars Weighted: licensed shelby mustang gt500 super snake rtr remote control rc cars
Unweighted: lamborghini gallardo superleggera radio remote control car color yellow
nasa mars curiosity rover spacecraft poster 13x19 Weighted: space shuttle blastoff poster 24x36
Unweighted: imagination nebula motivational photography poster print 24x36 inch
wonderful wonder world clockmaker Weighted: alice country clover ace hearts
Unweighted: olympos
Table 5.

The table shows few examples of the top product returned by CLSM model trained on amazon co-purchased dataset. Weighted model is able to capture the specific intent, while products predicted by the unweighted model are more generic. Bold words are those non-overlapping words which contribute to the ten most active neurons at the max-pooling layer.

4.2. Online Experiments on Sponsored Search

4.2.1. Dataset and Evaluation Methodology

Since the domain of sponsored search is very dynamic and an ad being relevant doesn’t imply clickability, hence offline evaluation has limited power and is not sufficient. We perform online A/B testing of curated, unweighted and weighted-ctr model on Bing live traffic and report user interaction metrics.

% change of metrics in A
as compared to B
A B Clicks Click-through Bounce
rate (CTR) rate
Unweighted Curated +10.11% +10.08% +2.85%
Weighted Curated +22.78% +23.13% -5.63%
Weighted Unweighted +11.51% +11.85% -8.25%
Table 6. Online A/B testing of different CLSM models. Weighted model shows 23% higher CTR and 5.6% lower bounce rate as compared to curated model, while 11.8% higher CTR and 8.2% lower bounce rate as compared to unweighted model.

4.2.2. Model Comparisons

We assigned equal online traffic to each of the three models. We also ensured that every other setting on our ad stack is kept the same for these models. We compare these models on three major performance indicators: total clicks, CTR (Click through rate) and bounce rate.

4.2.3. Results

Table 6 shows A/B testing results, when different CLSM models are used for Ad Retrieval. First, as compared to the curated model, the unweighted model generated 10.1% more clicks at 10.08% higher click-through rate due to better exploration for tail queries, but bounce rate deteriorated by 2.8%. Second, the Weighted model performed better than the curated model in all metrics. The weighted model generated 22.7% more clicks at 23.1% higher CTR while reducing bounce rate by 5.6%. Third, the Weighted model also showed significantly better metrics than the unweighted model. Ads retrieved by weighted model generated 11.5% more clicks while increasing CTR by 11.8% and simultaneously reducing bounce rate by 8.2%.

These results clearly demonstrate the impact of weighing on semantic model training. It is important to note here that apart from increasing the clicks and CTR, there is a huge reduction in terms of bounce rate when we move from unweighted to weighted. We see an increase in bounce rate when we move from curated to unweighted and we attribute this to numerous noisy pairs being directly added to training dataset with equal importance.

4.3. Amazon Product Recommendation

4.3.1. Dataset and Evaluation Methodology

In order to test the efficacy of weighing, independent of domain, we experiment with Amazon co-purchased dataset (McAuley et al., 2015) of related products. This dataset contains list of product pairs ( , ) purchased together by users. It contains co-purchased data for 24 categories with a total of  9 million unique products. We use this dataset to learn semantic models, which can be used for product recommendation task. On average, each product is associated with 36 co-purchased products. We make an 80-20 split of the dataset into train and evaluation set such that there is no product overlap between them. Models are evaluated on 20% holdout set.

@1 @3 @5 @10
Unweighted 63.71 69.08 72.84 79.75
Weighted (Jaccard) 63.98 69.33 73.07 79.89
Table 7. Evaluation of CLSM model trained for product recommendation task on amazon co-purchased dataset. NDCG metrics are reported on holdout set.

4.3.2. Model Comparison

First, the unweighted CLSM model is trained by assigning equal weight to each co-purchased product pair. Second, weighted CLSM model is trained by assigning Jaccard index based weight to each product pair (

, ). Jaccard index measures the similarity between two sets and is defined as the size of intersection divided by the size of the union. Let Nr() represent set of neighbors of product i.e. set of products which were co-purchased with , then Jaccard index between two products and can be defined as:


The performance of models has been measured by mean Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen, 2000), and we report NDCG scores at truncation levels 1, 3, 5 and 10.

4.3.3. Results

Table 7 represents the efficacy of weighted model over unweighted in terms on NDCG scores. We see a gain of 0.27 in NDCG@1 and similar gains for other truncation levels. This result shows that weighing if domain agnostic.

5. Discussion

Each section of the table 5 contains the title of purchased product and title of top-1 product returned by weighted and unweighted model. We further highlight those non-overlapping words between two titles that contribute to the ten most active neurons at the max-pooling layer.

Figure 4. Top neuron triggers at max-pool layer being mapped to contributing words. Weighted model maps top words like ford, gt, cars to a particular car model by ford. Whereas Unweighted model maps ford, gt, cars to a car model from other company.

In the first example, we see that apart from syntactic word matches, the weighted model is able to match the size as large and xxl in respective product titles appear among the ten most active neurons at the max-pooling layer. In the second example, most active neurons for weighted model correspond to words like ford, gt; shelby, mustang and snake, all of which refers to a particular car model by ford named ”ford mustang shelby GT500 super snake”. Whereas for the unweighted model, most active neurons correspond to words like cars, car, remote, rc, ford; lamborghini, gallardo, superleggera, remote. So unweighted model only captures the general intent, rather than specific intent captured by the weighted model. To further examine the learning, we trace the neurons with high activation at the max-pooling layer to the words from product title. Figure 4 shows that while the weighted model’s retrieval is governed by the similarity occurring between words related to a particular car, the retrieval of the unweighted model is governed by a general similarity between two different model of cars. We see that due to weighing, we are able to retrieve the specific car as opposed to any generic car. Similar observations were made in example 3 and 4 where weighted model recommended more specific products as opposed to a generic poster or a book.

6. Conclusion

This paper developed the cost-sensitive approach to training semantic models for IR. It extended the pair-wise loss of CLSM model by re-weighing the train data points and evaluated various weighing techniques for the same. A/B testing of proposed model on Bing sponsored search showed significant improvements in click-through rates and bounce rates. The proposed idea is general and is applicable to different tasks like retrieval, recommendation, classification etc.


  • (1)
  • Bendersky et al. (2011) Michael Bendersky, Donald Metzler, and W. Bruce Croft. 2011. Parameterized concept weighing in Verbose Queries (SIGIR).
  • Blei et al. (2003) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. JMLR (2003).
  • Burges (2010) Christopher Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. (2010).
  • Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank Using Gradient Descent. In ICML.
  • Cao et al. (2007) Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to Rank: From Pairwise Approach to Listwise Approach (ICML).
  • Chen et al. (2009) Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhiming Ma, and Hang Li. 2009. Ranking Measures and Loss Functions in Learning to Rank (NIPS).
  • Cossock and Zhang (2006) David Cossock and Tong Zhang. 2006. Subset Ranking Using Regression (COLT).
  • Deerwester et al. (1990) Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science (1990).
  • Dmochowski et al. (2010) Jacek P Dmochowski, Paul Sajda, and Lucas C Parra. 2010. Maximum likelihood in cost-sensitive learning: Model specification, approximations, and upper bounds.

    Journal of Machine Learning Research

  • Elkan (2001) Charles Elkan. 2001. The Foundations of Cost-sensitive Learning (IJCAI).
  • Fuhr (1989) Norbert Fuhr. 1989. Optimum Polynomial Retrieval Functions Based on the Probability Ranking Principle. TOIS (1989).
  • Gao et al. (2010) Jianfeng Gao, Xiaodong He, and Jian-Yun Nie. 2010. Clickthrough-based Translation Models for Web Search: From Word Models to Phrase Models (CIKM).
  • Gao et al. (2004) Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, and Guihong Cao. 2004. Dependence Language Model for Information Retrieval (SIGIR).
  • Gao et al. (2011) Jianfeng Gao, Kristina Toutanova, and Wen-tau Yih. 2011. Clickthrough-based Latent Semantic Models for Web Search (SIGIR).
  • Guo et al. (2016) Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval (CIKM).
  • Hofmann (1999) Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing (SIGIR).
  • Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences (NIPS).
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data (CIKM).
  • Järvelin and Kekäläinen (2000) Kalervo Järvelin and Jaana Kekäläinen. 2000. IR evaluation methods for retrieving highly relevant documents. ACM.
  • Jones et al. (2000) K. Sparck Jones, S. Walker, and S. E. Robertson. 2000. A Probabilistic Model of Information Retrieval: Development and Comparative Experiments. Information Processing and Management (2000).
  • Lavrenko and Croft (2001) Victor Lavrenko and W. Bruce Croft. 2001. Relevance Based Language Models (SIGIR).
  • Li et al. (2007) Ping Li, Christopher J. C. Burges, and Qiang Wu. 2007.

    McRank: Learning to Rank Using Multiple Classification and Gradient Boosting

  • Liu (2009) Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundation and Trends in Information Retrieval (2009).
  • Lv and Zhai (2009) Yuanhua Lv and ChengXiang Zhai. 2009. Positional Language Models for Information Retrieval (SIGIR).
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes (SIGIR).
  • Metzler and Croft (2005) Donald Metzler and W. Bruce Croft. 2005. A Markov Random Field Model for Term Dependencies (SIGIR).
  • Metzler and Croft (2007) Donald Metzler and W. Bruce Croft. 2007. Latent Concept Expansion Using Markov Random Fields (SIGIR).
  • Müller and Orr (1998) K.R. Müller and G.B. Orr. 1998. How to Make Neural Networks Work: Tips and Tricks of the Trade. Springer.
  • Pang et al. (2016) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text Matching As Image Recognition (AAAI).
  • Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval (CIKM).
  • Wan et al. (2016) Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng. 2016. Match-SRNN: Modeling the Recursive Matching Structure with Spatial RNN (IJCAI).
  • Wei and Croft (2006) Xing Wei and W. Bruce Croft. 2006. LDA-based Document Models for Ad-hoc Retrieval (SIGIR).
  • Xia et al. (2008) Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise Approach to Learning to Rank: Theory and Algorithm (ICML).
  • Xiong et al. (2017) Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling (SIGIR).
  • Yang et al. (2016) Liu Yang, Qingyao Ai, Jiafeng Guo, and W. Bruce Croft. 2016. aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model (CIKM).
  • Zadrozny et al. (2003) Bianca Zadrozny, John Langford, and Naoki Abe. 2003. Cost-Sensitive Learning by Cost-Proportionate Example Weighting (ICDM).