Micro-Browsing Models for Search Snippets

10/18/2018 ∙ by Muhammad Asiful Islam, et al. ∙ Google 0

Click-through rate (CTR) is a key signal of relevance for search engine results, both organic and sponsored. CTR of a result has two core components: (a) the probability of examination of a result by a user, and (b) the perceived relevance of the result given that it has been examined by the user. There has been considerable work on user browsing models, to model and analyze both the examination and the relevance components of CTR. In this paper, we propose a novel formulation: a micro-browsing model for how users read result snippets. The snippet text of a result often plays a critical role in the perceived relevance of the result. We study how particular words within a line of snippet can influence user behavior. We validate this new micro-browsing user model by considering the problem of predicting which snippet will yield higher CTR, and show that classification accuracy is dramatically higher with our micro-browsing user model. The key insight in this paper is that varying relatively few words within a snippet, and even their location within a snippet, can have a significant influence on the clickthrough of a snippet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Web search engines have become an essential tool for navigating the vast amounts of information on the internet. Implicit user feedback, specifically, click-through data, is valuable for optimizing search engine results [1, 11, 13, 5]

. Click-through data plays an equally important role in estimating the quality of sponsored search results

[14].

When using click-through data, the position bias of a result has to be always taken into account, since the same result usually gets a higher click-through rate (CTR) if it is positioned in a more prominent part (e.g., the top) of the page. There have been many papers on better estimating the probability of examination of a result, by using the pattern of clicks on prior results [6, 8, 9, 12], or using both prior clicks and the relevance of prior results [4, 10, 21, 16]. These models are all macro-models, as they build models of examining the complete result without looking at features based on the content of the result (e.g., snippets of organic search results or creatives of sponsored search results).

However, the result snippet often contains certain words or phrases that make the user more likely to click the result. For example, a user looking for spacious seats while booking a flight might consider “more legroom” to be the most salient feature in the decision to click, whereas another user looking for a better deal may be interested in phrases like “cheap flights” or “20% off”. Once the user sees these words in the snippet, she may decide to click without examining the other words.

Contributions   The key insight in this paper is that relatively few word variations within a snippet, and even where within a snippet particular words are located, can have a meaningful influence on the clickthrough of a snippet. Section II discusses related research. In Section III, we propose a fine-grained model for estimating the probability that users actually read a word (or phrase) in the snippet. We call this the micro-browsing model for search result snippets. Section IV discusses the application of the micro-browsing model to the problem of predicting which of two creatives will have higher CTR. Detailed experimental results and analysis are presented in Sections V, to demonstrate the effectiveness of the micro-browsing model. Finally Section VI outlines some promising areas of future work.

Ii Related Work

Prior user browsing models for web search results can be partitioned into three groups based on how they estimate the probability that the user examines a specific position:

  • Models that assume examination is independent of the other results for the query.

  • Models that assume examination depends on the pattern of clicks on prior results.

  • Models that assume examination depends on both the pattern of clicks on prior results, and the relevance of prior results.

Some of the models were originally targeted at sponsored search, while others were targeted at organic search results. However, while the parameter values might differ, all of these models are general enough to apply to both organic search and sponsored search.

Ii-a Examination independent of other results

We use the following notation:

  • Let denote the result at position in a ranked list of results (whether organic results or sponsored results).

  • Let

    denote a binary random variable that captures the event that a user clicks on

    .

  • Let denote the event that the user examines .

The examination hypothesis, originally proposed by Richardson et al. [14] and formalized by Craswell et al. [6], observes that to be clicked, a result must be both examined and relevant:

(1)

Richardson et al. [14] assume that the probability a result is viewed depends solely on its position, and is independent of other results.

Ii-B Examination depends on prior clicks

An implicit assumption in the above formulation is that the probability of examining the result in position does not depend on click events in other result positions. A plethora of papers explore models that incorporate this information into the examination probabilities.

The cascade hypothesis [6] assumes that users scan each result sequentially without any skips:

The cascade model [6] further constrains that the user continues examining results until she clicks on a result, and does not examine any additional results after the click:

(2)

This model is quite restrictive since it allows at most one click per query instance.

The dependent click model (DCM) [10] generalizes the cascade model to instances with multiple clicks:

The authors suggest estimating the position effects using maximum likelihood.

The user browsing model (UBM) [8] is also based on the examination hypothesis, but unlike the cascade model and DCM, does not force to be 1. In other words, it allows users to stop browsing the current results and instead reformulate the query (or perhaps give up). UBM assumes that the examination probability is determined by the preceding click position.

The Bayesian browsing model (BBM) [12]

uses exactly the same browsing model as UBM. However, BBM adopts a Bayesian paradigm for relevance, i.e., BBM considers relevance to be a random variable with a probability distribution, rather than a fixed (but unknown) value to be estimated. In the context of this paper, where we are focused on the user browsing model, UBM and BBM can be considered equivalent.

Ii-C Examination depends on prior clicks and prior relevance

Next, we summarize models that take into account both clicks on prior results, and the relevance of those results, to predict the probability of examination.

The click chain model (CCM) [9] is a generalization of DCM obtained by parameterizing and by allowing the user to abandon examination of more results:

Thus if a user clicks on the previous result, the probability that they go on to examine more results ranges between and depending on the relevance of the previous result.

The general click model (GCM) [21] treats all relevance and examination effects in the model as random variables:

This allows online inference within the cascade family. These authors show that all previous models are special cases by suitable choice of the random variables , and .

Ii-D Post-click models

In our discussion so far, relevance referred to “perceived” relevance – whether the user considers the result relevant before she clicks on the result. Post-click relevance is a measure of whether the user was satisfied with their experience after clicking on the result. Perceived relevance is positively correlated with post-click relevance [15]. However, there are cases where perceived relevance is high and post-click relevance is low (e.g., snippet or creative is inaccurate), or vice versa (e.g., only a small fraction of people searching “yahoo” want answers.yahoo.com – but for those people, it’s perfect). Thus both perceived and post-click relevance are equally important for user satisfaction.

The dynamic bayesian model (DBM) [4] uses the “user satisfaction” (post-click relevance) of the preceding click to predict whether the user will continue examining additional results:

where is the satisfaction of the user in the previous clicked result. They propose an EM-type estimation method to estimate and the user satisfaction variables.

The session utility model (SUM) [7] proposes a user browsing model based on the “intrinsic” (post-click) relevance of the sequence of clicked results in a user session. However, it does not model examination or pre-click relevance.

The click yield prediction framework [18] focuses on predicting group performance in sponsored search, and incorporates multiple factors, such as latent bias, textual features, interactive influences, and ad position correlation.

Coupled logistic regression

(CLR) [19] uses all features from ad, user, context and nonlinear features such as conjunction information. It integrates the conjunction information by employing factorization machine to achieve precise CTR prediction result.

Borisov et al. [2]

introduces a distributed representation-based approach for modeling user behavior and several neural click models, which learn patterns of user behavior directly from interaction data.

Click sequence model (CSM) [3]

predicts the order in which a user will interact with search engine results. CSM is based on neural network which follows the encoder-decoder architecture, and it uses an attention mechanism to extract necessary information about the search engine results.

Zhao et al. [20]

predicts gaze by combining user browsing data with eye tracking data from a small number of users in a grid-based recommender interface. They use Hidden Markov Models (HMMs), and evaluate their prediction models in MovieLens.

Our focus in this paper is on correctly estimating examination and perceived relevance. Thus in the rest of the paper, we will use “relevance” as shorthand for “perceived relevance”.

Iii Micro-Browsing Model

In this section, we describe the proposed micro-browsing model for search snippets.

Iii-a Model

For a given query , let be the probability that the term in the position in a snippet is relevant to the query, and indicate whether that term has been examined by the user issuing that query, where and . The probability of relevance of to is then given by:

(3)

where is the number of terms in the snippet . Note that for ease of notation, we consider here that the snippet consists of one line having terms — however this analysis can be easily generalized to a snippet with lines, in which case the variables and will be indexed by both the line and position numbers.

Note that this model considers that not all terms in a snippet are examined by the user — only a subset of terms (with corresponding ) are examined, and the relevance of the snippet is judged by the user based on the relevance of only these observed terms.

Iii-B Implications for Snippet CTR

Let us consider another snippet , for which the relevance and examination probabilities are indicated by and respectively. Given these two snippets and for the query , the ratio of the probabilities of and based on Equation 3 is:

(4)

One way to empirically validate the effectiveness of our micro-browsing model for search snippets is to use a dataset where multiple search snippets relevant to a particular query have different CTRs — the effectiveness of the micro-browsing model can then be measured by it’s accuracy in predicting which snippet has a higher CTR. As we will describe in Section V-A, one such dataset is available in sponsored search, comprising of pairs of snippets that have differences in their text and one where snippet has significantly higher CTR than another. We use the micro-browsing model to derive features that are used in training a snippet classifier

— given a pair of snippets, the classifier is trained to predict which snippet has higher CTR.

Let us consider a classifier that, for a given query and a pair of creatives and , predicts which creative is more relevant for . This classifier takes features from and as input, and outputs a 0/1 prediction.

We derive the features of this classifier from the micro-browsing model of search results, outlined in Section III-A. The log of the probability ratio in Equation 4 gives us a score indicative of the relative relevance of the two snippets given the query :

(5)

This motivates us to consider rewrite features that rewrite terms in to terms in in the classifier, details of which are discussed in the next section.

Iv Predicting Which Snippet Will Have Higher CTR

The snippet classifier is trained using the following features.

Iv-a Snippet Classifier Features

Term Features: The snippet is tokenized to get terms — unigrams, bigrams, and trigrams — from its text. The position of a term in a line and the number of the line in the snippet are also considered as features.

Rewrite Features: These features are extracted from pairs of snippets that are given as input to the snippet classifier. Consider the following example of two snippets:

  • Snippet 1 ():
    line1: “XYZ Airlines”
    line2: “Find cheap flights to New York.”
    line3: “No reservation costs. Great rates”

  • Snippet 2 ():
    line1: “XYZ Airlines”
    line2: “Flying to New York? Get discounts.”
    line3: “No reservation costs. Great rates!”

In this pair of snippets, one rewrite (terms bolded in the example) is from the term ‘find cheap’ in position 1 in line 2 of Snippet 1 (), to the term ‘get discounts’ in position 5 in line 2 of Snippet 2 (). This gives us the rewrite tuple (find cheap:1:2, get discounts:5:2). Note that there are other rewrites (terms italicized in the example), e.g., “flights” in got rewritten to “flying” in . So, overall the terms (“find cheap”, “flights”) in got rewritten to the terms (“flying”, “get discounts”) in . Finding out which phrase in matches to which corresponding phrase in is a combinatorial problem in general. In this example, the best match is “find cheap” “get discounts” and “flights” “flying” — we estimate the most likely matches using statistics available from our snippet corpus.

The rewrite matching algorithm we use is as follows: given a pair of snippets differing in one particular phrase rewrite, we assign a score to that phrase rewrite based on how much improvement (lift) in observed click-through rate we observe between the two snippets (details in Section V). Collecting these scores across all possible snippet pairs in the complete snippet corpus gives us a database of phrase rewrites with corresponding click-through rate lift scores. To find the matching phrases between and in a rewrite, we greedily match terms in with corresponding terms in that have a high score in the rewrite database — the intuition here is that a more probable rewrite like “find cheap” “get discounts” has a higher score in the rewrite database than a less probable rewrite like “find cheap” “flying”.

After this matching, there could be additional terms in (not in ), or terms in (not in ) — we add them as individual term-level features.

Note that these term feature and the rewrite features can be derived from the micro-browsing model in Section III. Let us assume that can be created from by rewriting a subset of features in to correspoding features in — the positions of the rewrites are encoded in a set of pairs , where indicates that a feature in position of has been rewritten to a feature in position of . We also consider the set of positions in and that occur in to be and respectively. Using , and , we can re-factor Equation 5 to the following form:

(6)

In the example above, , , and .

The first term in Equation 6 corresponds to the rewrite features, where the feature in position in is rewritten to the feature in position in indicates whether the user saw the feature in position in result , while gives the relevance of the feature in position in result (similarly for ). The second term enumerates over any terms that are in but not in after the matching, and the third term enumerates over any terms in that are in after the matching. The matching in can be done by various mechanisms — as explained above, we follow a greedy algorithm to find the most likely match based on statistics computed from the data source.

Iv-B Snippet Classification Framework

Fig. 1: Snippet Classification Framework.

The input data for the snippet classifier consists of pairs of snippets with CTR information for each snippet. The snippet classification pipeline consists of two phases (Figure 1 shows the overall flow).

In the first phase, the feature extractor extracts important feature statistics (e.g., rewrites database) from the snippet pairs. In the second, the snippet classifier is trained. Each data instance input to the second phase is a snippet pair , and the classifier predicts whether is better than or vice versa. The classifier data generator takes the snippet corpus and feature statistics database, and then for all possible snippet pairs it generates the input features for the classifier. After the model has been trained, we use it for classifying test data.

V Experimental Results

We conducted a set of experiments to evaluate the effectiveness of the snippet classifier derived from the micro-browsing model. As mentioned in Section III, we used a dataset from sponsored search for our evaluation.

Before we give the details of the dataset and the experiments, let us introduce some terminology. A sponsored search result (or ad) consists of creative text that is a brief (e.g., 3 line) description to be displayed by the search engine, and a keyword that specifies the query for which the creative is eligible for display. An advertiser often groups creatives of the same type (e.g., targeting a particular keyword) into a group called the adgroup. An ad impression results when the search engine displays an ad in response to a user query, and a clickthrough results when the user viewing the ad clicks on it and visits the landing page of the ad. The ad CTR for a given query measures the fraction of times an ad impression led to a clickthrough.

V-a Data Set

Advertisers often provide multiple alternative creative texts in a particular adgroup, corresponding to the same keyword used for targeting queries. When these creatives are shown corresponding to a query and the keyword used for targeting is the same, any observed difference in CTR can only caused by difference is the text of the creative.

We collected such a dataset of creative pairs from an advertiser, where the keyword used for targeting was same and the observed CTR was different. Our AdCorpus consists of ad creatives collected from a particular time period, where each adgroup got at least one click in that time. AdCorpus contains tens of millions creatives pairs, collected from several million adgroups.

V-B Serve Weight

The serve-weight (sw) of a creative in an adgroup denotes the probability that the creative will be shown from the set of creatives of an adgroup. It is computed from clicks and impressions of the different creatives in the adgroup, suitably normalized by the average CTR of the adgroup — this allows serve-weight values of two creatives in different adgroups to be compared, as it accounts for the CTR differences between adgroups.

In Section IV, we described how term and rewrite features were generated from a pair of creatives. We compute the sw-diff for each term feature in a creative pair — if the term is present in one creative but not the other, we define sw-diff as the difference in serve-weight of the creative containing that term with the serve-weight of the creative not containing that term. For rewrite features, if a term in creative is rewritten to a term in creative for a creative pair, we define sw-diff as the difference of serve-weights of and . For both term and rewrite features, we define a random variable delta-sw that is set to +1 if sw-diff is positive for that feature and -1 if sw-diff is negative.

V-C Feature Statistics Database

For each feature, we compute the empirical probability of sw-diff being +1 by estimating the fraction of times delta-sw is +1 over the complete AdCorpus

(using Laplace-smoothing to address sparsity). We record the odds-ratio of this probability (

) as the statistic corresponding to that feature in the statistics database. The statistics for each feature in this database intuitively estimates the odds of the presence of the feature causing an increase in creative CTR (since it increases serve-weight). Note that apart from term and rewrite features, we also calculate position features, i.e., for positions of terms and position pairs (source position and target position) for rewrites, we estimate the empirical probability of sw-diff being +1.

V-D Experiments

We conducted a set of experiments where we systematically ablated features from the classifier model to estimate the effectiveness of different aspects of the snippet classifier derived from the micro-browsing model. Specifically, we consider the following components in our ablation study:

  • Term features: Use n-gram (unigram, bigram, trigram) features from the input creative pair in the classifier. This is equivalent to only using the second and the third terms in Equation 

    6 in the score for the micro-browsing model, where and are set to 1 for all terms (i.e., all terms are viewed).

  • Rewrite features: Use term rewrite features for an input creative pair in the classifier. This is equivalent to only using the first term in Equation 6 in the score for the micro-browsing model, where and are set to 1 for all terms (i.e., all terms are viewed).

  • Position information: Uses position information of terms in the term and rewrite features in the classifier. This is equivalent to considering and in Equation 6, that can now model which terms are viewed by the user.

  • Feature statistics database: Use the feature statistics estimated from AdCorpus as the initial values of the term and rewrite features in the learning algorithm.

The snippet classifier we train is a logistic regression model with L1 regularization. We turn on these individual components incrementally in the feature set of the logistic regression model, to create multiple snippet classifier models:

  • M1: Term features with no position information, where feature values are initialized from the feature statistics database.

  • M2: Term features with position information, where feature values are initialized from the feature statistics database.

  • M3: Greedy rewrite features with no position information, where feature values are initialized from the feature statistics database.

  • M4: Greedy rewrite features with position information, feature values initialized using the feature statistics database.

  • M5: Greedy rewrite and term features with no position information, where feature values are initialized from the feature statistics database.

  • M6: Greedy rewrite and term features with position information, feature values initialized using the feature statistics database.

V-D1 Mapping of Micro-browsing model to Classifier

Let us now consider the mapping of the features from the micro-browsing model to the snippet classifier for these 5 variants. We illustrate this using M4, which considers rewrite features with position information and initialization. Let us consider only the rewrite features in Equation 6, which is derived from the micro-browsing model:

(7)

In the logistic regression model, whenever we see a rewrite feature for a pair of creatives and , we lookup the feature statistics database to see if we have any feature statistics corresponding to that rewrite and initialize the logistic regression learning algorithm with those feature values. As described in Section IV, we collect feature statistics for rewrites independent of position of the rewrite terms, to handle sparsity issues (since if we consider term x position information in the rewrites, the data would be too sparse). This corresponds to making the following approximation to Equation 7:

(8)

That is, the position terms and the relevance terms are decoupled, so that the relevance terms can be initialized using the features statistics database. Note that we also initialize the position features using the rewrite position features calculated from AdCorpus, as described in Section V-C.

Based on Equation 8, we represent the logistic regression problem for M4 as:

(9)

where O is the odds of having higher CTR than , are the classifier features corresponding to positions of the rewrite terms, and are the classifier features corresponding to the relevance of the rewrite terms. If we fix the values of , can be learned as a logistic regression model. Similarly if we fix the values of , can be learned as a logistic regression model. So, learning model M4 can be framed as an iterative learning of features and in Equation 9 using two coupled logistic regression models. Note that it can be shown that the iterative learning of and in Equation 9 converges to a fixed point under certain model assumptions, but that is beyond the scope of this paper.

Feature Recall Precision F-Measure
M1: Terms only 55.9% 58.2% 0.570
M2: Terms w. pos 64.4% 66.3% 0.653
M3: Rewrites only 59.0% 61.2% 0.601
M4: Rewrites w. pos 70.0% 71.9% 0.709
M5: Rewrites & terms 59.7% 61.8% 0.607
M6: Rewrites & terms w. pos 70.4% 72.1% 0.712
TABLE 1: Accuracy of creative classification using different sets of features

V-D2 Analysis of Results

We performed standard 10-fold cross validation experiments, where in each cross validation iteration 90% instances are used for training and the rest 10% are used for testing. Table 1 shows the precision, recall and F-measure of creative classification for these different models. As seen from the table, the baseline model is M1, which only uses terms from the creatives in the classifier, as done in a usual bag-of-terms model. As we add each component, based on features derived from the micro-browsing model, the classifier performance keeps improving. The final model M6, which incorporates all aspects of the micro-browsing model, has a dramatic improvement in performance compared to M1 — the F-measure increase from for M1 to for M6.

Fig. 2: Learned term position weights (line 1,2,3).

Let us now look at the values of the position features corresponding to the terms and rewrites. Figure 2 shows the learned term position weights for line 1, 2 and 3.

Table 2 presents the accuracy of creative classification for top and right-hand-side(rhs) ads. The accuracy of the classifier using the top ads data is slightly higher than that of rhs data.

Feature Top Rhs
M1: Terms only 57.1% 57.0%
M2: Terms w. position 65.7% 65.1%
M3: Rewrites only 60.2% 59.9%
M4: Rewrites w. position 71.1% 70.8%
M5: Rewrites and terms 60.9% 60.6%
M6: Rewrites and terms w. position 71.4% 71.1%
TABLE 2: Accuracy of creative classification in different configuration (top vs. rhs)

Vi Conclusions and Future Work

In this paper, we proposed a micro-browsing model for search result snippets, and presented an application of this model to predict better creative. We discovered that the micro-position of the terms as well as the rewrites is an important feature and influences the accuracy of the classifier.

Future work directions include learning the micro-position normalizers and adding more features to improve the accuracy of the classifier. Another promising direction will be to use language models to have deeper understanding of snippet text, as well as automatic generation of snippets. We would also like to do eye-tracking [20]

studies to see how the positions of important words in the snippet correlates with focus areas identified by the eye tracking models. Finally, we would like to study how our approach can be integrated with attention-based neural network models (e.g., transformer networks

[17]).

References

  • [1] E. Agichtein, E. Brill, and S. T. Dumais. Improving web search ranking by incorporating user behavior. In SIGIR, 2006.
  • [2] A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov. A neural click model for web search. In WWW, 2016.
  • [3] A. Borisov, M. Wardenaar, I. Markov, and M. de Rijke. A click sequence model for web search. In SIGIR, 2018.
  • [4] O. Chapelle and Y. Zhang.

    A dynamic bayesian network click model for web search ranking.

    In WWW, 2009.
  • [5] W. Chen, D. Wang, Y. Zhang, Z. Chen, A. Singla, and Q. Yang. A noise-aware click model for web search. In WSDM, 2012.
  • [6] N. Craswell, B. Ramsey, M. Taylor, and O. Zoeter. An experimental comparison of click position-bias models. In WSDM, 2008.
  • [7] G. Dupret and C. Liao. A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine. In WSDM, 2010.
  • [8] G. Dupret and B. Piwowarski. A user browsing model to predict search engine click data from past observations. In SIGIR, 2008.
  • [9] F. Guo, C. Liu, A. Kannan, T. Minka, M. Taylor, Y. Wang, and C. Faloutsos. Click chain model in web search. In WWW, 2009.
  • [10] F. Guo, C. Liu, and Y. Wang. Efficient multiple-click models in web search. In WSDM, 2009.
  • [11] T. Joachims. Optimizing search engines using clickthrough data. In KDD, 2002.
  • [12] C. Liu, F. Guo, and C. Faloutsos. Bbm: Bayesian browsing model from petabyte-scale data. In KDD, 2009.
  • [13] F. Radlinski and T. Joachims. Query chains: Learning to rank from implicit feedback. In KDD, 2005.
  • [14] M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: Estimating the click-through rate for new ads. In WWW, 2007.
  • [15] D. Sculley, R. Malkin, S. Basu, and R. J. Bayardo. Predicting bounce rates in sponsored search advertisements. In KDD, 2009.
  • [16] R. Srikant, S. Basu, N. Wang, and D. Pregibon. User browsing models: relevance versus examination. In KDD, 2010.
  • [17] A. Vaswani, N. Shazeer, N. Parma, J. Uszkoreit, L. Jones, A. N. Gomez, Łukasz Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • [18] D. Yin, B. Cao, J.-T. Sun, and B. D. Davison. Estimating ad group performance in sponsored search. In WSDM, 2014.
  • [19] N. Yin, H. Li, and H. Su. Clr: coupled logistic regression model for ctr prediction. In ACM TUR-C, 2017.
  • [20] Q. Zhao, S. Chang, F. M. Harper, and J. A. Konstan. Gaze prediction for recommender systems. In RecSys, 2016.
  • [21] Z. Zhu, W. Chen, T. Minka, C. Zhu, and Z. Chen. A novel click model and its applications to online advertising. In WSDM, 2010.