User Model-Based Intent-Aware Metrics for Multilingual Search Evaluation

12/13/2016 ∙ by Alexey Drutsa, et al. ∙ Yandex 0

Despite the growing importance of multilingual aspect of web search, no appropriate offline metrics to evaluate its quality are proposed so far. At the same time, personal language preferences can be regarded as intents of a query. This approach translates the multilingual search problem into a particular task of search diversification. Furthermore, the standard intent-aware approach could be adopted to build a diversified metric for multilingual search on the basis of a classical IR metric such as ERR. The intent-aware approach estimates user satisfaction under a user behavior model. We show however that the underlying user behavior models is not realistic in the multilingual case, and the produced intent-aware metric do not appropriately estimate the user satisfaction. We develop a novel approach to build intent-aware user behavior models, which overcome these limitations and convert to quality metrics that better correlate with standard online metrics of user satisfaction.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are many countries whose population speaks in different languages: a country can have two state languages (e.g., Belgium), have close relations with other countries (e.g., Germany), and be subjected to globalization (their citizens actively or passively learn popular international languages)111 This forces modern search engines to process queries and documents in different languages for users from the same region, which is known as multilingual search or multilingual aspect of web search Savoy (2005).

The language preferences (the need for relevant documents in a particular language) of a user are not always easily deduced from her query to the search engine and may be ambiguous. For instance, there are words which are the same in different languages (like “table" in English and French), and some named entities have the same meaning in different languages (like “cola" and “CIKM"). In this paper, we argue that the ambiguity problem of language preferences can be solved by diversification of search results with respect to their languages Chang et al. (2011). Diversification was successfully applied to other types of query intents: navigational/informational Sakai (2014), freshness Styskin et al. (2011), etc.

A comprehensive overview of various research questions and methodologies in the field of multilingual search can be found in Peters et al. (2012), which also includes a large survey of CLEF (Conference and Lab of the Evaluation Forum) and its test collections. To the best of our knowledge, there is no study devoted to the evaluation of multilingual search by means of specialized offline metrics. On the face of it, the task of building a multilingual metric may seem straightforward, since a great number of diversified search metrics exist Chapelle et al. (2011); Sakai (2014). However, the insufficiency of these metrics becomes apparent, when one applies them to the multilingual search. Usually, in the case of the intent-aware approach Agrawal et al. (2009); Chapelle et al. (2011); Sakai (2014), quality evaluation is based on relevance assessments of documents assigned for each intent individually. However, relevance is essentially independent of the language of a document, as the meaning of a document is not supposed to change after its translation into another language. Therefore, we can only rely on the relevance labels, which do not account for the language preferences (universal judgments). For this reason, the state-of-the-art collections like CLEF contain only universal relevance labels Peters et al. (2012). Therefore, we are faced with the problem of determining the per-language relevance probabilities

of a document from the document’s universal editorial judgment. According to the traditional intent-aware approach, we should assume that the relevance probability of a document whose language coincides with the intent (implicit language preference) depends on its universal judgment only, and, if the document language does not coincide with the language preference, the document is totally irrelevant 

Agrawal et al. (2009). This core principle of the intent-aware approach to diversified retrieval evaluation could be a pitfall, because, in a variety of countries, there is a part of users who can speak or understand two or more languages, though being proficient in these languages to a different degreePeters et al. (2012). The above described approach Agrawal et al. (2009) can produce a diversified metric, which dos not correctly estimate the satisfaction of such users with search results in different languages.

In our work, we utilize an intent-aware (IA) approach to make a diversified variant of the offline evaluation metric ERR

Chapelle et al. (2009, 2011); Sakai (2014), which together with its modifications Styskin et al. (2011); Chuklin et al. (2013a) are the most popular and well studied offline metrics used in search engine industry and academia Collins-Thompson et al. (2014). In order to advance the traditional approach to diversified search evaluation, we modify its underlying user model. Namely, we allow users having one (implicit) language intent to be satisfied by documents in another language. Then, we build a new metric based on this novel intent-aware click model using the technique from Chuklin et al. (2013a). We show experimentally that our extended intent-aware user model outperforms the existing ones in terms of perplexity, and the novel diversified metric (which is based on this IA-model) outperforms the studied offline metrics in terms of their correlation with a set of popular absolute online metrics.

2 Framework

We start the development of multilingual metrics from analysis of the click models that underlie the state-of-the-art offline metrics and their diversified variants. Metrics investigated in this paper include the state-of-the-art ERR Chapelle et al. (2009); Chuklin et al. (2013a)

as a baseline and its different modifications, which are based on the special cases of the Dynamic Bayesian Network (DBN) click model

Chapelle and Zhang (2009).

Click models. We remind that a click model is a probabilistic model which predicts the user behavior and her clicks on a search engine result page (SERP). Particularly, the DBN model assumes Chapelle and Zhang (2009) that a user examines the document snippets from SERP one by one from top to bottom and may be attracted by a snippet. If the user is attracted by a snippet, she clicks its URL and, with a certain probability, becomes satisfied with the document. If she is not satisfied, she may proceed to the next snippet or stops otherwise. In our work, we restrict ourselves to the simplified version of the DBN model (SDBNs) and add the following constraints to align it with the user model underlying ERR metric Chapelle and Zhang (2009): the user is always attracted by examined snippets, and she never abandons search results before having examined all results or having been satisfied with one of them.

In our general framework, we suppose that a user issues a query and examines the first documents from the received SERP. Let be the set of allowed query intents,

be the random variable of the

query’s intent with the value in , be the random event of examination of the -th document , and be the random event of satisfaction by the -th document . Then the user behavior is modeled as follows. A user issues a query , which has an intent with the probability , and starts examining the first document (). After the examination of the -th document (), she is satisfied () with the probability . The described user behavior is summarized in the following transition probabilities between the states of the variable and the events and for :

for each intent , where is the initial state of the user behavior (before issuing a query) and the relevance probability defines the probability of satisfaction by the document conditioned by the examination at position .

According to Chuklin et al. (2013a), we introduce the following additional constraint on the click models in order to build offline evaluation metrics: the relevance probability is determined by the relevance grade of the examined document, i.e., , where, is the relevance grade of the -th document . Hereby, unlike the original SDBN model, it is not an individual parameter for each particular query–document pair. The above framework allows us to describe both the SDBN model Chapelle and Zhang (2009); Chuklin et al. (2013a) and its different modifications. In order to obtain a particular click model, one needs to specify the conditional probabilities and for each and .

Model-based metrics. Following Chuklin et al. Chuklin et al. (2013a), we use the state-of-the-art methodology Chapelle et al. (2009); Styskin et al. (2011); Chuklin et al. (2013a) to build an offline quality metric based on a click model of user behavior. The classic effort-based metric on top of the model SDBN is the metric Chapelle et al. (2009); Chuklin et al. (2013a). To the best of our knowledge, we are the first who proposed to obtain a diversified metric on top of an intent-aware click model. The common formula Chapelle et al. (2009) for ERR-family metrics is


Intent awareness. The classical SDBN model Chapelle and Zhang (2009); Chuklin et al. (2013a) is intent-agnostic, i.e., there are no query intents (), and its relevance probabilities are independent of the intent : for some map . The simplest approach to introduce an intent awareness into a click model is as follows Chuklin et al. (2013b). For the intent , one introduces the per-intent relevance assessments to the model’s parameters in place of the (universal) relevance judgments Chapelle et al. (2011); Sakai (2014). The editorial judgments must be obtained222In certain cases, there are no per-intent judgments. This is the case of our multilingual study. We will discuss how we overcome this issue in Section 3. for each intent individually. For instance, for each possible intent of a query and each document, assessors can be instructed to imagine themselves asking the query with that particular intent, and to ignore the value of the document in the contexts of other possible intents Chapelle et al. (2011). In such a way, we obtain a modification of the SDBN model, where are substituted as the relevance probabilities (note that the function does not depend on the intent here what will be questioned in the next section). The described intent-aware (IA) approach is similar to the one generally used to build an offline metric of diversified search from its intent-agnostic variant Agrawal et al. (2009); Chapelle et al. (2011); Sakai (2014). Therefore, we will use the same terminology for the described way to introduce intent awareness into a click model.

Estimation of model parameters. In order to define the parameters of a model, one either sets them to default ad-hoc values (i.e., based on intuition only) or fits them from query logs. In the first case, for instance, the original ERR metric Chapelle et al. (2009) use the mapping , where is the maximum possible relevance grade, thus, e.g., for intent-agnostic and IA models. In the second case, in order to learn a model’s parameters (i.e., the relevance probabilities and the intent probability ), a likelihood function is optimized Chapelle and Zhang (2009); Chuklin et al. (2013a). In our work, we do this by means of the BFGS algorithm Byrd et al. (1995) (which is a variant of the gradient descend algorithm).

Figure 1: Distribution of sessions with clicks on English documents and clicks on documents written in the native language (in % w.r.t. the total number of sessions with clicks, ).
Classic intent-agnostic IA modification
model (same params)
(1) (2)
IA modification EIA
(diff params) modification
(2) (3)
Table 1: The evolution of the relevance probabilities from the source intent-agnostic model via the IA modifications to the EIA modification.

3 Multilingual intents & metrics

In the case of multilingual search, the space of query intents is the set of languages. The (universal) editorial judgments333We consider the state-of-the-art 5-grade scale for the editorial judgments. do not depend on the document’s language, since the meaning of a document is not supposed to change after its translation into another languages. In our work, the baseline is the classical intent-aware approach Agrawal et al. (2009) (see Sec. 2), where the per-language relevance judgments of a document in language is defined as follows. If the document’s language does not coincide with the considered language preference , then the document is naturally treated as totally irrelevant to this intent, i.e., , if , and ("Bad"), otherwise. This approach to introduce intent awareness could be a pitfall in the case of the multilingual search by the reasons that we explain further. In this paper, we propose a new intent-aware modification of the SDBN model whose relevance probabilities depend on both editorial judgments and the combination of the language preference and the language of the document.

We modify the SDBN model by increasing, step by step, the degree of freedom of the relevance probabilities

. We start from the version presented in the second block of Table 1. Here probabilities depend on the editorial judgments solely, if , and always correspond to the relevance otherwise. First, we hypothesize that a user may search for documents in different languages (e.g., her native language and her second language) with different levels of convenience and success. We conclude that the relevance probabilities of a document with the same editorial judgment might be different for different languages. Therefore, we perform the second step of our modification (denoted by (2) in Table 1), where the relevance probabilities are additionally allowed to depend on the intent in the case . We refer to this version of the IA model as the IA with “diff params", while we refer to the previous one as the IA with “same params".

Second, we remember that there are bilinguals who can speak in or understand two languages. Such a user, while preferring the documents in one language, could be occasionally satisfied by a document written in another language despite that she did not expect that documents in this language would contain any relevant information at the beginning. Such situation could be supported by the observation of user behavior from the query logs of one of the popular search engines operating in a European country. We plotted the distribution of sessions with clicks on English documents and clicks on documents written in the native language in % w.r.t. the total number of sessions with clicks, , in Fig. 1. One can see that users click on documents written in both languages in more than of sessions with 2 clicks ( and of sessions with and clicks respectively). Following this experience, we conclude that the relevance probabilities should not always correspond to the relevance in the case of . Therefore, we perform the third (final) step of our modification (denoted by (3) in Table 1), allowing the relevance probabilities depend both on the language preference and on the document’s language besides the (universal) editorial judgments . So, in terms of our general framework, we suppose that , and obtain a new Extended Intent-Aware model (SDBN-EIA).

A particular metric of the ERR-family defined by Eq. 1 is determined by the parameters (relevance probabilities) and (intent probabilities), , that are specified by the click model underlying the metric. For instance, the classical is based on the model SDBN, and, thus, it is defined by Eq. 1 with default parameters independent on the query intent (see Sec. 2). Contrariwise, our novel metric with extended intent awareness is based on the model SDBN-EIA, and, thus, it is defined by Eq. 1 with . Note that the modifications that we proposed do not introduce any additional restrictions to the basic click model, but, on the contrary, add more degrees of freedom to it. At the same time, if we were wrong in our assumptions, we would just learn such probabilities from the logs that would transform the extended model to the basic one anyway. However, our experiments demonstrate that both the click models and the metrics they underlie them benefit from the additional flexibility. The above modifications improve the metrics, because the better the model predicts user behavior, the better it predicts the user satisfaction, which is determined by Eq. 1.

SDBN modification # of params +%
Extended IA (SDBN-EIA) 21 1.268 +1.02%
IA learned “diff params" 11 1.271 +0.25%
IA learned “same params" 6 1.272 +3.12%
Intent-Agnostic learned 5 1.281 +22.5%
IA default 6 1.362 +17.9%
Intent-Agnostic default 5 1.441
Table 2: The average perplexity values for the click models.

4 Experiments

Experimental setup. In our experimentation we consider one of the major web search engines which operates in one of the European countries (15% of its population have knowledge of foreign languages and 78% of them speak English). In this case, the space of query intents is the set of 2 languages: the native language for 99% of the population of that country and English language. Since none of the existing collections for multilingual search evaluation Peters et al. (2012) are provided together with any click data (vital for our learning), we have collected click data from the logs of user interactions with the search engine during a six-month period in 2013. Then, following Chapelle et al. Chapelle et al. (2009), we perform the next steps to construct our data set. We define a session as an event with one query asked by one user, which received a list of results (URLs) and provided a list of clicked URLs (unlike in Chapelle et al. (2009), our session ends with the last action on its SERP). We restrict all sessions by the top URLs of the first result page (i.e., all further clicks were ignored, and, thus, ), since, as also explained in Chapelle et al. (2009), consideration of top 10 positions would lead to a much smaller intersection between query logs and editorial judgments. Then, we remove the sessions whose results contain at least one document without an editorial judgment as in Chapelle et al. (2009).

Next, specially for multilingual search evaluation, we filter our data as follows. We remove sessions with queries contained non-Latin characters. Then, we detect Peters et al. (2012) the language for each document from the top 5. The sessions with documents written in a language different from the set are removed. Finally, we remove sessions whose user’s location is outside the country under study. The resulting data set has more than 136M sessions and more than 44.8k unique queries . Finally, we split the data randomly into two parts with the ratio . The smallest part is used as the test data and the largest one serves as the training data . We repeat this procedure times in order to apply the

paired two-sample t-test

and measure the significance level of the obtained results. Then, each click model, whose parameters need to be learned from clicks, is learned on the training data set (as described in Sec. 2).

Evaluation of the models. In order to evaluate our models on a test set , we use a standard Chuklin et al. (2013b) averaged perplexity metric , where, for each position , we calculate the perplexity from the equality

is a binary value that indicates the click event at the -th position in the session , and is the probability of the click on the -th position in the session under the model. The better the model, the lower the value of its perplexity (for an ideal model, it is equal to ). And also relying on the literature, in order to compute the perplexity gain of a model over a model , we use the standard formula .

In Table 2, we report the values of the average perplexity for all click models under study (see Sec. 2 and 3). The differences between all pairs of the models are obtained at a high significance level with p-value . First, we see that our novel intent-aware model outperforms all other studied models (by a margin over the 2-nd one). Second, we see that there is no big difference () between the IA models “same params" and “diff params" (denoted by (1) and (2) in Table 1 respectively). Therefore, we conclude that our third modification (denoted by (3) in Table 1) of the relevance probability dependance give more profit than the second one. Third, we see expected results: the order of the models w.r.t.  corresponds to the number of learned parameters, and the models with default parameters have the lowest perplexity by a high margin. Finally, we conclude that our novel intent-aware click model of user behavior outperforms both the state-of-the-art intent-aware click model and intent-agnostic model (i.e., SDBN) in their ability to explain user click behavior.

Evaluation of the metrics. In order to compare the metrics under study (see Sec. 3), we calculate correlation between them and some absolute online metrics over configurations. We choose this method because it is commonly used Chapelle et al. (2011); Chuklin et al. (2013a). First, we utilize the following absolute online metrics Chapelle et al. (2009); Chuklin et al. (2013a): UCTR (binary value representing click); MaxRR, MinRR, and MeanRR (maximal, minimal, and mean reciprocal ranks of a click in a session); and PLC (the number of clicks divided by the position of the lowest click). Second, a configuration is a tuple of a query and the top- URLs of the ranked documents presented to a user Chapelle et al. (2009); Chuklin et al. (2013a). Our data set has more than 2.1M configurations (i.e., on average, more than configurations per query and more than sessions per configuration). We measure the weighted correlation Chapelle et al. (2009) over the configurations in a test data set between a model-based offline and an online metric.

ERR modification UCTR MaxRR MinRR MeanRR PLC
Extended IA (ERR-EIA) 0.318 0.422 0.379 0.386 0.389
Intent-Agnostic learned (ERR) 0.283 0.337 0.309 0.314 0.317
IA learned “same params" 0.282 0.409 0.361 0.368 0.372
Intent-Agnostic default 0.27 0.326 0.298 0.303 0.305
IA learned “diff params" 0.268 0.394 0.347 0.354 0.357
IA default 0.131 0.178 0.152 0.157 0.158
Table 3: The correlations of the multilingual metrics with online metrics.

We compute the correlations, using the 100-fold sampling of the previous section: we use the same learned parameters of the click models and we calculate the correlation on the test data sets from this sampling. These results are summarized in Table 3 (with p-value ). First, we see that the ranking of our offline metrics is different with respect to the UCTR and with respect to the other absolute metrics. The difference seems to be caused by the definition of the UCTR, which does not account for clicks unlike the other 4 metrics. This finding is in line with the results from Chapelle et al. (2009); Chuklin et al. (2013a), where the order of the studied metrics is different with respect to the UCTR metric and with respect to the other online metrics. Second, our novel model-based metric is the incontestable winner with respect to all absolute metrics. Third, we see that the intent-agnostic metric has the 2-nd place w.r.t. UCTR and outperforms some intent-aware metrics w.r.t. other absolute metrics. Possible explanations of this result, that are discussed in Section 3, have encouraged us to study the new intent-aware models. We explain this “strange" result by peculiarities of multilingual diversification: the presence of bilinguals among the search engine users and a high probability of being satisfied by results in both languages penalize the models, where a user, which prefers documents in one language, cannot be satisfied with documents in another one444The IA models are such models, while intent-agnostic and EIA ones are not (see Sec. 3).,

Finally, we conclude that our novel model based intent-aware metric outperforms both the state-of-the-art IA metric and the intent-agnostic metric (i.e., ERR) in terms of their correlation with several online metrics. Moreover, its use is necessary due to the inferiority of the state-of-the-art IA metric in comparison to the simple intent-agnostic one.

5 Conclusions and future work

In this paper, we were driven by the need to propose a user model and the corresponding metric which are best suited for the case of multilingual search, motivated by the observation that a considerable portion of users who can understand two languages and can be satisfied by documents written in one language, while searching documents in another language. As we demonstrated, the straightforward intent-aware modifications of user models do not take such aspects of this user behavior into account. In passing, first, to the best of our knowledge, we proposed a novel method to obtain new metrics of diversified search, which is based on the conversion of intent-aware click models into offline metrics. Second, to the best of our knowledge, we are the first who proposed an offline quality evaluation metric which takes the multilingual aspect of search into account. As future work we can, first, apply our intent-aware modification of metrics to evaluate diversified search based on other types of query intents, such as freshness, and etc. Second, we can also experiment with optimization of the click model parameters by directly maximizing the correlation of the metric it underlies with absolute online metrics.


  • Agrawal et al. [2009] Agrawal, R., Sreenivas, G., Halverson, A., and Leong, S. (2009). Diversifying search results. In WSDM.
  • Byrd et al. [1995] Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208.
  • Chang et al. [2011] Chang, Y., Zhang, R., Reddy, S., and Liu, Y. (2011). Detecting multilingual and multi-regional query intent in web search. In AAAI.
  • Chapelle et al. [2011] Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., and Wu, S. L. (2011). Intent-based diversification of web search results: metrics and algorithms. Information Retrieval.
  • Chapelle et al. [2009] Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In CIKM.
  • Chapelle and Zhang [2009] Chapelle, O. and Zhang, Y. (2009). A dynamic bayesian network click model for web search ranking. In WWW.
  • Chuklin et al. [2013a] Chuklin, A., Serdyukov, P., and De Rijke, M. (2013a). Click model-based information retrieval metrics. In SIGIR.
  • Chuklin et al. [2013b] Chuklin, A., Serdyukov, P., and De Rijke, M. (2013b). Using intent information to model user behavior in diversified search. Advances in Information Retrieval.
  • Collins-Thompson et al. [2014] Collins-Thompson, K., Bennett, P., Diaz, F., Clarke, C. L., and Voorhees, E. M. (2014). Trec 2013 web track overview. Technical report, DTIC Document.
  • Peters et al. [2012] Peters, C., Braschler, M., and Clough, P. (2012). Multilingual information retrieval: From research to practice. Springer Science & Business Media.
  • Sakai [2014] Sakai, T. (2014). Metrics, statistics, tests. Bridging Between IR and Databases.
  • Savoy [2005] Savoy, J. (2005). Comparative study of monolingual and multilingual search models for use with asian languages. In TALIP.
  • Styskin et al. [2011] Styskin, A., Romanenko, F., Vorobyev, F., and Serdyukov, P. (2011). Recency ranking by diversification of result set. In CIKM.