Reducing Offline Evaluation Bias in Recommendation Systems

by   Arnaud De Myttenaere, et al.
Université Paris 1

Recommendation systems have been integrated into the majority of large online systems. They tailor those systems to individual users by filtering and ranking information according to user profiles. This adaptation process influences the way users interact with the system and, as a consequence, increases the difficulty of evaluating a recommendation algorithm with historical data (via offline evaluation). This paper analyses this evaluation bias and proposes a simple item weighting solution that reduces its impact. The efficiency of the proposed solution is evaluated on real world data extracted from Viadeo professional social network.



There are no comments yet.


page 1

page 2

page 3

page 4


Study of a bias in the offline evaluation of a recommendation algorithm

Recommendation systems have been integrated into the majority of large o...

Reducing offline evaluation bias of collaborative filtering algorithms

Recommendation systems have been integrated into the majority of large o...

Alleviating the recommendation bias via rank aggregation

The primary goal of a recommender system is often known as "helping user...

On Offline Evaluation of Recommender Systems

In academic research, recommender models are often evaluated offline on ...

A Methodology for the Offline Evaluation of Recommender Systems in a User Interface with Multiple Carousels

Many video-on-demand and music streaming services provide the user with ...

Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms

Contextual bandit algorithms have become popular for online recommendati...

A Network-centric Framework for Auditing Recommendation Systems

To improve the experience of consumers, all social media, commerce and e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A recommender system provides a user with a set of possibly ranked items that are supposed to match the interests of the user at a given moment

[park2012literature, kantor2011recommender, adomavicius2005toward]. Such systems are ubiquitous in the daily experience of users of online systems. For instance, they are a crucial part of e-commerce where they help consumers select movies, books, music, etc. that match their tastes. They also provide an important source of revenues, e.g. via targeted ad placements where the ads displayed on a website are chosen according to the user profile as inferred by her browsing history for instance. Commercial aspects set aside, recommender systems can be seen as a way to select and sort information in a personalised way, and as a consequence to adapt a system to a user.

Obviously, recommendation algorithms must be evaluated before and during their active use in order to ensure the quality of the recommendations. Live monitoring is generally achieved using online quality metrics such as the click-through rate of displayed ads. This article focuses on the offline evaluation part which is done using historical data (which can be recorded during online monitoring). One of the main strategy of offline evaluation consists in simulating a recommendation by removing a confirmation action (click, purchase, etc.) from a user profile and testing whether the item associated to this action would have been recommended based on the rest of the profile [shani2011evaluating]. Numerous variations of this general scheme are used ranging from removing several confirmations to taking into account item ratings.

While this general scheme is completely valid from a statistical point of view, it ignores various factors that have influenced historical data as the recommendation algorithms previously used.

Assume for instance that several recommendation algorithms are evaluated at time based on historical data of the user database until . Then the best algorithm is selected according to a quality metric associated to the offline procedure and put in production. It starts recommending items to the users. Provided the algorithm is good enough, it generates some confirmation actions. Those actions can be attributed to a good user modeling but also to luck and to a natural attraction of some users to new things. This is especially true when the cost of confirming/accepting a recommendation is low. In the end, the state of the system at time has been influenced by the recommendation algorithm in production.

Then if one wants to monitor the performance of this algorithm at time , the offline procedure sometimes overestimates the quality of the algorithm because confirmation actions are now frequently triggered by the recommendations, leading to a very high predictability of the corresponding items.

This bias in offline evaluation with online systems can also be caused by other events such as a promotional offer on some specific products between a first offline evaluation and a second one. Its main effect is to favor algorithms that tend to recommend items that have been favored between and and thus to favor a kind of “winner take all” situation in which the algorithm considered as the best at

will probably remain the best one afterwards, even if an unbiased procedure could demote it. While limits of evaluation strategies for recommendation algorithms have been identified in e.g.

[HerlockerEtAl2004Evaluating, mcnee2006being, said2013user], the evaluation bias described above has not been addressed in the literature, to our knowledge.

This paper proposes a modification of the classical offline evaluation procedure that reduces the impact of this bias. Following the general principle of weighting instances used in the context of covariate shift [sugiyama2007covariate], we propose to assign a tunable weight to each item. The weights are optimized in order to reduce the bias without discarding new data generated since the reference evaluation.

The rest of the paper is organized as follows. Section 2 describes in detail the setting and the problem addressed in this paper. Section 3 introduces the weighting scheme proposed to reduce the evaluation bias. Section 4 demonstrates the practical relevance of the method on real world data extracted from the Viadeo professional social network111Viadeo is the world’s second largest professional social network with 55 million members in August 2013. See for more information about Viadeo..

2 Problem formulation

2.1 Notations and setting

We denote the set of users, the set of items and the historical data available at time . A recommendation algorithm is a function from to some set built from . We will denote the recommendation computed by at instant for user . The recommendation strategy, , could be a list of items (ordered in decreasing interest), a set of items (with no ranking), a mapping from a subset of to numerical grades for some items, etc. The specifics are not relevant to the present analysis as we assume given a quality function from product of the result space of and to that measures to what extent an item is correctly recommended by at time via .

Offline evaluation is based on the possibility of “removing” any item from a user profile ( denotes the items associated to ). The result is denoted and is the recommendation obtained at instant when has been removed from the profile of user . If outputs a subset of , then one possible choice for is when and 0 otherwise. If outputs a list of the best items, then will decrease with the rank of in this list (it could be, e.g., the inverse of the rank).

Finally, offline evaluation follows a general scheme in which a user is chosen according to some prior probability on users

(these probabilities might reflect the business importance of the users, for instance). Given a user, an item is chosen among the items associated to its profile, according to some conditional probability on items . When an item is not associated to a user (that is ), . Notice than while we use a stochastic framework, exhaustive approaches are common in medium size systems. In this case, the probabilities will be interpreted as weights and all the pairs (where ) will be used in the evaluation process. In both stochastic and exhaustive evaluations, a very common choice for is the uniform probability on . It is also quite common to use a uniform probability for . For instance, one could favor items recently associated to a profile over older ones.

The two distributions and

lead to a joint distribution

on . In an online system, evolves over time222While could also evolve over time, we do not consider the effects of such evolution in the present article.. For example, if the probability is uniform over the items associated to user , then as soon as gets a new item (recommended by an algorithm, for instance), all probabilities are modified. The same is true for more complex schemes that take into account the age of the items, for instance.

2.2 Origin of the bias in offline evaluation

The offline evaluation procedure consists in calculating the quality of the recommender at instant as where the expectation is taken with respect to the joint distribution, that is


In very large systems, is approximated by actually sampling from according to the probabilities while in small ones, the probabilities are used as weights, as pointed out above.

Then if two algorithms are evaluated at two different moments, their qualities are not directly comparable. While this problem does not fall exactly into the covariate shift paradigm [Shimodaira2000227], it is related: once a recommendation algorithm is chosen based on a given state of the system, it is almost guaranteed to influence the state of the system while put in production, inducing an increasing distance between its evaluation environment (i.e. the initial state of the system) and the evolving state of the system. This influence of the recommendation algorithm on the state of the system is responsible for the bias since offline evaluation relies on historical data.

A naive solution to this bias would be to define a fixed evaluation database (a snapshot of the system at ) and to compare algorithms only with respect to the original database. This is clearly unacceptable for an online system as it would discard both new users and, more importantly, evolutions of user profiles.

2.3 Real world illustration of the bias

We illustrate the evolution of the probabilities in an online system with a functionality provided by the Viadeo platform: each user can claim to have some skills that are displayed on his/her profile (examples of skills include project management, marketing, etc.). In order to obtain more complete profiles, skills are recommended to the users via a recommendation algorithm, a practice that has obviously consequences on the probabilities , as illustrated on Figure 1.

Figure 1: Impact of recommendation campaigns on the item probabilities: each curve displays the evolution of over time for a given item.

The skill functionality has been implemented at time . After 300 days, some of the are roughly static. Probabilities of other items still evolve over time under various influences, but the major sources of evolution are recommendation campaigns. Indeed, at times and , recommendation campaigns have been conducted: users have received personalized recommendation of skills to add to their profiles. The figure shows strong modifications of the quickly after each campaign. In particular, the probabilities of the items which have been recommended increase significantly; this is the case for the black, pink and light blue curves at . On the other hand, the probabilities of the items which have not been recommended decrease at the same time. The probabilities tend to become stable again until the same phenomenon can be observed right after the second recommendation campaign at : the curves corresponding to the items that have been recommended again keep increasing. The green curve represents the probability of an item which has been recommended only during the second recommendation campaign. Section 4.2 demonstrates the effects of this evolution on algorithm evaluations.

3 Reducing the evaluation bias

3.1 Principle for reducing the bias

Let us consider a naive algorithm which always recommends the same items whatever the user and historical data. In other words, is constant. Constant algorithms are particularly easy to understand and useful to illustrate the bias due to external factors. Indeed one can reasonably assume that the score of such algorithms does not strongly vary over time.

A simple transformation of equation (1) shows that for a constant algorithm :


As a consequence, a way to guarantee a stationary evaluation framework for a constant algorithm is to have constant values for the (the marginal distribution of the items).

A natural solution to have constant values for would be to record those probabilities at and use them subsequently in offline evaluation as the probability to select an item. However, this would require to revert the way offline evaluation is done: first select an item, then select a user having this item with a certain probability . But as the probability law originally defined on users reflects their relative importance and should not be modified, it will be necessary to compute such as the overall probability law on users is close enough to the original one . The computation of the coefficients would need to be done for all users. Keeping the standard offline evaluation procedure and computing coefficients to alter the probabilities of selecting an item for a given user is more efficient because it can be done only for a limited number of key items (in practice in much smaller quantity than the number of users for most of real world systems) leading to a much lower complexity.

A strong assumption we make is that in practice reducing offline evaluation bias for constant algorithms contributes to reducing offline evaluation bias for all algorithms.

3.2 Item weights

probabilities are thus the only quantities that can be modified in order to reduce the bias of offline evaluation. In particular, is driven by business considerations related to the importance of individual users and can seldom be manipulated without impairing the associated business metrics. We propose therefore to depart from the classical values for (such as using a uniform probability) in order to mimic static values for . This approach is related to the weighting strategy used in the case of covariate shift [sugiyama2007covariate].

This is implemented via tunable item specific weights, the , which induce modified conditional probabilities . The general idea is to increase the probability of selecting if is larger than 1 and vice versa, so that recalibrates the probability of selecting each item. The simplest way to implement this probability modification is to define as follows:


Other weighting schemes could be used. Notice that these weighted conditional probabilities lead to weighted item probabilities defined by:


3.3 Adjusting the weights

We thus reduce the evaluation bias by leveraging the weights and using the associated distribution instead of . Indeed one can chose in such as way that . This allows one to use all the data available at time for the offline evaluation while limiting the bias induced by those new data.

This leads to a non-linear system with equations and parameters () such that for all :

cannot be solved easily and we thus need to approximate it using an optimisation algorithm.

Optimizing the weights amounts to reducing a dissimilarity between the weighted distribution and the original one. We use here the Kullback-Leibler divergence, that is


Where represents the set of items which have been selected at least once at .

The asymmetric nature of is useful in our context as it reduces the influence of rare items at time as they were not very important in the calculation of .

The target probability is computed once and for all items at the initial evaluation time. One coordinate of the gradient can be computed in , where is the number of couples with and at instant . Thus the whole gradient can be computed in complexity . This would be prohibitive on a large system. To limit the optimization cost, we focus on the largest modifications between and . More precisely, we compute once for all and select the subset of of size which exhibits the largest differences in absolute values between and .

Then is only optimized with respect to the corresponding weights , leading to a cost in for each gradient calculation. Notice that is therefore an important parameter of the weighting strategy. In practice, we optimize the divergence via a basic gradient descent.

Notice that to implement weight optimization, one needs to compute and . As pointed out in 3.1 these are costly operations. We assume however that evaluating several recommendation algorithms has a much larger cost, because of the repeated evaluation of associated to e.g. statistical model parameter tuning. Then while optimizing is costly, it allows one to rely on the efficient classical offline strategy to evaluate recommendation algorithms with a reduced bias.

4 Experimental evaluation

4.1 Data and metrics

The proposed approach is studied on real world data extracted from the Viadeo professional social network. The recommendation setting is the one described in Section 2.3: users can attach skills to their profile. Skills are recommended to the users in order to help them build more accurate and complete profiles. In this context, items are skills. The data set used for the analysis contains 34 448 users and 35 741 items. The average number of items per user is 5.33. The distribution of items per user follows roughly a power law, as shown on Figure 2.

Figure 2: Distribution of items per user

Both probabilities and are uniform. The quality function is given by where consists in 5 items. We use constant recommendation algorithms to focus on the direct effects of our weighting proposal, which means here that each algorithm is based on a selection of 5 items that will be recommended to all users.

The quality of a recommendation algorithm,

, is estimated via stochastic sampling in order to simulate what could be done on a larger data set than the one used for testing. We selected repeatedly 20 000 users (uniformly among the 34 448, including possible repetitions) and then one item per user (according to

or ).

The analysis is conducted on a 201 days period, from day 300 to day 500. Day 0 corresponds to the launch date of the skill functionality. As noted in Section 2.3 two recommendation campaigns were conducted by Viadeo during this period at and respectively.

4.2 Bias in action

We first demonstrate the effect of the bias on two constant recommendation algorithms. The first one is modeled after the actual recommendation algorithm used by Viadeo in the following sense: it recommends the five most recommended items from to . The second algorithm takes the opposite approach by recommending the five most frequent items at time among the items that were never recommended from to . In a sense, agrees with Viadeo’s recommendation algorithm, while disagrees.

Figure 3: Evolution of over time ( “agrees” with the recommendation algorithm)
Figure 4: Evolution of over time ( “disagrees” with the recommendation algorithm)

Figures 3 and 4 show the evolution of and over time. As both algorithms are constant, it would be reasonable to expect minimal variations of their offline evaluation scores. However in practice the estimated quality of increases by more than 25 %, while the relative decrease of reaches 33 %.

4.3 Reduction of the bias

We apply the strategy described in Section 3 to compute optimal weights at different instants and for several values of the parameter. Results are summarized in Figures 5 and 6.

Figure 5: Evolution of over time ( “agrees” with the recommendation algorithm) when items are weighted (see text for details).
Figure 6: Evolution of over time ( “disagrees” with the recommendation algorithm) when items are weighted (see text for details).

The figures show clearly the stabilizing effects of the weighting strategy on the scores of both algorithms. In the case of algorithm , the stabilisation is quite satisfactory with only active weights. This is expected because agrees with Viadeo’s recommendation algorithm and therefore recommends items for which probabilities change a lot over time. Those probabilities are exactly the ones that are corrected by the weighting technique.

The case of algorithm is less favorable, as no stabilisation occurs with . This can be explained by the relative stability over time of the probabilities of the items recommended by (indeed, those items are not recommended during the period under study). Then the perceived reduction in quality over time is a consequence of increased probabilities associated to other items. Because those items are never recommended by , they correspond to direct recommendation failures. In order to stabilize evaluation, we need to take into account weaker modifications of probabilities, which can only be done by increasing . This is clearly shown by Figure 6.

5 Conclusion

We have analyzed the offline evaluation bias induced by various factors that have influenced historical data as the recommendation algorithms previously used for such an online system. Indeed, as recommendations influence users, a recommendation algorithm in production tends to be favored by offline evaluation over time. On the contrary, an algorithm with different recommendations will generally witness over time a reduction of its offline evaluation score. To overcome this bias, we have introduced a simple item weighting strategy inspired by techniques designed for tackling the covariate shift problem. We have shown on real world data extracted from Viadeo professional social network that the proposed technique reduces the evaluation bias for constant recommendation algorithms.

While the proposed solution is very general, we have only focused on the simplest situation of constant recommendations evaluated with a binary quality metric (an item is either in the list of recommended items or not). Further works include the confirmation of bias reduction on more elaborate algorithms, possibly with more complex quality functions. The trade off between the computational cost of the proposed solution and its quality should also be investigated in more details.

Appendix A Algorithmic details

a.1 Gradient calculation

We optimize with a gradient based algorithm and hence is needed. Let and be two distinct items , then


We have also


and therefore for all :


We have implicitly assumed that the evaluation is based on independent draws, and therefore:




Application: if and , then:

Complexity: Assuming we have a sparse matrix such as , we suggest to precalculate and then for each coordinate of the gradient and for each :

  • compute in

  • compute in

Then each consists in a sum of terms computed in , so that we can compute each coordinate of the gradient is .

Thus, as the complexity to compute coordinates of the gradient is .


  • [Adomavicius & Tuzhilin, 2005] Adomavicius and Tuzhilin][2005]adomavicius2005toward Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. Knowledge and Data Engineering, IEEE Transactions on, 17, 734–749.
  • [Herlocker et al., 2004] Herlocker et al.][2004]HerlockerEtAl2004Evaluating Herlocker, J. L., Konstan, J. A., Terveen, L. G., & Riedl, J. T. (2004). Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22, 5–53.
  • [Kantor et al., 2011] Kantor et al.][2011]kantor2011recommender Kantor, P. B., Rokach, L., Ricci, F., & Shapira, B. (Eds.). (2011). Recommender systems handbook. Springer.
  • [McNee et al., 2006] McNee et al.][2006]mcnee2006being McNee, S. M., Riedl, J., & Konstan, J. A. (2006). Being accurate is not enough: how accuracy metrics have hurt recommender systems. CHI’06 extended abstracts on Human factors in computing systems (pp. 1097–1101).
  • [Park et al., 2012] Park et al.][2012]park2012literature Park, D. H., Kim, H. K., Choi, I. Y., & Kim, J. K. (2012). A literature review and classification of recommender systems research. Expert Systems with Applications, 39, 10059–10072.
  • [Said et al., 2013] Said et al.][2013]said2013user Said, A., Fields, B., Jain, B. J., & Albayrak, S. (2013). User-centric evaluation of a k-furthest neighbor collaborative filtering recommender algorithm. Proceedings of the 2013 conference on Computer supported cooperative work (pp. 1399–1408).
  • [Shani & Gunawardana, 2011] Shani and Gunawardana][2011]shani2011evaluating Shani, G., & Gunawardana, A. (2011). Evaluating recommendation systems. In P. B. Kantor, L. Rokach, F. Ricci and B. Shapira (Eds.), Recommender systems handbook, 257–297. Springer.
  • [Shimodaira, 2000] Shimodaira][2000]Shimodaira2000227 Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 227 – 244.
  • [Sugiyama et al., 2007] Sugiyama et al.][2007]sugiyama2007covariate Sugiyama, M., Krauledat, M., & Müller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation.

    The Journal of Machine Learning Research

    , 8, 985–1005.