Study of a bias in the offline evaluation of a recommendation algorithm

by   Arnaud De Myttenaere, et al.

Recommendation systems have been integrated into the majority of large online systems to filter and rank information according to user profiles. It thus influences the way users interact with the system and, as a consequence, bias the evaluation of the performance of a recommendation algorithm computed using historical data (via offline evaluation). This paper describes this bias and discuss the relevance of a weighted offline evaluation to reduce this bias for different classes of recommendation algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4


Reducing offline evaluation bias of collaborative filtering algorithms

Recommendation systems have been integrated into the majority of large o...

Reducing Offline Evaluation Bias in Recommendation Systems

Recommendation systems have been integrated into the majority of large o...

Sudden Death: A New Way to Compare Recommendation Diversification

This paper describes problems with the current way we compare the divers...

On Offline Evaluation of Recommender Systems

In academic research, recommender models are often evaluated offline on ...

Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms

Contextual bandit algorithms have become popular for online recommendati...

A Methodology for the Offline Evaluation of Recommender Systems in a User Interface with Multiple Carousels

Many video-on-demand and music streaming services provide the user with ...

Multimodal Topic Learning for Video Recommendation

Facilitated by deep neural networks, video recommendation systems have m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A recommender system provides a user with a set of possibly ranked items that are supposed to match the interests of the user at a given moment

[park2012literature, kantor2011recommender, adomavicius2005toward]. Such systems are ubiquitous in the daily experience of users of online systems. For instance, they are a crucial part of e-commerce where they help consumers select movies, books, music, etc. that match their tastes. They also provide an important source of revenues, e.g. via targeted ad placements where the ads displayed on a website are chosen according to the user profile as inferred by her browsing history for instance. Commercial aspects set aside, recommender systems can be seen as a way to select and sort information in a personalized way, and as a consequence to adapt a system to a user.

Obviously, recommendation algorithms must be evaluated before and during their active use in order to ensure their performance. Live monitoring is generally achieved using online performance metrics (e.g. click-through rate of displayed ads) and several recommendation strategies can be compared using AB testing and online evaluation, whereas offline evaluation is computed using historical data. However putting an algorithm in production, collect and analyze data is generally a long process (many days or weeks). Offline evaluation allows to quickly test several strategies without having to wait for real metrics to be collected nor impacting the performance of the online system. One of the main strategy of offline evaluation consists in simulating a recommendation by removing a confirmation action (click, purchase, etc.) from a user profile and testing whether the item associated to this action would have been recommended based on the rest of the profile [shani2011evaluating]. Numerous variations of this general scheme are used ranging from removing several confirmations to taking into account item ratings.

While this general scheme is completely valid from a statistical point of view, it ignores various factors that have influenced historical data as the recommendation algorithms previously used. Even if limits of evaluation strategies for recommendation algorithms have been identified ([HerlockerEtAl2004Evaluating, mcnee2006being, said2013user]), this protocol is still intensively used in practice.

We study in this paper the general principle of instance weighting proposed in [demytt2014reducing] and show its practical relevance on the simple case of constant recommendation and on two collaborative filtering algorithms. In addition to its good performances, this method is more realistic than solutions proposed in [HerlockerEtAl2004Evaluating, mcnee2006being] for which a data collection phase based on random recommendations has to be performed. While this phase allows one to build a bias free evaluation data set, it has also adverse effects in terms of e.g. public image or business performance when used on a live system, as random recommendations are obviously less relevant than personnalized recommendations got by an algorithm.

The rest of the paper is organized as follows. Section 2 describes in details the setting and the problem. Section 3 introduces the weighting scheme proposed to reduce the evaluation bias. Section 4 demonstrates the practical relevance of our method for the particular case of constant algorithms and present experimental results based on real world data extracted from Viadeo (professional social network111See for more information about Viadeo.). Section 5 describes the results of our approach on two collaborative filtering and discuss the reduction of the bias for elaborated algorithms.

2 Problem formulation

2.1 Notations

We denote the set of users, the set of items and the historical data available at time . As user are associated to items, can be represented as a bipartite graph. Let and be the cardinal of the set of users and items at time , and represents the adjacency matrix of the bipartite graph given by . Then where

represents the zero matrix of size

, and is a binary matrix. is called biadjacency matrix and for each in , if the item is associated to user and 0 else. A representation of the data is presented on figure 1

Figure 1: Representation of the data as a bipartite graph and notations

A recommendation algorithm is a function from to some set built from . We will denote the recommendation computed by the algorithm at instant for user . We assume given a quality function from the product of the result space of and to that measures to what extent an item is correctly recommended by at time via . We denote the items associated to a user , and the set of users which are associated to the item .

2.2 The classical offline evaluation procedure

Offline evaluation is based on the possibility of “removing” any item from a user profile, which can be computed using stochastic or exhaustive sampling. Although exhaustive sampling gives more robust results, the stochastic approach is often prefered (especially for large systems) as it is faster and often precise enough to compare several algorithms. The user profile got after removing item from user is denoted and is the recommendation obtained at instant when has been removed from the profile of user .

Finally, offline evaluation follows a general scheme in which a user is chosen according to some probability on users

, which might reflect the business importance of the users. Given a user, an item is chosen among the items associated to its profile, according to some conditional probability on items . When an item is not associated to a user (that is ), . A very common choice for is the uniform probability on and it is also very common to use a uniform probability for (other strategy could favor items recently associated to a profile). As the system evolves over the time, and depends on .

The two distributions and

lead to a joint distribution

on . In other words, the classical offline evaluation consists in selecting a random node in user’s part of the bipartite graph, and then a random node among the ones associated to the selected user. Many other graph sampling methods could be used (random edge selection, …)

2.3 Origin of the bias in offline evaluation

As presented in [li2011unbiased, demytt2014reducing] the classical offline evaluation procedure ignores various factors that have influenced historical data as the recommendation algorithms previously used, promotional offers on some specific products, etc. Assume for instance that several recommendation algorithms are evaluated at time based on historical data of the user database until . Then the best algorithm is selected according to a quality metric associated to the offline procedure and put in production. It starts recommending items to the users. Provided the algorithm is good enough, it generates some confirmation actions. In other words, the recommendation campaigns introduce many new vertices in the bipartite graph representing the data (see figure 1). Those actions can be attributed to a good user modeling but also to luck and to a natural attraction of some users to new things. This is especially true when the cost of confirming/accepting a recommendation is low. In the end, the state of the system at time has been influenced by the recommendation algorithm in production.

Then if one wants to monitor the performance of this algorithm at time , the offline procedure tends to overestimate the quality of the algorithm because confirmation actions are now frequently triggered by the recommendations, leading to a very high predictability of the corresponding items.

Finally, one can decompose the evolution of a recommendation system in two cycles represented in figure 2. On one hand there is a virtuous circle (also called lean circle) in three steps: first an algorithm is put in production and the data collection process starts, then the collected data are analyzed to measure the performance of the algorithm, and finally data are also used to select the best algorithm among several new ones by offline evaluation. On the other hand we also observe a vicious circle as the algorithm in production influences the users behaviors, which introduces a bias in historical data used for the offline evaluation procedure.

Figure 2: The evolution of the recommendation system

This bias in offline evaluation with online systems can also be caused by other events such as a promotional offer on some specific products between a first offline evaluation and a second one. The main effect of this bias is to favor algorithms that tend to recommend items that have been favored between and and thus to favor a kind of “winner take all” situation in which the algorithm considered as the best at will probably remain the best one afterwards, even if an unbiased procedure could demote it. Indeed the score of an algorithm in production, given by the classical offline evaluation, tends to increase over time. More generally, the classical offline evaluation tends to overestimate (resp. underestimate) the unbiased score of an algorithm similar (resp. orthogonal) to the one in production.

More formally, the classic offline evaluation procedure consists in calculating the quality of the recommendation algorithm at instant as where the expectation is taken with respect to the joint distribution:


Then if two algorithms are evaluated at two different moments, their qualities are not directly comparable. Although as in an online system evolves over time222even if could also evolve over time we do not consider the effects of such evolution in the present article. once a recommendation algorithm is chosen based on a given state of the system, it starts influencing the state of the system when put in production, inducing an increasing distance between its evaluation environment (i.e. the initial state of the system) and the evolving state of the system. This influence is responsible for a bias on offline evaluation as it relies on historical data.

A naive solution to correct this bias would be to compare algorithms only with respect to the original database at , but this approach is not optimal as it would discard natural evolutions of user profiles.

2.4 Impact of recommendation campaigns on real data

We illustrate the evolution of the probabilities in an online system with a functionality provided by the Viadeo platform: each user can claim to have some skills that are displayed on his/her profile (examples of skills include project management, marketing, etc.). In order to obtain more complete profiles, skills are recommended to the users via a recommendation algorithm, a practice that has obviously consequences on the probabilities , as illustrated on Figure 3.

The skill functionality has been implemented at time . After 300 days, some of the are roughly static. Probabilities of other items still evolve over time under various influences, but the major sources of evolution are recommendation campaigns. Indeed, at times and , recommendation campaigns have been conducted: users have received personalized recommendation of skills to add to their profiles. The figure shows strong modifications of the quickly after each campaign. In particular, the probabilities of the items which have been recommended increase significantly; this is the case for the green, yellow and turquoise curves at . On the other hand, the probabilities of the items which have not been recommended decrease at the same time. The probabilities tend to become stable again until the same phenomenon can be observed right after the second recommendation campaign at : the curves corresponding to the items that have been recommended again keep increasing. The purple curve represents the probability selection of an item which has been recommended only during the second recommendation campaign. Section 4.2 demonstrates the effects of this evolution on the evaluation of recommendation algorithms.

Figure 3: Impact of recommendation campaigns on the item probabilities: the left figure displays the percentage of observations induced by the recommendations, while the right figure shows examples of the evolution of through time.

3 Reducing the evaluation bias

3.1 A weighted offline evaluation method to reduce the bias

A simple transformation of equation (1) shows that for a constant algorithm (i.e. if recommendations are the same for every users): . As a consequence, a way to guarantee a stationary evaluation framework for a constant algorithm is to have constant values for the marginal distribution of the items, .

A natural solution would be to record those probabilities at and use them as the probability to select an item in offline evaluation at . However, as the selection of users and items leads to a joint distribution, this would require to revert the way offline evaluation is done: first select an item, then select a user having this item with a certain probability leading to a different probability of users selection. Finally this process lead to a similar problem on users, and as in most of systems , it is more efficient to keep the classical evaluation protocol (see section 3.3 for more details).

Moreover, we will see that the recalibration of every items is not necessary to reduce the main part of the bias. Indeed in practice most of the time a few items concentrate most of the recommendations (very popular items, discount on selected products, …). Thus one can reduce the major part of the bias by optimizing the weight of the items such that the deviation given by have the highest values. In practice the choice of is done according to practical (time) or business constraints.

Thus the weighting strategy that we described in [demytt2014reducing] consists in keeping the classical choice for and weighting by departing from the classical values for (such as using a uniform probability) in order to mimic static values for by :

These weighted conditional probabilities lead to weighted item probabilities defined by:

Then we suggest to minimize the distance between and

by optimizing the Kullback-Leibler divergence, defined by :

where represents the set of items present at . The asymmetric nature of this distance is useful in our context to consider time as a reference. Moreover this asymmetry reduces the influence of rare items at time (as they were not very important in the calculation of ).

3.2 Gradient calculation

We optimize with a gradient based algorithm and hence is needed. Let and be two distinct items , then

We have also

and therefore for all :

We have implicitly assumed that the evaluation is based on independent draws, and therefore:


And in the particular case of uniform selection, i.e. if and , then:

3.3 Complexity

The value of coordinates of the gradient can be computed with a complexity, where is the number of couples wih and ().

Indeed let us assume we have computed the beadjacency sparse matrix of the bipartite graph twice: once indexed by raws, and once indexed by columns. Such matrix can be got in and give access to every element in . Then, in the particular case of uniform sampling it is possible to compute for all in .

Then if has been computed and saved for all (complexity in ), we have in for all .

So, after having computed and saved the values of and for all , the quantity is a sum of elements computed in and every coordinate of the gradient can be computed in

4 Illustration on constant algorithms

4.1 Data and metrics

We consider real world data extracted from Viadeo, where skills are attached to user’s profile. The objective of the recommendation systems consists in suggesting new skills to users. The dataset contains 18294 users and 180 items (skills), leading to 117376 couples .

Both probabilities and are uniform, and the quality function is given by where is a set of 5 items. The quality of a recommendation algorithm,

, is estimated via stochastic sampling in order to simulate what could be done on a larger data set than the one used for this illustration. We selected repeatedly 20 000 couples (user, item) (first we select a user

uniformly, then an item according to ).

The recommendation setting is the one described in Section 2.4: users can attach skills to their profile. Skills are recommended to the users in order to help them to build more accurate and complete profiles. In this context, items are skills. The data set used for the analysis contains 34 448 users and 35 741 items. The average number of items per user is 5.33. The distribution of items per user follows roughly a power law, as shown on Figure 4.

Figure 4: Distribution of items per user

4.2 Impact of previous recommendations campaigns

As described in section 2.2, the offline evaluation of a recommendation algorithm can by computed using stochastic or exhaustive approach. Here we will describe the impact of previous recommendation campaigns on the offline evaluation score and compute the score of offline evaluation by stochastic sampling on the sample data extracted from Viadeo, what permits to mimic the results which could be computed on bigger datasets. We first demonstrate the effect of the bias on two constant recommendation algorithms. The first one is modeled after the actual recommendation algorithm used by Viadeo in the following sense: it recommends the five most recommended items from to . The second algorithm takes the opposite approach by recommending the five most frequent items at time among the items that were never recommended from to . In a sense, agrees with Viadeo’s recommendation algorithm, while disagrees.

For each couple of selected user and item , the score given by the offline evaluation procedure of an algorithm is given by . For the experiments we have selected 30 000 couples , where is a user chosen uniformly on , and a skill chosen uniformly on (the set of skills associated to ). We will consider the quality function given by , where represents the top fives items suggested by the algorithm after selecting the couple . Figure 5 shows the evolution of and over time. As both algorithms are constant, it would be reasonable to expect minimal variations of their offline evaluation scores. However in practice the estimated quality of increases by more than 25 %, while the relative decrease of reaches 33 %.

Figure 5: Evolution of (left) and (right) though time.


is a binary function, it can be considered as a Bernoulli random variable of parameter

, where corresponds to the expected probability that . Then, after simulations we have observations (where and corresponds to the user and item selected during the step of the offline evaluation procedure) and the maximum likelihood estimator of is given by

Thus follows a binomial law which can be approximated by a gaussian random variable for big enough, and a confident interval for is classicaly given by

4.3 Reducing the bias

We apply the strategy described in Section 3 to compute optimal weights at different instants and for several values of the parameter. Results are summarized in Figure 6.

Figure 6: Evolution of (left) and (right) though time.

The figures show clearly the stabilizing effects of the weighting strategy on the scores of both algorithms. In the case of algorithm , the stabilisation is quite satisfactory with only active weights. This is expected because agrees with Viadeo’s recommendation algorithm and therefore recommends items for which probabilities change a lot over time. Those probabilities are exactly the ones that are corrected by the weighting method.

The case of algorithm is less favorable, as no stabilisation occurs with . This can be explained by the relative stability over time of the probabilities of the items recommended by (indeed, those items are not recommended during the period under study). Then the perceived reduction in quality over time is a consequence of increased probabilities associated to other items. Because those items are never recommended by , they correspond to direct recommendation failures. In order to stabilize evaluation, we need to take into account weaker modifications of probabilities, which can only be done by increasing , as represented on figure 5.

Thus, the weighted offfline evaluation procedure reduces the bias for the very simple class of constant algorithms. In the next part we discuss the relevance of this procedure to reduce the offline evaluation bias on collaborative filtering algorithms.

5 Experimentations on a collaborative filtering

5.1 Collaborative filtering algorithms

Collaborative filtering is a very popular class a recommendation algorithms which consists in computing recommendation to a user using the information available on other users, especially the ones similar to . For example, a classical collaborative filtering consists in recommending the most frequent items among the ones associated to users having items in common with the user .

More formally, let

be the vector of items of user

at time (). Then is a sparse vector as most of users are associated to only a few items, and corresponds to the column of the biadjacency matrix representing . The objective of collaborative filtering algorithms is to estimate for using the information known on other users. In this section we will discuss the efficiency of our method to reduce the offline evaluation bias on two different collaborative filtering algorithms:

The equation

is known as collaborative filtering with cosine similarity, whereas the equation

computes the proportion of users associated to item among the one associated to items possessed by . Then we will note naive CF (Collaborative Filtering) the algorithm .

Finally, the recommendation strategy consists in recommending the items having the highest values in .

5.2 Results

We apply the method described in Section 3 to compute optimal weights at different instants and for several values of the parameter . The collaborative filtering algorithms are the one presented in section 5.1. Results are summarized in figure 7.

(a) cosine similarity
(b) naive CF
Figure 7: Results on the collaborative filtering with cosine similarity and naive CF, respectively defined by equation and in section 5.1, for several values of (the number of weights optimized).

Once again the analysis is conducted on a 201 days period, from day 300 to day 500, where day 0 corresponds to the launch date of the skill feature and it is important to notice that two recommendation campaigns were conducted by Viadeo during this period at and respectively. As we can see on figure 7, the scores strongly decrease after the first recommendation campaign (). Thus those campaigns have strongly biased the collected data, leading to a significant bias in the offline evaluation.

The figure 7 shows the influence of the value of : the higher is the more weights are optimized and the more the bias is corrected. However, the efficiency of the recalibration depends on the algorithms. The results show that the weighting protocol permits to reduce the impact of recommendation campaigns on offline evaluation results as intended. However it does not lead to the stabilization of the score of collaborative filtering algorithms (while it lead to constant scores for constant algorithms). This can be explained by the nature of collaborative filtering: we can’t expect the score to be constant for such an algorithm as it depends on the correlation between users, which have been modified by the recommendation campaigns. In others words the bias can be decompose in two parts: one depending on the probability selection of each item, and the second one depending on the structure of the data (the vertices in the bipartite graph representing the data). Indeed the structure of the graph has been modified because since recommendation campaigns have increased the density of the graph by adding new vertices from targeted users to recommended items.

6 Conclusion

Various factors influence historical data and bias the score obtained by classical offline evaluation strategy. Indeed, as recommendations influence users, a recommendation algorithm in production tends to be favored by offline evaluation.

We have presented a new application of the item weighting strategy inspired by techniques designed for tackling the covariate shift problem. Whereas our previous results presented the efficiency of this method for constant algorithms, we have shown that this method also reduces the bias of more elaborate algorithms. However experiments on collaborative filtering shows that the bias can be decomposed in two part since previous recommendation campaigns change the probabilty selection of each item, but also modify the structure of the data.

Experiments shows that our is efficient to reduce the first bias. Future works will invesgate the correction of the structural bias.