1 Introduction
Under various forms, and under various names, recommendation has become a very common activity over the web. One can think of movie recommendation (Netflix), ecommerce (Amazon), online advertising (everywhere), news recommendation (Digg), personalized radio stations (Pandora) or even job recommendation (LinkedIn)… All these applications have their own characteristics. Yet the common key idea is to take advantage of some information we may have on a user (profile, demographics, time of the day etc.) in order to identify the most attractive content to serve him/her in a given context. Note that an element of content is generally referred to as an item. To perform recommendation, a piece of software called a recommender system (RS) can use past user/item interactions such as clicks or ratings. In particular, the typical approach to recommendation is to train a predictor of ratings and/or clicks of users on items on past data and use the resulting predictions to make personalized recommendations. This approach is based on the implicit assumption that past behavior can be used to predict future behavior.
In this paper we consider a recommendation applications in which the aforementioned assumption is not reasonable. We refer to this setting as dynamic recommendation. In dynamic recommendation, only a few tens of different items are available for recommendation at any given moment. These items have a limited lifetime and are continuously replaced by new ones, with different characteristics. We also consider that the tastes of users change, sometimes dramatically due to external parameters that we do not control. Many examples of dynamic recommendation exist on the web. The most popular one is news recommendation that can be found on specialized websites such as Digg or on general web portals (Yahoo!) and websites of various media (newspapers, TV channels…). Other examples can be mentioned such as private auctions in which the user can buy a limited set of items that changes everyday. Another example is a RS that can only recommend the K most recent items (this may apply to movies, videos, songs…). This problem has begun to be addressed recently with online learning solutions, by considering the contextual bandit framework. Nonetheless this is not the case in most of the recommendation literature. In all the textbooks, dynamic recommendation is handled with contentbased recommendation. The idea is to consider an item as a set of features and to try to predict the taste of a user with regards to these features, using an offline predictor as before. Yet we argue that although this can be a good idea in some special cases, this is not the way to go in general:

It requires a continuous labeling effort of new items.

We are limited by what the expert labels: things can be hard to label such as the appeal for a picture, the quality of a textual summary, etc.

Tastes are not static: the appeal of a user to some kind of news can be greatly impacted by the political context. Similarly the appeal towards clothing can be impacted by fashion, movie stars etc.
For such systems, the best way to compare the performance of two algorithms is to perform A/B testing on a subset of the web audience (ABtesting). Yet there is almost no ecommerce website that would let a new RS go live just for testing, even on a small portion of the audience for fear of hurting the user experience and loosing money. The entailed engineering effort can also be discouraging. Therefore being able to evaluate offline a RS is crucial. In classical recommendation, the measures of prediction accuracy and other metrics that can be computed on past data are well accepted and trusted in the community. Nevertheless for the reasons we gave above, they are irrelevant for online learning algorithms designed for dynamic recommendation. This paper is about the offline evaluation of such algorithms. Some fairly simple replay methodologies do exist in the literature. Nonetheless they have a well known, yet very little studied drawback which is that they use a very small portion of the data at hand. One may argue that the tremendous amount of data available with web applications makes this a marginal problem. Yet in this paper we will exhibit that this is a major issue that generates a huge bias when evaluating online learning algorithms for dynamic recommendation. Furthermore we will explain that acquiring more data does not solve the problem. Then we will propose a solution to this issue, that builds on the previously introduced methods and on different elements of bootstrapping theory. This solution is backed by a theoretical analysis as well as an empirical study. They both clearly exhibit the improvement in terms of evaluation accuracy brought by our new method. Furthermore the use of bootstrapping allows us to estimate the distribution of our estimation and therefore to get a sense of its quality. This is a highly desirable property, especially considering that such an evaluation method is designed in order to decide whether we should risk putting a new algorithm online. The fast theoretical convergence of this estimation is also proved in our analysis. Note that the experiments are run on synthetic data, for reasons that we will detail and also on a large publicly available dataset.
2 Background on bandits and notations
We motivated the need for online learning solutions in order to deal with dynamic recommendation. A natural way to model this situation is as a reinforcement learning problem
(sutton1998reinforcement), and more precisely using the contextual bandit framework (contextualMAB:aistats2010) that was introduced for the very purpose of news recommendation.2.1 Contextual Bandits
The bandit problem is also known in the literature as the multiarm bandit problem and other variations. This problem can be traced back to Robbins and Munro in 1952 (Robbins:1952) and even Thompson in 1933 (thompson1933). There are many variations in the definition of the problem; the contextual bandit framework as it is defined in (DBLP:conf/nips/LangfordZ07) is as follows:
Let be an arbitrary input space and be a set of actions. Let be a distribution over tuples with and
a vector of rewards: in the
pair, the component of is the reward associated to action , performed in the context . In recommendation, the context is the information we have on the user (session, demographics, profile etc.) and an action is the recommendation of an item.A contextual bandit game is an iterated game in which at each round :

is drawn from .

is provided to the player.

The player chooses an action based on and on the knowledge it gathered from the previous rounds of the game.

The reward is revealed to the player whose score is updated.

The player updates his knowledge based on this new experience.
It is important to note the partial observability of this game: the reward is known only for the action performed by the player. This is what makes offline evaluation a complicated problem. A typical goal is to maximize the sum of reward obtained after steps. To succeed a player has to learn about and try to exploit that information. Therefore at time , a player faces a classical exploration/exploitation dilemma: either perform an action he is uncertain about in order to improve his model of (explore), either perform the action he believe to be optimal, although it may not be (exploit).
A simpler variant of this problem in which no contextual information is given, called the multi armed bandit problem (MAB) was extensively studied in the literature. One can for instance mention the UCB algorithm (Auer02finitetimeanalysis) that optimistically deals with the dilemma by performing the action with higher upper confidence bound on the estimated reward expectation. The contextual bandit problem is less studied due to the additional complexity and additional assumptions entailed by the context space. The most popular algorithm is without doubt LinUCB (LinUCB)
, although a few others exist such as epochgreedy
(DBLP:conf/nips/LangfordZ07). LinUCB is basically an extension of the classical UCB that uses the contexts under the assumption of normality and that. The reward expectation of an action is estimated via a linear regression on the context vectors and the confidence bound is computed using the dispersion matrix of the context vectors. These two stateoftheart algorithms are the ones we will evaluate when running experiments.
2.2 Evaluation
We define a contextual algorithm as taking as input an ordered list of triplets (history) and outputting a policy . A policy maps to , that is chooses an action given a context. Note that we are also interested in evaluating policies. In our setting which is the most popular one, an algorithm is said optimal when maximizing the expectation of the sum of rewards after steps:
For convenience, we define the pertrial payoff as the average click rate after steps:
Note that for a static algorithm (ie. that always outputs the same policy ), we have that:
Note that from this point on, we will simplify the notations by systematically dropping . is thus the quantity we wish to estimate as the measure of performance of a bandit algorithm.
In order to minimize the risks entailed by playing live a new algorithm, we are also interested in the quality of the estimation. Bootstrapping will enable us to estimate it. To do so we need additional notations. denotes the distribution of the pertrial payoff of after steps (so is its mean). Besides estimating , our second goal is the computation of an estimator quality assessment . Note that typically,
can be a quantile, a standard error, a bias or what we will consider here for simplicity: a confidence region around the mean of
(aka ).3 The time acceleration issue with replay methodologies
This section describes the replay methodology, that we call replay and that was introduced by (Langford_ExploScav_08) and analyzed for the setting we consider by (LiCLW11). This section also highlights the method’s limitations that we overcome in this paper and is crucial to understand the significance of our contribution.
First of all, as (LiCLW11), we assume that we have a dataset that was acquired online using an random uniform allocation of the actions for steps. This data collection phase can be referred as exploration policy and is our unique information on . This random decision making implies that any point in
has a non null probability to belong to
; this allows the evaluation of any policy. In a nutshell, the replay methodology on such a dataset works as follows: for each record in , the algorithm is asked to choose an action given . If this action is , is revealed to and taken into account to evaluate its performance. If the action is different, the record is discarded. This method is proved to be unbiased in some cases (LiCLW11). Note that the fact it needs the data to be acquired uniformly at random is quite restrictive. This problem is well studied and replaycan be extended to allow the use of data acquired by a different but known logging policy at a cost of increased variance
(Langford_ExploScav_08). Some work has been done to reduce this variance and even allow the use of data for which the logging policy is unknown (DBLP:journals/corr/abs11034601; DBLP:conf/nips/StrehlLLK10). Note also that if the evaluated bandit algorithm is close from the logging policy, we may even further reduce the variance (trbottou2012). Finally there exist ways to adapt this method to the case where a list of items can be recommended (Langford_ExploScav_08). Although we do not take into account these considerations and keep the simplest assumption in this paper for clarity, our method is based on the same ideas as replay and could therefore be extended similarly as what is presented in the works we just cited.Another issue with replay is wellknown but not studied at all up to our knowledge. In average, only records are used. Therefore replay outputs an estimate which follows the distribution of mean . It is important to have in mind that if and only if or is a static policy. See figure 1 for a visual example of this problem. Note that in any situation except , , and the same thing goes for the confidence region .
One may argue that when evaluating a RS, plenty of data is available and therefore that is almost infinite. Consequently one may also consider replay to be almost unbiased. This is true with the classical contextual bandit framework considered in the literature. With dynamic recommendation, the main application for this method, this could not be more wrong. Indeed, we argued that in this context, everything changes, especially the available items. For instance, in news recommendation a news remains active from a few hours to two days and its CTR systematically decreases over time (Agarwal:2009:SME:1526709.1526713). Moreover we also mentioned reasons to believe that the user tastes may change as well. Therefore when evaluating a contextual bandit algorithm, we want to evaluate its behavior against a very dynamic environment and in particular its ability to adapt to it. The use of replay in such a context is often justified by the fact that the environment can be considered static for small periods of time. This is not necessary but makes the understanding of our point easier. When an algorithm faces a ”static” region of a dataset of size , when being replayed, it only has instead of steps to learn and exploit that knowledge. It is impossible to solve this problem by considering more data since new data would concern the next region, where different news with different CTR are available for recommendation, and potentially users with different tastes. In fact whatever assumptions we use to characterize how things evolve, using replay is equivalent to playing an algorithm with time going times faster than in reality. This generates a huge bias. Note that it is most likely because of time acceleration that a noncontextual algorithm which looks a lot like UCB won a challenge evaluated by replay on the Yahoo! R6B dataset (yahooToday) (news recommendation). See chapter 4 of Nicol (nicolEval14) for more details.
As a conclusion, we consider in this work a classical contextual bandit framework with a fixed number of steps . We assume that no more than records can be acquired. Yet it is clear that if we manage to deal with this problem without adding data, we would also be able to deal as well with the problem of evaluating dynamic recommendation for which using more data may not be possible.
4 Bootstrapped Replay on Expanded Data
Now that the shortcoming of the replay method has been understood, we look for an other offline evaluation protocol that does not suffer from the time acceleration issue. The idea we propose stems from the idea of bootstrapping, introduced by (citeulike:2825416). Thus let us remind the standard bootstrap approach. Basically, the idea is to compute the empirical distribution of an estimator computed on observations. To do so, one only has access to a dataset of size . Therefore new datasets of size are generated by making bootstrap resamples from . A bootstrap resample is generated via drawing samples with replacement. Note that this bootstrap procedure is a way to approximate , the underlying distribution of the data. Computing on all the yields , an estimation of . From a theoretical point of view and under mild assumptions, converges with no bias at a speed in
. This means that under a assumption of the concentration speed of a statistic we are able to estimate the confidence interval of the mean of the statistic much faster than its mean. Recall that
can be any measure of accuracy (defined in terms of bias, variance, confidence intervals, …) over the statistic we want to study. Here we are interested in confidence intervals over .The core idea of the evaluation protocol we propose in this paper is inspired by bootstrapping and inherits its theoretical properties. Using our notations, here is the description of this new method. From a dataset of size with possible choices of action at each step  we do not require to be constant over time ,we generate datasets of size by sampling with replacement, following the nonparametric bootstrap procedure. Then for each dataset we use the classical replay method to compute an estimate . Therefore is evaluated on records on average. This step can be seen as a subsampling step that allows to return in the classical bootstrap setting. Thus note that it would not work for a purely deterministic policy, that for obvious reason would not take advantage of the data expansion (an assumption in the formal analysis will reflect this fact). is given by averaging the . Together, the are also an estimation of , the distribution of the CTR of after interactions with on which we can compute our estimator quality assessment . More formally, the bootstrap estimator of the density of is estimated as follows:
where
is the empirical standard deviation obtained when computing the bootstrap estimates
. The complete procedure, called Bootstrapped Replay on Expanded Data (BRED), is implemented in algorithm 1.To complete the BRED procedure, one last detail is necessary. Each record of the original dataset is contained times in expectation in each expanded dataset
. Therefore a learning algorithm may tend to overfit which would bias the estimator. To prevent this from happening, we introduce a small amount of Gaussian noise on the contexts. This technique is known as jittering and is well studied in the neural network field
(citeulike:2512648). The goal is the same, that is avoiding overfitting. In practice however it is slightly different as neural network are generally not learning online but on batches of data, each data being used several times during learning. In bootstrapping theory this technique is known as the smoothed bootstrap and was introduced by (silverman1987bootstrap). We mentioned that the bootstrap resampling is a way to approximate. The smoothed bootstrap goes further by performing a kernel density estimation (KDE) of the data and sampling from it. Sampling from a KDE of the data where the kernel is Gaussian of bandwidth
is equivalent to sampling a record uniformly from and applying a random noise sampled from , which is what jittering does. The usual purpose of doing so in statistics is to get a less noisy empirical distribution for the estimator. Note that here we perform a partially smoothed bootstrap as we only apply a KDE on the part of that generates the contexts.5 Theoretical analysis
In this section, we make a theoretical analysis of our evaluation method BRED. The core loop in BRED is a bootstrap loop; henceforth, to complete this analysis, we first restate the theorem 1
which is a standard result of the bootstrap asymptotic analysis
(blb). Notice a small detail: each bootstrap step estimates a realization of. The number of evaluations  which is also the number of non rejects  is a random variable denoted
.Theorem 1.
Suppose that:

is a recommendation algorithm which generates a fixed policy over time (this hypothesis can be weakened as discussed in remark 2),

items may be recommended at each time step,

admits an expansion as an asymptotic series
where is a constant independent of the distribution (as defined in Sec. 2.1), and the are polynomials in the moments of under (this hypothesis is discussed and explained in remark 1),

The empirical estimator admits a similar expansion:
(1)
Then, for and assuming finite first and second moments of , with high probability:
(2) 
where is the resampled distribution of using realizations.
Proof.
it is actually a straightforward adaptation of the proof of theorem 3 of (blb). Also note that this theorem is a reformulation of the bootstrap main convergence result as introduced by (citeulike:2825416). ∎
Theorem 2.
Assuming that
Then for algorithm producing a fixed policy over time, BRED applied on a dataset of size evaluates the expectation of the with no bias and with high probability for and large enough:
This means that the convergence of the estimator of is much faster than the convergence of the estimator of (which is in . This will allow a nice control of the risk that may be badly evaluated.
The sketch of the proof of theorem 2 is the following: first we prove that the replay strategy is able to estimate the moments of the distribution of fast enough with respect to . The second step consists in using classical results from bootstrap theory to guarantee the unbiased convergence of the aggregation to the true distribution with an speed. The rational behind this is that the gap introduced by the subsampling will be of the order of .
Proof.
At each iteration of the bootstrap loop (indexed by ), BRED is estimating the CTR using the replay method on a dataset of size . As the actions in were chosen uniformly at random, we have .
As the policy is fixed, we can use the multiplicative Chernoff’s bound as in (LiCLW11) to obtain for all bootstrap step :
for any (where denotes the probability of event ). A similar inequality can be obtained with :
Thus with and using a union bound over probabilities, we have with probability at least :
which implies
So with high probability the first moment of as estimated by the replay method admits an asymptotic expansion in .
Now we need to focus on higher order terms. All the moments are finite because the reward distribution over is bounded. Recall that by hypothesis admits a order term:
The Chernoff’s bound can be applied to and leading to
With probability at least . So for a large enough , admits a expansion in polynomials of . Thus theorem 1 applies and the aggregation of all the allows an estimation of . For a large enough number of bootstrap iterations (the value of in BRED), we obtain a convergence speed in with high probability, which concludes the proof. ∎
After this analysis, we make two remarks about the assumptions that were needed to establish the theorems.
Remark 1: The key point of the theorems is the existence of an asymptotic expansion of and in polynomials of . This is a natural hypothesis for because the CTR is an average of bounded variables (probabilities of click). Note that the proof of theorem 2 shows that although is random the expansion remains valid anyway. For a contextual bandit algorithm
producing a fixed policy, the mean is going to concentrate according to the central limit theorem (CLT). Furthermore this hypothesis, omnipresent in bootstrap theory
(citeulike:2825416), is for instance justified in econometrics by the fact that all the common estimators respect it (horowitz2001bootstrap). Yet this assumption is not verified for a static deterministic policy.Remark 2 Let us consider algorithms that produce a policy which changes over time (a learning algorithm in particular). After a sufficient amount of recommendations, a reasonable enough algorithm will produce a policy that will not change any longer (if the world is stationary). Thus again, the CLT will apply and we will observe a convergence of to its limit in . Nevertheless nothing holds true here when the algorithm is actually learning. This is due to the fact that the Chernoff bound no longer applies as the steps are not independent. However the behavior of classical learning algorithms are smooth, especially when randomized (see (exp3) for an example of a randomized version of UCB). (LiCLW11) argue that in this case convergence bounds may be derived for replay (which then could be applied to BRED) at the cost of a much more complicated analysis including smoothness assumptions. For non reasonable algorithms and thus in the general case, no guarantees can be provided. By the way note that a very intuitive way to justify Jittering is to consider that it helps the Chernoff bound being ”more true” in the case of a learning algorithm.
6 Experiments in realistic settings
As we proved that BRED has promising theoretical guarantees in the setting introduced in (LiCLW11), let us now compare its empirical performance to that of the replay method
6.1 Synthetic data and discussing Jittering
The first set of experiments was run on synthetic data. Indeed, we needed to be able to compare the errors of estimation of the two methods on various fixed size datasets relatively to the ground truth: an evaluation against the model itself. Before going any further, let us describe the model we used. It is a linear model with Gaussian noise (as in (LinUCB)) and was built as follows:

a fixed action set (or news set) of size .

The context space is . Each context is generated as a sum where and .

The CTR of a news displayed in a context is given linearly by . Note the noncontextual element and that the noise is ignored.

Finally there are two kinds of news: (i) 4 “universal” news that are interesting in general like Obama is reelected and for which is high and . (ii) 6 specific news like New Linux distribution released for which is low and consists of zeros except for a number of relevant weights sampled from .
A non contextual approach would perform decently by quickly learning the values. Yet LinUCB (LinUCB), a contextual bandit algorithm will do better by learning when to recommend the specific news. Figure 2 displays the results and interpretation of an experiment which consists in evaluating LinUCB() using the different methods. It is clear that BRED converges much faster than the replay method.
Remark: As it can be seen on Figure 2, jittering is very important to obtain good performance when evaluating a learning algorithm. Empirically, a good choice for the level of jitter seems to be a function in , with the size of the dataset. Note that this is proportional to the standard distribution of the posterior of the data. The results confirm our intuition: jittering is very important when the dataset is small but gets less and less necessary as the dataset grows.
6.2 Real data
Adapting replay to a real life dataset, corresponding to dynamic recommendation is straightforward although it leads to biased estimations. BRED really needs the assumption of a static world in order to perform the bootstrap resamples. Therefore BRED needs to be run on successive windows of the dataset on which a static assumption can be made. This creates a bias/variance tradeoff: if the windows are too big, some dynamic properties of the world may be erased (bias). On the contrary, too smalls window will lead to very variate bootstrap estimates. To simplify things, we ran experiments assuming a static world on small portions of the Yahoo! R6B dataset. We actually took the smallest number of portions such that a given portion has a fixed pool of news ( portions). This experiment is similar to what is done in (LiCLW11): the authors measured the error of the estimated CTR of UCB () by the replay method on datasets of various sizes relatively to what they call the ground truth: an evaluation of the same algorithm on a real fraction of the audience. As we obviously cannot do that, we used a simple trick. For each portion of size with news, we computed an estimation of the ground truth by averaging the estimated CTR of UCB using the replay method on 100 random permutations of the data. For each portion the experiment consists in subsampling records and evaluating UCB using replay and BRED on this smaller dataset to estimate the ground truth using less data, faking time acceleration. The results and interpretation are shown on Figure 3: the better accuracy of BRED is very clearly illustrated.
7 Conclusion
In this paper, we studied the problem of recommendation system evaluation, sticking to a realistic setting: we focused on obtaining a methodology for practical offline evaluation, providing a good estimate using a reasonable amount of data. Previous methods are proved to be asymptotically unbiased with a low speed of convergence on a static dataset, but yield counterintuitive estimates of performance on real datasets. Here, we introduce BRED, a method with a much faster speed of convergence on static datasets (at the cost of loosing unbiasedness) which allows it to be much more accurate on dynamic data. Experiments demonstrated our point; they were performed on a publicly available dataset made from Yahoo! server logs and on synthetic data presenting the time acceleration issue. This paper was also meant to highlight the time acceleration issue and the misleading results given by a careless evaluation of an algorithm. Finally our method comes with a very desirable property in a context of minimizing the risks entailed by putting online a new RS: an extremely accurate estimation of the variability of the estimator it provides.
An interesting line of future work is the automatic selection of the Jittering bandwidth. Note that this problem is extensively studied in the context of KDE (scott1992multivariate).
A possible extension of this work is to use BRED to build a ”safe controller”. Indeed, when a company uses a recommendation system that behaves according to a certain policy that reaches a certain level of performance, the hope is that when changing the recommendation algorithm, the performance will not drop. As an extension of the work presented here, it is possible to collect some data using the current policy , compute small variations of with tight confidence intervals over their CTR and then replace the current policy with the improved one. This may be seen as a kind of “gradient” ascent of the CTR in the space of policies.
Comments
There are no comments yet.