1 Introduction
We consider the online version of the problem of the recommendation of items, that is, the one faced by websites. Items may be ads, news, music, videos, movies, books, diapers, … Daily, of even more often, these systems have to cope with users that have never visited the website, and new items introduced in the catalog. Appetence of the new users towards available items, and appeal of new items towards existing users have to be estimated as fast as possible: this is the cold start problem. Currently, this situation is handled thanks to side information available either about the user, or about the item (see
[DBLP:conf/nips/AgarwalCEMPRRZ08, contextualRecommendation]). In this paper, we consider this problem from a different perspective. Though perfectly aware of the potential utility of side information, we consider the problem without any side information, only focussing on acquiring appetence of new users and appeal new items as fast as possible; side information can be mixed with the ideas presented in this paper. This combination is left as future work. This poblem fits perfectly into the sequential decision making framework, and more specifically, the bandit without side information setting. However, in rather sharp contrast with the traditional bandit setting, here the set of bandits is continuously being renewed; the number of bandits is not small, though not being huge (from a few dozens to hundreds arms in general, up to dozens of millions in some application): this makes the problem very different from the 2armed bandit problem, though asymptotic approximation is still irrelevant; we look for efficient and effective ways to achieve this goal, since we want the proposed solution to be able to cope with real applications on the web. For obvious practical and economical reasons for real applications, the strategy can not merely consist in repeatedly presenting all available items to users until the appetence seems accurately estimated. We have to consider the problem as an exploration vs. exploitation problem in which exploration is a necessary evil to acquire information and eventually improve the performance of the recommendation system (RS for short).This being said, comes the problem of the objective function to optimize. Since the Netflix challenge, at least in the machine learning community, the recommendation problem is often boiled down to a matrix factorization problem, performed in batch, learning on a training set, and minimizing the root mean squared error (RMSE) on a testing set. However, the RMSE comes along very heavy flaws:

Using the RMSE makes no difference between the items that are highly rated by a user and items poorly rated by the same user; however, for a user, there is a big difference between well rated items and the others: the user wants to be recommended with items she will rate high; she does not care about unattractive items; to illustrate that idea in a rating context alike the Netflix challenge using integers in the range 1 to 5, making an error between a 4 and a 5 is qualitatively very different from making an error between 1 and 2. Furthermore, the restricted set of possible ratings implies that a 5 corresponds to more or less highly rated items. If ratings were real numbers, 5 would spread into allowing a more precise ranking of preferences by each user. Finally, it is wellknown that users have a propensity to rate items they like, rather than rate items they dislike [steck:kdd2010].

RMSE does not make any difference between the outcome of recommending an item to a heavy user (a user who has already rated a lot of items) as to observe the outcome of the first recommendation to a user during her first visit to the website.

Usually, the training set and the testing set are unordered, all information regarding the history of the interactions being left aside. Then, we consider average appetence over time, completely neglecting the fact that a given item does not have the same appeal from its birth to its death, and the fact that the appeal of items is often correlated to the set of available items at a given time, and those available in the past. [koren:td] has shown the importance of taking timestamps into account.

Though item recommendation is often presented as a prediction problem, it is really a ranking problem: however, RMSE is not meant to evaluate a ranking [ckt:recsys2010].
The objective function may be tinkered to handle certain of these aspects. However, we think that the one and only way to really handle the problem of recommendation is to address it as a sequential decision making problem, since the history should be taken into ccount. Such a sequential decision making problem faces an exploration vs. exploitation dilemma as detailed in section 4 the exploration being meant to acquire information in order to exploit it and to perform better subsequently; information gathering has a cost that can not be merely minimized to 0, or simply left as an unimportant matter. This means that the evaluation of the recommendation algorithm dealing with the cold start problem has to be done online.
Based on these ideas, our contribution in this paper is the following:
we propose an original way to tackle the cold start problem of recommendation systems: we cast this problem as a sequential decision making problem to be played online that selects items to recommend in order to optimize the exploration/exploitation balance; our solution is then to perform the rating matrix factorization driven by the policy of this sequential decision problem in order to focus on the most useful terms of the factorization.
The reader familiar with the bandit framework can think of this work as a contextual bandit building its own context from the observed reward using the hypothesis of the existence of a latent space of dimension .
We also introduce a methodology to use a classical partially filled rating matrix to assess the online performance of a banditbased recommendation algorithm.
After introducing our notation in the next section, Sec. 3 presents the matrix factorization approach. Sec. 4 introduces the necessary background in bandit theory. In Sec. 5 and Sec. 6, we solve the cold start setting in the case of new users and in the case of new items. Sec. 7 provides an experimental study on artificial data, and on real data. Finally, we conclude and draw some future lines of work in Sec. 8.
2 Notations and Vocabulary
Uppercase, boldface letters denote matrices, such as: . is the transpose matrix of , and denotes its row
. Lowercase, boldface letters denote vectors, such as
. is the number of components (dimension) of . Normal letters denote scalar value. Except for , greek letters are used to denote the parameters of the algorithms. We use calligraphic letters to denote sets, such as . is the number of elements of the set . For a vector and a set of integers (s.t. ), is the subvector of composed of the elements of which indices are contained in . Accordingly, being a matrix, a set of integers smaller or equal to the number of lines of , is the submatrix made of the rows of which indices form (the ordering of the elements in does not matter, but one can assume that the elements of the set are sorted). Now, we introduce a set of notations dedicated to the RS problem. We consider:
as we consider a timeevolving number of users and items, we will note the current number of users, and the current number of items. These should be indexed by a to denote time, though often in this paper, is dropped to simplify the notation. indices the users, whereas indices the items. Without loss of generality, we assume and , that is and are upper bounds of the number of ever seen users and items (those figures may as large as necessary).

represents the ground truth, that is the matrix of ratings. Obviously in a real application, this matrix is unknown. Each row is associated to one and only one user, whereas each column is associated to one and only one item. Hence, we will also use these row indices, and column indices to represent users, and items.
is of size . is the rating given by user to item .
We suppose that there exists an integer and two matrices of size and of size such that . This is a standard assumption [Dror:2011fk].

Not all ratings have been observed. We denote the set of elements that have been observed (yet). Then we define :
where
is a noise with zero mean, and finite variance. The
are i.i.d.In practice, the vast majority of elements of are unknown.
In this paper, we assume that
is fixed during all the time; at a given moment, only a submatrix made of
rows and columns is actually useful. This part of that is observed is increasing along time. That is, the set is growing along time. 
denotes the set of indices of the columns with available values in row number of (i.e. the set of items rated by user ). Likewise, denotes the sets of rows of with available values for column (i.e. the set of users who rated item ).

Symbols and are related to users, thus rows of the matrices containing ratings, while symbols and refer to items, thus columns of these matrices.

and denote estimates (with the statistical meaning) of the matrices and respectively. Their product is denoted . The relevant part of matrices and has dimensions at a given moment , and .
To clarify things, let us consider users, items, and as follows:
Let us suppose that , then assuming no noise:
We use the term “observation” to mean a triplet (user, item, rating of this item by this user such as ). Each known value of is an observation. The RS receives a stream of observations. We use the term “rating” to mean the value associated by a user to an item. It can be a rating as in the Netflix challenge, or a noclick/click, nosale/sale, …
For the sake of legibility, in the online setting we omit the subscript for time dependency. In particular, , , , , should be subscripted with .
3 Matrix Factorization
Since the Netflix challenge [Bennett07thenetflix], many works have been using matrix factorization: the matrix of observed ratings is assumed to be the product of two matrices of low rank . We refer the interested reader to [Koren2009] for a short survey. As most of the values of the rating matrix are unknown, the factorization can only be done using this set of observed values. The classical approach is to solve the regularized minimization problem where:
in which and the usual regularization term is:
is not convex. The minimization is usually performed either by stochastic gradient descent (SGD), or by alternate least squares (ALS). ALSWR
[Zhou:2008:LPC:1424237.1424269] weighs users and items according to their respective importance in the matrix of ratings.This regularization is known to have a good empirical behavior — that is limited overfitting, easy tuning of and , low RMSE.
4 Bandits
Let us consider a bandit machine with independent arms. When pulling arm , the player receives a reward drawn from
which follows a probability distribution
. Let denote the mean of , be the best arm and be the best expected reward (we assume there is only one best arm). The parameters , , and are unknown.A player aims at maximizing its cumulative reward after consecutive pulls. More specifically, by denoting the arm pulled at time and the reward obtained at time , the player wants to maximize the quantity As the parameters are unknown, at each timestep (except the last one), the player faces the dilemma:

either exploit by pulling the arm which seems the best according to the estimated values of the parameters;

or explore to improve the estimation of the parameters of the probability distribution of an arm by pulling it;
A wellknown approach to handle the exploration vs. exploitation tradeoff is the Upper Confidence Bound strategy (UCB) [Auer02finitetimeanalysis] which consists in playing the arm :
(1) 
where denotes the empirical mean reward incured when on pulls of arm up to time and corresponds to the number of pulls of arm since . UCB is optimal up to a constant. This equation clearly expresses the explorationexploitation tradeoff: while the first term of the sum () tends to exploit the seemingly optimal arm, the second term of the sum tends to explore less pulled arms.
Li et al. [LinUCB] extend the bandit setting to contextual arms. They assume that a vector of real features is associated to each arm and that the expectation of the reward associated to an arm is , where is an unknown vector. The algorithm handling this setting is known as LinUCB. LinUCB follows the same scheme as UCB in the sense that it consists in playing the arm with the largest upper confidence bound on the expected reward:
where is an estimate of , is a parameter and , where
is the identity matrix. Note that
corresponds to an estimate of the expected reward, while is an optimistic correction of that estimate.While the objective of UCB and LinUCB is to maximize the cumulative reward, theoretical results [LinUCB, NIPS2011_1243] are expressed in term of cumulative regret (or regret for short)
where stands for the best expected reward at time (either in the UCB setting or in the LinUCB setting). Hence, the regret measures how much the player looses (in expectation), in comparison to playing the optimal strategy. Standard results proove regrets of order or , depending on the assumptions on the distributions and depending on the precise analysis ^{1}^{1}1 means up to a logarithmic term on ..
Of course LinUCB, and more generally contextual bandits require the context (values of features) to be provided. In real applications this is done using side information about the items and the users [Shivaswamy/Joachims/11b] –i.e. expert knowledge, categorization of items, Facebook profiles of users, implicit feedback …The core idea of this paper is to use matrix factorization techniques to build a context online using the known ratings. To this end, one assumes that the items and the arms can be represented in the same space of dimension and assuming that the rating of user for item is the scalar product of and .
We study the introduction of new items and/or new users into the RS. This is done without using any side information on users or items.
5 Cold Start for a New User
Let us now consider a particular recommendation scenario. At each timestep ,

a user requests a recommendation to the RS,

the RS selects an item among the set of items that have never been recommended to user beforehand,

user returns a rating for item .
Obviously, the objective of the RS is to maximize the cumulative reward .
In the context of such a scenario, the usual matrix factorization approach of RS recommends item which has the best predicted rating for user . This corresponds to a pure exploitation (greedy) strategy of bandits setting, which is wellknown to be suboptimal to manage : to be optimal, the RS has to balance the exploitation and exploration.
Let us now describe the recommendation algorithm we propose at timestep . We aim at recommending to user an item which leads to the best tradeoff between exploration and exploitation in order to maximize . We assume that the matrix is factorized into by ALSWR  discussed later  which terminated by optimizing holding
fixed. In such a context, the UCB approach is based on a confidence interval on the estimated ratings
for any allowed item .We assume that we already observed a sufficient number of ratings for each item, but only a few ratings (possibly none) from user . As a consequence the uncertainty on is much more important than on any . In other words, the uncertainty on mostly comes from the uncertainty on . In the following, we express this uncertainty.
Let denote the (unknown) true value of and let us introduce the matrix:
As shown by [Zhou:2008:LPC:1424237.1424269], as and comes from ALSWR (which last iteration optimized with fixed),
Using Azuma’s inequality over the weighted sum of random variables (as introduced by
[DBLP:journals/corr/abs12052606] for linear systems), it follows that there exists a value such as, with probability :This inequality defines the confidence bound around the estimate of . Therefore, a UCB strategy selects item :
which amounts to:
where is an exploration parameter to be tuned. Fig. 1 illustrates the transition from the maximum on a confidence ellipsoid to its closedform .
Our complete algorithm, named BeWARE.User (which stands for “Bandit WARmsup REcommenders”) is described in Alg. 1. The presentation is optimized for clarity rather than for computational efficiency. Of course, if the exploration parameter is set to BeWARE.User chooses the same item as ALSWR. The estimate of the center of the ellipsoid and its size can be influenced by the use of an other regularization term. BeWARE.User uses a regularization based on ALSWR. It is possible to replace all by . This amounts to the standard regularization: we call this slightly different algorithm BeWARE.ALS.User. In fact one can use any regularization ensuring that is a linear combination of observed rewards. Please, note that BeWARE.ALS.User with is a LinUCB building its context using matrix decomposition  if the matrix does not changes after observation this is exactly a LinUCB.
5.1 Discussion on the Analysis of BeWARE.User
The analysis of BeWARE.User is rather similar to the LinUCB proof [NIPS2011_1243] but it requires to take care of the vectors of context which in our case are estimated through a matrix decomposition. As matrix decomposition error bounds are classically not distribution free [Chatterjee:arXiv1212.1247] (they require at least independancy between the observations), we cannot provide a complete proof. However, we can have one for a modified algorithm using the same LinUCB degradation as [journals/jmlr/ChuLRS11]
. The trick is to inject some independancy in the observed values in order to guarantee an unbiased estimation of
.6 Cold Start for New Items
When a new item is added, it is a larger source of uncertainty than the descriptions of the users. To refect this fact, we compute a confidence bound over the items instead of the users. As the second step of the ALS is to fix and optimize , it is natural to adapt our algorithm to handle the uncertainty on accordingly. This will take care of the exploration on the occurrence of new items. With the same criterion and regularization on as above, we obtain at timestep :
and
So considering the confidence ellipsoid on , the upper confidence bound of the rating for user on item is
This leads to the algorithm BeWARE.Item presented in Alg. 2. Again, the presentation is optimized for clarity rather than for computational efficiency. BeWARE.Item can be parallelized and has the complexity of one step of ALS. Fig. 2 gives the geometrical intuition leading to BeWARE.Item. Again, setting leads to the same selection as ALSWR. The regularization (on line 4) can be modified. This algorithm has no straightforward interpretation in terms of LinUCB.
7 Experimental Investigation
In this section we evaluate empirically our family of algorithms on artificial data, and on real datasets. The BeWARE algorithms are compared to:

greedy approaches (denoted Greedy.ALS and Greedy.ALSWR) that always choose the item with the largest current estimated value (respectively given a decomposition obtained by ALS, or by ALSWR),

the UCB1 approach [Auer02finitetimeanalysis] (denoted UCB.on.all.users) that consider each reward as an independent realization of a distribution . In other words, UCB.on.all.users recommends an item without taking into account the information on the user requesting the recommendation.
On the one hand, the comparison to greedy approaches highlights the needs of exploration to have an optimal algorithm in the online context. On the other hand, the comparison to UCB.on.all.users is there to assess the benefit of personalizing recommendations.
7.1 Experimental Setting
For each dataset, algorithms start with an empty matrix of 100 items and 200 users. Then, the evaluation goes like this:

select a user uniformly at random among those who have not yet rated all the items,

request his favorite item among those he has not yet rated,

compute the immediate regret (the difference of rating between the best not yet selected item and the one selected according to for this user),

iterate until all users have rated all items.
The difficulty with real datasets is that the ground truth is unknown, and actually, only a very small fraction of ratings is known. This makes the evaluation of algorithms uneasy. To overcome these difficulties, we also provide a comparison of the algorithms considering an artificial problem based on a ground truth matrix considering users and items. This matrix is generated as in [Chatterjee:arXiv1212.1247]. Each item belongs to either one of genres, and each user belongs to either one of types. For each item of genre and each user of type , is the ground truth rating of item by user , where is drawn uniformly at random in the set . The observed rating is a noisy value of : .
We also consider real datasets, the NetFlix dataset [Bennett07thenetflix] and the Yahoo!Music dataset [Dror:2011fk]. Of course, the major issue with real data is that there is no dataset with a complete matrix, which means we do no longer have access to the ground truth , which makes the evaluation of algorithms more complex. This issue is usually solved in the bandit literature by using a method based on reject sampling [LiCLW11]. For a well constructed dataset, this kind of estimators has no bias and a known bound on the decrease of the error rate [Langford_ExploScav_08].
For all the algorithms, we restrict the possible choices for a user at timestep to the items with a known rating in the dataset. However, a minimum amount of ratings per user is needed to be able to have a meaningful comparison of the algorithms (otherwise, a random strategy is the only reasonable one). As a consequence, with both datasets, we focus on the heaviest users for the top movies/songs. This leads to a matrix with only to of missing ratings. We insist on the fact that this is necessary for performance evaluation of the algorithms; obviously, this is not required to use the algorithms on a live RS.
For people used to work on full recommendation dataset the experiment can seem small. But one has to keep in mind several points:

Each value in the matrix corresponds to one possible observation. After each observation we are allowed to update our recommender policy. This means that for 4000 observations we need to perform 4000 matrix decompositions.

To evaluate precisely Beware, we would need the rating of any user on any item (because Beware may choose any of the items for the current user). In the dataset many of the ratings are unknown so using part of the matrix with many unknown ratings would introduce a bias in the evaluation.
We would like to advertize this experimental methodology which has a unique feature: indeed, this methodology allows us to turn any matrix –or tensor– of ratings into an online problem which can be used to test bandit recommendation algorithms. This is of interest because there is currently no standard dataset to evaluate bandits algorithms. To be able to evaluate offline any bandit algorithm on real data, one has to collect data using a random uniform strategy and use a replay like methodology
[Langford_ExploScav_08]. To the best of our knowledge, the very few datasets with desired properties are provided by Yahoo Webscope program (R6 dataset) as used in the challenge [ic12]. These datasets are only available to academics which restrain their use. So, it is very interesting to be able to use a more generally available rating matrix (such as the Netflix dataset) to evaluate an online policy. We think that this methodology is an other contribution of this paper. A similar trick has already been used in reinforcement learning to turn a turn a reinforcement learning into a supervised classification task
[Lagoudakis03reinforcementlearning].7.2 Experimental Results
Figures 3(a) and 3(b) show that given a fixed factorization method, BeWARE strategies improve the results on the Greedyone. Looking more closely at the results, BeWARE based on items uncertainty performs better than BeWARE based on users uncertainty, and BeWARE.users is the only BeWARE strategy beaten by its greedy counterpart (Greedy.ALSWR) on the Netflix dataset. These results demonstrate that an online strategy has to care about exploration to tend towards optimality.
While UCB.on.all.users is almost the worst approach over Artificial data (Fig. 3(a)), it surprisingly performs better than all other approaches over Netflix dataset. We feel that this difference is strongly related to the preprocessing of the Netflix dataset we have done to be able to follow the experimental protocol (and have an evaluation at all). By focusing on the top movies, we keep blockbusters that are appreciated by everyone. With that particular subset of movies, there is no need to adapt the recommendation user per user. As a consequence, UCB.on.all.users suffers a smaller regret than other strategies, as it considers users as independent realizations of the same distribution. It is worth noting that UCB.on.all.users regret would increase with the number of items while the regret of BeWARE scales with the dimensionality of the factorization, which makes BeWARE a better candidates for real applications with much more items to deal with.
Last, on Fig. 3(c) all approaches suffer the same regret.
7.3 Discussion
In a real setting, BeWARE.Item has a desirable property: it tends to favor new items with regards to older ones because they have less feedback than the others, hence larger confidence bound. So the algorithm gives them a boost which is exactly what a webstore is willing — if a webstore accepts new products this is because he feels the new one are potentially better than the old ones. Moreover it will allow the recommender policy to use at its best the novelty effect for the new items. This natural attraction of users with regards to new items can be very strong as it has been shown by the Exploration & Exploitation challenge at ICML’2012 which was won by a context free algorithm [ic12].
The computational cost of the BeWARE methods is the same as doing an additional step of alternate least squares; moreover some intermediate calculations of the QR factorization can be reused to speed up the computation. So the total cost of BeWARE.Item is almost the same as ALSWR. Even better, while the online setting requires to recompute the factorization at each timestep, this factorization sightly changes from one iteration to the next one. As a consequence, only a few ALSWR iterations are needed to update the factorization. Overall the computational cost stays reasonable even in a real application.
8 Conclusion and Future Work
In this paper, we introduced the idea of using bandit algorithm as a principled, and effective way to solve the cold start problem in recommendation systems. We think this contribution is conceptually rich, and opens ways to many different studies. We showed on large, publicly available datasets that this approach is also effective, leading to efficient algorithms able to work online, under the expected computational constraints of such systems. Furthermore, the algorithms are quite easy to implement.
Many extensions are currently under study. First, we work on extending these algorithms to use contextual information about users, and items. This will require combining the similarity measure with confidence bounds; this might be translated into a Bayesian prior. We also want to analyze regret bound for large enough number of items and users. This part can be tricky as LinUCB still does not have a full formal analysis, though some insights are available in [NIPS2011_1243].
An other important point is to work on the recommendation of several items at once and get feedback only for the best one. There is some work in the non contextual bandits on this point, one could try to translate it in our framework [DBLP:journals/jcss/CesaBianchiL12].
Finally, we plan to combine confidence ellipsoid about both users and items — this is not a straightforward sum of the bounds. However, we feel that such a combination has low odds to provide better results for real application, but it is interesting from a theoretical perspective, and should lead to even better results on artificial problems.
Comments
There are no comments yet.