1 Introduction
Recommender systems have become a core component for today’s personalized online businesses [23, 9]. With the abilities of connecting various items (e.g., retailing products, movies, News articles, advertisements, experts) to potentially interested users, recommender systems enable online webshops (e.g. Amazon, Netflix, Yahoo!) to expand the marketing efforts from historically a few bestsellings toward a large variety of longtail (niche) products [4, 9, 28]. Such abilities are endowed by a personalization algorithm for identifying the preference of each individual user, which is at the heart of a recommender system.
Predicting user preference is challenging. Usually, the user and item spaces are very large yet the observations are extremely sparse. Learning from such rare, noisy and largely missing evidences has a high risk of overfitting. Indeed, this data sparseness issue has been widely recognized as a critical challenge for constructing effective recommender systems.
A straightforward way for building recommender would be to learn a user’s preference based on the prior interactions between her and the recommender system. Typically, such interaction is an “opportunity giveandtake” process (c.f. Table 1), where at each interaction:
a user inquires the system (e.g. visits a movie recommendation web site);
the system offers a set of (personalized) opportunities (i.e. items) (e.g. recommends a list of movies of potential interest to the user);
the user chooses one item (or more) from these offers and takes actions accordingly (e.g. click a link, rent a movie, view a News article, purchase a product). Somewhat surprisingly, this interaction process has not been fullyexploited for learning recommenders. Instead, research on recommender systems has focused almost exclusively on recovering user preference by completing the matrix of user actions while the actual contexts in which user decisions are made are totally disregarded. In particular, Collaborative Filtering (CF) approaches only captures the action dyads while the contextual dyads (i.e. for all and ) are typically treated as missing data. For example, the ratingoriented models aim to approximating the ratings that users assigned to items [25, 20, 24, 1, 6, 15]; the recently proposed rankingoriented algorithms [29, 16] attempt to recover the ordinal ranking information derived from the ratings. Although this formulation of the recommendation problem has led to numerous algorithms which excel at a number of data sets, including the prizewinning work of [15], we argue here that the formulation is inherently flawed — a preference for Die Hard given a generic set of movies only tells us that the user appreciates action movies; however, a preference for Die Hard over Terminator or Rocky
suggests that the user might favor Bruce Willis over other action heroes. In other words, the context of user choice is vital when estimating user preferences.
When it comes to modeling of userrecommender interactions, an important question arises: what is the fundamental mechanism underlying the user choice behaviors? As reflected by its name, collaborative filtering is based on the notion of “collaboration effects” that similar items get similar responses from similar users. This assumption is essential because by encoding the “collaboration” among users or among items or both, CF greatly alleviates the issue of data sparseness and in turn makes more reliable predictions based on the somewhat pooled evidences across different items/users.
It has long been recognized in psychology and economics that, besides the effect of collaboration [5, 21], another mechanism governs users’ behavior — competition [17, 19, 3]. In particular, items turn to compete with each other for the attention of users; therefore, axiomatically, user will pick the best item (i.e. the one with highest utility) when confronted by the set of alternatives . For example, consider a user with a penchant for action movies by Arnold Schwarzenegger. Given the choice between Sleepless in Seattle and Die Hard he will likely choose the latter. However, when afforded the choice between the oeuvres of Schwarzenegger, Diesel or Willis, he’s clearly more likely to choose Schwarzenegger over the works of Willis. To capture user’s preference more accurately, it is therefore essential for a recommender model to take into account such local competition effect. Unfortunately, this effect is absent in a large number of collaborative filtering approaches.
In this paper, we present Competitive Collaborative Filtering (CCF) for learning recommender models by modeling users’ choice behavior in their interactions with the recommender system. Similar to matrix factorization approaches for CF, we employ a multiplicative latent factor model to characterize the dyadic utility function (i.e. the utility of an item to a user). In this way, CCF encodes the collaboration effect among users and items similar to CF. But instead of learning only the action dyads (i.e. or the “1” entries in Table 1), CCF bases the factorization learning on the whole userrecommender interaction traces. It therefore leverages not only the action dyads () but also the dyads in the context without user actions (i.e. for all and or the dot entries in Table 1), which were treated as potentially missing data in CF approaches.
To leverage the entire interaction trace for latent factor learning, we devise probabilistic models or optimization objectives to encode the local competition effect underlying the user choice process. We present two formulations with different flavors. The first formulation is derived from the
multinomial logit model
that has been widely used for modeling user choice behavior (e.g. choice of brands) in psychology [17], economics [18, 19] and marketing science [11]. The second formulation relates closely to the ordinal regression models in content filtering [12] (e.g. web search ranking). Essentially, both formulations attempt to encodes “local optimality of user choices” to encourage that every opportunity taken by a user be locally the best in the context of the opportunitiesoffered to her. From a machine learning viewpoint, CCF is a hybrid of
local and globallearning, where a global matrix factorization model is learned by optimizing a local contextaware loss function. We discuss the implementation of CCF, establish efficient learning algorithms and deliver an package that allows distributed optimization on streaming data.
Experiments were conducted on three realworld recommendation data sets. First, on two dyadic data sets, we show that CCF improves over standard CF models by up to 50+% in terms of offline top ranking. Furthermore, on a commercial recommender system, we show that CCF significantly outperform CF models in both offline and online evaluations. In particular, CCF achieves up to 7% improvement in offline top ranking and up to 13% in terms of online click rate prediction.
Outline:
2 Preliminaries
2.1 Problem formulation
Consider the usersystem interaction in a recommender system: we have users and items ; when a user visits the site, the system recommends a set of items and in turn chooses a (possibly empty) subset from and takes actions accordingly (e.g. buys some of the recommended products). For ease of explanation, let us temporarily assume , i.e. is not empty and contains exactly one item . More general scenarios shall be discussed later.
To build the recommender system, we record a collection of historical interactions in the form of , where is the index of a particular interaction session. Our goal is to generate recommendations for an incoming visit of user such that the user’s satisfaction is maximized. Hereafter, we refer to as user space, as item space, as offer set or context, as decision set, and as a decision.
A key component of a recommender system is a model that characterizes the utility of an item to a user , upon which recommendations for a new inquiry from user could be done by simply ranking items based on and recommending the topranked ones. Collaborative filtering is by far the most wellknown method for modeling such dyadic responses.
2.2 Collaborative filtering
In collaborative filtering we are given observations of dyadic responses with each being an observed response (e.g. user’s rating to an item, or indication of whether user took an action on item ). The whole mapping:
constitutes a large matrix . While we might have millions of users and items, only a tiny proportion (considerably less than 1% in realistic datasets) of entries are observable. Note the subtle difference in terms of the data representation: while we record entire sessions, CF only records the dyadic responses.
Collaborative filtering explores the notion of “collaboration effects”, i.e., similar users have similar preference to similar items. By encoding collaboration, CF pools the sparse observations in such a way that for predicting it also borrows observations from other users/items. Generally speaking, existing CF methods fall into either of the following two categories.
Neighborhood models.
A popular class of approaches to CF is based on propagating the observations of responses among items or users that are considered as neighbors. The model first defines a similarity measure between items / users. Then, an unseen response between user and item is approximated based on the responses of neighboring users or items [25, 20], for example, by simply averaging the neighboring responses with similarities as weights.
Latent factor models.
This class of methods learn predictive latent factors to estimate the missing dyadic responses. The basic idea is to associate latent factors^{1}^{1}1Throughout this paper, we assume each latent factor contains a constant component so as to absorb user/itemspecific offset into latent factors., for each user and for each item , and assume a multiplicative model for the dyadic response,
where denotes the set of hyperparameters, the utility is assumed as a multiplicative function of the latent factors,
This way the factors could explain past responses and in turn make prediction for future ones. This model implicitly encodes the AldousHoover theorem [13] for exchangeable matrices – are independent of each other given and . In essence, it amounts to a lowrank approximation of the matrix
that naturally embeds both users and items into a vector space in which the distances directly reflect the semantic relatedness.
To design a concrete model [2, 22, 26], one needs to specify a distribution for the dependence. Afterwards, the model boils down to an optimization problem. For example two commonlyused formulations are:
2.3 Motivating discussions
Collaborative filtering approaches have made substantial progresses and are currently the stateoftheart techniques for recommender system. However, we argue here that CF approaches might be a bit lacking in several aspects. First of all, although data sparseness is a big issue, CF does not fully leverage the wealth of user behavior data. Take the userrecommender interaction process described in §2.1 as an example (c.f. Table 1), CF methods typically use only the action dyad of each session while other dyads are treated missing and totally disregarded, which could be wasteful of the invaluable learning resource because these nonaction dyads are not totally useless, as shown by the experiments in this paper.
Secondly, most existing CF approaches learn user preference collaboratively by either approximating the dyadic responses [25, 20, 24, 1, 6, 15] or preserving the ordinal ranking information derived from the dyadic responses [29, 16]; none of them models the user choice behavior in recommender systems. Particularly, as users choose from competing alternatives, there is naturally a local competition effect among items being offered in a session. Our work show that this effect could be an important clue for learning user preference.
Because latent factor models are very flexible and could be underdetermined (or overparameterized) even for rather moderate number of users/items. With the above two limitations, CF approaches are vulnerable to overfitting [1, 15]. Particularly, while most existing CF models might learn consistently on user ratings (numerical value typically with five levels) if given enough training data, they usually perform poorly on binary responses. For example, for the aforementioned interaction process (c.f. Table 1), the response is typically a binary event indicating whether or not item was accepted by the user . With the nonaction dyads being ignored, the responses are exclusively positive observations (either or missing). As a result, we will obtain an overlyoptimistic estimator that biases toward positive responses and predicts positive for almost all the incoming dyads (See §4.1 for empirical evidences).
3 Collaborative competitive
filtering
We present a novel framework for recommender learning by modeling the systemuser interaction process. The key insight is that the contexts in which user’s decisions are made should be taken into account when learning recommender models. In practice, a user could make different decisions when facing different contexts . For instance, an item would not have been chosen by if it were not presented to her at the first place; likewise, user could choose another item if the context changes such that a better offer (e.g., a more interesting item) is presented to her.
In this section, we describe the framework of collaborativecompetitive filtering. We start with some axiomatic views of the user choice behaviors. Following that, we present the learning formulation of CCF. We then develop the optimization algorithms and implementation techniques. We close the section with a discussion of useful extensions.
3.1 Local optimality of user choices
Formally, the individual choice process (i.e. userrecommender interactions) in a recommender system can be viewed as an instance of the opportunity giveandtake (GAT) process.
Definition [GAT]: An opportunity giveandtake process is a process of interactions among an agent , a system and a set of opportunities ; at an interaction :
is given a set of opportunities by ;
makes the decision by takeing one of the opportunities: ;
Each opportunity could potentially give a revenue (utility) of if being taken or 0 otherwise. Note that we assume the agent is a priori not aware of all the items, and only through the recommender can she get to know the items, therefore other items that are not in is unaccessible to at interaction . This is reasonable considering that the number of item is usually very large. Moreover, we assume an agent is a rational decision maker: she knows that her choice of item will be at the expense of others , therefore she compares among alternatives before making her choice. In other words, for each decision, considers both revenue and opportunity cost, and decides which opportunity to take based on the potential profit of each opportunity in . Specifically, the opportunity cost is the potential loss of from taking an opportunity that excludes her to take other opportunities: ; the profit is the net gain of an decision. By drawing the rational decision theory [17], we present the following principle of individual choice behavior.
Proposition: A rational decision is a decision maximizing the profit: = .
This proposition implies the constraint of “local optimality of user choice”, a local competitive effect restricting that the agent always chooses the offer that is locally optimal in the context of the offer set .
3.2 Collaborative competitive filtering
The localoptimality principle induces a constraint which could be translated to an objective function for recommender learning:
or  (1) 
This objective is, however, problematic. First, the inequality constraint restricts the utility function only up to an arbitrary orderpreserving transformation (e.g. a monotonically increasing function), and hence cannot yield a unique solution (e.g. point estimation) [18]. Second, optimization based on the induced objective is computationally intractable due to the operator. To this end, we present two surrogate objectives, which both are computationally efficient and show close connections to existing models.
3.2.1 Softmax model
Our first formulation is based on the random utility theory [17, 18] which has been extensively used for modeling choice behavior in economics [19] and marketing science [11]. In particular, we assume the utility function consists of two components , where: (1) is a deterministic function characterizing the intrinsic interest of user to item , for which we use the latent factor model to quantify ; (2) the second part is a stochastic error term reflecting the uncertainty and complexness of the choice process^{2}^{2}2The error term essentially accounts for all the subtle, uncertain and unmeasurable factors that influence user choice behaviors, for example, a user’s mood, past experience, or other factors (e.g., whether the decision is made in a hurry, together with her friends, or totally unconsciously). Furthermore, we assume the error term is an independently and identically distributed Weibull (extreme point) variable:
Together with the localoptimality principle, these two constraints yield the following multinomial logit model [19, 18, 11]:
(2) 
Intuitively, this model enforces the localoptimality constraint by using the softmax function as a surrogate of max.
Given a collection of training interactions , the latent factors can be estimated using penalized maximum likelihood via
(3)  
While the above formulation is a convex optimization w.r.t. as each of the objective terms in Eq.(3) is strongly concave, it is nonconvex w.r.t. the latent factors . We postpone the discussion of optimization algorithms to §3.3.
3.2.2 Hinge model
Our second formulation is based on a simple reduction of the localoptimality constraint. Note that, from Eq(3.2), it follows that:
where is the average potential utility that could possibly gain from the nonchosen items. Intuitively, the above model encourages that the utility difference between choice and nonchosen items, , to be nontrivially greater than random errors. Based on this notion, we present the following formulation which views the task as a pairwise preference learning problem [12] and uses the nonchoices averagely as negative preferences.
(4)  
s.t.: 
This formulation is directly related to the maximum score estimation [18] of the multinomial logit model Eq(2). Intuitively, it directly reflects the insight that user decisions are usually made by comparing alternatives and considering the difference of potential utilities. In other words, it learns latent factors by maximizing the marginal utility between user choice and the average of nonchoices.
Again, the optimization is convex w.r.t. , but nonconvex w.r.t. the latent factors, therefore the standard optimization tools such as the large variety of RankSVM [12] solvers are not directly applicable.
3.2.3 Complexity
It is worth noting that our CCF formulations have an appealing linear complexity, , where the offer size is typically a very small number. For example, Netflix recommends movies for each visit, and Yahoo! frontpage highlights hot news for each browser. Therefore, CCF has the sameorder complexity as the ratingoriented CF models. Note that the rankingoriented CF approaches [29, 16] are much more expensive – for each user , the learning complexity is quadratic as they learn preference of each user by comparing every pair of the items.
3.3 Learning algorithms
As we have already mentioned, due to the use of bilinear terms, both of the two CCF variants are nonconvex optimization problems regardless of the choice of the loss functions. While there are convex reformulations for some settings they tend to be computationally inefficient for large scale problems as they occur in industry — the convex formulations require the manipulation of a full matrix which is impractical for anything beyond thousands of users.
Moreover, the interactions between user and items change over time and it is desirable to have algorithms which process this information incrementally. This calls for learning algorithms that are sufficiently efficient and preferably capable to update dynamically so as to reflect upcoming data streams, therefore excluding offline learning algorithms such as classical SVDbased factorization algorithms [15]
or spectral eigenvalue decomposition methods
[16] that involve largescale matrices.We use a distributed stochastic gradient variant with averaging based on the Hadoop MapReduce framework. The basic idea is to decompose the objectives in eq:softmax or eq:hinge by running stochastic optimization on subblocks of the interaction traces in parallel in the Map phase, and to combine the results for in the Reduce phase. The basic structure is analogous to [6, 32].
Stochastic Optimization.
We derive a stochastic gradient descent algorithm to solve the optimization described in Eqeq:softmax or Eqeq:hinge. The algorithm is computationally efficient and decouplable among different interactions and users, therefore amenable for parallel implementation.
The algorithm loops over all the observations and updates the parameters by moving in the direction defined by negative gradient. Specifically, we can carry out the following update equations on each machine separately:
For all do .
For each do . Here is the learning rate^{3}^{3}3We carry out an annealing procedure to discount by a factor of 0.9 after each iteration, as suggested by [14].. The gradients are given by:
(5)  
(6) 
where is the Heaviside function, i.e. if and otherwise.^{4}^{4}4In our implementation, we approximate this by the continuous function . This helps with convergence.
Feature Hashing.
A key challenge in learning CCF models on largescale data is that the storage of parameters as well as observable features requires a large amount of memory and a reverse index to map user IDs to memory locations. In particular in recommender systems with hundreds of millions of users the memory requirement would easily exceed what is available on today’s computers (100 million users with 100 latent feature dimensions each amounts to 40GB of RAM). We address this problem by implementing feature hashing [30] on the space of matrix elements. In particularly, by allowing random collisions and applying hash mapping to the latent factors (i.e. ), we keep the entire representation in memory, thus greatly accelerating optimization.
3.4 Extensions
We now discuss two extensions of CCF to address the fact that in some cases users choose not to respond to an offer at all and that moreover we may have observed features in addition to the latent representation discussed so far.
Sessions without response
In establishing the CCF framework for modeling the user choice behavioral data, we assumed that for each usersystem interaction , the decision set contains at least one item. This assumption is, however, not true in practice. A user’s visit at a recommender system does not always yields an action. For example, users frequently visit online ecommerce website without making any purchase, or browse a news portal without clicking on an ad. Actually, such nonresponded visits may account for a vast majority of the traffics that an recommender system receives. Moreover, different users may have different propensities for taking an action. Here, we extend the multinomial logit model to modeling both responded and nonresponded interactions, and respectively.
This is accomplished by adding a scalar for each user to capture the action threshold of user . We assume that, at an interaction , user takes an effective action only if she feels that the overall quality of the offers are good enough and worth the spending of her attention. In keeping with the logistic model this means that
(7) 
for all
and the probability of no response is given by the remainder, that is by
. In essence, this amounts to a model where the ‘nonresponse’ has a certain reserve utility that needs to be exceeded for a user to respond. We may extend the hinge model in the same spirit (we use a tradeoff constant to calibrate the importance of the nonresponses).subject to  
(8) 
#user  #item  #dyads  offer size  

Social  1.2M  400  29M   
Netflix5star  0.48M  18K  100M   
News  3.6M  2.5K  110M  4 
Content features
In previous sections, we use a plain latent factor model for quantifying utility, i.e. . A known drawback [1] of such model is that it only captures dyadic data (responses), and therefore generalizes poorly for completely new entities, i.e. unseen users or items, of which the observations are missing at the training stage. Here, we extend the model by incorporating content features. In particular, we assume that, in addition to the latent features s, there exist some observable properties (e.g. a user’s selfcrafted registration files) for each user , and (e.g. a textual description of an item) for each item . We then assume the utility as a function of both types of features (i.e. observable and latent):
where the matrix provides a bilinear form for characterizing the utility based on the content features of the corresponding dyads. This model integrates both collaborative filtering [15] and content filtering [7]. On the one hand, if the user or item has no or merely noninformative observable features, the model degrades to a factorizationstyle utility model [24]. On the other hand, if we assume that and are irrelevant, for instance, if or is totally new to the system such that there is no interaction involving either of them as in a coldstart setting, this model becomes the classical contentbased relevance model commonly used in, e.g. webpage ranking [31], advertisement targeting [6], and content recommendation [7].
4 Experiments
We report experimental results on two testbeds. First, we evaluate the CCF models with CF baselines on two dyadic data sets with simulated choice contexts. The choice of simulated data generated from CF datasets was made since we are unaware of any publicly available datasets directly suitable for CCF. Furthermore, we extend our evaluation to a more strict setting based on usersystem interaction session data from a commercial recommender system.
4.1 Dyadic response data
We use dyadic data with binary responses, i.e. where . We compare different recommender models in terms of their top ranking performance.
Social network data.
The first data set we used was collected from a commercial social network site, where a user expresses her preference for an item with an explicit indication of “like”. We examine data collected for about one year, involving hundreds of millions of users and a large collection of applications, such as games, sports, news feeds, finance, entertainment, travel, shopping, and local information services. Our evaluation focuses on a random subset consisting of about 400 items, 1.2 million users and 29 million dyadic responses (“like” indications).
Netflix 5 star data.
For the sake of reproducibility of our results, we also report results on a data set derived from the Netflix prize data^{5}^{5}5http://www.netflixprize.com, one of the most famous public data sets for recommendation. The Netflix data set contains 480K users and 18K movies. We derive binary responses by considering only 5star ratings as “positive” dyads and treating all the others as missing entries.
For both data sets, we randomly split the data into three pieces, one for training, one for testing and the other for validation.
Evaluation metrics.
We assess the recommendation performance of each model by comparing the top suggestions of the model to the true actions taken by a user (i.e. “like” or 5star). We consider three measures commonly used for accessing top ranking performance in the IR community:
is the average precision. AP@ averages the precision of the top ranked list of each query (e.g. user).
or average recall is the average recall of the top rank list of each query.
or normalized Discounted Cumulative Gain is the normalized positiondiscounted precision score. It gives larger credit to top positions. For all the three metrics we use since most social networks and movie recommendation sites recommend a similar number of items for each user visit.
Model  AP@5  AR@5  nDCG@5  

Social  
CF  0.448  0.230  0.475  
CF  Logistic  0.449  0.230  0.476 
CCF  Softmax  0.688  0.261  0.704 
CCF  Hinge  0.686  0.260  0.702 
Netflix5star  
CF  0.135  0.022  0.145  
CF  Logistic  0.135  0.023  0.146 
CCF  Softmax  0.186  0.033  0.189 
CCF  Hinge  0.185  0.032  0.188 
Evaluation protocol.
We compare the two CCF models (i.e. Softmax and Hinge) with the two standard CF factorization models (i.e. and Logistic) described in §2.2. For dyadic data with binary responses, the Logistic CF model amounts to the stateoftheart [22, 1].
We adopt a fairly strict top ranking evaluation. For each user, we assess the top results out of a total preference ordering of the whole item set. In particular, for each user , we consider all the items as candidates; we compute the three measures based on the comparison between the ground truth (the set of items in the test set that user actually liked) and the top5 suggestions predicted by each model. For statistical consistency, we employ a crossvalidation style procedure. We learn the models on training data with parameters tuned on validation data, and then apply the trained models to the test data to assess the performance. All three measures reported are computed on test data only, and they are averaged over five random repeats (i.e. random splits of the data).^{6}^{6}6 Note that the contextual information (the offer set for each interaction ) is missing for both of the two dyadic data sets. We choose the datasets anyway to ensure that the results (at least on the Netflix dataset) can be repeated by other research groups. Results on interaction data are reported in §4.2.
To render the data compatible with CCF we simulate a fixedsize pseudooffer set for each interaction. Specifically, for every positive observation, e.g. , we randomly sample a handful set of missing (unobserved) entries . These sampled dyads are then treated as nonchoices, and together with the positive dyad, they are used as the offer set for the current session. In our experiments, we choose pseudo nonchoices; in other words, we assume the offer size =10.
Results and analysis.
We report the mean scores in Table 3
. Since the dataset are fairly large the standard deviations of all values were below
. Consequently we omitted the latter from the results. As can be seen from the table, CCF dramatically outperforms CF baselines on both data sets. In terms of AP@5, the two CCF models gain about 52.8%–53.6% improvements compared to the two CF models on the Social data, and by 37.0%–37.8% on Netflix5 star. Similar comparisons apply to the nDCG@5 measure. And in terms of the AR@5, CCF models outperform CF competitors by up to 13.5% on Social, and 30% on Netflix5 star data. All these improvements are statistically highly significant. Note that these results are quite consistent: both CF models perform comparably with each other on both data sets; the performance of the two CCF variants is also comparable; between the two groups, there are noticeable gaps.One argument we made in this paper for motivating our work is that since the CF models disregard the context information and only learns on positive (action) dyads, they almost inevitably yield overlyoptimistic predictions (i.e. predicting positive for all possible dyads). We hypothesize that such estimation bias is one of the key reasons for the inability of CF models in learning binary dyadic data. As an empirical validation, in Figure 1, we plot the histograms of the predicted dyadic responses (i.e. entries of the diffused matrices) obtained by a CF model () and a CCF model respectively.^{7}^{7}7Similar results obtained with other losses. As we can see, the CF model indeed predicts “positive” for most (if not all) dyads; in contrast, the results obtained by the CCF model demonstrate a more realistic powerlaw distribution [8].^{8}^{8}8Note that the distribution starts at around 0.5 instead of 0, which is consistent with our intuitions that there is actually no truly “irrelevant” item for a user – any item has potential utility for a user; user choose one over another based on the relative preference rather than absolute utility. This is true especially in this era of information explosion, where a user is typically facing so many alternatives that she can only pick the one she likes most while ignoring the others.
In reality, each user can only afford to “like” a few items out of a huge amount of alternatives. This powerlaw property is crucial for information filtering because we are intended to identify a few truly relevant items by filtering out many many irrelevant ones. A powerlaw recommender is desirable in a way analogous to a filter with narrowbandwidth, which effectively filters the noises (i.e. irrelevant items) and only let the true signal (i.e. relevant items) pass to the endnode (i.e. users).
4.2 Usersystem interaction data
We now move on to a more realistic evaluation by applying CCF to real usersystem interaction data. We evaluate CCF in both an offline test and an online test while comparing its results to both CF baselines.
Data.
We collected a largescale set of usersystem interaction traces from a commercial article (News feeds) recommender system. In each interaction, the system offers four personalized articles to the visiting user, and the user chooses one of them by clicking to read that article. The recommendations are dynamically changing over time even during the user’s visit. The system regularly logs every click event of every user visit. It also records the articles being presented to users at a series of discrete time points. To obtain a context set for each usersystem interaction, we therefore trace back to the closest recording time point right before the userclick, and we use the articles presented at that time point as the offer set for the current session. We collected such interaction traces from logged records of over one month. We use a random subset containing 3.6 million users, 2500 items and over 110 million interaction traces. Learning an effective recommender on this data set is particularly challenging as the article pool is dynamically refreshing, and each article only has a lifetime of several hours — it only appears once within a particular day, is pulled out from the pool afterward never to appear again.
Evaluation protocol.
We consider the following two evaluation settings, one offline and the other offline.
Similar to the evaluations presented in §4.1, we evaluate the learned recommender models in terms of the top ranking performance on a holdout test subset. We follow the same configurations in §4.1
and use the three ranking measures, i.e. AP@n, AR@n and nDCG@n as the evaluation metrics. Note that here we use
instead of 5, because it is the default recommendation size used in the news recommender system.We further conduct an online test. In particular. for each incoming interaction, we use the trained models to predict which item among the four recommendations will be taken by the user. This prediction is of crucial importance because one of the key objectives for a recommender system is to maximize the traffic and monetary revenue by lifting the clickthrough rate.
Model  AP@4  AR@4  nDCG@4  

30% Training  
CF  0.245  0.261  0.255  
CF  Logistic  0.246  0.263  0.257 
CCF  Softmax  0.262  0.278  0.274 
CCF  Hinge  0.261  0.278  0.273 
50% Training  
CF  0.250  0.273  0.268  
CF  Logistic  0.252  0.276  0.269 
CCF  Softmax  0.266  0.285  0.278 
CCF  Hinge  0.265  0.285  0.277 
70% Training  
CF  0.253  0.275  0.271  
CF  Logistic  0.253  0.276  0.274 
CCF  Softmax  0.267  0.287  0.280 
CCF  Hinge  0.267  0.286  0.280 
Offline test results.
In this setting, we train each model on progressive proportions of 30%, 50% and 70% randomlysampled training data respectively, and evaluate each trained model in terms of offline top ranking performance. The results are reported in Table 4. The two CCF models greatly outperform the two CF baselines in all the three evaluation metrics. Specifically, CCF models gain up to 6.9% improvement over the two CF models in terms of average precision; up to 6.5% in terms of average recall, and up to 7.5% in terms of nDCG. We also conducted a test with a standard significance level. The hypothesis tests indicate that all the improvements obtained by CCF are significant.
It is worth noting that the improvements obtained by CCF compared to CF baselines are especially evident when the training data are sparser (e.g. using only 30% of training data). This observation empirically validates our argument that the contexts contain substantial useful information for learning recommender models especially when the dyadic action responses are scarce.
Model  30%train  50%train  70%train  

Random  0.250  
CF  0.337  0.343  0.347  
CF  Logistic  0.341  0.345  0.347 
CCF  Softmax  0.376  0.384  0.391 
CCF  Hinge  0.377  0.385  0.391 
The offline results obtained by CCF are quite satisfactory. For example, the average precision is up to 0.276, which means, out of the four recommended items, on average 1.1 are truly “relevant” (i.e. actually being clicked by the user). This performance is quite promising especially considering that most of the articles in the content pool are transient and subject to dynamically updating.
Online test results.
We further evaluate the online performance of each compared model by assessing the predicted click rates. Clickrate is essential for an online recommender system because it is closelyrelated to both the traffic and the revenue of a webshop. In our evaluation, for each of the incoming visits , we use the trained models to predict the user choice, i.e. we ask the question: “among all the offered items , which one will most likely be clicked?” We use the trained model to rank the items in the offer set, and compare the topranked item with the item that was actually taken (i.e. ) by user . We evaluate the results in terms of the prediction accuracy.
The results are given in Table 5. Because the size of each offer set in the current data set is 4, a random predictor yields 0.25. As seen from the table, while all the four models obtain significantly better predictions than the random predictor, the two CCF models further greatly outperform the two CF models. Specifically, we observe 11.3%–12.7% improvements obtained by CCF models compared to the two CF competitors. These results are quite significant especially considering the dynamic property of the system.
Impact of parameters.
The performance of the two CCF models is affected by the parameter settings of the latent dimensionality, , as well as the regularization weights, and . In Figure 2^{9}^{9}9Due to heavy computational consumptions, these results are obtained on a relatively small subset of data., we illustrate how the offline top ranking performance changes as a function of these parameters, where we use the same value for both and . Here we only reported the results with nDCG@5 measure because the results show similar shapes when other measures (including the click rate) are used. As can be seen from the Figure, the nDCG curves are typically in the inverted Ushape with the optimal values achieved at the middle. In particular, for both the two CCF models, the dimensionality around 10 and regularization weight around 0.0001 yield the best performance, which is also the default parameter setting we used in obtaining our reported results.
Nonresponded sessions.
In Section 3.4 we presented two models for encoding nonresponded interactions, e.g. a user visits the News website but does not click any of the recommended articles. These approaches are promising because compared to the responded sessions, the nonresponded ones are typically much more plentiful and if learned successfully, this wealth of information has a potential to alleviating the critical datasparse issue in recommendation.
Unfortunately, due to the datalogging mechanism of the News recommender system, we were unable to obtain such nonresponded interactions (this is subject to future work). Instead, for a preliminary test, we conducted evaluation on a small set of pseudo nonresponded sessions that are derived from the responded ones. In particular, we hold out a randomlysampled subset of sessions; for each of these sessions, we hide the item being clicked by the user, and use the remaining items as a nonresponded context set by assuming no click for this set. We augmented this set of derived nonresponded sessions to the training set, and train the model on the combined training data. The results from this preliminary evaluation did not show significant performance improvement. This is likely due to the fact that the surrogate distribution is invalid. A detailed analysis with more realistic data is the subject of future research.
5 Related Work
Although a natural reflection of a user’s preference is the process of interaction with the recommender, to our knowledge, this interaction data has not been exploited for learning recommender models. Instead, research on recommender systems has focused almost exclusively on learning the dyadic data. Particularly collaborative filtering approaches only capture the useritem dyadic data with explicit user actions while the context dyads are typically treated missing values. For example, the ratingoriented models aim to approximating the ratings that users assigned to items [25, 20, 24, 1, 6, 15]; whereas the recently proposed rankingoriented algorithms [29, 16] attempt to recover the ordinal ranking information derived from the ratings.
By exploiting past records of useritem dyadic responses for future prediction based on either neighborhood based [25, 20, 16] or latent factor based methods [24, 1, 6, 15, 29], collaborative filtering approaches encode the collaboration effect that similar users get similar preference on similar items. In this paper, by leveraging the userrecommender interaction data, we show that much better recommender performance can be obtained when a localcompetition effect underlying the user choice behaviors is also encoded.
The multinomial logit model we present is derived based on the random utility theory [17, 18]. The model is wellestablished and has been widely used for a long time in, e.g. psychology [17], economics [19, 18] and marketing science [11]. Particularly, [11] applied the model to examine the brand choice of households on grocery data; [10] showed this model is theoretically and empirically superior to the regression model. More recently, the pioneering work of [9] first applied the model to characterize online choices in recommender system and investigated how recommender systems impact sales diversity. Following these steps, this work further employs the model to learn factorization models for recommendation.
The Hinge formulation of CCF shows close connection to the pairwise preference learning approaches widely used in Web search ranking [12]. Our model, however, differs from these content filtering models [12] in that instead of learning a feature mapping as in [12], our model uses the formulation for learning a multiplicative latent factor model.
6 Summary and future research
We presented a framework for learning recommender by modeling user choice behavior in the usersystem interaction process. Instead of modeling only the sparse binary events of user actions as in traditional collaborative filtering, the proposed collaborativecompetitive filtering models take into account the contexts in which user decisions are made. We presented two models in this spirit, established efficient learning algorithms and demonstrated the effectiveness of the proposed approaches with extensive experiments on three largescale realworld recommendation data sets.
There are several promising directions for future research.
Attention budget and position bias.
When deriving the CCF model, we admit an assumption that user decides whether to take an offer solely based on the comparison of utilities. This assumption, however, neglects a factor which might be important in practice. In particular, a user might have budgeted attention such that when making choices he only pays attention to a few topranked items and totally disregard the others. This position bias is evident in both web search ranking and recommendation. We plan to take this into consideration for building choice models.
Recommender strategy and user behavior.
A key feature of the current paper is that we assume the recommender adopts a deterministic strategic policy when making recommendations. In practice, a recommender could also adaptively react to the users’ actions as well as its own considerations (e.g. inventory constraints, promotion requirement of certain brands). We would like to extend our analysis here to model the interactive process between users and recommender.
Further empirical validation.
Due to data collection constraints, some parts of the proposed models are not strictly evaluated in the current paper. We plan to refine the mechanism for data collection and conduct experiments for further experiments.
References
 [1] D. Agarwal and B.C. Chen. Regressionbased latent factor models. In 15th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pages 19–28, 2009.
 [2] E. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Advances in Neural Information Processing Systems 20, pages 33–40, 2008.
 [3] Y. Bakos and E. Brynjolfsson. Bundling and competition on the internet. Marketing Science, 19(1):63–82, 2000.
 [4] E. Brynjolfsson, Y. J. Hu, and M. D. Smith. Consumer surplus in the digital economy: Estimating the value of increased product variety at online booksellers. Management Science, 49(11):1580–1596, 2003.
 [5] D. E. Byrne. The attraction paradigm. Academic Press, 1971.
 [6] Y. Chen, D. Pavlov, and J. F. Canny. Largescale behavioral targeting. In 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pages 209–218, 2009.
 [7] W. Chu and S.T. Park. Personalized recommendation on dynamic content using predictive bilinear models. In WWW’09: Proceedings of the 18th international conference on World wide web, pages 691–700, 2009.
 [8] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On powerlaw relationships of the internet topology. In SIGCOMM’99: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, pages 251–262, 1999.
 [9] D. M. Fleder and K. Hosanagar. Recommender systems and their impact on sales diversity. In EC’07: Proceedings of the 8th ACM conference on Electronic commerce, pages 192–199, 2007.
 [10] D. H. Gensch and W. W. Recker. The multinomial, multiattribute logit choice model. Journal of Marketing Research Vol. 16, No. 1, pp. 124–132. Feb. 1979.
 [11] P. M. Guadagni and J. D. Little. A logit model of brand choice calibrated on scanner data. Marketing Science, 2(3):203238, 1983.

[12]
R. Herbrich, T. Graepel, and K. Obermayer.
Support vector learning for ordinal regression.
In
International Conference on Artificial Neural Networks
, pages 97–102, 1999.  [13] O. Kallenberg. Probabilistic symmetries and invariance principles. Springer, 2005.
 [14] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD’08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434, 2008.
 [15] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
 [16] N. N. Liu and Q. Yang. Eigenrank: a rankingoriented approach to collaborative filtering. In SIGIR’08: Proceedings of the 31st ACM SIGIR conference on Research and development in information retrieval, pages 83–90, 2008.
 [17] R. D. Luce. Individual choice behavior. Wiley, 1959.
 [18] C. F. Manski. Maximum score estimation of the stochastic utility model of choice. Journal of Econometrics, (3):205–228, August 1975.
 [19] D. McFadden. Conditional logic analysis of qualitative choice behavior. In Frontiers of econometrics, Academic Press, 1974.
 [20] M. R. McLaughlin and J. L. Herlocker. A collaborative filtering algorithm and evaluation metric that accurately model the user experience. In SIGIR’04: Proceedings of the 27th ACM SIGIR conference on Research and development in information retrieval, pages 329–336, 2004.
 [21] M. McPherson, L. S. Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415–444, 2001.
 [22] K. Miller, T. Griffiths, and M. Jordan. Nonparametric latent feature models for link prediction. In NIPS’09: Advances in Neural Information Processing Systems 22, pages 1276–1284, 2009.
 [23] B. Murthi and S. Sarkar. The Role of the Management Sciences in Research on Personalization. Management Science, 49(10):1344–1362, October 2003.

[24]
R. Salakhutdinov and A. Mnih.
Bayesian probabilistic matrix factorization using markov chain monte carlo.
In ICML’08: Proceedings of the 25th international conference on Machine learning, pages 880–887, 2008.  [25] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Itembased collaborative filtering recommendation algorithms. In WWW’01: Proceedings of the 10th international conference on World Wide Web, pages 285–295, 2001.
 [26] A. Singh and G. Gordon. A unified view of matrix factorization models. In W. Daelemans, B. Goethals, and K. Morik, editors, ECML’08: European Conference on Machine Learning, pages 358–373. Springer, 2008.
 [27] J. Rennie and N. Srebro. Fast maximum margin matrix factoriazation for collaborative prediction. In Proc. Intl. Conf. Machine Learning, 2005.
 [28] T. F. Tan and S. Netessine. Is tom cruise threatened? using netflix prize data to examine the long tail of electronic commerce. Working paper 1361, Wharton school, University of Pennsylvania, 2010.
 [29] M. Weimer, A. Karatzoglou, Q. Le, and A. Smola. Cofi rank  maximum margin matrix factorization for collaborative ranking. In NIPS’07: Advances in Neural Information Processing Systems 20, pages 1593–1600, 2007.
 [30] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In ICML’09: Proceedings of the 26th International Conference on Machine Learning, pages 1113–1120, 2009. ACM.
 [31] Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, and G. Sun. A General Boosting Method and its Application to Learning Ranking Functions for Web Search. In NIPS’08: Advances in Neural Information Processing Systems 20.
 [32] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems 23, 2010.