Recommender systems are part of many online businesses such as Amazon, e-Bay, Netflix, and others. They are tools to learn the interests of customers in order to make customer-specific recommendations of other products. These systems provide successful and valuable marketing strategies, especially due to the expansion of the world wide web and e-commerce. In 2006, Netflix organized the Netflix-Prize, awarding one million dollars for the best algorithm. The prize was won in 2009 by a team of researchers called Bellkor’s Pragmatic Chaos (AT&T Labs) after over three years of competition. The problem has attracted attention in the statistical community, with statisticians working on similar problems, see for instance feuerverger12, FV86, Lebanon12, Zhu14, and the references therein.
We consider a version of recommender systems where an expert opinion ranking is available and is used, together with a partial ranking by a costumer, in order to predict that customer’s complete ranking. Essentially, we want to predict an individual’s ranking of a set of different objects, given an expert opinion ranking of the same objects. More precisely, a set of objects indexed by is to be evaluated and ranked by both, an expert and an individual. The expert ranks all the objects, while the individual ranks only the subset of objects corresponding to the indices in the set . This can happen for instance if the individual does not have knowledge yet of the objects with indices in .
Assume ties are impossible. Let be the permutation group on the set . The experiment provides a complete expert’s ranking as well as an incomplete user’s ranking , where is the user’s rank of object among the objects . The choice for the subscripts and is clarified by the following. We think of the ranks as being induced by ratings or grades measured on a continuous scale: if denote the expert’s grades and if denote the individual’s grades, then and .
If the user had been able to grade all objects, the user’s grades would have been , with corresponding ranking . In view of this, the model is constructed by assuming an underlying set of latent pairs of grades of all objects attributed by both the expert and the individual. Concretely, we let ,
, be realizations of independent random vectors,
, each of which is distributed according to the same bivariate distribution with continuous margins. The continuity assumption makes the marginal distributions of the grades irrelevant to the rankings. We assume that the copula of the joint distribution belongs to some parametric family indexed by a parameterfor which we select a prior. Let and denote the random expert and individual rankings, respectively, of the objects in , and let denote the random user ranking of the objects in . Note that and are random elements in , while is a random element in . The prediction of the user’s complete ranking of all objects is based on the mode of the posterior predictive distribution:
This predicted ranking is then used for instance to recommend new products to the customer.
One difficulty here is the evaluation of the joint probability mass function of the pair of rankings: given a parameter value , we need to compute
In most cases, this probability is not analytically tractable. In the literature, it has been referred to as the rank likelihood; see for instance Hoff07, Hoff14 and Segers14
. The continuity assumption on the marginal cumulative distribution functions makes the margins irrelevant to the evaluation of the probability in (2). The assumption of uniform margins implies that the probability (2) can be evaluated by means of an integral over . Within a Bayesian approach, it also means that we need only put a prior on the copula parameter.
An objective of this work is to find a family of copulas for which a closed-form expression of the posterior predictive distribution in (2) is available. We will show that the Farlie–Gumbel–Morgenstern (fgm) family (Nelsen06, p. 77) satisfies this requirement.
Since the range of dependence that can be modelled by the fgm family is rather restricted, it is natural to ask how to proceed for other parametric copula families, when no explicit formulas for (2) exist. We develop a stochastic algorithm to compute (2) and we assess its accuracy by comparing its output to the results obtained from the exact formulas available for the fgm family.
Another problem for computing the prediction in (1) is that the cardinality of the set of rankings that are compatible with the observed ranking is equal to . This number will usually be so high that it is infeasible to find the maximum in (1) by computing the probabilities on the right-hand side on (1) for all possible . We will instead propose a solution based on an ergodic Monte Carlo Markov chain with the correct limiting distribution. The algorithm is applied to predict user rankings in the MovieLens 100k dataset with a Gaussian copula modelling dependence between expert and user grades.
2. Rank likelihood
Let be the permutation group of the set . A permutation is a bijection from to itself; notation . The group operation, denoted by , is the usual composition of functions, that is, for and in and . The group’s identity element is the identity map, . The inverse of a permutation is denoted by and satisfies .
Let be the set of vectors in having no ties. The rank vector or ranking associated to is defined by
We have for all . We also define if , ensuring that the map is well-defined. A simple but useful property is that the rank map behaves well under composition with permutations: for and , we have
Given two grading vectors , we want to investigate the alignment, or the lack thereof, of the associated rankings and . We would like to know the rank, under , of the object that was attributed rank under the grading . The original index of this object is equal to , and its rank under is equal to . This leads us to the study of the permutation
If , for instance, the gradings and are perfectly aligned and induce the same ranking of the objects.
Recall from the introduction that the random pairs , , represent the expert’s together with the individual’s gradings of objects. Assume that , , are independent and identically distributed (iid) random pairs with continuous margins. With probability one, there are no ties among the gradings. Consider the random rank vectors
As in (4), we want to express the ranking induced by the random grading vector in terms of the one induced by the random grading vector . This motivates the definition of the rank statistic
The joint distribution of is determined by the distribution of .
If , , are iid random pairs with continuous margins, then
The proof of Lemma 2.1 and of the other results in this paper are deferred to the Appendix. Let and be the joint and marginal cumulative distribution functions, respectively, of the random pairs , i.e.,
for . By assumption, and are continuous. Let and for
. The random variablesand
are uniformly distributed onand their joint cumulative distribution function is a copula,
Sklar’s Theorem says that admits the representation
With probability one, the rankings induced by the random vectors and are the same as the ones induced by the random vectors and , respectively. Since the pairs , , are iid with cumulative distribution function given by the copula , it follows that the joint distribution of the rank vectors is determined by . This is formalized by the next theorem. In view of Lemma 2.1, it suffices to study the distribution of .
Let , , be iid random vectors with continuous margins and copula . If the copula has density , then we have
where , , are iid according to the distribution.
Among other things, Theorem 2.2 shows the intuitively obvious property that the marginal distributions of the gradings do not affect the joint distribution of the rank vectors. We shall therefore assume that the marginal distributions of the expert’s and user’s grades are both uniform on . Then we have and , and the random pairs , , are independent and identically distributed according to the copula . This assumption also allows us to unambiguously write .
When the copula function belongs to a parametric family , the distribution of depends on . The probability mass function of , seen as a function of , i.e., the map , for , is sometimes referred to as the rank likelihood. The expression (6) is particularly helpful as it suggests a certain Monte Carlo algorithm to compute this rank likelihood; see Algorithm 5.2.
3. Predictive distribution of compatible rankings
3.1. Compatible rankings
In the predictive distribution (1), any candidate complete ranking by the user should be compatible with the observed ranking of the objects graded by the user. The notion of compatible rankings has already appeared in the literature, see for instance Alvo14. Recall that is the group of permutations of the set .
An incomplete ranking of size , with , is a couple consisting of a permutation and a subset , with . For example, for , an incomplete ranking of size is given by and . This incomplete ranking corresponds to the partial ranking , the second object not (yet) being ranked among the three other ones.
The set of compatible rankings associated to the incomplete ranking is defined as the set of rankings of all objects such that the ranking of the objects in induced by is equal to . Formally, we have
For the example above, we have
To select an -compatible ranking , it suffices to choose the ranks, , of the objects . The ranks of the remaining objects in are then determined by the compatibility constraint. This shows that the cardinality of is equal to .
In the original formulation of the problem, the permutations and represent the expert and the individual’s complete rankings respectively. The set represents the indices of the objects ranked by the individual, with and . One observes the expert’s complete ranking and the user’s ranking of the objects in . Notice that if the expert would have ranked only the objects in , then the expert’s ranks would have been
Further, consider the permutation
In words, is the user’s rank of the object that was given rank by the expert, among the objects graded by the user. If , the identity permutation in , then the rankings by the user and the expert are perfectly aligned.
By using Lemma 2.1, it will be shown in (9) below that for the calculation of the posterior predictive distribution (1), we can simply work with the transformation as our observed data instead of with , , and . However, we need to translate the compatibility constraint to the transformed rankings. This is the purpose of Lemma 3.1 below, which says that is compatible with for the objects in if and only if is compatible with for the objects in the set defined in (7). This statement can be further interpreted as if the objects were lined up in the order of the expert’s preference (and so ) and the user was to rank the objects presented to him in that order: the result would be and .
The order statistics of the expert ranks of the objects graded by the user are denoted by . For the reason explained above, it will be convenient to consider the incomplete ranking with
To apply Lemma 2.1, we would like to switch from to . The following lemma says how this transformation affects the compatibility constraint.
3.2. Predictive distribution
Let be parametric family of bivariate copulas and let , , be a prior density on . Conditionally on , the random pairs , , are iid with common distribution given by . We observe the complete expert ranking as well as the partial user ranking on as above. The posterior predictive distribution, or predictive distribution in short, of the complete user ranking given the data is
for . For the numerator, we use Lemma 2.1 to find that
Summing over all , we find for the denominator that
The marginal distribution of (marginal with respect to ) is
As a consequence, the predictive distribution of given the data is
for . Write with and note from Lemma 3.1 that if and only if . We find that the predictive distribution of given the data is
for . To ease the notation and terminology, we call , , the predictive distribution of the compatible rankings. Since for , we do not need to consider such permutations.
The predicted ranking, , for the user is equal to the mode of the predictive distribution. In view of the above identities, we have
Two questions arise: How to compute the marginal and predictive distributions and , respectively? How to find the mode, , of the predictive distribution? For the family of Farlie–Gumbel–Morgenstern (fgm) copulas, we can find explicit formulas for the marginal probabilities . For other copula families, we propose Monte Carlo algorithms, the performance of which we assess by using the explicit formulas for the fgm family as benchmark.
4. Farlie–Gumbel–Morgenstern copula family
4.1. Rank likelihood
The copulas in the fgm family have the form , with densities , for and with parameter . The fact that the density is polynomial allows us to evaluate the rank likelihood in (6) explicitly. The resulting expression is a polynomial of degree in .
Let , , be iid random vectors with continuous margins and fgm copula , . Then
with , and
The fgm model gives rise to some symmetries in the rank likelihood. Let be the anti-identity. Note that and put .
For and , we have, for the fgm copula family,
Inspecting the proof of Lemma 4.2, we see that the symmetry property (14) holds for any family of copula densities such that and for all , where it is assumed that the parameter set is symmetric around the origin. Besides the fgm family, this includes, after reparametrization, the bivariate Frank, Plackett, and Gauss copula families.
4.2. Marginal distribution of the rank statistic
Let the fgm parameter have prior density over . It follows from Theorem 4.1 that the marginal distribution of is
for . The marginal probabilities
are directly obtained via the calculation of the moments of orderof the prior distribution.
The symmetries found in Lemma 4.2 for the rank likelihood carry through to the marginal distribution. If the prior on is symmetric, which is the case for Jeffreys’ prior discussed below, we obtain the equality
with the anti-identity. This symmetry property reduces the number of distinct values for the probabilities , , .
Next we investigate the mode of the marginal distribution of . In many cases, the identity, , or the anti-identity, , are modes, and sometimes even both rankings are modal. The next result gives sufficient conditions for or to be modal rankings.
Let be a (prior) density on the fgm parameter .
If the odd order moments ofare nonnegative, that is, if , for , then the identity is a mode of the marginal distribution , .
If the odd order moments of are nonpositive, then the anti-identity is a mode of the marginal distribution.
4.3. Prior distributions
We consider two priors on the fgm
family, namely the Beta distribution and Jeffreys’ prior.
For the Beta prior, let with and parameters and , and let denote the resulting density. The moments of are easily computed:
where , for , is the beta function. We have a corollary to Theorem 4.3 for this particular choice of prior.
Let with for and .
If , then the identity is a mode of the marginal distribution of .
If , then the anti-identity is a mode of the marginal distribution of .
We now compute Jeffreys’ prior , for , with the Fisher information at . We have
where , for , is the dilogarithm function. It follows that the Fisher information is
See Figure 1(a) for a graph of Jeffreys’ prior, . Its odd order moments vanish because of symmetry, and its even order moments can be computed by numerical quadrature.
Jeffreys’ prior being symmetric, we compare it with the symmetric subfamily , , of the Beta prior. We consider the total variation distance between and , i.e.,
The minimal value of , obtained numerically, is attained when , with , see Figure 1(b). The symmetric Beta prior with this value for may be a numerically tractable alternative to Jeffreys’ prior.
We illustrate the marginal distributions , , obtained with Jeffreys’ prior and with an asymmetrical Beta prior. The lack of a universal total ordering on makes graphing a bit difficult. We visualize the marginal distributions arising from both priors by plotting the marginal probabilities of against the Kendall distance, , of from the modal rankings, the latter depending on the prior. The Kendall distance (Diaconis88) on is given essentially by the number of discordances between two permutations; more precisely,
In particular, we have the relation , where is the sample version of Kendall’s tau of the sample .
The marginal distributions are illustrated in Figure 2 for an asymmetric Beta prior on the left and for Jeffrey’s prior on the right. The superposition of points is explained by Lemma 4.2 and equation (15). For the asymmetric Beta prior, Corollary 4.4 implies that the mode of the marginal distribution is the anti-identity . Jeffrey’s prior is symmetric, so that, by Theorem 4.3, both the identity, , and the anti-identity, , are modes of the marginal distribution. The symmetry that appears for Jeffrey’s prior is also an artifact of the fgm model and will also appear for other exchangeable and radially symmetric copula families, as discussed after Lemma 4.2. Since , with the anti-identity and , we get
Together with the equality (15), we obtain that the marginal probabilities are symmetrical with respect to the midrange distance .
4.4. Predictive distribution
The posterior predictive distribution, , of the rank statistic is equal to the marginal distribution conditioned on the event ; see (9). The polynomial form of the rank-likelihood (11) induced by the fgm family together with moment formulas for the prior distributions then allow us to compute predictive probabilities exactly.
To provide an example, take as a toy problem the incomplete rankings or in other words, , and , and consider the same two priors as in Figure 2. The predictive distribution , , from equation (9) is illustrated in Figure 3. The predictive distribution associated to Jeffreys’ prior has two modes, and . We will return to this toy example in Section 5.3.
In contrast to the marginal distribution, the posterior predictive distribution arising from Jeffrey’s prior is no longer symmetric around the midrange distance : compare the right-hand panels of Figures 2 and 3. Recall that the symmetry property of the marginal distribution is due to a combination of equations (15) and (17). For the predictive distribution, this explanation breaks down, because implies . Indeed, a compatible ranking must satisfy
but for , we have , and so . In passing, note that, in contrast to the ranking , the ranking may or may not belong to the compatible rankings. Take for instance , , and . On the one hand, we have but . On the other hand, if , we have both and .
5.1. Drawing compatible rankings
The mode of the posterior predictive distribution is the ranking used in order to make a recommendation to the individual. The simulated annealing algorithm proposed in Section 5.2 below gives a way to approximate this mode. It is based on an algorithm to draw random compatible rankings. In practice, the cardinality, , of can be enormous, and a complete listing of all compatible rankings is elusive. One way to draw samples from the uniform distribution over is to draw a permutation randomly from and then turn it to a compatible ranking by rearranging