Antithetic and Monte Carlo kernel estimators for partial rankings

07/01/2018 ∙ by María Lomelí, et al. ∙ 2

In the modern age, rankings data is ubiquitous and it is useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world is incomplete, which forbids the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators of Gram matrices. These Monte Carlo kernel estimators are based on extending kernel mean embeddings to the embedding of a set of full rankings consistent with an observed partial ranking. They form a computationally tractable alternative to previous approaches for partial rankings data. We also present a novel variance reduction scheme based on an antithetic variate construction between permutations to obtain an improved estimator. An overview of the existing kernels and metrics for permutations is also provided.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

Permutations play a fundamental role in statistical modelling and machine learning applications involving rankings and preference data. A ranking over a set of objects can be encoded as a permutation, hence, kernels for permutations are useful in a variety of machine learning applications involving rankings. Applications include recommender systems, multi-object tracking and preference learning. It is of interest to construct a kernel in the space of the data in order capture similarities between datapoints and thereby influence the pattern of generalisation. Kernels are used in many machine learning methods. For instance, a kernel input is required for the maximum mean discrepancy (MMD) two sample test (Gretton et al. , 2012)

, kernel principal component analysis (kPCA) 

(Schölkopf et al. , 1999)

, support vector machines 

(Boser et al. , 1992; Cortes & Vapnik, 1995), Gaussian processes (GPs) (Rasmussen & Williams, 2006) and agglomerative clustering (Duda & Hart, 1973), among others.

Our main contributions are: (i) A novel and computationally tractable way to deal with incomplete or partial rankings by first representing the marginalised kernel (Haussler, 1999) as a kernel mean embedding of a set of full rankings consistent with an observed partial ranking. We then propose two estimators that can be represented as the corresponding empirical mean embeddings: (ii) A Monte Carlo kernel estimator that is based on sampling independent and identically distributed rankings from the set of consistent full rankings given an observed partial ranking; (iii) An antithetic variate construction for the marginalised Mallows kernel that gives a lower variance estimator for the kernel Gram matrix. The Mallows kernel has been shown to be an expressive kernel; in particular,  Mania et al.  (2016) show that the Mallows kernel is an example of a universal and characteristic kernel, and hence it is a useful tool to distinguish samples from two different distributions, and it achieves the Bayes risk when used in kernel-based classification/regression (Sriperumbudur et al. , 2011). Jiao & Vert (2015) have proposed a fast approach for computing the Kendall marginalised kernel, however, this kernel is not characteristic (Mania et al. , 2016), and hence has limited expressive power.

The resulting estimators are used for a variety of kernel machine learning algorithms in the experiments. We present comparative simulation results demonstrating the efficacy of the proposed estimators for an agglomerative clustering task, a hypothesis test task using the maximum mean discrepancy (MMD) (Gretton et al. , 2012) and a Gaussian process classification task. For the latter, we extend some of the existing methods in the software library GPy (GPy, since 2012).

Since the space of permutations is an example of a discrete space, with a non-commutative group structure, the corresponding reproducing kernel Hilbert spaces (RKHS) have only recently being investigated; see Kondor et al.  (2007)Fukumizu et al.  (2009)Kondor & Barbosa (2010)Jiao & Vert (2015) and Mania et al.  (2016). We provide an overview of the connection between kernels and certain semimetrics when working on the space of permutations. This connection allows us to obtain kernels from given semimetrics or semimetrics from existing kernels. We can combine these semimetric-based kernels to obtain novel, more expressive kernels which can be used for the proposed Monte Carlo kernel estimator.

2 Definitions

We first briefly introduce the theory of permutation groups. A particular application of permutations is to use them to represent rankings; in fact, there is a natural one-to-one relationship between rankings of items and permutations. For this reason, we sometimes use ranking and permutation interchangeably. In this section, we state some mathematical definitions to formalise the problem in terms of the space of permutations.

Let be a set of indices for items, for some . Given a ranking of these items, we use the notation to denote the ordering of the items induced by the ranking, so that for distinct , if is preferred to , we will write . Note that for a full ranking, the corresponding relation is a total order on .

We now outline the correspondence between rankings on and the permutation group that we use throughout the paper. In words, given a full ranking of , we will associate it with the permutation that maps each ranking position to the correct object under the ranking. More mathematically, given a ranking of , we may associate it with the permutation given by for all . For example, the permutation corresponding to the ranking on given by , corresponds to the permutation given by . This correspondence allows the literature relating to kernels on permutations to be leveraged for problems involving the modelling of ranking data.

In the next section, we will review some of the semimetrics on that can serve as building blocks for the construction of more expressive kernels.

2.1 Metrics for permutations and properties

Definition 1

Let be any set and is a function, which we write for every . Then is a semimetric if it satisfies the following conditions, for every (Dudley, 2002):

  1. , that is, is a symmetric function.

  2. if and only if .

    A semimetric is a metric if it satifies:

  3. for every , that is, satisfies the triangle inequality.

The following are some examples of semimetrics on the space of permutations (Diaconis, 1988). All semimetrics in bold have the additional property of being of negative type. Theorem 2.1, stated below, shows that negative type semimetrics are closely related to kernels.

  1. Spearman’s footrule.

    .

  2. Spearman’s rank correlation.

    .

  3. Hamming distance.

    It can also be defined as the minimum number of substitutions required to change one permutation into the other.

  4. Cayley distance.

    ,

    where the composition operation of the permutation group is denoted by and if is the largest item in its cycle and is equal to 1 otherwise (Irurozki & Lozano, 2016). It is also equal to the minimum number of pairwise transpositions taking to . Finally, it can also be shown to be equal to where is the number of cycles in .

  5. Kendall distance.

    ,

    where is the number of discordant pairs for the permutation pair . It can also be defined as the minimum number of pairwise adjacent transpositions taking to .

  6. distances. with .

  7. distance. .

Definition 2

A semimetric is said to be of negative type if for all , and with , we have

(1)

In general, if we start with a Mercer kernel for permutations, that is, a symmetric and positive definite function , the following expression gives a semimetric that is of negative type

(2)

A useful characterisation of semimetrics of negative type is given by the following theorem, which states a connection between negative type metrics and a Hilbert space feature representation or feature map .

Theorem 2.1

(Berg et al. , 1984). A semimetric is of negative type if and only if there exists a Hilbert space and an injective map such that , .

Once the feature map from Theorem 2.1 is found, we can directly take its inner product to construct a kernel. For instance, Jiao & Vert (2015) propose an explicit feature representation for Kendall kernel given by

They show that the inner product between two such features is a positive definite kernel. The corresponding metric, given by Kendall distance, can be shown to be the square of the norm of the difference of feature vectors. Hence, by Theorem 2.1, it is of negative type.

Analogously, Mania et al.  (2016) propose an explicit feature representation for the Mallows kernel, given by

where when and .

In the following proposition, an explicit feature representation for the Hamming distance is introduced and we show that it is a distance of negative type.

Proposition 1

The Hamming distance is of negative type with

(3)

where the corresponding feature representation is a matrix given by

Proof

The Hamming distance can be written as a square difference of indicator functions in the following way

where each indicator is one whenever the given entry of the permutation is equal to the corresponding element of the identity element of the group. Let the -th feature vector be , then

This is the trace of the difference of the product of the feature matrices , where the difference of feature matrices is given by

This is the square of the usual Frobenius norm for matrices, so by Theorem 2.1, the Hamming distance is of negative type.

Another example is Spearman’s rank correlation, which is a semimetric of negative type since it is the square of the usual Euclidean distance (Berg et al. , 1984).

The two alternative definitions given for some of the distances in the previous examples are handy from different perspectives. One is an expression in terms of either an injective or non-injective feature representation, while the other is in terms of the minimum number of operations to change one permutation to the other. Other distances can be defined in terms of this minimum number of operations, they are called editing metrics (Deza & Deza, 2009). Editing metrics are useful from an algorithmic point of view whereas metrics defined in terms of feature vectors are useful from a theoretical point of view. Ideally, having a particular metric in terms of both algorithmic and theoretical descriptions gives a better picture of which are the relevant characteristics of the permutation that the metric takes into account. For instance, Kendall and Cayley distances algorithmic descriptions correspond to the bubble and quick sort algorithms respectively (Knuth, 1998).

Figure 1: Kendall and Cayley distances for permutations of . There is an edge between two permutations in the graph if they differ by one adjacent or non-adjacent transposition, respectively.

Another property shared by most of the semimetrics in the examples is the following

Definition 3

Let , denote the symmetric group of degree n with the composition operation, a right-invariant semimetric (Diaconis, 1988) satisfies

(4)

In particular, if we take then , where corresponds to the identity element of the permutation group.

This property is inherited by the distance-induced kernel from Section 2.2, Example 7. This symmetry is analogous to translation invariance for kernels defined in Euclidean spaces.

2.2 Kernels for

If we specify a symmetric and positive definite function or kernel , it corresponds to defining an implicit feature space representation of a ranking data point. The well-known kernel trick exploits the implicit nature of this representation by performing computations with the kernel function explicitly, rather than using inner products between feature vectors in high or even infinite dimensional space. Any symmetric and positive definite function uniquely defines an underlying Reproducing Kernel Hilbert Space (RKHS), see the supplementary material Appendix A for a brief overview about the RKHS. Some examples of kernels for permutations are the following

  1. The Kendall kernel (Jiao & Vert, 2015) is given by

    ,
    where and denote the number of concordant and discordant pairs between and respectively.

  2. The Mallows kernel (Jiao & Vert, 2015) is given by

    .

  3. The Polynomial kernel of degree m (Mania et al. , 2016), is given by

    .

  4. The Hamming kernel is given by

    .

  5. An exponential semimetric kernel is given by

    , where is a semimetric of negative type.

  6. The diffusion kernel (Kondor & Barbosa, 2010) is given by

    , where and is a function that must satisfy and . A particular case is if and are connected by an edge in some Cayley graph representation of , and if or otherwise.

  7. The semimetric or distance induced kernel (Sejdinovic et al. , 2013), if the semimetric is of negative type, then, a family of kernels , parameterised by a central permutation , is given by

    .

If we choose any of the above kernels by itself, it will generally not be complex enough to represent the ranking data’s generating mechanism. However, we can benefit from the allowable operations for kernels to combine kernels and still obtain a valid kernel. Some of the operations which render a valid kernel are the following: sum, multiplication by a positive constant, product, polynomial and exponential (Berlinet & Thomas-Agnan, 2004).

In the case of the symmetric group of degree , , there exist kernels that are right invariant, as defined in Equation (4). This invariance property is useful because it is possible to write down the kernel as a function of a single argument and then obtain a Fourier representation. The caveat is that this Fourier representation is given in terms of certain matrix unitary representations due to the non-Abelian structure of the group (James, 1978). Even though the space is finite, and every irreducible representation is finite-dimensional (Fukumizu et al. , 2009), these Fourier representations do not have closed form expressions. For this reason, it is difficult to work on the spectral domain as opposed to the case. There is also no natural measure to sample from such as the one provided by Bochner’s theorem in Euclidean spaces (Wendland, 2005). In the next section, we will present a novel Monte Carlo kernel estimator for the case of partial rankings data.

3 Partial rankings

Having provided an overview of kernels for permutations, and reviewed the link between permutations and rankings of objects, we now turn to the practical issue that in real datasets, we typically have access only to partial ranking information, such as pairwise preferences and top- rankings. Following Jiao & Vert (2015), we consider the following types of partial rankings:

Definition 4 (Exhaustive partial rankings, top- rankings)

Let . A partial ranking on the set is specified by an ordered collection of disjoint non-empty subsets , for any . The partial ranking encodes the fact that the items in are preferred to those in , for , with no preference information specified about the items in . A partial ranking with termed exhaustive, as all items in are included within the preference information. A top- partial ranking is a particular type of exhaustive ranking , with , and . We will frequently identify a partial ranking with the set of full rankings consistent with the partial ranking. Thus, iff for all , and for all , we have . When there is potential for confusion, we will use the term “subset partial ranking” when referring to a partial ranking as a subset of , and “preference partial ranking” when referring to a partial ranking with the notation .

Thus, for many practical problems, we require definitions of kernels between subsets of partial rankings rather than between full rankings, to be able to deal with datasets containing only partial ranking information. A common approach (Tsuda et al. , 2002) is to take a kernel defined on , and use the marginalised kernel, defined on subsets of partial rankings by

(5)

for all

, for some probability distribution

. Here, denotes the conditioning of to the set . Jiao & Vert (2015) use the convolution kernel (Haussler, 1999) between partial rankings,

given by

(6)

This is a particular case for the marginalised kernel of Equation (5), in which we take the probability mass function to be uniform over respectively. In general, computation with a marginalised kernel quickly becomes computationally intractable, with the number of terms in the right-hand side of Equation (5) growing super-exponentially with , for a fixed number of items in the partial rankings and , see Appendix D for a numerical example of such growth. An exception is the Kendall kernel case for two interleaving partial rankings of and items or a top- and top- ranking. In this case, the sum can be tractably computed and it can be done in time (Jiao & Vert, 2015).

We propose a variety of Monte Carlo methods to estimate the marginalised kernel of Equation (5) for the general case, where direct calculation is intractable.

Definition 5

The Monte Carlo estimator approximating the marginalised kernel of Equation (5) is defined for a collection of partial rankings , given by

(7)

for , where are random permutations, and are random weights. Note that this general set-up allows for several possibilities:

  • For each , the permutations are drawn exactly from the distribution . In this case, the weights are simply for .

  • For each , the permutations drawn from some proposal distribution with the weights given by the corresponding importance weights for .

An alternative perspective on the estimator defined in Equation (7), more in line with the literature on random feature approximations of kernels, is to define a random feature embedding for each of the partial rankings .

More precisely, let be the (finite-dimensional) Hilbert space associated with the kernel on the space , and let be the associated feature map, so that for each . Then observe that we have for all . We now extend this feature embedding to partial rankings as follows. Given a partial ranking , we define the feature embedding of by

With this extension of to partial rankings, we may now directly express the marginalised kernel of Equation (5) as an inner product in the same Hilbert space :

for all partial rankings . If we define a random feature embedding of the partial rankings by

then the Monte Carlo kernel estimator of Equation (7) can be expressed directly as

(8)

for each . This expression of the estimator as an inner product between randomised embeddings will be useful in the sequel.

We provide an illustration of the various RKHS embeddings at play in Figure 2, using the notation of the proof of Theorem 3.2. In this figure, is a partial ranking, with three consistent full rankings . The extended embedding applied to is the barycentre in the RKHS of the embeddings of the consistent full rankings, and a Monte Carlo approximation to this embedding is also displayed.

Figure 2: Visualisation of the various embeddings discussed in the proof of Theorem 3.2. are permutations in , which are mapped into the RKHS by the embedding . is a partial ranking subset which contains , and its embedding is given as the average of the embeddings of its full rankings. The Monte Carlo embedding induced by Equation (7) is computed by taking the average of a randomly sampled collection of consistent full rankings from .
Theorem 3.1

Let be a partial ranking, and let independent and identically distributed samples from . The kernel Monte Carlo mean embedding,

is a consistent estimator of the marginalised kernel embedding

Proof

Note that the RKHS in which these embeddings take values is finite-dimensional, and the Monte Carlo estimator is the average of iid terms, each of which is equal to the true embedding in expectation. Thus, we immediately obtain unbiasedness and consistency of the Monte Carlo embedding.

Theorem 3.2

The Monte Carlo kernel estimator from Equation (7) does define a positive-definite kernel; further, it yields consistent estimates of the true kernel function.

Proof

We first deal with the positive-definiteness claim. Let be a collection of partial rankings, and for each , let be an i.i.d. weighted collection of complete rankings distributed according to . To show that the Monte Carlo kernel estimator is positive-definite, we observe that by Equation (8), the matrix with th element given by is the Gram matrix of the vectors with respect to the inner product of the Hilbert space . We therefore immediately deduce that the matrix is positive semi-definite, and therefore the kernel estimator itself is positive-definite. Furthermore, the Monte Carlo kernel estimator is consistent; see Appendix B in the supplementary material for the proof.

Having established that the Monte Carlo estimator is itself a kernel, we note that when it is evaluated at two partial rankings , the resulting expression is not a sum of iid terms; the following result quantifies the quality of the estimator through its variance.

Theorem 3.3

The variance of the Monte Carlo kernel estimator evaluated at a pair of partial rankings , with Monte Carlo samples respectively, is given by

The proof is given in the supplementary material, Appendix C. We have presented some theoretical properties of the embedding corresponding to the Monte Carlo kernel estimator which confirm that it is a sensible embedding. In the next section, we present a lower variance estimator based on a novel antithetic variates construction.

4 Antithetic random variates for permutations

A common, computationally cheap variance reduction technique in Monte Carlo estimation of expectations of a given function is to use antithetic variates (Hammersley & Morton, 1956), the purpose of which is to introduce negative correlation between samples without affecting their marginal distribution, resulting in a lower variance estimator. Antithetic samples have been used when sampling from Euclidean vector spaces, for which antithetic samples are straightforward to define. However, to the best of our knowledge, antithetic variate constructions have not been proposed for the space of permutations. We begin by introducing a definition for antithetic samples for permutations.

Definition 6 (Antithetic permutations)

Let be a top- partial ranking. The antithetic operator maps each permutation to the permutation in of maximal distance from .

It is not necessarily clear a priori that the antithetic operator of Definition 6 is well-defined, but for the Kendall distance and top- partial rankings, it turns out that it is indeed well-defined.

Remark 1

For the Kendall distance and top- partial rankings, the antithetic operators of Definition 6 are well-defined, in the sense that there exists a unique distance-maximising permutation in from any given . Indeed, the antithetic map when is a top- partial ranking has a particularly neat expression; if the partial ranking corresponding to is , and we have a full ranking (so that , then the antithetic permutation is given by

In this case, we have .

This definition of antithetic samples for permutations has parallels with the standard notion of antithetic samples in vector spaces, in which typically a sampled vector is negated to form , its antithetic sample; is the vector maximising the Euclidean distance from , under the restrictions of fixed norm.

Proposition 2

Let be a partial ranking and be an antithetic pair from , distributed Uniformly in the region . Let be the Kendall distance and a fixed permutation, then and , then, and have negative covariance.

The proof of this proposition is presented after the relevant lemmas are proved. Since one of the main tasks in statistical inference is to compute expectations of a function of interest, denoted by , once the antithetic variates are constructed, the functional form of determines whether or not the antithetic variate construction produces a lower variance estimator for its expectation. If is a monotone function, we have the following corollary.

Corollary 3

Let

be a monotone increasing (decreasing) function. Then, the random variables

and , have negative covariance.

Proof

The random variable from Proposition 2 is equal in distribution to , where is a constant which changes depending whether is a full ranking or an exhaustive partial ranking, see the proof of Proposition 2 in the next section for the specific form of the constants. By Chebyshev’s integral inequality (Fink & Jodeit, 1984), the covariance between a monotone increasing (decreasing) and a monotone decreasing (increasing) functions is negative.

The next theorem presents the antithetic empirical feature embedding and corresponding antithetic kernel estimator. Indeed, if we take the inner product between two embeddings, this yields the kernel antithetic estimator which is a function of a pair of partial rankings subsets. In this case, the function from above is the kernel evaluated in each pair, this is an example of a -statistic (Serfling, 1980, Chapter 5).

Theorem 4.1

Let be a partial ranking, denotes the space of permutations of , are antithetic pairs of i.i.d. samples from the region . The Kernel antithetic Monte Carlo mean embedding is

It is a consistent estimator of the embedding that corresponds to the marginalised kernel

(9)
Proof

Since the estimator is a convex combination of the Monte Carlo Kernel estimator, consistency follows.

In the next section, we present the main result about the estimator from Theorem 4.1, namely, that it has lower asymptotic variance than the Monte Carlo kernel estimator from Equation (7).

4.1 Variance of the antithetic kernel estimator

We now establish some basic theoretical properties of antithetic samples in the context of marginalised kernel estimation. In order to do so, we require a series of lemmas to derive the main result in Theorem 4.2 that guarantees that the antithetic kernel estimator has lower asymptotic variance than the Monte Carlo kernel estimator for the marginalised Mallows kernel.

The following result shows that antithetic permutations may be used to achieve coupled samples which are marginally distributed uniformly on the subset of corresponding to a top- partial ranking.

Lemma 1

If is a top- partial ranking, then if , then .

Proof

The proof is immediate from Remark 1, since is bijective on .

Lemma 1 establishes a base requirement of an antithetic sample – namely, that it has the correct marginal distribution. In the context of antithetic sampling in Euclidean spaces, this property is often trivial to establish, but the discrete geometry of makes this property less obvious. Indeed, we next demonstrate that the condition of exhaustiveness of the partial ranking in Lemma 1 is neccessary.

Example 1

Let , and consider the partial ranking . Note that this is not an exhaustive partial ranking, as the element does not feature in the preference information. There are three full rankings consistent with this partial ranking, namely , , and . Encoding these full rankings as permutations, as described in the correspondence outlined in Section 2, we obtain three permutations, which we respectively denote by . Specifically, we have

Under the right-invariant Kendall distance, we obtain pairwise distances given by

Thus, the marginal distribution of an antithetic sample for the partial ranking places no mass on , and half of its mass on each of and , and is therefore not uniform over .

We further show that the condition of right-invariance of the metric is necessary in the next example.

Example 2

Let , and suppose is a distance on such that, with the notation introduced in Example 1, we have

Note that is not right-invariant, since

where is given by . Then note that an antithetic sample for the kernel associated with this distance and the partial ranking , is equal to with probability and the other two full rankings with probability

each, and therefore does not have a uniform distribution.

Examples 1 and 2 serve to illustrate the complexity of antithetic sampling constructions in discrete spaces.

The following two lemmas state some useful relationships between the distance between two permutations and the corresponding pair in both the unconstrained and constrained cases which correspond to not having any partial ranking information and having partial ranking information, respectively.

Lemma 2

Let . Then,

.

Proof

This is immediate from the interpretation of the Kendall distance as the number of discordant pairs between two permutations; a distinct pair are discordant for iff they are concordant for .

In fact, Lemma 2 generalises in the following manner.

Lemma 3

Let be a top- ranking , and let . Then .

Proof

As for the proof of Lemma 2, we use the “discordant pairs” interpretation of the Kendall distance. Note that if a distinct pair has at least one of , then by virtue of the fact that , any pair of these permutations is concordant for . Now observe that any distinct pair is discordant for iff it is concordant for , from the construction of described in Remark 1. The total number of such pairs is , so we have , as required.

Next, we show that it is possible to obtain a unique closest element in a given partial ranking set , denoted by , with respect to any given permutation . This is based on the usual generalisation of a distance between a set and a point (Dudley, 2002). We then use such closest element in Lemmas 5 and 6 to obtain useful decompositions of distances identities. Finally, in Lemma 7 we verify that the closest element is also distributed uniformly on a subset of the original set .

Lemma 4

Let be a top- partial ranking, let be arbitrary. There is a unique closest element in to . In other words, is a set of size 1.

Proof

We use the interpretation of the Kendall distance as the number of discordant pairs between two permutations. Let be the top- partial ranking given by , and let . We decompose the Kendall distance between and as follows:

(10)

As varies in , only some of these terms vary. In particular, it is only the third term that varies with , and it is minimised at by the permutation in which is in accordance with on the set .

Definition 7

Let be a top- partial ranking. Let be the map that takes a permutation to the corresponding Kendall-closest permutation in ; by Lemma 4, this is well-defined.

Lemma 5 (Decomposition of distances)

Let , and . We have the following decomposition of the distance :

Proof

We compute directly with the discordant pairs definition of the Kendall distance. Again, let be the partial ranking , and let . We decompose the Kendall distance between and as before: