Kernel functions based on triplet comparisons

07/28/2016
by   Matthäus Kleindessner, et al.
0

Given only information in the form of similarity triplets "Object A is more similar to object B than to object C" about a data set, we propose two ways of defining a kernel function on the data set. While previous approaches construct a low-dimensional Euclidean embedding of the data set that reflects the given similarity triplets, we aim at defining kernel functions that correspond to high-dimensional embeddings. These kernel functions can subsequently be used to apply any kernel method to the data set.

READ FULL TEXT VIEW PDF

page 8

page 16

08/29/2022

Dimension Independent Data Sets Approximation and Applications to Classification

We revisit the classical kernel method of approximation/interpolation th...
07/11/2017

Similarity Search Over Graphs Using Localized Spectral Analysis

This paper provides a new similarity detection algorithm. Given an input...
08/04/2018

Bounded Statistics

If two probability density functions (PDFs) have values for their first ...
10/31/2020

Measuring Place Function Similarity with Trajectory Embedding

Modeling place functions from a computational perspective is a prevalent...
03/14/2022

Permutation Invariant Representations with Applications to Graph Deep Learning

This paper presents primarily two Euclidean embeddings of the quotient s...
11/30/2016

Low-dimensional Data Embedding via Robust Ranking

We describe a new method called t-ETE for finding a low-dimensional embe...
06/24/2019

Improving Stochastic Neighbour Embedding fundamentally with a well-defined data-dependent kernel

We identify a fundamental issue in the popular Stochastic Neighbour Embe...

Introduction

Many machine learning algorithms are based on a notion of similarity between objects. The rationale is that objects that are similar to each other tend to have the same class label, belong to the same clusters, and so on. However, in complex tasks, like estimating suitable age ratings for movies, it can be hard to come up with a meaningful similarity function that can be evaluated automatically. On the other hand, in such scenarios humans often intuitively know which objects should be considered similar, and it is then natural to incorporate human expertise into the machine learning process. One way of doing so is to collect statements like “Movie

is more similar to movie than to movie ” from people. We refer to such statements as similarity triplets and formally define them as binary answers to distance comparisons

(1)

where is a symmetric dissimilarity function on some set  and . To simplify presentation, we may assume that either or holds true whenever . We refer to object as the anchor object in the distance comparison (1). Note that a similarity triplet might be incorrect, meaning that it claims a positive answer to the comparison (1) although in fact it holds that . It is widely accepted that humans are better and more reliable in providing similarity triplets, which means assessing similarity on a relative scale, than in providing numerical similarity estimates on an absolute scale (“The similarity between movie and movie is 0.8”). Due to the emergence of crowd sourcing platforms, collecting similarity triplets has been tremendously facilitated in recent years.

Given a data set and similarity triplets for its objects, it is not clear how to solve machine learning problems on . A general approach is to construct an ordinal embedding of , that is to map objects to a Euclidean space of a small dimension such that the given similarity triplets are preserved as well as possible (Agarwal et al., 2007; Tamuz et al., 2011; van der Maaten and Weinberger, 2012; Kleindessner and von Luxburg, 2014; Terada and von Luxburg, 2014; Arias-Castro, 2015; Amid and Ukkonen, 2015). Once such an ordinal embedding is constructed, one can simply solve a problem on by solving it on the ordinal embedding. Only recently, algorithms have been proposed for solving a number of specific problems directly without constructing an ordinal embedding in an intermediate step (Heikinheimo and Ukkonen, 2013; Ukkonen et al., 2015; Kleindessner and von Luxburg, 2016). With this paper we provide another generic means for solving machine learning problems based on similarity triplets that is different from the ordinal embedding approach. We define two data-dependent kernel functions on , corresponding to high-dimensional embeddings of , that can subsequently be used by any kernel method. Our proposed kernel functions measure similarity between two objects in by comparing to which extent the two objects give rise to similar similarity triplets. The hope is that in doing so we quantify the relative difference in the locations of the two objects in . Our experiments on both artificial and real data show that this is indeed the case and that the similarity scores defined by our kernel functions are meaningful.

Our kernel functions

Figure 1: Illustration of the idea behind our first kernel function. In order to compute a similarity score between (in red) and (in blue) we would like to rank all objects with respect to their distance from and also with respect to their distance from and compute Kendall’s  between the two rankings. In this example, the objects would rank as and , respectively. Kendall’s  between these two rankings is , and hence this would be the similarity score between and . For comparison, ranking the objects with respect to their distance from (in green) yields , and the similarity score between and would be and between and it would be .

Let be an arbitrary set and be a symmetric dissimilarity function on : a higher value of means that two elements of are more dissimilar (or, likewise, less similar) to each other. Assume we are given a collection of similarity triplets about objects of some data set , which are encoded as follows: an ordered triple of distinct objects is interpreted as . Similarity triplets in

can be incorrect (compare with the previous section), but for the moment assume that contradicting triples

and cannot be present in at the same time. We will discuss how to deal with the general case below.

Our first kernel function is based on the following idea: We fix two objects and . In order to compute a similarity score between and we would like to rank all objects in with respect to their distance from and also rank them with respect to their distance from , and take a similarity score between these two rankings as similarity score between and . One possibility to measure similarity between rankings is given by the famous Kendall tau correlation coefficient (Kendall, 1938), which is also known as Kendall’s : for two rankings of items, Kendall’s  between the two rankings is the fraction of concordant pairs of items minus the fraction of discordant pairs of items. Here, a pair of two items and is concordant if or holds true in both rankings, and discordant if they satisfy according to one and according to the other ranking. Formally, a ranking is represented by a permutation such that , , and means that item is ranked at the -th position. Given two rankings and , the fraction of concordant pairs equals

(2)

the fraction of discordant pairs equals

(3)

and Kendall’s  between and is given by

It has been established just recently that Kendall’s  is actually a kernel function on the set of total rankings (Jiao and Vert, 2015). Consequently, by measuring similarity between the two rankings of objects (one with respect to their distance from and one with respect to their distance from ) we would not only compute a similarity score between and , but would even end up with a kernel function on . This idea is illustrated with an example in Figure 1.

In our situation, the problem is that in most cases will contain only a small fraction of all possible similarity triplets and also that some of the triplets in might be incorrect, so that there is no way of ranking all objects with respect to their distance from any fixed object based on the similarity triplets in . We therefore have to adapt the procedure. For doing so we consider the feature map that corresponds to the kernel function just described. Recall that a feature map corresponding to a kernel function is a mapping for some such that

It is easy to see from (2) and (3) (also compare with Jiao and Vert (2015)) that the feature map corresponding to the described kernel function is given by with

In our situation, where we are only given and cannot evaluate in most cases, we now have to replace by an approximation: up to a normalizing factor, we simply replace an entry in by zero if we cannot evaluate it based on the similarity triplets in . More precisely, we consider the feature map given by

(4)

and define our first proposed kernel function by

(5)

Note that the scaling factor in the definition of in (4), ensuring that the feature embedding lies on the unit sphere, is crucial whenever the number of similarity triplets in which an object appears as anchor object is not approximately constant over the different objects. Also note that we have to assume that every object in appears at least once as an anchor object in a similarity triplet in . Otherwise, would not be well-defined since we would encounter a denominator equaling zero.

Figure 2: Illustration of the idea behind our second kernel function. In order to compute a similarity score between (in red) and (in blue) we would like to check for every pair of objects whether the distance comparisons and yield the same result or not. In this example, we have 32 pairs for which they yield the same result (e.g., is one such a pair) and 17 pairs for which they do not (e.g., ). Hence, we could assign as similarity score between and . For comparison, the similarity score between and (in green) would be 3/49 and between and it would be 1/49.

Our second kernel function is based on a similar idea, however, this time we do not consider and as anchor objects when measuring their similarity, but rather compare whether they rank similar with respect to their distances from the various other objects. Concretely, up to normalization, we would like to count the number of pairs of objects for which the distance comparisons

and

yield the same result and subtract the number of pairs for which these comparisons yield different results. This idea is illustrated in Figure 2. Adapted to our situation, where we are only given a non-exhaustive collection of similarity triplets, it corresponds to considering the feature map given by

(6)

and defining our second proposed kernel function by

(7)

Again, the scaling factor in the definition of in (6) is crucial whenever there are objects occurring in more similarity triplets than others. Here we have to assume that for every object in there is at least one similarity triplet in which the object appears, but not as an anchor object.

Contradicting similarity triplets

If contains contradicting triples and and there might be triples occurring repeatedly, one can either replace contradicting and multiple triples by just one triple representing a “majority decision” (if there is a tie, remove the triples completely) or one can alter the definition of or as follows: if denotes the number of how often the triple occurs in , set

The definition of can be revised in an analogous way. In doing so, we incorporate a simple estimate of the likelihood of a triple being correct.

Combining the two kernel functions

Clearly, one can combine and in order to obtain another kernel functions: for parameters we define as

and there are further possibilities for building up new kernel functions from existing ones, for example or , (Shawe-Taylor and Cristianini, 2004).

Reducing diagonal dominance

If the number of given similarity triplets is small, our kernel functions suffer from a problem that is shared by many other kernel functions defined on complex data: the feature maps and map the objects in

to sparse vectors, that is almost all of their entries are zero. As a consequence, two different feature vectors

and appear to be almost orthogonal and the similarity score is much smaller than the self-similarity scores or . This phenomenon, usually referred to as diagonal dominance of the kernel function, has been observed to pose difficulties for the kernel methods the kernel function is applied to, and several ways have been proposed for dealing with it (Schölkopf et al., 2002; Greene and Cunningham, 2006). In all our experiments we deal with diagonal dominance in the following simple way: Let denote a kernel function and the kernel matrix on , that is , which would be the input to a kernel method. Then we replace by where

denotes the identity matrix and

is the smallest eigenvalue of

. Note that is non-negative and that it is the largest number that we can subtract from the diagonal of such that the resulting matrix is still positive semi-definite and hence can be used by a kernel method.

Meaningfulness of our kernel functions

Figure 3: The kernel function measures similarity between two objects essentially by counting in how many of the halfspaces that are obtained from distance comparisons the two objects reside at the same time. The outcome does not only depend on the distance between the two objects but also on their location within the data set: although and are located far apart from each other, the kernel function considers them to be very similar. See the running text for details.

Intuitively, our kernel functions measure similarity between and by quantifying to which extent and can be expected to be located in the same region of : Think of as being a subset of and being the Euclidean metric. A similarity triplet then tells us that

resides in the halfspace defined by the hyperplane that is perpendicular to the line segment connecting

and and goes through the segment’s midpoint. If there is also a similarity triplet , and thus are located in the same halfspace (assuming the correctness of the similarity triplets) and this is reflected by a higher value of . Similarly, a similarity triplet tells us that is located in a ball with radius centered at and the value of is higher if there is a similarity triplet telling us that is located in this ball too and it is smaller if there is a similarity triplet telling us that is not located in this ball. Note that the similarity scores between and defined by our kernel functions do not only depend on the distance between and but also on how the points in are spread in the space and on the locations of and within since this affects how the various hyperplanes or balls are related to each other. Consider the example illustrated in Figure 3: Let implying that , , and be arbitrarily large. Although and are located at the maximal distance to each other, they satisfy and for all , and hence both and are jointly located in all the halfspaces obtained from these distance comparisons. Note that these halfspaces can be arranged as a sequence of increasing subsets. It is easy to see that we end up with , , assuming is computed based on all possible similarity triplets, all of which are correct. On the other hand, the distance between and is much smaller, but there are many points in between them and the hyperplanes obtained from the distance comparisons with these points separate and . We end up with , . Depending on the task at hand, this may be desirable or not.

400 points neg. distance matrix similarity scores
Figure 4: Kernel matrices for various data sets consisting of 400 points based on 10% of all similarity triplets. 1st plot: The data points. 2nd plot: The negated distance matrix. 3rd / 4th / 5th plot: The kernel matrix corresponding to / / . 6th plot: Similarity scores (encoded by color) based on between a fixed point (shown as a black cross) and the other points.

Let us examine the meaningfulness of our kernel functions by calculating them on several visualizable data sets. Each data set consists of 400 points in and equals the Euclidean metric. We computed the kernel functions , and based on 3176040 similarity triplets that were chosen uniformly at random without replacement from the set of all possible similarity triplets (the number of chosen triplets corresponds to 10% of all triplets). The results are shown in Figure 4. Each row corresponds to one data set. The first plot of a row shows the data set. The second plot shows the negated distance matrix on the data set. Next, we can see the kernel matrices. The last plot of a row shows the similarity scores based on between one fixed point (shown as a black cross) and the other points in the data set. Clearly, the kernel matrices reflect the block structures of the negated distance matrices. As hoped for, the similarity scores are the smaller the larger the distances from the fixed points are. A situation like in the example of Figure 3 does not seem to occur.

Computational complexity

A naive implementation of our kernel functions explicitly computes the feature vectors or , , and subsequently calculates the kernel matrix by means of (5) or (7). In doing so, we store the feature vectors in the feature matrix

or

Proceeding this way is straightforward and simple, requiring to go through only once, but comes with a computational cost of operations. Note that the number of different distance comparisons of the form (1) is and hence one might expect that and . By performing (5) or (7) in terms of matrix multiplication or and applying Strassen’s algorithm (Higham, 1990) one can slightly reduce the number of operations to , but still such a naive implementation is infeasible for any somewhat larger data set. However, this is also the case for ordinal embedding algorithms, which are the current state-of-the-art method for solving problems based on similarity triplets. All existing ordinal embedding algorithms iteratively solve an optimization problem. For none of these algorithms theoretical bounds for their complexity are available in the literature, but simulations in Kleindessner and von Luxburg (2016) show that their running time is tremendously high. We will also see in the experiments below that computing our kernel functions actually takes much less time than computing an ordinal embedding.

We believe that our kernel functions bear some potential regarding efficiency that should be examined in future work: On the one hand, if the number of given similarity triplets  is small, then the feature matrix or is sparse with only non-zero entries and methods for sparse matrix multiplication decrease computational complexity (Yuster and Zwick, 2005). On the other hand, if is assumed to have some particular “structure”, that is one knows which distance comparisons were evaluated for the similarity triplets in and they are sorted in an appropriate way, one can do better than the naive implementation and compute the kernel matrix by going through on the fly without explicitly computing the feature vectors or . Similarly, considering an active setting instead of the batch setting assumed in this paper (compare with the next section), by applying an appropriate query strategy one might be able to calculate the kernel matrix without computing the feature vectors simultaneously to querying similarity triplets. However, a crucial question regarding all these suggestions is how small can be such that the kernel functions are still meaningful (also compare with the next section).

Related work and further background

In the introduction we have defined similarity triplets as binary answers to distance comparisons

As such they are a special case of answers to the more general distance comparisons

We refer to any collection of answers to these general comparisons as ordinal distance information. In recent years, a whole new branch of the machine learning literature has emerged that deals with ordinal distance information, both on the empirical as well as on the theoretical side. Among the work on ordinal distance information in general (see Kleindessner and von Luxburg (2016) for references), similarity triplets have been paid particular attention, mainly due to the fact that they can easily be gathered from humans: Given vector data and similarity triplets for it, Schultz and Joachims (2003) learn a parameterized distance function that best explains the given triplets. Jamieson and Nowak (2011) deal with the important question of how many similarity triplets are required for uniquely determining an ordinal embedding of Euclidean data. Algorithms for constructing an ordinal embedding based on similarity triplets (but not on general ordinal distance information) are proposed in Tamuz et al. (2011) and van der Maaten and Weinberger (2012). Although the title of Tamuz et al. (2011) bears resemblance to the title of our paper, the aim of their paper is indeed rather different from ours: by kernel they mean a kernel function corresponding to a low-dimensional ordinal embedding as feature map. Heikinheimo and Ukkonen (2013) present a method for medoid estimation based on statements “Object 

is the outlier within the triple of objects (

,,)”, which correspond to the two similarity triplets and , and subsequently apply it to devise a crowdclustering algorithm. In Ukkonen et al. (2015), the authors use the same kind of statements for density estimation. Pretty useful for working with similarity triplets in practice might be the work by Wilber et al. (2014), who examine how to minimize time and costs when collecting similarity triplets via crowdsourcing. Producing a number of ordinal embeddings at the same time, each corresponding to a different dissimilarity function based on which a distance comparison (1) might have been evaluated, is studied in Amid and Ukkonen (2015). Kleindessner and von Luxburg (2016) propose algorithms for several problems based on statements “Object  is the most central object within the triple of objects (,,)”, which comprise the two similarity triplets and .

Despite the considerable number of papers, to the best of our knowledge we are the first to propose kernel functions that can be evaluated given only ordinal distance information about a data set (and that do not arise as a byproduct of an ordinal embedding) and thus make it possible to apply any kernel method to the data set. Let us make some more comments about ordinal distance information and similarity triplets.

Some remarks on ordinal distance information and similarity triplets

Dealing with ordinal distance information comes with a critical drawback compared to the standard setting of cardinal distance information: while for a data set comprising objects there are in total “only” distances between objects, there are potentially different distance comparisons involving four objects at a time and still different comparisons involving three objects (corresponding to similarity triplets). Unless is rather small, in practice it is thus prohibitive to collect answers to all possible comparisons. However, the hope is that not all distance comparisons are required and that relatively few of them already contain the bulk of usable information due to high redundancy. This gives rise to distinguishing between a batch setting where one is given an arbitrary collection of similarity triplets for a data set as we are assuming in this paper and an active setting in which one can query similarity triplets and aims at reducing their number by asking only for answers to the most informative distance comparisons (Jamieson and Nowak, 2011; Tamuz et al., 2011). Furthermore, the performance of any method based on similarity triplets has to be examined with respect to the number of similarity triplets it is provided as input. The performance also has to be examined with respect to the amount of incorrect similarity triplets it can deal with: a method is the more useful the fewer similarity triplets it requires and the larger fraction of incorrect ones it allows for in order to produce a correct result. In the next section we experimentally examine our proposed kernel functions with respect to these two quantities.

Experiments

We performed several experiments on both artificial and real data that demonstrate the meaningfulness and usefulness of our proposed kernel functions.

Artificial data

We applied the kernel functions , and to various artificial data and different kernel methods. We compared our proposed approaches, that is applying one of our kernel functions to a kernel method, to straightforward ordinal embedding approaches. For constructing the ordinal embedding we used the SOE (soft ordinal embedding) algorithm (Terada and von Luxburg, 2014) and made use of its R implementation provided by the authors. In doing so, we set all parameters except the dimension of the space of the embedding to the provided default parameters. Collections of similarity triplets , provided as input to both our kernel functions and the SOE algorithm, were created as follows: Based on a dissimilarity function , which we specify in each experiment, we created answers to all possible distance comparisons (1) (their number is

). Answers were incorrect with some error probability

independently from each other. We then chose similarity triplets in from this set of answers uniformly at random without replacement. An answer being incorrect with error probability means that a comparison (1) whose true answer is yields the answer with probability .

Figure 5: Clustering 300 points from a mixture of three Gaussians in with being the Euclidean metric. Purity (8) (average over 100 runs) for kernel -means based on (in blue), (in red) and (in pink) and for ordinary -means applied to the SOE embedding of the data points (in black). 1st / 2nd plot: Purity as a function of the fraction of the number of input similarity triplets compared to the number of all possible similarity triplets. 3rd / 4th plot: Purity as a function of .
Figure 6: Clustering 300 points from a mixture of three Gaussians in with being the shortest-path distance in the unweighted Euclidean minimum spanning tree on the points. 1st / 2nd plot: Purity (8) (average over 100 runs) as a function of the fraction of the number of input similarity triplets compared to the number of all possible similarity triplets for kernel -means based on (in blue), (in red) and (in pink) and for ordinary -means applied to the SOE embedding of the data points (in black). 3rd / 4th plot: Running time in seconds (average over 100 runs) for computing (in blue) and (in red) and the SOE embedding (in black) as a function of the fraction of the number of input similarity triplets (all computations performed in R).
PCA on 800 USPS digits Ordinal embedding PCA on ordinal embedding
[scale=0.26]pictures/USPS_embedding/origPoints_800points_1percent_CUT

Kernel PCA based on Kernel PCA based on Kernel PCA based on
Figure 7: Various embeddings of 800 USPS digits with equaling the Euclidean metric. The ordinal embedding and the kernel functions were computed based on 1% of all possible similarity triplets, which were incorrect with
Figure 8: Clustering of the food data set. Part of the dendrogram obtained from complete-linkage clustering with the kernel function .

Clustering

We began with clustering a simple data set consisting of 300 points from a mixture of three Gaussians in similar to the one shown in the first row of Figure 4. The dissimilarity function equaled the Euclidean metric. For assessing the quality of a clustering we computed its purity. Purity measures the accordance between an inferred clustering and a known ground truth partitioning of a data set as follows: if consists of different classes that we would like to recover and the clustering comprises different clusters , then the purity of is given by

(8)

We always have and a high value indicates a good clustering. Figure 5 shows the purity of the clusterings obtained from kernel -means (Dhillon et al., 2001) based on our kernel functions and obtained from ordinary -means applied to the SOE embedding. The results were averaged over 100 runs. We always provided the correct number of clusters, that is three, as input. The dimension of the space of the SOE embedding was correctly chosen as two. In the first and in the second plot we can study the purity of the various clusterings as a function of the fraction of the number of provided input similarity triplets compared to the number of all possible similarity triplets. If (first plot), then all considered methods seem to perform similarly. Only when they are provided only 1% of all similarity triplets as input, the embedding approach seems to be slightly superior. The situation is different when (second plot). In this case the embedding approach clearly outperforms our kernel functions, which can compete only when the fraction of input similarity triplets is 10% or larger. But note that the data set, since being Euclidean, is ideally suited for the embedding approach. In the third and in the fourth plot we can see the purity of the various clusterings as a function of when either 5% (third plot) or 20% (fourth plot) of all possible similarity triplets were provided as input. As expected, for all methods the purity decreases as increases from to . Interestingly, the purity of kernel -means based on our kernel functions then increases again. In hindsight, this is not surprising: if and thus every similarity triplet is incorrect, we simply end up with the feature map or , when or are the feature maps based on only correct triplets, and hence with the same kernel function as for . When only 5% of all similarity triplets are provided as input, the embedding approach outperforms our kernel functions in the sense that it yields an almost perfect result even for while for our kernel functions this is only the case as long as . For a larger number of input similarity triplets, namely 20% of all triplets, the difference between the performance of the embedding approach and our kernel functions becomes smaller. This is in agreement to the first and second plot.

Clustering non-Euclidean data

In this experiment we used the same set of points as in the previous one, that is 300 points from a mixture of three Gaussians in , but this time the dissimilarity function did not equal the Euclidean metric. Instead, we chose as the shortest-path distance in the Euclidean minimum spanning tree on the data points. For computing the shortest-path distance we considered the spanning tree as an unweighted graph. The first and the second plot in Figure 6 show the purity (average over 100 runs) of the clusterings obtained from kernel -means based on , and and obtained from ordinary -means applied to the SOE embedding as a function of the fraction of the number of input similarity triplets compared to the number of all possible similarity triplets. Again, we provided the correct number of clusters, that is three, as input and the dimension of the space of the SOE embedding was chosen as two. The results look similar to the ones of the previous experiment although both the embedding approach and our kernel functions perform worse than in the Euclidean case. In case of (first plot), all methods seem to perform similarly, with kernel -means based on the kernel function being slightly superior. In case of (second plot), our kernel functions require a significantly higher number of similarity triplets in order to produce a good clustering. Our kernel functions seem to be more sensitive to noise in the similarity triplets than the embedding approach — at least in these experiments.

Running time

In order to make a fair comparison between the running times for computing our kernel functions and for computing an ordinal embedding we have implemented our kernel functions and in R. In the third and the fourth plot in Figure 6 we can see the running times for computing or using our R implementations in the course of the previous experiment. The plots also show the running time for computing the SOE embedding using the R implementation provided by Terada and von Luxburg (2014). The computations were performed on an iMac with 2.9 GHz Intel Core i5 and 8 GB 1600 MHz DDR3. We can see that computing our kernel functions is approximately ten times faster than computing the SOE embedding in case of (third plot) and the difference is even more severe in case of (fourth plot).

Figure 9: Kernel PCA on the car data set. Projection onto the first two principal components based on the kernel function .

Principal component analysis

We used our kernel functions , and to apply kernel PCA (Schölkopf et al., 1999) to a randomly chosen subset of 800 USPS digits. Similarity triplets were created based on equaling the Euclidean metric. Figure 7 shows in the second row the projections of the data points onto the first two principal components when similarity triplets were incorrect with and we used 1% of all possible similarity triplets. For comparison, in the first row we can see the ordinary PCA embedding of the data points (using their coordinates), a two-dimensional ordinal embedding computed with the SOE algorithm from the same similarity triplets that we provided to our kernel functions and the ordinary PCA embedding of the points of this ordinal embedding. In our opinion, none of the shown embeddings looks superior. In all of them, the data points are somewhat clustered according to their labels, but there is also significant overlap. All embeddings looked similar when we increased the number of input similarity triplets and also when we decreased the value of (plots omitted). The only notable difference was that in case of the dark blue cluster in the kernel PCA embeddings was compressed in the same way as in the other embeddings.

Real data

Food data set

We applied the kernelized version of complete-linkage clustering based on our proposed kernel functions to the food data set introduced in Wilber et al. (2014). This data set consists of 100 images from a wide range of foods111According to the authors, the data set contains copyrighted material under the educational fair use exemption to the U.S. copyright law.. Encoding similarity triplets by ordered triples as in the second section, the authors have gathered a total of 190376 unique triples (which corresponds to a number of about 39% of the number of all possible similarity triplets) via crowd sourcing. However, there are 9349 contradicting pairs of triples. Figure 8 shows a part of the dendrogram that we obtained when working with the kernel function . Each of the ten clusters depicted there contains pretty homogeneous images. For example, the last row only shows desserts whereas the fourth row only shows vegetables and salads. The same was true when working with the kernel functions or (figures omitted), and we consider the result to provide supporting evidence for the meaningfulness of our kernel functions.

Car data set

In this experiment, we applied kernel PCA to the car data set introduced in Kleindessner and von Luxburg (2016). It consists of 60 images of cars, and the authors have collected statements of the kind “Object is the most central object within the triple of objects (,,)” for these images. One such a statement comprises two similarity triplets: and . Working with the statements in T_All (one of four collections of statements provided by the authors) and encoding similarity triplets by ordered triples, we ended up with 13514 triples of which 12502 were unique (corresponding to a number of about 12% of the number of all possible similarity triplets). Again, there was a number of contradicting pairs of triples. The projection of the data set onto the first two principal components can be seen in Figure 9. For producing this figure we have used the kernel function , but using or instead yielded similar ones (figures omitted). The result looks quite reasonable, with the cars obviously arranged in groups of sports cars (top left), ordinary cars (middle right) and off-road/sport utility vehicles (bottom left). Also within these groups there is some reasonable structure. For example, the race-like sports cars are located near to each other and close to the Formula One car and the red cars at the top are strikingly close. Again, we consider this experiment as supporting our claim that our proposed kernel functions are meaningful and useful.

Discussion

In this paper we have proposed kernel functions that can be evaluated given only similarity triplets about a data set . These kernel functions can be applied to any kernel method in order to solve various machine learning problems. As opposed to existing methods, our kernel functions correspond to high-dimensional embeddings of . Their construction aims at quantifying the relative difference in the locations of two objects in . In a number of experiments we have demonstrated the meaningfulness of our kernel functions. Even with a naive implementation they run much faster than current state-of-the-art methods.

Acknowledgements

This work has been supported by the Institutional Strategy of the University of Tübingen (Deutsche Forschungsgemeinschaft, ZUK 63).

References

  • Agarwal et al. (2007) S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie. Generalized non-metric multidimensional scaling. In

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2007.
  • Amid and Ukkonen (2015) E. Amid and A. Ukkonen. Multiview triplet embedding: Learning attributes in multiple maps. In International Conference on Machine Learning (ICML), 2015.
  • Arias-Castro (2015) E. Arias-Castro. Some theory for ordinal embedding. arXiv:1501.02861 [math.ST], 2015.
  • Dhillon et al. (2001) I. Dhillon, Y. Guan, and B. Kulis.

    Kernel k-means, spectral clustering and normalized cuts.

    In International Conference on Knowledge Discovery and Data Mining (KDD), 2001.
  • Greene and Cunningham (2006) D. Greene and P. Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In International Conference on Machine Learning (ICML), 2006.
  • Heikinheimo and Ukkonen (2013) H. Heikinheimo and A. Ukkonen. The crowd-median algorithm. In Conference on Human Computation and Crowdsourcing (HCOMP), 2013.
  • Higham (1990) N. Higham. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Transactions on Mathematical Software, 16(4):352–368, 1990.
  • Jamieson and Nowak (2011) K. Jamieson and R. Nowak. Low-dimensional embedding using adaptively selected ordinal data. In Allerton Conference on Communication, Control, and Computing, 2011.
  • Jiao and Vert (2015) Y. Jiao and J.-P. Vert. The Kendall and Mallows kernels for permutations. In International Conference on Machine Learning (ICML), 2015.
  • Kendall (1938) M. Kendall. A new measure of rank correlation. Biometrika, 30(1–2):81–93, 1938.
  • Kleindessner and von Luxburg (2014) M. Kleindessner and U. von Luxburg. Uniqueness of ordinal embedding. In Conference on Learning Theory (COLT), 2014.
  • Kleindessner and von Luxburg (2016) M. Kleindessner and U. von Luxburg. Lens depth function and -relative neighborhood graph: versatile tools for ordinal data analysis. arXiv:1602.07194 [stat.ML]. Data available on http://www.wsi.uni-tuebingen.de/lehrstuehle/theory-of-machine-learning/
    people/matthaeus-kleindessner.html
    , 2016.
  • Schölkopf et al. (1999) B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods : Support Vector Learning, pages 327–352. MIT Press, Cambridge, MA, 1999.
  • Schölkopf et al. (2002) B. Schölkopf, J. Weston, E. Eskin, C. Leslie, and W. Noble. A kernel approach for learning from almost orthogonal patterns. In European Conference on Machine Learning (ECML), 2002.
  • Schultz and Joachims (2003) M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In Neural Information Processing Systems (NIPS), 2003.
  • Shawe-Taylor and Cristianini (2004) J. Shawe-Taylor and A. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, 2004.
  • Tamuz et al. (2011) O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. Kalai. Adaptively learning the crowd kernel. In International Conference on Machine Learning (ICML), 2011.
  • Terada and von Luxburg (2014) Y. Terada and U. von Luxburg. Local ordinal embedding. In International Conference on Machine Learning (ICML), 2014. Code available on https://cran.r-project.org/web/packages/loe.
  • Ukkonen et al. (2015) A. Ukkonen, B. Derakhshan, and H. Heikinheimo. Crowdsourced nonparametric density estimation using relative distances. In Conference on Human Computation and Crowdsourcing (HCOMP), 2015.
  • van der Maaten and Weinberger (2012) L. van der Maaten and K. Weinberger. Stochastic triplet embedding. In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2012.
  • Wilber et al. (2014) M. Wilber, I. Kwak, and S. Belongie. Cost-effective hits for relative similarity comparisons. In Conference on Human Computation and Crowdsourcing (HCOMP), 2014. Data available on http://vision.cornell.edu/se3/projects/cost-effective-hits/.
  • Yuster and Zwick (2005) R. Yuster and U. Zwick. Fast sparse matrix multiplication. ACM Transactions on Algorithms, 1(1):2–13, 2005.