Classification from Triplet Comparison Data

07/24/2019 ∙ by Zhenghang Cui, et al. ∙ 1

Learning from triplet comparison data has been extensively studied in the context of metric learning, where we want to learn a distance metric between two instances, and ordinal embedding, where we want to learn an embedding in an Euclidean space of the given instances that preserves the comparison order as well as possible. Unlike fully-labeled data, triplet comparison data can be collected in a more accurate and human-friendly way. Although learning from triplet comparison data has been considered in many applications, an important fundamental question of whether we can learn a classifier only from triplet comparison data has remained unanswered. In this paper, we give a positive answer to this important question by proposing an unbiased estimator for the classification risk under the empirical risk minimization framework. Since the proposed method is based on the empirical risk minimization framework, it inherently has the advantage that any surrogate loss function and any model, including neural networks, can be easily applied. Furthermore, we theoretically establish an estimation error bound for the proposed empirical risk minimizer. Finally, we provide experimental results to show that our method empirically works well and outperforms various baseline methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

triplet_classification

Implementation for "Classification from Triplet Comparison Data"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, learning from comparison-feedback data has received increasing attention (Heim, 2016; Kleindeßner, 2017). It is usually argued that humans perform better in the task of evaluating which instances are similar, rather than identifying each individual instance (Stewart et al., 2005). It is also argued that humans can achieve much better and more reliable performance on assessing the similarity on a relative scale (“Instance A is more similar to instance B than to instance C”) rather than on an absolute scale (“The similarity score between A and B is while the one between A and C is ”) (Kleindeßner, 2017). Collecting data in this manner has the advantage of avoiding the problem caused by individuals’ different assessment scales. On the other hand, the collected absolute similarity scores may only provide information on a comparison level in some applications, e.g., sensor localization (Liu et al., 2004). It was shown that keeping only the relative comparison information can help an algorithm be resilient against measurement errors and achieve high accuracy (Xiao et al., 2006).

In this paper, we focus on the problem of learning from triplet comparison data, which is a common form of comparison-feedback data. A triplet comparison contains the information that instance is more similar to than to . As one example, search-engine query logs can readily provide feedback in the form of triplet comparisons (Schultz and Joachims, 2004). Given a list of website links for a query, if links and are clicked and the link is not clicked, we can formulate a triplet comparison as .

Learning from triplet comparison data was initially studied in the context of metric learning (Schultz and Joachims, 2004)

, in which a consistent distance metric between two instances is assumed to be learned from data. The well-known triplet loss for face recognition was proposed in this line of research 

(Schroff et al., 2015; Yu et al., 2018). Using this loss function, an inductive mapping function can be efficiently learned from triplet comparison image data. At the same time, the problem of ordinal embedding has also been extensively studied (Agarwal et al., 2007; Van Der Maaten and Weinberger, 2012). It aims to learn an embedding of the given instances to the Euclidean space that preserves the order given by the data. Algorithms for large scale ordinal embedding have been developed (Anderton and Aslam, 2019). In addition, many other problem settings have been considered for the situation of using only triplet comparison data, such as nearest neighbor search (Haghiri et al., 2017), kernel function construction (Kleindessner and von Luxburg, 2017)

and outlier identification  

(Kleindessner and Von Luxburg, 2017).

However, learning a binary classifier from triplet comparison data remained untouched until recently. A random forest construction algorithm 

(Haghiri et al., 2018) was proposed for both classification and regression. However, it first requires a labeled dataset and needs to actively access a triplet comparison oracle many times. For passively collected triplet comparison data, a boosting based algorithm (Perrot and von Luxburg, 2018) was recently proposed without accessing a triplet comparison oracle. However, a set of labeled data is still indispensable to initiating the training process. To the best of our knowledge, this paper is the first to tackle the problem of learning a classifier only from passively obtained triplet comparison data, without accessing either a labeled dataset or an oracle.

Contributions:

We show that we can learn a binary classifier from only passively obtained triplet comparison data. We achieve this goal by developping a novel method for learning a binary classifier in this setting with theoretical justification. We use the direct risk minimization framework given for the classification problem. We then show that the classification risk can be empirically estimated in an unbiased way given only triplet comparison data. Theoretically, we establish an estimation error bound for the proposed empirical risk minimizer, showing that learning from triplet comparison data is consistent. Our method also returns an inductive model, which is different from clustering and ordinal embedding, and can be applied to unseen test data points. The test data would consist of single instances instead of triplet comparisons since our primitive goal is to perform a binary classification task on unseen data points.

In summary, for the problem of classification using only triplet comparison data, our contributions in this paper are three-fold:

  • We propose an empirical risk minimization method for binary classification using only passively obtained triplet comparison data, which gives us an inductive classifier.

  • We theoretically establish an estimation error bound for our method, showing that the learning is consistent.

  • We experimentally demonstrate the practical usefulness of our method.

2 Related Work

Our problem setting of learning a binary classifier from passively obtained triplet comparison data can be considered as a type of a weakly-supervised classification problem, where we do not have access to ground-truth labels (Zhou, 2017).

An approach based on constructing an unbiased risk estimator of the true classification risk from weakly-supervised data has been explored in many problem settings; for example, positive-unlabeled classification (du Plessis et al., 2014; Niu et al., 2016) and similarity-unlabeled classification (Bao et al., 2018) can be handled by the framework of learning from two sets of unlabeled data (Lu et al., 2018). Nevertheless, our problem setting is not a special case addressed by Lu et al. (2018) since we have only one set of triplet comparison data. We later show that we can formulate three different distributions, which is fundamentally different from the framework used by  Lu et al. (2018).

Moreover, our problem setting is also different from similarity-dissimilarity-unlabeled classification (Shimada et al., 2019) in the sense that we have no access to unlabeled data and similarity and dissimilarity pairs, but only triplet comparison information. Furthermore, it is important to note that our problem setting is also different from preference learning (Fürnkranz and Hüllermeier, 2010), since we do not want to learn a ranking function but construct a binary classifier. Although we can first learn a ranking function and then decide a proper threshold to construct a binary classifier (Narasimhan and Agarwal, 2013), it is not straightforward to choose a proper threshold. Therefore, instead of this two-stage method, we focus on a method that can directly learn a binary classifier from triplet comparison data.

3 Learning A Classifier from Triplet Comparison Data

In this section, we first review the ordinary fully supervised classification setting. Then we introduce the problem setting and assumption for the data generation process of triplet comparison data. Finally, we describe the proposed method for training a binary classifier from only passively obtained triplet comparison data.

3.1 Preliminary

We first briefly introduce the traditional binary classification problem. We denote as a -dimensional sample space and as a binary label space. In the fully supervised setting, we usually assume the labeled data

are drawn from the joint probability distribution with density

 (Vapnik, 1995). The goal is to obtain a classifier that minimizes the classification risk

(1)

where the expectation is over the joint density and is a loss function that measures how well the classifier estimates the true class label.

In the traditional fully supervised classification setting, we are given both positive and negative training data collectively drawn from the joint density . However, in our case, we still want to train a binary classifier that minimizes the classification risk, although we do not have fully labeled data.

3.2 Generation Process of Triplet Comparison Data

We formulate the underlying generation process of triplet comparison data in order to perform empirical risk minimization. Three samples in a triplet are first generated independently, then shown to a user. The user can mark the triplet to be proper or not. A proper triplet means that the similarity between the first and second samples is stronger or the same as the similarity between the first and third samples. Specifically, it means that three labels in a triplet appear to be one of the following cases:

Otherwise, it means the first sample is more similar to the third sample than to the second sample; thus, the user chooses to mark the triplet as not proper. Similarly, it means appears to be one of the following cases

First, three data samples are generated independently from the underlying joint density , then are collected without knowing the underlying true labels . However, we can collect information about which case a triplet belongs to from user feedback. After receiving feedback from users, we can actually obtain two distinct datasets. The data the user chooses to keep the order is denoted as

Similarly, the data the user chooses to flip the order is denoted as

Note that the ratio of to is fixed because we assume the three samples in a triplet are generated independently from ; thus, the ratio

is only dependent on the underlying class prior probabilities, which are fixed unknown values.

The two datasets can be considered to be generated from two underlying distributions as indicated by the following lemma.

Lemma 1.

Corresponding to the data generation process described above, let

(2)

where , and are the class prior probabilities that satisfy and and are class conditional probabilities. Then it follows

Detailed derivation is given in Appendix A.

We denote the pointwise data collected from and by ignoring the triplet comparison relation as , , , , and , the marginal densities of which can be expressed by the following theorem.

Theorem 1.

Samples in , , and are independently drawn from

(3)

samples in are independently drawn from

(4)

and samples in are independently drawn from

(5)

A proof is given in Appendix B.

Theorem 1 indicates that from triplet comparison data, we can essentially obtain samples that can be drawn independently from three different distributions. We denote the three aggregated datasets as

3.3 Unbiased Risk Estimator for Triplet Comparison Data

We now attempt to express the classification risk,

(6)

on the basis of the three pointwise densities presented in Section 3.2.

The classification risk can be separately expressed as the expectations over and . Although we do not have access to data drawn from these two distributions, we can obtain data from three related densities , , and as indicated in Theorem 1. Letting

(7)

we can express the relationship between these densities as

(8)

Our goal is to solve the above equation so that we can express and in terms of the three densities from which we have i.i.d. data samples. To this end, we can rewrite the classification risk, which we want to minimize, in terms of , and . An answer to Eq. (8) is given by the following lemma.

Lemma 2.

We can express and in terms of , and as

(9)

provided where

Detailed derivation is given in Appendix C.

As a result of the above lemma, we can express the classification risk using only triplet comparison data. Letting and , we have the following theorem.

Theorem 2.

The classification risk can be equivalently expressed as

(10)

where denotes the class prior of the test dataset.

A proof is given in Appendix D.

In this paper, we consider the common case in which , which means the test dataset shares the same class prior as the training dataset. However, even when , which means the class prior shift (Sugiyama, 2012) occurs, our method can still be used when is known.

4 Estimation Error Bound

In this section, we establish an estimation error bound for the proposed unbiased risk estimator. Let represent a function class specified by a model. First, let be the (expected) Rademacher complexity of which is defined as

(11)

where is a positive integer,

are i.i.d. random variables drawn from a probability distribution with density

, and are Rademacher variables, which are random variables that take the value of or with even probabilities.

We assume for any probability density , the specified model satisfies for some constant . Also let be the true risk minimizer and be the empirical risk minimizer.

Theorem 3.

Assume the loss function is -Lipschitz with respect to the first argument (), and all functions in the model class are bounded, i.e., there exists a constant such that for any . Let . Then for any , with probability at least :

(12)

where

(13)

A proof is given in Appendix E.

Since appears in the denominator, it is obvious that when the class prior is fixed, the bound will get tighter as the amount of triplet comparison data increases. However, it is not clear how the bound will behave when we fix the amount of triplet comparison data and change the class prior. Thus in Figure 1, we show the behavior of the coefficient term with respect to the same class prior of both training and test datasets. From the illustration, we can capture the rough trend that the bound gets tighter when the class prior becomes further from . We will further investigate this behavior in experiments.

Figure 1: Behaviour of the coefficient term.

5 On Class Prior

In the previous sections, the class prior is assumed known. For this simple case, we can directly use the proposed algorithm to separate test data as well as identify correct classes. However, it may not be true for many real-world applications. There are two situations that can be considered. For the worst case, no information about the class prior is given. Although we still can estimate a result for the class prior from data and obtain a classifier that is able to separate data for different classes, we cannot identify the correct class without the information of which class has a higher class prior. A better situation is that we have the information of which class has a higher class prior. By setting this class as the positive class, we can successfully train a classifier to identify the correct class. Thus, we assume that the positive class has a higher class prior, which means .

5.1 Class Prior Estimation from Triplet Comparison Data

Noticing , we can obtain . By assuming , we have

(14)

Since we can unbiasedly estimate by , the class prior can thus be estimated once the triplet comparison dataset is given.

6 Experiments

In this section, we conducted experiments using real world datasets to evaluate and investigate the performance of the proposed method for triplet classification.

6.1 Baseline methods

Kmeans:

As a simple baseline, we used -means clustering (Macqueen, 1967) with on all the data instances of triplets while ignoring all the relation information.

Itml:

Information-theoretic metric learning (Davis et al., 2007) is a metric learning method that requires pairwise the relationship between data instances. From a triplet , we constructed pairwise constraints as being similar and being dissimilar. Using the metric returned by the algorithm, we conducted

-means clustering on test data. We used the identity matrix for prior knowledge and fix the slack variable as

.

Tl:

Triplet loss (Schroff et al., 2015) is a loss function proposed in the context of deep metric learning which can learn a metric directly from triplet comparison data. Using the metric returned by the algorithm, we conducted -means clustering on test data.

Seraph:

Semi-supervised metric learning paradigm with hyper sparsity (Niu et al., 2014) is a metric learning method based on entropy regularization. We formulated a pairwise relationship in the same manner as with ITML. Using the metric returned by ITML, we conducted -means clustering on test data.

Su:

SU learning (Bao et al., 2018) is a method for learning a binary classifier from similarity and unlabeled data. We used the same method for estimating the class prior, and considered the less similar sample in a triplet as unlabeled data.

6.2 Datasets

UCI datasets:

We used six datasets from the

UCI Machine Learning Repository

 (Asuncion and Newman, 2007). They are binary classification datasets and we use the given labels for further triplet comparison data generation.

Image datasets:

We used the following three image datasets.

The MNIST (Deng, 2012) dataset consists of examples associated with a label from ten digits. Each data instance is a gray-scale image; thus, the input dimension is

. To form a binary classification problem, we treat even numbers as the positive class and odd numbers as the negative class. The data were standardized to have zero mean and unit variance.

The Fashion MNIST (Xiao et al., 2017) dataset consists of examples associated with a label from ten fashion item classes. Each data instance is a gray-scale image thus the input dimension is . To form a binary classification problem, we treat five classes, i.e., T-shirt/top, Pullover, Dress, Coat, and Shirt, as positive class since they all represent upper body clothing. The data were standardized to have zero mean and unit variance.

The CIFAR-10 (Li et al., 2017) dataset consists of examples associated with a label from ten classes. Each image is given in a format thus the input dimension is . To form a binary classification problem, we treated four classes, i.e., airplane, automobile, ship, and truck, as positive classe since they all represent artificial objects.

6.3 Proposed method

For the proposed method, we used a fully-connected neural network with only hidden layer of width

and rectified linear units (ReLUs) 

(Nair and Hinton, 2010) for all the datasets except for CIFAR-10. The width of the hidden layer was set to be through out all experiments. Adam (Kingma and Ba, 2014) was used for optimization. The neural network architecture used for CIFAR-10 is specified in Appendix.

6.4 Results

The proposed method estimates the unknown class prior first. For baseline methods, performances are measured by the clustering accuracy where is the error rate. The results of different triplet numbers are listed in Tables 1, 2,and 3

. The best and equivalent methods are shown in bold face on the one-sided t-test with a significance level of

. Also as shown in Figure 2, the performance of the proposed method with respect to the class prior and the size of training dataset followed the prediction by the theory in most of the cases.

Proposed Methods Baselines
Dataset Squared Double Hinge KMEANS ITML TL SERAPH SU
adult 65.54 (0.41) 64.19 (0.61) 71.94 (0.10) 71.04 (1.00) 61.48 (1.36) 71.04 (1.00) 75.88 (0.50)
breast 97.41 (0.28) 96.90 (0.31) 96.20 (0.34) 95.84 (0.29) 93.87 (0.78) 96.72 (0.23) 65.26 (0.76)
diabetes 70.71 (0.84) 64.87 (0.74) 66.69 (0.70) 65.91 (0.69) 64.38 (1.60) 67.44 (0.78) 34.42 (0.73)
magic 61.75 (1.00) 71.91 (0.39) 65.08 (0.17) 64.79 (0.17) 65.42 (0.22) 64.96 (0.19) 34.77 (0.19)
phishing 76.58 (0.30) 74.95 (0.27) 63.43 (0.50) 63.75 (0.23) 57.85 (0.92) 63.42 (0.53) 34.17 (0.22)
spambase 62.08 (1.87) 64.66 (1.04) 63.59 (0.24) 63.24 (0.31) 59.59 (1.57) 63.28 (0.34) 60.27 (0.30)
mnist 79.86 (0.35) 80.78 (0.34) 65.24 (0.25) 0.00 (0.00) 58.26 (1.24) 0.00 (0.00) 50.80 (0.03)
fashion 89.73 (0.33) 91.62 (0.33) 74.90 (1.00) 0.00 (0.00) 76.83 (1.31) 0.00 (0.00) 49.85 (0.08)
cifar10 76.39 (1.57) 66.28 (2.51) 64.17 (0.01) 0.00 (0.00) 60.17 (1.26) 0.00 (0.00) 59.50 (0.50)
Table 1: Experimental results with class prior as and training triplets.
Proposed Methods Baselines
Dataset Squared Double Hinge KMEANS ITML TL SERAPH SU
adult 62.72 (0.57) 59.74 (1.44) 71.44 (0.60) 71.79 (0.20) 58.53 (1.17) 70.54 (1.09) 76.30 (0.04)
breast 96.90 (0.44) 96.53 (0.35) 96.28 (0.29) 96.79 (0.24) 89.67 (1.97) 96.68 (0.27) 64.12 (0.91)
diabetes 69.64 (0.68) 67.08 (0.91) 66.27 (0.65) 64.87 (0.66) 63.15 (1.56) 67.44 (0.68) 33.90 (0.67)
magic 63.86 (1.44) 70.37 (0.36) 64.86 (0.15) 65.03 (0.13) 66.36 (0.30) 64.94 (0.14) 34.83 (0.15)
phishing 75.52 (0.31) 74.57 (0.37) 63.08 (0.47) 63.31 (0.41) 56.37 (1.18) 62.73 (0.76) 33.89 (0.20)
spambase 61.18 (1.11) 59.95 (1.38) 63.55 (0.32) 64.17 (0.31) 59.35 (1.48) 63.53 (0.35) 58.96 (0.44)
mnist 74.23 (0.32) 75.19 (0.50) 64.74 (0.55) 0.00 (0.00) 56.07 (0.87) 0.00 (0.00) 50.87 (0.26)
fashion 83.83 (0.55) 87.86 (0.66) 75.40 (0.34) 0.00 (0.00) 76.66 (1.39) 0.00 (0.00) 49.88 (0.08)
cifar10 66.28 (1.77) 62.63 (2.53) 64.16 (0.01) 0.00 (0.00) 61.26 (1.13) 0.00 (0.00) 59.05 (0.65)
Table 2: Experimental results with class prior as and training triplets.
Proposed Methods Baselines
Dataset Squared Double Hinge KMEANS ITML TL SERAPH SU
adult 58.12 (0.90) 55.10 (1.00) 70.54 (1.50) 70.04 (1.17) 58.28 (0.94) 68.54 (1.67) 75.27 (0.51)
breast 96.68 (0.32) 96.50 (0.35) 95.91 (0.34) 96.24 (0.24) 94.27 (0.68) 96.64 (0.28) 66.20 (0.80)
diabetes 69.25 (0.98) 65.36 (0.89) 64.97 (0.87) 67.27 (0.72) 63.47 (1.22) 67.11 (0.82) 35.23 (0.94)
magic 60.54 (1.88) 68.56 (0.53) 64.88 (0.13) 65.15 (0.14) 66.31 (0.42) 64.97 (0.15) 34.60 (0.34)
phishing 72.22 (0.62) 72.11 (0.65) 63.70 (0.26) 63.71 (0.21) 57.02 (1.41) 63.17 (0.77) 34.03 (0.32)
spambase 57.69 (1.68) 55.74 (1.19) 63.78 (0.34) 63.04 (0.35) 60.78 (1.63) 63.74 (0.25) 58.92 (0.43)
mnist 67.14 (0.67) 70.96 (0.53) 64.49 (1.00) 0.00 (0.00) 57.88 (1.43) 0.00 (0.00) 50.10 (0.62)
fashion 76.67 (0.40) 83.74 (0.55) 74.90 (1.00) 0.00 (0.00) 73.24 (1.80) 0.00 (0.00) 47.97 (0.76)
cifar10 63.14 (1.68) 58.83 (2.16) 64.16 (0.01) 0.00 (0.00) 61.23 (1.18) 0.00 (0.00) 58.65 (0.66)
Table 3: Experimental results with class prior as and training triplets.
Figure 2:

Average classification error and standard error over

trials.

7 Conclusion

In this paper, we proposed a novel method for learning a classifier from only passively obtained triplet comparison data. We established an estimation error bound for the proposed method, and confirmed that the estimation error decreases as the amount of triplet comparison data increases. We also empirically confirmed that the performance of the proposed method surpassed multiple baseline methods on various datasets. For future work, it would be interesting to investigate alternative methods that can handle a multi-class case.

References

  • Agarwal et al. [2007]

    Agarwal S, Wills J, Cayton L, Lanckriet G, Kriegman D, Belongie S (2007) Generalized non-metric multidimensional scaling. In: Artificial Intelligence and Statistics, pp 11–18

  • Anderton and Aslam [2019] Anderton J, Aslam J (2019) Scaling up ordinal embedding: A landmark approach. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, California, USA, Proceedings of Machine Learning Research, vol 97, pp 282–290
  • Asuncion and Newman [2007] Asuncion A, Newman D (2007) UCI machine learning repository
  • Bao et al. [2018] Bao H, Niu G, Sugiyama M (2018) Classification from pairwise similarity and unlabeled data. In: ICML, pp 452–461
  • Davis et al. [2007] Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of the 24th international conference on Machine learning, ACM, pp 209–216
  • Deng [2012]

    Deng L (2012) The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine 29(6):141–142

  • Fürnkranz and Hüllermeier [2010] Fürnkranz J, Hüllermeier E (2010) Preference learning. Springer
  • Haghiri et al. [2017] Haghiri S, Ghoshdastidar D, von Luxburg U (2017) Comparison based nearest neighbor search. arXiv preprint arXiv:170401460
  • Haghiri et al. [2018] Haghiri S, Garreau D, Luxburg U (2018) Comparison-based random forests. In: International Conference on Machine Learning, pp 1866–1875
  • Heim [2016] Heim E (2016) Efficiently and effectively learning models of similarity from human feedback. PhD thesis, University of Pittsburgh
  • Kingma and Ba [2014] Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980
  • Kleindeßner [2017] Kleindeßner M (2017) Machine learning in a setting of ordinal distance information. PhD thesis, Eberhard Karls Universität Tübingen
  • Kleindessner and von Luxburg [2017] Kleindessner M, von Luxburg U (2017) Kernel functions based on triplet comparisons. In: Advances in Neural Information Processing Systems, pp 6807–6817
  • Kleindessner and Von Luxburg [2017] Kleindessner M, Von Luxburg U (2017) Lens depth function and k-relative neighborhood graph: versatile tools for ordinal data analysis. The Journal of Machine Learning Research 18(1):1889–1940
  • Li et al. [2017] Li H, Liu H, Ji X, Li G, Shi L (2017) Cifar10-dvs: an event-stream dataset for object classification. Frontiers in neuroscience 11:309
  • Liu et al. [2004] Liu C, Wu K, He T (2004) Sensor localization with ring overlapping based on comparison of received signal strength indicator. In: 2004 IEEE International Conference on Mobile Ad-hoc and Sensor Systems (IEEE Cat. No. 04EX975), IEEE, pp 516–518
  • Lu et al. [2018] Lu N, Niu G, Menon AK, Sugiyama M (2018) On the minimal supervision for training any binary classifier from only unlabeled data. arXiv preprint arXiv:180810585
  • Macqueen [1967] Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: In 5-th Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297
  • Moore [1920] Moore EH (1920) On the reciprocal of the general algebraic matrix. Bull Am Math Soc 26:394–395
  • Nair and Hinton [2010]

    Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814

  • Narasimhan and Agarwal [2013] Narasimhan H, Agarwal S (2013) On the relationship between binary classification, bipartite ranking, and binary class probability estimation. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in Neural Information Processing Systems 26, Curran Associates, Inc., pp 2913–2921
  • Niu et al. [2014] Niu G, Dai B, Yamada M, Sugiyama M (2014) Information-theoretic semi-supervised metric learning via entropy regularization. Neural computation 26(8):1717–1762
  • Niu et al. [2016] Niu G, du Plessis MC, Sakai T, Ma Y, Sugiyama M (2016) Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In: NeurIPS, pp 1199–1207
  • Penrose and Todd [1954] Penrose BYR, Todd CJA (1954) A generalized inverse for matrices
  • Perrot and von Luxburg [2018] Perrot M, von Luxburg U (2018) Boosting for comparison-based learning. arXiv preprint arXiv:181013333
  • du Plessis et al. [2014] du Plessis MC, Niu G, Sugiyama M (2014) Analysis of learning from positive and unlabeled data. In: NeurIPS, pp 703–711
  • Schroff et al. [2015]

    Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 815–823

  • Schultz and Joachims [2004] Schultz M, Joachims T (2004) Learning a distance metric from relative comparisons. In: Advances in neural information processing systems, pp 41–48
  • Shimada et al. [2019] Shimada T, Bao H, Sato I, Sugiyama M (2019) Classification from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. arXiv:1904.11717
  • Stewart et al. [2005] Stewart N, Brown GD, Chater N (2005) Absolute identification by relative judgment. Psychological review 112(4):881
  • Sugiyama [2012] Sugiyama M (2012) Learning under non-stationarity: Covariate shift adaptation by importance weighting. In: Handbook of Computational Statistics, Springer, pp 927–952
  • Van Der Maaten and Weinberger [2012] Van Der Maaten L, Weinberger K (2012) Stochastic triplet embedding. In: 2012 IEEE International Workshop on Machine Learning for Signal Processing, IEEE, pp 1–6
  • Vapnik [1995]

    Vapnik VN (1995) The Nature of Statistical Learning Theory. Springer-Verlag, Berlin, Heidelberg

  • Xiao et al. [2017] Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:170807747
  • Xiao et al. [2006] Xiao L, Li R, Luo J (2006) Sensor localization based on nonmetric multidimensional scaling. STRESS 2:1
  • Yu et al. [2018] Yu B, Liu T, Gong M, Ding C, Tao D (2018) Correcting the triplet selection bias for triplet loss. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 71–87
  • Zhou [2017]

    Zhou ZH (2017) A brief introduction to weakly supervised learning. National Science Review 5(1):44–53

Appendix A Proof of Lemma 1

Proof.

From the data generation process, we can consider the generation distribution for data of as

(15)

Note that the denominator in Eq. (15) can be rewritten as

(16)

then we have

(17)

Moreover, the distribution at the numerator of Eq. (17) can be explicitly expressed as

(18)

from the assumption that three instances in each triplet comparison is generated independently.

Similarly, the underlying density for data of can be expressed as

(19)

Appendix B Proof of Theorem 1

Proof.

For simplicity, we give the proof of and the other cases follow the similar proof. Noticing

(20)

In order to decompose the triplet comparison data distribution into pointwise distribution, we marginalize with respect to and :

(21)

Appendix C Proof of Lemma 2

Proof.

Notice that the equation has an infinite number of solutions. Letting

(22)

we resort to finding the Moore-Penrose pseudo inverse [Moore, 1920, Penrose and Todd, 1954], which provides the minimum Euclidean norm solution to the above system of linear equations.

Let denote the conjugate transpose. We have

(23)

In the next step, we need to take the inverse of the above matrix. To achieve a proper inverse matrix, we need to introduce another assumption that , which guarantees . Then

(24)

Finally, the Moore-Penrose pseudo inverse is given by

(25)

Thus we can express and in terms of , and as

(26)

Appendix D Proof of Theorem 2

Proof.

Using Equation 9, we can rewrite the classification risk as

(27)

which can be then simplified as Equation 10. ∎

Appendix E Proof of Theorem 3

Proof.

Letting

and

(28)

we can simplify the unbiased risk estimator info the form

(29)

Then

For the first term,