Implementation for "Classification from Triplet Comparison Data"
Learning from triplet comparison data has been extensively studied in the context of metric learning, where we want to learn a distance metric between two instances, and ordinal embedding, where we want to learn an embedding in an Euclidean space of the given instances that preserves the comparison order as well as possible. Unlike fully-labeled data, triplet comparison data can be collected in a more accurate and human-friendly way. Although learning from triplet comparison data has been considered in many applications, an important fundamental question of whether we can learn a classifier only from triplet comparison data has remained unanswered. In this paper, we give a positive answer to this important question by proposing an unbiased estimator for the classification risk under the empirical risk minimization framework. Since the proposed method is based on the empirical risk minimization framework, it inherently has the advantage that any surrogate loss function and any model, including neural networks, can be easily applied. Furthermore, we theoretically establish an estimation error bound for the proposed empirical risk minimizer. Finally, we provide experimental results to show that our method empirically works well and outperforms various baseline methods.READ FULL TEXT VIEW PDF
Ordinary (pointwise) binary classification aims to learn a binary classi...
Recent studies have highlighted that deep neural networks (DNNs) are
Empirical risk minimization (ERM), with proper loss function and
Variants of Triplet networks are robust entities for learning a
To investigate objects without a describable notion of distance, one can...
We consider the semi-supervised ordinal regression problem, where unlabe...
We explore the use of a topological manifold, represented as a collectio...
Implementation for "Classification from Triplet Comparison Data"
Recently, learning from comparison-feedback data has received increasing attention (Heim, 2016; Kleindeßner, 2017). It is usually argued that humans perform better in the task of evaluating which instances are similar, rather than identifying each individual instance (Stewart et al., 2005). It is also argued that humans can achieve much better and more reliable performance on assessing the similarity on a relative scale (“Instance A is more similar to instance B than to instance C”) rather than on an absolute scale (“The similarity score between A and B is while the one between A and C is ”) (Kleindeßner, 2017). Collecting data in this manner has the advantage of avoiding the problem caused by individuals’ different assessment scales. On the other hand, the collected absolute similarity scores may only provide information on a comparison level in some applications, e.g., sensor localization (Liu et al., 2004). It was shown that keeping only the relative comparison information can help an algorithm be resilient against measurement errors and achieve high accuracy (Xiao et al., 2006).
In this paper, we focus on the problem of learning from triplet comparison data, which is a common form of comparison-feedback data. A triplet comparison contains the information that instance is more similar to than to . As one example, search-engine query logs can readily provide feedback in the form of triplet comparisons (Schultz and Joachims, 2004). Given a list of website links for a query, if links and are clicked and the link is not clicked, we can formulate a triplet comparison as .
Learning from triplet comparison data was initially studied in the context of metric learning (Schultz and Joachims, 2004)
, in which a consistent distance metric between two instances is assumed to be learned from data. The well-known triplet loss for face recognition was proposed in this line of research(Schroff et al., 2015; Yu et al., 2018). Using this loss function, an inductive mapping function can be efficiently learned from triplet comparison image data. At the same time, the problem of ordinal embedding has also been extensively studied (Agarwal et al., 2007; Van Der Maaten and Weinberger, 2012). It aims to learn an embedding of the given instances to the Euclidean space that preserves the order given by the data. Algorithms for large scale ordinal embedding have been developed (Anderton and Aslam, 2019). In addition, many other problem settings have been considered for the situation of using only triplet comparison data, such as nearest neighbor search (Haghiri et al., 2017), kernel function construction (Kleindessner and von Luxburg, 2017)
and outlier identification(Kleindessner and Von Luxburg, 2017).
However, learning a binary classifier from triplet comparison data remained untouched until recently. A random forest construction algorithm(Haghiri et al., 2018) was proposed for both classification and regression. However, it first requires a labeled dataset and needs to actively access a triplet comparison oracle many times. For passively collected triplet comparison data, a boosting based algorithm (Perrot and von Luxburg, 2018) was recently proposed without accessing a triplet comparison oracle. However, a set of labeled data is still indispensable to initiating the training process. To the best of our knowledge, this paper is the first to tackle the problem of learning a classifier only from passively obtained triplet comparison data, without accessing either a labeled dataset or an oracle.
We show that we can learn a binary classifier from only passively obtained triplet comparison data. We achieve this goal by developping a novel method for learning a binary classifier in this setting with theoretical justification. We use the direct risk minimization framework given for the classification problem. We then show that the classification risk can be empirically estimated in an unbiased way given only triplet comparison data. Theoretically, we establish an estimation error bound for the proposed empirical risk minimizer, showing that learning from triplet comparison data is consistent. Our method also returns an inductive model, which is different from clustering and ordinal embedding, and can be applied to unseen test data points. The test data would consist of single instances instead of triplet comparisons since our primitive goal is to perform a binary classification task on unseen data points.
In summary, for the problem of classification using only triplet comparison data, our contributions in this paper are three-fold:
We propose an empirical risk minimization method for binary classification using only passively obtained triplet comparison data, which gives us an inductive classifier.
We theoretically establish an estimation error bound for our method, showing that the learning is consistent.
We experimentally demonstrate the practical usefulness of our method.
Our problem setting of learning a binary classifier from passively obtained triplet comparison data can be considered as a type of a weakly-supervised classification problem, where we do not have access to ground-truth labels (Zhou, 2017).
An approach based on constructing an unbiased risk estimator of the true classification risk from weakly-supervised data has been explored in many problem settings; for example, positive-unlabeled classification (du Plessis et al., 2014; Niu et al., 2016) and similarity-unlabeled classification (Bao et al., 2018) can be handled by the framework of learning from two sets of unlabeled data (Lu et al., 2018). Nevertheless, our problem setting is not a special case addressed by Lu et al. (2018) since we have only one set of triplet comparison data. We later show that we can formulate three different distributions, which is fundamentally different from the framework used by Lu et al. (2018).
Moreover, our problem setting is also different from similarity-dissimilarity-unlabeled classification (Shimada et al., 2019) in the sense that we have no access to unlabeled data and similarity and dissimilarity pairs, but only triplet comparison information. Furthermore, it is important to note that our problem setting is also different from preference learning (Fürnkranz and Hüllermeier, 2010), since we do not want to learn a ranking function but construct a binary classifier. Although we can first learn a ranking function and then decide a proper threshold to construct a binary classifier (Narasimhan and Agarwal, 2013), it is not straightforward to choose a proper threshold. Therefore, instead of this two-stage method, we focus on a method that can directly learn a binary classifier from triplet comparison data.
In this section, we first review the ordinary fully supervised classification setting. Then we introduce the problem setting and assumption for the data generation process of triplet comparison data. Finally, we describe the proposed method for training a binary classifier from only passively obtained triplet comparison data.
We first briefly introduce the traditional binary classification problem. We denote as a -dimensional sample space and as a binary label space. In the fully supervised setting, we usually assume the labeled data
are drawn from the joint probability distribution with density(Vapnik, 1995). The goal is to obtain a classifier that minimizes the classification risk
where the expectation is over the joint density and is a loss function that measures how well the classifier estimates the true class label.
In the traditional fully supervised classification setting, we are given both positive and negative training data collectively drawn from the joint density . However, in our case, we still want to train a binary classifier that minimizes the classification risk, although we do not have fully labeled data.
We formulate the underlying generation process of triplet comparison data in order to perform empirical risk minimization. Three samples in a triplet are first generated independently, then shown to a user. The user can mark the triplet to be proper or not. A proper triplet means that the similarity between the first and second samples is stronger or the same as the similarity between the first and third samples. Specifically, it means that three labels in a triplet appear to be one of the following cases:
Otherwise, it means the first sample is more similar to the third sample than to the second sample; thus, the user chooses to mark the triplet as not proper. Similarly, it means appears to be one of the following cases
First, three data samples are generated independently from the underlying joint density , then are collected without knowing the underlying true labels . However, we can collect information about which case a triplet belongs to from user feedback. After receiving feedback from users, we can actually obtain two distinct datasets. The data the user chooses to keep the order is denoted as
Similarly, the data the user chooses to flip the order is denoted as
Note that the ratio of to is fixed because we assume the three samples in a triplet are generated independently from ; thus, the ratio
is only dependent on the underlying class prior probabilities, which are fixed unknown values.
The two datasets can be considered to be generated from two underlying distributions as indicated by the following lemma.
Corresponding to the data generation process described above, let
where , and are the class prior probabilities that satisfy and and are class conditional probabilities. Then it follows
Detailed derivation is given in Appendix A.
We denote the pointwise data collected from and by ignoring the triplet comparison relation as , , , , and , the marginal densities of which can be expressed by the following theorem.
Samples in , , and are independently drawn from
samples in are independently drawn from
and samples in are independently drawn from
A proof is given in Appendix B.
Theorem 1 indicates that from triplet comparison data, we can essentially obtain samples that can be drawn independently from three different distributions. We denote the three aggregated datasets as
We now attempt to express the classification risk,
on the basis of the three pointwise densities presented in Section 3.2.
The classification risk can be separately expressed as the expectations over and . Although we do not have access to data drawn from these two distributions, we can obtain data from three related densities , , and as indicated in Theorem 1. Letting
we can express the relationship between these densities as
Our goal is to solve the above equation so that we can express and in terms of the three densities from which we have i.i.d. data samples. To this end, we can rewrite the classification risk, which we want to minimize, in terms of , and . An answer to Eq. (8) is given by the following lemma.
We can express and in terms of , and as
Detailed derivation is given in Appendix C.
As a result of the above lemma, we can express the classification risk using only triplet comparison data. Letting and , we have the following theorem.
The classification risk can be equivalently expressed as
where denotes the class prior of the test dataset.
A proof is given in Appendix D.
In this paper, we consider the common case in which , which means the test dataset shares the same class prior as the training dataset. However, even when , which means the class prior shift (Sugiyama, 2012) occurs, our method can still be used when is known.
In this section, we establish an estimation error bound for the proposed unbiased risk estimator. Let represent a function class specified by a model. First, let be the (expected) Rademacher complexity of which is defined as
where is a positive integer,
are i.i.d. random variables drawn from a probability distribution with density, and are Rademacher variables, which are random variables that take the value of or with even probabilities.
We assume for any probability density , the specified model satisfies for some constant . Also let be the true risk minimizer and be the empirical risk minimizer.
Assume the loss function is -Lipschitz with respect to the first argument (), and all functions in the model class are bounded, i.e., there exists a constant such that for any . Let . Then for any , with probability at least :
A proof is given in Appendix E.
Since appears in the denominator, it is obvious that when the class prior is fixed, the bound will get tighter as the amount of triplet comparison data increases. However, it is not clear how the bound will behave when we fix the amount of triplet comparison data and change the class prior. Thus in Figure 1, we show the behavior of the coefficient term with respect to the same class prior of both training and test datasets. From the illustration, we can capture the rough trend that the bound gets tighter when the class prior becomes further from . We will further investigate this behavior in experiments.
In the previous sections, the class prior is assumed known. For this simple case, we can directly use the proposed algorithm to separate test data as well as identify correct classes. However, it may not be true for many real-world applications. There are two situations that can be considered. For the worst case, no information about the class prior is given. Although we still can estimate a result for the class prior from data and obtain a classifier that is able to separate data for different classes, we cannot identify the correct class without the information of which class has a higher class prior. A better situation is that we have the information of which class has a higher class prior. By setting this class as the positive class, we can successfully train a classifier to identify the correct class. Thus, we assume that the positive class has a higher class prior, which means .
Noticing , we can obtain . By assuming , we have
Since we can unbiasedly estimate by , the class prior can thus be estimated once the triplet comparison dataset is given.
In this section, we conducted experiments using real world datasets to evaluate and investigate the performance of the proposed method for triplet classification.
As a simple baseline, we used -means clustering (Macqueen, 1967) with on all the data instances of triplets while ignoring all the relation information.
Information-theoretic metric learning (Davis et al., 2007) is a metric learning method that requires pairwise the relationship between data instances. From a triplet , we constructed pairwise constraints as being similar and being dissimilar. Using the metric returned by the algorithm, we conducted
-means clustering on test data. We used the identity matrix for prior knowledge and fix the slack variable as.
Triplet loss (Schroff et al., 2015) is a loss function proposed in the context of deep metric learning which can learn a metric directly from triplet comparison data. Using the metric returned by the algorithm, we conducted -means clustering on test data.
Semi-supervised metric learning paradigm with hyper sparsity (Niu et al., 2014) is a metric learning method based on entropy regularization. We formulated a pairwise relationship in the same manner as with ITML. Using the metric returned by ITML, we conducted -means clustering on test data.
SU learning (Bao et al., 2018) is a method for learning a binary classifier from similarity and unlabeled data. We used the same method for estimating the class prior, and considered the less similar sample in a triplet as unlabeled data.
We used the following three image datasets.
The MNIST (Deng, 2012) dataset consists of examples associated with a label from ten digits. Each data instance is a gray-scale image; thus, the input dimension is
The Fashion MNIST (Xiao et al., 2017) dataset consists of examples associated with a label from ten fashion item classes. Each data instance is a gray-scale image thus the input dimension is . To form a binary classification problem, we treat five classes, i.e., T-shirt/top, Pullover, Dress, Coat, and Shirt, as positive class since they all represent upper body clothing. The data were standardized to have zero mean and unit variance.
The CIFAR-10 (Li et al., 2017) dataset consists of examples associated with a label from ten classes. Each image is given in a format thus the input dimension is . To form a binary classification problem, we treated four classes, i.e., airplane, automobile, ship, and truck, as positive classe since they all represent artificial objects.
For the proposed method, we used a fully-connected neural network with only hidden layer of width
and rectified linear units (ReLUs)(Nair and Hinton, 2010) for all the datasets except for CIFAR-10. The width of the hidden layer was set to be through out all experiments. Adam (Kingma and Ba, 2014) was used for optimization. The neural network architecture used for CIFAR-10 is specified in Appendix.
The proposed method estimates the unknown class prior first. For baseline methods, performances are measured by the clustering accuracy where is the error rate. The results of different triplet numbers are listed in Tables 1, 2,and 3
. The best and equivalent methods are shown in bold face on the one-sided t-test with a significance level of. Also as shown in Figure 2, the performance of the proposed method with respect to the class prior and the size of training dataset followed the prediction by the theory in most of the cases.
|adult||65.54 (0.41)||64.19 (0.61)||71.94 (0.10)||71.04 (1.00)||61.48 (1.36)||71.04 (1.00)||75.88 (0.50)|
|breast||97.41 (0.28)||96.90 (0.31)||96.20 (0.34)||95.84 (0.29)||93.87 (0.78)||96.72 (0.23)||65.26 (0.76)|
|diabetes||70.71 (0.84)||64.87 (0.74)||66.69 (0.70)||65.91 (0.69)||64.38 (1.60)||67.44 (0.78)||34.42 (0.73)|
|magic||61.75 (1.00)||71.91 (0.39)||65.08 (0.17)||64.79 (0.17)||65.42 (0.22)||64.96 (0.19)||34.77 (0.19)|
|phishing||76.58 (0.30)||74.95 (0.27)||63.43 (0.50)||63.75 (0.23)||57.85 (0.92)||63.42 (0.53)||34.17 (0.22)|
|spambase||62.08 (1.87)||64.66 (1.04)||63.59 (0.24)||63.24 (0.31)||59.59 (1.57)||63.28 (0.34)||60.27 (0.30)|
|mnist||79.86 (0.35)||80.78 (0.34)||65.24 (0.25)||0.00 (0.00)||58.26 (1.24)||0.00 (0.00)||50.80 (0.03)|
|fashion||89.73 (0.33)||91.62 (0.33)||74.90 (1.00)||0.00 (0.00)||76.83 (1.31)||0.00 (0.00)||49.85 (0.08)|
|cifar10||76.39 (1.57)||66.28 (2.51)||64.17 (0.01)||0.00 (0.00)||60.17 (1.26)||0.00 (0.00)||59.50 (0.50)|
|adult||62.72 (0.57)||59.74 (1.44)||71.44 (0.60)||71.79 (0.20)||58.53 (1.17)||70.54 (1.09)||76.30 (0.04)|
|breast||96.90 (0.44)||96.53 (0.35)||96.28 (0.29)||96.79 (0.24)||89.67 (1.97)||96.68 (0.27)||64.12 (0.91)|
|diabetes||69.64 (0.68)||67.08 (0.91)||66.27 (0.65)||64.87 (0.66)||63.15 (1.56)||67.44 (0.68)||33.90 (0.67)|
|magic||63.86 (1.44)||70.37 (0.36)||64.86 (0.15)||65.03 (0.13)||66.36 (0.30)||64.94 (0.14)||34.83 (0.15)|
|phishing||75.52 (0.31)||74.57 (0.37)||63.08 (0.47)||63.31 (0.41)||56.37 (1.18)||62.73 (0.76)||33.89 (0.20)|
|spambase||61.18 (1.11)||59.95 (1.38)||63.55 (0.32)||64.17 (0.31)||59.35 (1.48)||63.53 (0.35)||58.96 (0.44)|
|mnist||74.23 (0.32)||75.19 (0.50)||64.74 (0.55)||0.00 (0.00)||56.07 (0.87)||0.00 (0.00)||50.87 (0.26)|
|fashion||83.83 (0.55)||87.86 (0.66)||75.40 (0.34)||0.00 (0.00)||76.66 (1.39)||0.00 (0.00)||49.88 (0.08)|
|cifar10||66.28 (1.77)||62.63 (2.53)||64.16 (0.01)||0.00 (0.00)||61.26 (1.13)||0.00 (0.00)||59.05 (0.65)|
|adult||58.12 (0.90)||55.10 (1.00)||70.54 (1.50)||70.04 (1.17)||58.28 (0.94)||68.54 (1.67)||75.27 (0.51)|
|breast||96.68 (0.32)||96.50 (0.35)||95.91 (0.34)||96.24 (0.24)||94.27 (0.68)||96.64 (0.28)||66.20 (0.80)|
|diabetes||69.25 (0.98)||65.36 (0.89)||64.97 (0.87)||67.27 (0.72)||63.47 (1.22)||67.11 (0.82)||35.23 (0.94)|
|magic||60.54 (1.88)||68.56 (0.53)||64.88 (0.13)||65.15 (0.14)||66.31 (0.42)||64.97 (0.15)||34.60 (0.34)|
|phishing||72.22 (0.62)||72.11 (0.65)||63.70 (0.26)||63.71 (0.21)||57.02 (1.41)||63.17 (0.77)||34.03 (0.32)|
|spambase||57.69 (1.68)||55.74 (1.19)||63.78 (0.34)||63.04 (0.35)||60.78 (1.63)||63.74 (0.25)||58.92 (0.43)|
|mnist||67.14 (0.67)||70.96 (0.53)||64.49 (1.00)||0.00 (0.00)||57.88 (1.43)||0.00 (0.00)||50.10 (0.62)|
|fashion||76.67 (0.40)||83.74 (0.55)||74.90 (1.00)||0.00 (0.00)||73.24 (1.80)||0.00 (0.00)||47.97 (0.76)|
|cifar10||63.14 (1.68)||58.83 (2.16)||64.16 (0.01)||0.00 (0.00)||61.23 (1.18)||0.00 (0.00)||58.65 (0.66)|
In this paper, we proposed a novel method for learning a classifier from only passively obtained triplet comparison data. We established an estimation error bound for the proposed method, and confirmed that the estimation error decreases as the amount of triplet comparison data increases. We also empirically confirmed that the performance of the proposed method surpassed multiple baseline methods on various datasets. For future work, it would be interesting to investigate alternative methods that can handle a multi-class case.
Agarwal S, Wills J, Cayton L, Lanckriet G, Kriegman D, Belongie S (2007) Generalized non-metric multidimensional scaling. In: Artificial Intelligence and Statistics, pp 11–18
Deng L (2012) The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine 29(6):141–142
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 815–823
Vapnik VN (1995) The Nature of Statistical Learning Theory. Springer-Verlag, Berlin, Heidelberg
Zhou ZH (2017) A brief introduction to weakly supervised learning. National Science Review 5(1):44–53
From the data generation process, we can consider the generation distribution for data of as
Note that the denominator in Eq. (15) can be rewritten as
then we have
Moreover, the distribution at the numerator of Eq. (17) can be explicitly expressed as
from the assumption that three instances in each triplet comparison is generated independently.
Similarly, the underlying density for data of can be expressed as
For simplicity, we give the proof of and the other cases follow the similar proof. Noticing
In order to decompose the triplet comparison data distribution into pointwise distribution, we marginalize with respect to and :
Notice that the equation has an infinite number of solutions. Letting
Let denote the conjugate transpose. We have
In the next step, we need to take the inverse of the above matrix. To achieve a proper inverse matrix, we need to introduce another assumption that , which guarantees . Then
Finally, the Moore-Penrose pseudo inverse is given by
Thus we can express and in terms of , and as
we can simplify the unbiased risk estimator info the form
For the first term,