Similarity-based Classification: Connecting Similarity Learning to Binary Classification

06/11/2020 ∙ by Han Bao, et al. ∙ 6

In real-world classification problems, pairwise supervision (i.e., a pair of patterns with a binary label indicating whether they belong to the same class or not) can often be obtained at a lower cost than ordinary class labels. Similarity learning is a general framework to utilize such pairwise supervision to elicit useful representations by inferring the relationship between two data points, which encompasses various important preprocessing tasks such as metric learning, kernel learning, graph embedding, and contrastive representation learning. Although elicited representations are expected to perform well in downstream tasks such as classification, little theoretical insight has been given in the literature so far. In this paper, we reveal that a specific formulation of similarity learning is strongly related to the objective of binary classification, which spurs us to learn a binary classifier without ordinary class labels—by fitting the product of real-valued prediction functions of pairwise patterns to their similarity. Our formulation of similarity learning does not only generalize many existing ones, but also admits an excess risk bound showing an explicit connection to classification. Finally, we empirically demonstrate the practical usefulness of the proposed method on benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In pattern recognition, the primary goal is to train a classifier that generalizes well to unseen test patterns. Supervised classification is a central formulation to train such a classifier: given training pairs of a pattern and its corresponding class label, we directly minimize an empirical classification risk which measures the discrepancy between a given class label and its prediction. This approach is called empirical risk minimization (ERM) and has been theoretically studied well in the literature 

[49].

Similarity learning [28] is another learning paradigm, where a pairwise model is built to predict whether a given pair of input patterns are similar or dissimilar as a preprocessing step. We call such labeled pairs of input patterns pairwise supervision, in contrast with ordinary pointwise supervision where each input pattern is equipped with its class label. Metric learning based on the Mahalanobis distances [56, 9, 19, 53, 8, 38], kernel learning [16, 1, 30, 32, 15], -good similarity [2, 8], and contrastive representation learning [57, 35, 47, 52, 39, 33] are encompassed in this framework. The obtained pairwise model can be regarded as a metric function in the pattern space. If a good metric is learned, the model is expected to achieve better performance in downstream tasks by capturing inherent structures within data. For this reason, similarity learning has been widely used for various downstream tasks such as classification [16, 2, 25, 44, 40], clustering [10, 56, 19, 53], model selection [30], representation learning [35, 39], and one-shot learning [27].

Several existing researches have investigated theoretical guarantees for downstream classification by assuming that both pointwise and pairwise supervision are available [16, 2, 8, 44, 40]. However, it would be more appealing if we can provably achieve good generalization for downstream classification with only pairwise supervision, since it can be obtained easily in various real-world domains such as geographical analysis [51], chemical experiment [26], click-through feedback [19]

, computer vision 

[57, 52]

, natural language processing 

[35, 33], and privacy-aware questioning [3].

In this work, we seek to establish a connection between binary classification and similarity learning—with a specific formulation of similarity learning, we can guarantee the generalization performance of pointwise classifiers learned with only pairwise supervision. The basic idea is closely related to excess risk transfer bounds:333 Excess risk transfer bounds are also known as regret transfer bounds [37], but we use “excess risk” in this paper to avoid confusion with “regret” in online optimization [22]. an excess risk of a target problem is connected to that of an alternative problem that we solve in reality. One of the most well-known examples is excess risk bounds for surrogate risks [5, 42], which are upper bounds of the excess risk of (binary) classification error by that of surrogate risks and justify classification via the surrogate risk minimization. Adopting the similar idea, we first connect the risk functions of similarity learning and binary classification (Section 3.1

). At this moment, we only focus on the classification performance

up to label flipping, which does not require a classifier to predict class labels but classify patterns into two groups. Next, we investigate how to recover correct class assignment explicitly and what information is merely needed for this purpose (Section 3.2). This result is useful when one wants to predict actual class labels with minimally available information. These findings inspire us to train a pointwise binary classifier given pairwise patterns, by fitting the product of real-valued prediction functions of a pair to their similarity (Section 3.3). Furthermore, we establish a finite-sample excess risk bound and its consistency for the proposed method (Section 4). Finally, we experimentally demonstrate the practical usefulness of the proposed method (Section 5).

Related work.

A number of researches have tried to solve classification with only pairwise supervision. Semi-supervised clustering [51, 6, 7, 58, 13, 38] is one of the common approaches to assign a cluster index to each input pattern with only pairwise supervision, which performs clustering without violating pairwise supervision. However, it depends on a cluster assumption [12], which may not hold in many real-world problems, and their generalization performance has not been theoretically guaranteed. Recently, a meta classification approach has emerged [25, 54], in which a model predicting pairwise labels is decomposed into pointwise classifiers. Wu et al. [54] studied the generalization performance of pairwise prediction, while the pointwise generalization performance of the meta classifier remains unexplored. Historically, Zhang and Yan [59] gave theoretical justification for a similar approach, but their result only holds for the squared loss in the asymptotic case. More recently, several researches [3, 45, 17, 18]

solved classification with only pairwise supervision by minimizing unbiased estimates of the classification risk. Their approaches are blessed with pointwise generalization error bounds, while their performance deteriorates as the class-prior probabilities become close to uniform.

2 Problem setup

Let be a -dimensional input pattern space, be the binary label space, and be the density of an underlying distribution over . Denote the positive (negative, resp.) class prior by (, resp.). Let for and otherwise.

Binary classification.

The goal of binary classification is to classify unseen patterns into the positive and negative classes. It can be formulated as a problem to find a classifier that minimizes the classification error:

(1)

where is the indicator function and denotes the expectation with respect to . Typically, we specify a hypothesis class beforehand and find a minimizer of in it: . In practice, the ERM is applied with a finite number of training data, where the expectation in is approximated by the sample average.

Similarity learning.

Here, we introduce similarity learning, which aims to learn a good metric that represents similarity between pairwise patterns. Specifically, we focus on the following formulation—another binary classification problem with pairwise supervision, to predict whether pairwise patterns belong to the same class or not. At this point, we are interested in the minimizer of the classification error defined by

(2)

The product of class labels, i.e., , indicates whether and are the same   or not . Throughout this paper, we call / the pointwise/pairwise classification error, respectively.

3 Learning a binary classifier from pairwise supervision

In this section, we see the connection between binary classification and similarity learning (Theorem 1) via the classification errors (1) and (2). This connection enables us to train a pointwise binary classifier with pairwise supervision, i.e., we can propose a two-step algorithm (Section 3.3). All proofs are deferred to Appendix.

3.1 Step 1: Clustering error minimization

We first introduce a performance metric called the clustering error that quantifies the discriminative power of a classifier up to label flipping:

(3)

The clustering error is often used for the performance evaluation of clustering methods because any class-specific information is not available in ordinary clustering scenarios [21]. The clustering error can be connected to the pairwise classification error as follows.

Theorem 1.

Any classifier satisfies

(4)

Note that the following monotonic relationship holds for any two hypotheses and :

Hence, Theorem 1 states that the minimization of the pairwise classification error leads to the minimization of the clustering error. Thus, the minimization or maximization of the pointwise classification error is achieved through the minimization of the pairwise classification error. Although it is not intuitive to think of the “maximization” of the pointwise classification error, the maximizer of can easily be converted into its minimizer by flipping the sign.

While several researches in ()-good similarity [8] and contrastive learning [44, 40] connected similarity learning to downstream classification with the aid of pointwise supervision, our approach directly connects similarity learning to classification without pointwise supervision at all.

Surrogate risk minimization.

Here, we discuss surrogate losses for clustering error minimization with pairwise supervision. We define a hypothesis class by , where is a specified class of real-valued prediction functions and . Theorem 1 suggests that we may minimize by minimizing instead. As in the standard binary classification case, the indicator function appearing in is replaced with a surrogate loss since it is intractable to perform minimization over the indicator function [5]. Eventually, the clustering error minimization is performed via minimization of the pairwise surrogate classification risk denoted by

(5)

If a classification-calibrated surrogate loss [5] is used, it is expected that the minimization of leads to the minimization of as well.444 If a surrogate loss is classification-calibrated, the minimization of the classification risk leads to the minimization of the target classification error. The precise definition can be found in Bartlett et al. [5]. This is justified by Lemma 1 in Section 4.

In practice, we approximate the expectation in by the sample average and obtain a classifier minimizing the empirical risk. The empirical risk minimization is justified by Lemma 2 in Section 4.

3.2 Step 2: Class assignment

For a given hypothesis , we are interested in its sign, i.e., or , leading to a smaller pointwise classification error. We refer to this identification of the optimal sign as class assignment. The optimal class assignment is denoted by . Unfortunately, it is hopeless to determine the correct class assignment with only pairwise supervision because it does not provide any class-specific information. Indeed, pairwise labels would not change even if all of the positive and negative labels were flipped. Thus, we require another source of information to obtain the correct class assignment. One may think of a situation where a small number of class labels are available. In such a case, we can determine the class assignment that minimizes the pointwise classification error as described by Zhang and Yan [59]. Here, we further ask if it is possible to obtain the correct class assignment without any class labels. Surprisingly, we find that this is still possible if the positive and negative proportions are not equal and we know which class is the majority. Based on the equivalent expression of  [45], this finding is formally stated in the following theorem.

Theorem 2.

Assume that the class prior is not equal to . Then, the optimal class assignment can be represented as

(6)

In practice, the assumption that

may be satisfied naturally in imbalanced classification; in the case of anomaly detection, usually

. We approximate with a finite number of pairwise patterns. Theoretically, pairwise patterns used for class assignment should be independent of those used for clustering error minimization.

3.3 Proposed method

Motivated by Theorems 1 and 2, we propose a two-stage method for pointwise classification with pairwise supervision. Assume that the class prior is not and the majority class is known. Let be a set of pairwise pattern, where and and are i.i.d. examples following . We randomly divide pairs in into two sets and , where and satisfying .

First, we obtain a minimizer of the empirical pairwise classification risk with :

(7)

Next, we obtain the class assignment with and :

(8)

Eventually, can be used for pointwise classification.

If class assignment is not necessary and just separating test patterns into two disjoint groups is the goal, we may simply set and omit the second step of finding .

Relation to existing similarity learning.

First, several existing formulations can be regarded as special cases of the pairwise surrogate risk (5): kernel learning [16] assumes the linear loss , (,,)-good similarity [8] assumes the hinge loss , and contrastive learning [33] assumes the logistic loss , by regarding as similarity between and .

Next, Hsu et al. [25] formulated similarity learning in a slightly different way, as maximum likelihood estimation of pairwise supervision555The multi-class formulation in Hsu et al. [25] is simplified in binary classification here for comparison.. With notation representing pairwise label by :

(9)

where is the inner product of the multinomial logistic models with . On the other hand, our formulation (7) with the logistic loss is

(10)

Equation (9) defines similarity by the inner product of class probabilities, while Eq. (10) defines similarity by the (inner) product of . Equation (10) is often called the inner product similarity (IPS) model [41].666

The IPS model originally defines similarity between two vector data representations, hence is called

inner product similarity. The IPS model is used in several domains [47, 33, 44, 41]. While the both are valid similarity learning methods, Eq. (10

) is a more natural extension of the classification risk minimization—one can choose arbitrary loss functions, and the pairwise classification risk minimization (

7) admits an excess risk bound (Lemma 1 in Section 4).

In short, our formulation is beneficial in two ways: generalization in terms of the choice of surrogate losses in Eq. (5), and more explicit connection to the clustering error in Eq. (4).

4 Theoretical analysis

We theoretically analyze the excess risk of the proposed method. Let and be the solutions obtained by Eqs. (7) and (8), respectively. The excess risk for the proposed method is denoted by

(11)

and indicates the infimum over all measurable functions. is also known as Bayes error rate. An important insight is that if the class assignment is successful, the excess risk is equivalent to the one with respect to the clustering error minimization, i.e.,

(12)

We derive a probabilistic guarantee for using the Rademacher complexity [4]. Let be a function class consisting of the scalar multiplication of a function in . The Rademacher complexity of is defined as follows [4].

Definition 1.

Let be

i.i.d. random pairwise patterns drawn from a probability distribution with density

, and be the Rademacher variables, i.e., for all . Then, the Rademacher complexity of is

(13)

Before obtaining an excess risk bound of , we need to bridge and the surrogate risk .

Lemma 1.

If a surrogate loss is classification calibrated [5], then there exists a convex, non-decreasing, and invertible such that for any sequence in ,

and for any measurable function and probability distribution on ,

(14)

where and .

Although the same result as Lemma 1 has already been known for  [5, Theorem 1], the proof for requires special care to treat the product of prediction functions properly.

Then, the excess risk bound is derived based on Theorem 1, Lemma 1, and the uniform bound.

Lemma 2.

Let be a minimizer of , and be a minimizer of defined in Eq. (7). Assume that is -Lipschitz (), and that for any for some . Let . For any , with probability at least ,

(15)

Next, the class assignment error probability using pairwise supervision and the class prior is analyzed.

Lemma 3.

Assume that . Let be the solution defined in Eq. (8). Then, we have

(16)

Several observations from Lemma 3 follow. As , the inequality (16) becomes loose. This comes from the fact that the estimation of the pointwise classification error with pairwise supervision becomes more difficult as  [45]. Moreover, discriminability of function , i.e., , appears in the inequality and thus it is directly related to the error rate. Intuitively, if a given function classifies a large portion of correctly, the optimal sign can be identified easily.

Finally, an overall excess risk bound by combining Lemma 2 and Lemma 3 is derived as follows.

Theorem 3.

Assume that . Let . Under the same assumptions as Lemma 2, for any , the following holds with probability at least .

(17)

If the Rademacher complexity satisfies , the RHS in Eq. (17) asymptotically approaches to the approximation error (i.e., ) in probability. For example, linear-in-parameter model satisfies as shown in Kuroki et al. [29, Lemma 5], where and are weights and bias parameters and is mapping functions. Note that our result is stronger than Zhang and Yan [59] because they only provided the asymptotic convergence, while Theorem 3 provides a finite sample guarantee.

Discussion.

Since class assignment admits the exponential decay of the error probability (Lemma 3) under the moderate condition (), we may set in practice. In contrast, our excess risk bound of clustering error minimization (Lemma 2) is governed in part by -transform. The explicit rate requires specific choices of loss functions: e.g., the hinge loss gives . Hence, under the assumption , the explicit rate is for the hinge loss.777 As another example, the logistic loss gives , entailing the explicit rate for the excess risk bound (Lemma 2). For more examples of , please refer to Steinwart [46, Table 1]. This rate is no slower than the pointwisely supervised case because pairwise supervision can be generated if pointwise supervision is available.

Note again that the proposed method assumes only in class assignment (Step 2 & Lemma 3), not in clustering error minimization (Step 1 & Lemma 2). This is a subtle but notable difference from earlier similarity learning methods based on unbiased classification risk estimators, which requires even in risk minimization (see, e.g., Shimada et al. [45, Theorem 3]).

Our excess risk transfer bound (Theorem 3) resembles transfer bounds among binary classification, class probability estimation, and bipartite ranking [37], which show that the excess risks of both binary classification and class probability estimation (CPE) can be bounded from above by that of bipartite ranking. As can be seen in Narasimhan and Agarwal [37, Theorems 4 and 14], the excess risk rate slows down by after the reduction of classification/CPE to ranking. The same decay is observed in Theorem 3 as well, reducing classification to similarity learning. This decay can be regarded as a loss arising from problem reduction.

5 Experiments

dataset Ours MCL SD OVPC SSP CKM KM (SV)
(dim., )
adult 100 39.8 (1.6) 38.4 (2.1) 30.8 (0.9) 45.0 (0.9) 24.7 (0.3) 28.9 (0.8) 24.9 (0.5) 21.9 (0.4)
(123, 0.24) 1000 17.6 (0.3) 17.2 (0.3) 20.5 (0.3) 45.5 (0.7) 24.2 (0.3) 27.9 (0.4) 27.9 (0.5) 15.9 (0.3)
codrna 100 24.7 (1.8) 32.3 (1.4) 28.0 (1.3) 32.0 (2.0) 45.5 (1.5) 46.7 (0.6) 42.5 (1.0) 11.0 (0.6)
(8, 0.33) 1000 6.3 (0.2) 6.5 (0.2) 8.8 (0.4) 28.3 (2.0) 44.8 (1.6) 46.1 (0.4) 45.4 (0.6) 6.3 (0.2)
ijcnn1 100 16.6 (2.3) 24.9 (2.9) 10.7 (0.3) 41.1 (1.1) 31.6 (2.0) 40.0 (1.3) 31.9 (2.4) 9.1 (0.2)
(22, 0.10) 1000 7.7 (0.2) 7.9 (0.2) 8.1 (0.2) 42.0 (1.4) 34.9 (1.7) 45.9 (0.8) 43.4 (0.7) 7.6 (0.2)
phishing 100 12.7 (2.3) 12.8 (2.3) 34.6 (1.8) 41.7 (1.0) 46.6 (0.5) 24.4 (3.4) 47.0 (0.5) 7.6 (0.2)
(44, 0.68) 1000 6.5 (0.2) 6.3 (0.2) 22.0 (1.0) 43.8 (1.1) 45.5 (0.5) 15.2 (2.7) 46.4 (0.5) 6.3 (0.2)
w8a 100 31.5 (1.9) 31.4 (2.1) 11.8 (0.3) 39.7 (1.4) 5.3 (1.2) 6.8 (1.9) 5.5 (1.3) 10.3 (0.4)
(300, 0.03) 1000 2.6 (0.2) 2.2 (0.1) 2.6 (0.2) 43.1 (0.8) 3.0 (0.1) 8.9 (2.6) 3.7 (0.5) 2.0 (0.1)
Table 1:

Mean clustering error and standard error on different benchmark datasets over

trials. Bold numbers indicate outperforming methods (excluding SV): among each configuration, the best one is chosen first, and then the comparable ones are chosen by one-sided t-test with the significance level

.

This section shows simulation results to confirm our theoretical findings: the sample complexity of the clustering error minimization via similarity learning (Lemma 2), the effect of the class prior in similarity learning (Discussion in Section 4), and class assignment without pointwise supervision (Lemma 3). In addition, benchmark results are included to compare with baselines. All experiments were carried out with 3.60GHz Intel® CoreTM i7-7700 CPU and GeForce GTX 1070. Full results of experiments with more datasets are included in Appendix D.

Clustering error minimization on benchmark datasets.

Tabular datasets from LIBSVM [11] and UCI [20] repositories and MNIST dataset [31]

were used in benchmarks. The class labels of MNIST were binarized into even vs. odd digits. Pairwise supervision was artificially generated by random coupling of pointwise data in the original datasets. We briefly introduce baseline methods below. Constrained

-means clustering (CKM) [51]

and semi-supervised spectral clustering (SSP) 

[13] are semi-supervised clustering methods based on -means [34] and spectral clustering [50], respectively. A method proposed by Zhang and Yan [59], called OVPC, and similar-dissimilar classification (SD) [45] are classification methods with the generalization guarantee. Meta-classification likelihood (MCL) [25] is an approach based on maximum likelihood estimation over pairwise labels. For reference, the performances of (unsupervised)

-means clustering (KM) and supervised learning (SV) were included.

For classification methods that require model specification (i.e., ours, SD, MCL, OVPC, and SV), a linear model was used. For computing the empirical pairwise classification risk in (7

) and the empirical pointwise classification risk in SD, the logistic loss was used as a surrogate loss. As the optimizer for ours, SD, MCL, and SV, the stochastic gradient descent was used. In all experiments, we set the size of minibatch to

, the learning rate to , the weight decay to

, and the number of training epochs to

. For KM and SV, pairwise data were used for training without all link information. The true class labels were revealed to SV.

First, to see the sample complexity behavior in Lemma 2, classifiers were trained with MNIST. The number of pairwise data was set to each of . Figure 1(a) presents the performances of our method and SV. This result demonstrates that the clustering error of the proposed method constantly decreases as grows, which is consistent with Lemma 2. Moreover, the proposed method performed more efficiently than expected in terms of sample complexity—as we discussed in Section 4, we expect that our proposed method with pairs performs comparably to SV with data points, while Figure 1(a) shows that our method performs comparable to SV with almost the same sample complexity order.

Next, to see the effect of the class prior, we compared our method, SD, and SV with various class priors. In this experiment, train and test data were generated from MNIST under the controlled class prior , where was set to each of . For each trial, pairs were randomly subsampled from MNIST for training and the performance was evaluated with another labeled examples. The average clustering errors and standard errors over ten trials are plotted in Figure 1(b). This result indicates that the proposed method is less affected compared with SD.

Finally, we show the benchmark performances of each method on the tabular datasets in Table 1, where each cell contains the average clustering error and the standard error over trials. For each trial, we randomly subsampled pairs for training data and pointwise examples for evaluation. This result demonstrates the proposed method performs better than most of the baselines and comparably to MCL. Especially, the performance difference between our method and clustering methods implies that the assumptions required for such clustering methods may not always hold.

(a) , various (b) various , Figure 1: (left) Mean clustering error and standard error (shaded areas) over trials on MNIST. (right) Mean clustering error and standard error (shaded areas) over trials on MNIST. Figure 2: Classification error for each threshold classifier (upper) and the error probability of the proposed class assignment method over trials (bottom) on the synthetic Gaussian dataset with

Class assignment on synthetic dataset.

We empirically investigated the performance of the proposed class assignment method on synthetic dataset. The class-conditional distributions with the standard Gaussian distributions were used as the underlying distribution:

and . Throughout this experiment, we fixed to . Here, we consider a 1-D thresholded classifier denoted by if and otherwise. Given the class prior , we generated pairwise examples from the above distributions and apply the proposed class assignment method for a fixed classifier . Then, we evaluated whether the estimated class assignment is optimal or not. Each parameter was set as follows: , , and . For each , we repeated these data generation processes, class assignment, and evaluation procedure for times.

The error probabilities are depicted in Figure 2. We find that the performance of the proposed class assignment method improves as (i) the number of pairwise examples grows and (ii) the classification error for a given classifier gets away from . These results are aligned with our analysis in Section 4. Moreover, we observed that class assignment improves as the class prior becomes farther from in additional experiments in Appendix D.

6 Conclusion

In this paper, we presented the underlying relationship between similarity learning and binary classification (Theorem 1) and proposed the two-step approach for binary classification with only pairwise supervision, which was validated through the experiments. Clustering error minimization in our framework does not rely on the specific choice of surrogate losses and is less affected by the class prior. As a result, our framework subsumes many existing similarity learning methods. We anticipate that this work opens a new direction towards understanding similarity learning.

Acknowledgement

HB was supported by JSPS KAKENHI Grant Number 19J21094, Japan, and JST ACT-I Grant Number JPMJPR18UI, Japan. IS was supported by JST CREST Grant Number JPMJCR17A1, Japan. MS was supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan.

References

  • Bach et al. [2004] F. R. Bach, G. R. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In

    Proceedings of the 21st International Conference on Machine Learning

    , page 6, 2004.
  • Balcan et al. [2008] M.-F. Balcan, A. Blum, and N. Srebro. A theory of learning with similarity functions. Machine Learning, 72(1-2):89–112, 2008.
  • Bao et al. [2018] H. Bao, G. Niu, and M. Sugiyama. Classification from pairwise similarity and unlabeled data. In Proceedings of the 35th International Conference on Machine Learning, pages 461–470, 2018.
  • Bartlett and Mendelson [2002] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • Bartlett et al. [2006] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
  • Basu et al. [2002] S. Basu, A. Banerjee, and R. Mooney. Semi-supervised clustering by seeding. In Proceedings of the 19th International Conference on Machine Learning, 2002.
  • Basu et al. [2008] S. Basu, I. Davidson, and K. Wagstaff. Constrained Clustering: Advances in Algorithms, Theory, and Applications. CRC Press, 2008.
  • Bellet et al. [2012] A. Bellet, A. Habrard, and M. Sebban. Similarity learning for provably accurate sparse linear classification. In Proceedings of the 29th International Coference on Machine Learning, pages 1491–1498, 2012.
  • Bilenko et al. [2004] M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the 21st International Conference on Machine Learning, pages 839–846, 2004.
  • Bromley et al. [1994] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah.

    Signature verification using a "siamese" time delay neural network.

    In Advances in Neural Information Processing Systems 7, pages 737–744, 1994.
  • Chang and Lin [2011] C.-C. Chang and C.-J. Lin.

    LIBSVM: a library for support vector machines, 2011.

    URL {http://www.csie.ntu.edu.tw/^^cb^^9ccjlin/libsvm}. ACM Transactions on Intelligent Systems and Technology.
  • Chapelle et al. [2010] O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. MIT Press, 2010.
  • Chen and Feng [2012] W. Chen and G. Feng. Spectral clustering: A semi-supervised approach. Neurocomputing, 77:229–242, 2012.
  • Clanuwat et al. [2018] T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha. Deep learning for classical Japanese literature. In NeurIPS Workshop on Machine Learning for Creativity and Design, 2018.
  • Cortes et al. [2010] C. Cortes, M. Mohri, and A. Rostamizadeh. Two-stage learning kernel algorithms. In Proceedings of the 27th International Conference on Machine Learning, pages 239–246, 2010.
  • Cristianini et al. [2002] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola. On kernel-target alignment. In Advances in Neural Information Processing Systems 15, pages 367–373, 2002.
  • Cui et al. [2020] Z. Cui, N. Charoenphakdee, I. Sato, and M. Sugiyama. Classification from triplet comparison data. Neural Computation, 32(3):659–681, 2020.
  • Dan et al. [2020] S. Dan, H. Bao, and M. Sugiyama. Learning from noisy similar and dissimilar data. arXiv preprint arXiv:2002.00995, 2020.
  • Davis et al. [2007] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning, pages 209–216, 2007.
  • Dua and Graff [2017] D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Fahad et al. [2014] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2(3):267–279, 2014.
  • Hazan et al. [2016] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • Hoeffding [1963] W. Hoeffding.

    Probability inequalities for sums of bounded random variables.

    Journal of the American Statistical Association, 58(301):13–30, 1963.
  • Horn and Johnson [2012] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 2012.
  • Hsu et al. [2019] Y.-C. Hsu, Z. Lv, J. Schlosser, P. Odom, and Z. Kira. Multi-class classification without multi-class labels. In Proceedings of the 7th International Conference on Learning Representations, 2019.
  • Klein et al. [2002] D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceedings of the 19th International Conference on Machine Learning, 2002.
  • Koch et al. [2015] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2. Lille, 2015.
  • Kulis et al. [2013] B. Kulis et al. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013.
  • Kuroki et al. [2019] S. Kuroki, N. Charoenphakdee, H. Bao, J. Honda, I. Sato, and M. Sugiyama. Unsupervised domain adaptation based on source-guided discrepancy. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 33, pages 4122–4129, 2019.
  • Lanckriet et al. [2004] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5(Jan):27–72, 2004.
  • LeCun [2013] Y. LeCun. The mnist database of handwritten digits, 2013. URL {http://yann.lecun.com/exdb/mnist}.
  • Li and Liu [2009] Z. Li and J. Liu. Constrained clustering by spectral kernel learning. In IEEE 12th International Conference on Computer Vision, pages 421–427, 2009.
  • Logeswaran and Lee [2018] L. Logeswaran and H. Lee. An efficient framework for learning sentence representations. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  • MacQueen et al. [1967] J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California Press, 1967.
  • Mikolov et al. [2013] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119, 2013.
  • Mohri et al. [2012] M. Mohri, A. Rostamizadeh, F. Bach, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
  • Narasimhan and Agarwal [2013] H. Narasimhan and S. Agarwal. On the relationship between binary classification, bipartite ranking, and binary class probability estimation. In Advances in Neural Information Processing Systems 26, pages 2913–2921, 2013.
  • Niu et al. [2014] G. Niu, B. Dai, M. Yamada, and M. Sugiyama. Information-theoretic semi-supervised metric learning via entropy regularization. Neural Computation, 26(8):1717–1762, 2014.
  • Noroozi and Favaro [2016] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
  • Nozawa et al. [2019] K. Nozawa, P. Germain, and B. Guedj. PAC-Bayesian contrastive unsupervised representation learning. arXiv preprint arXiv:1910.04464, 2019.
  • Okuno and Shimodaira [2020] A. Okuno and H. Shimodaira. Hyperlink regression via Bregman divergence. Neural Networks, 126:362–383, 2020.
  • Reid and Williamson [2009] M. D. Reid and R. C. Williamson. Surrogate regret bounds for proper losses. In Proceedings of the 26th International Conference on Machine Learning, pages 897–904, 2009.
  • Sakai et al. [2017] T. Sakai, M. C. du Plessis, G. Niu, and M. Sugiyama. Semi-supervised classification based on classification from positive and unlabeled data. In Proceedings of the 34th International Conference on Machine Learning, pages 2998–3006, 2017.
  • Saunshi et al. [2019] N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, pages 5628–5637, 2019.
  • Shimada et al. [2019] T. Shimada, H. Bao, I. Sato, and M. Sugiyama. Classification from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. arXiv preprint arXiv:1904.11717, 2019.
  • Steinwart [2007] I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, 2007.
  • Tang et al. [2015] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. LINE: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077, 2015.
  • van Rooyen et al. [2015] B. van Rooyen, A. Menon, and R. C. Williamson. Learning with symmetric label noise: The importance of being unhinged. In Advances in Neural Information Processing Systems 28, pages 10–18, 2015.
  • Vapnik [1998] V. Vapnik. Statistical Learning Theory. wiley New York, 1998.
  • von Luxburg [2007] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
  • Wagstaff et al. [2001] K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al.

    Constrained k-means clustering with background knowledge.

    In Proceedings of the 18th International Conference on Machine Learning, volume 1, pages 577–584, 2001.
  • Wang and Gupta [2015] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2015.
  • Weinberger and Saul [2009] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10:207–244, 2009.
  • Wu et al. [2020] S. Wu, X. Xia, T. Liu, B. Han, M. Gong, N. Wang, H. Liu, and G. Niu. Multi-class classification from noisy-similarity-labeled data. arXiv preprint arXiv:2002.06508, 2020.
  • Xiao et al. [2017] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Xing et al. [2003] E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng. Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems 16, pages 521–528, 2003.
  • Yan et al. [2006] R. Yan, J. Zhang, J. Yang, and A. G. Hauptmann. A discriminative learning framework with pairwise constraints for video object classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):578–593, 2006.
  • Zeng and Cheung [2011] H. Zeng and Y.-M. Cheung. Semi-supervised maximum margin clustering with pairwise constraints. IEEE Transactions on Knowledge and Data Engineering, 24(5):926–939, 2011.
  • Zhang and Yan [2007] J. Zhang and R. Yan. On the value of pairwise constraints in classification and consistency. In Proceedings of the 24th International Conference on Machine Learning, pages 1111–1118. ACM, 2007.

Appendix A Proofs of Theorems and Lemmas

In this section, we provide complete proofs for Theorem 1, Theorem 2, Lemma 1, Lemma 2, and Lemma 3.

a.1 Proof of Theorem 1

We derive an equivalent expression of the pairwise classification error as follows.

(18)

We can transform the above equation as

(19)

Then, we also have

(20)

By combining the results in Eqs. (19) and (20), we finally obtain Eq. (4), which completes the proof of Theorem 1. ∎

a.2 Proof of Theorem 2

The optimal sign can be written as

(21)

According to Shimada et al. [45], is equivalently expressed as follows.

Lemma 4 (Theorem 1 in Shimada et al. [45]).

Assume that . Then, the pointwise classification error for a given classifier can be equivalently represented as

(22)

By plugging Eq. (22) into Eq. (21), we obtain

(23)

Thus, we derive the following result.

(24)

which completes the proof of Theorem 2. Note that can be either when , which is equivalent to . Here we arbitrarily set to in this case. ∎

a.3 Proof of Lemma 1

We introduce the following notation:

represents the conditional -risk in the following sense:

where

Define the function by , where is the Fenchel-Legendre biconjugate of , and

corresponds to -transform introduced by Bartlett et al. [5] exactly.

We will show that the statement of the lemma is satisfied by based on the calibration analysis [46]. We further introduce the following notation:

represents the conditional -risk in the following sense:

and

Let be the calibration function [46, Lemma 2.16] defined by

By the consequence of Lemma 2.9 of Steinwart [46], for all implies that . Further, under this condition, Theorem 2.13 of Steinwart [46] implies that is non-decreasing, invertible, and satisfies

for any measurable function . Hence, it is sufficient to show that for all . Indeed, , and for all because is classification calibrated [5, Lemma 2]. From now on, we will see .

First, we simplify the constraint part of . Since