1 Introduction
In supervised classification, we need a vast amount of labeled training data to train our classifiers. However, it is often not easy to obtain labels due to high labeling costs
(chapelle2010semi), privacy concern (warner1965randomized), social bias (nederhof1985methods), and difficulty to label data. For such reasons, there is a situation in realworld classification problems, where pairwise similarities (i.e., pairs of samples in the same class) and pairwise dissimilarities (i.e., pairs of samples in different classes) might be easier to collect than fully labeled data. For example, in the task of protein function prediction (Klein2002FromIC), the knowledge about similarities/dissimilarities can be obtained as additional supervision, which can be found by experimental means. To handle such pairwise information, similarunlabeled (SU) classification (bao2018classification) has been proposed, where the classification risk is estimated in an unbiased fashion from only similar pairs and unlabeled data. Although they assumed that only similar pairs and unlabeled data are available, we may also obtain dissimilar pairs in practice. In this case, a method which can handle all of similarities/dissimilarities and unlabeled data is desirable.Semisupervised clustering (wagstaff2001constrained) is one of the methods that can handle both similar and dissimilar pairs, where mustlink pairs (i.e., similar pairs) and cannotlink pairs (i.e., dissimilar pairs) are used to obtain meaningful clusters. Existing work provides useful semisupervised clustering methods based on the ideas that (i) must/cannotlinks are treated as constraints (basu2002semi; wagstaff2001constrained; li2009constrained; hu2008maximum), (ii) clustering is performed with metrics learned by semisupervised metric learning (xing2003distance; bilenko2004integrating; weinberger2009distance; davis2007information; niu2012information), and (iii) missing links are predicted by matrix completion (yi2013semi; chiang2015matrix). However, there is a gap between the motivation of clustering and classification algorithms, so applying semisupervised clustering to classification might cause a problem. For example, most of the semisupervised clustering methods rely on geometrical or marginbased assumptions such as the cluster assumption and manifold assumption (chapelle2010semi), which heavily depend on the structure of datasets. Therefore, the range of applications of semisupervised clustering can be restricted. In addition, the objective of semisupervised clustering is not basically the minimization of the classification risk, which may perform suboptimally in terms of classification accuracy.
In this paper, we propose similardissimilarunlabeled (SDU) classification, where we can utilize all of pairwise similarities/dissimilarities and unlabeled data for unbiased estimation of the classification risk. Similarly to SU classification, our method does not require geometrical assumptions on the data distribution and directly minimizes the classification risk. As the first step to construct our SDU classification, we propose dissimilarunlabeled (DU) classification and similardissimilar classification (SD), where only dissimilar and unlabeled data or similar and dissimilar data are required. Then, we combine the risks of SU, DU, and SD classification linearly in a similar manner to positivenegativeunlabeled (PNU) classification (sakai2017semi). One important question is which combination of these three risks is the best. To answer this question, we establish estimation error bounds for each algorithm and find that SD and DU classification are likely to outperform SU classification in terms of generalization. Therefore, we claim that the combination of SD and DU classification is the most promising approach for SDU classification. Through experiments, we demonstrate the practical usefulness of our proposed method.
Our contributions can be summarized as follows.

We extend SU classification to DU and SD classification and propose SDU classification by combination of those algorithms (Sec. 3).

From the comparison of the estimation error bounds, we find that SD classification and DU classification are likely to outperform SU classification, and provide an insight that the combination of SD and DU classification is the most promising for SDU classification (Sec. 4.2).
2 Preliminary
In this section, we first introduce our problem setting and data generation process of similar pairs, dissimilar pairs, and unlabeled data. Then we review the formulation of the existing SU classification algorithm.
2.1 Problem Setting
Let and be a dimensional example space and binary label space, respectively. Suppose that each labeled example
is generated from the joint probability with density
independently. For simplicity, let and be class priors and , which satisfy the condition , and and be class conditional densities and .The standard goal of supervised binary classification is to obtain a classifier which minimizes the classification risk defined by
2.2 Generation Process of Training Data
We describe the data generation process of pairwise similar and dissimilar data and unlabeled data. We assume that similar and dissimilar pairs are generated from pairwise distributions independently. We denote the event that two samples and have the same class label (i.e., ) by , and otherwise by . Then, similar and dissimilar pairs are generated from an underlying joint density as follows:
(2) 
where
(3)  
(4) 
Here, pairs in can be decomposed into similar pairs and dissimilar pairs based on the variable .
(5)  
(6) 
For convenience, we introduce notations representing the similar and dissimilar proportions and conditional densities.
(7)  
(8)  
(9)  
(10) 
Then, we can consider the generation process of similar and dissimilar pairs as
Note that we assume each sample in a pair is generated independently, namely, . Thus, we have
We assume unlabeled samples are generated as follows:
(11) 
2.3 SU classification
In (bao2018classification), SU classification was proposed, where the classification risk is estimated in an unbiased fashion only from similar pairs and unlabeled data.
Proposition 1 (Theorem 1 in (bao2018classification)).
We can train a classifier by minimizing the empirical version of , , from .
3 Proposed Method
In this section, we propose SDU classification, where the classification risk is estimated from similar and dissimilar pairs and unlabeled data. For the preparation to construct SDU classification, we extend SU classification to DU and SD classification first.
3.1 DU and SD classification
As well as SU classification, the classification risk can be estimated only from dissimilar pairs and unlabeled data (DU), or similar pairs and dissimilar pairs (SD) as follows.
Theorem 1.
We can train a classifier by minimizing the empirical version of or , or , obtained from or . We call the training with these risks DU classification and SD classification, respectively.
3.2 SDU classification
We propose SDU classification by combining SU, DU, and SD classification. The main idea of our method is to combine risks obtained from SU, DU and SD data in a similar manner to PositiveNegativeUnlabeled (PNU) classification (sakai2017semi).
With a positive real value , we define the following three representations of the classification risk.
(17)  
(18)  
(19) 
We call the training with these risks SDSU classification, SDDU classification, and SUDU classification, respectively. Here, one spontaneous question is that which one is the most promising algorithm. We can claim the combination of SD and DU risks is the most promising from the point of view of estimation error bounds. We discuss the details in Sec. 4.2.
3.3 Practical Implementation
We investigate the objective function when using a linear classifier , where and are parameters and is a mapping function. For simplicity, we consider the following generalized optimization problem, With positive real values which satisfy the condition , we denote our optimization problem by
(20) 
where
(21)  
(22) 
Here, is a parameter of L2 regularization. When , and , this optimization corresponds to SDDU, SDSU, SUDU classification, respectively.
From now on, we assume the loss function is a margin loss function. As defined in (mohri2018foundations), we call a margin loss function if there exists such that In general, the optimization problem in Eq. (20) is not a convex problem. However, if we choose satisfies the following property, the optimization problem becomes convex.
Theorem 2.
Suppose that the loss function is a convex margin loss, twice differentiable in z almost everywhere (for every fixed ) and satisfies following condition.
(23) 
Then, the optimization problem in Eq. (20) is a convex problem.
Next, we consider the case of squared loss and double hinge loss, which satisfy the condition in Eq. (23).
3.3.1 Squared Loss
Suppose that we use squared loss defined by
(24) 
Then, the objective function in can be written as
(25) 
where
We denote
as the vector whose elements are all ones and
as the identity matrix. Since this function has a quadratic form with respect to
, the solution of this minimization problem can be obtained analytically.3.3.2 Double Hinge Loss
3.4 Class Prior Estimation
Although we have to know the class prior before training for calculation of empirical risks , , and , can be estimated from the number of similar pairs and the number of dissimilar pairs . First, and has following relationship.
(26) 
The above equality is obtained from . Note that is an unbiased estimator of . Thus, can be estimated by plugging into Eq. (26).
4 Theoretical Analysis
In this section, we analyze the generalization bound for our algorithms. As the first step, we show estimation error bounds for SU, DU, and SD classification via Rademacher complexity. With those bounds, we compare each algorithm and then we clarify which algorithm among the three SDU classification approaches is the most promising. Finally, we show the estimation error bound for SDU classification.
4.1 Estimation Error Bounds for SU, DU, and SD
First, we investigate estimation error bounds for SU, DU and SD classification. Let be a function class of the specified model.
Definition 1 (Rademacher Complexity).
Let be a positive integer,
be i.i.d. random variables drawn from a probability distribution with density
, be a class of measurable functions, and be Rademacher variables, i.e., random variables taking and with even probabilities. Then the (expected) Rademacher complexity of is defined as(27) 
For the function class and any probability density , we assume
(28) 
This assumption holds for many models such as linearinparameter model class .
Partially based on (bao2018classification), we have estimation error bounds for SU, DU and SD classification as follows.
Theorem 3.
Let be a classification risk for function , be its minimizer and be minimizers of empirical SU, DU, SD risk, respectively. Assume the the loss function is  function with respect to the first argument (), and all functions in the model class are bounded, i.e., there exists a constant such that for any . Let . For any , all of the following inequalities hold with probability at least :
(29)  
(30)  
(31) 
where
(32) 
4.2 Comparison of SD, SU, and DU Bounds
Here, we compare SU, DU, and SD algorithms from the point of view of estimation error bounds. Under the generation process of similar and dissimilar pairs in Eq. (2), we have
Corollary 1.
Proof.
If holds, then
and
These two inequalities indicate and , respectively.
Since we assume the generation process in Eq.(2
), the class of each pair (i.e., similar or dissimilar) follows a Bernoulli distribution. Therefore, the number of pairs in each class follows a binomial distribution, namely,
. By using Chernoff’s inequality in (okamoto1959some), we haveTherefore,
∎
From above discussion, when is sufficiently large, we have or with high probability. Thus, we claim that the best pairwise combination for SDU algorithm is SDDU.
4.3 Estimation Error Bounds for SDU
Now we consider the estimation error bound for SDU classification. With the same technique in Theorem 3, we have the following theorem.
Theorem 4.
Let be a classification risk for function , be its minimizer and be a minimizer of the empirical risk in Eq. (22). Assume the the loss function is  function with respect to the first argument (), and all functions in the model class are bounded, i.e., there exists an constant such that for any . Let . For any , with probability at least ,
(33) 
where
(34) 
5 Experiments
In this section, we experimentally evaluate the performance of our SDU algorithm and investigate the behaviors of SU, DU and SD classification.
5.1 Datasets
We conducted experiments on ten benchmark datasets obtained from UCI Machine Learning Repository
(lichman2013uci) and LIBSVM (chang2011libsvm). To convert labeled data into similar and dissimilar pairs, we first determined the positive prior . Then we randomly subsampled pairwise similar and dissimilar data following the ratio of and . To obtain unlabeled data, we randomly picked data following the ratio of and . For all experiments, was set to 0.7.5.2 Common Setup
As a classifier, we used the linearininput model . The weight of L2 regularization was chosen from . For SDU algorithms, the combination parameter was chosen from . For hyperparameter tuning, we used 5fold crossvalidation. To estimate validation error, the empirical risk on SD data equipped with the zeroone loss was used. In each trial, the parameters with the minimum validation error were chosen.
We used squared loss for experiments in Sec. 5.3 and Sec. 5.4 and both squared and doublehinge loss for experiments in Sec. 5.5. For experiments in Sec. 5.4 and Sec. 5.5, class prior was estimated from the number of similar and dissimilar pairs by means of Eq. (26).
Average misclassification rate and standard error as a function of the number of similar and dissimilar pairs over 50 trials. For all experiments, class prior
is set to 0.7 and is set to 500.5.3 Comparison of SU, DU, and SD Performances
We compared the performances of SU, DU and SD classification. We set the number of unlabeled samples to 500 and the number of pairwise data to . In these experiments, we assumed true class prior is known. As we show the results in Fig. 3, DU and SD classification consistently outperform SU classification. The results are consistent with our analysis in Sec. 4.2 that SD and DU classification are likely to outperform SU classification.^{3}^{3}3Due to limited space, the magnified versions of experimental results are shown in Appendix C.
5.4 Improvement by Unlabeled Data
We investigated the effect of unlabeled data on classification performance. The number of pairwise data was set to 50. We compared the performance of three SDU classification methods and SD classification. As the results are shown in Fig. 8, when the number of unlabeled data is sufficiently large, SDDU classification outperforms SD classification. Furthermore, we demonstrate that SDDU classification constantly performs the best among all SDU classification.
SDDU (proposed)  Baselines  

Dataset  Squared  DoubleHinge  KM  CKM  SSP  ITML  
adult  50  61.9 (0.9)  77.7 (0.6)  65.0 (0.8)  66.6 (1.1)  69.5 (0.3)  62.4 (0.7) 
200  71.4 (0.7)  82.5 (0.3)  63.3 (0.8)  71.9 (0.9)  69.3 (0.3)  60.8 (0.7)  
banana  50  63.9 (1.2)  63.5 (1.1)  52.9 (0.4)  52.7 (0.4)  58.7 (0.7)  53.0 (0.4) 
200  66.5 (0.8)  66.9 (0.7)  52.5 (0.2)  52.5 (0.2)  66.5 (1.3)  52.5 (0.2)  
codrna  50  78.1 (1.1)  68.5 (0.8)  62.6 (0.5)  61.5 (0.4)  54.6 (1.0)  62.7 (0.5) 
200  87.7 (0.6)  72.7 (0.7)  62.8 (0.5)  59.5 (0.5)  53.2 (0.7)  62.5 (0.5)  
ijcnn1  50  64.7 (0.8)  68.8 (0.9)  55.5 (0.6)  54.7 (0.5)  60.9 (0.8)  55.8 (0.6) 
200  75.1 (0.7)  76.3 (0.5)  54.2 (0.3)  53.1 (0.3)  59.5 (0.8)  54.3 (0.4)  
magic  50  65.5 (0.9)  65.1 (1.0)  52.4 (0.2)  51.8 (0.2)  52.7 (0.3)  52.5 (0.2) 
200  73.0 (0.6)  71.4 (0.7)  52.0 (0.2)  51.7 (0.2)  52.6 (0.3)  52.0 (0.2)  
phishing  50  69.4 (0.8)  80.5 (0.9)  62.6 (0.3)  62.6 (0.3)  68.1 (0.3)  62.5 (0.3) 
200  81.7 (0.7)  87.0 (0.4)  62.6 (0.3)  62.8 (0.3)  68.4 (0.3)  62.6 (0.3)  
phoneme  50  67.9 (0.9)  69.2 (0.9)  67.8 (0.3)  68.9 (0.5)  66.5 (1.0)  67.8 (0.3) 
200  73.5 (0.5)  74.4 (0.4)  67.8 (0.3)  71.0 (0.6)  72.0 (0.7)  67.9 (0.3)  
spambase  50  66.7 (0.8)  82.9 (0.6)  63.7 (1.1)  64.2 (1.1)  70.4 (0.3)  61.4 (1.1) 
200  77.9 (0.7)  87.5 (0.3)  61.8 (1.2)  70.4 (0.6)  70.7 (0.3)  60.6 (1.2)  
w8a  50  60.8 (0.9)  73.2 (0.9)  69.3 (0.3)  66.0 (0.7)  64.2 (0.6)  69.0 (0.3) 
200  64.1 (0.7)  80.2 (0.7)  69.3 (0.3)  56.3 (0.6)  67.2 (0.7)  68.6 (0.3)  
waveform  50  72.9 (1.0)  83.2 (1.0)  51.5 (0.2)  51.6 (0.2)  53.3 (0.3)  51.5 (0.2) 
200  83.2 (0.6)  86.7 (0.6)  51.5 (0.2)  51.5 (0.1)  53.1 (0.3)  51.5 (0.2) 
is error rate. Bold numbers indicate outperforming methods, chosen by onesided ttest with the significance level 5%.
5.5 Comparison of SDU and Existing Methods
We evaluated the performances of the proposed SDU classification with four baseline methods. We conducted experiments on each benchmark dataset with 500 unlabeled data and similar or dissimilar pairs in total. Accuracy was measured in each trial with 500 test samples. Due to limited space, we show only the results of SDDU classification as a representative of the proposed method. As we can see in Table 1, SDDU classification performs the best on many datasets. The details of the baseline methods are described below.
KMeans Clustering (KM):
Ignoring all pairwise information, Kmeans clustering algorithm
(macqueen1967some) is applied to only training data. We predicted labels of test data with learned clusters.Constrained KMeans Clustering (CKM): Constrained Kmeans clustering (wagstaff2001constrained) is a clustering method using pairwise similar / dissimilar information as mustlink / cannotlink constraints.
Semisupervised Spectral Clustering (SSP):
Semisupervised spectral clustering
(chen2012spectral) is a spectral clustering based method, where similar and dissimilar pairs are used for affinity propagation. We set , which is used for knearestneighbors graph construction, and , which is a precision parameter for similarity measurement.Information Theoretical Metric Learning (ITML): Information Theoretical Metric Learning (davis2007information) is a metric learning based algorithm, where similar and dissimilar pairs used for regularizing the covariance matrix. We used the identity matrix as prior information and a slack parameter was set to 1. For test samples prediction, kmeans clustering was applied with the obtained metric.
For clustering algorithms, the number of clusters was set to 2. To evaluate the performances of kmeans based clustering methods (i.e., KM, CKM, and ITML), test samples were completely separated from training samples. The labels of test samples are predicted based on the clusters obtained only from training samples. For semisupervised spectral clustering, we applied the algorithm on both train and test samples so that we could predict for test samples.
6 Conclusion
In this paper, we proposed a novel weakly supervised classification algorithm, which is the empirical risk minimization from pairwise similar, dissimilar, and unlabeled data. We formulated the optimization problem for SDU classification and provided practical solutions with squared and doublehinge loss. From estimation error bound analysis, we show that the SDDU combination is the most promising for SDU classification. Through experiments on benchmark dataset, we confirmed that our SDU classification outperforms baseline methods.
Acknowledgments
HB was supported by JST ACTI Grant Number JPMJPR18UI. IS was supported by JST CREST Grant Number JPMJCR17A1, Japan. MS was supported by JST CREST Grant Number JPMJCR1403.
References
Appendix A Proofs of Theorems
a.1 Preliminaries
For convenience, we introduce pointwise densities and for similar and dissimilar data. By marginalizing pairwise densities and by , we have
(35)  
(36) 
Let be a set of pointwise samples in and be a set of pointwise samples in as well.
(37)  
(38) 
Then, we can consider the generation process of pointwise similar/dissimilar data as
We use above notations for the proofs of Theorems 2, 3, and 4.
a.2 Proof of Theorem 1
We start from an unbiased risk estimator from SU data in Proposition 1. The classification risk is equivalently represented as:
(39) 
where
Since the pairwise density can be decomposed into density of similar pairs and dissimilar pairs, namely, , the expectation over can be decomposed as follows.
(40) 
In addition, the risk over can be equivalently represented as:
(41)  
(42) 
By applying Eq.(40),(41), (41) to (39), we can derive the risk only from dissimilar and unlabeled distributions or similar and dissimilar distributions.
(43) 
(44) 
where we use the equation . Therefore, and are also unbiased estimators of the classification risk. ∎
a.3 Proof of Theorem 2
We prove this theorem based on the positive semidefiniteness of the Hessian matrix similarly to SU classification in [bao2018classification]. Since is a twice differentiable margin loss, there is a twice differentiable function such that . Here our objective function can be written as
(45) 
The secondorder derivative of with respect to can be computed as
(46) 
where is employed in the second equality and is employed in the last equality. Here, the Hessian of with respect to is
(47) 
where means that a matrix is positive semidefinite. Positive semidefiniteness of follows from ( is convex) and . Therefore, is convex with respect to . ∎
a.4 Proof of Theorem 3
We apply the similar technique with SU classification to DU and SD classification. From pointwise decomposition in Sec. A.1, we have the following lemma.
Comments
There are no comments yet.