In supervised classification, we need a vast amount of labeled data in the training phase. However, in many real-world problems such as robotics, medical diagnosis and bioinformatics, it is time-consuming and laborious to label a huge amount of unlabeled data. To deal with this problem, weakly-supervised classification has been explored in various setups, including semi-supervised classification (Chapelle & Zien, 2005; Chapelle et al., 2010; Sakai et al., 2017) and positive-unlabeled (PU) classification (Elkan & Noto, 2008; du Plessis et al., 2014, 2015; Niu et al., 2016).
Another line of research from the clustering viewpoint is semi-supervised clustering (Wagstaff et al., 2001; Basu et al., 2002; Klein et al., 2002; Xing et al., 2002; Bar-Hillel et al., 2003; Basu et al., 2004; Bilenko et al., 2004; Weinberger et al., 2005; Davis et al., 2007; Lu, 2007; Li et al., 2008; Kulis et al., 2009; Li & Liu, 2009; Yi et al., 2013; Calandriello et al., 2014; Chen et al., 2014; Chiang et al., 2014)
, where pairwise similarity and dissimilarity data (a.k.a. must-link and cannot-link constraints) are utilized to guide unsupervised clustering to a desired solution. Semi-supervised clustering and weakly-supervised classification are similar in that they do not use fully-supervised data. However, they are essentially different from the learning theoretic viewpoint—weakly-supervised classification methods are justified as supervised learning methods, while semi-supervised clustering methods are still evaluated as unsupervised learning (see Table1 for the definitions of classification and clustering). Indeed, weakly-supervised learning methods based on empirical risk minimization (du Plessis et al., 2014, 2015; Niu et al., 2016; Sakai et al., 2017) were shown to achieve the optimal parametric convergence rate to the optimal solution, while such generalization guarantee is not available for semi-supervised clustering methods.
The goal is to minimize the true risk (given the zero-one loss) of an inductive classifier. To this end, an empirical risk (given a surrogate loss) on the training data is minimized for training the classifier. The training and testing phases can be clearly distinguished. Classification requires the existence of the underlying joint density.
|Clustering||The goal is to partition the data at hand into clusters. To this end, density-/margin-/information-based measures are optimized for implementing the low-density separation based on the cluster assumption. The training and testing phases are combined so that all data are in-sample. Clustering does not need the underlying joint density.|
The goal of this paper is to bridge these two similar but substantially different paradigms. More specifically, we propose a novel weakly-supervised learning method called SU classification, where only similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data are employed. In SU classification, the information available for training a classifier is similar to semi-supervised clustering. However, our proposed method gives an inductive model, which learns decision functions from training data and predict attributes of out-of-sample (i.e., unseen test data). Furthermore, the proposed method can not only separate two classes but also identify which class is positive (class identification) under certain conditions.
Pairwise data are useful in various applications such as speaker identification (Bar-Hillel et al., 2003), protein function prediction (Klein et al., 2002) (functional links between proteins can be found by experiments), and lane-finding using GPS data (Wagstaff et al., 2001) (clustering coarse GPS data points into each lane).
For this SU classification problem, our contributions in this paper are three-fold:
We propose an empirical risk minimization method for SU classification (Section 3). This enables us to obtain an inductive classifier. Under certain loss conditions together with the linear-in-parameter model, its objective function becomes even convex in the parameters.
We theoretically establish an estimation error bound for our SU classification method (Section 5), showing that the proposed method achieves the optimal parametric convergence rate.
We experimentally demonstrate the practical usefulness of the proposed SU classification method (Section 6).
Related problem settings are summarized in Figure 1.
2 Binary Classification
In this section, we formulate the standard binary classification problem briefly. Let be a -dimensional example space and be a binary label space. We assume that labeled data
is drawn from the joint probability distribution with density. The goal of binary classification is to obtain a classifier which minimizes the classification risk defined as
where denotes the expectation and
is a loss function.
In standard supervised classification scenarios, we are given positive and negative training data independently following . Then, based on these training data, the classification risk (1) is empirically approximated and the empirical risk minimizer is obtained. However, in many real-world problems, collecting labeled training data is costly. The goal of this paper is to train a binary classifier only from pairwise similarity and unlabeled data, which are cheaper to collect than fully labeled data.
3 Learning from Pairwise Similarity and Unlabeled Data
In this section, we propose a learning method to train a decision function from pairwise similarity and unlabeled data.
3.1 Pairwise Similarity and Unlabeled Data
First, we discuss underlying distributions of similar pairs and unlabeled data, in order to perform the empirical risk minimization.
Pairwise Similarity: If and belong to the same class, they are said to be pairwise similar (S). We assume that pairwise similar data are drawn following
are the class-prior probabilities satisfying, and and are the class-conditional densities. Eq. (2) means that we draw two labeled data independently following , and we accept/reject them if they belong to the same class/different classes.
Unlabeled Data: We assume that unlabeled (U) data is drawn following the marginal density , which can be decomposed into the sum of the class-conditional densities as
Our goal is to train a classifier only from SU data, which we call SU classification. We assume that we have a similar dataset and an unlabeled dataset as
We also use a notation to denote pointwise similar data obtained by ignoring pairwise relations in .
are independently drawn following
A proof is given in Appendix A in the supplementary material.
Lemma 1 states that a similar data pair is essentially symmetric, and can be regarded as being independently drawn following , if we assume the pair is drawn following
. This perspective is important when we analyze the variance of the risk estimator (Section3.3), and estimate the class-prior (Section 4.2).
3.2 Risk Expression with SU Data
Below, we attempt to express the classification risk (1) only in terms of SU data. Assume , and let , and be
Then we have the following theorem.
The classification risk (1) can be equivalently expressed as
A proof is given in Appendix B in the supplementary material.
where in the last line we use the decomposed version of similar dataset instead of , since the loss form is symmetric.
and are illustrated in Figure 2.
3.3 Minimum-Variance Risk Estimator
Eq. (5) is one of the candidates of SU risk estimator. However, due to the symmetry of , we have the following lemma.
The first term of can be equivalently expressed as
where is an arbitrary weight.
is also an unbiased estimator of the first term of . Then, a natural question arises: is the risk estimator (5) best among all ? We answer this question by the following theorem.
has the minimum variance among all unbiased estimators of
A proof is given in Appendix C.2 in the supplementary material.
3.4 Objective Function
Here, we investigate the objective function when the linear-in-parameter model is employed as a classifier, where and are parameters and is basis functions. In general, the bias parameter can be ignored 111 Let and then . . We formulate SU classification as the following empirical risk minimization problem using Eq. (5) together with the regularization:
and is the regularization parameter. We need the class-prior to solve this optimization problem. We discuss how to estimate in Section 4.2.
Next, we will investigate appropriate choices of the loss function . From now on, we focus on margin loss functions (Mohri et al., 2012): is said to be a margin loss function if there exists such that .
In general, our objective function (8) is non-convex even if a convex loss function is used for 222 In general, is non-convex because either or is convex and the other is concave. is not always convex even if is convex, either. . However, the next theorem, inspired by du Plessis et al. (2015), states that a certain loss function will result in a convex objective function.
If the loss function is a convex margin loss and satisfies the condition
then is convex.
|Double hinge loss|
Examples of margin loss functions satisfying the conditions in Theorem 3 are shown in Table 2. Below, as special cases, we show the objective function for the squared loss and the double-hinge loss. The detailed derivations are given in Appendix E in the supplementary material.
Squared Loss: The squared loss is . Substituting into Eq. (8), the objective function is
is a vector whose elements are all ones,
is the identity matrix,, and . The minimizer of this objective function can be obtained analytically as
Thus the optimization problem can be easily implemented and solved highly efficiently if the number of basis functions is not so large.
Double-Hinge Loss: Since the hinge loss does not satisfy the conditions in Theorem 3, the double-hinge loss is proposed by du Plessis et al. (2015) as an alternative. Substituting into Eq. (8), we can reformulate the optimization problem as follows:
where for vectors denotes the element-wise inequality. This optimization problem is a quadratic program (QP). The transformation into the standard QP form is given in Appendix E in the supplementary material.
4 Relation between Class-Prior and SU Learning
In Section 3, we assume that the class-prior is given in advance. In practice, the behavior of SU classification depends on the prior knowledge about . In this section, we first clarify the relation between behaviors of the proposed method and , then we propose an algorithm to estimate in case we do not have in advance.
4.1 Class-Prior-Dependent Behaviors of Proposed Method
We discuss the following three different cases on prior knowledge of and summarize in Table 3.
(Case 1) The class-prior is given: In this case, we can directly solve the optimization problem (7). The solution does not only separate data but also identify classes, i.e., determine which class is positive.
(Case 2) No prior knowledge on the class-prior is given: In this case, we need to estimate before solving (7). If we assume , the estimation method in Section 4.2 gives an estimator of . Thus, we can regard the larger cluster as positive class and solve the optimization problem (7). This time the solution just separates data because we have no prior information for class identifiability.
(Case 3) Magnitude relation of the class-prior is given: Finally, consider the case where we know which class has a larger class-prior. In this case, we also need to estimate , but surprisingly, we can identify classes. For example, if the negative class has a larger class-prior, first we estimate the class-prior (let be an estimated value). Since Algorithm 1 always gives an estimate of the class-prior of the larger class, the positive class-prior is given as . After that, it reduces to Case 1.
Remark: In all of the three cases above, our proposed method gives an inductive model, which is learned from training data and generalize to predict attributes of out-of-sample (i.e., unseen test data) without any modification. On the other hand, most of the unsupervised/semi-supervised clustering methods can only takes care of in-sample (i.e., data at hand given in advance).
4.2 Class-Prior Estimation from Pairwise Similarity and Unlabeled Data
We propose a class-prior estimation algorithm only from SU data, which needs an assumption on pairwise dissimilar (D) data.
If and belong to the different classes, they are said to be pairwise dissimilar, and we assume they are drawn following
5 Estimation Error Bound
In this section, we establish an estimation error bound for the proposed method. Hereafter, let be a function class of a specified model.
Let be a positive integer,
be i.i.d. random variables drawn from a probability distribution with density
be i.i.d. random variables drawn from a probability distribution with density, be a class of measurable functions, and be Rademacher variables, i.e., random variables taking and with even probabilities. Then (expected) Rademacher complexity of is defined as
In this section, we assume for any probability density , our model class satisfies
for some constant . This assumption is reasonable because many model classes such as the linear-in-parameter model class ( and are positive constants) satisfy it (Mohri et al., 2012).
Subsequently, let be the Bayes optimal classifier, and be the empirical risk minimizer.
Assume the loss function is -Lipschitz with respect to the first argument (), and all functions in the model class are bounded, i.e., there exists a constant such that for any . Let . For any , with probability at least ,
A proof is given in Appendix F in the supplementary material.
In this section, we empirically investigate performances of class-prior estimation and the proposed method for SU classification.
Datasets are obtained from the UCI Machine Learning Repository
UCI Machine Learning Repository(Lichman, 2013), the LIBSVM (Chang & Lin, 2011) and the ELENA project 333https://www.elen.ucl.ac.be/neural-nets/Research/Projects/ELENA/elena.htm. We randomly subsample the original datasets, to maintain that similar data consists of positive pairs and negative pairs with the ratio of to (see Eq. (2)), while the ratios of unlabeled and test data are to (see Eq. (3)).
6.1 Class-Prior Estimation
First, we study empirical performance of class-prior estimation. We conduct experiments on benchmark datasets. Different dataset sizes are tested, where a half of data are S pairs and the other half are U data.
In Figure 4, KM1 and KM2 are plotted, which are two class-prior estimation algorithms proposed by Ramaswamy et al. (2016). We used them as CPE in Algorithm 1 444We used the author’s implementations published in http://web.eecs.umich.edu/~cscott/code/kernel_CPE.zip.. Since
, we use additional heuristic to setin Algorithm 1 of Ramaswamy et al. (2016).
6.2 Classification Complexity
We empirically investigate our proposed method in terms of the relationship between classification performance and the number of training data. We conduct experiments on benchmark datasets with the fixed number of S pairs (fixed to 200), and the different numbers of U data are tested.
The detailed setting about the proposed method is described below.
Proposed Method (SU): We use the linear-in-parameter model (the identity map is chosen for the basis function). In Section 6.2, the squared loss is used, and is given (Case 1 in Table 3). In Section 6.3, the squared loss and the double-hinge loss 555The optimization problem with the double-hinge loss is solved by cvxopt (http://cvxopt.org/). are used, and the class-prior is estimated by Algorithm 1 with KM2 (Ramaswamy et al., 2016) (Case 2 in Table 3). Regularization parameter is chosen from .
To choose hyperparameters, 5-fold cross-validation is used. Since we do not have fully labeled data in the training phase, Eq. (5) equipped with the zero-one loss is used as a proxy for the misclassification rate (the validation risk). In each experimental trial, the model with minimum validation risk is chosen.
6.3 Benchmark Comparison with Baseline Methods
We compare our proposed method with baseline methods on benchmark datasets. We conduct experiments on each dataset with 500 similar data pairs, 500 unlabeled data and 100 test data. As can be seen from Table 4, our proposed method outperforms baselines for many datasets. The details about the baseline methods are described below.
Baseline 1 (KM): As a simple baseline, we consider -means clustering (MacQueen, 1967). We ignore pair information of S data and apply -means clustering with to U data.
Baseline 2 (ITML): Information-theoretic metric learning (Davis et al., 2007) is a metric learning method by regularizing the covariance matrix based on prior knowledge, with pairwise constraints. We use the identity matrix as prior knowledge, and the slack variable parameter is fixed to , since we cannot employ the cross-validation without any class label information. Using the obtained metric, -means clustering is applied on test data.
Baseline 3 (SERAPH): Semi-supervised metric learning paradigm with hyper sparsity (Niu et al., 2012) is another metric learning method based on entropy regularization. Hyperparameter choice follows . Using the obtained metric, -means clustering is applied on test data.
Baseline 4 (3SMIC): Semi-supervised SMI-based clustering (Calandriello et al., 2014) models class-posteriors and maximizes mutual information between unlabeled data at hand and their cluster labels. The penalty parameter and the kernel parameter are chosen from and , respectively, via 5-fold cross-validation.
Baseline 5 (DIMC): DirtyIMC (Chiang et al., 2015) is a noisy version of inductive matrix completion, where the similarity matrix is recovered from a low-rank feature matrix. Similarity matrix is assumed to be expressed as , where is low-rank feature representations of input data. After obtaining , -means clustering is conducted on . Two hyperparameters in Eq. (2) in (Chiang et al., 2015) are set to .
Remark: KM, ITML and SERAPH rely on -means, which is trained by using only training data. Test prediction is based on the metric between test data and learned cluster centers. Among the baselines, DIMC can only handle in-sample prediction, so it is trained by using both training and test data at the same time.
|adult||123||66.3 (1.2)||84.5 (0.8)||58.1 (1.1)||57.9 (1.1)||66.5 (1.7)||58.5 (1.3)||63.7 (1.2)|
|banana||2||64.1 (1.7)||68.2 (1.2)||54.3 (0.7)||54.8 (0.7)||55.0 (1.1)||61.9 (1.2)||64.3 (1.0)|
|cod-rna||8||82.5 (1.1)||71.0 (0.9)||63.1 (1.1)||62.8 (1.0)||62.5 (1.4)||56.6 (1.2)||63.8 (1.1)|
|higgs||28||54.9 (1.6)||69.3 (0.9)||66.4 (1.6)||66.6 (1.3)||63.4 (1.1)||57.0 (0.9)||65.0 (1.1)|
|ijcnn1||22||68.2 (1.3)||73.6 (0.9)||54.6 (0.9)||55.8 (0.7)||59.8 (1.2)||58.9 (1.3)||66.2 (2.2)|
|magic||10||65.9 (1.5)||69.0 (1.3)||53.9 (0.6)||54.5 (0.7)||55.0 (0.9)||59.1 (1.4)||63.1 (1.1)|
|phishing||68||75.2 (1.3)||91.3 (0.6)||64.4 (1.0)||61.9 (1.1)||62.4 (1.1)||60.1 (1.3)||64.8 (1.4)|
|phoneme||5||68.0 (1.4)||70.8 (1.0)||65.2 (0.9)||66.7 (1.4)||69.1 (1.4)||61.3 (1.1)||64.5 (1.2)|
|spambase||57||69.5 (1.3)||85.5 (0.5)||60.1 (1.8)||54.4 (1.1)||65.4 (1.8)||61.5 (1.3)||63.6 (1.3)|
|susy||18||60.7 (1.0)||74.8 (1.2)||55.6 (0.7)||55.4 (0.9)||58.0 (1.0)||57.1 (1.2)||65.2 (1.0)|
|w8a||300||60.5 (1.2)||86.5 (0.6)||71.0 (0.8)||69.5 (1.5)||N/A||60.5 (1.5)||65.0 (2.0)|
|waveform||21||78.6 (1.6)||87.0 (0.5)||56.1 (0.8)||54.8 (0.7)||56.5 (0.9)||56.5 (0.9)||65.0 (0.9)|
is error rate. Bold-faces indicate outperforming methods, chosen by one-sided t-test with the significance level. The result of SERAPH with “w8a” is unavailable due to high-dimensionality and memory constraints.
In this paper, we proposed a novel weakly-supervised learning problem named SU classification, where only similar and unlabeled data are needed. This problem is related to semi-supervised clustering in that pairwise data are employed. SU classification even becomes class-identifiable under a certain condition on the class-prior (see Table 3). The optimization problem of SU classification with the linear-in-parameter model becomes convex if we choose certain loss functions such as the squared loss and the double-hinge loss. We established an estimation error bound for the proposed method, and confirmed that the estimation error decreases with the parametric optimal order, as the number of similar data and unlabeled data becomes larger. We also investigated empirical performances and confirmed that our proposed method performs better than baseline clustering methods.
This work was supported by JST CREST JPMJCR1403.
- Bar-Hillel et al. (2003) Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D.Z. Learning distance functions using equivalence relations. In ICML, pp. 11–18, 2003.
- Basu et al. (2002) Basu, S., Banerjee, A., and Mooney, R. J. Semi-supervised clustering by seeding. In ICML, pp. 27–34, 2002.
- Basu et al. (2004) Basu, S., Bilenko, M., and Mooney, R. J. A probabilistic framework for semi-supervised clustering. In SIGKDD, pp. 59–68, 2004.
- Bilenko et al. (2004) Bilenko, M., Basu, S., and Mooney, R. J. Integrating constraints and metric learning in semi-supervised clustering. In ICML, pp. 11, 2004.
- Calandriello et al. (2014) Calandriello, D., Niu, G., and Sugiyama, M. Semi-supervised information-maximization clustering. Neural Networks, 57:103–111, 2014.
Chang & Lin (2011)
Chang, C.-C. and Lin, C.-J.
LIBSVM: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- Chapelle & Zien (2005) Chapelle, O. and Zien, A. Semi-supervised classification by low density separation. In AISTATS 2005, pp. 57–64, 2005.
- Chapelle et al. (2010) Chapelle, O., Schlkopf, B., and Zien, A. Semi-Supervised Learning. MIT Press, 1st edition, 2010.
- Chen et al. (2014) Chen, Y., Jalali, A., Sanghavi, S., and Xu, H. Clustering partially observed graphs via convex optimization. Journal of Machine Learning Research, 15(1):2213–2238, 2014.
- Chiang et al. (2014) Chiang, K.-Y., Hsieh, C.-J., Natarajan, N., Dhillon, I. S., and Tewari, A. Prediction and clustering in signed networks: a local to global perspective. The Journal of Machine Learning Research, 15(1):1177–1213, 2014.
- Chiang et al. (2015) Chiang, K.-Y., Hsieh, C.-J., and Dhillon, I. S. Matrix completion with noisy side information. In NIPS, pp. 3447–3455, 2015.
- Davis et al. (2007) Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S. Information-theoretic metric learning. In ICML, pp. 209–216, 2007.
- du Plessis et al. (2014) du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In NIPS, pp. 703–711. 2014.
- du Plessis et al. (2015) du Plessis, M. C., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386–1394, 2015.
- du Plessis et al. (2017) du Plessis, M. C., Niu, G., and Sugiyama, M. Class-prior estimation for learning from positive and unlabeled data. Machine Learning, 106(4):463–492, 2017.
- Elkan & Noto (2008) Elkan, C. and Noto, K. Learning classifiers from only positive and unlabeled data. In SIGKDD, pp. 213–220, 2008.
- Klein et al. (2002) Klein, D., Kamvar, S. D., and Manning, C. D. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In ICML, pp. 307–314, 2002.
- Kulis et al. (2009) Kulis, B., Basu, S., Dhillon, I., and Mooney, R. J. Semi-supervised graph clustering: a kernel approach. Machine learning, 74(1):1–22, 2009.
- Li & Liu (2009) Li, Z. and Liu, J. Constrained clustering by spectral kernel learning. In ICCV, pp. 421–427, 2009.
- Li et al. (2008) Li, Z., Liu, J., and Tang, X. Pairwise constraint propagation by semidefinite programming for semi-supervised classification. In ICML, pp. 576–583, 2008.
- Lichman (2013) Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
- Lu (2007) Lu, Z. Semi-supervised clustering with pairwise constraints: A discriminative approach. In AISTATS, pp. 299–306, 2007.
- MacQueen (1967) MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281–297, Berkeley, Calif., 1967. University of California Press.
- Mendelson (2008) Mendelson, S. Lower bounds for the empirical minimization algorithm. IEEE Transactions on Information Theory, 54(8):3797–3803, 2008.
- Mohri et al. (2012) Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2012.
- Niu et al. (2012) Niu, G., Dai, B., Yamada, M., and Sugiyama, M. Information-theoretic semi-supervised metric learning via entropy regularization. In ICML, 2012.
- Niu et al. (2016) Niu, G., du Plessis, M. C., Sakai, T., Ma, Y., and Sugiyama, M. Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In NIPS, pp. 1199–1207, 2016.
- Ramaswamy et al. (2016) Ramaswamy, H. G., Scott, C., and Tewari, A. Mixture proportion estimation via kernel embedding of distributions. In ICML, pp. 2052–2060, 2016.
- Sakai et al. (2017) Sakai, T., du Plessis, M. C., Niu, G., and Sugiyama, M. Semi-supervised classification based on classification from positive and unlabeled data. In ICML, pp. 2998–3006, 2017.
Wagstaff et al. (2001)
Wagstaff, K., Cardie, C., Rogers, S., and Schrödl, S.
Constrained k-means clustering with background knowledge.In ICML, pp. 577–584, 2001.
- Weinberger et al. (2005) Weinberger, K. Q., Blitzer, J., and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. In NIPS, pp. 1473–1480, 2005.
- Xing et al. (2002) Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. Distance metric learning, with application to clustering with side-information. In NIPS, pp. 521–528, 2002.
- Yi et al. (2013) Yi, J., Zhang, L., Jin, R., Qian, Q., and Jain, A. Semi-supervised clustering by input pattern assisted pairwise similarity matrix completion. In ICML, pp. 1400–1408, 2013.
Appendix A Proof of Lemma 1
From the assumption, . In order to decompose pairwise data into pointwise, marginalize with respect to :
Since a pair is independently and identically drawn, both and are drawn following .
Appendix B Proof of Theorem 1
The classification risk (1) can be equivalently expressed as