1 Introduction
In ordinary supervised classification problems, each training pattern is equipped with a label which specifies the class the pattern belongs to. Although supervised classifier training is effective, labeling training patterns is often expensive and takes a lot of time. For this reason, learning from less expensive data has been extensively studied in the last decades, including but not limited to, semisupervised learning
semi ; zhu03icml ; zhou03nips ; grandvalet04nips ; belkin06jmlr ; mann07icml ; niu13icml ; li15tpami ; yang16icml ; kipf17iclr ; laine17iclr , learning from pairwise/triplewise constraints pairwise_constraints ; goldberger04nips ; davis07icml ; weinberger09jmlr ; NC:Niu+etal:2014 , and positiveunlabeled learning denis98alt ; elkan08kdd ; ward09biometrics ; blanchard10jmlr ; punips ; puicml ; NIPS:Niu+etal:2016 ; kiryo17nips .In this paper, we consider another weakly supervised classification scenario with less expensive data: instead of any ordinary class label, only a complementary label which specifies a class that the pattern does not belong to is available. If the number of classes is large, choosing the correct class label from many candidate classes is laborious, while choosing one of the incorrect class labels would be much easier and thus less costly. In the binary classification setup, learning with complementary labels is equivalent to learning with ordinary labels, because complementary label (i.e., not class ) immediately means ordinary label . On the other hand, in class classification for , complementary labels are less informative than ordinary labels because complementary label only means either of the ordinary labels .
The complementary classification problem may be solved by the method of learning from partial labels partial , where multiple candidate class labels are provided to each training pattern—complementary label can be regarded as an extreme case of partial labels given to all classes other than class . Another possibility to solve the complementary classification problem is to consider a multilabel setup multilabel2 , where each pattern can belong to multiple classes—complementary label is translated into a negative label for class and positive labels for the other classes.
Our contribution in this paper is to give a direct risk minimization framework for the complementary classification problem. More specifically, we consider a complementary loss that incurs a large loss if a predicted complementary label is not correct. We then show that the classification risk can be empirically estimated in an unbiased fashion if the complementary loss satisfies a certain symmetric condition—the sigmoid loss and the ramp loss (see Figure 1) are shown to satisfy this symmetric condition. Theoretically, we establish estimation error bounds for the proposed method, showing that learning from complementary labels is also consistent; the order of these bounds achieves the optimal parametric rate , where
denotes the order in probability and
is the number of complementarily labeled data.We further show that our proposed complementary classification can be easily combined with ordinary classification, providing a highly dataefficient classification method. This combination method is particularly useful, e.g., when labels are collected through crowdsourcing crowdsourcing : Usually, crowdworkers are asked to give a label to a pattern by selecting the correct class from the list of all candidate classes. This process is highly timeconsuming when the number of classes is large. We may instead choose one of the classes randomly and ask crowdworkers whether a pattern belongs to the chosen class or not. Such a yes/no question can be much easier and quicker to be answered than selecting the correct class out of a long list of candidates. Then the pattern is treated as ordinarily labeled if the answer is yes; otherwise, the pattern is regarded as complementarily labeled.
Finally, we demonstrate the practical usefulness of the proposed methods through experiments.
2 Review of ordinary multiclass classification
Suppose that dimensional pattern and its class label
are sampled independently from an unknown probability distribution with density
. The goal of ordinary multiclass classification is to learn a classifier that minimizes the classification risk with multiclass loss :(1) 
where denotes the expectation. Typically, a classifier is assumed to take the following form:
(2) 
where is a binary classifier for class versus the rest. Then, together with a binary loss that incurs a large loss for small , the oneversusall (OVA) loss^{1}^{1}1We normalize the “rest” loss by to be consistent with the discussion in the following sections. or the pairwisecomparison (PC) loss defined as follows are used as the multiclass loss ova :
(3)  
(4) 
Finally, the expectation over unknown in Eq.(1) is empirically approximated using training samples to give a practical classification formulation.
3 Classification from complementary labels
In this section, we formulate the problem of complementary classification and propose a risk minimization framework.
We consider the situation where, instead of ordinary class label , we are given only complementary label which specifies a class that pattern does not belong to. Our goal is to still learn a classifier that minimizes the classification risk (1), but only from complementarily labeled training samples . We assume that are drawn independently from an unknown probability distribution with density:^{2}^{2}2The coefficient is for the normalization purpose: it would be natural to assume since all for equally contribute to ; in order to ensure that is a valid joint density such that , we must take .
(5) 
Let us consider a complementary loss for a complementarily labeled sample . Then we have the following theorem, which allows unbiased estimation of the classification risk from complementarily labeled samples:
Theorem 1.
The classification risk (1) can be expressed as
(6) 
if there exist constants such that for all and , the complementary loss satisfies
(7) 
Proof.
The first constraint in (7) can be regarded as a multiclass loss version of a symmetric constraint that we later use in Theorem 2. The second constraint in (7) means that the smaller is, the larger should be, i.e., if “pattern belongs to class ” is correct, “pattern does not belong to class ” should be incorrect.
With the expression (6), the classification risk (1) can be naively approximated in an unbiased fashion by the sample average as
(8) 
Let us define the complementary losses corresponding to the OVA loss and the PC loss as
(9)  
(10) 
Then we have the following theorem (its proof is given in Appendix A):
Theorem 2.
4 Estimation Error Bounds
In this section, we establish the estimation error bounds for the proposed method.
Let be a function class for empirical risk minimization, be Rademacher variables, then the Rademacher complexity of for of size drawn from is defined as follows (mohri12FML, ):
define the Rademacher complexity of for of size drawn from as
Note that and thus , which enables us to express the obtained theoretical results using the standard Rademacher complexity .
To begin with, let be the shifted loss such that (in order to apply the Talagrand’s contraction lemma (ledoux91PBS, ) later), and and be losses defined following (9) and (10) but with instead of ; let be any (not necessarily the best) Lipschitz constant of . Define the corresponding function classes as follows:
Then we can obtain the following lemmas (their proofs are given in Appendices B and C):
Lemma 3.
Let be the Rademacher complexity of for of size drawn from defined as
Then,
Lemma 4.
Let be the Rademacher complexity of defined similarly to . Then,
Based on Lemmas 3 and 4, we can derive the uniform deviation bounds of as follows (its proof is given in Appendix D):
Lemma 5.
For any , with probability at least ,
where is w.r.t. , and
where is w.r.t. .
Let be the true risk minimizer and be the empirical risk minimizer, i.e.,
Let also
Finally, based on Lemma 5, we can establish the estimation error bounds as follows:
Theorem 6.
For any , with probability at least ,
if is trained by minimizing is w.r.t. , and
if is trained by minimizing is w.r.t. .
Proof.
Based on Lemma 5, the estimation error bounds can be proven through
where we used that by the definition of . ∎
Theorem 6 also guarantees that learning from complementary labels is consistent: as , . Consider a linearinparameter model defined by
where is a Hilbert space with an inner product , is a normal, is a feature map, and and are constants (scholkopf01LK, ). It is known that (mohri12FML, ) and thus in if this is used, where denotes the order in probability. This order is already the optimal parametric rate and cannot be improved without additional strong assumptions on , and jointly.
5 Incorporation of ordinary labels
In many practical situations, we may also have ordinarily labeled data in addition to complementarily labeled data. In such cases, we want to leverage both kinds of labeled data to obtain more accurate classifiers. To this end, motivated by sakai17icml , let us consider a convex combination of the classification risks derived from ordinarily labeled data and complementarily labeled data:
(15) 
where
is a hyperparameter that interpolates between the two risks. The combined risk (
15) can be naively approximated by the sample averages as(16) 
where are ordinarily labeled data and are complementarily labeled data.
As explained in the introduction, we can naturally obtain both ordinarily and complementarily labeled data through crowdsourcing crowdsourcing . Our risk estimator (16) can utilize both kinds of labeled data to obtain better classifiers^{3}^{3}3 Note that when pattern has already been equipped with ordinary label , giving complementary label does not bring us any additional information (unless the ordinary label is noisy).. We will experimentally demonstrate the usefulness of this combination method in Section 6.
6 Experiments
In this section, we experimentally evaluate the performance of the proposed methods.
6.1 Comparison of different losses
Here we first compare the performance among four variations of the proposed method with different loss functions: OVA (9) and PC (10), each with the sigmoid loss (13) and ramp loss (14). We used the MNIST handwritten digit dataset, downloaded from the website of the late Sam Roweis^{4}^{4}4See http://cs.nyu.edu/~roweis/data.html.
(with all patterns standardized to have zero mean and unit variance), with different number of classes: 3 classes (digits “1” to “3”) to 10 classes (digits “1” to “9” and “0”). From each class, we randomly sampled 500 data for training and 500 data for testing, and generated complementary labels by randomly selecting one of the complementary classes. From the training dataset, we left out 25% of the data for validating hyperparameter based on (
8) with the zeroone loss plugged in (9) or (10).For all the methods, we used a linearininput model as the binary classifier, where denotes the transpose, is the weight parameter, and is the bias parameter for class . We added an regularization term, with the regularization parameter chosen from . Adam adam was used for optimization with 5,000 iterations, with minibatch size 100. We reported the test accuracy of the model with the best validation score out of all iterations. All experiments were carried out with Chainer chainer .
We reported means and standard deviations of the classification accuracy over five trials in Table
1. From the results, we can see that the performance of all four methods deteriorates as the number of classes increases. This is intuitive because supervised information that complementary labels contain becomes weaker with more classes.The table also shows that there is no significant difference in classification accuracy among the four losses. Since the PC formulation is regarded as a more direct approach for classification vapnik (it takes the sign of the difference of the classifiers, instead of the sign of each classifier as in OVA) and the sigmoid loss is smooth, we use PC with the sigmoid loss as a representative of our proposed method in the following experiments.
Method  3 cls  4 cls  5 cls  6 cls  7 cls  8 cls  9 cls  10 cls 

OVA Sigmoid  95.2 (0.9)  91.4 (0.5)  87.5 (2.2)  82.0 (1.3)  74.5 (2.9)  73.9 (1.2)  63.6 (4.0)  57.2 (1.6) 
OVA Ramp  95.1 (0.9)  90.8 (1.0)  86.5 (1.8)  79.4 (2.6)  73.9 (3.9)  71.4 (4.0)  66.1 (2.1)  56.1 (3.6) 
PC Sigmoid  94.9 (0.5)  90.9 (0.8)  88.1 (1.8)  80.3 (2.5)  75.8 (2.5)  72.9 (3.0)  65.0 (3.5)  58.9 (3.9) 
PC Ramp  94.5 (0.7)  90.8 (0.5)  88.0 (2.2)  81.0 (2.2)  74.0 (2.3)  71.4 (2.4)  69.0 (2.8)  57.3 (2.0) 
). Best and equivalent methods (with 5% ttest) are highlighted in boldface.
Dataset  Class  Dim  # train  # test  PC/S  PL  ML 
WAVEFORM1  1 3  21  1226  398  
WAVEFORM2  1 3  40  1227  408  
SATIMAGE  1 7  36  415  211  
PENDIGITS  1 5  16  719  336  
6 10  719  335  
even #  719  336  
odd #  719  335  
1 10  719  335  
DRIVE  1 5  48  3955  1326  
6 10  3923  1313  
even #  3925  1283  
odd #  3939  1278  
1 10  3925  1269  
LETTER  1 5  16  565  171  
6 10  550  178  
11 15  556  177  
16 20  550  184  
21 25  585  167  
1 25  550  167  
USPS  1 5  256  652  166  
6 10  542  147  
even #  556  147  
odd #  542  147  
1 10  542  127 
6.2 Benchmark experiments
Next, we compare our proposed method, PC with the sigmoid loss (PC/S), with two baseline methods. The first baseline is one of the stateoftheart partial label (PL) methods partial with the squared hinge loss^{5}^{5}5We decided to use the squared hinge loss (which is convex) here since it was reported to work well in the original paper partial .:
The second baseline is a multilabel (ML) method multilabel2 , where every complementary label is translated into a negative label for class and positive labels for the other classes. This yields the following loss:
where we used the same sigmoid loss as the proposed method for
. We used a onehiddenlayer neural network (
) with rectified linear units(ReLU)
reluas activation functions, and weight decay candidates were chosen from
. Standardization, validation and optimization details follow the previous experiments.We evaluated the classification performance on the following benchmark datasets: WAVEFORM1, WAVEFORM2, SATIMAGE, PENDIGITS, DRIVE, LETTER, and USPS. USPS can be downloaded from the website of the late Sam Roweis^{6}^{6}6See http://cs.nyu.edu/~roweis/data.html., and all other datasets can be downloaded from the
UCI machine learning repository
^{7}^{7}7See http://archive.ics.uci.edu/ml/.. We tested several different settings of class labels, with equal number of data in each class.In Table 2, we summarized the specification of the datasets and reported the means and standard deviations of the classification accuracy over 10 trials. From the results, we can see that the proposed method is either comparable to or better than the baseline methods on many of the datasets.
6.3 Combination of ordinary and complementary labels
Dataset  Class  Dim  # train  # test  OL  CL  OL & CL 
()  ()  ()  
WAVEFORM1  1 3  21  413/826  408  
WAVEFORM2  1 3  40  411/821  411  
SATIMAGE  1 7  36  69/346  211  
PENDIGITS  1 5  16  144/575  336  
6 10  144/575  335  
even #  144/575  336  
odd #  144/575  335  
1 10  72/647  335  
DRIVE  1 5  48  780/3121  1305  
6 10  795/3180  1290  
even #  657/3284  1314  
odd #  790/3161  1255  
1 10  397/3570  1292  
LETTER  1 5  16  113/452  171  
6 10  110/440  178  
11 15  111/445  177  
16 20  110/440  184  
21 25  117/468  167  
1 25  22/528  167  
USPS  1 5  256  130/522  166  
6 10  108/434  147  
even #  108/434  166  
odd #  111/445  147  
1 10  54/488  147 
Finally, we demonstrate the usefulness of combining ordinarily and complementarily labeled data. We used (16), with hyperparameter fixed at for simplicity. We divided our training dataset by ratio, where one subset was labeled ordinarily while the other was labeled complementarily^{8}^{8}8We used times more complementarily labeled data than ordinarily labeled data since a single ordinary label corresponds to complementary labels.. From the training dataset, we left out 25% of the data for validating hyperparameters based on the zeroone loss version of (16). Other details such as standardization, the model and optimization, and weightdecay candidates follow the previous experiments.
We compared three methods: the ordinary label (OL) method corresponding to , the complementary label (CL) method corresponding to , and the combination (OL & CL) method with . The PC and sigmoid losses were commonly used for all methods.
We reported the means and standard deviations of the classification accuracy over 10 trials in Table 3. From the results, we can see that OL & CL tends to outperform OL and CL, demonstrating the usefulnesses of combining ordinarily and complementarily labeled data.
7 Conclusions
We proposed a novel problem setting called learning from complementary labels, and showed that an unbiased estimator to the classification risk can be obtained only from complementarily labeled data, if the loss function satisfies a certain symmetric condition. Our risk estimator can easily be minimized by any stochastic optimization algorithms such as Adam adam , allowing largescale training. We theoretically established estimation error bounds for the proposed method, and proved that the proposed method achieves the optimal parametric rate. We further showed that our proposed complementary classification can be easily combined with ordinary classification. Finally, we experimentally demonstrated the usefulness of the proposed methods.
The formulation of learning from complementary labels may also be useful in the context of privacyaware machine learning privacy : a subject needs to answer private questions such as psychological counseling which can make him/her hesitate to answer directly. In such a situation, providing a complementary label, i.e., one of the incorrect answers to the question, would be mentally less demanding. We will investigate this issue in the future.
It is noteworthy that the symmetric condition (11), which the loss should satisfy in our complementary classification framework, also appears in other weakly supervised learning formulations, e.g., in positiveunlabeled learning punips . It would be interesting to more closely investigate the role of this symmetric condition to gain further insight into these different weakly supervised learning problems.
Acknowledgements
GN and MS were supported by JST CREST JPMJCR1403. We thank Ikko Yamane for the helpful discussions.
References
 [1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399–2434, 2006.

[2]
G. Blanchard, G. Lee, and C. Scott.
Semisupervised novelty detection.
Journal of Machine Learning Research, 11:2973–3009, 2010. 
[3]
M. R. Boutell, J. Luo, X. Shen, and C. M. Brown.
Learning multilabel scene classification.
Pattern Recognition, 37(9):1757–1771, 2004.  [4] O. Chapelle, B. SchÃ¶lkopf, and A. Zien, editors. SemiSupervised Learning. MIT Press, 2006.
 [5] T. Cour, B. Sapp, and B. Taskar. Learning from partial labels. Journal of Machine Learning Research, 12:1501–1536, 2011.
 [6] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Informationtheoretic metric learning. In ICML, 2007.
 [7] F. Denis. PAC learning from positive statistical queries. In ALT, 1998.
 [8] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In NIPS, 2014.
 [9] M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, 2015.
 [10] C. Dwork. Differential privacy: A survey of results. In TAMC, 2008.
 [11] C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In KDD, 2008.
 [12] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004.
 [13] Y. Grandvalet and Y. Bengio. Semisupervised learning by entropy minimization. In NIPS, 2004.
 [14] J. Howe. Crowdsourcing: Why the power of the crowd is driving the future of business. Crown Publishing Group, 2009.
 [15] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [16] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In ICLR, 2017.
 [17] R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama. Positiveunlabeled learning with nonnegative risk estimator. In NIPS, 2017.
 [18] S. Laine and T. Aila. Temporal ensembling for semisupervised learning. In ICLR, 2017.
 [19] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991.
 [20] Y.F. Li and Z.H. Zhou. Towards making unlabeled data never hurt. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1):175–188, 2015.
 [21] G. Mann and A. McCallum. Simple, robust, scalable semisupervised learning via expectation regularization. In ICML, 2007.
 [22] C. McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics, pages 148–188. Cambridge University Press, 1989.
 [23] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.

[24]
V. Nair and G. Hinton.
Rectified linear units improve restricted boltzmann machines.
In ICML, 2010.  [25] G. Niu, B. Dai, M. Yamada, and M. Sugiyama. Informationtheoretic semisupervised metric learning via entropy regularization. Neural Computation, 26(8):1717–1762, 2014.
 [26] G. Niu, M. C. du Plessis, T. Sakai, Y. Ma, and M. Sugiyama. Theoretical comparisons of positiveunlabeled learning against positivenegative learning. In NIPS, 2016.
 [27] G. Niu, W. Jitkrittum, B. Dai, H. Hachiya, and M. Sugiyama. Squaredloss mutual information regularization: A novel informationtheoretic approach to semisupervised learning. In ICML, 2013.
 [28] T. Sakai, M. C. du Plessis, G. Niu, and M. Sugiyama. Semisupervised classification based on classification from positive and unlabeled data. In ICML, 2017.
 [29] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, 2001.

[30]
S. Tokui, K. Oono, S. Hido, and J. Clayton.
Chainer: a nextgeneration open source framework for deep learning.
In Proceedings of Workshop on Machine Learning Systems in NIPS, 2015.  [31] V. N. Vapnik. Statistical learning theory. John Wiley and Sons, 1998.
 [32] G. Ward, T. Hastie, S. Barry, J. Elith, and J. Leathwick. Presenceonly data and the EM algorithm. Biometrics, 65(2):554–563, 2009.
 [33] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10:207–244, 2009.
 [34] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering with sideinformation. In NIPS, 2002.
 [35] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semisupervised learning with graph embeddings. In ICML, 2016.
 [36] T. Zhang. Statistical analysis of some multicategory large margin classification methods. Journal of Machine Learning Research, 5:1225–1251, 2004.
 [37] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In NIPS, 2003.
 [38] X. Zhu, Z. Ghahramani, and J. Lafferty. Semisupervised learning using Gaussian fields and harmonic functions. In ICML, 2003.
Appendix A Proof of Theorem 2
Appendix B Proof of Lemma 3
By definition, so that
After rewriting , we can know that
and subsequently,
due to the subadditivity of the supremum.
The first term is independent of and thus
Comments
There are no comments yet.