Positive and unlabeled learning (PU learning) aims at learning from only positive and unlabeled examples, without explicit exposure to negative examples. This setting arises from multiple practical application scenarios: retrieving information with limited feedback given [Onoda et al.2005], text classification with only positive labels collected [Yu et al.2003] and detecting area of interest in images where normal samples are available but the abnormal samples are scarce and diverse [Zuluaga et al.2011, Li et al.2011]. It is widely applicable in industrial scenarios such as content censorship [Ren et al.2014, Li et al.2014], disease gene detection [Yang et al.2012] and drug discovery [Liu et al.2017].
Positive labels are considered prefect in most literatures while the unlabeled data are not and thus handled in different ways. The first category tries to identify negative samples from the unlabeled data and convert the problem back to positive-negative classification [Liu et al.2002, Li and Liu2003]
. The heuristic strategies in these methods act as external information to recognize negative samples. However, these strategies often heavily rely on subtle design for a single task/dataset and results in low transferability. The second category take the unlabeled data as corrupted negative samples. Early approaches attempt to reweight the unlabeled data[Liu et al.2003, Lee and Liu2003] with a smaller penalty per sample, but their performances are upper bounded due to their intrinsic bias, proved by Du Plessis et al. Plessis2014AnalysisOL,Plessis2015ConvexFF who later develop an approach, called uPU, with a non-convex losses to cancel the bias. This work is extended by nnPU [Kiryo et al.2017]
to avoid overfitting by preventing risk estimators from reaching negative values. Hou et al. Hou2017GenerativeAP further argue that overfitting is still an issue with flexible deep neural networks. They proposed GenPU, a generative adversarial approach, to address the challenge of limited positive data, whereupon they train two discriminators: one telling fake generated examples from the true and the other assign positive labels to the generated examples that are similar to the positive class.
A similar keypoint behind all the aforementioned solutions is that they all try to recover the true distribution of positive and negative data and thus recover the true risk. However, performing risk rectification at the outcome-of-loss-function level, which is the main cause of the inaccuracy, according to our elaboration in section 2.2. In this paper, we propose a novel method called “Collectively loss function to learn from Positive and Unlabeled data” (cPU) to rectify the predictor instead of the total risk. We collectively gather predictions from predictors and rectify them before the calculation of loss function. We design our method with the following principles in mind:
Minimum intervention. The only difference between PU learning setting and regular positive/negative learning is that the negative data are not explicitly labeled. Therefore, we also hope minimum feature construction is required. Hence, we only process at prediction level and leave the feature engineering part to the powerful representation of models (especially neural networks) themselves.
Robustness. Due to the class uncertainty in the unlabeled data, it is demanding to estimate the class prior accurately in the unlabeled data. As a result, we take the collective prediction to balance the randomness of mini-batch.
Our main contributions are threefold. Firstly, we provide a unbiased approach of estimating the posterior probability in PU learning setting, which is in harmony with very flexible models and learns on a large scale. Secondly, we propose a general framework of studying the behavior of loss functions via elicitation. Thirdly, we derive the collective loss function to rectify the decision boundary drift and theoretically bounded the generalization error. We conduct comprehensive experiments of in comparison with state-of-the-art approaches.
2 Problem Statement
Consider the input space and label space , we denote by
the joint distribution over. Let be a predictor function and be the classifier, where means rounding to the nearest integer and
is a sigmoid function. For example,. Let be the loss function
for binary classification. According to statistical learning theory[Vapnik1999] the risk of is defined by
2.1 Conditional Risk for PN Learning.
Let denote the posterior probability of the positive class. For clarity, we omit the argument in expressions such as and throughout the paper. The risk decomposes to the conditional form
where . is the marginal distribution of . We expect Fisher consistency [Lin2004] (aka classification-calibration in some literatures) on the predictors, which is a very weak condition. To be specific, if the risk in (2) is minimized, the following equation holds.
An example loss function is the zero-one loss:
2.2 Risk Estimators for PU Learning
Due to the absence of negative samples in PU learning, risks have to be estimated from only positive and unlabeled samples. In other words, need to be derived via risks from and . Formally, a PU learning system receives training samples from , which can be divided into two not-necessarily-independent components. The labeled positive samples: and unlabeled samples: . Underlyingly, the unlabeled set consists of positive samples and negative samples . As a popular practice, negative labels are assigned to the unlabeled samples [du Plessis et al.2015b, Kiryo et al.2017]. Let be the risk of unlabeled samples, which are drawn from the same distribution of , with loss function assigning all the labels to negative. Unbiased PU (aka uPU) learning methods [du Plessis et al.2014, du Plessis et al.2015b] attempt to estimate the risk for PU learning via subtracting the wrongly included risk :
Non-negative PU (aka nnPU) [Kiryo et al.2017] observed that should be always non-negative. However, this does not always hold, especially when the model becomes flexible (i.e., deep neural networks). To mitigate this drawback, they propose a non-negative risk estimator, thus ensuring the risk will not reach negative values:
Nevertheless, minimizing these risk estimators will lead to insufficient penalty for the negative samples. Maximizing , instead of making small, will result in the same effect of minimizing the total risk , which is also natural side effect of minimizing for flexible models and convex surrogate loss functions. Risk estimators are rectification at the outcome-of-loss-function level, which cannot avoid the explosion (i.e., in some worst case an unbounded loss may reach very large value [Kiryo et al.2017]) of some surrogate losses, such as the popular logarithm loss. In these cases, the flexible model overfits the training data well and sampling may include some easy positive examples which sum up to large that overwhelms . Thereupon, can we remedy the problem before loss function?
3 Collective Logarithm Loss for PU Learning
In this section, we firstly address the decision boundary drift problem in section 3.1 and provide rectification of the predictor. Then in section 3.2, we introduce the background of elicitation and how it connects to the design of loss function in normal situations. Finally, in section 3.3, we describe the framework of eliciting the loss function under PU learning setting.
3.1 Rectification of Predictor
To satisfy the Fisher consistency, we hope (3) hold during test, while (6) is biased in general [Kiryo et al.2017], hence leading to biased solutions. A key observation is that the decision boundary is different between training and testing for PU learning problem. Our aim is to rectify the decision boundary so that the classifier for testing also fits the positive and unlabeled train data. We introduce to denote true labels. remains the observed labels, where unlabeled samples are regarded as negative . Let be the posterior probability of testing, namely what we hope to capture by learning. In , the underlying true labels for and for . It is evident that the data distribution for training and testing is different. Let and be the posterior probability for training and testing. Denote by the total sample space, we can estimate these two expectations by the following equations in empirical estimation.
We denote the value in Eqn. (9) as for the rest of the paper. We hope . However based on PU training data, the model may converge biasedly to . Let be the portion of positive data in compared to the whole P class we can derive:
3.2 Preliminary for Elicitation
In statistics and economics, elicitation is a practice of designing reward mechanisms that encourage a predictor to make true predictions. Let be the prediction (i.e., an estimator of ), we have . Savage et al. savage1971elicitation defines the total reward as a linear function of .
where and denote the conditional reward for a certain event obtains or not. In binary classification context, and refers to the reward for and [Masnadi-Shirazi and Vasconcelos2008]. Specifically, is regarded as the event obtains and otherwise. The goal of elicitation is to design the rewards in order that a maximizes if and only if when . In other words, no larger reward should be given than when prediction is ideal. Lemma 1 finds the sufficient and necessary condition for it.
Lemma 1 (Savage savage1971elicitation).
Let be as defined in (11). Assume that is differentiable, then
holds and if and only if
The equality in (12) holds if and only if . Eqn. (12) also implies that is a strictly convex function of . This is the regular situation where event and prediction are in the same space. The event not observed will never happen, c.f., in PU learning, the event obtains even though it is not observed (i.e., unlabeled).
Masnadi et al. MasnadiShirazi2008OnTD interpreted loss functions in machine learning as a special form of . We rewrite it in Lemma 2 with and illustrate the process of deriving the logarithm loss with Example 1.
Lemma 2 (Masnadi et al. MasnadiShirazi2008OnTD).
Lemma 2 bridges the design of a loss function with the reward function .
Example 1 (Eliciting logarithm loss).
Let be defined as follows,
which can be interpreted as the closeness of the prediction to the true label. Intuitively, larger should get larger reward (or smaller penalty). Let be the convex function. Applying (17), we can derive:
The loss function is
3.3 Eliciting Collective Loss Function for PU Learning
In PU learning, the label of a specific sample in is unknown. We only possess the statistical information of the samples. Therefore, a reward function that suits this kind of collective information is desired. Lemma 2 indicates the symmetry of the link function , which changes in PU learning setting. Let be defined in (18). In PU learning, we must ensure when . A straight-forward solution is to encourage making a certain amount of positive predictions when the labels are negative. The amount is such that the expectation of the predictions equals to , i.e., the positive prior in unlabeled data, because holds. Under this condition, :
Note that we apply an absolute function because when the prediction it is also considered to be a negative event, thus deviates from the correctness. Hence, we derive the rectified reward function as follows. Without loss of generality, we let be logarithm function .
Theorem 1 (Maximum reward in PU learning).
We apply stochastic gradient optimization. Instead of traditional one-loss-per-sample paradigm, we collect the model predictions from multiple samples while update the gradient only once. That is equivalent to ask multiple agents to make decisions under condition of (22) The intuition is as follows: It is difficult to ensure the correctness of a single prediction especially under unlabeled data. The underlying label may be either positive or negative. However, when a batch of samples are considered together, the expectation of the prediction converges to . For all mini-batch , the loss function is as follows.
3.5 Estimation error bound
We next theoretically upper bound the generalization error. Let be the empirical risk minimizer corresponding to (26). The learning problem is to find an optimal decision function in the function class where is a constant. Formally, . Let be the Rademacher complexity defined in [Bartlett and Mendelson2001].
Lemma 3 (Ledoux1991ProbabilityIB Ledoux1991ProbabilityIB).
Assuming is Lipschitz continuous with constant and , we have
Theorem 2 (Generalization error bound).
For any , with probability at least :
where is the total number of i.i.d. samples corresponding to the Rademacher variables.
The Lipschitz constant is for original cross entropy [Yedida2019] where and
is the input vector norm. This Lipschitz constant also applies for (26), so that the last inequality follows. The penultimate inequality follows from routine proof of generalization bound using Rademacher complexity [Shalev-Shwartz and Ben-David2014, Section 26.1].
|Dataset||#Train||#Test||Details||P class||N class|
|CIFAR-10||50000||19947||3232 image||‘bird’, ‘cat’, ‘deer’,||‘airplane’, ‘auto mobile’,|
|‘dog’, ‘frog’, and ‘horse’||‘ship’, and ‘truck’|
|20ng||11314||7532||text||‘alt.’, ‘comp.’,||‘sci.’, ‘soc.’|
|‘misc.’ and ‘rec.’||and ‘talk.’|
We perform experiments on five real-world datasets, including MNIST [LeCun et al.1998], USPS [Hastie et al.2005], SVHN [Netzer et al.2011], CIFAR-10 [Krizhevsky2009] and 20ng (twenty news groups) [Lang1995]. We choose the positive and negative class in accordance with the previous research [Kiryo et al.2017]. The specification of datasets are described in Table 1. We still need the actual label for testing the models, hence we use originally labeled data. Specifically, we randomly pick of P class data and mix them with all the N class data to compose the unlabeled set . The remaining P class data forms the positive set .
We apply neural networks as the predictor function. Specifically, we apply vanilla vgg-16 structure [Simonyan and Zisserman2014]Softsign) and pre-trained word embedding (300-dimension GloVe [Pennington et al.2014] word embeddings) are same with [Kiryo et al.2017]. For the optimizer, we use Nadam [Dozat2016] with learning rate 0.0005 throughout all models. The parameters in nnPU are set equal to the original paper, i.e., .
We then evaluate the results to show the efficacy of proposed method cPU. We explore the following two common questions in applications: 1) Can it separate the unlabeled positive samples from the negative ones without explicit exposure to negative samples? 2) Is it sensitive to class prior, which may vary and sometimes with uncertainty in real applications?
|MNIST||0.2||0.9920 0.0003||0.9868 0.0011||–||–||0.9925 0.0003|
|0.3||0.9910 0.0006||0.9859 0.0010||–||–||0.9911 0.0002|
|0.4||0.9898 0.0005||0.9853 0.0011||–||–||0.9907 0.0008|
|0.8||0.9772 0.0013||0.9787 0.0005||–||–||0.9851 0.0006|
|USPS||0.2||0.9396 0.0015||0.9624 0.0030||0.934||–||0.9606 0.0009|
|0.3||0.9398 0.0024||0.9638 0.0034||0.911||–||0.9599 0.0027|
|0.4||0.9357 0.0046||0.9595 0.0017||0.901||–||0.9624 0.0017|
|0.8||0.9334 0.0031||0.9316 0.0077||–||–||0.9501 0.0018|
|SVHN||0.2||0.9082 0.0023||0.8972 0.0036||0.785||0.851||0.9150 0.0014|
|0.3||0.9044 0.0017||0.8995 0.0021||0.776||0.852||0.9102 0.0020|
|0.4||0.9027 0.0022||0.8953 0.0037||0.748||0.850||0.9083 0.0023|
|0.8||0.8679 0.0039||0.8569 0.0049||–||–||0.8595 0.0019|
|CIFAR-10||0.2||0.8534 0.0032||0.8374 0.0033||0.772||0.834||0.8610 0.0029|
|0.3||0.8427 0.0024||0.8264 0.0056||0.761||0.861||0.8556 0.0054|
|0.4||0.8351 0.0049||0.8178 0.0063||0.701||0.860||0.8446 0.0038|
|0.8||0.7636 0.0025||0.7494 0.0023||–||–||0.7906 0.0021|
|20ng||0.2||0.8601 0.0013||0.7675 0.0410||–||–||0.8601 0.0012|
|0.3||0.8589 0.0014||0.8132 0.0180||–||–||0.8599 0.0034|
|0.4||0.8573 0.0050||0.8414 0.0047||–||–||0.8592 0.0041|
|0.8||0.8422 0.0027||0.8191 0.0022||–||–||0.8428 0.0028|
”. The results of LDCE and PULD are excerpted from the original paper, thus without standard deviation values.
4.1 Comparison to State of the Art
We first show the overall evaluation results on the real-world datasets. We compare our proposed approach with current state-of-the-art PU learning methods: unbiased PU (uPU) [du Plessis et al.2015b] and non-negative PU (nnPU) [Kiryo et al.2017], LDCE [Shi et al.2018] and PULD [Zhang et al.2019]
. We re-implement uPU and nnPU using the same vgg-16 structure as in our method. We do not compare with LDCE and PULD, but simply provide the results for reference because: 1) they require additional features construction/engineering process, which is not explicit; 2) these two models deeply involve support vector machine[Cortes and Vapnik1995] as their model, and thus can neither be plugged in by other loss functions than hinge loss nor be fairly compared with neural networks. The experiments are repeated five times with randomly sampled P class each. We report the mean and standard deviation of accuracies in Table 2. We can see that our proposed method cPU outperforms the current state-of-the-art methods in most cases and are relatively more stable (smaller standard deviation). On the rather more difficult dataset CIFAR-10, cPU achieves a healthy 1-4 point accuracy gap with the closest competitor. Note that, Zhang et al. Zhang2019PositiveUnlabeledLW reported that nnPU performs dramatically worse than other competitors (i.e., the best accuracy is 0.771 for CIFAR-10 at ), which did not happen in our experiments.
In this section, we study a common scenario of PU-learning in which the class prior is not accurately estimated. This usually happens in real applications, where a small sample can be achieved to approximate the class prior . To simulate the scenario, we set and misspecify . The results are shown in Table 3. We can observe that generally the results are worse if deviation of become big. Another phenomenon is that, the bigger , the deviation are more influential to the results. This can be avoided by sampling more data to get better estimation of , since larger indicates more unlabeled data available in real applications. Nevertheless, the fluctuation is acceptable when
varies, which means our proposed approach is robust towards wrongly estimated prior probabilities of P class in unlabeled data.
4.3 Training Process Analysis and Case Study
In order to get a deeper insight on how loss function take effect, we project the layer before last onto a 2D fully connected layer and plot its activation 111We uniformly sample 500 examples from the training set for clarity of plot.. For simplicity, we demonstrate with a toy dataset DVC (dog-vs-cat) 222https://www.kaggle.com/c/dogs-vs-cats/data, in which dogs are mixed with cats to form the unlabeled. A snapshot of the 1st, 4th and 7th epoch is shown in Figure 1
along with five samples (denoted S1…S5). We observe that some unlabeled dogs are blended with the cats at first. As the training progressed, they gradually move towards the positive dogs. S3 is a typical example. This further supports the assertion that unlabeled positive samples are separable even without explicit negative examples. We also analyze the errors. S1 and S2 are special examples with human inside. We observe S1 is guided by positive label and move towards the positive center, while S2, in which the cat is barely recognizable, move towards S1, due to their resemblance, and lead to a wrong prediction. S4 and S5 are noisy unlabeled samples. As a result, they move back and forth across the borderline. This might be a useful signal for active learning, which will be left for future works.
In this paper, we identify the bias caused by class uncertainty in the unlabeled as the major difficulty for current risk estimators. We propose a novel approach towards PU learning dubbed “cPU” that collectively process the predictions. We design the loss function through theoretical elicitation PU learning setting and rectification of the predictor. It outperforms the state-of-the-art methods on PU learning and shows robustness against wrongly estimated class prior on the unlabeled data.
- [Bartlett and Mendelson2001] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, 2001.
Jessa Bekker and Jesse Davis.
Estimating the class prior in positive and unlabeled data through decision tree induction.In AAAI, 2018.
- [Cortes and Vapnik1995] Corinna Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297, 1995.
Incorporating nesterov momentum into adam.In ICLR Workshop, 2016.
- [du Plessis et al.2014] Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. In NeurIPS, 2014.
- [du Plessis et al.2015a] Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Class-prior estimation for learning from positive and unlabeled data. Machine Learning, 106:463–492, 2015.
- [du Plessis et al.2015b] Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, 2015.
- [Hastie et al.2005] Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2005.
- [Hou et al.2017] Ming Hou, Brahim Chaib-draa, C. C. Li, and Qibin Zhao. Generative adversarial positive-unlabelled learning. In IJCAI, 2017.
- [Kiryo et al.2017] Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In NeurIPS, 2017.
- [Krizhevsky2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. In Technical report, 2009.
- [Lang1995] Ken Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, pages 331–339, 1995.
- [LeCun et al.1998] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- [Ledoux and Talagrand1991] Michel Ledoux and Michel Talagrand. Probability in banach spaces: Isoperimetry and processes. 1991.
[Lee and Liu2003]
Wee Sun Lee and Bing Liu.
Learning with positive and unlabeled examples using weighted logistic regression.In ICML, 2003.
- [Li and Liu2003] Xiaoli Li and Bing Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, 2003.
- [Li et al.2011] Wenkai Li, Qinghua Guo, and Charles Elkan. A positive and unlabeled learning algorithm for one-class classification of remote-sensing data. IEEE T-GRS, 49:717–725, 2011.
- [Li et al.2014] Huayi Li, Zhiyuan Chen, Bing Liu, Xiaokai Wei, and Jidong Shao. Spotting fake reviews via collective positive-unlabeled learning. ICDM, pages 899–904, 2014.
- [Lin2004] Yi Lin. A note on margin-based loss functions in classification. Statistics & probability letters, 68(1):73–82, 2004.
- [Liu et al.2002] Bing Liu, Wee Sun Lee, Philip S. Yu, and Xiaoli Li. Partially supervised classification of text documents. In ICML, 2002.
- [Liu et al.2003] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. Building text classifiers using positive and unlabeled examples. ICDM, pages 179–186, 2003.
- [Liu et al.2017] Yashu Liu, Shuang Qiu, Ping Zhang, Pinghua Gong, Feng Wang, Guoliang Xue, and Jieping Ye. Computational drug discovery with dyadic positive-unlabeled learning. In SDM, 2017.
Hamed Masnadi-Shirazi and Nuno Vasconcelos.
On the design of loss functions for classification: theory, robustness to outliers, and savageboost.In NeuroIPS, 2008.
- [Netzer et al.2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
- [Onoda et al.2005] Takashi Onoda, Hikari Murata, and Seiji Yamada. One class support vector machine based non-relevance feedback document retrieval. Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., 1:552–557 vol. 1, 2005.
- [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
- [Ren et al.2014] Yafeng Ren, Dong-Hong Ji, and Hongbin Zhang. Positive unlabeled learning for deceptive reviews detection. In EMNLP, 2014.
- [Savage1971] Leonard J Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801, 1971.
- [Shalev-Shwartz and Ben-David2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. 2014.
- [Shi et al.2018] Hong Shi, Shaojun Pan, Jian Xi Yang, and Chen Gong. Positive and unlabeled learning via loss decomposition and centroid estimation. In IJCAI, 2018.
- [Simonyan and Zisserman2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- [Vapnik1999] Vladimir Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10 5:988–99, 1999.
- [Yang et al.2012] Peng Yang, Xiaoli Li, Jian-Ping Mei, Chee Keong Kwoh, and See-Kiong Ng. Positive-unlabeled learning for disease gene identification. In Bioinformatics, 2012.
- [Yedida2019] Rahul Yedida. Finding a good learning rate. Blog post, 2019.
- [Yu et al.2003] Hwanjo Yu, ChengXiang Zhai, and Jiawei Han. Text classification from positive and unlabeled documents. In CIKM, 2003.
- [Zhang et al.2019] Chuang Zhang, Dexin Ren, Tongliang Liu, Jian Yang, and Chen Gong. Positive and unlabeled learning with label disambiguation. In IJCAI, 2019.
- [Zuluaga et al.2011] Maria A. Zuluaga, Don Hush, Edgar J. F. Delgado Leyton, Marcela Hernández Hoyos, and Maciej Orkisz. Learning from only positive and unlabeled data to detect lesions in vascular ct images. Medical image computing and computer-assisted intervention : MICCAI … International Conference on Medical Image Computing and Computer-Assisted Intervention, 14 Pt 3:9–16, 2011.