1 Introduction
Positive and unlabeled learning (PU learning) aims at learning from only positive and unlabeled examples, without explicit exposure to negative examples. This setting arises from multiple practical application scenarios: retrieving information with limited feedback given [Onoda et al.2005], text classification with only positive labels collected [Yu et al.2003] and detecting area of interest in images where normal samples are available but the abnormal samples are scarce and diverse [Zuluaga et al.2011, Li et al.2011]. It is widely applicable in industrial scenarios such as content censorship [Ren et al.2014, Li et al.2014], disease gene detection [Yang et al.2012] and drug discovery [Liu et al.2017].
Positive labels are considered prefect in most literatures while the unlabeled data are not and thus handled in different ways. The first category tries to identify negative samples from the unlabeled data and convert the problem back to positivenegative classification [Liu et al.2002, Li and Liu2003]
. The heuristic strategies in these methods act as external information to recognize negative samples. However, these strategies often heavily rely on subtle design for a single task/dataset and results in low transferability. The second category take the unlabeled data as corrupted negative samples. Early approaches attempt to reweight the unlabeled data
[Liu et al.2003, Lee and Liu2003] with a smaller penalty per sample, but their performances are upper bounded due to their intrinsic bias, proved by Du Plessis et al. Plessis2014AnalysisOL,Plessis2015ConvexFF who later develop an approach, called uPU, with a nonconvex losses to cancel the bias. This work is extended by nnPU [Kiryo et al.2017]to avoid overfitting by preventing risk estimators from reaching negative values. Hou et al. Hou2017GenerativeAP further argue that overfitting is still an issue with flexible deep neural networks. They proposed GenPU, a generative adversarial approach, to address the challenge of limited positive data, whereupon they train two discriminators: one telling fake generated examples from the true and the other assign positive labels to the generated examples that are similar to the positive class.
A similar keypoint behind all the aforementioned solutions is that they all try to recover the true distribution of positive and negative data and thus recover the true risk. However, performing risk rectification at the outcomeoflossfunction level, which is the main cause of the inaccuracy, according to our elaboration in section 2.2. In this paper, we propose a novel method called “Collectively loss function to learn from Positive and Unlabeled data” (cPU) to rectify the predictor instead of the total risk. We collectively gather predictions from predictors and rectify them before the calculation of loss function. We design our method with the following principles in mind:

Minimum intervention. The only difference between PU learning setting and regular positive/negative learning is that the negative data are not explicitly labeled. Therefore, we also hope minimum feature construction is required. Hence, we only process at prediction level and leave the feature engineering part to the powerful representation of models (especially neural networks) themselves.

Robustness. Due to the class uncertainty in the unlabeled data, it is demanding to estimate the class prior accurately in the unlabeled data. As a result, we take the collective prediction to balance the randomness of minibatch.
Our main contributions are threefold. Firstly, we provide a unbiased approach of estimating the posterior probability in PU learning setting, which is in harmony with very flexible models and learns on a large scale. Secondly, we propose a general framework of studying the behavior of loss functions via elicitation. Thirdly, we derive the collective loss function to rectify the decision boundary drift and theoretically bounded the generalization error. We conduct comprehensive experiments of in comparison with stateoftheart approaches.
2 Problem Statement
Consider the input space and label space , we denote by
the joint distribution over
. Let be a predictor function and be the classifier, where means rounding to the nearest integer andis a sigmoid function. For example,
. Let be the loss functionfor binary classification. According to statistical learning theory
[Vapnik1999] the risk of is defined by(1) 
where .
2.1 Conditional Risk for PN Learning.
Let denote the posterior probability of the positive class. For clarity, we omit the argument in expressions such as and throughout the paper. The risk decomposes to the conditional form
(2) 
where . is the marginal distribution of . We expect Fisher consistency [Lin2004] (aka classificationcalibration in some literatures) on the predictors, which is a very weak condition. To be specific, if the risk in (2) is minimized, the following equation holds.
(3) 
An example loss function is the zeroone loss:
(4) 
2.2 Risk Estimators for PU Learning
Due to the absence of negative samples in PU learning, risks have to be estimated from only positive and unlabeled samples. In other words, need to be derived via risks from and . Formally, a PU learning system receives training samples from , which can be divided into two notnecessarilyindependent components. The labeled positive samples: and unlabeled samples: . Underlyingly, the unlabeled set consists of positive samples and negative samples . As a popular practice, negative labels are assigned to the unlabeled samples [du Plessis et al.2015b, Kiryo et al.2017]. Let be the risk of unlabeled samples, which are drawn from the same distribution of , with loss function assigning all the labels to negative. Unbiased PU (aka uPU) learning methods [du Plessis et al.2014, du Plessis et al.2015b] attempt to estimate the risk for PU learning via subtracting the wrongly included risk :
(5) 
Nonnegative PU (aka nnPU) [Kiryo et al.2017] observed that should be always nonnegative. However, this does not always hold, especially when the model becomes flexible (i.e., deep neural networks). To mitigate this drawback, they propose a nonnegative risk estimator, thus ensuring the risk will not reach negative values:
(6) 
Nevertheless, minimizing these risk estimators will lead to insufficient penalty for the negative samples. Maximizing , instead of making small, will result in the same effect of minimizing the total risk , which is also natural side effect of minimizing for flexible models and convex surrogate loss functions. Risk estimators are rectification at the outcomeoflossfunction level, which cannot avoid the explosion (i.e., in some worst case an unbounded loss may reach very large value [Kiryo et al.2017]) of some surrogate losses, such as the popular logarithm loss. In these cases, the flexible model overfits the training data well and sampling may include some easy positive examples which sum up to large that overwhelms . Thereupon, can we remedy the problem before loss function?
3 Collective Logarithm Loss for PU Learning
In this section, we firstly address the decision boundary drift problem in section 3.1 and provide rectification of the predictor. Then in section 3.2, we introduce the background of elicitation and how it connects to the design of loss function in normal situations. Finally, in section 3.3, we describe the framework of eliciting the loss function under PU learning setting.
3.1 Rectification of Predictor
To satisfy the Fisher consistency, we hope (3) hold during test, while (6) is biased in general [Kiryo et al.2017], hence leading to biased solutions. A key observation is that the decision boundary is different between training and testing for PU learning problem. Our aim is to rectify the decision boundary so that the classifier for testing also fits the positive and unlabeled train data. We introduce to denote true labels. remains the observed labels, where unlabeled samples are regarded as negative . Let be the posterior probability of testing, namely what we hope to capture by learning. In , the underlying true labels for and for . It is evident that the data distribution for training and testing is different. Let and be the posterior probability for training and testing. Denote by the total sample space, we can estimate these two expectations by the following equations in empirical estimation.
(7)  
(8) 
(9) 
We denote the value in Eqn. (9) as for the rest of the paper. We hope . However based on PU training data, the model may converge biasedly to . Let be the portion of positive data in compared to the whole P class we can derive:
(10) 
3.2 Preliminary for Elicitation
In statistics and economics, elicitation is a practice of designing reward mechanisms that encourage a predictor to make true predictions. Let be the prediction (i.e., an estimator of ), we have . Savage et al. savage1971elicitation defines the total reward as a linear function of .
(11) 
where and denote the conditional reward for a certain event obtains or not. In binary classification context, and refers to the reward for and [MasnadiShirazi and Vasconcelos2008]. Specifically, is regarded as the event obtains and otherwise. The goal of elicitation is to design the rewards in order that a maximizes if and only if when . In other words, no larger reward should be given than when prediction is ideal. Lemma 1 finds the sufficient and necessary condition for it.
Lemma 1 (Savage savage1971elicitation).
Let be as defined in (11). Assume that is differentiable, then
(12) 
holds and if and only if
(13)  
(14) 
Remark 1.
The equality in (12) holds if and only if . Eqn. (12) also implies that is a strictly convex function of . This is the regular situation where event and prediction are in the same space. The event not observed will never happen, c.f., in PU learning, the event obtains even though it is not observed (i.e., unlabeled).
Masnadi et al. MasnadiShirazi2008OnTD interpreted loss functions in machine learning as a special form of . We rewrite it in Lemma 2 with and illustrate the process of deriving the logarithm loss with Example 1.
Lemma 2 (Masnadi et al. MasnadiShirazi2008OnTD).
Remark 2.
Lemma 2 bridges the design of a loss function with the reward function .
Example 1 (Eliciting logarithm loss).
Let be defined as follows,
(18) 
which can be interpreted as the closeness of the prediction to the true label. Intuitively, larger should get larger reward (or smaller penalty). Let be the convex function. Applying (17), we can derive:
(19) 
The loss function is
(20) 
3.3 Eliciting Collective Loss Function for PU Learning
In PU learning, the label of a specific sample in is unknown. We only possess the statistical information of the samples. Therefore, a reward function that suits this kind of collective information is desired. Lemma 2 indicates the symmetry of the link function , which changes in PU learning setting. Let be defined in (18). In PU learning, we must ensure when . A straightforward solution is to encourage making a certain amount of positive predictions when the labels are negative. The amount is such that the expectation of the predictions equals to , i.e., the positive prior in unlabeled data, because holds. Under this condition, :
(21) 
Note that we apply an absolute function because when the prediction it is also considered to be a negative event, thus deviates from the correctness. Hence, we derive the rectified reward function as follows. Without loss of generality, we let be logarithm function .
(22) 
According to [Savage1971, section 7], (22) must be upper bounded by a maximum reward. We further detail it in Theorem 1.
Theorem 1 (Maximum reward in PU learning).
3.4 Implementation
We apply stochastic gradient optimization. Instead of traditional onelosspersample paradigm, we collect the model predictions from multiple samples while update the gradient only once. That is equivalent to ask multiple agents to make decisions under condition of (22) The intuition is as follows: It is difficult to ensure the correctness of a single prediction especially under unlabeled data. The underlying label may be either positive or negative. However, when a batch of samples are considered together, the expectation of the prediction converges to . For all minibatch , the loss function is as follows.
(26) 
In practice, we treat the positive class prior in as known during training. Many related works [du Plessis et al.2015a, Bekker and Davis2018] can be applied to estimate it. We further show that our method is insensitive to it in Section 4.2.
3.5 Estimation error bound
We next theoretically upper bound the generalization error. Let be the empirical risk minimizer corresponding to (26). The learning problem is to find an optimal decision function in the function class where is a constant. Formally, . Let be the Rademacher complexity defined in [Bartlett and Mendelson2001].
Lemma 3 (Ledoux1991ProbabilityIB Ledoux1991ProbabilityIB).
Assuming is Lipschitz continuous with constant and , we have
(27) 
Theorem 2 (Generalization error bound).
For any , with probability at least :
(28) 
where is the total number of i.i.d. samples corresponding to the Rademacher variables.
The Lipschitz constant is for original cross entropy [Yedida2019] where and
is the input vector norm. This Lipschitz constant also applies for (
26), so that the last inequality follows. The penultimate inequality follows from routine proof of generalization bound using Rademacher complexity [ShalevShwartz and BenDavid2014, Section 26.1].Dataset  #Train  #Test  Details  P class  N class 

MNIST  60000  23878  3232 image  0,2,4,6,8  1,3,5,7,9 
USPS  7291  2942  3232 image  0  rest 
SVHN  73257  20718  1616 image  1,2,3,4,5  6,7,8,9,0 
CIFAR10  50000  19947  3232 image  ‘bird’, ‘cat’, ‘deer’,  ‘airplane’, ‘auto mobile’, 
‘dog’, ‘frog’, and ‘horse’  ‘ship’, and ‘truck’  
20ng  11314  7532  text  ‘alt.’, ‘comp.’,  ‘sci.’, ‘soc.’ 
‘misc.’ and ‘rec.’  and ‘talk.’ 
4 Experiments
We perform experiments on five realworld datasets, including MNIST [LeCun et al.1998], USPS [Hastie et al.2005], SVHN [Netzer et al.2011], CIFAR10 [Krizhevsky2009] and 20ng (twenty news groups) [Lang1995]. We choose the positive and negative class in accordance with the previous research [Kiryo et al.2017]. The specification of datasets are described in Table 1. We still need the actual label for testing the models, hence we use originally labeled data. Specifically, we randomly pick of P class data and mix them with all the N class data to compose the unlabeled set . The remaining P class data forms the positive set .
We apply neural networks as the predictor function. Specifically, we apply vanilla vgg16 structure [Simonyan and Zisserman2014]
to encode the input features. For 20ng, all the details including model structure (a multilayer perceptron with five layers and the activation functions are
Softsign) and pretrained word embedding (300dimension GloVe [Pennington et al.2014] word embeddings) are same with [Kiryo et al.2017]. For the optimizer, we use Nadam [Dozat2016] with learning rate 0.0005 throughout all models. The parameters in nnPU are set equal to the original paper, i.e., .We then evaluate the results to show the efficacy of proposed method cPU. We explore the following two common questions in applications: 1) Can it separate the unlabeled positive samples from the negative ones without explicit exposure to negative samples? 2) Is it sensitive to class prior, which may vary and sometimes with uncertainty in real applications?
Dataset  r  uPU  nnPU  LDCE  PULD  cPU (ours) 

MNIST  0.2  0.9920 0.0003  0.9868 0.0011  –  –  0.9925 0.0003 
0.3  0.9910 0.0006  0.9859 0.0010  –  –  0.9911 0.0002  
0.4  0.9898 0.0005  0.9853 0.0011  –  –  0.9907 0.0008  
0.8  0.9772 0.0013  0.9787 0.0005  –  –  0.9851 0.0006  
USPS  0.2  0.9396 0.0015  0.9624 0.0030  0.934  –  0.9606 0.0009 
0.3  0.9398 0.0024  0.9638 0.0034  0.911  –  0.9599 0.0027  
0.4  0.9357 0.0046  0.9595 0.0017  0.901  –  0.9624 0.0017  
0.8  0.9334 0.0031  0.9316 0.0077  –  –  0.9501 0.0018  
SVHN  0.2  0.9082 0.0023  0.8972 0.0036  0.785  0.851  0.9150 0.0014 
0.3  0.9044 0.0017  0.8995 0.0021  0.776  0.852  0.9102 0.0020  
0.4  0.9027 0.0022  0.8953 0.0037  0.748  0.850  0.9083 0.0023  
0.8  0.8679 0.0039  0.8569 0.0049  –  –  0.8595 0.0019  
CIFAR10  0.2  0.8534 0.0032  0.8374 0.0033  0.772  0.834  0.8610 0.0029 
0.3  0.8427 0.0024  0.8264 0.0056  0.761  0.861  0.8556 0.0054  
0.4  0.8351 0.0049  0.8178 0.0063  0.701  0.860  0.8446 0.0038  
0.8  0.7636 0.0025  0.7494 0.0023  –  –  0.7906 0.0021  
20ng  0.2  0.8601 0.0013  0.7675 0.0410  –  –  0.8601 0.0012 
0.3  0.8589 0.0014  0.8132 0.0180  –  –  0.8599 0.0034  
0.4  0.8573 0.0050  0.8414 0.0047  –  –  0.8592 0.0041  
0.8  0.8422 0.0027  0.8191 0.0022  –  –  0.8428 0.0028 
”. The results of LDCE and PULD are excerpted from the original paper, thus without standard deviation values.
4.1 Comparison to State of the Art
We first show the overall evaluation results on the realworld datasets. We compare our proposed approach with current stateoftheart PU learning methods: unbiased PU (uPU) [du Plessis et al.2015b] and nonnegative PU (nnPU) [Kiryo et al.2017], LDCE [Shi et al.2018] and PULD [Zhang et al.2019]
. We reimplement uPU and nnPU using the same vgg16 structure as in our method. We do not compare with LDCE and PULD, but simply provide the results for reference because: 1) they require additional features construction/engineering process, which is not explicit; 2) these two models deeply involve support vector machine
[Cortes and Vapnik1995] as their model, and thus can neither be plugged in by other loss functions than hinge loss nor be fairly compared with neural networks. The experiments are repeated five times with randomly sampled P class each. We report the mean and standard deviation of accuracies in Table 2. We can see that our proposed method cPU outperforms the current stateoftheart methods in most cases and are relatively more stable (smaller standard deviation). On the rather more difficult dataset CIFAR10, cPU achieves a healthy 14 point accuracy gap with the closest competitor. Note that, Zhang et al. Zhang2019PositiveUnlabeledLW reported that nnPU performs dramatically worse than other competitors (i.e., the best accuracy is 0.771 for CIFAR10 at ), which did not happen in our experiments.4.2 Robustness
Dataset  r  

10%  5%  +5%  +10%  
MNIST  0.2  0.9925  0.9924  0.9927  0.9842 
0.3  0.9905  0.9912  0.9911  0.9878  
0.4  0.9908  0.9907  0.9907  0.9881  
0.8  0.9832  0.9842  0.9855  0.9609  
USPS  0.2  0.9651  0.9616  0.9621  0.9542 
0.3  0.9641  0.9581  0.9656  0.9517  
0.4  0.9631  0.9631  0.9606  0.9527  
0.8  0.9601  0.9562  0.9283  0.9128  
SVHN  0.2  0.9156  0.9078  0.9124  0.9022 
0.3  0.9106  0.9159  0.9097  0.8921  
0.4  0.9085  0.9096  0.8989  0.8888  
0.8  0.8763  0.8755  0.8402  0.8283  
CIFAR10  0.2  0.8597  0.8607  0.8609  0.8543 
0.3  0.8542  0.8527  0.8519  0.8549  
0.4  0.8343  0.8401  0.8435  0.8311  
0.8  0.7778  0.7854  0.7854  0.7797  
20ng  0.2  0.8593  0.8594  0.8606  0.8598 
0.3  0.8589  0.8598  0.8602  0.8596  
0.4  0.8590  0.8605  0.8578  0.8563  
0.8  0.8417  0.8421  0.8426  0.8352 
In this section, we study a common scenario of PUlearning in which the class prior is not accurately estimated. This usually happens in real applications, where a small sample can be achieved to approximate the class prior . To simulate the scenario, we set and misspecify . The results are shown in Table 3. We can observe that generally the results are worse if deviation of become big. Another phenomenon is that, the bigger , the deviation are more influential to the results. This can be avoided by sampling more data to get better estimation of , since larger indicates more unlabeled data available in real applications. Nevertheless, the fluctuation is acceptable when
varies, which means our proposed approach is robust towards wrongly estimated prior probabilities of P class in unlabeled data.
4.3 Training Process Analysis and Case Study
In order to get a deeper insight on how loss function take effect, we project the layer before last onto a 2D fully connected layer and plot its activation ^{1}^{1}1We uniformly sample 500 examples from the training set for clarity of plot.. For simplicity, we demonstrate with a toy dataset DVC (dogvscat) ^{2}^{2}2https://www.kaggle.com/c/dogsvscats/data, in which dogs are mixed with cats to form the unlabeled. A snapshot of the 1st, 4th and 7th epoch is shown in Figure 1
along with five samples (denoted S1…S5). We observe that some unlabeled dogs are blended with the cats at first. As the training progressed, they gradually move towards the positive dogs. S3 is a typical example. This further supports the assertion that unlabeled positive samples are separable even without explicit negative examples. We also analyze the errors. S1 and S2 are special examples with human inside. We observe S1 is guided by positive label and move towards the positive center, while S2, in which the cat is barely recognizable, move towards S1, due to their resemblance, and lead to a wrong prediction. S4 and S5 are noisy unlabeled samples. As a result, they move back and forth across the borderline. This might be a useful signal for active learning, which will be left for future works.
5 Conclusion
In this paper, we identify the bias caused by class uncertainty in the unlabeled as the major difficulty for current risk estimators. We propose a novel approach towards PU learning dubbed “cPU” that collectively process the predictions. We design the loss function through theoretical elicitation PU learning setting and rectification of the predictor. It outperforms the stateoftheart methods on PU learning and shows robustness against wrongly estimated class prior on the unlabeled data.
References
 [Bartlett and Mendelson2001] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, 2001.

[Bekker and
Davis2018]
Jessa Bekker and Jesse Davis.
Estimating the class prior in positive and unlabeled data through decision tree induction.
In AAAI, 2018.  [Cortes and Vapnik1995] Corinna Cortes and V. Vapnik. Supportvector networks. Machine Learning, 20:273–297, 1995.

[Dozat2016]
Timothy Dozat.
Incorporating nesterov momentum into adam.
In ICLR Workshop, 2016.  [du Plessis et al.2014] Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. In NeurIPS, 2014.
 [du Plessis et al.2015a] Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Classprior estimation for learning from positive and unlabeled data. Machine Learning, 106:463–492, 2015.
 [du Plessis et al.2015b] Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, 2015.
 [Hastie et al.2005] Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2005.
 [Hou et al.2017] Ming Hou, Brahim Chaibdraa, C. C. Li, and Qibin Zhao. Generative adversarial positiveunlabelled learning. In IJCAI, 2017.
 [Kiryo et al.2017] Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positiveunlabeled learning with nonnegative risk estimator. In NeurIPS, 2017.
 [Krizhevsky2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. In Technical report, 2009.
 [Lang1995] Ken Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, pages 331–339, 1995.
 [LeCun et al.1998] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [Ledoux and Talagrand1991] Michel Ledoux and Michel Talagrand. Probability in banach spaces: Isoperimetry and processes. 1991.

[Lee and Liu2003]
Wee Sun Lee and Bing Liu.
Learning with positive and unlabeled examples using weighted logistic regression.
In ICML, 2003.  [Li and Liu2003] Xiaoli Li and Bing Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, 2003.
 [Li et al.2011] Wenkai Li, Qinghua Guo, and Charles Elkan. A positive and unlabeled learning algorithm for oneclass classification of remotesensing data. IEEE TGRS, 49:717–725, 2011.
 [Li et al.2014] Huayi Li, Zhiyuan Chen, Bing Liu, Xiaokai Wei, and Jidong Shao. Spotting fake reviews via collective positiveunlabeled learning. ICDM, pages 899–904, 2014.
 [Lin2004] Yi Lin. A note on marginbased loss functions in classification. Statistics & probability letters, 68(1):73–82, 2004.
 [Liu et al.2002] Bing Liu, Wee Sun Lee, Philip S. Yu, and Xiaoli Li. Partially supervised classification of text documents. In ICML, 2002.
 [Liu et al.2003] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. Building text classifiers using positive and unlabeled examples. ICDM, pages 179–186, 2003.
 [Liu et al.2017] Yashu Liu, Shuang Qiu, Ping Zhang, Pinghua Gong, Feng Wang, Guoliang Xue, and Jieping Ye. Computational drug discovery with dyadic positiveunlabeled learning. In SDM, 2017.

[MasnadiShirazi and
Vasconcelos2008]
Hamed MasnadiShirazi and Nuno Vasconcelos.
On the design of loss functions for classification: theory, robustness to outliers, and savageboost.
In NeuroIPS, 2008.  [Netzer et al.2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
 [Onoda et al.2005] Takashi Onoda, Hikari Murata, and Seiji Yamada. One class support vector machine based nonrelevance feedback document retrieval. Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., 1:552–557 vol. 1, 2005.
 [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
 [Ren et al.2014] Yafeng Ren, DongHong Ji, and Hongbin Zhang. Positive unlabeled learning for deceptive reviews detection. In EMNLP, 2014.
 [Savage1971] Leonard J Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801, 1971.
 [ShalevShwartz and BenDavid2014] Shai ShalevShwartz and Shai BenDavid. Understanding machine learning: From theory to algorithms. 2014.
 [Shi et al.2018] Hong Shi, Shaojun Pan, Jian Xi Yang, and Chen Gong. Positive and unlabeled learning via loss decomposition and centroid estimation. In IJCAI, 2018.
 [Simonyan and Zisserman2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [Vapnik1999] Vladimir Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10 5:988–99, 1999.
 [Yang et al.2012] Peng Yang, Xiaoli Li, JianPing Mei, Chee Keong Kwoh, and SeeKiong Ng. Positiveunlabeled learning for disease gene identification. In Bioinformatics, 2012.
 [Yedida2019] Rahul Yedida. Finding a good learning rate. Blog post, 2019.
 [Yu et al.2003] Hwanjo Yu, ChengXiang Zhai, and Jiawei Han. Text classification from positive and unlabeled documents. In CIKM, 2003.
 [Zhang et al.2019] Chuang Zhang, Dexin Ren, Tongliang Liu, Jian Yang, and Chen Gong. Positive and unlabeled learning with label disambiguation. In IJCAI, 2019.
 [Zuluaga et al.2011] Maria A. Zuluaga, Don Hush, Edgar J. F. Delgado Leyton, Marcela Hernández Hoyos, and Maciej Orkisz. Learning from only positive and unlabeled data to detect lesions in vascular ct images. Medical image computing and computerassisted intervention : MICCAI … International Conference on Medical Image Computing and ComputerAssisted Intervention, 14 Pt 3:9–16, 2011.