1 Introduction
PositiveUnlabeled (PU) Learning is a semisupervised analog of binary classification. Unlike the latter, PU Classification does not require labeled samples from both classes for training. Instead, two samples are required: a labeled sample from Positive class, and an Unlabeled sample with mixed data from both Positive and Negative classes with generally unknown mixing proportions. The objective is to classify the Unlabeled sample, which requires to identify the mixing proportions first.
PU Classification naturally arises in numerous realworld cases, when obtaining labeled data from both classes is either complicated or impossible. It is applied in text analysis, when the objective is to detect fake reviews or spam, and only some nonfake documents are labeled Ren, Ji, and Zhang (2014). In medicine, when the objective is early diagnosis of type 2 diabetes Claesen et al. (2015). In bioinformatics, when the objective is to expand the database of disease genes Yang et al. (2012).
We propose a transparent nonparametric method named DifferenceofEstimatedDensitiesbased PositiveUnlabeled Learning, or DEDPUL. The method simultaneously estimates the proportions of the mixing components in the Unlabeled sample and classifies it. This is unlike current stateoftheart method Kiryo et al. (2017) that requires the proportions to be identified in advance. DEDPUL adheres to the following twostep strategy.
In the first step a NonTraditional Classifier (NTC) is constructed Elkan and Noto (2008). NTC is any classifier trained to distinguish Positive and Unlabeled samples. During learning, NTC simply treats the Unlabeled data as Negative, which clearly leads to biased estimates. The second step should eliminate this bias using explicit estimation of densities of the NTC predictions for both Positive and Unlabeled samples.
The paper makes several contributions:

We propose equation (11), which connects the posterior probability that an Unlabeled instance is a latent Positive to the density ratio of NTC predictions for the two samples. This serves as theoretical justification of DEDPUL.

We improve stateoftheart nonnegative Risk Estimation method Kiryo et al. (2017) by using Brier loss^{2}^{2}2
Sigmoid and Brier loss functions are mean absolute and mean squared errors between the binary classifier’s predictions and the correct labels.
while training neural networks instead of originally proposed sigmoid loss.
The rest of the paper is organized as follows. Section 2 introduces notations, formally defines the problems, and solves them in the ideal case of known densities. Section 3 proposes DEDPUL. Section 4 briefly summarizes history of PU Classification and relates DEDPUL to the existing literature. Section 5 describes experimental procedure and results. Section 6 concludes.
2 Problem Setup and Notation
In this section we cover the population case when the densities of Positive and Unlabeled distributions are known in advance. We introduce relevant notations and formally define the problems of Mixture Proportions Estimation and PU Classification. At the end of the section we propose equation (11).
Let , ,
be the probability density functions of Positive (P), Negative (N) and Unlabeled (U) distributions of
, whereis a vector in a mdimensional feature space
. Let be the proportion of in :(1) 
Denote as the posterior probability that is sampled from rather than . This posterior probability can be computed using the Bayes rule, providing the priors are identified:
(2) 
2.1 True Proportions are Unidentifiable
The goal of Mixture Proportions Estimation is to estimate the proportions of the mixing components in U, given the samples and from the distributions and respectively. The problem is fundamentally illposed even if the distributions and are known. Indeed, a valid estimate of is any (tilde denotes estimate) that fits the following constraint from (1):
(3) 
This constraint simply means that a mixing component cannot exceed the mixture itself. In other words, true proportion is generally unidentifiable, as it might be any value in the range . However, the upper bound of the range can be identified directly from (3):
(4) 
For this reason, estimation of rather than of true proportion should be considered as the objective of Mixture Proportions Estimation.^{3}^{3}3Surprisingly, and not has traditionally been used as the ground truth. We discuss this issue in Subsection 5.2. Denote as corresponding to posteriors.
(5) 
2.2 NonTraditional Classifier
Here we discuss how NonTraditional Classifier (NTC) may be useful for Mixture Proportions Estimation and PU Classification. Define as the following likelihood proportion:
(6) 
Define NTC as a function that estimates :
(7) 
In practice, NTC is a balanced binary classifier trained on samples and .
By definition (6), the proportions (4) and the posteriors (5) can be estimated through :
(8) 
(9) 
Directly applying (8) and (9) to the output of NTC has been considered by Elkan and Noto (2008) and is referred to as EN method. We, however, go one step further: we treat
as a random variable.
Let , , be the probability density functions of the distributions of with , , and respectively. Equation (9) shows that the posteriors are unambiguously and monotonically related to . In particular:
(10) 
This property is crucial. It means that is equivalent to for computation of the posteriors. Combining this with (5), we arrive to the following proposition:
(11) 
As equality of the posteriors leads to equality of the priors, (11) leads to:
(12) 
Above we keep in mind that:
(13) 
Equations (12) and (11) allow us to estimate and using the distributions of the NTC predictions instead of the initial distributions and . This provides certain benefits that are discussed in the next section. Note that all the presented equations are correct only for the true distributions. In practice only the samples and are available. Consequently, in order to apply (11) and (12), both and its densities and need to be approximated, which leads to a number of issues. These are going to be discussed in the next section.
Also note that most of the equations have been formulated and proven in the literature: analogs of (4)  in Blanchard, Lee, and Scott (2010); Scott, Blanchard, and Handy (2013); Jain et al. (2016); analogs of (8) and (9)  in Elkan and Noto (2008); (12)  in Jain et al. (2016). However, to the best of our knowledge (10) and (11) are introduced for the first time.
3 Algorithm Development
In this section we propose DEDPUL to solve both problems of Mixture Proportion Estimation and PU Classification. The method is summarized in Algorithm 1 and is illustrated in Figure 1, while the secondary functions are presented in Algorithm 2.
In the previous section, we have already discussed how to solve the problems of Mixture Proportions Estimation and PU Classification in the case of explicitly known distributions and using (4) and (5). However, this is rarely the case: only the samples and are usually available. Can we use these samples to estimate densities and and still apply (4) and (5)? Formally, the answer is ’yes’. In practice, however, two crucial issues may arise. Below we formulate these issues and propose solutions, which eventually results in DEDPUL.
The first issue is that performance of density estimation methods rapidly decreases as dimensionality of the distribution increases Liu, Lafferty, and Wasserman (2007)
, which makes estimation of highdimensional densities difficult. The issue is known as the ’curse of dimensionality’. This may be resolved with a preliminary procedure that reduces dimensionality of
and . To this end, we propose the NTC transformation (7). After applying this transformation, we may estimate onedimensional densities and of NTC predictions instead of mdimensional densities and . Then, (4) and (5) are replaced with (12) and (11) respectively. Note that the choice of NTC is flexible and depends on the data.In our experiments we use Kernel Density Estimation to estimate and . An alternative is to use methods like Kanamori, Hido, and Sugiyama (2009) to directly estimate density ratio or even , however this approach has shown inferior performance. Still, the problem of optimal bandwidth selection remains unsolved.
The second issue is that (12) may systematically underestimate as it relies solely on the infinum point. The reason this may happen is noise in the estimates of NTC predictions and of their densities and . To resolve this issue, we propose two alternative estimates.
The first alternative is based on probabilistic rule that the priors are equal to the expected posteriors. The proposed estimate is
, which is the prior probability such that it equals to the mean posterior probability over
. If exists, can be identified with iterative EM algorithm. On Estep, the posterior probability is estimated with (11) using current estimate of the priors. On Mstep the prior probability is updated as the mean posteriors.The (nonzero) estimate may not exist. In this case we propose the second alternative , which is the estimate where the slope of a specific function changes the most. This function is the difference between the priors and the mean posteriors.
The two approaches are motivated by the two empirically observed possible behaviors of the function, which mainly differ in whether it crosses zero or not. These are illustrated in the bottomcenter part of Figure 1. The implementations of the two approaches are EM and max_slope in Algorithm 2.
Note that the task DEDPUL solves is classification of the Unlabeled sample. Such problem formulation is known as transductive. In a more general inductive formulation, the task is to build a classifier able to evaluate any new data rather than
specifically. To achieve this, the output of the method could be either linearly interpolated in
or substituted into a loss function to train a new classifier, as proposed in Elkan and Noto (2008).4 Related work
In this section we provide a brief overview of PU Learning methods and relate DEDPUL to the existing literature. For a more detailed overview, see Bekker and Davis (2018c).
Early PU learning methods mostly concerned text classification and were heuristic by nature
Liu et al. (2002); Yu, Han, and Chang (2002); Li and Liu (2003). The strategy behind these methods was (i) to identify Reliably Negative (RN) instances in U and (ii) to train a traditional classifier on P and RN samples. The drawbacks of such strategy are obvious: on the one hand, large and potentially useful subsample of U is simply ignored; on the other hand, RN may still be contaminated with P. In 2003 two studies have considered a different strategy: to adapt logistic regression
Lee and Liu (2003) and SVM Liu et al. (2003) for PU setting by changing their loss functions. These methods successfully outperformed the heuristic approach.The paper of Elkan and Noto (2008) is often considered to be a milestone of PU classification. The authors proposed two PU classification methods. First, they introduced the notion of NTC and algebraically connected its predictions with the posteriors through (9). Second, they considered Unlabeled data as simultaneously Positive and Negative, weighted with opposite weights. By introducing these weights into the loss function, any PN classifier may be learned directly from PU data. Disappointingly, these weights are exactly the posteriors and , meaning that to implement the method the answer is required. Nevertheless, this general idea of loss function reweighting would later be adopted by the Risk Estimation framework du Plessis, Niu, and Sugiyama (2014); Du Plessis, Niu, and Sugiyama (2015a) and its latest nonnegative modification Kiryo et al. (2017) that is currently considered stateoftheart.
Most of the described methods require prior knowledge of the mixing proportions, which may be considered a bottleneck. Among those only Elkan and Noto (2008) address this issue by proposing three ways to estimate the proportions (one of which is (8)). Fortunately, multiple studies focus solely on this problem known as Mixture Proportions Estimation Sanderson and Scott (2014); du Plessis, Niu, and Sugiyama (2015b); Jain et al. (2016); Ramaswamy, Scott, and Tewari (2016); Bekker and Davis (2018b). The stateoftheart method is KM Ramaswamy, Scott, and Tewari (2016), which is based on mean embeddings of P and U samples into reproducing kernel Hilbert space.
We now relate DEDPUL to the existing literature. For example, Jain et al. (2016) a method of Mixture Proportions Estimation is proposed. The method is based on explicit estimation of the total likelihood of and as a function of . The proposed estimate is the point where the slope of this function changes the most. This is similar to our estimation strategy. Furthermore, the paper proposes to use NTC as a transformation that reduces dimensionality while preserving the proportion . The idea to approach PU Learning with Density Estimation also appears in the literature. For instance, du Plessis, Niu, and Sugiyama (2015b) explicitly estimate the densities and , while Charoenphakdee and Sugiyama (2018) estimate their ratio directly. Neither uses NTC in their framework. Next, Kato et al. (2018) fill the gap of proportions estimation in risk estimation framework. They use iterative EMlike algorithm to identify the proportion that is equal to the mean posteriors. A downside is the requirement to retrain classifier in each step. Still, the approach is similar to our estimation strategy.
Some recent studies concern the question of how PU data is generated Jain et al. (2016); Jain, White, and Radivojac (2016); Bekker and Davis (2018a). Most methods, including DEDPUL, either explicitly or implicitly assume that the distributions of labeled and unlabeled Positives coincide. From the data generation perspective, this can be formulated as Selected Completely At Random assumption: the probability of a Positive instance to be labeled is a constant and does not depend on . A more general alternative to this is Selected At Random assumption, which allows labeling probability to be a function of , called propensity score Bekker and Davis (2018a).
5 Experimental Procedure and Results
We conduct experiments on synthetic and realworld data sets to evaluate the performance of the algorithms (DEDPUL, EN, KM, nnRE). We consider Mixture Proportions Estimation and PU Classification as separate problems and measure performance on them independently. The Mixture Proportions Estimation algorithms try to identify the proportions, while the PU Classification algorithms receive the proportions as input and try to classify . The algorithms are tested on numerous data sets that differ in the initial distributions of the mixing components, in their proportions, and in the extent of their intersection. Additionally, each of these experiments is repeated 10 times for different randomly drawn samples. The algorithms are compared pairwise and the significance is verified using paired Wilcoxon signedrank test with 0.01 Pvalue threshold.
5.1 Data
In the synthetic setting we experiment with mixtures of onedimensional Laplace distributions. We fix as and mix it with different : , , , , and . For each of these cases, the proportion is varied in {0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99}. Sizes of the samples and are fixed as 500 and 2500 respectively.
In the realworld setting we experiment with eight data sets from UCI machine learning repository
Bache and Lichman (2013) and with MNIST data set of handwritten digits LeCun, Cortes, and Burges (2010) (Table 1). The proportions are varied in {0.05, 0.25, 0.5, 0.75, 0.95}. The samples and are randomly drawn from the data sets in a stratified manner to satisfy these proportions. The joint size of the samples does not exceed 5000. Categorical features are transformed into numerical features with dummy encoding. Numerical features are normalized.data set  size 




bank  45211  16  yes  no  
concrete  1030  8  (35.8, 82.6)  (2.3, 35.8)  
landsat  6435  36  4, 5, 7  1, 2, 3  
mushroom  8124  22  p  e  
pageblock  5473  10  2, 3, 4, 5  1  
shuttle  58000  9  2, 3, 4, 5, 6, 7  1  
spambase  4601  57  1  0  
wine  6497  12  red  white  
mnist  70000  784  1, 3, 5, 7, 9  0, 2, 4, 6, 8 
5.2 Measures for Performance Evaluation
The synthetic setting provides a straightforward way to evaluate performance. Since the underlying distributions and are known, we calculate the true values of the proportions and the posteriors using (4) and (5) respectively. Then, we directly compare these with algorithm’s estimates using mean absolute errors (Table 2, row 1). In the realworld setting the distributions of the data are unknown, and the straightforward performance measures of the synthetic setting are unavailable. Here, for Mixture Proportions Estimation we use a similar measure, but substitute with , while for PU Classification we use accuracy (Table 2, row 2).
synthetic  
realworld  1  accuracy 
Note that such measure of proportion estimation in realworld data favors the algorithms that consistently underestimate it in the case when . In this sense, synthetic experiments are more reliable due to ability to compare directly to . Surprisingly, we do not know a single paper that takes this into account, as and not has traditionally been used as the ground truth for proportion estimation.
5.3 Implementations of Algorithms
DEDPUL is implemented according to Algorithms 1 and 2. As NTC we use an ensemble of 10 neural networks with 1 layer of 32 to 512 neurons, depending on the data. We recommend to train the networks on logistic loss with high learning rate. The predictions of each network are obtained with 3fold crossvalidation and are averaged over the ensemble. Densities of the predictions are computed using Kernel Density Estimation with Gaussian kernels. Instead of
, we estimate densities of appropriately ranged and make according posttransformations. Bandwidths are chosen heuristically as 0.1 and 0.05 for and respectively. Threshold in monotonize is chosen as .ElkanNoto (EN) is implemented as in Elkan and Noto (2008). The paper proposes posteriors’ estimator (9) and three proportions’ estimators e1, e2, and e3, where e3 is analogous to (8). We use e3 and e1 in sythetic and realworld settings respectively. Predictions are obtained with the same NTC as in DEDPUL.
The Kernel Mean based gradient thresholding algorithm (KM1 and KM2) is retrieved from the original paper Ramaswamy, Scott, and Tewari (2016).^{4}^{4}4http://web.eecs.umich.edu/~cscott/code.html##kmpe.
We provide experimental results for KM2, which is a better performing version. As advised, the MNIST data is reduced to 50 dimensions with Principal Component Analysis.
We explore two versions of the nonnegative Risk Estimation (nnRE) algorithm Kiryo et al. (2017). In original nnREsigmoid the networks are trained on sigmoid loss, while in nnREbrier we propose to use Brier loss instead. As classifier we use an ensemble of 10 neural networks with 3 layers of 32 to 256 neurons, depending on the data. The hyperparameters are chosen as and .
5.4 Experimental results
Experimental results are presented in Figures 2, 3, 4, and 5. The following conclusions may be made. (i) DEDPUL significantly outperforms both baseline EN and stateoftheart KM algorithms in Mixture Proportions Estimation in both synthetic and realworld settings (Fig. 2, 3). (ii) DEDPUL significantly outperforms both baseline EN and stateoftheart nnRE algorithms in PositiveUnlabeled Classification in both settings (Fig. 4, 5). (iii) Proposed modification nnREbrier significantly outperforms originally proposed nnREsigmoid in both settings (Fig. 4, 5).
6 Conclusion
We propose DEDPUL, a method that simultaneously solves the problems of PositiveUnlabeled Classification and Mixture Proportions Estimation. Validity of the method is shown through extensive empirical investigation. The method is also justified theoretically through (11). Still, some questions remain open. For instance, it is yet unclear what distinguishes the cases when the estimate does not exist and why such cases happen. Formal proofs of consistency of the estimates and would also be valuable. Next, it is yet unclear how to tune hyperparameters such as bandwidths during density estimation. Finally, several extensions of DEDPUL could be explored, such as extensions to multiclassification, to the case when both samples are mutually contaminated, and to the case when all three of Positive, Negative, and Unlabeled samples are available. Application of the method to corruption detection in procurement auctions will be the subject of our future research.
Acknowledgements
I sincerely thank Ksenia Balabaeva and Iskander Safiulin for regular revisions of the paper and for brainstorming the method’s name; Vitalia Eliseeva, Alexander Nesterov, and Alexander Sirotkin for revisions of the later versions. Support from the Basic Research Program of the National Research University Higher School of Economics is gratefully acknowledged.
References
 Bache and Lichman [2013] Bache, K., and Lichman, M. 2013. UCI machine learning repository.
 Bekker and Davis [2018a] Bekker, J., and Davis, J. 2018a. Beyond the selected completely at random assumption for learning from positive and unlabeled data. arXiv preprint arXiv:1809.03207.

Bekker and Davis [2018b]
Bekker, J., and Davis, J.
2018b.
Estimating the class prior in positive and unlabeled data through decision tree induction.
InProceedings of the 32th AAAI Conference on Artificial Intelligence
.  Bekker and Davis [2018c] Bekker, J., and Davis, J. 2018c. Learning from positive and unlabeled data: A survey. arXiv preprint arXiv:1811.04820.

Blanchard, Lee, and
Scott [2010]
Blanchard, G.; Lee, G.; and Scott, C.
2010.
Semisupervised novelty detection.
Journal of Machine Learning Research 11(Nov):2973–3009.  Charoenphakdee and Sugiyama [2018] Charoenphakdee, N., and Sugiyama, M. 2018. Positiveunlabeled classification under class prior shift and asymmetric error. arXiv preprint arXiv:1809.07011.
 Claesen et al. [2015] Claesen, M.; De Smet, F.; Gillard, P.; Mathieu, C.; and De Moor, B. 2015. Building classifiers to predict the start of glucoselowering pharmacotherapy using belgian health expenditure data. arXiv preprint arXiv:1504.07389.
 du Plessis, Niu, and Sugiyama [2014] du Plessis, M. C.; Niu, G.; and Sugiyama, M. 2014. Analysis of learning from positive and unlabeled data. In Advances in neural information processing systems, 703–711.
 Du Plessis, Niu, and Sugiyama [2015a] Du Plessis, M.; Niu, G.; and Sugiyama, M. 2015a. Convex formulation for learning from positive and unlabeled data. In International Conference on Machine Learning, 1386–1394.
 du Plessis, Niu, and Sugiyama [2015b] du Plessis, M. C.; Niu, G.; and Sugiyama, M. 2015b. Classprior estimation for learning from positive and unlabeled data. In ACML, 221–236.
 Elkan and Noto [2008] Elkan, C., and Noto, K. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 213–220. ACM.
 Jain et al. [2016] Jain, S.; White, M.; Trosset, M. W.; and Radivojac, P. 2016. Nonparametric semisupervised learning of class proportions. arXiv preprint arXiv:1601.01944.
 Jain, White, and Radivojac [2016] Jain, S.; White, M.; and Radivojac, P. 2016. Estimating the class prior and posterior from noisy positives and unlabeled data. In Advances in Neural Information Processing Systems, 2693–2701.
 Kanamori, Hido, and Sugiyama [2009] Kanamori, T.; Hido, S.; and Sugiyama, M. 2009. A leastsquares approach to direct importance estimation. Journal of Machine Learning Research 10:1391–1445.
 Kato et al. [2018] Kato, M.; Xu, L.; Niu, G.; and Sugiyama, M. 2018. Alternate estimation of a classifier and the classprior from positive and unlabeled data. arXiv preprint arXiv:1809.05710.
 Kiryo et al. [2017] Kiryo, R.; Niu, G.; du Plessis, M. C.; and Sugiyama, M. 2017. Positiveunlabeled learning with nonnegative risk estimator. In Advances in Neural Information Processing Systems, 1675–1685.
 LeCun, Cortes, and Burges [2010] LeCun, Y.; Cortes, C.; and Burges, C. 2010. MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2.
 Lee and Liu [2003] Lee, W. S., and Liu, B. 2003. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, volume 3, 448–455.
 Li and Liu [2003] Li, X., and Liu, B. 2003. Learning to classify texts using positive and unlabeled data. In IJCAI, volume 3, 587–592.
 Liu et al. [2002] Liu, B.; Lee, W. S.; Yu, P. S.; and Li, X. 2002. Partially supervised classification of text documents. In ICML, volume 2, 387–394. Citeseer.
 Liu et al. [2003] Liu, B.; Dai, Y.; Li, X.; Lee, W. S.; and Yu, P. S. 2003. Building text classifiers using positive and unlabeled examples. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 179–186. IEEE.
 Liu, Lafferty, and Wasserman [2007] Liu, H.; Lafferty, J.; and Wasserman, L. 2007. Sparse nonparametric density estimation in high dimensions using the rodeo. In Artificial Intelligence and Statistics, 283–290.
 Ramaswamy, Scott, and Tewari [2016] Ramaswamy, H.; Scott, C.; and Tewari, A. 2016. Mixture proportion estimation via kernel embeddings of distributions. In International Conference on Machine Learning, 2052–2060.
 Ren, Ji, and Zhang [2014] Ren, Y.; Ji, D.; and Zhang, H. 2014. Positive unlabeled learning for deceptive reviews detection. In EMNLP, 488–498.
 Sanderson and Scott [2014] Sanderson, T., and Scott, C. 2014. Class proportion estimation with application to multiclass anomaly rejection. In Artificial Intelligence and Statistics, 850–858.
 Scott, Blanchard, and Handy [2013] Scott, C.; Blanchard, G.; and Handy, G. 2013. Classification with asymmetric label noise: Consistency and maximal denoising. In Conference On Learning Theory, 489–511.
 Yang et al. [2012] Yang, P.; Li, X.L.; Mei, J.P.; Kwoh, C.K.; and Ng, S.K. 2012. Positiveunlabeled learning for disease gene identification. Bioinformatics 28(20):2640–2647.
 Yu, Han, and Chang [2002] Yu, H.; Han, J.; and Chang, K. C.C. 2002. Pebl: Positive example based learning for web page classification using svm. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, 239–248. New York, NY, USA: ACM.
Comments
There are no comments yet.