DEDPUL: Method for Mixture Proportion Estimation and Positive-Unlabeled Classification based on Density Estimation

02/19/2019
by   Dmitry Ivanov, et al.
Higher School of Economics
0

This paper studies Positive-Unlabeled Classification, the problem of semi-supervised binary classification in the case when Negative (N) class in the training set is contaminated with instances of Positive (P) class. We develop a novel method (DEDPUL) that simultaneously solves two problems concerning the contaminated Unlabeled (U) sample: estimates the proportions of the mixing components (P and N) in U, and classifies U. By conducting experiments on synthetic and real-world data we favorably compare DEDPUL with current state-of-the-art methods for both problems. We introduce an automatic procedure for DEDPUL hyperparameter optimization. Additionally, we improve two methods in the literature and achieve DEDPUL level of performance with one of them.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/08/2016

Nonparametric semi-supervised learning of class proportions

The problem of developing binary classifiers from positive and unlabeled...
10/05/2010

A bagging SVM to learn from positive and unlabeled examples

We consider the problem of learning a binary classifier from a training ...
10/01/2018

Classification from Positive, Unlabeled and Biased Negative Data

Positive-unlabeled (PU) learning addresses the problem of learning a bin...
01/30/2018

Mixture Proportion Estimation for Positive--Unlabeled Learning via Classifier Dimension Reduction

Positive--unlabeled (PU) learning considers two samples, a positive set ...
09/19/2018

Positive-Unlabeled Classification under Class Prior Shift and Asymmetric Error

A bottleneck of binary classification from positive and unlabeled data (...
02/10/2020

Towards Mixture Proportion Estimation without Irreducibility

Mixture proportion estimation (MPE) is a fundamental problem of practica...
04/22/2020

Quantifying With Only Positive Training Data

Quantification is the research field that studies the task of counting h...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Positive-Unlabeled (PU) Learning is a semi-supervised analog of binary classification. Unlike the latter, PU Classification does not require labeled samples from both classes for training. Instead, two samples are required: a labeled sample from Positive class, and an Unlabeled sample with mixed data from both Positive and Negative classes with generally unknown mixing proportions. The objective is to classify the Unlabeled sample, which requires to identify the mixing proportions first.

PU Classification naturally arises in numerous real-world cases, when obtaining labeled data from both classes is either complicated or impossible. It is applied in text analysis, when the objective is to detect fake reviews or spam, and only some non-fake documents are labeled Ren, Ji, and Zhang (2014). In medicine, when the objective is early diagnosis of type 2 diabetes Claesen et al. (2015). In bioinformatics, when the objective is to expand the database of disease genes Yang et al. (2012).

We propose a transparent non-parametric method named Difference-of-Estimated-Densities-based Positive-Unlabeled Learning, or DEDPUL. The method simultaneously estimates the proportions of the mixing components in the Unlabeled sample and classifies it. This is unlike current state-of-the-art method Kiryo et al. (2017) that requires the proportions to be identified in advance. DEDPUL adheres to the following two-step strategy.

In the first step a Non-Traditional Classifier (NTC) is constructed Elkan and Noto (2008). NTC is any classifier trained to distinguish Positive and Unlabeled samples. During learning, NTC simply treats the Unlabeled data as Negative, which clearly leads to biased estimates. The second step should eliminate this bias using explicit estimation of densities of the NTC predictions for both Positive and Unlabeled samples.

The paper makes several contributions:

  • We propose DEDPUL and empirically show that it outperforms current state-of-the-art methods for both Mixture Proportions Estimation Ramaswamy, Scott, and Tewari (2016) and PU Classification Kiryo et al. (2017).

  • We propose equation (11), which connects the posterior probability that an Unlabeled instance is a latent Positive to the density ratio of NTC predictions for the two samples. This serves as theoretical justification of DEDPUL.

  • We improve state-of-the-art non-negative Risk Estimation method Kiryo et al. (2017) by using Brier loss222

    Sigmoid and Brier loss functions are mean absolute and mean squared errors between the binary classifier’s predictions and the correct labels.

    while training neural networks instead of originally proposed sigmoid loss.

The rest of the paper is organized as follows. Section 2 introduces notations, formally defines the problems, and solves them in the ideal case of known densities. Section 3 proposes DEDPUL. Section 4 briefly summarizes history of PU Classification and relates DEDPUL to the existing literature. Section 5 describes experimental procedure and results. Section 6 concludes.

2 Problem Setup and Notation

In this section we cover the population case when the densities of Positive and Unlabeled distributions are known in advance. We introduce relevant notations and formally define the problems of Mixture Proportions Estimation and PU Classification. At the end of the section we propose equation (11).

Let , ,

be the probability density functions of Positive (P), Negative (N) and Unlabeled (U) distributions of

, where

is a vector in a m-dimensional feature space

. Let be the proportion of in :

(1)

Denote as the posterior probability that is sampled from rather than . This posterior probability can be computed using the Bayes rule, providing the priors are identified:

(2)

2.1 True Proportions are Unidentifiable

The goal of Mixture Proportions Estimation is to estimate the proportions of the mixing components in U, given the samples and from the distributions and respectively. The problem is fundamentally ill-posed even if the distributions and are known. Indeed, a valid estimate of is any (tilde denotes estimate) that fits the following constraint from (1):

(3)

This constraint simply means that a mixing component cannot exceed the mixture itself. In other words, true proportion is generally unidentifiable, as it might be any value in the range . However, the upper bound of the range can be identified directly from (3):

(4)

For this reason, estimation of rather than of true proportion should be considered as the objective of Mixture Proportions Estimation.333Surprisingly, and not has traditionally been used as the ground truth. We discuss this issue in Subsection 5.2. Denote as corresponding to posteriors.

(5)

2.2 Non-Traditional Classifier

Here we discuss how Non-Traditional Classifier (NTC) may be useful for Mixture Proportions Estimation and PU Classification. Define as the following likelihood proportion:

(6)

Define NTC as a function that estimates :

(7)

In practice, NTC is a balanced binary classifier trained on samples and .

By definition (6), the proportions (4) and the posteriors (5) can be estimated through :

(8)
(9)

Directly applying (8) and (9) to the output of NTC has been considered by Elkan and Noto (2008) and is referred to as EN method. We, however, go one step further: we treat

as a random variable.

Let , , be the probability density functions of the distributions of with , , and respectively. Equation (9) shows that the posteriors are unambiguously and monotonically related to . In particular:

(10)

This property is crucial. It means that is equivalent to for computation of the posteriors. Combining this with (5), we arrive to the following proposition:

(11)

As equality of the posteriors leads to equality of the priors, (11) leads to:

(12)

Above we keep in mind that:

(13)

Equations (12) and (11) allow us to estimate and using the distributions of the NTC predictions instead of the initial distributions and . This provides certain benefits that are discussed in the next section. Note that all the presented equations are correct only for the true distributions. In practice only the samples and are available. Consequently, in order to apply (11) and (12), both and its densities and need to be approximated, which leads to a number of issues. These are going to be discussed in the next section.

Also note that most of the equations have been formulated and proven in the literature: analogs of (4) - in Blanchard, Lee, and Scott (2010); Scott, Blanchard, and Handy (2013); Jain et al. (2016); analogs of (8) and (9) - in Elkan and Noto (2008); (12) - in Jain et al. (2016). However, to the best of our knowledge (10) and (11) are introduced for the first time.

3 Algorithm Development

Figure 1: DEDPUL pipeline

In this section we propose DEDPUL to solve both problems of Mixture Proportion Estimation and PU Classification. The method is summarized in Algorithm 1 and is illustrated in Figure 1, while the secondary functions are presented in Algorithm 2.

1:Input: ,
2:Output: , {}
3: We use ensemble of Neural Networks as NTC
4: Kernel Density Estimation
5: Array of density ratios for each instance
6: Sort by
7: Sort predictions
8: This and other functions are defined in Algorithm 2
9:
10: Return to initial unsorted order
11:
12:
13:if  then
14:     Return
15:else
16:     Return
Algorithm 1 DEDPUL

In the previous section, we have already discussed how to solve the problems of Mixture Proportions Estimation and PU Classification in the case of explicitly known distributions and using (4) and (5). However, this is rarely the case: only the samples and are usually available. Can we use these samples to estimate densities and and still apply (4) and (5)? Formally, the answer is ’yes’. In practice, however, two crucial issues may arise. Below we formulate these issues and propose solutions, which eventually results in DEDPUL.

1:function EM()
2:      Initialize
3:     repeat
4:         
5:          E-step
6:          M-step
7:     until 
8:     Return ,
9:function max_slope()
10:      Array of function values
11:     for  in range(start=0, end=1, step=do
12:         
13:         
14:               
15:      Array of function second lags
16:     for i in range(start=1, end=, step=1) do
17:               
18:      s. t.
19:     
20:     
21:     Return ,
22:function monotonize()
23:     
24:     for i in range(start=0, end=, step=1) do
25:         if  then
26:              
27:                             
28:     Return
29:function rolling_median(, )
30:     
31:     
32:     for i in range(start=0, end=, step=1) do
33:         
34:               
35:     Return
Algorithm 2 Secondary functions for DEDPUL

The first issue is that performance of density estimation methods rapidly decreases as dimensionality of the distribution increases Liu, Lafferty, and Wasserman (2007)

, which makes estimation of high-dimensional densities difficult. The issue is known as the ’curse of dimensionality’. This may be resolved with a preliminary procedure that reduces dimensionality of

and . To this end, we propose the NTC transformation (7). After applying this transformation, we may estimate one-dimensional densities and of NTC predictions instead of m-dimensional densities and . Then, (4) and (5) are replaced with (12) and (11) respectively. Note that the choice of NTC is flexible and depends on the data.

In our experiments we use Kernel Density Estimation to estimate and . An alternative is to use methods like Kanamori, Hido, and Sugiyama (2009) to directly estimate density ratio or even , however this approach has shown inferior performance. Still, the problem of optimal bandwidth selection remains unsolved.

The second issue is that (12) may systematically underestimate as it relies solely on the infinum point. The reason this may happen is noise in the estimates of NTC predictions and of their densities and . To resolve this issue, we propose two alternative estimates.

The first alternative is based on probabilistic rule that the priors are equal to the expected posteriors. The proposed estimate is

, which is the prior probability such that it equals to the mean posterior probability over

. If exists, can be identified with iterative EM algorithm. On E-step, the posterior probability is estimated with (11) using current estimate of the priors. On M-step the prior probability is updated as the mean posteriors.

The (non-zero) estimate may not exist. In this case we propose the second alternative , which is the estimate where the slope of a specific function changes the most. This function is the difference between the priors and the mean posteriors.

The two approaches are motivated by the two empirically observed possible behaviors of the function, which mainly differ in whether it crosses zero or not. These are illustrated in the bottom-center part of Figure 1. The implementations of the two approaches are EM and max_slope in Algorithm 2.

Note that the task DEDPUL solves is classification of the Unlabeled sample. Such problem formulation is known as transductive. In a more general inductive formulation, the task is to build a classifier able to evaluate any new data rather than

specifically. To achieve this, the output of the method could be either linearly interpolated in

or substituted into a loss function to train a new classifier, as proposed in Elkan and Noto (2008).

4 Related work

In this section we provide a brief overview of PU Learning methods and relate DEDPUL to the existing literature. For a more detailed overview, see Bekker and Davis (2018c).

Early PU learning methods mostly concerned text classification and were heuristic by nature

Liu et al. (2002); Yu, Han, and Chang (2002); Li and Liu (2003)

. The strategy behind these methods was (i) to identify Reliably Negative (RN) instances in U and (ii) to train a traditional classifier on P and RN samples. The drawbacks of such strategy are obvious: on the one hand, large and potentially useful subsample of U is simply ignored; on the other hand, RN may still be contaminated with P. In 2003 two studies have considered a different strategy: to adapt logistic regression

Lee and Liu (2003) and SVM Liu et al. (2003) for PU setting by changing their loss functions. These methods successfully outperformed the heuristic approach.

The paper of Elkan and Noto (2008) is often considered to be a milestone of PU classification. The authors proposed two PU classification methods. First, they introduced the notion of NTC and algebraically connected its predictions with the posteriors through (9). Second, they considered Unlabeled data as simultaneously Positive and Negative, weighted with opposite weights. By introducing these weights into the loss function, any PN classifier may be learned directly from PU data. Disappointingly, these weights are exactly the posteriors and , meaning that to implement the method the answer is required. Nevertheless, this general idea of loss function reweighting would later be adopted by the Risk Estimation framework du Plessis, Niu, and Sugiyama (2014); Du Plessis, Niu, and Sugiyama (2015a) and its latest non-negative modification Kiryo et al. (2017) that is currently considered state-of-the-art.

Most of the described methods require prior knowledge of the mixing proportions, which may be considered a bottleneck. Among those only Elkan and Noto (2008) address this issue by proposing three ways to estimate the proportions (one of which is (8)). Fortunately, multiple studies focus solely on this problem known as Mixture Proportions Estimation Sanderson and Scott (2014); du Plessis, Niu, and Sugiyama (2015b); Jain et al. (2016); Ramaswamy, Scott, and Tewari (2016); Bekker and Davis (2018b). The state-of-the-art method is KM Ramaswamy, Scott, and Tewari (2016), which is based on mean embeddings of P and U samples into reproducing kernel Hilbert space.

We now relate DEDPUL to the existing literature. For example, Jain et al. (2016) a method of Mixture Proportions Estimation is proposed. The method is based on explicit estimation of the total likelihood of and as a function of . The proposed estimate is the point where the slope of this function changes the most. This is similar to our estimation strategy. Furthermore, the paper proposes to use NTC as a transformation that reduces dimensionality while preserving the proportion . The idea to approach PU Learning with Density Estimation also appears in the literature. For instance, du Plessis, Niu, and Sugiyama (2015b) explicitly estimate the densities and , while Charoenphakdee and Sugiyama (2018) estimate their ratio directly. Neither uses NTC in their framework. Next, Kato et al. (2018) fill the gap of proportions estimation in risk estimation framework. They use iterative EM-like algorithm to identify the proportion that is equal to the mean posteriors. A downside is the requirement to retrain classifier in each step. Still, the approach is similar to our estimation strategy.

Some recent studies concern the question of how PU data is generated Jain et al. (2016); Jain, White, and Radivojac (2016); Bekker and Davis (2018a). Most methods, including DEDPUL, either explicitly or implicitly assume that the distributions of labeled and unlabeled Positives coincide. From the data generation perspective, this can be formulated as Selected Completely At Random assumption: the probability of a Positive instance to be labeled is a constant and does not depend on . A more general alternative to this is Selected At Random assumption, which allows labeling probability to be a function of , called propensity score Bekker and Davis (2018a).

5 Experimental Procedure and Results

We conduct experiments on synthetic and real-world data sets to evaluate the performance of the algorithms (DEDPUL, EN, KM, nnRE). We consider Mixture Proportions Estimation and PU Classification as separate problems and measure performance on them independently. The Mixture Proportions Estimation algorithms try to identify the proportions, while the PU Classification algorithms receive the proportions as input and try to classify . The algorithms are tested on numerous data sets that differ in the initial distributions of the mixing components, in their proportions, and in the extent of their intersection. Additionally, each of these experiments is repeated 10 times for different randomly drawn samples. The algorithms are compared pair-wise and the significance is verified using paired Wilcoxon signed-rank test with 0.01 P-value threshold.

5.1 Data

In the synthetic setting we experiment with mixtures of one-dimensional Laplace distributions. We fix as and mix it with different : , , , , and . For each of these cases, the proportion is varied in {0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99}. Sizes of the samples and are fixed as 500 and 2500 respectively.

In the real-world setting we experiment with eight data sets from UCI machine learning repository

Bache and Lichman (2013) and with MNIST data set of handwritten digits LeCun, Cortes, and Burges (2010) (Table 1). The proportions are varied in {0.05, 0.25, 0.5, 0.75, 0.95}. The samples and are randomly drawn from the data sets in a stratified manner to satisfy these proportions. The joint size of the samples does not exceed 5000. Categorical features are transformed into numerical features with dummy encoding. Numerical features are normalized.

data set size
positive
target
values
negative
target
values
bank 45211 16 yes no
concrete 1030 8 (35.8, 82.6) (2.3, 35.8)
landsat 6435 36 4, 5, 7 1, 2, 3
mushroom 8124 22 p e
pageblock 5473 10 2, 3, 4, 5 1
shuttle 58000 9 2, 3, 4, 5, 6, 7 1
spambase 4601 57 1 0
wine 6497 12 red white
mnist 70000 784 1, 3, 5, 7, 9 0, 2, 4, 6, 8
Table 1: Description of real-world data sets

5.2 Measures for Performance Evaluation

The synthetic setting provides a straightforward way to evaluate performance. Since the underlying distributions and are known, we calculate the true values of the proportions and the posteriors using (4) and (5) respectively. Then, we directly compare these with algorithm’s estimates using mean absolute errors (Table 2, row 1). In the real-world setting the distributions of the data are unknown, and the straightforward performance measures of the synthetic setting are unavailable. Here, for Mixture Proportions Estimation we use a similar measure, but substitute with , while for PU Classification we use accuracy (Table 2, row 2).

synthetic
real-world 1 - accuracy
Table 2: Performance measures for estimates of priors and posteriors on synthetic and real-world data

Note that such measure of proportion estimation in real-world data favors the algorithms that consistently underestimate it in the case when . In this sense, synthetic experiments are more reliable due to ability to compare directly to . Surprisingly, we do not know a single paper that takes this into account, as and not has traditionally been used as the ground truth for proportion estimation.

5.3 Implementations of Algorithms

DEDPUL is implemented according to Algorithms 1 and 2. As NTC we use an ensemble of 10 neural networks with 1 layer of 32 to 512 neurons, depending on the data. We recommend to train the networks on logistic loss with high learning rate. The predictions of each network are obtained with 3-fold cross-validation and are averaged over the ensemble. Densities of the predictions are computed using Kernel Density Estimation with Gaussian kernels. Instead of

, we estimate densities of appropriately ranged and make according post-transformations. Bandwidths are chosen heuristically as 0.1 and 0.05 for and respectively. Threshold in monotonize is chosen as .

Elkan-Noto (EN) is implemented as in Elkan and Noto (2008). The paper proposes posteriors’ estimator (9) and three proportions’ estimators e1, e2, and e3, where e3 is analogous to (8). We use e3 and e1 in sythetic and real-world settings respectively. Predictions are obtained with the same NTC as in DEDPUL.

The Kernel Mean based gradient thresholding algorithm (KM1 and KM2) is retrieved from the original paper Ramaswamy, Scott, and Tewari (2016).444http://web.eecs.umich.edu/~cscott/code.html##kmpe.

We provide experimental results for KM2, which is a better performing version. As advised, the MNIST data is reduced to 50 dimensions with Principal Component Analysis.

We explore two versions of the non-negative Risk Estimation (nnRE) algorithm Kiryo et al. (2017). In original nnRE-sigmoid the networks are trained on sigmoid loss, while in nnRE-brier we propose to use Brier loss instead. As classifier we use an ensemble of 10 neural networks with 3 layers of 32 to 256 neurons, depending on the data. The hyperparameters are chosen as and .

5.4 Experimental results

Experimental results are presented in Figures 2, 3, 4, and 5. The following conclusions may be made. (i) DEDPUL significantly outperforms both baseline EN and state-of-the-art KM algorithms in Mixture Proportions Estimation in both synthetic and real-world settings (Fig. 2, 3). (ii) DEDPUL significantly outperforms both baseline EN and state-of-the-art nnRE algorithms in Positive-Unlabeled Classification in both settings (Fig. 4, 5). (iii) Proposed modification nnRE-brier significantly outperforms originally proposed nnRE-sigmoid in both settings (Fig. 4, 5).

6 Conclusion

We propose DEDPUL, a method that simultaneously solves the problems of Positive-Unlabeled Classification and Mixture Proportions Estimation. Validity of the method is shown through extensive empirical investigation. The method is also justified theoretically through (11). Still, some questions remain open. For instance, it is yet unclear what distinguishes the cases when the estimate does not exist and why such cases happen. Formal proofs of consistency of the estimates and would also be valuable. Next, it is yet unclear how to tune hyperparameters such as bandwidths during density estimation. Finally, several extensions of DEDPUL could be explored, such as extensions to multi-classification, to the case when both samples are mutually contaminated, and to the case when all three of Positive, Negative, and Unlabeled samples are available. Application of the method to corruption detection in procurement auctions will be the subject of our future research.

Figure 3: Mixture Proportions Estimation methods on UCI and MNIST data sets
Figure 4: Positive-Unlabeled Classification methods on mixtures of Laplace distributions
Figure 2: Mixture Proportions Estimation methods on mixtures of Laplace distributions
Figure 3: Mixture Proportions Estimation methods on UCI and MNIST data sets
Figure 4: Positive-Unlabeled Classification methods on mixtures of Laplace distributions
Figure 5: Positive-Unlabeled Classification methods on UCI and MNIST data sets
Figure 2: Mixture Proportions Estimation methods on mixtures of Laplace distributions

Acknowledgements

I sincerely thank Ksenia Balabaeva and Iskander Safiulin for regular revisions of the paper and for brainstorming the method’s name; Vitalia Eliseeva, Alexander Nesterov, and Alexander Sirotkin for revisions of the later versions. Support from the Basic Research Program of the National Research University Higher School of Economics is gratefully acknowledged.

References

  • Bache and Lichman [2013] Bache, K., and Lichman, M. 2013. UCI machine learning repository.
  • Bekker and Davis [2018a] Bekker, J., and Davis, J. 2018a. Beyond the selected completely at random assumption for learning from positive and unlabeled data. arXiv preprint arXiv:1809.03207.
  • Bekker and Davis [2018b] Bekker, J., and Davis, J. 2018b.

    Estimating the class prior in positive and unlabeled data through decision tree induction.

    In

    Proceedings of the 32th AAAI Conference on Artificial Intelligence

    .
  • Bekker and Davis [2018c] Bekker, J., and Davis, J. 2018c. Learning from positive and unlabeled data: A survey. arXiv preprint arXiv:1811.04820.
  • Blanchard, Lee, and Scott [2010] Blanchard, G.; Lee, G.; and Scott, C. 2010.

    Semi-supervised novelty detection.

    Journal of Machine Learning Research 11(Nov):2973–3009.
  • Charoenphakdee and Sugiyama [2018] Charoenphakdee, N., and Sugiyama, M. 2018. Positive-unlabeled classification under class prior shift and asymmetric error. arXiv preprint arXiv:1809.07011.
  • Claesen et al. [2015] Claesen, M.; De Smet, F.; Gillard, P.; Mathieu, C.; and De Moor, B. 2015. Building classifiers to predict the start of glucose-lowering pharmacotherapy using belgian health expenditure data. arXiv preprint arXiv:1504.07389.
  • du Plessis, Niu, and Sugiyama [2014] du Plessis, M. C.; Niu, G.; and Sugiyama, M. 2014. Analysis of learning from positive and unlabeled data. In Advances in neural information processing systems, 703–711.
  • Du Plessis, Niu, and Sugiyama [2015a] Du Plessis, M.; Niu, G.; and Sugiyama, M. 2015a. Convex formulation for learning from positive and unlabeled data. In International Conference on Machine Learning, 1386–1394.
  • du Plessis, Niu, and Sugiyama [2015b] du Plessis, M. C.; Niu, G.; and Sugiyama, M. 2015b. Class-prior estimation for learning from positive and unlabeled data. In ACML, 221–236.
  • Elkan and Noto [2008] Elkan, C., and Noto, K. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 213–220. ACM.
  • Jain et al. [2016] Jain, S.; White, M.; Trosset, M. W.; and Radivojac, P. 2016. Nonparametric semi-supervised learning of class proportions. arXiv preprint arXiv:1601.01944.
  • Jain, White, and Radivojac [2016] Jain, S.; White, M.; and Radivojac, P. 2016. Estimating the class prior and posterior from noisy positives and unlabeled data. In Advances in Neural Information Processing Systems, 2693–2701.
  • Kanamori, Hido, and Sugiyama [2009] Kanamori, T.; Hido, S.; and Sugiyama, M. 2009. A least-squares approach to direct importance estimation. Journal of Machine Learning Research 10:1391–1445.
  • Kato et al. [2018] Kato, M.; Xu, L.; Niu, G.; and Sugiyama, M. 2018. Alternate estimation of a classifier and the class-prior from positive and unlabeled data. arXiv preprint arXiv:1809.05710.
  • Kiryo et al. [2017] Kiryo, R.; Niu, G.; du Plessis, M. C.; and Sugiyama, M. 2017. Positive-unlabeled learning with non-negative risk estimator. In Advances in Neural Information Processing Systems, 1675–1685.
  • LeCun, Cortes, and Burges [2010] LeCun, Y.; Cortes, C.; and Burges, C. 2010. MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2.
  • Lee and Liu [2003] Lee, W. S., and Liu, B. 2003. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, volume 3, 448–455.
  • Li and Liu [2003] Li, X., and Liu, B. 2003. Learning to classify texts using positive and unlabeled data. In IJCAI, volume 3, 587–592.
  • Liu et al. [2002] Liu, B.; Lee, W. S.; Yu, P. S.; and Li, X. 2002. Partially supervised classification of text documents. In ICML, volume 2, 387–394. Citeseer.
  • Liu et al. [2003] Liu, B.; Dai, Y.; Li, X.; Lee, W. S.; and Yu, P. S. 2003. Building text classifiers using positive and unlabeled examples. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 179–186. IEEE.
  • Liu, Lafferty, and Wasserman [2007] Liu, H.; Lafferty, J.; and Wasserman, L. 2007. Sparse nonparametric density estimation in high dimensions using the rodeo. In Artificial Intelligence and Statistics, 283–290.
  • Ramaswamy, Scott, and Tewari [2016] Ramaswamy, H.; Scott, C.; and Tewari, A. 2016. Mixture proportion estimation via kernel embeddings of distributions. In International Conference on Machine Learning, 2052–2060.
  • Ren, Ji, and Zhang [2014] Ren, Y.; Ji, D.; and Zhang, H. 2014. Positive unlabeled learning for deceptive reviews detection. In EMNLP, 488–498.
  • Sanderson and Scott [2014] Sanderson, T., and Scott, C. 2014. Class proportion estimation with application to multiclass anomaly rejection. In Artificial Intelligence and Statistics, 850–858.
  • Scott, Blanchard, and Handy [2013] Scott, C.; Blanchard, G.; and Handy, G. 2013. Classification with asymmetric label noise: Consistency and maximal denoising. In Conference On Learning Theory, 489–511.
  • Yang et al. [2012] Yang, P.; Li, X.-L.; Mei, J.-P.; Kwoh, C.-K.; and Ng, S.-K. 2012. Positive-unlabeled learning for disease gene identification. Bioinformatics 28(20):2640–2647.
  • Yu, Han, and Chang [2002] Yu, H.; Han, J.; and Chang, K. C.-C. 2002. Pebl: Positive example based learning for web page classification using svm. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, 239–248. New York, NY, USA: ACM.