Discriminative adversarial networks for positive-unlabeled learning

by   Fangqing Liu, et al.

As an important semi-supervised learning task, positive-unlabeled (PU) learning aims to learn a binary classifier only from positive and unlabeled data. In this article, we develop a novel PU learning framework, called discriminative adversarial networks, which contains two discriminative models represented by deep neural networks. One model Φ predicts the conditional probability of the positive label for a given sample, which defines a Bayes classifier after training, and the other model D distinguishes labeled positive data from those identified by Φ. The two models are simultaneously trained in an adversarial way like generative adversarial networks, and the equilibrium can be achieved when the output of Φ is close to the exact posterior probability of the positive class. In contrast with existing deep PU learning approaches, DAN does not require the class prior estimation, and its consistency can be proved under very general conditions. Numerical experiments demonstrate the effectiveness of the proposed framework.



There are no comments yet.


page 10


A generative adversarial framework for positive-unlabeled classification

In this work, we consider the task of classifying the binary positive-un...

Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks

In this paper we present a method for learning a discriminative classifi...

Semi-Supervised Self-Growing Generative Adversarial Networks for Image Recognition

Image recognition is an important topic in computer vision and image pro...

A method on selecting reliable samples based on fuzziness in positive and unlabeled learning

Traditional semi-supervised learning uses only labeled instances to trai...

Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning

In PU learning, a binary classifier is trained from positive (P) and unl...

Advocacy Learning: Learning through Competition and Class-Conditional Representations

We introduce advocacy learning, a novel supervised training scheme for a...

Nonparametric semi-supervised learning of class proportions

The problem of developing binary classifiers from positive and unlabeled...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real-life applications, we are confronted with the task of building a binary classification model from a number of positive data and plenty of unlabeled data without extra information on the negative data. For example, it is common in disease gene identification [1] that only known disease genes and unknown genes are available, because the reliable non-disease genes are difficult to obtain. Similar scenarios occur in deceptive review detection [2], web data mining [3]

, inlier-based outlier detection


, etc. Such a task is certainly beyond the scope of the standard supervised machine learning, and where positive-unlabeled (PU) learning comes in handy.

A straightforward approach for PU learning is to employ a two-step strategy: First reliable negative data are identified from the unlabeled data by some heuristic techniques

[5, 6, 7, 8]

, then the classifier can be trained by traditional supervised learning or expectation-maximization-like semi-supervised learning algorithms

[9, 10]. Furthermore, the two steps can be iteratively executed so that more negative data can be accurately identified [11]. Most of the two-step strategy based methods assume that the positive and negative data distributions can be well separated with almost non-overlapping supports, which is difficult to satisfy in complex practical problems. Recently, applications of generative adversarial networks (GAN) in PU learning have received growing attention [12, 13], where the generative models learn to generate fake positive and negative samples (or only negative samples), and the classifier is trained by using the fake samples. Experiments show that GAN can improve the performance of PU learning when the size of positive labeled data is extremely small, but some strong assumptions of data distributions, including the data separability, are still required for the GAN based methods.

Another widely used approach is to train the classifier by minimizing a weighted loss function, where unlabeled data are interpreted as negative samples with noisy labels, and the weights can be constant hyperparameters

[14, 15] or modeled as a continuous weight function according to the estimated mislabeling probabilities [16, 17]. In [18], a universal framework for classification with noisy labels are developed under the data separability assumption, and the PU learning can be efficiently performed by the presented rank pruning algorithm as a special case within this framework.

One solution to the PU learning problem with general data distributions is given by [19, 20]

, where an unbiased estimator for the misclassification risk of supervised learning is derived for PU data, and the classifier can be trained through minimizing the estimate. However, the direct minimization of the estimated risk easily leads to severe overfitting. In order to address this difficulty, a non-negative risk estimator is presented in

[21], which is biased but more robust to statistical noise. The main limitation of this approach is that the class prior, i.e., the proportion of positive data (including labeled and unlabeled) to the whole data, is needed. In practical applications, the class prior can be estimated by some class prior estimation methods [22, 23, 24, 25], but the classification performance could be badly affected by an inaccurate estimate.

In this paper, we propose a novel PU learning framework called discriminative adversarial networks (DAN). The key idea of DAN is to approximate the ideal Bayes classifier via reducing the distribution distance between the labeled positive data and those identified by the classifier from the whole dataset. DAN measures and minimizes the distance through a minimax game between the classifier and another discriminative model by analogy to the well-known generative adversarial networks (GAN) [26], and provides a more efficient way to recover the positive and negative data distributions from unlabeled data than GAN based PU learning methods. Moreover, it can effectively avoid the phenomenon of mode collapse, from which GAN easily suffers. Both theoretical analysis and experimental results show that the proposed framework can achieve high classification accuracy in general cases without the class prior or the common assumption of data separability in PU learning. The paper is organized as follows. The next section is devoted to problem statement. Section 3 presents DAN and its detailed mathematical formulation. Section 4 compares our approach and some related works, and experiment results are provided in Section 5. Finally, further discussion and some future research directions of DAN are given in Section 6.

2 Problem statement

Let be independent samples drawn from an underlying distribution density with labels , where only the first samples are labeled as positive, i.e., for , and the labels of the other samples are unavailable. We further assume that the empirical distribution of the positive data in is consistent with the ground truth . The goal of PU learning is to learn a binary classification model from the positive dataset and the unlabeled dataset , which can predict the label of a new instance .

It is well-known that the optimal classifier in the sense of minimum misclassification probability can be given by with being the conditional probability of the positive label. In the case of positive-negative (PN) learning, where all training samples are labeled, can be effectively approximated by minimizing some empirical misclassification risk (e.g., cross-entropy loss). But such an approximation is difficult for PU learning due to the absence of labeled negative training data.

Remark 1.

We only consider here the single-training-set scenario of PU learning with . Another common scenario in the literature is called case-control [27], where samples in and are drawn from and independently, and the method proposed in this paper can be naturally extended to this scenario (see Section C in Supplementary Information).

3 Discriminative adversarial networks

3.1 Motivation

Unlike some popular PU learning methods [19, 21, 12], the class prior of the positive class is not

assumed to be known in this paper, and only distributions of positive data and the whole dataset are available. According to the Bayes’ theorem, the two distributions are connected via

as , where


represents the positive data distribution reconstructed by a function . Furthermore, by replacing with the empirical distribution of , Eq. (1) can be rewritten as


where denotes the Dirac function and is the mean value of over . Hence, we can generate samples via resampling with probability proportional to .

The above analysis suggests that

a parametric model

of can be trained via minimizing the distance between and . However, It is worth pointing out that for according to (1). So, we can only get the value of up to a proportional constant even if holds exactly. The scale invariance of has been thoroughly discussed in the research of mixture proportion estimation, and some theoretical conclusions can be seen in [28, 29]. Here, we make the following assumption so that is identifiable for given and :


i.e., at least one sample can be predicted to be positive with probability one, which comprises many practical cases. Under this assumption, we can obtain


if .

3.2 Method

Inspired by the remarkable success of generative adversarial networks (GAN) [30], here we represent as a deep neural network, and define a second deep discriminative model , which maps a sample to the probability that came from rather than . Then the distance between and can be measured and minimized through the following game between and :


Intuitively, as illustrated in Fig. 1, intends to separate the samples uniformly drawn from and those obtained by resampling from with weights given by , whereas is trained to correctly identify positive samples in so as to fool . Under some technical assumptions, it can be shown that for a fixed ,


at the limit of infinite data size (see Proposition 2 in Supplementary Information), where denotes the Jensen-Shannon divergence. Hence, in the ideal situation, the training procedure converges to the equilibrium point where and cannot distinguish the two distributions with , and can be obtained after the normalization described in (4).

Figure 1: Illustration of the objective function in (5). In resampling, each in is selected with probability . Notice the resampling is not actually implemented during training process, and the involved expected values are calculated through weighted averaging.

The adversarial training method described in above can provide satisfying performance when is small. But for high-dimensional PU learning tasks, the training procedure defined by (5) also suffers from mode collapse like training of GAN, i.e., tends to predict for a part of positive samples especially when the positive data distribution has multiple modes. In order to address this problem, we introduce a penalty factor


and change the learning objective as


so that is still satisfied by the optimal solution, where is the average of over , and is a small constant to avoid singularity. The numerator of penalizes small values of for , and can effectively prevent the phenomenon of model collapse because as for some . Furthermore, the normalization constraint (3) of model can be automatically satisfied by solving (8) since for . The denominator of is designed according to our experimental experience (see Section B in Supplementary Information for some other choices), which increases the gap between for in and , and can improve the classification performance. More detailed analysis of and is given in Section A of Supplementary Information.

The learning framework developed in this section is similar to GAN, and is based on a zero-sum game between two discriminators instead of a generator and a discriminator. Thus, we call this framework discriminative adversarial networks (DAN).

3.3 Implementation

The detailed DAN learning algorithm adopted in this paper is summarized by Algorithm 1, where are both deep networks with weights denoted by

, and the Sigmoid output neurons (or the other bounded output neurons) can be used so that

for all . For applications in big data scenarios, all mean values involved in the objective function are approximated by mini-batches in each iteration. Notice that is updated only by using the gradient of in Step (9) because is independent of . For , the value of is usually positive under the condition that performs better than random guess, which is usually satisfied in training process. But when the model is badly initialized with , updating according to the gradient of may yield the divergence of the algorithm. So we implement the update of as shown in (10) for numerical stability.

1:Training data , initial weights of and , hyperparameter .
2:Classifier defined by .
3:for  do
4:     Randomly sample mini-batches and from and with batch size .
5:     Compute
6:     Update weights and with step-sizes and :
7:end for
8:Normalize as
Algorithm 1 DAN learning

4 Related work

An important idea of DAN is to approximate by matching and , which has in fact been investigated in literature (see, e.g., [31, 32, 33, 34, 24]). However, the direct approximation based on (1) involves the probability density estimation and is difficult for high-dimensional applications. In [34, 24], by modeling the ratio between and as a linear combination of basis functions, this problem is transformed into a quadratic programming problem. But the approximation results cannot meet the requirement for classification, and are only applicable to estimation of the class prior of . One main contribution of our approach compared to the previous works is that we find a general and effective way to optimize the model of by adversarial training.

It is also interesting to compare DAN to GenPU, a GAN based PU learning method [12], since they share the similar adversarial training architecture. In DAN, the discriminative model plays the role of the generative model in GAN by approximating positive data distribution in an implicit way, and can be efficiently trained together with . In contrast, GenPU is much more time-consuming and easily suffers from mode collapse as stated in [12] due to that it contains three generators and two discriminators. (Notice that the penalty factor cannot be applied to GenPU for the probability densities of samples given by generators are unknown.) Furthermore, the consistency of the GenPU needs the assumptions that class prior is given and there is no overlapping between positive and negative data distributions, which are not necessary for DAN.

5 Experiments

In this section, we conduct a series of PU learning experiments on both synthetic and real-world datasets to evaluate the performance of DAN. The detailed settings of datasets and algorithms are provided in Section D of Supplementary Information, and the software code for DAN is also available.111The software code will be publicly available after the blind review process.

We first visualize the learning results of DAN on four two-dimensional toy examples in Fig. 2, from which we can observe that an accurate classification boundary can be deduced from the conditional class probability approximated by DAN even if the positive and negative data cannot be well separated.

Figure 2: Results of DAN learning on four two-dimensional datasets. First line: Samples in training sets, where each set contains positive samples (in yellow) and negative samples (in green), and positive samples are labeled. Second line: The estimated given by DAN.

Next, we conduct experiments on three benchmark datasets taken from the UCI Machine Learning Repository [35, 36], and the performance of DAN is compared to that of some recently developed PU learning methods, including the unbiased risk estimator based uPU and nnPU [19, 21], the generative model based GenPU [12], and the rank pruning (RP) proposed in [18].222The software codes are downloaded from https://github.com/kiryor/nnPUlearning, https://qibinzhao.github.io/index.html and https://github.com/cgnorthcutt/rankpruning. Considering that uPU and nnPU require the class prior , we implement uPU and nnPU under two different conditions: (a) The exact value of is known, and (b) is estimated by KM2 proposed in [29], which is one of the state-of-the-art class prior estimation algorithms. For GenPU, the hyperparameters of the algorithm are determined by greedy grid search (see Section D.5 in Supplementary Information). The classification results are summarized in Table 1

. It can be seen that DAN outperforms the other methods with high accuracies and low variances on almost all the datasets. Only the nnPU obtains a higher accuracy on the dataset of Grid Stability with “unstable vs stable” when the exact value of

is given, and its accuracy decreases significantly with estimated . In addition, RP interprets unlabeled data as noisy negative data and can get an accurate classifier when the proportion of positive data is small in unlabeled data. But in the opposite case where the proportion is too large, RP performs even worse than random guess. ( and in Page Blocks with ’2,3,4,5’ vs ’1’ and Grid Stability with ’unstable’ vs ’stable’.)

Dataset DAN nnPU nnPU(KM2) uPU uPU(KM2) GenPU RP
Page Blocks
Page Blocks
Grid Stability
Grid Stability
Table 1:

Classification accuracies (%) of compared methods on UCI datasets. The accuracies are evaluated on test sets, and the mean and standard deviation values are computed from

independent runs. Definitions of labels (’Positive’ vs ’Negative’) are as follows: Page Blocks: ’1’ vs ’2,3,4,5’. Page Blocks: ’2,3,4,5’ vs ’1’. Grid Stability: ’stable’ vs ’unstable’. Grid Stability: ’unstable’ vs ’stable’. Avila: ’A’ vs the rest. Avila: ’A, F’ vs the rest. Labeled positive data are randomly selected from the training data with and .

Finally, all the methods are compared on two image datasets: FashionMNIST and CIFAR-10,333Datasets are downloaded from https://github.com/zalandoresearch/fashion-mnist and https://www.cs.toronto.edu/~kriz/cifar.html. and the classification results are collected in Table 2, where the superior performance of DAN is also evident. Here uPU performs much worse than nnPU due to the overfitting problem [21] (see Fig. 4 in Supplementary Information). Moreover, the performance of GenPU is also not satisfying because of the mode collapse of generators as shown in Fig. 3. In contrast, different modes of positive and negative data can be successfully sampled from the distributions defined by and in DAN.

Dataset DAN nnPU nnPU(KM2) uPU uPU(KM2) GenPU RP
Table 2: Classification accuracies (%) of compared methods on FashionMNIST and CIFAR-10 datasets. The accuracies are evaluated on test sets. Definitions of labels (’Positive’ vs ’Negative’) are as follows: FashionMNIST: ’1,4,7’ vs ’0,2,3,5,6,8,9’. FashionMNIST: ’0,2,3,5,6,8,9’ vs ’1,4,7’. CIFAR-10: ’0,1,8,9’ vs ’2,3,4,5,6,7’. CIFAR-10: ’2,3,4,5,6,7’ vs ’0,1,8,9’. Labeled positive data are randomly selected from the training data with .
Figure 3: Samples generated by using in DAN and generative models in GenPU. (a, b) Images resampled from training set with probability proportional to and . (c, d) Images generated by positive and negative generators in GenPU. True labels (’Positive’ vs ’Negative’) are given by ’1,4,7’ (Trouser, Coat, Sneaker) vs ’0,2,3,5,6,8,9’ (T-shirt/Top, Pullover, Dress, Sandal, Skirt, Bag, Ankle boot).

6 Discussion

The framework of DAN can be viewed as a mixture of discriminative learning and generative learning: A discriminative model is trained by minimizing a loss function defined by the distribution distance as a generative model. Due to the existence of unlabeled data, it is very difficult, if not impossible, to perform the PU learning in a pure discriminative manner. Even uPU and nnPU, which are developed based on the estimator of the discriminative loss, still need to model positive and negative data distributions for the approximation of class prior. But DAN demonstrates that, in PU learning, the classifier can be trained directly without solving the problem of probability density estimation as an intermediate step. It is interesting to extend this idea to more general semi-supervised learning problems, such as PNU learning, where some data are labeled as positive or negative while most data are unlabeled, and DAN has the potential to address such classification challenges especially in application scenarios where labeled positive and negative data cannot cover all modes of datasets.

It is also worthy to note that DAN is a very flexible framework, and the performance can be expected to be further improved by utilizing many advanced GAN techniques developed in recent years. For example, by analogy to WGAN and MMD-GAN, we can simply establish DAN models based on the Wasserstein metric and maximum mean discrepancy between distributions. Another research direction in future is to investigate robust DAN for semi-supervised learning with noisy labels.


  • [1] P. Yang, X.-L. Li, J.-P. Mei, C.-K. Kwoh, and S.-K. Ng, “Positive-unlabeled learning for disease gene identification,” Bioinformatics, vol. 28, no. 20, pp. 2640–2647, 2012.
  • [2] Y. Ren, D. Ji, and H. Zhang, “Positive unlabeled learning for deceptive reviews detection.,” in EMNLP, pp. 488–498, 2014.
  • [3] B. Liu, Web data mining: exploring hyperlinks, contents, and usage data. Springer Science & Business Media, 2007.
  • [4]

    A. Smola, L. Song, and C. H. Teo, “Relative novelty detection,” in

    Artificial Intelligence and Statistics, pp. 536–543, 2009.
  • [5] B. Liu, W. S. Lee, P. S. Yu, and X. Li, “Partially supervised classification of text documents,” in ICML, vol. 2, pp. 387–394, Citeseer, 2002.
  • [6] T. Peng, W. Zuo, and F. He, “Svm based adaptive learning method for text classification from positive and unlabeled documents,” Knowledge and Information Systems, vol. 16, no. 3, pp. 281–301, 2008.
  • [7] F. Lu and Q. Bai, “Semi-supervised text categorization with only a few positive and unlabeled documents,” in International Conference on Biomedical Engineering and Informatics, vol. 7, pp. 3075–3079, IEEE, 2010.
  • [8] S. Chaudhari and S. Shevade, “Learning from positive and unlabelled examples using maximum margin clustering,” in International Conference on Neural Information Processing, pp. 465–473, Springer, 2012.
  • [9] X. Li and B. Liu, “Learning to classify texts using positive and unlabeled data,” in IJCAI, vol. 3, pp. 587–592, 2003.
  • [10] H. Yu, “Single-class classification with mapping convergence,” Machine Learning, vol. 61, no. 1-3, pp. 49–69, 2005.
  • [11] A. Kaboutari, J. Bagherzadeh, and F. Kheradmand, “An evaluation of two-step techniques for positive-unlabeled learning in text classification,” Int. J. Comput. Appl. Technol. Res, vol. 3, pp. 592–594, 2014.
  • [12] M. Hou, B. Chaib-Draa, C. Li, and Q. Zhao, “Generative adversarial positive-unlabeled learning,” in International Joint Conference on Artificial Intelligence, pp. 2255–2261, AAAI Press, 2018.
  • [13] F. Chiaroni, M.-C. Rahal, N. Hueber, and F. Dufaux, “Learning with a generative adversarial network from a positive unlabeled dataset for image classification,” in IEEE International Conference on Image Processing (ICIP), pp. 1368–1372, IEEE, 2018.
  • [14] B. Liu, Y. Dai, X. Li, W. S. Lee, and S. Y. Philip, “Building text classifiers using positive and unlabeled examples.,” in ICDM, vol. 3, pp. 179–188, Citeseer, 2003.
  • [15]

    Z. Liu, W. Shi, D. Li, and Q. Qin, “Partially supervised classification: based on weighted unlabeled samples support vector machine,” in

    Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications, pp. 1216–1230, IGI Global, 2008.
  • [16]

    W. S. Lee and B. Liu, “Learning with positive and unlabeled examples using weighted logistic regression,” in

    ICML, vol. 3, pp. 448–455, 2003.
  • [17] C. Elkan and K. Noto, “Learning classifiers from only positive and unlabeled data,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213–220, ACM, 2008.
  • [18] C. G. Northcutt, T. Wu, and I. L. Chuang, “Learning with confident examples: Rank pruning for robust classification with noisy labels,” arXiv preprint arXiv:1705.01936, 2017.
  • [19] M. C. Du Plessis, G. Niu, and M. Sugiyama, “Analysis of learning from positive and unlabeled data,” in Advances in neural information processing systems, pp. 703–711, 2014.
  • [20] M. Du Plessis, G. Niu, and M. Sugiyama, “Convex formulation for learning from positive and unlabeled data,” in International Conference on Machine Learning, pp. 1386–1394, 2015.
  • [21] R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama, “Positive-unlabeled learning with non-negative risk estimator,” in Advances in neural information processing systems, pp. 1675–1685, 2017.
  • [22] S. Jain, M. White, M. W. Trosset, and P. Radivojac, “Nonparametric semi-supervised learning of class proportions,” arXiv preprint arXiv:1601.01944, 2016.
  • [23] M. Christoffel, G. Niu, and M. Sugiyama, “Class-prior estimation for learning from positive and unlabeled data,” in Asian Conference on Machine Learning, pp. 221–236, 2016.
  • [24] M. C. du Plessis, G. Niu, and M. Sugiyama, “Class-prior estimation for learning from positive and unlabeled data,” Machine Learning, vol. 106, no. 4, pp. 463–492, 2017.
  • [25]

    J. Bekker and J. Davis, “Estimating the class prior in positive and unlabeled data through decision tree induction,” in

    AAAI Conference on Artificial Intelligence, 2018.
  • [26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
  • [27] J. Bekker and J. Davis, “Learning from positive and unlabeled data: A survey,” arXiv preprint arXiv:1811.04820, 2018.
  • [28] C. Scott, “A rate of convergence for mixture proportion estimation, with application to learning from noisy labels,” in Artificial Intelligence and Statistics, pp. 838–846, 2015.
  • [29] H. Ramaswamy, C. Scott, and A. Tewari, “Mixture proportion estimation via kernel embeddings of distributions,” in International Conference on Machine Learning, pp. 2052–2060, 2016.
  • [30] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE Signal Proc. Mag., vol. 35, no. 1, pp. 53–65, 2018.
  • [31] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe, “Direct importance estimation for covariate shift adaptation,” Annals of the Institute of Statistical Mathematics, vol. 60, no. 4, pp. 699–746, 2008.
  • [32] G. Blanchard, G. Lee, and C. Scott, “Semi-supervised novelty detection,” Journal of Machine Learning Research, vol. 11, no. Nov, pp. 2973–3009, 2010.
  • [33] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847–5861, 2010.
  • [34] M. C. Du Plessis and M. Sugiyama, “Semi-supervised learning of class balance under class-prior change by distribution matching,” Neural Networks, vol. 50, pp. 110–119, 2014.
  • [35] D. Dua and C. Graff, “UCI machine learning repository,” 2017.
  • [36] C. De Stefano, M. Maniaci, F. Fontanella, and A. S. di Freca, “Reliable writer identification in medieval manuscripts through page layout features: The ‘avila’ bible case,” Engineering Applications of Artificial Intelligence, vol. 72, pp. 99–110, 2018.

Appendix A Theoretical analysis of DAN learning

In this section, we analyze the properties of (8) and its optimal solution under the following assumptions.

Assumption 1.

have enough capacity and both and tends to infinity with being fixed.

Assumption 2.

The marginal distribution for all .

Assumption 3.

There exists a measurable set so that for all .

Proposition 1.

defined by (7) satisfies: (i) for . (ii) . (iii) as for some .


The proof of (i) is trivial, and (ii) and (iii) are direct conclusions of the following inequality:


Proposition 2.

For a given ,


and the maximum is achieved when


According to the definition (5), is maximized when


Then, the optimal is given by 14 and the maximum is


Proposition 3.

If for some area with , and is an optimal solution to (8). Then


in probability and .


It can be known from Propositions 1 and 2 that the optimal satisfies and . Therefore


We can then obtain (17) according to Assumption 3. ∎

Notice that Proposition 3 shows the consistency of DAN learning with normalization step (11).

Appendix B Penalty factors

Besides the penalty factor given in (7), we also considered the following factors:




denotes the mutual information between sample and its label defined by , and denotes the average of over in (22). All the above choices of the penalty factor can lead to consistency of learning. We choose defined by (7) because it achieves the best performance in our experiments.

Appendix C Case-control scenario

Under the scenario of case-control, the empirical approximation (2) of becomes


where is the mean value of over . Therefore, the method and theory presented in this paper can be extended to the case-control scenario by defining


Appendix D Experiment details

In FashionMNIST, CIFAR-10 and Avila, datasets have been separated into training and test sets. For the two UCI datasets, we adopt the train_test_split function in scikit-learn to get test sets.

d.1 Toy examples

The former three toy examples in our experiments are generated by functions of make_circles, make_moons, make_blobs in the package of scikit-learn, where the centers of blobs are , , ,

. The dataset of the fourth example are given by a Gaussian mixture model with centers

, , , , , , , . The other details are shown in Table 3.

Dataset parameters
Concentric circles factor= noise=
Half moons noise=
Blobs cluster_std=
Gaussian mixture model covariance matrix=
Table 3: Parameters of toy examples.

d.2 UCI datasets

We first clarify the UCI datasets used in our experiments in Table 4. Then, we give the detailed experimental settings of each experiment in Table 5.

Dataset size of test set
Page Blocks
Grid Stability
Table 4: Description of UCI datasets used in experiments.
Experiment setting Data amount
Page Blocks ’2,3,4,5’ vs ’1’ = =
Page Blocks ’1’ vs ’2,3,4,5’ = =
Grid Stability ’stable’ vs ’unstable’ = =
Grid Stability ’unstable’ vs ’stable’ = =
Avila ’A’ vs The rest = =
Avila ’A,F’ vs The rest = =
Table 5: Experimental settings for UCI datasets, where denotes the number of all labeled and unlabeled positive data in training sets.

d.3 FashionMNIST and CIFAR-10

The details of the experiments are shown in Table 6

. Classification errors of nnPU, uPU and DAN on CIFAR-10 test data with different numbers of epochs are plotted in Fig. 


Figure 4: Test errors of DAN,nnPU-KM2 and uPU-KM2 on CIFAR-10 with different numbers of epochs. The left one: Classes 2,3,4,5,6,7 are positive. The right one: Classes 0,1,8,9 are positive.
Experiment Setting Data amount KM2
FashionMNIST ’1,4,7’ vs ’0,2,3,5,6,8,9’ = =
FashionMNIST ’0,2,3,5,6,8,9’ vs ’1,4,7’ = =
CIFAR-10 ’0,1,8,9’ vs ’2,3,4,5,6,7’ = =
CIFAR-10 ’2,3,4,5,6,7’ vs ’0,1,8,9’ = =
Table 6: Experimental settings for FashionMNIST and CIFAR-10, where denotes the number of all labeled and unlabeled positive data in training sets.

d.4 Other details

We choose Adam as the optimizer for DAN in our experiments, and the hyperparameters in Adam are . The architectures for models in DAN are shown in Table 7. The epoch number of DAN for image datasets is , and for the other datasets. Moreover, the hyperparameter .

Dataset Network Model Initial learning rate
Toy examples D

-layers MLP with ReLU

-layers MLP with ReLU
UCI datasets D -layers MLP with ReLU
-layers MLP with ReLU
Fashion-MNIST D -layers CNN with ReLU
-layers CNN with ReLU
CIFAR-10 D -layers CNN with ReLU
-layers CNN with ReLU
Table 7: The architectures details for experiments.

d.5 Choice of hyperparameters of GenPU

GenPU contains four hyperparameters: , , , . Although the parameters are coupled for given in [12], our experience shows that the better performance can be achieved by selecting the four parameters independently. Table 8 shows the best hyperparameters which lead to the largest classification accuracies on test sets. They are selected in by greedy grid search.

Dataset FashionMNIST CIFAR-10 Page Blocks Grid Stability Avila
Table 8: Choice of hyperparameters for GenPU.