Learning Classifiers on Positive and Unlabeled Data with Policy Gradient

10/15/2019 ∙ by Tianyu Li, et al. ∙ 0

Existing algorithms aiming to learn a binary classifier from positive (P) and unlabeled (U) data generally require estimating the class prior or label noises ahead of building a classification model. However, the estimation and classifier learning are normally conducted in a pipeline instead of being jointly optimized. In this paper, we propose to alternatively train the two steps using reinforcement learning. Our proposal adopts a policy network to adaptively make assumptions on the labels of unlabeled data, while a classifier is built upon the output of the policy network and provides rewards to learn a better strategy. The dynamic and interactive training between the policy maker and the classifier can exploit the unlabeled data in a more effective manner and yield a significant improvement on the classification performance. Furthermore, we present two different approaches to represent the actions sampled from the policy. The first approach considers continuous actions as soft labels, while the other uses discrete actions as hard assignment of labels for unlabeled examples.We validate the effectiveness of the proposed method on two benchmark datasets as well as one e-commerce dataset. The result shows the proposed method is able to consistently outperform state-of-the-art methods in various settings.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

PU learning refers to the problem of learning from a dataset where only a subset of examples are positively labeled and the rest are not annotated at all. It is a critical task due to its prevalence in various real-world applications [31, 16, 23]. In many common situations only positive data are available, for instance, an e-commerce website may only record users who have clicked on advertisements or purchased items. Meanwhile, it is not possible to simply assume that unlabeled instances are negative. Another example is diagnosis systems that predict whether or not a patient has a certain disease. To build such systems, the already diagnosed patients are naturally treated as positives. Yet, we cannot infer that all undiagnosed patients are not suffering from the disease.

The process of PU learning is conventionally done in two steps: (1) identify likely negative samples from unlabeled data and (2) perform traditional supervised learning on labeled positives and reliable negatives (N) [21, 20, 17, 25]. More recent research focuses on estimating label noise in the unlabeled dataset or the class prior of the training dataset, and then exploit the estimated values during the classifier training. The work in [7] made a notable breakthrough by modeling each unlabeled data point as a mix of both positive and negative classes. In the case that the class prior is known, the learning on P and U can be reformulated as a cost-sensitive classification problem [8]. The work in [4]

introduces a risk estimator that exploits non-convex loss functions,

e.g., the ramp loss, to cancel estimation bias. A more general estimator which is unbiased and convex by utilizing different loss functions for positive and unlabeled examples is further proposed in [5]. Incorrectly labeled examples can be removed from the original PU dataset based on noisy label prediction, allowing for training a better classification model [26].

The class prior and label noise rate are essential to existing PU learning approaches and have to be estimated before training the classifier. However, the prior distribution of labels or the possible mislabeled examples in the unlabeled dataset are unknown in typical real-world scenarios [3, 24, 14]. Consequently, the resulting classifier is affected by the estimation accuracy of prior and label noise. Moreover, the two-step process is unidirectional, i.e., there is no feedback from the classification to the prior and label noise estimation. As a result, the pipeline of existing methods leads to non-optimal classification on PU datasets.

This paper proposes a reinforcement learning framework to jointly estimate the labels of unlabeled data and learn a binary classifier. The whole framework can be trained in an end-to-end fashion. Our framework, named policyPU

, consists of two components: a policy network and a classifier. The policy network learns to infer label assignment for the unlabeled data, while the classifier is trained using the data and label estimates. The policy network, serving as an agent, formulates the input attribute vector as state and receives rewards from the classifier to update with the policy gradient. It gradually improves its decision making and generates a more accurate output that maximizes the expected reward from the classifier. We present two variants of our framework in terms of learning different policies for unlabeled data. The classifiers use distinct objective functions accordingly. In the first approach, we assume that U data is a combination of P and N 

[7]. The policy network produces continuous action values within as soft labels. The second approach applies discrete actions of the policy network as hard label assignments, which allows us to use standard supervised learning on the complete dataset. The hard assignment can be obtained by simply thresholding continuous label values.

Regardless of the different strategies, the policy network and the classifier are trained iteratively to learn a policy which makes correct assumptions for U data, and eventually a classifier that fully exploits both P and U so that it has a better generalization ability.

The technical contributions of this paper are summarized as follows:

  1. We propose a policy network for explicitly inferring the label assignment of unlabeleled data through a dynamic interaction with the classifier. Compared to existing methods that estimate unlabeled examples beforehand, we exploit the underlying structure of unlabeled data more effectively by taking the targeted classifier performance into consideration.

  2. Two approaches are presented for applying the outcome of policy network differently. The classifiers are trained with either continuous or discrete actions accordingly. Especially, the continuous actions allow a classifier to explore unlabeled instance as a mixture of positive and negative.

  3. We conduct comprehensive experiments and show that the classifiers learned by our framework yield consistent improvements in terms of accuracy, the area under the ROC curve (ROCAUC), and the area under the precision-recall curve (PRAUC) on three datasets.

Ii PU Learning Settings

PU learning is to build a classifier from positive and unlabeled training data. Although the inputs to PN and PU learning are different, they share the same goal, namely to apply the resulting classifier to distinguish positive and negative samples in test data.

Let be the feature vector of a sample, its true class label and its status of being labeled or not. We represent a PU dataset as a set of triplets , which consists of a set of labeled examples and a set of unlabeled examples . Since only positive examples are labeled, indicates . For , either or could be true. A general assumption for current PU learning methods is the Selected Completely At Random (SCAR) assumption. It assumes that all labeled samples are selected completely at random from the entire positive example set, indicating that the label and the attribute are conditionally independent from the true class [7]. It is formally stated as:


The value of

is the constant probability of a positive example being labeled, referred as label frequency

[2]. Elkan [7] proves the following property between class prior and label frequency :


Equation (2) has been significant for existing PU learning algorithms.

Iii Learning Classifiers on PU datasets via Policy Gradient

Iii-a Overview

Fig. 1: The diagram of the proposed reinforcement learning framework. The policy network takes actions on the input feature vectors. The training data and their actions from the policy are applied to learn a classification model. The policy receives the predicted class label probabilities by the classifier as rewards to update parameters with policy gradient.

This paper presents a framework, policyPU, in which a classifier is learned from positive and unlabeled examples via interacting with a policy network, as shown in Fig. 1. Given a PU dataset, we explore how to learn a more accurate classifier by exploiting the unlabeled examples. Inspired by reinforcement learning [19], our policyPU dynamically adjusts its assumptions to U data after making decisions and receiving rewards from the classifier. Thus, it is able to learn a classifier given a PU dataset in an end-to-end fashion.

To be more specific, the policy network acts as an agent, while the target classifier and the PU dataset serve as the environment in our reinforcement learning setting. The attribute vector of data instances in the training dataset is state, and the action represents how the data example is used for classifier training. In practice, a sequence of mini-batches in our training process is formulated as trajectory. Hence, the interaction between the agent and environment is as follows: the agent (the policy network) takes actions (label assignment) with input states (attribute vectors), and the classifier determines rewards for the agent to update its policy. We denote two different approaches to learn policies as Weighter and Separator, respectively. In the rest of this Section, we first describe the policy networks, and then elaborate on the reward design. This is followed by the description of the classifiers. Finally, the iterative training procedure is presented.

Iii-B Policy Networks for PU datasets

In order to learn a generalized classification model with P and U data, we would like to make better use of unlabeled examples. We formulate this solution-seeking process as a reinforcement learning task by defining the feature vector as state and the output of the policy network as action for input . The goal is then to learn a policy, , that infers how each data sample in the training dataset contributes to the classifier. Let be the labeled example set, and the unlabeled example set. The objective of the policy network is to generate actions for data instances that maximize its expected reward:


where is the reward by the data instance with feature vector after taking action . The reward is defined as the class label probability given by, , the classifier in our framework.

Iii-C Classification Coherence Rewards

The core of our proposed framework is the learning of an effective policy to infer the labels of unlabeled data. To achieve this goal, we seek the feedback from the on-going classifier training process. The intuition behind our reward design is that eventually a good policy will be coherent with the classifier, and this coherence is valid for all data instances and across mini-batches in our framework setting. More specifically, we leverage the probability of positive examples predicted by the classifier as references to decide whether an unlabeled data instance may be a plausible P or N, and define the reward function as:


where is the predicted probability by the classifier for input vector , and the threshold is used as a reference for unlabeled examples. We use the minimum class label prediction of positive examples as our first threshold value denoted as, . Those unlabeled examples with larger than this value, together with all positive examples are used to computer the threshold value in Equation (4):


where .

For unlabeled examples of which , we trust the classifier’s predicted label . Hence, we use their as reward, the same as that of positive examples in the training dataset, otherwise as they are predicted to be negative.

The goal of the policy learning is to optimize parameter to output actions that can maximize the expected reward on both the labeled and unlabeled data [9]. We update the policy by mini-batch training in practice. Since the policy maker in our framework gets instantaneous rewards from the classifier after each mini-batch, we apply the REINFORCE algorithm to maximize [32, 18, 34]. Its gradient is computed based on policy gradient theorem [30] as follows:


where is obtained by Equation (4) for both labeled data and unlabeled data .

Let the batch size be , then the parameter of the policy network is updated via:


where is the learning rate.

Iii-D Classifiers on PU datasets

With the actions from the policy network, all training data are used to learn a classifier. The classifier is denoted as , where is a differentiable classification model parameterized by .

In Weighter, the policy network outputs continuous actions, , as soft labels for unlabeled data instances. They are used to compute the weighted cost function of the corresponding classifier. The network learns the weighting policy that maximizes the classifier reward. It generates weighted data and receives rewards to update its parameter to improve its judgement. The classifier in Weighter follows the idea from prior work that each unlabeled sample is a combination of a positive example with weight and a negative example with weight , where is a continuous real value and is the feature vector of example [7]. The objective function to learn the classifier in Weighter is:


where is the weight for an unlabeled example to be positive and is the predicted probability given feature vector . Here, the action value for is directly used as in the cost function. The loss function is minimized to learn the model , which not only discriminates labeled and unlabeled samples, but also correctly identifies the contribution of unlabeled data to the model training.

The action of Separator is a hard assignment, indicating whether an unlabeled data instance is assigned as positive or negative. The hard assignment can be seen as the soft labels thresholded with a value, set to in our experiments. A data sample with action value larger than this threshold is assigned as positive, otherwise negative. The learned policy here aims to identify those data examples in U that can be directly put into the labeled example set. The corresponding classifier is trained on generated P and N data via minimizing a cross-entropy loss function.

As described, according to the discrete actions of the policy network, some unlabeled data are selected as P, denoted as , and the rest are used as negative, denoted as . The classifier for Separator is a standard supervised model with the cross-entropy cost function:


Iii-E The iterative training between the policy and the classifier

The policy network and the classifier interact in the following way: unlabeled examples in the training dataset are input to the policy network, which outputs either discrete actions or continuous-valued actions. The classifier takes the positive examples and unlabeled examples with their corresponding action values as input, generates class label probability for each data sample. The prediction results of the classifier are used as rewards for the policy maker. During the training, whether an unlabeled data example should be put in the positive dataset or how it is shared as both positive and negative simultaneously is learned and adjusted dynamically.

Our framework trains deep neural networks, Convolutional Neural Networks (CNNs) and Multilayer Perceptrons (MLPs), as classification models and policy networks. The neural network training is done using mini-batch training. In each epoch, we randomly shuffle the data to create a trajectory of mini-batches. Given a dataset that consists of both labeled and unlabeled data, a mini-batch with

instances is first randomly sampled from the training dataset to obtain feature vectors as states. For each state, an action is then taken by the policy network . The generated is fed to train the classifier . The class label prediction results for data instances in the current mini-batch are in return applied as reward to update the policy. They are jointly optimized and their parameters, and , are updated every mini-batch.

In Weighter, the weight of a labeled example is set as for the objective function in Equation (8), and that of an unlabeled example is the sampled continuous-valued action. In Separator, those labeled instances are used as P in classification directly, while the unlabeled instances are separated based on their sampled actions.

Input: a training dataset consists of P and U data
Parameter: ; ; batch size ; iteration number ; policy update frequency
Output: ;

  Initialize target policy:
  for  to  do
     shuffle to create mini-bathes
     for each mini-batch do
        Sample action from target policy for ;
        Minimize Equation (8)/(9) to learn the classifiers
        using the generated ;
        Predict class probability for ;
        Calculate the threshold via Equation (5);
        Get via Equation (4) for taking action ;
        Update policy parameter using Equation (7)
     end for
     if  then
        Update target policy:
     end if
  end for
Algorithm 1 policyPU learning

The classifier in the framework is built upon the P data and the processed U data. It updates parameter and predicts class label probabilities, , for all training data instances. They are the rewards for the policy network to update .

To reduce high variance of the returned reward, we adopt a target policy network for sampling actions. The network is updated every

epochs in our experiment. The whole training process is illustrated in Algorithm 1. Besides, a pre-training step, that simply use unlabeled examples as negative, is applied before the interactive learning to stabilize the process as well. First, the classifier is trained using P and U data directly via a few epochs, then the policy network is also pre-trained with several iterations using the prediction outcome of the classifier.

Iv Related Work

Elkan et al.[7] first shows that the output probability of a classifier which predicts can be adjusted via Equation (2) to get under the SCAR assumption. It also proposes that each unlabeled example can be utilized as a combination of a positive with weight and a negative with complementary , where , and c is the label frequency. Besides, similarity based method is proposed to associate ambiguous data instances with two similarity weights indicating their resemblance to positive and negative examples, respectively [33].

The research in [4] proposes an unbiased risk estimator to learn classifiers. Let be a decision function, be a loss function and be the class prior. The risk of in the learning on positive and negative examples is:


Given the key observation that can be approximated using , the Equation (10) is rewritten as:


for PU learning. But this method requires a loss function to satisfy , e.g., the ramp loss. To apply this unbiased risk estimator to train deep neural networks, a non-negative variation is proposed to alleviate overfitting and to be implemented for large scale training data by stochastic optimization [15]. Another recent research [29] also converts PU learning to risk minimization problem, in which it adopts different loss functions for P and U, respectively [10]. The proposed method further shows that its risk minimization in the presence of noisy negative data can be turned into the estimation of the centroid of negative examples.

As knowing the label noise rate or the class prior of training dataset simplifies PU learning greatly, plenty of methods have been proposed to directly conduct estimation from PU datasets [6, 1]. Several mixture proportion estimation methods are proposed recently, in which the unlabeled data are considered as a mixture of positive distribution and an unknown negative distribution [28]. [13] proposes to use kernels to model distributions, and the class prior is the sum of weights that represent how much the positive distribution contributes to each kernel. [27] further proposes a similar algorithm but using the distance between kernel embeddings to find the optimal weights. More recently, [2]

proposes a decision tree based approach to estimate label frequency, then use Equation (

2) as a medium to get class prior. RankPruning [26] tempts to guess incorrectly labeled examples in the training datasets, then prune those examples with low predicted probabilities after ranking them. Then, a classifier with weighted cost function is learned on the pruned training datasets.

V Experiments

Dataset Train No. Test No. Feature No. Classifier Policy network
MNIST 60,000 10,000 (1*28*28) 6-layer CNN 5-layer CNN
CIFAR-10 50,000 10,000 (3*32*32) 6-layer CNN 5-layer CNN
UserTargeting 19,032 19,032 153 6-layer MLP 4-, 6-layer MLP
TABLE I: Benchmark datasets. The number of data instances in train and test, the feature vector dimension, also the architecture of the classifiers and the policy networks trained in our framework are shown.

We verify whether the classifiers learned by our framework are able to yield classification performance improvement on three real world datasets, MNIST, CIFAR-10 and one e-commerce dataset (UserTargeting). Comprehensive experiments are designed to test our proposal with various number of labeled examples and distinct ratios of positive in the unlabeled dataset. Besides, in order to avoid single-sided evaluation, the results of three metrics are presented in the comparison with other state-of-the-art PU learning algorithms.

V-a Datasets

Originally, both MNIST and CIFAR-10 datasets have ten classes. We preprocess and create a binary classification dataset in the same way as [15]. For MNIST, constitute the P class, while constitute the N class; For CIFAR-10, airplane, automobile, ship, truck form the P class, while bird, cat, deer, dog, frog, horse are the N class. In our experiments, we first construct PU datasets from their original train dataset, then test the learned classifiers on the original test dataset.

The labeled examples in the UserTargeting dataset are those users who responded positively to certain products on an e-commerce platform. They are used to identify potential users among all other users of this e-commerce platform. We solve this problem by formulating it as a PU learning problem and randomly sample from the users who are not labeled yet as U to create the UserTargeting dataset. This dataset is split 50-50 to train and test. Both of them have 4,758 labeled and 14,274 unlabeled users. The feature vector for each user in this dataset is created based on user’s past 6 months online activities, and its dimension is 153. The details of the benchmark datasets are shown in Table I.

V-B Baselines

Fig. 2: Accuracy comparison on MNIST dataset. The classifier is a 6-layer CNN model, and the policy is a 5-layer CNN model. The number of labeled examples is 300, 500 and 1,000 from the first row to the third row; The percentage of positive examples in the unlabeled data is 0.3, 0.5 and 0.7 from left to right.

We compare our framework against state-of-the-art PU learning algorithms, which are summarized as follows:

  • biased PU: Biased PU learning builds a classification model by using unlabeled data instances as negatives directly.

  • TIcE[2]+nnPU[15]: TIcE111https://dtai.cs.kuleuven.be/software/tice/ utilizes a decision tree induction based method to calculate label frequency first, and obtain class prior via Equation (2). Its estimation result is used as input to nnPU222https://github.com/kiryor/nnPUlearning which is a non-negative unbiased risk estimator.

  • KM2[27]+nnPU[15]: KM2333http://web.eecs.umich.edu/~cscott/code.html#kmpe is an efficient algorithm for mixture proportion estimation. It embeds the distributions into a reproducing kernel Hilbert space and uses a quadratic programming solver as a sub-routine. The estimation result is also input to nnPU for classifier learning.

  • RankPruning[26]: RankPruning proposes to remove incorrectly labeled examples in the training dataset before inputting them to build classification models. Note that we make an adaptation to its original algorithm444https://github.com/cgnorthcutt/rankpruning that is for learning to fit the PU learning setting.

  • PMPU[11]: PMPU relies on this large positive margin oracle which claims that positive instances are located far away from the decision boundary. It estimates the labels of unlabeled data in each iteration according to the positive margin shrinkage, and then retrain the classifier based on random sampling.

  • policyPU_separator: The proposed framework in which a policy network learns to generate a hard assignment of labels to unlabeled examples, while a classifier is built on positives and the output results from the policy.

  • policyPU_weighter: The proposed framework that applies a policy network to output continuous actions as the cost function weights for the corresponding classifier.

  • optimal PN: The classification model is built on the training data with ground truth labels. It is used as a reference baseline for other algorithms.

ROCAUC, accuracy and PR

AUC are used as the evaluation metrics.

V-C Experiment setup

We experiment with different numbers of positively labeled examples. Let be the number of labeled examples, are tested for MNIST and CIFAR-10. The number of unlabeled examples in the PU datasets is set to , and we report the results using varying proportions of positive examples in the unlabeled set, denoted as and set to 0.3, 0.5 and 0.7 for adequate verification.

The targeted classifier for MNIST and CIFAR-10 is a 6-layer CNN with 3 convolutional layers ([d-C(33,96)-C(33,192)-C(11,10)-100-1]), while the policy network is a 5-layer CNN model with 2 convolutional layers([d-C(33,96)-C(33,10)-100-1]). Two 6-layer MLP models are used for the UserTargeting dataset, paired with two different policy network architectures, respectively. The architecture of the classifier and the policy network in our experiments is shown in Table I. For policy networks, we deliberately use a slightly shallower architecture compared to the corresponding targeted classifier. The policy is expected to make rough assumptions at the beginning so that it can gradually adjust itself towards the direction of greater cumulative reward.

We train neural networks using as the optimizer with a batch size of 128 and a learning rate fixed to . Also, we use [22]

as the activation function, apply weight decay and batch normalization

[12]. To achieve fair comparison, we train classifiers with same architecture and same parameter settings for different algorithms on all created PU datasets. We average the performance by running each experiment 5 times. All experiments are implemented using Chainer 555https://chainer.org/.

V-D Experiments on the MNIST dataset

In the experiment, a PU dataset is first created for each setting with respect to the number of labeled examples and the positive ratio in the unlabeled data. For TIcE and KM2, the feature vectors are first flattened, and then fed to generate class priors. We follow their default settings, and conduct downsampling if the total number of training instances exceeds , the same as [2], for better estimation. Then, these estimated values are input to nnPU for training classification models. The classifier in RankPruning is first trained to remove noisy labels in the unlabeled dataset. We follow its setting to run a 5-folder cross validation in order to prune incorrectly labeled examples and get weights for different classes. Then, its classification model is learned with a weighted cost function on the pruned dataset. For PMPU, we adapt it to a mini-batch training scenario, obtain the large positive margin oracle and resample 3/4 unlabeled examples for classifier learning within every mini-batch. As for the optimal PN, the classification model is built with both labeled positive and negative examples. The performance of optimal PN is used as a reference to other PU learning algorithms. The weight decay for all classifiers is set to , while that of the policy network in Weighter is and in Separator is . We pre-train the policy networks and classifiers using unlabeled examples as negatives for 5 epochs. Then, we run epochs for training and update policy every epochs.

The accuracy with different settings is displayed in Fig. 2. Vertically, it shows experimental results with different number of labeled examples, 300, 500 and 1,000 from top row to the bottom. Horizontally, the figures are different in terms of the percentage of P in U. The experimental results verify that the classifiers learned by the proposed framework are very competitive with different number of labeled examples in training datasets, as well as the different ratios of positive examples in the unlabeled data.

Especially, when the ratio of positive data is low (i.e., =0.3), policyPU_weighter is shown to be capable of approaching to the accuracy curve of optimal PN learning. We conjecture the reason is that the data used for the classifier to obtain the threshold have a relatively large proportion of labeled true positive data. As a consequence, it is able to produce valid reward to those unlabeled data. Meanwhile, another observation is that with a larger , it takes longer for both of the classifiers in our framework to start to predict reasonably. Particularly, policyPU_separator is hardly able to keep increasing its performance when , likely due to the influence of the threshold setting in Equation (4). As described, the threshold is calculated based on the expectation over the predicted class label probability of labeled examples and some unlabeled examples chosen based on . Compared to solely using labeled examples as the reference, the proposed way is expected to get a balanced threshold value by considering those unlabeled examples which are likely to be positive. Yet, it is also possible to increase the threshold if many positives are far from negatives in the unlabeled dataset. As a result, the policy may get non-optimal reward from those positives in U data near the decision boundary due to the threshold setting. On the other hand, the policyPU_weighter is not impacted as severe as the policyPU_separator. We think it is because of the weighting mechanism that the classifier in Weighter holds. It is capable of drawing a more flexible decision boundary even with many data instances near it. Therefore, the class label prediction would be more accurate and more valid as reward to the policy.

The performance comparison after 300 epochs training is shown in Table II, III and IV for the experiments with 300, 500 and 1,000 labeled examples, respectively. Each table shows the ROC_AUC, accuracy and PR_AUC results of three distinct fraction settings. It is shown that almost all algorithms can generate consistent performance except biased PU learning which fails to achieve a good accuracy. Experiment results show that the classifiers learned by the proposed framework can outperform others and even output close results to optimal results for a few cases.

0.3 0.5 0.7

biased PU
0.957 0.622 0.957 0.929 0.535 0.933 0.874 0.510 0.865

0.953 0.860 0.954 0.936 0.825 0.936 0.942 0.828 0.940

0.951 0.867 0.954 0.938 0.861 0.943 0.906 0.722 0.905

0.875 0.745 0.859 0.899 0.825 0.882 0.878 0.771 0.874

0.956 0.861 0.957 0.931 0.827 0.933 0.911 0.793 0.911

0.954 0.862 0.957 0.927 0.825 0.934 0.890 0.722 0.893

0.975 0.916 0.975 0.948 0.880 0.948 0.915 0.818 0.909

optimal PN
0.976 0.919 0.977 0.974 0.907 0.973 0.971 0.839 0.969
TABLE II: Experiment results on MNIST with CNN classifiers. No. of labeled examples is 300. The percentage of P in U is 0.3, 0.5 and 0.7. ROC_AUC, accuracy and PR_AUC are shown from left to right for each percentage setting.
Model 0.3 0.5 0.7

biased PU
0.976 0.632 0.976 0.955 0.536 0.955 0.898 0.508 0.902

0.963 0.778 0.962 0.962 0.794 0.960 0.948 0.722 0.944

0.970 0.892 0.971 0.962 0.894 0.963 0.924 0.828 0.934

0.941 0.872 0.924 0.770 0.722 0.722 0.832 0.746 0.820

0.971 0.885 0.972 0.954 0.856 0.954 0.921 0.811 0.929

0.973 0.903 0.975 0.943 0.836 0.947 0.903 0.760 0.917

0.986 0.942 0.986 0.979 0.923 0.979 0.950 0.833 0.950

optimal PN
0.989 0.949 0.989 0.990 0.944 0.990 0.986 0.877 0.985
TABLE III: Experiment results on MNIST with CNN classifiers. No. of labeled examples is 500. The percentage of P in U is 0.3, 0.5 and 0.7. ROC_AUC, accuracy and PR_AUC are shown from left to right for each percentage setting.
Model 0.3 0.5 0.7

biased PU
0.983 0.626 0.983 0.973 0.517 0.973 0.937 0.507 0.946

0.979 0.893 0.978 0.974 0.794 0.972 0.975 0.858 0.975

0.969 0.845 0.971 0.973 0.897 0.974 0.905 0.715 0.917

0.965 0.869 0.963 0.776 0.639 0.763 0.932 0.853 0.932

0.980 0.895 0.979 0.970 0.871 0.970 0.931 0.833 0.938

0.979 0.912 0.979 0.969 0.898 0.971 0.879 0.743 0.896

0.991 0.948 0.991 0.989 0.935 0.988 0.978 0.843 0.977

optimal PN
0.993 0.960 0.993 0.994 0.952 0.993 0.991 0.884 0.990
TABLE IV: Experiment results on MNIST with CNN classifiers. No. of labeled examples is 1,000. The percentage of P in U is 0.3, 0.5 and 0.7. ROC_AUC, accuracy and PR_AUC are shown from left to right for each percentage setting.

V-E Experiments on the CIFAR-10 dataset

Fig. 3: Accuracy comparison on CIFAR-10 dataset. The classifier is a 6-layer CNN model, and the policy is a 5-layer CNN model. The number of labeled examples is 300, 500 and 1,000 from the first row to the third row; The percentage of positive examples in the unlabeled data is 0.3, 0.5 and 0.7 from left to right.

For the experiments on CIFAR-10, we train CNN models as classifiers and policies using the same architecture as the experiments on MNIST. Similarly, TIcE and KM2 algorithms are run first to make estimation before feeding the results to nnPU, separately. RankPruning eliminates label noises to create a relatively clean training dataset, and then builds a classification model on it. We apply the same pre-training for our proposal and update policy once in 3 epochs to learn classifiers. The weight decay for classifiers is , and for policy networks are and , respectively.

The experimental results are illustrated in Fig. 3. As shown, our framework is able to train classification models that generate higher accuracy compared to other algorithms. It is also recognized from the accuracy curve comparison that, our proposal sometimes even yields higher accuracy than the classifier trained on fully labeled PN data with the same parameter setting. We believe that if the true positive and negative examples in U dataset overlap near the decision boundary, the instance weights and even hard assignment given by the policy on these data may serve as an effective regularizer for the classifier. It is an interesting phenomenon worth further investigation in the future.

We also observe interesting learning curves from the accuracy comparison on CIFAR-10, and in a few settings on MNIST as well. For some senarios, the policy network seems to be making inaccurate decisions for unlabeled examples at the beginning, yet quickly corrects itself after a few trials. It, in fact, reveals that the proposed interactive learning between the policy network and the classifier is effective for learning on PU datasets. Even if the policy network is inaccurate at the beginning of training, coherence rewards provided by the classifier would allow the policy network and targeted classifier to learn from each other and quickly rectify the policy. The detailed comparison on ROC_AUC, accuracy and PR_AUC after 300-epoch training is presented in Table V, VI and VII.

0.3 0.5 0.7

biased PU
0.906 0.683 0.865 0.864 0.622 0.812 0.833 0.607 0.757

0.888 0.682 0.828 0.872 0.657 0.810 0.858 0.646 0.784

0.895 0.814 0.877 0.884 0.736 0.825 0.849 0.691 0.781

0.901 0.792 0.857 0.880 0.754 0.832 0.819 0.577 0.734

0.874 0.774 0.808 0.849 0.721 0.767 0.833 0.721 0.743

0.915 0.835 0.877 0.886 0.808 0.842 0.847 0.750 0.780

0.915 0.830 0.880 0.875 0.798 0.831 0.818 0.741 0.742

optimal PN
0.935 0.919 0.907 0.926 0.779 0.894 0.927 0.623 0.895
TABLE V: Experiment results on CIFAR-10 with CNN classifiers. No. of labeled examples is 300. The percentage of P in U is 0.3, 0.5 and 0.7. ROC_AUC, accuracy and PR_AUC are shown from left to right for each percentage setting.
Model 0.3 0.5 0.7

biased PU
0.926 0.680 0.895 0.899 0.619 0.859 0.859 0.607 0.807

0.900 0.621 0.847 0.878 0.553 0.813 0.881 0.510 0.827

0.915 0.803 0.871 0.903 0.720 0.851 0.885 0.744 0.842

0.929 0.726 0.902 0.901 0.792 0.859 0.837 0.584 0.778

0.880 0.749 0.812 0.880 0.748 0.806 0.865 0.743 0.788

0.930 0.860 0.897 0.910 0.834 0.874 0.876 0.787 0.834

0.935 0.859 0.907 0.907 0.823 0.866 0.847 0.770 0.790

optimal PN
0.949 0.874 0.926 0.945 0.800 0.921 0.939 0.643 0.913
TABLE VI: Experiment results on CIFAR-10 with CNN classifiers. No. of labeled examples is 500. The percentage of P in U is 0.3, 0.5 and 0.7. ROC_AUC, accuracy and PR_AUC are shown from left to right for each percentage setting.
Model 0.3 0.5 0.7

biased PU
0.939 0.703 0.919 0.920 0.618 0.885 0.911 0.602 0.875

0.904 0.641 0.859 0.906 0.592 0.859 0.876 0.443 0.817

0.924 0.811 0.893 0.917 0.745 0.875 0.918 0.703 0.880

0.948 0.825 0.927 0.929 0.712 0.898 0.905 0.667 0.861

0.907 0.764 0.860 0.887 0.731 0.817 0.869 0.762 0.792

0.938 0.856 0.914 0.926 0.846 0.891 0.911 0.833 0.875

0.951 0.884 0.933 0.929 0.853 0.897 0.902 0.816 0.858

optimal PN
0.955 0.879 0.940 0.994 0.814 0.932 0.991 0.650 0.919
TABLE VII: Experiment results on CIFAR-10 with CNN classifiers. No. of labeled examples is 1,000. The percentage of P in U is 0.3, 0.5 and 0.7. ROC_AUC, accuracy and PR_AUC are shown from left to right for each percentage setting.
4-layer policy 6-layer policy
Model ROC_AUC Accuracy PR_AUC ROC_AUC Accuracy PR_AUC
biased PU 0.951 0.872 0.828 0.956 0.885 0.849
TIcE+nnPU 0.957 0.895 0.841 0.958 0.895 0.849
KM2+nnPU 0.957 0.896 0.846 0.955 0.896 0.847
RankPruning 0.949 0.880 0.821 0.956 0.886 0.842
PMPU 0.858 0.631 0.526 0.844 0.629 0.518
policyPU_separator 0.974 0.937 0.898 0.972 0.941 0.892
policyPU_weighter 0.971 0.914 0.885 0.969 0.919 0.883
TABLE VIII: Experiment results on UserTargeting dataset. The classifiers are two 6-layer MLPs ([d-100-50-50-30-1]), and they are paired with a 4-layer (left) and a 6-layer (right) policy networks.

V-F Experiments on the UserTargeting dataset

We train two MLP classifiers with 1,000 labeled examples, and they are learned together with a 6-layer MLP and 4-layer MLP as policy networks, respectively. A narrower architecture, [d-100-50-50-30-1], is used for the neural networks due to the dimension of user feature vector. The weight decay for our classifiers and policy networks are set to and , separately. Experiment results are summarized after running 1,000 epochs.

The accuracy comparison is presented in Fig. 4. Note that since UserTargeting dataset does not contain any true negative examples, there is no comparison to optimal PN learning in the experiment. For other baseline algorithms, we follow the same training procedure described for the experiments on MNIST and CIFAR-10. As displayed, unfortunately PMPU struggles to have good performance this time, unlike on the other two datasets. We speculate that the reason is due to the fact that this user behavior dataset is noisier than MNIST and CIFAR-10. Hence, the calculation of the significant parameter, , for PMPU may be severely impacted.

We recognize that it takes a bit longer for policy networks to learn consistent policy. We assume it is still because of the high noise level, which makes the policy learning converge slower. It is hard for policy to make quick decisions on how an unlabeled data instance should be used. RankPruning is able to produce very promising results. Its pruning of likely mislabeled examples works well for this user behavior dataset. Besides, both class prior estimation methods seem to have difficulty to accurately approximate true values. Meanwhile, biased PU learning turns out to be a strong baseline for this dataset as the assumption that the unlabeled instances being mostly negative may actually hold for this particular problem. Yet, eventually our classifiers can yield comparable performance. Another important observation is that our classifiers do not severely suffer from overfitting problem in the end. The comparison on ROC_AUC, accuracy and PR_AUC are shown in Table VIII.

Fig. 4: Accuracy comparison on UserTargeting dataset. The classifier is 6-layer MLP, the paired policy network is a 4-layer MLP (left) and a 6-layer MLP (right). The number of labeled examples is 1,000.

V-G Verification on policy learning

In this subsection, we demonstrate whether the policy is learning to improve its decision making on the unlabeled examples. In our proposed interactive learning mechanism, the policy must update itself towards better policy during the training process in order to facilitate the classifier learning. As shown in Fig. 2 and Fig. 3, the classification models learn to yield more accurate performance. Here, we illustrate policy’s decision making on the unlabeled data to verify if they are gradually getting better during the training as well. Since the policy in Separator directly makes a hard assignment which is more straightforward to understand, we use its results as examples for discussion. As indicated in Fig. 5, the rate of correctly assigned data instances is getting better in the training process for both scenarios. The drop at the beginning, in fact, matches one of the observations elaborated in the experiment on CIFAR-10, that at first the policy is making wrong decisions. However, the interactive learning can correct it after a few epochs of trials. We can see that the policy is indeed improving along with the classifier in overall. This observation is actually consistent with the theoretical analysis in [35]. They propose a generalized cross entropy loss and derive a bound of the optimal objective function value difference between using the dataset with true labels and using the dataset with noisy labels. The latter corresponds to the PU learning in our experiment. Further, it proves that as noise rate decreases, the optimal function value on a noisy dataset approaches to the one using a clean dataset.

Fig. 5: Correct assiginment rate of the unlabeled examples by the policy in Separator. The experiment is run with 300 labeled examples, and the percentage of positive in unlabeled data is 0.3. Experiment results on MNIST (left) and on CIFAR-10 (right) are shown.

Vi Conclusions

This paper proposed a reinforcement learning framework, in which a policy network learns to update its assumptions of unlabeled examples, and a classifier that builds on the actions taken by the policy, makes predictions and generates rewards to guide the policy training. Compared to existing PU learning methods which rely on a pipeline to make estimations on unlabeled examples and to build a classifier, the interactive learning between the policy and the classifier in our proposed framework is able to make use of U data in a more effective manner, and train a more generalized classifier in an end-to-end fashion. Experimental results on three datasets demonstrate that the classifiers learned by our framework are able to yield performance improvement in terms of ROC_AUC, accuracy and PR_AUC.


  • [1] J. Bekker and J. Davis (2018-11) Learning from positive and unlabeled data: a survey. In https://arxiv.org/abs/1811.04820, pp. . Cited by: §IV.
  • [2] J. Bekker and J. Davis (2018) Estimating the class prior in positive and unlabeled data through decision tree induction. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence

    pp. 2712–2719. Cited by: §II, §IV, 2nd item, §V-D.
  • [3] G. Blanchard, G. Lee, and C. Scott (2010-12)

    Semi-supervised novelty detection

    JMLR 11, pp. 2973–3009. Cited by: §I.
  • [4] M. C. du Plessis, G. Niu, and M. Sugiyama (2014-December 08-13) Analysis of learning from positive and unlabeled data. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 703–711. Cited by: §I, §IV.
  • [5] M. C. du Plessis, G. Niu, and M. Sugiyama (2015-July 06-11) Convex formulation for learning from positive and unlabeled data. In

    Proceedings of the 32nd International Conference on International Conference on Machine Learning

    Lille, France, pp. 1386–1394. Cited by: §I.
  • [6] M. C. du Plessis and M. Sugiyama (2014-05) Class prior estimation from positive and unlabeled data. IEICE Transactions on Information and Systems E97.D, pp. 1358–1362. Cited by: §IV.
  • [7] C. Elkan and K. Noto (2008-August 24-27) Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, pp. 213–220. Cited by: §I, §I, §II, §II, §III-D, §IV.
  • [8] C. Elkan (2001-August 04-10) The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA, USA, pp. 973–978. Cited by: §I.
  • [9] J. Feng, M. Huang, L. Zhao, Y. Yang, and X. Zhu (2018) Reinforcement learning for relation classification from noisy data. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 5779–5786. Cited by: §III-C.
  • [10] W. Gao, L. Wang, Y. Li, and Z. Zhou (2016) Risk minimization in the presence of label noise. In Proceedings of The Thirtieth AAAI Conference on Artificial Intelligence, pp. 1575–1581. Cited by: §IV.
  • [11] T. Gong, G. Wang, J. Ye, Z. Xu, and M. C. Lin (2018) Margin based pu learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 3037–3044. Cited by: 5th item.
  • [12] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, pp. 448–456. Cited by: §V-C.
  • [13] S. Jain, M. White, and P. Radivojac (2016-December 05-10) Estimating the class prior and posterior from noisy positives and unlabeled data. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain. Cited by: §IV.
  • [14] S. Jain, M. White, M. W. Trosset, and P. Radivojac (2016) Nonparametric semi-supervised learning of class proportions. CoRR abs/1601.01944. Cited by: §I.
  • [15] R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama (2017) Positive-unlabeled learning with non-negative risk estimator. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA. Cited by: §IV, 2nd item, 3rd item, §V-A.
  • [16] W. Li, Q. Guo, and C. Elkan (2010-08) A positive and unlabeled learning algorithm for one-class classification of remote-sensing data. IEEE Transactions on Geoscience and Remote Sensing 49 (2), pp. 717–725. Cited by: §I.
  • [17] X. Li and B. Liu (2003-08) Learning to classify texts using positive and unlabeled data. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico, pp. 587–592. Cited by: §I.
  • [18] Y. Li and J. Ye (2018-August 19-23) Learning adversarial networks for semi-supervised text classification via policy gradient. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, United Kingdom, pp. 1715–1723. Cited by: §III-C.
  • [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. CoRR. Cited by: §III-A.
  • [20] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu (2003-11) Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining, Melbourne, FL, USA, pp. 179–186. Cited by: §I.
  • [21] B. Liu, W. S. Lee, P. S. Yu, and X. Li (July 08-12, 2002) Partially supervised classification of text documents. In Proceedings of the Nineteenth International Conference on Machine Learning, pp. 387–394. Cited by: §I.
  • [22] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: §V-C.
  • [23] F. Mordelet and J. Vert (2010-10) A bagging svm to learn from positive and unlabeled examples. Pattern Recog. Lett 37, pp. . External Links: Document Cited by: §I.
  • [24] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari (2013) Learning with noisy labels. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, pp. 1196–1204. Cited by: §I.
  • [25] M. N. Nguyen, X. Li, and S. Ng (2011-July 16-22) Positive unlabeled learning for time series classification. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, pp. 1421–1426. Cited by: §I.
  • [26] C. G. Northcutt, T. Wu, and I. L. Chuang (2017) Learning with confident examples: rank pruning for robust classification with noisy labels. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI’17. Cited by: §I, §IV, 4th item.
  • [27] H. G. Ramaswamy, C. Scott, and A. Tewari (2016) Mixture proportion estimation via kernel embedding of distributions. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 2052–2060. Cited by: §IV, 3rd item.
  • [28] C. Scott (2015-May 09-12) A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, San Diego, California, USA, pp. 838–846. Cited by: §IV.
  • [29] H. Shi, S. Pan, J. Yang, and C. Gong (2018) Positive and unlabeled learning via loss decomposition and centroid estimation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2689–2695. Cited by: §IV.
  • [30] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour (1999-November 29-December 04) Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, pp. 1057–1063. Cited by: §III-C.
  • [31] G. A. Ward, T. J. Hastie, S. T. Barry, J. Elith, and J. R. Leathwick (2009) Presence-only data and the em algorithm. Biometrics 65 2, pp. 554–563. Cited by: §I.
  • [32] R. J. Williams (1992-05) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §III-C.
  • [33] Y. Xiao, B. Liu, J. Yin, L. Cao, C. Zhang, and Z. Hao (2011) Similarity-based approach for positive and unlabelled learning. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, pp. 1577–1582. Cited by: §IV.
  • [34] T. Zhang, M. Huang, and Z. Zhang (2018) Learning structured representation for text classification via reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 6053–6060. Cited by: §III-C.
  • [35] Z. Zhang and M. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in Neural Information Processing Systems, pp. 8778–8788. Cited by: §V-G.