Adaptive Positive-Unlabelled Learning via Markov Diffusion

by   Paola Stolfi, et al.

Positive-Unlabelled (PU) learning is the machine learning setting in which only a set of positive instances are labelled, while the rest of the data set is unlabelled. The unlabelled instances may be either unspecified positive samples or true negative samples. Over the years, many solutions have been proposed to deal with PU learning. Some techniques consider the unlabelled samples as negative ones, reducing the problem to a binary classification with a noisy negative set, while others aim to detect sets of possible negative examples to later apply a supervised machine learning strategy (two-step techniques). The approach proposed in this work falls in the latter category and works in a semi-supervised fashion: motivated and inspired by previous works, a Markov diffusion process with restart is used to assign pseudo-labels to unlabelled instances. Afterward, a machine learning model, exploiting the newly assigned classes, is trained. The principal aim of the algorithm is to identify a set of instances which are likely to contain positive instances that were originally unlabelled.



There are no comments yet.


page 1

page 2

page 3

page 4


A method on selecting reliable samples based on fuzziness in positive and unlabeled learning

Traditional semi-supervised learning uses only labeled instances to trai...

Negative Confidence-Aware Weakly Supervised Binary Classification for Effective Review Helpfulness Classification

The incompleteness of positive labels and the presence of many unlabelle...

Instance-Dependent PU Learning by Bayesian Optimal Relabeling

When learning from positive and unlabelled data, it is a strong assumpti...

Binary classification with ambiguous training data

In supervised learning, we often face with ambiguous (A) samples that ar...

Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets

Protein-protein interaction (PPI) prediction is an important problem in ...

Rethinking Ranking-based Loss Functions: Only Penalizing Negative Instances before Positive Ones is Enough

Optimising the approximation of Average Precision (AP) has been widely s...

Weakly supervised segment annotation via expectation kernel density estimation

Since the labelling for the positive images/videos is ambiguous in weakl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Binary classification in traditional settings refers to the ability to correctly identify positive and negative instances, namely instances labelled as 1 or 0, based on their attributes. This problem is addressed by the so-called supervised learning algorithms, i.e. algorithms that are trained over data sets containing precise information regarding positive and negative instances.

There are some cases in which binary classification refers to the ability to identify positive instances among a set of positive and unlabelled instances, namely some positive instances are labelled as 1 while the remaining instances are labelled as 0 that, in this case, reflects absence of knowledge regarding an instance. These problems are addressed by semi-supervised learning algorithms which in this particular context are known as Positive-Unlabelled (PU) learning.

The term PU learning first appeared in the early 2000s. Since then, there has been an increase of interest on such learning task, both from a methodological and an applied point of view. Indeed PU learning arises in many contexts, for instance when dealing with disease genes identification (see for instance [1], [2]), where for each disease there is a small number on genes which are known to be relevant for the disease, while there is no knowledge regarding the other genes. The same happens when dealing with customer’s preferences in which the information is related to what customers like while there is no information regarding what they do not like (e.g., see [3]).

Regarding PU learning methods, they can be divided into two main approaches: in the first approach the set of unlabelled instances is assumed to be the contaminated set of negative instances and the contamination is considered during the modelling process by weighting the data points or adding penalties on mis-classification (see for instance [4], [2], [5],[6]); the second approach, also called two-step techniques, aims at identifying a set of reliable negative examples, namely a non contaminated set of negative examples, to train a supervised learning algorithm (see for instance [7],[1]). The right choice of the method when dealing with PU learning and the resulting performances highly depend on data assumptions, see [8] for a recent and detailed review of these methods.

In this work we consider the last approach, namely the two-step techniques. In particular, we propose a modification of the method introduced in [1] where the authors propose a multi-class labelling procedure which consists is considering five different labels, namely Positive (P), Likely Positive (LP), Weakly Negative (WN), Likely Negative (LN) and Reliable Negative (RN). This type of labelling provides the identification of a set of examples among the unlabelled, namely the LP, which are more likely to be positive. This approach, as well as all the two-step techniques, is based on the separability and smoothness assumptions which require respectively that the features are able to distinguish between positive and negative instances and that instances which have similar features are more likely to have the same label.

In [1]

the Reliable Negative class is composed by those unlabelled instances whose distance from the average feature vector of the positive examples is greater than the average. The other labels have been given according to the stationary distribution of a Markov process with restart, that we call label propagation process, where the initial state is a distribution over the Positive and Reliable Negative examples. The Markov matrix has been constructed from a slightly modified, non-symmetric version of the Euclidean distance matrix computed over the instances features. Roughly speaking, Likely Positive and Likely Negative are those instances which receive, through the Markov diffusion process, the largest part of the initial distribution from the positive and reliable negative examples, respectively. The remaining are labelled as Weakly Negative.

In this work we propose some modifications of this method, which are related to i) the distance matrix defining the Markov chain, ii) the definition of Reliable Negative examples and iii) the thresholds which define the other classes. Some of these modification were needed in order to apply the method to general PU data sets, while other were proposed in order to make the process of class formation more flexible. The rest of this work is structured as follows. Section

2 contains the description of the PU learning method based on two-step multi-class labelling technique with a focus of the novelties that have been introduced with this work. Section 3 contains the description of several machine learning data set used to test the method performance. Section 4 is related to discussions of the obtained results while section 5 is related to conclusions.

2 Method

In this section we detail the proposed methodology to address positive-unlabelled learning with the introduction of auxiliary classes able to further discriminate between unlabelled samples. As previously discussed, the methodology is based on the PU learning algorithm introduced in [1] improving some aspects of it, redefining the label propagation matrix and making dynamic the selection of the Reliable Negative instances and the thresholds in order to make the selection of the various classes adjustable. We call our algorithm APU, which stands for Adaptive Positive-Unlabelled (learning).

Let be a set whose generic element , for , is characterised by the couple where represents the features vector and the initial label. The label propagation process can be defined by the following steps:
Step 1: compute the matrix , whose elements , with values between 0 and 1, represents the similarity score between elements and , defined as follows


where is the euclidean distance between the features of elements and , , and . The present definition of preserve the symmetry of the score between the elements and ;
Step 2: Compute the reduced matrix as follows


where all the elements of below the threshold score are set to zero. The threshold

is computed as a given quantile of the distribution of the elements in the matrix

(for example the quantile can be used as threshold), in order to exclude from the label propagation process all the links between those elements that are poorly related. In order to use a Markov chain with restart to propagate the label in the unlabeled sample we must normalize as follows


where is the diagonal matrix with elements .
Step 3: Recalling that is the set of positive labelled elements, let us define the set of reliable negative elements as , where is the set of elements having zero value on the -th row of . In this way the set of reliable negative is composed by those elements having no links with all the positive labelled elements. In order to create a data set as balanced as possible it is possible to set the threshold so that . It is now possible to define the vector containing the initial label by setting to 1 the elements in the positive labelled set, , setting to the elements in the reliable negative set, , and setting to zero the remaining elements. In this way we have a distribution of positive values and one of negative values (associated to the Positive and Reliable Negative elements, respectively) keeping the sum of the elements of at zero.
Step 4: Label propagation. Starting from the vector with initial label , a Markov process with restart (citation) is introduced in order to obtain an iterative stationary distribution of propagating labels:


where the parameter is usually set to [1, 9]. Such a process guaranteed the conservation of the sum of the elements of for each

, and that the positive and negative values diffuse to their neighbors (with probability

) and restart from the initial distribution (with probability ) till a stationary distribution is considered reached when . The asymptotic vector is called .
Step 5: The remaining labels are defined as follows:


where represents the average number of non-zero elements in the rows of , i.e., the average number of neighbor for each element. Formula (5), by the use of the parameter , makes adjustable the number of likely positives (or likely negatives) depending on the number of neighbors in the set of positives (or reliable negatives) which are considered.
Step 6:

Classification. A machine learning classifier is trained over the data set containing the new labelling. Three different machine learning (ML) algorithms have been used: Random Forest (RF)


, Support Vector Machine (SVM)

[11] [12]

and Multilayer Perceptron (MLP)


2.1 Evaluation criteria

The main feature of APU compared to other PU algorithms is in the identification of a set of instances, namely the likely positives, which have a good chance to contain the positive instances that are unlabelled. In addition, since it is possible to modify various parameters of the algorithm, it is able to adapt well to different classification problems. To show how the proposed classifier perform in this task we introduced a parameter representing the proportion of positive instances which are labelled as positive, therefore represents the proportion of positive instances which are unlabelled and that we would like to label as likely positive. We consider several valued of , namely . The value means that there are no positive instances among the unlabelled and it has been considered in order to compare APU performance with binary classifiers. The values are situations which can occur in real data set and in which PU learning algorithms could work well without pre-processing the data set. The value has been considered as stress test.

As performance metrics we used the traditional precision and recall and two more measures that are a modified version of the precision and recall which consider the proportion of positives among the unlabelled and the misclassification error of the positive instances. More in detail, we report the formula for the modified version of the precision and recall:

where and are computed considering all positive instances whether or not they have been labelled as positive while is computed considering only negative instances. In other words, these modified versions are equal to the usual ones if , while they differ from the traditional version when , so as to reward the algorithm when it classifies truly positive instances that are unlabelled.

3 Data sets

In this section we briefly detail the data sets used to test the performance of the proposed methodology. Those data sets are obtained from UCI Machine Learning repository [14] and they have been already used to test PU learning algorithms [15].

Banknote Authentication data

Photos of specimens of genuine and counterfeit banknotes were used to collect the data. The original photographs have a resolution of 400 x 400 pixels. Gray-scale pictures with a resolution of 660 dpi were obtained. In order to extract features from photographs, the Wavelet Transform technique was used [14]

. This dataset is used for binary classification tasks, to determine if a banknote is genuine or counterfeit. The dataset is composed of 1372 instances described by 4 features, which are namely: variance of Wavelet Transformed image (continuous), skewness of Wavelet Transformed image (continuous), kurtosis of Wavelet Transformed image (continuous) and entropy of image (continuous). The target variable is binary: 0 for real banknotes and 1 for counterfeit ones.

Pima Indians Diabetes data set

The National Institute of Diabetes and Digestive and Kidney Diseases provided this data [16]

. The data set aim is to diagnose whether a patient has diabetes using diagnostic measures included in the data set. The collection of these instances (n = 768 observations in total) from a larger database was subjected to many constraints. Many of the patients in this clinic are Pima Indian women who are at least 21 years old. There are multiple medical predictor variables (n = 8) in the data set, as well as one target variable, i.e. whether a patient was diagnoses with diabetes or not. With regards of the features, they comprehend: number of pregnancies the patient has had (integers), their BMI (continuous), insulin level (continuous), age (integer), Diabetes Pedigree (continuous), Skin Thickness (integer), blood pressure (integer) and glucose plasma levels (integer).

Occupancy Detection data set

This data set is intended to identify wether rooms are occupied or not according to a set of regressors which are: Date time (data type) of the sample collection (was not used for modeling), Temperature expressed in degrees Celsius (continuous), Relative Humidity (continuous), Light (continuous), CO2 (continuous) in ppm, Humidity Ratio Temperature which is expressed in kilograms of water vapor per kilogram of air (continuous). The dependent variable is Occupancy (binary), indicated as 0 for not occupied and 1 for occupied status. The data set contains n = 9752 observations which are experimental data determined using time-stamped data samples taken every minute [17].

4 Results

To asses the performances of APU we compare it with a binary classifier on the various data set discussed in the previous chapter. In particular, as aforementioned, we consider different ML algorithms, namely RF, SVM and MLP (see [13] for a detailed explanations of these algorithms) and compare the performance of APU against the binary classifier for each ML algorithm. In order to avoid poor performances due to unbalancing issues, we adopt balancing procedures both for APU and for the binary classifier. In particular, we use downsampling for APU learning while ADASYN [18] for binary learning. ADASYN is the best balancing procedure for binary classification while downsampling is the easiest although not very efficient in particular when the number of positive instances is small compared to the number of unlabelled instances. We do not use ADASYN for APU because it has been designed for binary classification and an extension to multi-class labelling could not be appropriate for APU since the introduced artificial classes could be not separable, therefore adding synthetic observations translates in adding noise to artificial classes. Without balancing the data sets, APU outperformed the binary classifier drastically. This is why we choose the best performing balancing strategy for the binary case (ADASYN) and the worst (downsampling) for the multiclass classification. Therefore, we are challenging the performance of our algorithm against the best performances that can be obtained by a binary classifier. The proposed algorithm is not intended to outperform the binary classifier in terms of recall and precision, indeed we expect to have similar performances between the two methods. The value of APU is in the correct classification of the class of likely positive which should contain positive instances which are unlabelled. To check this we introduce the parameter that identifies the proportion of positive instances which are actually labelled as positive and we test the performances of APU, for several values of , at labelling as likely positive the positives instances that had not been labelled as positive.

All the hereby results were obtained as an average on 10 runs of APU for each data set and for each ML method. Table 1 shows the results on Banknote, Pima Indians and Occupancy data sets. As detailed in Section 2.1, we consider as performance measures the classification error of the positive class, identified as in Table 1, which allows to observe improvements in the right classification of positive instances, and the modified version of precision and recall, identified respectively with Prec and Rec in Table 1, which allow to investigate if the performance of the multi-label classification outperform binary classification when there are positive instances which are unlabelled. Looking at Table 1 it turns out that for , namely when there are no positive instances that are unlabelled, the multi-label classifier APU and the binary classifier have comparable performance over all the data set with both RF and SVM. Regarding MLP, multi-label classification works a bit better on Pima Indians data set. For and the multi-label classifier APU and the binary classifier have comparable performances with both RF and SVM over the Banknote, Pima Indians and Occupancy data sets. Regarding MLP, multi-label classification outperforms binary classification on Banknote and Pima Indians data sets while they have the same performance on Occupancy data set. For the performance of multi-label classification APU and binary classification are comparable with both RF and SVM. When using MLP instead, multi-label classification outperforms binary classification on Banknote and Pima Indians data sets while on the other data sets the performance are similar.

Binary - RF APU - RF
0.00 0.02 0.04 0.18 0.00 0.02 0.06 0.33
Bank Prec 0.99 0.99 1.00 1.00 0.99 0.99 0.99 1.00
Rec 1.00 0.98 0.96 0.82 1.00 0.98 0.94 0.68
0.31 0.35 0.39 0.55 0.24 0.26 0.31 0.41
Pima Prec 0.61 0.61 0.64 0.62 0.60 0.59 0.61 0.62
Rec 0.69 0.64 0.62 0.47 0.76 0.74 0.70 0.60
0.01 0.02 0.04 0.16 0.01 0.03 0.05 0.12
Occupancy Prec 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
Rec 0.99 0.98 0.96 0.84 0.99 0.97 0.95 0.89
Binary - SVM APU - SVM
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Bank Prec 0.94 0.95 0.95 0.95 0.94 0.94 0.94 0.95
Rec 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.27 0.27 0.30 0.30 0.30 0.28 0.28 0.37
Pima Prec 0.60 0.61 0.60 0.61 0.61 0.60 0.59 0.63
Rec 0.73 0.72 0.71 0.70 0.70 0.71 0.72 0.64
0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01
Occupancy Prec 0.87 0.93 0.93 0.93 0.95 0.94 0.94 0.94
Rec 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99
Binary - MLP APU - MLP
0.16 0.24 0.36 0.75 0.08 0.09 0.09 0.10
Bank Prec 0.95 0.96 0.97 1.00 0.82 0.83 0.82 0.85
Rec 0.84 0.76 0.65 0.25 0.92 0.92 0.91 0.90
0.87 0.98 1.00 1.00 0.26 0.36 0.42 0.80
Pima Prec 0.64 0.53 0.15 0.00 0.52 0.55 0.53 0.57
Rec 0.13 0.02 0.00 0.00 0.74 0.64 0.58 0.20
0.01 0.01 0.01 0.07 0.01 0.01 0.01 0.11
Occupancy Prec 0.97 0.97 0.98 0.98 0.97 0.97 0.97 0.98
Rec 0.99 0.99 0.99 0.93 0.99 0.99 0.99 0.89
Table 1: The table is divided into three panels, each refers to a ML algorithm, namely first panel contains the performances of RF, second panel contains the performances of SVM while third panel contains the performances of MLP. In each panel there are the classification error of positives , the modified precision Prec and the modified recall Rec obtained for each data set, namely Bank, Pima and Occupacy, and for each proportion of positive instances that are unlabelled .

5 Conclusions

The search for positive instances in a set of unlabelled ones is a frequently occurring task with important applications, ranging from market research to finding disease-associated genes (DAGs). In particular, this last task is of paramount importance in biomedical research in which machine learning approaches can be useful to uncover new, previously unknown, disease-associated genes. Often, in the DAG discovery process and in several other similar problems, the non labelled items are typically handled as negative examples, with the risk of having noisy negative sets, considering the presence of positive instances not labelled. To overcome this problem, several positive-unlabelled algorithms have been proposed, such as [1], among others.

In this paper we present the APU algorithm as an extension of the algorithm proposed in [1]. Our new algorithm overcomes some of the problems of the original one. The principal advantage of the proposed algorithm is that it allows a more accurate control of the set of likely positives which, extracted from the set of unlabelled elements, contains, with the highest probability, positive non-labelled elements. We applied the proposed algorithm to various data sets obtained from UCI Machine Learning repository [14].


Partially supported by the ERC Advanced Grant 788893 AMDROMA “Algorithmic and Mechanism Design Research in Online Markets”, the EC H2020RIA project “SoBigData++” (871042), and the MIUR PRIN project ALGADIMAR “Algorithms, Games, and Digital Markets”.


  • [1] Peng Yang, Xiao Li Li, Jian Ping Mei, Chee Keong Kwoh, and See Kiong Ng. Positive-unlabeled learning for disease gene identification. Bioinformatics, 28:2640–2647, 2012.
  • [2] F. Mordelet and J. P. Vert. A bagging svm to learn from positive and unlabeled examples. Pattern Recogn. Lett., 37:201–209, February 2014.
  • [3] Shiyu Chang, Yang Zhang, Jiliang Tang, Dawei Yin, Yi Chang, Mark A Hasegawa-Johnson, and Thomas S Huang. Positive-unlabeled learning in streaming networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 755–764, 2016.
  • [4] Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 213–220, 2008.
  • [5] Marc Claesen, Frank De Smet, Johan AK Suykens, and Bart De Moor. A robust ensemble approach to learn from positive and unlabeled data using svm base models. Neurocomputing, 160:73–84, 2015.
  • [6] Ting Ke, Hui Lv, Mingjing Sun, and Lidong Zhang. A biased least squares support vector machine based on mahalanobis distance for pu learning. Physica A: Statistical Mechanics and its Applications, 509:422–438, 2018.
  • [7] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining, ICDM ’03, page 179, USA, 2003. IEEE Computer Society.
  • [8] Jessa Bekker and Jesse Davis. Learning from positive and unlabeled data: a survey, volume 109. Springer US, 2020.
  • [9] Yongjin Li and Jagdish C Patra. Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network. Bioinformatics, 26(9):1219–1224, 2010.
  • [10] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [11] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • [12] Harris Drucker, Chris JC Burges, Linda Kaufman, Alex Smola, Vladimir Vapnik, et al. Support vector regression machines. Advances in neural information processing systems, 9:155–161, 1997.
  • [13] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. springer series in statistics. In :. Springer, 2001.
  • [14] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
  • [15] Shantanu Jain, Martha White, and Predrag Radivojac. Estimating the class prior and posterior from noisy positives and unlabeled data. Advances in neural information processing systems, 29:2693–2701, 2016.
  • [16] Md. Aminul Islam and Nusrat Jahan. Prediction of onset diabetes using machine learning techniques. International Journal of Computer Applications, 180(5):7–11, December 2017.
  • [17] Luis M. Candanedo and Véronique Feldheim. Accurate occupancy detection of an office room from light, temperature, humidity and CO 2 measurements using statistical learning models. Energy and Buildings, 112:28–39, January 2016.
  • [18] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In

    2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence)

    , pages 1322–1328. IEEE, 2008.