Improving Positive Unlabeled Learning: Practical AUL Estimation and New Training Method for Extremely Imbalanced Data Sets

04/21/2020
by   Liwei Jiang, et al.
0

Positive Unlabeled (PU) learning is widely used in many applications, where a binary classifier is trained on the datasets consisting of only positive and unlabeled samples. In this paper, we improve PU learning over state-of-the-art from two aspects. Firstly, existing model evaluation methods for PU learning requires ground truth of unlabeled samples, which is unlikely to be obtained in practice. In order to release this restriction, we propose an asymptotic unbiased practical AUL (area under the lift) estimation method, which makes use of raw PU data without prior knowledge of unlabeled samples. Secondly, we propose ProbTagging, a new training method for extremely imbalanced data sets, where the number of unlabeled samples is hundreds or thousands of times that of positive samples. ProbTagging introduces probability into the aggregation method. Specifically, each unlabeled sample is tagged positive or negative with the probability calculated based on the similarity to its positive neighbors. Based on this, multiple data sets are generated to train different models, which are then combined into an ensemble model. Compared to state-of-the-art work, the experimental results show that ProbTagging can increase the AUC by up to 10 two artificial PU data sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/25/2022

Positive unlabeled learning with tensor networks

Positive unlabeled learning is a binary classification problem with posi...
research
03/10/2016

Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning

In PU learning, a binary classifier is trained from positive (P) and unl...
research
03/18/2018

A Robust AUC Maximization Framework with Simultaneous Outlier Detection and Feature Selection for Positive-Unlabeled Classification

The positive-unlabeled (PU) classification is a common scenario in real-...
research
04/20/2020

MixPUL: Consistency-based Augmentation for Positive and Unlabeled Learning

Learning from positive and unlabeled data (PU learning) is prevalent in ...
research
11/29/2010

Classifying extremely imbalanced data sets

Imbalanced data sets containing much more background than signal instanc...
research
01/09/2022

Learning class prototypes from Synthetic InSAR with Vision Transformers

The detection of early signs of volcanic unrest preceding an eruption, i...
research
02/04/2017

Latent Hinge-Minimax Risk Minimization for Inference from a Small Number of Training Samples

Deep Learning (DL) methods show very good performance when trained on la...

Please sign up or login with your details

Forgot password? Click here to reset