PULSNAR – Positive unlabeled learning selected not at random: class proportion estimation when the SCAR assumption does not hold

03/14/2023
by   Praveen Kumar, et al.
0

Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the selected completely at random (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, α, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms can estimate α or the probability of an individual unlabeled instance being positive or both. We propose two PU learning algorithms to estimate α, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR uses a divide-and-conquer approach that creates and solves several SCAR-like sub-problems using PULSCAR. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.

READ FULL TEXT

page 8

page 24

research
03/02/2021

Botcha: Detecting Malicious Non-Human Traffic in the Wild

Malicious bots make up about a quarter of all traffic on the web, and de...
research
02/19/2019

DEDPUL: Method for Mixture Proportion Estimation and Positive-Unlabeled Classification based on Density Estimation

This paper studies Positive-Unlabeled Classification, the problem of sem...
research
04/26/2015

Assessing binary classifiers using only positive and unlabeled data

Assessing the performance of a learned model is a crucial part of machin...
research
03/26/2019

A method on selecting reliable samples based on fuzziness in positive and unlabeled learning

Traditional semi-supervised learning uses only labeled instances to trai...
research
08/27/2018

Learning from Positive and Unlabeled Data under the Selected At Random Assumption

For many interesting tasks, such as medical diagnosis and web page class...
research
08/13/2021

Adaptive Positive-Unlabelled Learning via Markov Diffusion

Positive-Unlabelled (PU) learning is the machine learning setting in whi...
research
02/17/2021

Unbiased Estimations based on Binary Classifiers: A Maximum Likelihood Approach

Binary classifiers trained on a certain proportion of positive items int...

Please sign up or login with your details

Forgot password? Click here to reset