On Non-Random Missing Labels in Semi-Supervised Learning

06/29/2022
by   Xinting Hu, et al.
0

Semi-Supervised Learning (SSL) is fundamentally a missing label problem, in which the label Missing Not At Random (MNAR) problem is more realistic and challenging, compared to the widely-adopted yet naive Missing Completely At Random assumption where both labeled and unlabeled data share the same class distribution. Different from existing SSL solutions that overlook the role of "class" in causing the non-randomness, e.g., users are more likely to label popular classes, we explicitly incorporate "class" into SSL. Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data. 2) To encourage rare class training, whose model is low-recall but high-precision that discards too many pseudo-labeled data, we propose Class-Aware Imputation (CAI) that dynamically decreases (or increases) the pseudo-label assignment threshold for rare (or frequent) classes. 3) Overall, we integrate CAP and CAI into a Class-Aware Doubly Robust (CADR) estimator for training an unbiased SSL model. Under various MNAR settings and ablations, our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods. Please check our code at: https://github.com/JoyHuYY1412/CADR-FixMatch.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/15/2023

Semi-Supervised Learning with Multiple Imputations on Non-Random Missing Labels

Semi-Supervised Learning (SSL) is implemented when algorithms are traine...
research
11/18/2022

Why pseudo label based algorithm is effective? –from the perspective of pseudo labeled data

Recently, pseudo label based semi-supervised learning has achieved great...
research
12/07/2022

Leveraging Structure for Improved Classification of Grouped Biased Data

We consider semi-supervised binary classification for applications in wh...
research
01/29/2019

Rare geometries: revealing rare categories via dimension-driven statistics

In many situations, the classes of data points of primary interest also ...
research
11/27/2020

They are Not Completely Useless: Towards Recycling Transferable Unlabeled Data for Class-Mismatched Semi-Supervised Learning

Semi-Supervised Learning (SSL) with mismatched classes deals with the pr...
research
02/17/2021

Sinkhorn Label Allocation: Semi-Supervised Classification via Annealed Self-Training

Self-training is a standard approach to semi-supervised learning where t...
research
07/22/2023

Collaboratively Learning Linear Models with Structured Missing Data

We study the problem of collaboratively learning least squares estimates...

Please sign up or login with your details

Forgot password? Click here to reset