Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak Supervision for Text Classification

05/24/2023
by   Chengyu Dong, et al.
0

Recent advances in weakly supervised text classification mostly focus on designing sophisticated methods to turn high-level human heuristics into quality pseudo-labels. In this paper, we revisit the seed matching-based method, which is arguably the simplest way to generate pseudo-labels, and show that its power was greatly underestimated. We show that the limited performance of seed matching is largely due to the label bias injected by the simple seed-match rule, which prevents the classifier from learning reliable confidence for selecting high-quality pseudo-labels. Interestingly, simply deleting the seed words present in the matched input texts can mitigate the label bias and help learn better confidence. Subsequently, the performance achieved by seed matching can be improved significantly, making it on par with or even better than the state-of-the-art. Furthermore, to handle the case when the seed words are not made known, we propose to simply delete the word tokens in the input text randomly with a high deletion ratio. Remarkably, seed matching equipped with this random deletion method can often achieve even better performance than that with seed deletion.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/22/2023

A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches

Etremely Weakly Supervised Text Classification (XWS-TC) refers to text c...
research
10/13/2022

LIME: Weakly-Supervised Text Classification Without Seeds

In weakly-supervised text classification, only label names act as source...
research
04/20/2021

Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation

Weakly-supervised text classification aims to induce text classifiers fr...
research
05/24/2022

WeDef: Weakly Supervised Backdoor Defense for Text Classification

Existing backdoor defense methods are only effective for limited trigger...
research
11/05/2017

Multi-label Dataless Text Classification with Topic Modeling

Manually labeling documents is tedious and expensive, but it is essentia...
research
02/20/2020

MEUZZ: Smart Seed Scheduling for Hybrid Fuzzing

Seed scheduling is a prominent factor in determining the yields of hybri...
research
04/30/2020

User-Guided Aspect Classification for Domain-Specific Texts

Aspect classification, identifying aspects of text segments, facilitates...

Please sign up or login with your details

Forgot password? Click here to reset