Topics and Label Propagation: Best of Both Worlds for Weakly Supervised Text Classification

12/04/2017
by   Sachin Pawar, et al.
0

We propose a Label Propagation based algorithm for weakly supervised text classification. We construct a graph where each document is represented by a node and edge weights represent similarities among the documents. Additionally, we discover underlying topics using Latent Dirichlet Allocation (LDA) and enrich the document graph by including the topics in the form of additional nodes. The edge weights between a topic and a text document represent level of "affinity" between them. Our approach does not require document level labelling, instead it expects manual labels only for topic nodes. This significantly minimizes the level of supervision needed as only a few topics are observed to be enough for achieving sufficiently high accuracy. The Label Propagation Algorithm is employed on this enriched graph to propagate labels among the nodes. Our approach combines the advantages of Label Propagation (through document-document similarities) and Topic Modelling (for minimal but smart supervision). We demonstrate the effectiveness of our approach on various datasets and compare with state-of-the-art weakly supervised text classification approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2022

BERT-Flow-VAE: A Weakly-supervised Model for Multi-Label Text Classification

Multi-label Text Classification (MLTC) is the task of categorizing docum...
research
04/04/2023

MEGClass: Text Classification with Extremely Weak Supervision via Mutually-Enhancing Text Granularities

Text classification typically requires a substantial amount of human-ann...
research
12/16/2016

Automatic Labelling of Topics with Neural Embeddings

Topics generated by topic models are typically represented as list of te...
research
05/21/2023

WOT-Class: Weakly Supervised Open-world Text Classification

State-of-the-art weakly supervised text classification methods, while si...
research
10/16/2019

HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

GitHub has become an important platform for code sharing and scientific ...
research
06/12/2023

Weakly-Supervised Scientific Document Classification via Retrieval-Augmented Multi-Stage Training

Scientific document classification is a critical task for a wide range o...
research
09/29/2020

Natcat: Weakly Supervised Text Classification with Naturally Annotated Datasets

We seek to improve text classification by leveraging naturally annotated...

Please sign up or login with your details

Forgot password? Click here to reset