Weakly Supervised Prototype Topic Model with Discriminative Seed Words: Modifying the Category Prior by Self-exploring Supervised Signals

11/20/2021
by   Bing Wang, et al.
0

Dataless text classification, i.e., a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as seed words. The recent generative dataless methods construct document-specific category priors by using seed word occurrences only, however, such category priors often contain very limited and even noisy supervised signals. To remedy this problem, in this paper we propose a novel formulation of category prior. First, for each document, we consider its label membership degree by not only counting seed word occurrences, but also using a novel prototype scheme, which captures pseudo-nearest neighboring categories. Second, for each label, we consider its frequency prior knowledge of the corpus, which is also a discriminative knowledge for classification. By incorporating the proposed category prior into the previous generative dataless method, we suggest a novel generative dataless method, namely Weakly Supervised Prototype Topic Model (WSPTM). The experimental results on real-world datasets demonstrate that WSPTM outperforms the existing baseline methods.

READ FULL TEXT
research
04/20/2021

Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation

Weakly-supervised text classification aims to induce text classifiers fr...
research
11/05/2017

Multi-label Dataless Text Classification with Topic Modeling

Manually labeling documents is tedious and expensive, but it is essentia...
research
05/23/2022

Seeded Hierarchical Clustering for Expert-Crafted Taxonomies

Practitioners from many disciplines (e.g., political science) use expert...
research
02/22/2016

Empath: Understanding Topic Signals in Large-Scale Text

Human language is colored by a broad range of topics, but existing text ...
research
12/21/2016

Inverted Bilingual Topic Models for Lexicon Extraction from Non-parallel Data

Topic models have been successfully applied in lexicon extraction. Howev...
research
10/27/2022

BERT-Flow-VAE: A Weakly-supervised Model for Multi-Label Text Classification

Multi-label Text Classification (MLTC) is the task of categorizing docum...
research
11/24/2021

Out-of-Category Document Identification Using Target-Category Names as Weak Supervision

Identifying outlier documents, whose content is different from the major...

Please sign up or login with your details

Forgot password? Click here to reset