Out-of-Category Document Identification Using Target-Category Names as Weak Supervision

11/24/2021
by   Dongha Lee, et al.
0

Identifying outlier documents, whose content is different from the majority of the documents in a corpus, has played an important role to manage a large text collection. However, due to the absence of explicit information about the inlier (or target) distribution, existing unsupervised outlier detectors are likely to make unreliable results depending on the density or diversity of the outliers in the corpus. To address this challenge, we introduce a new task referred to as out-of-category detection, which aims to distinguish the documents according to their semantic relevance to the inlier (or target) categories by using the category names as weak supervision. In practice, this task can be widely applicable in that it can flexibly designate the scope of target categories according to users' interests while requiring only the target-category names as minimum guidance. In this paper, we present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories based on its category-specific relevance score. Our framework adopts a two-step approach; (i) it first generates the pseudo-category label of all unlabeled documents by exploiting the word-document similarity encoded in a text embedding space, then (ii) it trains a neural classifier by using the pseudo-labels in order to compute the confidence from its target-category prediction. The experiments on real-world datasets demonstrate that our framework achieves the best detection performance among all baseline methods in various scenarios specifying different target categories.

READ FULL TEXT

page 1

page 8

research
11/07/2021

MotifClass: Weakly Supervised Text Classification with Higher-order Metadata Information

We study the problem of weakly supervised text classification, which aim...
research
02/23/2021

Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks

Text categorization is an essential task in Web content analysis. Consid...
research
10/14/2020

Text Classification Using Label Names Only: A Language Model Self-Training Approach

Current text classification methods typically require a good number of h...
research
11/20/2021

Weakly Supervised Prototype Topic Model with Discriminative Seed Words: Modifying the Category Prior by Self-exploring Supervised Signals

Dataless text classification, i.e., a new paradigm of weakly supervised ...
research
06/12/2023

Weakly-Supervised Scientific Document Classification via Retrieval-Augmented Multi-Stage Training

Scientific document classification is a critical task for a wide range o...
research
10/24/2020

X-Class: Text Classification with Extremely Weak Supervision

In this paper, we explore to conduct text classification with extremely ...
research
02/03/2023

Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach

Natural language processing (NLP) is a promising approach for analyzing ...

Please sign up or login with your details

Forgot password? Click here to reset