Minimally Supervised Categorization of Text with Metadata

05/01/2020
by   Yu Zhang, et al.
0

Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) the presence of metadata: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2) label scarcity: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose MetaCat, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of MetaCat over many competitive baselines.

READ FULL TEXT
research
10/26/2020

Hierarchical Metadata-Aware Document Categorization under Weak Supervision

Categorizing documents into a given label hierarchy is intuitively appea...
research
10/23/2020

Robust Document Representations using Latent Topics and Metadata

Task specific fine-tuning of a pre-trained neural language model using a...
research
04/06/2021

Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach

Topic models such as the Structural Topic Model (STM) estimate latent to...
research
02/15/2021

MATCH: Metadata-Aware Text Classification in A Large Hierarchy

Multi-label text classification refers to the problem of assigning each ...
research
02/23/2021

Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks

Text categorization is an essential task in Web content analysis. Consid...
research
01/12/2022

Topic Modeling on Podcast Short-Text Metadata

Podcasts have emerged as a massively consumed online content, notably du...
research
05/15/2017

Using Titles vs. Full-text as Source for Automated Semantic Document Annotation

A significant part of the largest Knowledge Graph today, the Linked Open...

Please sign up or login with your details

Forgot password? Click here to reset