Robust Document Representations using Latent Topics and Metadata

10/23/2020
by   Natraj Raman, et al.
2

Task specific fine-tuning of a pre-trained neural language model using a custom softmax output layer is the de facto approach of late when dealing with document classification problems. This technique is not adequate when labeled examples are not available at training time and when the metadata artifacts in a document must be exploited. We address these challenges by generating document representations that capture both text and metadata artifacts in a task agnostic manner. Instead of traditional auto-regressive or auto-encoding based training, our novel self-supervised approach learns a soft-partition of the input space when generating text embeddings. Specifically, we employ a pre-learned topic model distribution as surrogate labels and construct a loss function based on KL divergence. Our solution also incorporates metadata explicitly rather than just augmenting them with text. The generated document embeddings exhibit compositional characteristics and are directly used by downstream classification tasks to create decision boundaries from a small number of labeled examples, thereby eschewing complicated recognition methods. We demonstrate through extensive evaluation that our proposed cross-model fusion solution outperforms several competitive baselines on multiple datasets.

READ FULL TEXT

page 6

page 7

page 8

research
05/01/2020

Minimally Supervised Categorization of Text with Metadata

Document categorization, which aims to assign a topic label to each docu...
research
05/22/2023

Learning Easily Updated General Purpose Text Representations with Adaptable Task-Specific Prefixes

Many real-world applications require making multiple predictions from th...
research
05/06/2022

Prompt Distribution Learning

We present prompt distribution learning for effectively adapting a pre-t...
research
06/07/2021

LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models

Cross-lingual document representations enable language understanding in ...
research
09/29/2020

Zero-Shot Clinical Acronym Expansion with a Hierarchical Metadata-Based Latent Variable Model

We introduce Latent Meaning Cells, a deep latent variable model which le...
research
06/01/2021

NewsEmbed: Modeling News through Pre-trained Document Representations

Effectively modeling text-rich fresh content such as news articles at do...
research
11/07/2021

How does a Pre-Trained Transformer Integrate Contextual Keywords? Application to Humanitarian Computing

In a classification task, dealing with text snippets and metadata usuall...

Please sign up or login with your details

Forgot password? Click here to reset