NewsEmbed: Modeling News through Pre-trained Document Representations

06/01/2021
by   Jialu Liu, et al.
0

Effectively modeling text-rich fresh content such as news articles at document-level is a challenging problem. To ensure a content-based model generalize well to a broad range of applications, it is critical to have a training dataset that is large beyond the scale of human labels while achieving desired quality. In this work, we address those two challenges by proposing a novel approach to mine semantically-relevant fresh documents, and their topic labels, with little human supervision. Meanwhile, we design a multitask model called NewsEmbed that alternatively trains a contrastive learning with a multi-label classification to derive a universal document encoder. We show that the proposed approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting where texts in different languages are encoded in the same semantic space. We experimentally demonstrate NewsEmbed's competitive performance across multiple natural language understanding tasks, both supervised and unsupervised.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2021

Empowering News Recommendation with Pre-trained Language Models

Personalized news recommendation is an essential technique for online ne...
research
06/07/2018

Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints

Semantic annotations have to satisfy quality constraints to be useful fo...
research
08/15/2020

Label-Wise Document Pre-Training for Multi-Label Text Classification

A major challenge of multi-label text classification (MLTC) is to stimul...
research
07/04/2017

Multilingual Hierarchical Attention Networks for Document Classification

Hierarchical attention networks have recently achieved remarkable perfor...
research
06/07/2021

LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models

Cross-lingual document representations enable language understanding in ...
research
09/03/2017

Understanding the Logical and Semantic Structure of Large Documents

Current language understanding approaches focus on small documents, such...
research
10/23/2020

Robust Document Representations using Latent Topics and Metadata

Task specific fine-tuning of a pre-trained neural language model using a...

Please sign up or login with your details

Forgot password? Click here to reset