Patton: Language Model Pretraining on Text-Rich Networks

05/20/2023
by   Bowen Jin, et al.
0

A real-world text corpus sometimes comprises not only text documents but also semantic links between them (e.g., academic papers in a bibliographic network are linked by citations and co-authorships). Text documents and semantic connections form a text-rich network, which empowers a wide range of downstream tasks such as classification and retrieval. However, pretraining methods for such structures are still lacking, making it difficult to build one generic model that can be adapted to various tasks on text-rich networks. Current pretraining objectives, such as masked language modeling, purely model texts and do not take inter-document structure information into consideration. To this end, we propose our PretrAining on TexT-Rich NetwOrk framework Patton. Patton includes two pretraining strategies: network-contextualized masked language modeling and masked node prediction, to capture the inherent dependency between textual attributes and network structure. We conduct experiments on four downstream tasks in five datasets from both academic and e-commerce domains, where Patton outperforms baselines significantly and consistently.

READ FULL TEXT

page 4

page 9

page 10

research
03/29/2022

LinkBERT: Pretraining Language Models with Document Links

Language model (LM) pretraining can learn various knowledge from text co...
research
03/25/2021

Visual Grounding Strategies for Text-Only Natural Language Processing

Visual grounding is a promising path toward more robust and accurate Nat...
research
08/24/2021

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

With recent progress in joint modeling of visual and textual representat...
research
02/28/2023

Turning a CLIP Model into a Scene Text Detector

The recent large-scale Contrastive Language-Image Pretraining (CLIP) mod...
research
04/06/2023

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

This paper aims to establish a generic multi-modal foundation model that...
research
09/18/2019

A Lexical, Syntactic, and Semantic Perspective for Understanding Style in Text

With a growing interest in modeling inherent subjectivity in natural lan...
research
01/27/2023

Call for Papers – The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

We present the call for papers for the BabyLM Challenge: Sample-efficien...

Please sign up or login with your details

Forgot password? Click here to reset