Visually-Augmented Language Modeling

05/20/2022
by   Weizhi Wang, et al.
0

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on the text-only self-supervised training with massive text data, which precludes them from utilizing relevant visual information when necessary. To address this, we propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. Specifically, VaLM builds on a novel text-vision alignment method via an image retrieval module to fetch corresponding images given a textual context. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling by attending on both text context and visual knowledge in images. We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel. VaLM outperforms the text-only baseline with substantial gains of +8.66 reasoning, respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2022

Visually-augmented pretrained language models for NLP tasks without images

Although pre-trained language models (PLMs) have shown impressive perfor...
research
10/14/2020

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Humans learn language by listening, speaking, writing, reading, and also...
research
09/23/2021

Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?

Large language models are known to suffer from the hallucination problem...
research
05/26/2023

Learning to Imagine: Visually-Augmented Natural Language Generation

People often imagine relevant scenes to aid in the writing process. In t...
research
10/16/2022

COFAR: Commonsense and Factual Reasoning in Image Search

One characteristic that makes humans superior to modern artificially int...
research
05/19/2023

LLM Itself Can Read and Generate CXR Images

Building on the recent remarkable development of large language models (...
research
10/06/2020

Learning to Represent Image and Text with Denotation Graph

Learning to fuse vision and language information and representing them i...

Please sign up or login with your details

Forgot password? Click here to reset