Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination

10/21/2022
by   Yue Yang, et al.
0

Large-scale pretrained language models have made significant advances in solving downstream language understanding tasks. However, they generally suffer from reporting bias, the phenomenon describing the lack of explicit commonsense knowledge in written text, e.g., ”an orange is orange”. To overcome this limitation, we develop a novel approach, Z-LaVI, to endow language models with visual imagination capabilities. Specifically, we leverage two complementary types of ”imaginations”: (i) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the original language tasks. Notably, fueling language models with imagination can effectively leverage visual knowledge to solve plain language tasks. In consequence, Z-LaVI consistently improves the zero-shot performance of existing language models across a diverse set of language tasks.

READ FULL TEXT

page 1

page 7

page 9

page 14

page 16

page 17

page 18

research
02/07/2022

Cedille: A large autoregressive French language model

Scaling up the size and training of autoregressive language models has e...
research
04/13/2023

What does CLIP know about a red circle? Visual prompt engineering for VLMs

Large-scale Vision-Language Models, such as CLIP, learn powerful image-t...
research
09/26/2022

Can Large Language Models Truly Understand Prompts? A Case Study with Negated Prompts

Previous work has shown that there exists a scaling law between the size...
research
02/09/2023

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

Augmenting pretrained language models (LMs) with a vision encoder (e.g.,...
research
04/01/2022

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Large foundation models can exhibit unique capabilities depending on the...
research
02/23/2023

Teaching CLIP to Count to Ten

Large vision-language models (VLMs), such as CLIP, learn rich joint imag...
research
05/17/2022

M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems

Industrial recommender systems have been growing increasingly complex, m...

Please sign up or login with your details

Forgot password? Click here to reset