I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data

11/17/2022
by   Sophia Gu, et al.
0

Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether this makes it possible to learn those skills from text data and then use them to complete vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study a variety of strategies to mitigate this concern. We produce models using only text training data on three tasks: image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images. We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images. We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated language data, but instead text data from books, the web, or language models.

READ FULL TEXT
research
09/30/2022

Linearly Mapping from Image to Text Space

The extent to which text-only language models (LMs) learn to represent t...
research
05/24/2022

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Integrating vision and language has gained notable attention following t...
research
08/17/2019

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Shouldn't language and vision features be treated equally in vision-lang...
research
03/19/2016

Generating Natural Questions About an Image

There has been an explosion of work in the vision & language community d...
research
02/04/2022

Webly Supervised Concept Expansion for General Purpose Vision Models

General purpose vision (GPV) systems are models that are designed to sol...
research
07/02/2017

Modulating early visual processing by language

It is commonly assumed that language refers to high-level visual concept...
research
09/23/2020

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Mirroring the success of masked language models, vision-and-language cou...

Please sign up or login with your details

Forgot password? Click here to reset