Language Features Matter: Effective Language Representations for Vision-Language Tasks

08/17/2019
by   Andrea Burns, et al.
4

Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We believe that language features deserve more attention, and conduct experiments which compare different word embeddings, language models, and embedding augmentation steps on five common VL tasks: image-sentence retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. Our experiments provide some striking results; an average embedding language model outperforms an LSTM on retrieval-style tasks; state-of-the-art representations such as BERT perform relatively poorly on vision-language tasks. From this comprehensive set of experiments we propose a set of best practices for incorporating the language component of VL tasks. To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. This multi-task training is applied to a new Graph Oriented Vision-Language Embedding (GrOVLE), which we adapt from Word2Vec using WordNet and an original visual-language graph built from Visual Genome, providing a ready-to-use vision-language embedding: http://ai.bu.edu/grovle.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2022

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

We present Answer-Me, a task-aware multi-task framework which unifies a ...
research
02/02/2019

A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases

Deep language models learning a hierarchical representation proved to be...
research
02/10/2023

Is multi-modal vision supervision beneficial to language?

Vision (image and video) - Language (VL) pre-training is the recent popu...
research
11/17/2022

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data

Many high-level skills that are required for computer vision tasks, such...
research
04/02/2017

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

An important goal of computer vision is to build systems that learn visu...
research
07/24/2023

3D-LLM: Injecting the 3D World into Large Language Models

Large language models (LLMs) and Vision-Language Models (VLMs) have been...
research
05/25/2023

Language Models Implement Simple Word2Vec-style Vector Arithmetic

A primary criticism towards language models (LMs) is their inscrutabilit...

Please sign up or login with your details

Forgot password? Click here to reset