Probing Text Models for Common Ground with Visual Representations

05/01/2020
by   Gabriel Ilharco, et al.
0

Vision, as a central component of human perception, plays a fundamental role in shaping natural language. To better understand how text models are connected to our visual perceptions, we propose a method for examining the similarities between neural representations extracted from words in text and objects in images. Our approach uses a lightweight probing model that learns to map language representations of concrete words to the visual domain. We find that representations from models trained on purely textual data, such as BERT, can be nontrivially mapped to those of a vision model. Such mappings generalize to object categories that were never seen by the probe during training, unlike mappings learned from permuted or random representations. Moreover, we find that the context surrounding objects in sentences greatly impacts performance. Finally, we show that humans significantly outperform all examined models, suggesting considerable room for improvement in representation learning and grounding.

READ FULL TEXT

page 2

page 3

page 7

research
06/13/2022

Compositional Mixture Representations for Vision and Text

Learning a common representation space between vision and language allow...
research
06/30/2022

Visual grounding of abstract and concrete words: A response to Günther et al. (2020)

Current computational models capturing words' meaning mostly rely on tex...
research
09/21/2021

Does Vision-and-Language Pretraining Improve Lexical Grounding?

Linguistic representations derived from text alone have been criticized ...
research
11/13/2021

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

In natural language processing, most models try to learn semantic repres...
research
10/22/2022

A Visual Tour Of Current Challenges In Multimodal Language Models

Transformer models trained on massive text corpora have become the de fa...
research
02/11/2021

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Pre-trained representations are becoming crucial for many NLP and percep...
research
07/21/2022

Grounding Visual Representations with Texts for Domain Generalization

Reducing the representational discrepancy between source and target doma...

Please sign up or login with your details

Forgot password? Click here to reset