Incorporating Visual Semantics into Sentence Representations within a Grounded Space

02/07/2020
by   Patrick Bordes, et al.
0

Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one correspondence between modalities. This hypothesis does not hold when representing words, and becomes problematic when used to learn sentence representations — the focus of this paper — as a visual scene can be described by a wide variety of sentences. To overcome this limitation, we propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are preserved across modalities. We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.

READ FULL TEXT
research
04/15/2021

Learning Zero-Shot Multifaceted Visually Grounded Word Embeddingsvia Multi-Task Training

Language grounding aims at linking the symbolic representation of langua...
research
03/27/2019

Learning semantic sentence representations from visually grounded language without lexical knowledge

Current approaches to learning semantic representations of sentences oft...
research
12/02/2017

Improving Visually Grounded Sentence Representations with Self-Attention

Sentence representation models trained only on language could potentiall...
research
09/30/2019

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

With the aim of promoting and understanding the multilingual version of ...
research
10/19/2020

Image Captioning with Visual Object Representations Grounded in the Textual Modality

We present our work in progress exploring the possibilities of a shared ...
research
08/29/2019

Probing Representations Learned by Multimodal Recurrent and Transformer Models

Recent literature shows that large-scale language modeling provides exce...
research
03/26/2016

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Understanding language goes hand in hand with the ability to integrate c...

Please sign up or login with your details

Forgot password? Click here to reset