Image Captioning with Visual Object Representations Grounded in the Textual Modality

10/19/2020
by   Dušan Variš, et al.
0

We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual object representations, we propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system instead of grounding words or sentences in their associated images. Based on the previous work, we apply additional grounding losses to the image captioning training objective aiming to force visual object representations to create more heterogeneous clusters based on their class label and copy a semantic structure of the word embedding space. In addition, we provide an analysis of the learned object vector space projection and its impact on the IC system performance. With only slight change in performance, grounded models reach the stopping criterion during training faster than the unconstrained model, needing about two to three times less training updates. Additionally, an improvement in structural correlation between the word embeddings and both original and projected object vectors suggests that the grounding is actually mutual.

READ FULL TEXT
research
12/02/2021

Consensus Graph Representation Learning for Better Grounded Image Captioning

The contemporary visual captioning models frequently hallucinate objects...
research
03/06/2017

Sound-Word2Vec: Learning Word Representations Grounded in Sounds

To be able to interact better with humans, it is crucial for machines to...
research
05/23/2023

Grounding and Distinguishing Conceptual Vocabulary Through Similarity Learning in Embodied Simulations

We present a novel method for using agent experiences gathered through a...
research
06/13/2023

Top-Down Viewing for Weakly Supervised Grounded Image Captioning

Weakly supervised grounded image captioning (WSGIC) aims to generate the...
research
07/19/2017

Learning Visually Grounded Sentence Representations

We introduce a variety of models, trained on a supervised image captioni...
research
02/07/2020

Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Language grounding is an active field aiming at enriching textual repres...
research
11/19/2015

Order-Embeddings of Images and Language

Hypernymy, textual entailment, and image captioning can be seen as speci...

Please sign up or login with your details

Forgot password? Click here to reset