A Better Loss for Visual-Textual Grounding

08/11/2021
by   Davide Rigoni, et al.
0

Given a textual phrase and an image, the visual grounding problem is defined as the task of locating the content of the image referenced by the sentence. It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution. In the last years, several works have addressed this problem with heavy and complex models that try to capture visual-textual dependencies better than before. These models are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. In this work, we propose a model that, although using a simple multi-modal feature fusion component, is able to achieve a higher accuracy than state-of-the-art models thanks to the adoption of a more effective loss function, based on the classes probabilities, that reach, in the considered datasets, a better learning balance between the two sub-tasks mentioned above.

READ FULL TEXT

page 5

page 7

research
03/29/2018

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Textual grounding is an important but challenging task for human-compute...
research
11/12/2015

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visu...
research
05/18/2023

Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement

Using only image-sentence pairs, weakly-supervised visual-textual ground...
research
04/17/2021

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framewo...
research
10/07/2019

Adversarial reconstruction for Multi-modal Machine Translation

Even with the growing interest in problems at the intersection of Comput...
research
03/29/2018

Unsupervised Textual Grounding: Linking Words to Image Concepts

Textual grounding, i.e., linking words to objects in images, is a challe...
research
11/21/2020

Deep learning for video game genre classification

Video game genre classification based on its cover and textual descripti...

Please sign up or login with your details

Forgot password? Click here to reset