Log In Sign Up

A Better Loss for Visual-Textual Grounding

by   Davide Rigoni, et al.

Given a textual phrase and an image, the visual grounding problem is defined as the task of locating the content of the image referenced by the sentence. It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution. In the last years, several works have addressed this problem with heavy and complex models that try to capture visual-textual dependencies better than before. These models are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. In this work, we propose a model that, although using a simple multi-modal feature fusion component, is able to achieve a higher accuracy than state-of-the-art models thanks to the adoption of a more effective loss function, based on the classes probabilities, that reach, in the considered datasets, a better learning balance between the two sub-tasks mentioned above.


page 5

page 7


Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Textual grounding is an important but challenging task for human-compute...

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visu...

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framewo...

Adversarial reconstruction for Multi-modal Machine Translation

Even with the growing interest in problems at the intersection of Comput...

SeqTR: A Simple yet Universal Network for Visual Grounding

In this paper, we propose a simple yet universal network termed SeqTR fo...

Unsupervised Textual Grounding: Linking Words to Image Concepts

Textual grounding, i.e., linking words to objects in images, is a challe...

Support-Set Based Cross-Supervision for Video Grounding

Current approaches for video grounding propose kinds of complex architec...