Top-Down Viewing for Weakly Supervised Grounded Image Captioning

06/13/2023
by   Chen Cai, et al.
0

Weakly supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) first apply an off-the-shelf object detector to encode the input image into multiple region features; (2) and then leverage a soft-attention mechanism for captioning and grounding. However, object detectors are mainly designed to extract object semantics (i.e., the object category). Besides, they break down the structural images into pieces of individual proposals. As a result, the subsequent grounded captioner is often overfitted to find the correct object words, while overlooking the relation between objects (e.g., what is the person doing?), and selecting incompatible proposal regions for grounding. To address these difficulties, we propose a one-stage weakly supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. In addition, we explicitly inject a relation module into our one-stage framework to encourage the relation understanding through multi-label classification. The relation semantics aid the prediction of relation words in the caption. We observe that the relation words not only assist the grounded captioner in generating a more accurate caption but also improve the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance.

READ FULL TEXT

page 1

page 3

page 5

page 8

page 9

page 10

research
08/02/2021

Distributed Attention for Grounded Image Captioning

We study the problem of weakly supervised grounded image captioning. Tha...
research
12/02/2021

Consensus Graph Representation Learning for Better Grounded Image Captioning

The contemporary visual captioning models frequently hallucinate objects...
research
11/29/2019

OptiBox: Breaking the Limits of Proposals for Visual Grounding

The problem of language grounding has attracted much attention in recent...
research
04/01/2020

More Grounded Image Captioning by Distilling Image-Text Matching Model

Visual attention not only improves the performance of image captioners, ...
research
08/03/2022

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation

Recently, increasing efforts have been focused on Weakly Supervised Scen...
research
10/19/2020

Image Captioning with Visual Object Representations Grounded in the Textual Modality

We present our work in progress exploring the possibilities of a shared ...
research
06/01/2019

Learning to Generate Grounded Image Captions without Localization Supervision

When generating a sentence description for an image, it frequently remai...

Please sign up or login with your details

Forgot password? Click here to reset