Fine-Grained Image Captioning with Global-Local Discriminative Objective

07/21/2020
by   Jie Wu, et al.
0

Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Figure 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the ground-truth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.

READ FULL TEXT

page 1

page 4

page 5

page 10

page 12

page 13

page 15

research
03/22/2018

Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data

The aim of image captioning is to generate similar captions by machine a...
research
05/03/2017

FOIL it! Find One mismatch between Image and Language caption

In this paper, we aim to understand whether current language and vision ...
research
01/11/2017

Context-aware Captions from Context-agnostic Supervision

We introduce an inference technique to produce discriminative context-aw...
research
06/26/2019

A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning

Generating textual descriptions for images has been an attractive proble...
research
04/04/2023

Cross-Domain Image Captioning with Discriminative Finetuning

Neural captioners are typically trained to mimic human-generated referen...
research
12/06/2022

Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning

Discriminativeness is a desirable feature of image captions: captions sh...
research
08/28/2021

Goal-driven text descriptions for images

A big part of achieving Artificial General Intelligence(AGI) is to build...

Please sign up or login with your details

Forgot password? Click here to reset