Image Difference Captioning with Pre-training and Contrastive Learning

02/09/2022
by   Linli Yao, et al.
0

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning stronger vision and language association and 2) high-cost of manual annotations that leads to limited supervised data. To address these challenges, we propose a new modeling framework following the pre-training-finetuning paradigm. Specifically, we design three self-supervised tasks and contrastive learning strategies to align visual differences and text descriptions at a fine-grained level. Moreover, we propose a data expansion strategy to utilize extra cross-task supervision information, such as data for fine-grained image classification, to alleviate the limitation of available supervised IDC data. Extensive experiments on two IDC benchmark datasets, CLEVR-Change and Birds-to-Words, demonstrate the effectiveness of the proposed modeling framework. The codes and models will be released at https://github.com/yaolinli/IDC.

READ FULL TEXT

page 1

page 3

page 4

page 7

page 10

page 12

page 13

research
07/30/2021

CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification

Existing computer vision research in artwork struggles with artwork's fi...
research
06/01/2022

CLIP4IDC: CLIP for Image Difference Captioning

Image Difference Captioning (IDC) aims at generating sentences to descri...
research
09/09/2019

Neural Naturalist: Generating Fine-Grained Image Comparisons

We introduce the new Birds-to-Words dataset of 41k sentences describing ...
research
03/09/2023

Replacement as a Self-supervision for Fine-grained Vision-language Pre-training

Fine-grained supervision based on object annotations has been widely use...
research
05/20/2023

What Makes for Good Visual Tokenizers for Large Language Models?

We empirically investigate proper pre-training methods to build good vis...
research
03/06/2023

Neighborhood Contrastive Transformer for Change Captioning

Change captioning is to describe the semantic change between a pair of s...
research
08/31/2023

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Vision-language pre-training (VLP) methods are blossoming recently, and ...

Please sign up or login with your details

Forgot password? Click here to reset