CLIP4IDC: CLIP for Image Difference Captioning

06/01/2022
by   Zixin Guo, et al.
0

Image Difference Captioning (IDC) aims at generating sentences to describe the differences between two similar-looking images. The conventional approaches learn captioning models on the offline-extracted visual features and the learning can not be propagated back to the fixed feature extractors pre-trained on image classification datasets. Accordingly, potential improvements can be made by fine-tuning the visual features for: 1) narrowing the gap when generalizing the visual extractor trained on image classification to IDC, and 2) relating the extracted visual features to the descriptions of the corresponding changes. We thus propose CLIP4IDC to transfer a CLIP model for the IDC task to attain these improvements. Different from directly fine-tuning CLIP to generate sentences, a task-specific domain adaptation is used to improve the extracted features. Specifically, the target is to train CLIP on raw pixels to relate the image pairs to the described changes. Afterwards, a vanilla Transformer is trained for IDC on the features extracted by the vision encoder of CLIP. Experiments on three IDC benchmark datasets, CLEVR-Change, Spot-the-Diff and Image-Editing-Request, demonstrate the effectiveness of CLIP4IDC. Our code and models will be released at https://github.com/sushizixin/CLIP4IDC.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/09/2022

Image Difference Captioning with Pre-training and Contrastive Learning

The Image Difference Captioning (IDC) task aims to describe the visual d...
research
04/30/2023

Consolidator: Mergeable Adapter with Grouped Connections for Visual Adaptation

Recently, transformers have shown strong ability as visual feature extra...
research
10/25/2022

On Fine-Tuned Deep Features for Unsupervised Domain Adaptation

Prior feature transformation based approaches to Unsupervised Domain Ada...
research
07/04/2022

Distilling Ensemble of Explanations for Weakly-Supervised Pre-Training of Image Segmentation Models

While fine-tuning pre-trained networks has become a popular way to train...
research
09/28/2022

Thinking Hallucination for Video Captioning

With the advent of rich visual representations and pre-trained language ...
research
12/01/2022

GRiT: A Generative Region-to-text Transformer for Object Understanding

This paper presents a Generative RegIon-to-Text transformer, GRiT, for o...
research
08/31/2018

Learning to Describe Differences Between Pairs of Similar Images

In this paper, we introduce the task of automatically generating text to...

Please sign up or login with your details

Forgot password? Click here to reset