Learning to Describe Differences Between Pairs of Similar Images

08/31/2018
by   Harsh Jhamtani, et al.
0

In this paper, we introduce the task of automatically generating text to describe the differences between two similar images. We collect a new dataset by crowd-sourcing difference descriptions for pairs of image frames extracted from video-surveillance footage. Annotators were asked to succinctly describe all the differences in a short paragraph. As a result, our novel dataset provides an opportunity to explore models that align language and vision, and capture visual salience. The dataset may also be a useful benchmark for coherent multi-sentence generation. We perform a firstpass visual analysis that exposes clusters of differing pixels as a proxy for object-level differences. We propose a model that captures visual salience by using a latent variable to align clusters of differing pixels with output sentences. We find that, for both single-sentence generation and as well as multi-sentence generation, the proposed model outperforms the models that use attention alone.

READ FULL TEXT
research
11/20/2014

Learning a Recurrent Visual Representation for Image Caption Generation

In this paper we explore the bi-directional mapping between images and t...
research
04/13/2021

From Solving a Problem Boldly to Cutting the Gordian Knot: Idiomatic Text Generation

We study a new application for text generation – idiomatic sentence gene...
research
06/15/2023

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

Current perceptual similarity metrics operate at the level of pixels and...
research
03/09/2022

Align-Deform-Subtract: An Interventional Framework for Explaining Object Differences

Given two object images, how can we explain their differences in terms o...
research
03/31/2018

Compare and Contrast: Learning Prominent Visual Differences

Relative attribute models can compare images in terms of all detected pr...
research
06/01/2022

CLIP4IDC: CLIP for Image Difference Captioning

Image Difference Captioning (IDC) aims at generating sentences to descri...
research
02/28/2015

Generating Multi-Sentence Lingual Descriptions of Indoor Scenes

This paper proposes a novel framework for generating lingual description...

Please sign up or login with your details

Forgot password? Click here to reset