Visual Transformation Telling

05/03/2023
by   Xin Hong, et al.
0

In this paper, we propose a new visual reasoning task, called Visual Transformation Telling (VTT). This task requires a machine to describe the transformation that occurred between every two adjacent states (i.e. images) in a series. Unlike most existing visual reasoning tasks that focus on state reasoning, VTT emphasizes transformation reasoning. We collected 13,547 samples from two instructional video datasets, CrossTask and COIN, and extracted desired states and transformation descriptions to create a suitable VTT benchmark dataset. Humans can naturally reason from superficial states differences (e.g. ground wetness) to transformations descriptions (e.g. raining) according to their life experience but how to model this process to bridge this semantic gap is challenging. We designed TTNet on top of existing visual storytelling models by enhancing the model's state-difference sensitivity and transformation-context awareness. TTNet significantly outperforms other baseline models adapted from similar tasks, such as visual storytelling and dense video captioning, demonstrating the effectiveness of our modeling on transformations. Through comprehensive diagnostic analyses, we found TTNet has strong context utilization abilities, but even with some state-of-the-art techniques such as CLIP, there remain challenges in generalization that need to be further explored.

READ FULL TEXT

page 4

page 5

page 7

page 8

page 12

page 15

page 18

page 19

research
05/02/2023

Visual Reasoning: from State to Transformation

Most existing visual reasoning tasks, such as CLEVR in VQA, ignore an im...
research
07/17/2020

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Generating natural language descriptions for videos, i.e., video caption...
research
11/26/2020

Transformation Driven Visual Reasoning

This paper defines a new visual reasoning paradigm by introducing an imp...
research
03/16/2018

A dataset and architecture for visual reasoning with a working memory

A vexing problem in artificial intelligence is reasoning about events th...
research
11/21/2016

Dense Captioning with Joint Inference and Visual Context

Dense captioning is a newly emerging computer vision topic for understan...
research
06/13/2023

Synapse: Leveraging Few-Shot Exemplars for Human-Level Computer Control

This paper investigates the design of few-shot exemplars for computer au...
research
07/23/2020

Few-shot Visual Reasoning with Meta-analogical Contrastive Learning

While humans can solve a visual puzzle that requires logical reasoning b...

Please sign up or login with your details

Forgot password? Click here to reset