Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

03/19/2021
by   Honglu Zhou, et al.
8

This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset that requires multi-step reasoning to localize objects of interest correctly.

READ FULL TEXT

page 9

page 18

page 19

page 20

page 22

page 23

page 26

page 27

research
07/20/2021

Generative Video Transformer: Can Objects be the Words?

Transformers have been successful for many natural language processing t...
research
10/10/2019

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

Computer vision has undergone a dramatic revolution in performance, driv...
research
03/23/2020

Learning Object Permanence from Video

Object Permanence allows people to reason about the location of non-visi...
research
07/14/2022

Deepfake Video Detection with Spatiotemporal Dropout Transformer

While the abuse of deepfake technology has caused serious concerns recen...
research
11/10/2022

Spatiotemporal k-means

Spatiotemporal data is readily available due to emerging sensor and data...
research
12/01/2014

Recovering Spatiotemporal Correspondence between Deformable Objects by Exploiting Consistent Foreground Motion in Video

Given unstructured videos of deformable objects, we automatically recove...
research
01/21/2021

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

In this paper we introduce a Transformer-based approach to video object ...

Please sign up or login with your details

Forgot password? Click here to reset