Generative Video Transformer: Can Objects be the Words?

07/20/2021
by   Yi-Fu Wu, et al.
1

Transformers have been successful for many natural language processing tasks. However, applying transformers to the video domain for tasks such as long-term video generation and scene understanding has remained elusive due to the high computational complexity and the lack of natural tokenization. In this paper, we propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer. By factoring the video into objects, our fully unsupervised model is able to learn complex spatio-temporal dynamics of multiple interacting objects in a scene and generate future frames of the video. Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU. We compare our model with previous RNN-based approaches as well as other possible video transformer baselines. We demonstrate OCVT performs well when compared to baselines in generating future frames. OCVT also develops useful representations for video reasoning, achieving start-of-the-art performance on the CATER task.

READ FULL TEXT

page 8

page 13

page 14

page 15

research
10/13/2021

Object-Region Video Transformers

Evidence from cognitive psychology suggests that understanding spatio-te...
research
03/07/2023

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

Motion, scene and object are three primary visual components of a video....
research
03/19/2021

Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

This paper considers the problem of spatiotemporal object-centric reason...
research
06/12/2020

Unmasking the Inductive Biases of Unsupervised Object Representations for Video Sequences

Perceiving the world in terms of objects is a crucial prerequisite for r...
research
04/21/2022

Learning Future Object Prediction with a Spatiotemporal Detection Transformer

We explore future object prediction – a challenging problem where all ob...
research
08/15/2023

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

We introduce an object-aware decoder for improving the performance of sp...
research
09/08/2016

Generating Videos with Scene Dynamics

We capitalize on large amounts of unlabeled video in order to learn a mo...

Please sign up or login with your details

Forgot password? Click here to reset