Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers

03/20/2023
by   Jaehoon Yoo, et al.
0

Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive Transformers for generating moderately long videos in both quality and speed.

READ FULL TEXT

page 7

page 14

page 15

research
06/29/2020

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Transformers achieve remarkable performance in several tasks but due to ...
research
07/21/2022

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

The task of action detection aims at deducing both the action category a...
research
12/20/2019

Axial Attention in Multidimensional Transformers

We propose Axial Transformers, a self-attention-based autoregressive mod...
research
05/12/2023

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Autoregressive transformers are spectacular models for short sequences b...
research
10/06/2021

Ripple Attention for Visual Perception with Sub-quadratic Complexity

Transformer architectures are now central to modeling in natural languag...
research
11/30/2022

Fast Inference from Transformers via Speculative Decoding

Inference from large autoregressive models like Transformers is slow - d...
research
02/08/2022

MaskGIT: Masked Generative Image Transformer

Generative transformers have experienced rapid popularity growth in the ...

Please sign up or login with your details

Forgot password? Click here to reset