Higher Order Recurrent Space-Time Transformer

04/17/2021
by   Tsung-Ming Tai, et al.
21

Endowing visual agents with predictive capability is a key step towards video intelligence at scale. The predominant modeling paradigm for this is sequence learning, mostly implemented through LSTMs. Feed-forward Transformer architectures have replaced recurrent model designs in ML applications of language processing and also partly in computer vision. In this paper we investigate on the competitiveness of Transformer-style architectures for video predictive tasks. To do so we propose HORST, a novel higher order recurrent layer design whose core element is a spatial-temporal decomposition of self-attention for video. HORST achieves state of the art competitive performance on Something-Something-V2 early action recognition and EPIC-Kitchens-55 action anticipation, without exploiting a task specific design. We believe this is promising evidence of causal predictive capability that we attribute to our recurrent higher order design of self-attention.

READ FULL TEXT

page 7

page 13

page 14

research
02/09/2021

Is Space-Time Attention All You Need for Video Understanding?

We present a convolution-free approach to video classification built exc...
research
07/13/2023

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Recent video recognition models utilize Transformer models for long-rang...
research
06/12/2021

Video Super-Resolution Transformer

Video super-resolution (VSR), with the aim to restore a high-resolution ...
research
06/22/2022

NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022

In this report, we describe the technical details of our submission for ...
research
12/14/2021

Temporal Transformer Networks with Self-Supervision for Action Recognition

In recent years, 2D Convolutional Networks-based video action recognitio...
research
06/18/2020

I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths

Self-attention has emerged as a vital component of state-of-the-art sequ...
research
07/01/2021

Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition

Deep neural networks based purely on attention have been successful acro...

Please sign up or login with your details

Forgot password? Click here to reset