Stand-Alone Inter-Frame Attention in Video Models

06/14/2022
by   Fuchen Long, et al.
0

Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the feature maps across consecutive frames can be nicely aggregated. Nevertheless, the assumption may not always hold especially for the regions with large deformation. In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location. Technically, SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. Taking each spatial location in the current frame as the query, the locally deformable neighbors in the next frame are regarded as the keys/values. Then, SIFA measures the similarity between query and keys as stand-alone attention to weighted average the values for temporal aggregation. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer. Extensive experiments conducted on four video datasets demonstrate the superiority of SIFA-Net and SIFA-Transformer as stronger backbones. More remarkably, SIFA-Transformer achieves an accuracy of 83.1 Kinetics-400 dataset. Source code is available at <https://github.com/FuchenUSTC/SIFA>.

READ FULL TEXT

page 1

page 3

page 8

research
11/15/2022

Dynamic Temporal Filtering in Video Models

Video temporal dynamics is conventionally modeled with 3D spatial-tempor...
research
07/20/2020

Learning Joint Spatial-Temporal Transformations for Video Inpainting

High-quality video inpainting that completes missing regions in video fr...
research
07/26/2021

Contextual Transformer Networks for Visual Recognition

Transformer with self-attention has led to the revolutionizing of natura...
research
08/10/2023

Temporally-Adaptive Models for Efficient Video Understanding

Spatial convolutions are extensively used in numerous deep video models....
research
11/13/2022

SCOTCH and SODA: A Transformer Video Shadow Detection Framework

Shadows in videos are difficult to detect because of the large shadow de...
research
07/14/2023

TALL: Thumbnail Layout for Deepfake Video Detection

The growing threats of deepfakes to society and cybersecurity have raise...
research
07/08/2021

Multi-frame Collaboration for Effective Endoscopic Video Polyp Detection via Spatial-Temporal Feature Transformation

Precise localization of polyp is crucial for early cancer screening in g...

Please sign up or login with your details

Forgot password? Click here to reset