ActionFormer: Localizing Moments of Actions with Transformers

02/16/2022
by   Chenlin Zhang, et al.
7

Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer – a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 65.6 prior model by 8.7 absolute percentage points and crossing the 60 first time. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.0 mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release

READ FULL TEXT

page 3

page 9

page 13

research
08/25/2022

Adaptive Perception Transformer for Temporal Action Localization

Temporal action localization aims to predict the boundary and category o...
research
06/20/2022

Global Context Vision Transformers

We propose global context vision transformer (GC ViT), a novel architect...
research
12/26/2021

Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence

Object Detection with Transformers (DETR) and related works reach or eve...
research
07/19/2021

Action Forecasting with Feature-wise Self-Attention

We present a new architecture for human action forecasting from videos. ...
research
09/29/2020

Knowledge Fusion Transformers for Video Action Recognition

We introduce Knowledge Fusion Transformers for video action classificati...
research
09/06/2021

Class Semantics-based Attention for Action Detection

Action localization networks are often structured as a feature encoder s...
research
11/16/2022

Where a Strong Backbone Meets Strong Features – ActionFormer for Ego4D Moment Queries Challenge

This report describes our submission to the Ego4D Moment Queries Challen...

Please sign up or login with your details

Forgot password? Click here to reset