Action Keypoint Network for Efficient Video Recognition

01/17/2022
by   Xu Chen, et al.
11

Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes, while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification. AK-Net has two steps, i.e., the keypoint selection and the point cloud classification. First, it inputs the video into a baseline network and outputs a feature map from an intermediate layer. We view each pixel on this feature map as a spatial-temporal point and select some informative keypoints using self-attention. Second, AK-Net devises a ranking criterion to arrange the keypoints into an ordered 1D sequence. Consequentially, AK-Net brings two-fold benefits for efficiency: The keypoint selection step collects informative content within arbitrary shapes and increases the efficiency for modeling spatial-temporal dependencies, while the point cloud classification step further reduces the computational cost by compacting the convolutional kernels. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.

READ FULL TEXT

page 1

page 4

page 9

page 10

research
10/19/2021

Spatial-Temporal Transformer for 3D Point Cloud Sequences

Effective learning of spatial-temporal information within a point cloud ...
research
12/20/2020

Anchor-Based Spatial-Temporal Attention Convolutional Networks for Dynamic 3D Point Cloud Sequences

Recently, learning based methods for the robot perception from the image...
research
03/31/2020

SK-Net: Deep Learning on Point Cloud via End-to-end Discovery of Spatial Keypoints

Since the PointNet was proposed, deep learning on point cloud has been t...
research
07/31/2023

DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation

In this technical report, we present our findings from the research cond...
research
09/01/2022

MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition

Recognizing human actions from point cloud videos has attracted tremendo...
research
07/16/2022

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Text-Video retrieval is a task of great practical value and has received...
research
09/27/2022

AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition

Recent research has revealed that reducing the temporal and spatial redu...

Please sign up or login with your details

Forgot password? Click here to reset