AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

12/28/2021
by   Yulin Wang, et al.
4

Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input diversity and training stability. Moreover, a conditional-exit technique is proposed to perform temporal adaptive computation on top of AdaFocus without additional training. Extensive experiments on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1 V2, and Jester) demonstrate that our model significantly outperforms the original AdaFocus and other competitive baselines, while being considerably more simple and efficient to train. Code is available at https://github.com/LeapLabTHU/AdaFocusV2.

READ FULL TEXT
research
05/07/2021

Adaptive Focus for Efficient Video Recognition

In this paper, we explore the spatial redundancy in video recognition wi...
research
09/27/2022

AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition

Recent research has revealed that reducing the temporal and spatial redu...
research
08/06/2022

Frozen CLIP Models are Efficient Video Learners

Video recognition has been dominated by the end-to-end learning paradigm...
research
09/15/2023

Differentiable Resolution Compression and Alignment for Efficient Video Classification and Retrieval

Optimizing video inference efficiency has become increasingly important ...
research
05/10/2022

Accelerating the Training of Video Super-Resolution Models

Despite that convolution neural networks (CNN) have recently demonstrate...
research
12/08/2022

Deep Model Assembling

Large deep learning models have achieved remarkable success in many scen...
research
04/09/2021

A Reinforcement-Learning-Based Energy-Efficient Framework for Multi-Task Video Analytics Pipeline

Deep-learning-based video processing has yielded transformative results ...

Please sign up or login with your details

Forgot password? Click here to reset