VA-RED^2: Video Adaptive Redundancy Reduction

by   Bowen Pan, et al.

Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depends on the dynamics and type of events in the video: static videos have more temporal redundancy while videos focusing on objects tend to have more channel redundancy. Here we present a redundancy reduction framework, termed VA-RED^2, which is input-dependent. Specifically, our VA-RED^2 framework uses an input-dependent policy to decide how many features need to be computed for temporal and channel dimensions. To keep the capacity of the original model, after fully computing the necessary features, we reconstruct the remaining redundant features from those using cheap linear operations. We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism, making it highly efficient. Extensive experiments on multiple video datasets and different visual tasks show that our framework achieves 20% - 40% reduction in computation (FLOPs) when compared to state-of-the-art methods without any performance loss. Project page:


page 9

page 16

page 17

page 18

page 19


IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

The self-attention-based model, transformer, is recently becoming the le...

Fine-grained Video Categorization with Redundancy Reduction Attention

For fine-grained categorization tasks, videos could serve as a better so...

AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition

Temporal modelling is the key for efficient video action recognition. Wh...

Feature-dependent Cross-Connections in Multi-Path Neural Networks

Learning a particular task from a dataset, samples in which originate fr...

Robust and Efficient Memory Network for Video Object Segmentation

This paper proposes a Robust and Efficient Memory Network, referred to a...

Dynamic Network Quantization for Efficient Video Inference

Deep convolutional networks have recently achieved great success in vide...

Efficient Robustness Assessment via Adversarial Spatial-Temporal Focus on Videos

Adversarial robustness assessment for video recognition models has raise...

Please sign up or login with your details

Forgot password? Click here to reset