Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection

03/30/2023
by   Pilhyeon Lee, et al.
0

Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Despite the promising performance, existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. In this paper, we introduce a decomposed cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality. Specifically, instead of direct distillation, we propose to separately learn RGB and motion representations, which are in turn combined to perform action localization. The dual-branch design and the asymmetric training objectives enable effective motion knowledge transfer while preserving RGB information intact. In addition, we introduce a local attentive fusion to better exploit the multimodal complementarity. It is designed to preserve the local discriminability of the features that is important for action localization. Extensive experiments on the benchmarks verify the effectiveness of the proposed method in enhancing RGB-based action detectors. Notably, our framework is agnostic to backbones and detection heads, bringing consistent gains across different model combinations.

READ FULL TEXT

page 3

page 7

research
08/08/2021

Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection

In video understanding, most cross-modal knowledge distillation (KD) met...
research
07/31/2021

Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking

The target representation learned by convolutional neural networks plays...
research
09/20/2019

CNN-based RGB-D Salient Object Detection: Learn, Select and Fuse

The goal of this work is to present a systematic solution for RGB-D sali...
research
07/27/2021

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Weakly supervised temporal action localization (WS-TAL) is a challenging...
research
04/05/2019

Paying More Attention to Motion: Attention Distillation for Learning Video Representations

We address the challenging problem of learning motion representations us...
research
04/30/2019

Cross-Modal Message Passing for Two-stream Fusion

Processing and fusing information among multi-modal is a very useful tec...
research
12/30/2020

DUT-LFSaliency: Versatile Dataset and Light Field-to-RGB Saliency Detection

Light field data exhibit favorable characteristics conducive to saliency...

Please sign up or login with your details

Forgot password? Click here to reset