AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization

11/27/2019
by   Xiao-Yu Zhang, et al.
17

The point process is a solid framework to model sequential data, such as videos, by exploring the underlying relevance. As a challenging problem for high-level video understanding, weakly supervised action recognition and localization in untrimmed videos has attracted intensive research attention. Knowledge transfer by leveraging the publicly available trimmed videos as external guidance is a promising attempt to make up for the coarse-grained video-level annotation and improve the generalization performance. However, unconstrained knowledge transfer may bring about irrelevant noise and jeopardize the learning model. This paper proposes a novel adaptability decomposing encoder-decoder network to transfer reliable knowledge between trimmed and untrimmed videos for action recognition and localization via bidirectional point process modeling, given only video-level annotations. By decomposing the original features into domain-adaptable and domain-specific ones based on their adaptability, trimmed-untrimmed knowledge transfer can be safely confined within a more coherent subspace. An encoder-decoder based structure is carefully designed and jointly optimized to facilitate effective action classification and temporal localization. Extensive experiments are conducted on two benchmark datasets (i.e., THUMOS14 and ActivityNet1.3), and experimental results clearly corroborate the efficacy of our method.

READ FULL TEXT

page 1

page 3

page 6

page 9

page 11

research
02/20/2019

Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision

Action recognition in videos has attracted a lot of attention in the pas...
research
05/10/2021

Action Shuffling for Weakly Supervised Temporal Localization

Weakly supervised action localization is a challenging task with extensi...
research
02/09/2020

Weakly-Supervised Multi-Person Action Recognition in 360^∘ Videos

The recent development of commodity 360^∘ cameras have enabled a single ...
research
05/02/2019

Large-scale weakly-supervised pre-training for video action recognition

Current fully-supervised video datasets consist of only a few hundred th...
research
07/14/2022

Forcing the Whole Video as Background: An Adversarial Learning Strategy for Weakly Temporal Action Localization

With video-level labels, weakly supervised temporal action localization ...
research
12/17/2020

Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN

3D Convolutional Neural Network (3D CNN) captures spatial and temporal i...
research
04/28/2022

Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos

In videos that contain actions performed unintentionally, agents do not ...

Please sign up or login with your details

Forgot password? Click here to reset