Weakly Supervised Gaussian Networks for Action Detection
Detecting temporal extents of human actions in videos is a challenging computer vision problem that require detailed manual supervision including frame-level labels. This expensive annotation process limits deploying action detectors on a limited number of categories. We propose a novel action recognition method, called WSGN, that can learn to detect actions from "weak supervision", video-level labels. WSGN learns to exploit both video-specific and dataset-wide statistics to predict relevance of each frame to an action category. We show that a combination of the local and global channels leads to significant gains in two standard benchmarks THUMOS14 and Charades. Our method improves more than 12 weakly supervised state-of-the-art methods and only 4 state-of-the-art supervised method in THUMOS14 dataset for action detection. Similarly, our method is only 0.3 method on challenging Charades dataset for action localisation.
READ FULL TEXT