Learnable pooling with Context Gating for video classification

06/21/2017
by   Antoine Miech, et al.
0

Common video representations often deploy an average or maximum pooling of pre-extracted frame features over time. Such an approach provides a simple means to encode feature distributions, but is likely to be suboptimal. As an alternative, we here explore combinations of learnable pooling techniques such as Soft Bag-of-words, Fisher Vectors, NetVLAD, GRU and LSTM to aggregate video features over time. We also introduce a learnable non-linear network unit, named Context Gating, aiming at modeling interdependencies between features. We evaluate the method on the multi-modal Youtube-8M Large-Scale Video Understanding dataset using pre-extracted visual and audio features. We demonstrate improvements provided by the Context Gating as well as by the combination of learnable pooling methods. We finally show how this leads to the best performance, out of more than 600 teams, in the Kaggle Youtube-8M Large-Scale Video Understanding challenge.

READ FULL TEXT
research
09/16/2018

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Leveraging both visual frames and audio has been experimentally proven e...
research
07/14/2017

Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding

This paper describes our solution for the video recognition task of the ...
research
07/11/2017

Hierarchical Deep Recurrent Architecture for Video Understanding

This paper introduces the system we developed for the Youtube-8M Video U...
research
10/01/2018

Learnable Pooling Methods for Video Classification

We introduce modifications to state-of-the-art approaches to aggregating...
research
08/24/2017

Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction

Frame-level visual features are generally aggregated in time with the te...
research
11/12/2018

NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification

This paper introduces a fast and efficient network architecture, NeXtVLA...
research
01/21/2021

LEAF: A Learnable Frontend for Audio Classification

Mel-filterbanks are fixed, engineered audio features which emulate human...

Please sign up or login with your details

Forgot password? Click here to reset