CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

by   Yuanjun Xiong, et al.

This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP 93.23%) on the testing set and secured the first place in the challenge.


page 1

page 2

page 3

page 4


Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions

This paper presents the details of the Audio-Visual Scene Classification...

Multi-modal Aggregation for Video Classification

In this paper, we present a solution to Large-Scale Video Classification...

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Very deep convolutional networks have been central to the largest advanc...

Learning spatio-temporal representations with temporal squeeze pooling

In this paper, we propose a new video representation learning method, na...

Multi-modal Ensemble Models for Predicting Video Memorability

Modeling media memorability has been a consistent challenge in the field...

Vehicle classification using ResNets, localisation and spatially-weighted pooling

We investigate whether ResNet architectures can outperform more traditio...

PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation

Audio event classification is an active research area and has a wide ran...

Please sign up or login with your details

Forgot password? Click here to reset