Multi-modal Aggregation for Video Classification

10/27/2017
by   Chen Chen, et al.
0

In this paper, we present a solution to Large-Scale Video Classification Challenge (LSVC2017) [1] that ranked the 1st place. We focused on a variety of modalities that cover visual, motion and audio. Also, we visualized the aggregation process to better understand how each modality takes effect. Among the extracted modalities, we found Temporal-Spatial features calculated by 3D convolution quite promising that greatly improved the performance. We attained the official metric mAP 0.8741 on the testing set with the ensemble model.

READ FULL TEXT

page 3

page 4

research
09/16/2018

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Leveraging both visual frames and audio has been experimentally proven e...
research
05/25/2023

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Recently, video object segmentation (VOS) referred by multi-modal signal...
research
08/02/2016

CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

This paper presents the method that underlies our submission to the untr...
research
06/27/2018

Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

In this report, our approach to tackling the task of ActivityNet 2018 Ki...
research
12/31/2014

ModDrop: adaptive multi-modal gesture recognition

We present a method for gesture detection and localisation based on mult...
research
04/25/2018

Multi-modal Approach for Affective Computing

Throughout the past decade, many studies have classified human emotions ...
research
04/20/2021

HMS: Hierarchical Modality Selection for Efficient Video Recognition

Videos are multimodal in nature. Conventional video recognition pipeline...

Please sign up or login with your details

Forgot password? Click here to reset