Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

09/16/2018
by   Jinlai Liu, et al.
0

Leveraging both visual frames and audio has been experimentally proven effective to improve large-scale video classification. Previous research on video classification mainly focuses on the analysis of visual content among extracted video frames and their temporal feature aggregation. In contrast, multimodal data fusion is achieved by simple operators like average and concatenation. Inspired by the success of bilinear pooling in the visual and language fusion, we introduce multi-modal factorized bilinear pooling (MFB) to fuse visual and audio representations. We combine MFB with different video-level features and explore its effectiveness in video classification. Experimental results on the challenging Youtube-8M v2 dataset demonstrate that MFB significantly outperforms simple fusion methods in large-scale video classification.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2017

Learnable pooling with Context Gating for video classification

Common video representations often deploy an average or maximum pooling ...
research
05/29/2020

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

Recognizing sounds is a key aspect of computational audio scene analysis...
research
10/27/2017

Multi-modal Aggregation for Video Classification

In this paper, we present a solution to Large-Scale Video Classification...
research
12/18/2020

Temporal Bilinear Encoding Network of Audio-Visual Features at Low Sampling Rates

Current deep learning based video classification architectures are typic...
research
12/20/2017

An Order Preserving Bilinear Model for Person Detection in Multi-Modal Data

We propose a new order preserving bilinear framework that exploits low-r...
research
10/11/2022

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

An audio-visual event (AVE) is denoted by the correspondence of the visu...
research
06/17/2017

Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text

The YouTube-8M video classification challenge requires teams to classify...

Please sign up or login with your details

Forgot password? Click here to reset