Deep Multimodal Learning: An Effective Method for Video Classification

11/30/2018
by   Tianqi Zhao, et al.
0

Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied. Recurrent neural network (RNN) architecture has wildly been used on many sequential learning problems such as Language Model, Time-Series Analysis, etc. In this paper, we propose some variations of RNN such as stacked bidirectional LSTM/GRU network with attention mechanism to categorize large-scale video data. We also explore different multimodal fusion methods. Our model combines both visual and audio information on both video and frame level and received great result. Ensemble methods are also applied. Because of its multimodal characteristics, we decide to call this method Deep Multimodal Learning(DML). Our DML-based model was trained on Google Cloud and our own server and was tested in a well-known video classification competition on Kaggle held by Google.

READ FULL TEXT
research
09/22/2021

Hierarchical Multimodal Transformer to Summarize Videos

Although video summarization has achieved tremendous success benefiting ...
research
11/14/2022

Grafting Pre-trained Models for Multimodal Headline Generation

Multimodal headline utilizes both video frames and transcripts to genera...
research
07/04/2017

Aggregating Frame-level Features for Large-Scale Video Classification

This paper introduces the system we developed for the Google Cloud & You...
research
07/26/2017

Video Highlight Prediction Using Audience Chat Reactions

Sports channel video portals offer an exciting domain for research on mu...
research
06/13/2017

Action Search: Learning to Search for Human Activities in Untrimmed Videos

Traditional approaches for action detection use trimmed data to learn so...
research
05/08/2019

Multimodal Semantic Attention Network for Video Captioning

Inspired by the fact that different modalities in videos carry complemen...
research
09/28/2019

Translation, Sentiment and Voices: A Computational Model to Translate and Analyse Voices from Real-Time Video Calling

With internet quickly becoming an easy access to many, voice calling ove...

Please sign up or login with your details

Forgot password? Click here to reset