Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

04/15/2018
by   Xin Wang, et al.
0

A major challenge for video captioning is to combine audio and visual cues. Existing multi-modal fusion methods have shown encouraging results in video understanding. However, the temporal structures of multiple modalities at different granularities are rarely explored, and how to selectively fuse the multi-modal representations at different levels of details remains uncharted. In this paper, we propose a novel hierarchically aligned cross-modal attention (HACA) framework to learn and selectively fuse both global and local temporal dynamics of different modalities. Furthermore, for the first time, we validate the superior performance of the deep audio features on the video captioning task. Finally, our HACA model significantly outperforms the previous best systems and achieves new state-of-the-art results on the widely used MSR-VTT dataset.

READ FULL TEXT
research
03/07/2020

Cross-modal Learning for Multi-modal Video Categorization

Multi-modal machine learning (ML) models can process data in multiple mo...
research
11/01/2019

Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning

This paper addresses the challenging task of video captioning which aims...
research
07/23/2020

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

In this paper, we focus on the problem of applying the transformer struc...
research
05/04/2021

Where and When: Space-Time Attention for Audio-Visual Explanations

Explaining the decision of a multi-modal decision-maker requires to dete...
research
03/28/2018

Topic Modeling Based Multi-modal Depression Detection

Major depressive disorder is a common mental disorder that affects almos...
research
09/24/2021

DeepStroke: An Efficient Stroke Screening Framework for Emergency Rooms with Multimodal Adversarial Deep Learning

In an emergency room (ER) setting, the diagnosis of stroke is a common c...
research
11/22/2020

Hierachical Delta-Attention Method for Multimodal Fusion

In vision and linguistics; the main input modalities are facial expressi...

Please sign up or login with your details

Forgot password? Click here to reset