Two-Stream Video Classification with Cross-Modality Attention

08/01/2019
by   Lu Chi, et al.
5

Fusing multi-modality information is known to be able to effectively bring significant improvement in video classification. However, the most popular method up to now is still simply fusing each stream's prediction scores at the last stage. A valid question is whether there exists a more effective method to fuse information cross modality. With the development of attention mechanism in natural language processing, there emerge many successful applications of attention in the field of computer vision. In this paper, we propose a cross-modality attention operation, which can obtain information from other modality in a more effective way than two-stream. Correspondingly we implement a compatible block named CMA block, which is a wrapper of our proposed attention operation. CMA can be plugged into many existing architectures. In the experiments, we comprehensively compare our method with two-stream and non-local models widely used in video classification. All experiments clearly demonstrate strong performance superiority by our proposed method. We also analyze the advantages of the CMA block by visualizing the attention map, which intuitively shows how the block helps the final prediction.

READ FULL TEXT

page 4

page 8

page 11

page 12

research
11/29/2019

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

In this work, we explore the impact of visual modality in addition to sp...
research
10/12/2022

PSNet: Parallel Symmetric Network for Video Salient Object Detection

For the video salient object detection (VSOD) task, how to excavate the ...
research
08/27/2019

Cooperative Cross-Stream Network for Discriminative Action Representation

Spatial and temporal stream model has gained great success in video acti...
research
10/12/2020

Learning Selective Mutual Attention and Contrast for RGB-D Saliency Detection

How to effectively fuse cross-modal information is the key problem for R...
research
03/22/2023

Text with Knowledge Graph Augmented Transformer for Video Captioning

Video captioning aims to describe the content of videos using natural la...
research
08/18/2020

AssembleNet++: Assembling Modality Representations via Attention Connections

We create a family of powerful video models which are able to: (i) learn...
research
07/27/2022

Rethinking Efficacy of Softmax for Lightweight Non-Local Neural Networks

Non-local (NL) block is a popular module that demonstrates the capabilit...

Please sign up or login with your details

Forgot password? Click here to reset