AudioVisual Video Summarization

05/17/2021
by   Bin Zhao, et al.
0

Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, existing approaches just exploit the visual information while neglect the audio information. In this paper, we argue that the audio modality can assist vision modality to better understand the video content and structure, and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream LSTM is utilized to encode the audio and visual feature sequentially by capturing their temporal dependency. 2) the audiovisual fusion LSTM is employed to fuse the two modalities by exploring the latent consistency between them. 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information, and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part, and the superiority of AVRN compared to those approaches just exploiting visual information for video summarization.

READ FULL TEXT

page 1

page 3

page 7

research
12/02/2022

Role of Audio in Audio-Visual Video Summarization

Video summarization attracts attention for efficient video representatio...
research
10/15/2020

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

This paper presents MAST, a new model for Multimodal Abstractive Text Su...
research
07/05/2022

Multimodal Frame-Scoring Transformer for Video Summarization

As the number of video content has mushroomed in recent years, automatic...
research
06/02/2020

Transforming Multi-Concept Attention into Video Summarization

Video summarization is among challenging tasks in computer vision, which...
research
05/10/2021

Reconstructive Sequence-Graph Network for Video Summarization

Exploiting the inner-shot and inter-shot dependencies is essential for k...
research
06/02/2020

Transfoming Multi-Concept Attention into Video Summarization

Video summarization is among challenging tasks in computer vision, which...
research
07/06/2023

CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization

Multimodal summarization usually suffers from the problem that the contr...

Please sign up or login with your details

Forgot password? Click here to reset