GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization

04/26/2021
by   Jia-Hong Huang, et al.
10

Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users' expectations in content search and exploration scenarios. Multi-modal video summarization is one of the methods utilized to address this problem. When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. Thus, encoding the text-based query and the video effectively are both important for the task of multi-modal video summarization. In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. Based on the evaluation of the existing multi-modal video summarization benchmark, experimental results show that the proposed model is effective with the increase of +5.88 with the state-of-the-art method.

READ FULL TEXT

page 1

page 5

page 8

research
09/17/2020

Multi-modal Summarization for Video-containing Documents

Summarization of multimedia data becomes increasingly significant as it ...
research
04/07/2020

Query-controllable Video Summarization

When video collections become huge, how to explore both within and acros...
research
08/24/2022

Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization

Most current multi-modal summarization methods follow a cascaded manner,...
research
09/21/2019

Video Skimming: Taxonomy and Comprehensive Survey

Video skimming, also known as dynamic video summarization, generates a t...
research
09/11/2021

A Survey on Multi-modal Summarization

The new era of technology has brought us to the point where it is conven...
research
05/13/2021

DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization

The recent growth of web video sharing platforms has increased the deman...
research
10/19/2022

VTC: Improving Video-Text Retrieval with User Comments

Multi-modal retrieval is an important problem for many applications, suc...

Please sign up or login with your details

Forgot password? Click here to reset