VideoXum: Cross-modal Visual and Textural Summarization of Videos

by   Jingyang Lin, et al.

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset – VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model – VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.


page 2

page 5

page 8

page 15


TL;DW? Summarizing Instructional Videos with Task Relevance Cross-Modal Saliency

YouTube users looking for instructions for a specific task may spend a l...

Summary Transfer: Exemplar-based Subset Selection for Video Summarization

Video summarization has unprecedented importance to help us digest, brow...

Text Synopsis Generation for Egocentric Videos

Mass utilization of body-worn cameras has led to a huge corpus of availa...

MultiSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

Multimodal summarization with multimodal output (MSMO) has emerged as a ...

Does Video Summarization Require Videos? Quantifying the Effectiveness of Language in Video Summarization

Video summarization remains a huge challenge in computer vision due to t...

Hierarchical3D Adapters for Long Video-to-text Summarization

In this paper, we focus on video-to-text summarization and investigate h...

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

The goal of multimodal summarization is to extract the most important in...

Please sign up or login with your details

Forgot password? Click here to reset