Translating Text Synopses to Video Storyboards

12/31/2022
by   Xu Gu, et al.
0

A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.

READ FULL TEXT

page 2

page 4

page 7

page 9

research
07/01/2022

Video + CLIP Baseline for Ego4D Long-term Action Anticipation

In this report, we introduce our adaptation of image-text models for lon...
research
01/05/2023

HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics...
research
04/30/2021

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

Generating videos from text is a challenging task due to its high comput...
research
08/28/2023

CoVR: Learning Composed Video Retrieval from Web Video Captions

Composed Image Retrieval (CoIR) has recently gained popularity as a task...
research
12/21/2022

Esports Data-to-commentary Generation on Large-scale Data-to-text Dataset

Esports, a sports competition using video games, has become one of the m...
research
07/26/2021

Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

Video-and-Language Inference is a recently proposed task for joint video...
research
05/15/2019

Automatic Long-Term Deception Detection in Group Interaction Videos

Most work on automated deception detection (ADD) in video has two restri...

Please sign up or login with your details

Forgot password? Click here to reset