Imagine This! Scripts to Compositions to Videos

04/10/2018
by   Tanmay Gupta, et al.
2

Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. CRAFT explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Our contributions include sequential training of components of CRAFT while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. We evaluate CRAFT on semantic fidelity to caption, composition consistency, and visual quality. CRAFT outperforms direct pixel generation approaches and generalizes well to unseen captions and to unseen video databases with no text annotations. We demonstrate CRAFT on FLINTSTONES, a new richly annotated video-caption dataset with over 25000 videos. For a glimpse of videos generated by CRAFT, see https://youtu.be/688Vv86n0z8.

READ FULL TEXT

page 1

page 10

page 14

page 19

page 21

page 22

research
03/18/2021

On Semantic Similarity in Video Retrieval

Current video retrieval efforts all found their evaluation on an instanc...
research
06/16/2020

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Current methods for learning visually grounded language from videos ofte...
research
11/05/2022

Semantic Metadata Extraction from Dense Video Captioning

Annotation of multimedia data by humans is time-consuming and costly, wh...
research
07/02/2017

Where to Play: Retrieval of Video Segments using Natural-Language Queries

In this paper, we propose a new approach for retrieval of video segments...
research
08/05/2019

Visual-Relation Conscious Image Generation from Structured-Text

Generating realistic images from text descriptions is a challenging prob...
research
07/05/2023

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

Dense video captioning, a task of localizing meaningful moments and gene...
research
11/17/2021

Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

Schemata are structured representations of complex tasks that can aid ar...

Please sign up or login with your details

Forgot password? Click here to reset