Learning Language-Visual Embedding for Movie Understanding with Natural-Language

09/26/2016
by   Atousa Torabi, et al.
0

Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in "Predicate + Object" (PO) phrases based on "Knowlywood", an activity knowledge mining model. Our best model archives Recall@10 of 19.2 of 1000 samples. For multiple-choice test, our best model achieve accuracy 58.11

READ FULL TEXT

page 3

page 4

page 7

page 11

research
03/03/2015

Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research

In this work, we introduce a dataset of video annotated with high qualit...
research
07/02/2017

Where to Play: Retrieval of Video Segments using Natural-Language Queries

In this paper, we propose a new approach for retrieval of video segments...
research
06/07/2019

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Learning text-video embeddings usually requires a dataset of video clips...
research
10/10/2016

End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering

We propose a high-level concept word detector that can be integrated wit...
research
04/12/2017

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Associating image regions with text queries has been recently explored a...
research
03/04/2019

M-VAD Names: a Dataset for Video Captioning with Naming

Current movie captioning architectures are not capable of mentioning cha...
research
07/02/2019

Language2Pose: Natural Language Grounded Pose Forecasting

Generating animations from natural language sentences finds its applicat...

Please sign up or login with your details

Forgot password? Click here to reset