Mining for meaning: from vision to language through multiple networks consensus

06/05/2018
by   Iulia Duta, et al.
2

Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase consensus process. We demonstrate the strength of our approach by obtaining state of the art results on the challenging MSR-VTT dataset.

READ FULL TEXT

page 9

page 14

page 15

page 16

page 17

page 19

research
04/12/2016

Attributes as Semantic Units between Natural Language and Visual Recognition

Impressive progress has been made in the fields of computer vision and n...
research
11/20/2015

Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation

Integrating higher level visual and linguistic interpretations is at the...
research
02/09/2021

The Role of the Input in Natural Language Video Description

Natural Language Video Description (NLVD) has recently received strong i...
research
01/30/2020

Dual Convolutional LSTM Network for Referring Image Segmentation

We consider referring image segmentation. It is a problem at the interse...
research
05/08/2023

Joint Moment Retrieval and Highlight Detection Via Natural Language Queries

Video summarization has become an increasingly important task in the fie...
research
11/30/2020

A Comprehensive Review on Recent Methods and Challenges of Video Description

Video description involves the generation of the natural language descri...
research
11/20/2014

CIDEr: Consensus-based Image Description Evaluation

Automatically describing an image with a sentence is a long-standing cha...

Please sign up or login with your details

Forgot password? Click here to reset