Predicting Visual Features from Text for Image and Video Caption Retrieval

09/05/2017
by   Jianfeng Dong, et al.
0

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute Word2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.

READ FULL TEXT

page 1

page 3

page 7

page 8

page 9

research
04/23/2016

Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction

This paper strives to find the sentence best describing the content of a...
research
11/21/2022

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

In this paper we tackle the cross-modal video retrieval problem and, mor...
research
11/20/2014

Learning a Recurrent Visual Representation for Image Caption Generation

In this paper we explore the bi-directional mapping between images and t...
research
12/19/2022

Diffusing Surrogate Dreams of Video Scenes to Predict Video Memorability

As part of the MediaEval 2022 Predicting Video Memorability task we expl...
research
09/14/2015

Deep Learning Applied to Image and Text Matching

The ability to describe images with natural language sentences is the ha...
research
08/19/2017

Image2song: Song Retrieval via Bridging Image Content and Lyric Words

Image is usually taken for expressing some kinds of emotions or purposes...

Please sign up or login with your details

Forgot password? Click here to reset