Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction

04/23/2016
by   Jianfeng Dong, et al.
0

This paper strives to find the sentence best describing the content of an image or video. Different from existing works, which rely on a joint subspace for image / video to sentence matching, we propose to do so in a visual space only. We contribute Word2VisualVec, a deep neural network architecture that learns to predict a deep visual encoding of textual input based on sentence vectorization and a multi-layer perceptron. We thoroughly analyze its architectural design, by varying the sentence vectorization strategy, network depth and the deep feature to predict for image to sentence matching. We also generalize Word2VisualVec for matching a video to a sentence, by extending the predictive abilities to 3-D ConvNet features as well as a visual-audio representation. Experiments on four challenging image and video benchmarks detail Word2VisualVec's properties, capabilities for image and video to sentence matching, and on all datasets its state-of-the-art results.

READ FULL TEXT

page 1

page 3

page 8

research
09/05/2017

Predicting Visual Features from Text for Image and Video Caption Retrieval

This paper strives to find amidst a set of sentences the one best descri...
research
12/06/2017

Learning Semantic Concepts and Order for Image and Sentence Matching

Image and sentence matching has made great progress recently, but it rem...
research
12/19/2022

Diffusing Surrogate Dreams of Video Scenes to Predict Video Memorability

As part of the MediaEval 2022 Predicting Video Memorability task we expl...
research
04/23/2015

Multimodal Convolutional Neural Networks for Matching Image and Sentence

In this paper, we propose multimodal convolutional neural networks (m-CN...
research
03/30/2016

Enhancing Sentence Relation Modeling with Auxiliary Character-level Embedding

Neural network based approaches for sentence relation modeling automatic...
research
10/21/2021

Video and Text Matching with Conditioned Embeddings

We present a method for matching a text sentence from a given corpus to ...
research
08/31/2020

Sentence Guided Temporal Modulation for Dynamic Video Thumbnail Generation

We consider the problem of sentence specified dynamic video thumbnail ge...

Please sign up or login with your details

Forgot password? Click here to reset