Deep Learning Applied to Image and Text Matching

09/14/2015
by   Afroze Ibrahim Baqapuri, et al.
0

The ability to describe images with natural language sentences is the hallmark for image and language understanding. Such a system has wide ranging applications such as annotating images and using natural sentences to search for images.In this project we focus on the task of bidirectional image retrieval: such asystem is capable of retrieving an image based on a sentence (image search) andretrieve sentence based on an image query (image annotation). We present asystem based on a global ranking objective function which uses a combinationof convolutional neural networks (CNN) and multi layer perceptrons (MLP).It takes a pair of image and sentence and processes them in different channels,finally embedding it into a common multimodal vector space. These embeddingsencode abstract semantic information about the two inputs and can be comparedusing traditional information retrieval approaches. For each such pair, the modelreturns a score which is interpretted as a similarity metric. If this score is high,the image and sentence are likely to convey similar meaning, and if the score is low then they are likely not to. The visual input is modeled via deep convolutional neural network. On theother hand we explore three models for the textual module. The first one isbag of words with an MLP. The second one uses n-grams (bigram, trigrams,and a combination of trigram & skip-grams) with an MLP. The third is morespecialized deep network specific for modeling variable length sequences (SSE).We report comparable performance to recent work in the field, even though ouroverall model is simpler. We also show that the training time choice of how wecan generate our negative samples has a significant impact on performance, and can be used to specialize the bi-directional system in one particular task.

READ FULL TEXT
research
08/08/2016

Learning Joint Representations of Videos and Sentences with Web Image Search

Our objective is video retrieval based on natural language queries. In a...
research
12/07/2014

Deep Visual-Semantic Alignments for Generating Image Descriptions

We present a model that generates natural language descriptions of image...
research
06/01/2021

Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features

Cross-modal retrieval is an important functionality in modern search eng...
research
09/05/2017

Predicting Visual Features from Text for Image and Video Caption Retrieval

This paper strives to find amidst a set of sentences the one best descri...
research
04/23/2015

Multimodal Convolutional Neural Networks for Matching Image and Sentence

In this paper, we propose multimodal convolutional neural networks (m-CN...
research
06/22/2014

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

We introduce a model for bidirectional retrieval of images and sentences...
research
04/11/2017

Learning Two-Branch Neural Networks for Image-Text Matching Tasks

This paper investigates two-branch neural networks for image-text matchi...

Please sign up or login with your details

Forgot password? Click here to reset