Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

by   Hadi Abdi Khojasteh, et al.

This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchmark dataset show that our model outperforms certain methods presented previously and has competitive performance compared to the state-of-the-art. The code and dataset have been made available publicly.


Do Cross Modal Systems Leverage Semantic Relationships?

Current cross-modal retrieval systems are evaluated using R@K measure wh...

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual-semantic embedding aims to find a shared latent space where relat...

Dual-Path Convolutional Image-Text Embedding with Instance Loss

Matching images and sentences demands a fine understanding of both modal...

Dual-Path Convolutional Image-Text Embedding

This paper considers the task of matching images and sentences. The chal...

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Text-image cross-modal retrieval is a challenging task in the field of l...

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Inspired by recent advances in multimodal learning and machine translati...

Grounding Visual Representations with Texts for Domain Generalization

Reducing the representational discrepancy between source and target doma...

1 Introduction

The advent of social networks has brought about a plethora of opportunities for everyone to share information online in the forms of text, image, video and so forth. As a result, there is a vast amount of raw data on the Net which could be helpful in dealing with many challenges in natural language processing and image recognition. Matching pictures with their textual descriptions is one of these challenges in which the research interest has been growing

(Wang and Chan, 2018; Eisenschtat and Wolf, 2017; Faghri et al., 2017; Lee et al., 2018).

The goal in image-text matching is, given an image, to automatically retrieve a natural language description of this image. In addition, given a caption (textual image description), we want to match it with the most related image found in our dataset as shown in Fig. 1. The process involves modeling the relationship between images and texts or captions used to describe them. This defines the semantics of a language by grounding it to the visual world.

Figure 1: Motivation/Concept Figure: Given an image (caption), the goal in image-text matching is to automatically retrieve the closest textual description (image) for that. Tweets are examples of collected dataset.

Many studies have explored the task of cross-modal retrieval on the level of sentence and image regions (Wang and Chan, 2018; Niu et al., 2017; Karpathy and Fei-Fei, 2015; Liu et al., 2017). Karpathy et al. (2014) work on matching parts of an image objects with phrases by using dependency tree relations for sentence fragments and finding a common space for representing fragments. Huang et al. (2017) propose a sm-LSTM where they utilize a multimodal context-modulated global attention scheme and LSTM to predict the salient instance pairs. Recently, many researchers (Huang et al., 2018; Yan and Mikolajczyk, 2015; Zheng et al., 2017; Donahue et al., 2015; Lev et al., 2016; Mao et al., 2014; Gu et al., 2018)

introduced a neural network model for image caption retrieval consists of RNNs, CNNs, and additional multimodal layers. Practically, one of the reasons that these deep learning approaches have been on the rise is the availability of abundant information on the Web. The next section describes the proposed model.

2 Model

In this work, we introduce an end-to-end multimodal neural network for learning image and text representations simultaneously. The architecture is illustrated in Fig. 2. It consists of two main subnets, a CNN for input image representation with an embedding and an LSTM to map the captions into the new space. The purpose of the model is to find a mapping from the text and image to a common space in order to represent them with similar embeddings. In this space, an image (text) will have a similar representation to its text (image) but a different one from other texts (images). Once the model is trained, by feeding an image (text) to the network, we find the most similar text (image).

2.1 Image representation

For our initial model, after removing the fully-connected layer from ResNet-50 (Xie et al., 2017)

which has been pre-trained on ImageNet

(Russakovsky et al., 2015), we treat the remaining layers as an image feature extractor. The inputs of the network are images and the output is a

feature vector. Therefore, a dense layer with the size of text domain is added to the end of the network. With the rest of the network, this layer which is now part of the model, is trained to produce image representations. If we call this vector

, which is a representation of the input image, then is a visual descriptor that is the result of forward pass in the network. The forward pass is denoted by , which is a non-linear function and is defined as . Conventionally, the image model is considered as one part of our network with its pre-trained weights to avoid computing a large number of learnable parameters which is a time-consuming process. Then, we add two fully-connected layers to transform to an image feature vector () computed by = + .

Figure 2: Proposed end-to-end multimodal neural network architecture for learning the image and text representations. Image features are extracted by a CNN with 16 residual blocks and text features are extracted by recurrent unit. Then the fully-connected layers join the two domains by feature transformation.

2.2 Text representation

Each input text () is first represented by an matrix, with being its length and being the size of the dictionary. To build the dictionary, stop words and punctuation marks are removed and all the words are stemmed using porter stemmer. In addition, the removal of the special characters is carried out and the remaining words are all in lowercase format. Each word in the final dictionary is represented by a one-hot dimensional vector and every word can find an index l in the dictionary. Therefore, for an input sentence with words, there is a matrix as the following:

where and .

Based on this definition, each text should have a fixed length. In this study, since two datasets with various distributions are employed, the length of each sentence is considered a fixed number. In order to meet this criterion, when there are several sentences for one image, we concatenate all the words and build a long description for that image. When the length grows to be more than the expected length, the extra words are removed and when there are fewer words, zero-padding is applied. Therefore, we will have a

dimension space for the representation of the sentences.

The input of the text representation model is a sequence of integer numbers. In the next step, a word embedding is used to reduce the number of semantically similar words or to remove the words with low frequency, which are non-existent in the dictionary, resulting in a new embedding space. Since the vocabulary size is very large, the reduction is helpful in increasing the network’s generalizability. The new embeddings are then fed into an LSTM (Gers et al., 2000)

to learn a probability distribution over the above-mentioned sequence in order to predict the next word. The output of the LSTM is not used for word-level labeling. Instead, for the representation of the whole text, only the last hidden state is utilized. Therefore, for the input sentence T, its text descriptor denoted by

and using the function , is computed as . The final word feature vector () is defined by .

Task Sentence Retrieval Image Retrieval
Method R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r 1K test images
Random Ranking 0.1 0.6 1.1 631 0.1 0.5 1.0 500
STV (2015) 33.8 67.7 82.1 3 25.9 60.0 74.6 4
DVSA (2015) 38.4 69.9 80.5 1 27.4 60.2 74.8 3
GMM-FV (2015) 39.0 67.0 80.3 3 24.2 59.3 76.0 4
MM-ENS (2015) 39.4 67.9 80.9 2 25.1 59.8 76.6 4
m-RNN (2014) 41.0 73.0 83.5 2 29.0 42.2 77.0 3
m-CNN (2015) 42.8 73.1 84.1 2 32.6 68.6 82.8 3
HM-LSTM (2017) 43.9 - 87.8 2 36.1 - 86.7 3
SPE (2016) 50.1 79.7 89.2 - 39.6 75.2 86.9 -
VQA-A (2016) 50.5 80.1 89.7 - 37.0 70.9 82.9 -
2WayNet (2017) 55.8 75.2 - - 39.7 63.3 - -
sm-LSTM (2017) 53.2 83.1 91.5 1 40.7 75.8 87.4 2
RRF-Net (2017) 56.4 85.3 91.5 - 43.9 78.1 88.6 -
VSE++ (2017) 64.6 90.0 95.7 1 52.0 84.3 92.0 1
SCAN (2018) 72.7 94.8 98.4 - 58.8 88.4 94.8 -
Ours 47.5 81.0 91.0 2 48.4 84.3 91.5 2
GMM-FV (2015) 17.3 39.0 50.2 10 10.8 28.3 40.1 17 5K test images
DVSA (2015) 16.5 39.2 52.0 9 10.7 29.6 42.2 14
VQA-A (2016) 23.5 50.7 63.6 - 16.7 40.5 53.8 -
VSE++ (2017) 41.3 71.1 81.2 2 30.3 59.4 72.4 4
SCAN (2018) 50.4 82.2 90.0 - 38.6 69.3 80.4 -
Ours 23.8 53.7 67.3 4 25.6 55.1 68.4 3
Table 1: Image and sentence retrieval results on MS-COCO. “Sentence Retrieval” denotes using an image as query to search for the relevant sentences, and “Image Retrieval” denotes using a sentence to find the relevant image. R@K is Recall@K (high is good). Med r is the median rank (low is good).

2.3 Alignment Objective

Having an aligned collection of image-text pairs, the goal is to learn the image-text similarity score denoted by which is defined as follows:

where and are the same-size image and text representations which have been projected onto a partial order visual-semantic embedding space. The penalty paid for every true pair of points that disagree is .

To compute the training loss, the image and text output vectors have been forced to be in the . By merging the image and text embedded models as illustrated in Fig. 2

, we achieve the desired visual-semantic model. To learn an order encoding function, we considered a hinge-based triplet loss function which encourages positive examples to have zero penalty, and negative examples to have penalty greater than a margin:

where , the similarity score function, is as described above while and are inferred from the ground truth by matching contrastive images with each caption and the reverse.

is discrete variance written as

. For computational efficiency, rather than summing over all the negative samples, we assumed only the negatives in a mini-batch.

3 Experiments

3.1 Implementation

The proposed method has been implemented with the TensorFlow

(Abadi et al., 2016), and Python ran on a machine with GeForce GTX 1080 Ti. For initialization, the GloVe (Pennington et al., 2014) word embeddings, trained on Twitter with 1.2 million vocabulary size, 27 billion tokens and 2 billion tweets, are employed. The training phase starts with an Adam optimizer with learning rate of 0.1 and a batch size of 16 and continues as long as the amount of loss does not change. When it happens, the learning rate is divided by 2. This continues until the learning rate becomes . Then, the batch size is doubled and the learning rate is reset to 0.1. We repeat this process to optimize the model. During the training, a grid search over all the hyper-parameters is carried out in order to conduct a model selection. For efficiency, the training is performed in batches which allows us to do real-time data augmentation on images in CPU in parallel with training the model in GPU.

3.2 Evaluation

Given a sentence (image), all the images (captions) of the test set are retrieved and listed based on their penalty in an increasing order. Then we report the results using Recall and Median Rank. Recall is a metric for assessing how well a system retrieves information to a query. It is computed by dividing the number of relevant retrieved results by the total number of instances. In , the top results are treated as the output and the Recall is computed accordingly. Med is the middle number in a sorted sequence of the retrieved instances.

To address this issue, other metrics can be taken into account since the existing measures can be intrinsically problematic (Bernardi et al., 2016). For instance, the retrieval of the exact image (text) is not guaranteed. In these cases, since the exact matches have not been retrieved, its score is considered although similar ones have been matched.

3.3 Data Collection and Results

Several datasets have been published for image-sentence retrieval task (Rashtchian et al., 2010; Ordonez et al., 2011; Young et al., 2014; Hu et al., 2017; Farhadi et al., 2010). We collect a dataset, as a proof of concept, for evaluating and analyzing our method to better showcase its ability to generalize as well as for demonstrating the extensibility of this type of solution to conversational texts and unusual images. Moreover, we used MS-COCO (Lin et al., 2014) to train and test the proposed model. This dataset contains 123,287 images and 616,767 descriptions (Lin et al., 2014). Each image contains 5 textual descriptions on average which collected by crowdsourcing on AMT. The average caption length is 8.7 words after rare word removal. We follow the protocol in (Karpathy et al., 2014) and use 5000 images for both validation and testing, and also report results on a subset of 1000 testing images in Table 1.

We collected 13751 tweets with 14415 images by a crawler based on the Twitter API. To make sure that the collection is diverse, we first created a list of seed users. Then, the followers of the seed accounts were added to the list. Next, the latest tweets of the users in our list were extracted and saved in the dataset. To make the data appropriate for our task, we removed retweets, the tweets with no images, non-English tweets and the ones that had less than three words. This led the dataset to have a relatively long description for each image and at least one image for every tweet. At the final step, the dataset was examined by two professionals and unrelated content was removed by them. Fig. 1 shows samples of the extracted dataset. This collection is different from currently existing ones due to varied domains, informal texts and high level correlation between text and image. For instance, the tweets may contain abbreviations, initialisms, hashtags or URLs. On collected tweets, our model improves sentence retrieval by 14.3% relatively and image retrieval by 16.4% relatively based on . The dataset has been available.111Dataset, source codes and model will be publicly available after publishing the paper.

4 Discussion

We propose a multi-modal image-text matching model using a convolutional neural network and a long short-term memory along with fully-connected layers. They are employed to map image and text inputs into a shared feature space, where their representations can be compared, to find the closest pairs. Additionally, a new dataset of images and tweets extracted from Twitter is introduced, with the aim of having a characteristically different collection from the benchmarks. Whereas the descriptions in the benchmarks are well-organized, our dataset has not been standardized and the image-text pairs can contain high semantic correlations. Also, because of a varied number of domains existent in the extracted dataset, the task of image-text matching becomes even more challenging. Therefore, it can be used to carry out new research and assess the robustness of the proposed frameworks. Our experiments on MS-COCO yield improved results over some previously proposed architectures.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th Symposium on Operating Systems Design and Implementation, pp. 265–283. Cited by: §3.1.
  • R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures.

    Journal of Artificial Intelligence Research

    55, pp. 409–442.
    Cited by: §3.2.
  • J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2625–2634. Cited by: §1.
  • A. Eisenschtat and L. Wolf (2017) Linking image and text with 2-way nets. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4601–4611. Cited by: §1, Table 1.
  • F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612. Cited by: §1, Table 1.
  • A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth (2010) Every picture tells a story: generating sentences from images. In European conference on computer vision, pp. 15–29. Cited by: §3.3.
  • F. A. Gers, J. A. Schmidhuber, and F. A. Cummins (2000) Learning to forget: continual prediction with lstm. Neural Comput. 12 (10), pp. 2451–2471. External Links: ISSN 0899-7667, Link, Document Cited by: §2.2.
  • J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181–7189. Cited by: §1.
  • Y. Hu, L. Zheng, Y. Yang, and Y. Huang (2017) Twitter100k: a real-world dataset for weakly supervised cross-media retrieval. IEEE Transactions on Multimedia 20 (4), pp. 927–938. Cited by: §3.3.
  • Y. Huang, W. Wang, and L. Wang (2017) Instance-aware image and sentence matching with selective multimodal lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318. Cited by: §1, Table 1.
  • Y. Huang, Q. Wu, C. Song, and L. Wang (2018) Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6163–6171. Cited by: §1.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §1, Table 1.
  • A. Karpathy, A. Joulin, and L. F. Fei-Fei (2014) Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pp. 1889–1897. Cited by: §1, §3.3.
  • R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: Table 1.
  • B. Klein, G. Lev, G. Sadeh, and L. Wolf (2015) Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4437–4446. Cited by: Table 1.
  • K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216. Cited by: §1, Table 1.
  • G. Lev, G. Sadeh, B. Klein, and L. Wolf (2016) Rnn fisher vectors for action recognition and image annotation. In European Conference on Computer Vision, pp. 833–850. Cited by: §1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.3.
  • X. Lin and D. Parikh (2016) Leveraging visual question answering for image-caption ranking. In European Conference on Computer Vision, pp. 261–277. Cited by: Table 1.
  • Y. Liu, Y. Guo, E. M. Bakker, and M. S. Lew (2017) Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4107–4116. Cited by: §1, Table 1.
  • L. Ma, Z. Lu, L. Shang, and H. Li (2015) Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE international conference on computer vision, pp. 2623–2631. Cited by: Table 1.
  • J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille (2014)

    Deep captioning with multimodal recurrent neural networks (m-rnn)

    arXiv preprint arXiv:1412.6632. Cited by: §1, Table 1.
  • Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1881–1889. Cited by: §1, Table 1.
  • V. Ordonez, G. Kulkarni, and T. L. Berg (2011) Im2text: describing images using 1 million captioned photographs. In Advances in neural information processing systems, pp. 1143–1151. Cited by: §3.3.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.1.
  • C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier (2010) Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147. Cited by: §3.3.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2.1.
  • L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5005–5013. Cited by: Table 1.
  • Q. Wang and A. B. Chan (2018) CNN+ cnn: convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019. Cited by: §1, §1.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §2.1.
  • F. Yan and K. Mikolajczyk (2015) Deep correlation for matching images and text. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3441–3450. Cited by: §1.
  • P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: §3.3.
  • Z. Zheng, L. Zheng, M. Garrett, Y. Yang, and Y. Shen (2017) Dual-path convolutional image-text embedding with instance loss. arXiv preprint arXiv:1711.05535. Cited by: §1.