Multitask Text-to-Visual Embedding with Titles and Clickthrough Data

05/30/2019 ∙ by Pranav Aggarwal, et al. ∙ adobe 0

Text-visual (or called semantic-visual) embedding is a central problem in vision-language research. It typically involves mapping of an image and a text description to a common feature space through a CNN image encoder and a RNN language encoder. In this paper, we propose a new method for learning text-visual embedding using both image titles and click-through data from an image search engine. We also propose a new triplet loss function by modeling positive awareness of the embedding, and introduce a novel mini-batch-based hard negative sampling approach for better data efficiency in the learning process. Experimental results show that our proposed method outperforms existing methods, and is also effective for real-world text-to-visual retrieval.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image search has been a well-studied problem in both research and industry. With the recent advancement in deep learning, cross-modal retrieval between texts and images has been a central problem in the field of language and vision. Previous methods mostly rely on tags extracted from textual data or automatically inferred from images to perform text-based image retrieval. However, these methods are prone to errors when the text query is long due to the lack of flexibility on variations of language descriptions.

Text-visual embedding (also called semantic-visual embedding) aims to map text and visual information to the same feature space so that cross-modal retrieval can be performed by nearest neighbor search in the feature space. It can effectively cope with the limitations of bag-of-words based models common to many image search algorithms. Text-visual embedding is particularly effective for long text queries with proper encoding of language information. In recent developments, generative models are used to get image representations for text and then perform nearest neighbours while some methods try to perform operations in image feature extraction side to achieve the common vector space. Instead we try to build a shallow architecture, which tries to get only the text in a space that can be extracted very easily, such as the ResNet

[10] or VGG [25] feature space.

In this paper, our contributions include: (1) We propose a new way of getting images and text into a common space by performing multi-task training using a combination of click-through dataset for user intention and image caption dataset to equip our model for long sentences and remove user query noise, while keeping the image feature extraction architecture unchanged. Using this combinations helps our model tackle real-world text queries. (2) We also propose a novel loss function that tries to combine the advantages of the contrastive loss and triplet loss, we call it positive aware triplet ranking loss. (3) This paper describes how we can select “hard” negatives which will eventually decide the kind of generalization we see in the image or text results by influencing the amount of tightening of the entity clusters.

Figure 1: Conceptual illustration of our training mechanism using one sample: We perform this for title dataset batch and click-through dataset batch separately and the average of the two losses is then back-propagated through the query embedding creator.

2 Related work

The problem described above is part of a broader concept called Metric learning which has been studied in several fields such as machine learning 

[26], information retrieval [15]

, and computer vision 

[17]. The goal of a metric learning algorithm is to project samples from two different domains/modals to a common latent space so that similar data samples (e.g. from the same class and different domains/modals) lie close to each other and dissimilar data samples (e.g. samples from different classes) lie far away from each other.

In this paper we consider image and text as two modals of the data. Canonical Correlation Analysis (CCA) [6] is one of the early methods to find a common latent space between text and image. [21] extends CCA for learning a common space taking into account the high level semantic information in the form of multi-label annotations. Similar to most traditional machine learning algorithms, CCA has some intrinsic limitations where it may not be suitable for large scale datasets. [28] combines deep learning with CCA and addresses the problem of matching images and captions in a joint latent space learnt with deep canonical correlation analysis (DCCA). For metric learning, a deep network can learn a complex non-linear projection function for each modal of the data using coupled networks [11, 13, 16].

Recently, [18] goes beyond the coupled networks and designed a cross modal retrieval approach with only one network. This is done by fusing images and texts before passing to a single network. [5] significantly improves the state-of-the-art performance on cross-modal image-caption retrieval by using the coupled structure together with two additional generative networks. Adversarial learning also used in [24] to find a common space using within-modality and across modality discriminators.

3 Our Text-Visual multi-task training network

3.1 Creating the common embedding space

First we need to fix the embedding space for which we would like our images and text to be projected to. For this we make use of already existing pre-trained architectures that are trained on a large corpus of images to predict tags such as VGG-19 [25] , ResNet-152 or ResNet-50 [10] models. We select the layer before the last activation layer (Softmax) as the common embedding space.

3.2 Text embedding creation network

Figure 2: Query embedding creator architecture: This architecture represents the query embedding creator block seen in Figure 1.

Given a text input to the network, we first associate the vector representation for each word. We make use of pre-trained embeddings such as Fasttext embeddings [4]

to get the vector representations of the words which produces a 300 sized vector for each word. The sequential information is then captured using the Long Short Term Memory units

[7] (LSTM). Stacking the LSTMs and adding a dropout [19] to the LSTM cells helps in improving the performance of the model. Output of the last LSTM unit is then given to a fully connected layer which tries to match the dimension space of the LSTM output to image embeddings.

3.3 Hard negative image selection

We perform a multi-task training in which we train separately from both the title dataset and click-through dataset. For both cases, we have the input text associated with a positive image. For the title dataset, the text is the title and the positive image is the image associated with the title. For the click-through dataset the text is the text of the query and the positive image is the image being clicked. As for the negative images, we select images within the batch. We have to make sure that the negative images are similar enough to the positive image so as to increase discrimination by the network but selected in such a way that they don’t belong to the same category. Steps to sample negative images:

1: We use a batch of 512 image embedding samples, where each sample corresponds to either the clicked image associated to a text query for the click-through dataset or the image associated with a title for the title dataset. These are the positive image embeddings.

2: For every positive sample in the batch,

2.1: We calculate the square distance to every other sample in the batch

2.2: We remove those samples for which its associated text shares any words in common with those of the positive sample, excluding any stop words. So, for example, if the text query associated with the positive sample is ”man on a motorbike”, the words we consider are ”man” and ”motorbike” and we remove those samples that have either ”man” or ”motorbike” in their associated text. If we want to select harder negatives, we only remove those samples that have both ”man” and ”motorbike” instead.

2.3: From the samples left, we select N negative samples with the least squared distance from the positive sample.

3.4 Positive aware triplet ranking loss

Once we find the squared distance between the positive image embedding and the query embedding () and the squared distance between the negative image embedding and the query embedding (), we find the positive aware triplet ranking loss:


here tries to penalize , therefore higher the , tighter are the clusters. In triplet loss [3][20][9] given below:


here acts like the . When we consider inside the max(), the loss function tries to minimize the by increasing or decreasing the value of both and together, with getting impacted more, causing the difference to automatically increase. We wanted to consider both these values separately in our loss function. tries to minimize and also separately tries to maximize . This causes the positive image embeddings to become very similar to the text query embeddings, therefore forcing them to lie in the same cluster while tightening these clusters by maximizing . The effectiveness can be increased by adding multiple negative samples by selecting the top N negative samples. Then our loss function becomes:


Our final multi-task training loss is:


4 Experimental results

4.1 Datasets

Pascal Sentence Dataset [22] : The dataset contains 1K images with 5 captions each. To be fair we use the same dataset partition as seen in [27] i.e. 800 images for training and 200 for testing. The images are already divided into 20 categories. We use the category information only during testing with minor obvious mislabelling corrections.

Adobe Stock dataset: We collect 1M user typed text query-clicked image pairs using our current Adobe Stock search engine and 1M caption (title)-image pairs provided by Adobe Stock users, for training. The testing dataset consists of 5K caption-image pairs and 5k user typed text query-clicked image pairs for which we find the evaluations separately and then show the average. The user provided captions are not always very elaborate and precise. We find results with this dataset to show that our loss function coupled with multi-task training out-performs other loss functions and methods in a real-world case scenario.

4.2 Implementation

For all experiments we use Tensorflow

[14] and Adam Optimizer [12] with decay parameter (beta1=0.99) to train our model. Also we stack the LSTMs 5 times with 25% of the nodes dropped-out. To get image features from VGG architecture we extract the relu7 features (4096 dim) and from ResNet architectures we extract the pool5 layer (2048 dim).

Dataset No.of learning LSTM Image
epoch rate units feature
Pascal 50 0.0005, 18 VGG-19,
Sentence 0.0001 ResNet-152
Adobe 30 0.0001 15 ResNet-50
Table 1: Training hyper-paramters and image features

For the Pascal Sentence Dataset we implement our loss function (3) with a margin of 1.0 and as we only have caption data for this experiment so we do not implement the multi-task training. The number of top hard negative samples (N) taken is 3.

For Adobe Stock Dataset we compare our loss function (4) ( = 1.2) with multi-task triplet loss (2) ( = 0.5) and l2 loss (Baseline). To show fair comparisons we use only one negative sample for both and . We also compare our multi-task implementation (4) with only click-trough dataset and caption dataset implementation using equation (1).

4.3 Evaluation Metric

We report the mean Average Precision (mAP) score described in [2] for Pascal Sentence dataset. Considering we have first R top-ranked retrieved data samples for each query, we can define AP for each sample as:


here M is the number of relevant samples retrieved. p(r) is the precision at r and rel(r) is 1 if r is relevant and 0 if not. The retrieved data is considered as relevant if it has the same semantic label as the query. mAP is the average of all the samples. We report mAP@50 (R=50).

As we do not have semantic labels for the Adobe Stock dataset so we use the measure in [8] for our evaluation i.e., R@K which is defined as the percentage of text queries in which the ground-truth images are contained in the first K retrieved results.

4.4 Cross Modal Retrieval

Figure 3: Cross-modal retrieval results using Pascal Sentence Dataset and our method (ResNet-152) (green color represents ground truth). Top box shows Txt2Img results while bottom box shows Img2Txt results.
Method Img2Txt Txt2Img
Corr-AE [2] 0.290 0.279
ACMR [1] 0.535 0.543
CVS [23] 0.604 0.578
Ours (VGG-19) 0.634 0.533
Ours (ResNet-152) 0.656 0.562
Table 2: mAP evaluation using Pascal dataset

In Table 2 we see that our methods easily out-perform others in Img2Txt retrieval. The Txt2Img evaluation using ResNet-152 features shows almost state of the art performance but can be improved by having larger dataset as we have more semantically similar images to select the negative samples from.

Caption Dataset Click-through Dataset Average
Method R@1 R@10 R@20 R@1 R@10 R@20 R@1 R@10 R@20
l2 0.118 0.430 0.546 0.071 0.286 0.371 0.094 0.358 0.459
triplet loss [3] 0.153 0.485 0.583 0.087 0.315 0.398 0.120 0.400 0.491
Ours 0.174 0.511 0.620 0.092 0.312 0.393 0.133 0.412 0.506
Table 3: Ranking evaluations for different loss functions using both caption and click-through Adobe Stock dataset
Caption Dataset Click-through Dataset Average
Training Dataset R@1 R@10 R@20 R@1 R@10 R@20 R@1 R@10 R@20
only clicks 0.076 0.322 0.425 0.091 0.312 0.404 0.084 0.317 0.414
only titles 0.180 0.514 0.619 0.052 0.193 0.262 0.116 0.354 0.441
clicks and titles 0.174 0.511 0.620 0.092 0.312 0.393 0.133 0.412 0.506
Table 4: Ranking evaluations for different training combinations of caption and click-through dataset with our loss function

4.5 Real-World Analysis for Image Retrieval

In Table 3 we see that our method is able to retrieve the ground-truth images with better performance for all the R cases for kinds of Adobe Stock dataset as compared to the other loss functions.

Table 4 shows that using multi-task training with both click-through and caption (title) dataset can combine the advantages of the methods when trained only with one type of dataset. Therefore we see a high increase in the average values for all the R cases when using our approach.

Figure 4: Category level T-SNE clustering of Pascal Sentences testing dataset using our method (ResNet-152). Here “x” represents image and “o” represent text. (Please zoom in for better visualization).

5 Conclusions

We present a shallow deep learning architecture along with a novel loss function which tries to get the text into the same vector space as that of image features. This model takes into account the practical use cases of a search engine and therefore we show how multi-task training involving user click data and caption data can help get better results for real world use cases. Also we demonstrate how negative samples can be an important aspect of how we get our query clusters. In Figure 4. we see that the text captions get mapped into the visual space. One good example seen here is that the bicycle text (sentence) cluster is very close to the motorbike text cluster (without much overlap) as their objects are visually similar therefore gaining visual intuition.


  • [1] X. Xu A. Hanjalic B. Wang, Y. Yang and H. T. Shen. Adversarial cross-modal retrieval. pages 154–162, 2017.
  • [2] X. Wang F. Feng and R. Li.

    Cross-modal retrieval with correspondence autoencoder.

    page 7–16, 2014.
  • [3] U. Shalit G. Chechik, V. Sharma and S. Bengio. Large scale online learning of image similarity through ranking. 11:1109–1135, 2010.
  • [4] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  • [5] Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7181–7189, 2018.
  • [6] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:2639 2664, 2004.
  • [7] S. Hochreiter and J. Schmidhuber. Long short-term memory. 9 no. 8.:1735–1780, 1997.
  • [8] S. Fidler I. Vendrov, R. Kiros and R. Urtasun. Order-embeddings of images and language. 2016.
  • [9] T. Leung C. Rosenberg J. Wang J. Philbin B. Chen J. Wang, Y. Song and Y. Wu. Learning fine-grained image similarity with deep ranking. 2014.
  • [10] S. Ren K. He, X. Zhang and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [11] Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889–1897, 2014.
  • [12] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • [13] V. E. Liong, J. Lu, Yap-Peng Tan, and J. Zhou. Deep coupled metric learning for cross-modal matching. IEEE Transactions on Multimedia, 19(6):1234–1244, 2017.
  • [14] P. Barham E. Brevdo Z. Chen C. Citro G. S. Corrado A. Davis J. Dean M. Devin et al. M. Abadi, A. Agarwal. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.
  • [15] Brian McFee and Gert R Lanckriet. Metric learning to rank. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 775–782, 2010.
  • [16] Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E Papalexakis, and Amit K Roy-Chowdhury. Webly supervised joint embedding for cross-modal image-text retrieval. arXiv preprint arXiv:1808.07793, 2018.
  • [17] Saeid Motiian, Marco Piccirilli, Donald A. Adjeroh, and Gianfranco Doretto. Unified deep supervised domain adaptation and generalization. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [18] Shah Nawaz, Muhammad Kamran Janjua, Alessandro Calefati, and Ignazio Gallo. Revisiting cross modal retrieval. arXiv preprint arXiv:1807.07364, 2018.
  • [19] Alex Krizhevsky Ilya Sutskever Ruslan Salakhutdinov Nitish Srivastava, Geoffrey Hinton.

    Dropout: A simple way to prevent neural networks from overfitting.

    In Journal of Machine Learning Research, page 1929−1958, 2014.
  • [20] D. Parikh and K. Grauman. Relative attributes. page 503–510, 2011.
  • [21] V. Ranjan, N. Rasiwasia, and CV Jawahar. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision, pages 4094–4102, 2015.
  • [22] Young P. Hodosh M. Hockenmaier J. Rashtchian, C. Collecting image annotations using amazon’s mechanical turk. Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 139–147, 2010.
  • [23] R. Ptucha. S. Sah, S. Gopalakrishnan. Cross modal retrieval using common vector space. CVPR Language and Vision Workshop 2018, 2018.
  • [24] Shagan Sah, Sabarish Gopalakrishnan, and Ray Ptucha. Cross modal retrieval using common vector space.
  • [25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [26] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, June 2009.
  • [27] X. Huang Y. Peng, J. Qi and Yuan Y. Ccl:cross-modal correlation learning with multi-grained fusion by hierarchical network. JMLR, pages 154–162, 2017.
  • [28] Fei Yan and Krystian Mikolajczyk. Deep correlation for matching images and text. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3441–3450, 2015.