Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Cross-modal retrieval between visual data and natural language description remains a long-standing challenge in multimedia. While recent image-text retrieval methods offer great promise by learning deep representations aligned across modalities, most of these methods are plagued by the issue of training with small-scale datasets covering a limited number of images with ground-truth sentences. Moreover, it is extremely expensive to create a larger dataset by annotating millions of images with sentences and may lead to a biased model. Inspired by the recent success of webly supervised learning in deep neural networks, we capitalize on readily-available web images with noisy annotations to learn robust image-text joint representation. Specifically, our main idea is to leverage web images and corresponding tags, along with fully annotated datasets, in training for learning the visual-semantic joint embedding. We propose a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding. Experiments on two standard benchmark datasets demonstrate that our method achieves a significant performance gain in image-text retrieval compared to state-of-the-art approaches.


page 1

page 2

page 7


Self-Supervised Visual Representations for Cross-Modal Retrieval

Cross-modal retrieval methods have been significantly improved in last y...

StacMR: Scene-Text Aware Cross-Modal Retrieval

Recent models for cross-modal retrieval have benefited from an increasin...

Deep Binaries: Encoding Semantic-Rich Cues for Efficient Textual-Visual Cross Retrieval

Cross-modal hashing is usually regarded as an effective technique for la...

Self-Supervised Learning from Web Data for Multimodal Retrieval

Self-Supervised learning from multimodal image and text data allows deep...

See, Hear, and Read: Deep Aligned Representations

We capitalize on large amounts of readily-available, synchronous data to...

Learning Robust Visual-Semantic Embeddings

Many of the existing methods for learning joint embedding of images and ...

Cross-media Structured Common Space for Multimedia Event Extraction

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims ...

1. Introduction

Joint embeddings have been widely used in multimedia data mining as they enable us to integrate the understanding of different modalities together. These embeddings are usually learned by mapping inputs from two or more distinct domains (e.g., images and text) into a common latent space, where the transformed vectors of semantically associated inputs should be close. Learning an appropriate embedding is crucial for achieving high-performance in many multimedia applications involving multiple modalities. In this work, we focus on the task of cross-modal retrieval between images and language (See Fig. 

1), i.e., the retrieval of images given sentence query, and retrieval of text from a query image.

The majority of the success in image-text retrieval task has been achieved by the joint embedding models trained in a supervised way using image-text pairs from hand-labeled image datasets (e.g., MSCOCO (Chen et al., 2015), Flickr30k(Plummer et al., 2015)). Although, these datasets cover a significant number of images (e.g., about 80k in MSCOCO and 30K in Flickr30K), creating a larger dataset with image-sentence pairs is extremely difficult and labor-intensive (Krause et al., 2016). Moreover, it is generally feasible to have only a limited number of users to annotate training images, which may lead to a biased model (van Miltenburg, 2016; Hu and Strout, 2018; Zhao et al., 2017). Hence, while these datasets provide a convenient modeling assumption, they are very restrictive considering the enormous amount of rich descriptions that a human can compose (Karpathy and Fei-Fei, 2015). Accordingly, although trained models show good performance on benchmark datasets for image-text retrieval task, applying such models in the open-world setting is unlikely to show satisfactory cross-dataset generalization (training on a dataset, testing on a different dataset) performance.

Figure 1. Illustration of image-text retrieval task: Given a text query, retrieve and rank images from the database based on how well they depict the text or vice versa.
Figure 2. The problem setting of our paper. Our goal is to utilize web images associated with noisy tags to learn a robust visual-semantic embedding from a dataset of clean images with ground truth sentences. We test the learned latent space by projecting images and text descriptions from the test set in the embedding and perform cross-modal retrieval.

On the other hand, streams of images with noisy tags are readily available in datasets, such as Flickr-1M (Huiskes and Lew, 2008), as well as in nearly infinite numbers on the web. Developing a practical system for image-text retrieval considering a large number of web images is more likely to be robust. However, inefficient utilization of weakly-annotated images may increase ambiguity and degrade performance. Motivated by this observation, we pose an important question in this paper: Can a large number of web images with noisy annotations be leveraged upon with a fully annotated dataset of images with textual descriptions to learn better joint embeddings? Fig. 2 shows an illustration of this scenario. This is an extremely relevant problem to address due to the difficulty and non-scalability of obtaining a large amount of human-annotated training set of image-text pairs.

In this work, we study how to judiciously utilize web images to develop a successful image-text retrieval system. We propose a novel framework that can augment any ranking loss based supervised formulation with weakly-supervised web data for learning robust joint embeddings. Our approach consistently outperforms previous approaches significantly in cross-modal image-text retrieval tasks. We believe our efforts will provide insights to the researchers working in this area to focus on the importance of large scale web data for efficiently learning a more comprehensive representation from multimodal data.

1.1. Overview of the Proposed Approach

In the cross-modal image-text retrieval task, an embedding network is learned to project image features and text features into the same joint space, and then the retrieval is performed by searching the nearest neighbor in the latent space. In this work, we attempt to utilize web images annotated with noisy tags for improving joint embeddings trained using a dataset of images and ground-truth sentence descriptions. However, combining web image-tag pairs with image-text pairs in training the embedding is non-trivial. The greatest obstacle arises from noisy tags and the intrinsic difference between the representation of sentence description and tags. A typical representation of text is similar to, and yet very different from the representation of tags. Sentences are usually represented using RNN-based encoder with word-to-vec (Word2Vec) model, providing sequential input vectors to the encoder. In contrast, tags do not have sequential information and a useful representation of tags can be tf-idf weighted BOW vectors or the average of all Word2Vec vectors corresponding to the tags.

To bridge this gap, we propose a two-stage approach that learns the joint image-text representation. Firstly, we use a supervised formulation that leverages the available clean image-text pairs from a dataset to learn an aligned representation that can be shared across three modalities (e.g., image, tag, text). As tags are not available directly in the datasets, we consider nouns and verbs from a sentence as dummy tags (Fig. 3). We leverage ranking loss based formulation with image-text and image-tags pairs to learn a shared representation across modalities. Secondly, we utilize weakly-annotated image-tags pairs from the web (e.g., Flickr) to update the previously learned shared representation, which allows us to transfer knowledge from thousands of freely available weakly annotated images to develop a better cross-modal retrieval system. Our proposed approach is also motivated by learning using privileged information (LUPI) paradigm (Vapnik and Vashist, 2009; Sharmanska et al., 2013) and multi-task learning strategies in deep neural networks (Ruder, 2017; Bingel and Søgaard, 2017) that share representations between closely related tasks for enhanced learning performance.

1.2. Contributions

We address a novel and practical problem in this paper—how to exploit large-scale web data for learning an effective multi-modal embedding without requiring a large amount of human-crafted training data. Towards solving this problem, we make the following main contributions.

We propose a webly-supervised approach utilizing web image collection with associated noisy tags, and a clean dataset containing images and ground truth sentence descriptions for learning robust joint representations.

We develop a novel framework with pair-wise ranking loss for augmenting a typical supervised method with weakly-supervised web data to learn a more robust joint embedding.

We demonstrate clear performance improvement in image-text retrieval task using proposed web-supervised approach on Flickr30K (Plummer et al., 2015) and MSCOCO datasets (Lin et al., 2014).

Figure 3. A brief illustration of our proposed framework for learning visual-semantic embedding model utilizing image-text pairs from a dataset and image-tag pairs from the web. First, a dataset of images and their sentence descriptions are used to learn an aligned image-text representation. Then, we update the joint representation using web images and corresponding tags. The trained embedding is used in image-text retrieval task. Please see Section 3 for details.

2. Related Work

Visual-Semantic Embedding: Joint visual-semantic models have shown excellent performance on several multimedia tasks, e.g., cross-modal retrieval (Wang et al., 2016; Klein et al., 2015; Huang et al., 2017b; Mithun et al., 2018), image captioning (Mao et al., 2014; Karpathy and Fei-Fei, 2015), image classification (Hubert Tsai et al., 2017; Frome et al., 2013; Gong et al., 2014a) video summarization (Choi et al., 2017; Plummer et al., 2017). Cross-modal retrieval methods require computing semantic similarity between two different modalities, i.e., vision and language. Learning joint visual-semantic representation naturally fits to our task of image-text retrieval since it is possible to directly compare visual data and sentence descriptions in such a joint space (Faghri et al., 2017; Nam et al., 2017).

Image-Text Retrieval:

Recently, there has been significant interest in developing powerful image-text retrieval methods in multimedia, computer vision and machine learning communities

(Karpathy et al., 2014; Henning and Ewerth, 2017). In (Farhadi et al., 2010), a method for mapping visual and textual data to a common space based on extracting a triplet of object, action, and scene is presented. A number of image-text embedding approaches has been developed based on Canonical Correlation Analysis (CCA) (Yan and Mikolajczyk, 2015; Socher and Fei-Fei, 2010; Hodosh et al., 2013; Gong et al., 2014a). Ranking loss has been used for training the embedding in most recent works relating image and language modality for image-text retrieval (Kiros et al., 2014; Frome et al., 2013; Wang et al., 2018a; Faghri et al., 2017; Nam et al., 2017). In (Frome et al., 2013), words and images are projected to a common space utilizing a ranking loss that applies a penalty when an incorrect label is ranked higher than the correct one. A bi-directional ranking loss based formulation is used to project image features and sentence features to a joint space for cross-modal image-text retrieval in (Kiros et al., 2014).

Several image-text retrieval methods extended this work (Kiros et al., 2014)

with slight modifications in the loss function

(Faghri et al., 2017), similarity calculation (Vendrov et al., 2015; Wang et al., 2018a) or input features (Nam et al., 2017). In (Faghri et al., 2017), the authors modified the ranking loss based on violations incurred by relatively hard negatives and is the current state-of-the art in image-text retrieval task. An embedding network is proposed in (Wang et al., 2018a) that uses the bi-directional ranking loss along with neighborhood constraints. Multi-modal attention mechanism is proposed in (Nam et al., 2017) to selectively attend to specific image regions and sentence fragments and calculate similarity. A multi-modal LSTM network is proposed in (Huang et al., 2017a) that recurrently select salient pairwise instances from image and text, and aggregate local similarity measurement for image-sentence matching. Our method complements the works that project words and images to a common space utilizing a bi-directional ranking loss. The proposed formulation could be extended and applied to most of these approaches with little modifications.

Webly Supervised Learning: The method of manually annotating images for training does not scale well to the open-world setting as it is impracticable to collect and annotate images for all relevant concepts (Li et al., 2017a; Mithun et al., 2016). Moreover, there exists different types of bias in the existing datasets (van Miltenburg, 2016; Torralba et al., 2011; Khosla et al., 2012). In order to circumvent these issues, several recent studies focused on using web images and associated metadata as auxiliary source of information to train their models (Li et al., 2017b; Gong et al., 2017; Sun et al., 2015). Although web images are noisy, utilizing such weakly-labeled images has been shown to be very effective in many multimedia tasks (Gong et al., 2014b; Li et al., 2017b; Joulin et al., 2016)

Our work is motivated by these works on learning more powerful models by realizing the potential of web data. As the largest MSCOCO dataset for image-sentence retrieval has only 80K training images, we believe it is extremely crucial and practical to complement scarcer clean image-sentence data with web images to improve the generalization ability of image-text embedding models. Most relevant to our work is (Gong et al., 2014b), where authors constructed a dictionary by taking a few thousand most common words and represent text as tf-idf weighted bag of words (BoW) vectors that ignore word order and represents each caption as a vector of word frequencies. Although, such a textual feature representation allows them to utilize the same feature extractor for sentences and set of tags, it fails to consider the inherent sequential nature present in sentences in training image-sentence embedding models.

3. Approach

In this section, we first describe the network structure (Section 3.1). Then, we revisit the basic framework for learning image text mapping using pair-wise ranking loss (Section 3.2). Finally, we present our proposed strategy to incorporate the tags in the framework to learn an improved embedding (Section 3.3).

3.1. Network Structure and Input Feature

Network Structure: We learn our joint embedding model using a deep neural network framework. As shown in Fig. 3, our model has three different branches for utilizing image, sentence, and tags. Each branch has different expert network for a specific modality followed by two fully connected embedding layers. The idea is that the expert networks will focus on identifying modality-specific features at first and the embedding layers will convert the modality-specific features to modality-robust features. The parameters of these expert networks can be fine-tuned together with training the embedding layers. For simplicity, we keep image encoder (e.g., pre-trained CNN) and tag encoder (e.g., pre-trained Word2Vec model) fixed in this work. The word embedding and the GRU for sentence representation are trained end-to-end.

Text Representation:

For encoding sentences, we use Gated Recurrent Units (GRU)

(Chung et al., 2014), which has been used for representing sentences in many recent works (Faghri et al., 2017; Kiros et al., 2014). We set the dimensionality of the joint embedding space, , to 1024. The dimensionality of the word embeddings that are input to the GRU is 300.

Image Representation:

For encoding image, we adopt a deep CNN model trained on ImageNet dataset as the encoder. Specifically, we experiment with state-of-the-art 152 layer ResNet model

(He et al., 2016) and 19 layer VGG model (Simonyan and Zisserman, 2014) in this work. We extract image features directly from the penultimate fully connected layer. The dimension of the image embedding is 2048 for ResNet152 and 4096 for VGG19.

Tag Representation: We generate the feature representation of tags by summing over the Word2Vec (Mikolov et al., 2013) embeddings of all tags associated with an image and then normalizing it by the number of tags. Averaged word vectors has been shown to be a strong feature for text in several tasks (Yu et al., 2014; Kenter and de Rijke, 2015; Kenter et al., 2016).

3.2. Train Joint Embedding with Ranking Loss

We now describe the basic framework for learning joint image-sentence embedding based on bi-directional ranking loss. Many prior approaches have utilized pairwise ranking loss as the objective for learning joint embedding between visual input and textual input (Kiros et al., 2014; Zheng et al., 2017; Wang et al., 2016; Karpathy et al., 2014). Specifically, these approaches minimize a hinge-based triplet ranking loss in order to maximize the similarity between an image embedding and corresponding text embedding and minimize similarity to all other non-matching ones.

Given an image feature representation  (), the projection on the joint space can be derived as (). Similarly, the projection of input text embedding to joint space can be derived by . Here, is the transformation matrix that maps the visual content into the joint space and is the dimensionality of the space. In the same way, maps input sentence embedding to the joint space. Given feature representation for words in a sentence, the sentence embedding is found from the hidden state of the GRU. Here, given the feature representation of both images and corresponding text, the goal is to learn a joint embedding characterized by (i.e., , and GRU weights) such that the image content and semantic content are projected into the joint space. Now, the image-sentence loss function can be written as,


where is a non-matching text embedding for image embedding , and is the matching text embedding. This is similar for image embedding and non-matching image embedding . is the margin value for the ranking loss. The scoring function

measures the similarity between the images and text in the joint embedded space. In this work, we use cosine similarity in the representation space to calculate similarity, which is widely used in learning image-text embedding and shown to be very effective in many prior works

(Zheng et al., 2017; Kiros et al., 2014; Faghri et al., 2017). However, note that our approach does not depend on any particular choice of similarity function.

The first term in Eq. (1) represents the sum over all non-matching text embedding which attempts to ensure that for each visual feature, corresponding/matching text features should be closer than non-matching ones in the joint space. Similarly, the second term attempts to ensure that text embedding that corresponds to the image embedding should be closer in the joint space to each other than non-matching image embeddings.

Recently, focusing on hard-negatives has been shown to be effective in learning joint embeddings (Faghri et al., 2017; Zheng et al., 2017; Schroff et al., 2015; Wu et al., 2017). Subsequently, the loss in Eq. 1 is modified to focus on hard negatives (i.e., the negative closest to each positive pair) instead of sum over all negatives in the formulation. For a positive pair , the hardest negative sample can be identified using and . The loss function can be written as follows,


We name Eq. 1 as VSE loss and Eq. 2 as VSEPP loss. We utilize both of these loss functions in evaluating our proposed approach.

3.3. Training Joint Embedding with Web Data

In this work, we try to utilize image-tag pairs from the web for improving joint embeddings trained using a clean dataset with images-sentence pairs. Our aim is to learn a good representation for image-text embedding that ideally ignores the data-dependent noise and generalizes well. Utilization of web data effectively increases the sample size used for training our model and can be considered as implicit data augmentation. However, it is not possible to directly update the embedding (Sec. 3.2) using image-tag pairs. GRU based approach is not suitable for representing tags since tags do not have any semantic context as in the sentences.

Our task can also be considered from the perspective of learning with side or privileged information strategies (Vapnik and Vashist, 2009; Sharmanska et al., 2013), as in our case an additional tag modality is available at training time and we would like to utilize this extra information to train a stronger model. However, directly employing LUPI strategies are also not possible in our case as the training data do not provide information for all three modalities at the same time. The training datasets (e.g., MSCOCO, Flickr30K) provide only image-sentence pairs and do not provide tags. On the other hand, a web source usually provides images with tags, but no sentence descriptions. To bridge this gap, we propose a two-stage approach to train the joint image-text representation. In the first stage, we leverage the available clean image-text pairs from a dataset to learn an aligned representation that can be shared across three modalities (e.g., image, tag, text). In the second stage, we adapt the model trained in the first stage with web data.

Stage I: Training Initial Embedding. We leverage image-text pairs from an annotated dataset to learn a joint embedding for image, tags, and text. As tags are not available directly in the datasets, we consider nouns and verbs from the relevant sentence as dummy tags for an image (Fig. 3). For learning the shared representation, we combine the image-text ranking loss objective (Sec. 3.2), with image-tag ranking loss objective. We believe combining image-tag ranking loss objective provides a regularization effect in training that leads to more generalized image-text embedding.

Now the goal is to learn a joint embedding characterized by (i.e., , , and GRU weights) such that the image, sentence, and tags are projected into the joint space. Here, projects the representation of tags on the joint space as, . The resulting loss function can be written as follows,


where, represent image-tag ranking loss objective, which is similar to image-sentence ranking loss objective in Sec. 3.2. Similar to VSEPP loss in Eq. 2, can be written as,


where for a positive image-tag pair , the hardest negative sample tag representation can be identified as . Note that, all tags associated with an image are considered for generating tag representation in creating an image-tag pair rather than considering a single tag related to that image. In Eq. 3, and are predefined weights for different losses. In the first training stage, both losses are used ( and ) while in the second stage, the image-text loss is not used ( and ).

Stage II: Model Adaptation with Web Data. After Stage I converges, we have a shared representation of image, sentence description and tags with a learned image-tag embedding model. In Stage II, we utilize weakly-annotated image-tags pairs from Flickr to update the previously learned embedding network using loss. This enables us to transfer knowledge from thousands of freely available weakly annotated images in learning the embedding. We utilize a smaller learning rate in Stage II, as the network achieves competitive performance after Stage I and tuning the embedding network with a high learning rate from weakly-annotated data may lead to catastrophic forgetting (Kemker et al., 2017).

As web data is very prone to label noise, we found it is hard to learn good representation for our task in many cases. Hence, in Stage II, we adopt a curriculum learning-based strategy in training. Curriculum learning allows the model to learn from easier instances first so they can be used as building blocks to learn more complex ones, which leads to a better performance in the final task. It has been shown in many previous works that appropriate curriculum strategies guide the learner towards better local minima (Bengio et al., 2009). Our idea is to gradually inject difficult information to the learner such that in the early stages of training, the network is presented with images related to frequently occurring concepts/keywords in the clean training set. Images related to rarely occurring concepts are presented at a later stage. Since the network trained in Stage I is more likely to have learned well about frequently occurring concepts, label noise is less likely to affect the network.

4. Experiments

We perform experiments on two standard benchmark datasets with the main goal of analyzing the performance of different supervised methods by utilizing large scale web data using our curriculum guided webly supervised approach. Ideally, we would expect an improvement in performance irrespective of the loss function and features used to learn the embedding in Sec. 3.

We first describe the details on the datasets and evaluation metric in Sec. 

4.1 and training details in Sec. 4.2. We report the results of different methods on MSCOCO dataset in Sec. 4.3 and results on Flickr30K dataset in Sec. 4.4.

4.1. Datasets and Evaluation Metric

We present experiments on standard benchmark datasets for sentence-based image description: MSCOCO Dataset (Chen et al., 2015) and Flickr30K dataset (Plummer et al., 2015) to evaluate the performance of our proposed framework.

MSCOCO. The MSCOCO is a large-scale sentence-based image description dataset. This is the largest image captioning dataset in terms of the number of sentences and the size of the vocabulary. This dataset contains around 123K images. Each image comes with 5 captions. Following (Karpathy and Fei-Fei, 2015), we use the training, testing and validation split. In this split, the set contains 82,783 training images, 5000 validation images and 5000 test images. About 30K images were left out in this split. Some previous works utilize this images with for training to improve accuracy. We also report results using this images in training. In most of the previous works, the results are reported by averaging over 5 folds of 1K test images (Kiros et al., 2014; Wang et al., 2018b; Eisenschtat and Wolf, 2017).

Flickr30K. Flickr30K is another standard benchmark dataset for sentence-based image description. Flickr30K dataset has a standard 31,783 images and 158,915 English captions. Each image comes with 5 captions, annotated by AMT workers. In this work, we follow the dataset split provided in (Karpathy and Fei-Fei, 2015). In this dataset split, the training set contains 29,000 images, validation set contains 1000 and test set contains 1000 images.

Table 1. Image-to-Text Retrieval Results on MSCOCO Dataset.

Web Image Collection. We use photo-sharing website Flickr to retrieve web images with tags and use those images without any additional manual labeling. To collect images, we create a list of 1000 most occurring keywords in MSCOCO and Flickr30K dataset text descriptions and sort them in descending order based on frequency. We remove stop-words and group similar words together after performing lemmatization. We then use this list of keywords to query Flickr and retrieve around 200 images per query, together with their tags. In this way, we collect about 210,000 images with tags. We only collect images having at least two English tags and we don’t collect more than 5 images from a single owner. We also utilize first 5 tags to remove duplicate images.

Evaluation Metric. We use the standard evaluation criteria used in most prior work on image-text retrieval task (Kiros et al., 2014; Faghri et al., 2017; Dong et al., 2016). We measure rank-based performance by Recall at () and Median Rank(). calculates the percentage of test samples for which the correct result is ranked within the top- retrieved results to the query sample. We project sentences, tags, and images into the embedded space and calculate similarity scores. We report results of and . Median Rank calculates the median of the ground-truth matches in the ranking results.

4.2. Training Details

We start training with a learning rate of 0.0002 and keep the learning rate fixed for 10 epochs. We then lower the learning rate by a factor of 10 every 10 epochs. We continue training Stage I for initial 20 epochs. Then we start updating the learned model in Stage I with web images in Stage II for another 20 epochs. The embedding networks are trained using ADAM optimizer

(Kingma and Ba, 2014). Gradients are clipped when the norm of the gradients (for the entire layer) exceeds 2. We tried different values for margin in training and empirically choose

as 0.2, which we found performed well consistently on the datasets. We evaluate the model on the validation set after every epoch. The best model is chosen based on the sum of recalls in the validation set to deal with the over-fitting issue. We use a batch-size of 128 in the experiment. We also tried with other mini-batch sizes of 32 and 64 but didn’t notice significant impact on the performance. We used two Telsa K80 GPUs and implemented the network using PyTorch toolkit.

4.3. Results on MSCOCO Dataset

We report the result of testing on MSCOCO dataset (Lin et al., 2014) in Table 1. To understand the effect of the proposed webly supervised approach, we divide the table in 3 rows (1.1-1.3). We compare our results with several representative image-text retrieval approaches, i.e., Embedding-Net (Wang et al., 2018a), 2Way-Net (Eisenschtat and Wolf, 2017), Sm-LSTM (Huang et al., 2017a), Order-Embedding (Vendrov et al., 2015), SAE (Gong et al., 2014b), VSE (Kiros et al., 2014) and VSEPP (Faghri et al., 2017). For these approaches, we directly cite scores from respective papers when available and select the score of the best performing method if scores for multiple models are reported.

In row-1.2, we report the results on applying two different variants of pair-wise ranking loss based baseline VSE and VSEPP with two different feature representation from (Faghri et al., 2017). VSE(Kiros et al., 2014) is based on the basic triplet ranking loss similar to Eq. 1 and VSEPP(Faghri et al., 2017) is based on the loss function that emphasizes hard-negatives as shown in Eq. 2. We consider VSE and VSEPP loss based formulation as the main baseline for this work. Finally, in row-1.3, results using the proposed approach are reported. To enable a fair comparison, we apply our webly supervised method using the same VSE and VSEPP loss used by methods in row-1.2.

Effect of Proposed Webly Supervised Training. For evaluating the impact of our approach, we compare results reported in row-1.2 and row-1.3. Our method utilizes the same loss functions and features used in row-1.2 for a fair comparison. From Table 1

, We observe that the proposed approach improves performance consistently in all the cases. For the retrieval task, the average performance increase in text-to-image retrieval is 7.5% in R@1 and 3.2% in R@10.

We also compare the proposed approach with web supervised approach SAE(Gong et al., 2014b) (reported in row-1.1). In this regard, we implement SAE based webly supervised approach following (Gong et al., 2014b). We use the same feature and VSEPP ranking loss for a fair comparison and follow the exact same settings for experiments. We observe that our approach consistently performs better.

Table 2. Image-to-Text Retrieval Results on Flickr30K Dataset.
Figure 4. Examples of 4 test images from Flickr30K dataset and the top 1 retrieved captions for our web supervised VSEPP-ResNet152 and standard VSEPP-ResNet as shown in Table. 2. The value in brackets is the rank of the highest ranked ground-truth caption in retrieval. Ground Truth (GT) is a sample from the ground-truth captions. Image 1,2 and 4 show a few examples where utilizing our approach helps to match the correct caption, compared to using the typical approach.

Effect of Loss Function. While evaluating the performance of different ranking loss, we observe that our webly supervised approach shows performance improvement for both VSE and VSEPP based formulation, and the performance improvement rate is similar for both VSE and VSEPP (See row-1.2 and row-1.3). Similar to the previous works (Faghri et al., 2017; Zheng et al., 2017), we also find that methods using VSEPP loss perform better than VSE loss. We observe that in the image-to-text retrieval task, the performance improvement using VSEPP based formulation is higher and in the text-to-image retrieval task, the performance improvement for VSE based formulation is higher.

Effect of Feature. For evaluating the impact of different image feature in our web-supervised learning, we compare VGG19 feature based results with ResNet152 feature based results. We find consistent performance improvement using both VGG19 and ResNet152 feature. However, the performance improvement is slightly higher when ResNet152 feature is used. In image-to-text retrieval, the average performance improvement in R@1 using ResNet152 feature is 4%, compared to 2.3% using VGG19 feature. In the text-to-image retrieval task, the average performance improvement in R@1 using ResNet152 feature is 11.18%, compared to 3.5% using VGG19 feature.

Our webly supervised learning approach is agnostic to the choice loss function used for cross-modal feature fusion and we believe more sophisticated ones will only benefit our approach. We use two different variants of pairwise ranking loss (VSE and VSEPP) in the evaluation and observe that our approach improves the performance in both cases irrespective of the feature used to represent the images.

4.4. Results on Flickr30K Dataset

Table 2 summarizes the results on Flickr30K dataset (Plummer et al., 2015). Similar to Table 1, we divide the table into 3 rows (2.1-2.3) to understand the effect of the proposed approach compared to other approaches. From Table 2, we have the following key observations: (1) Similar to the results on MSCOCO dataset, our proposed approach consistently improves the performance of different supervised method (row-2.2 and row-2.3) in image-to-text retrieval by a margin of about 3%-6% in R@1 and 3%-9% in R@10. The maximum improvement of 6%-9% is observed in the VSEPP-VGG19 case while the least mean improvement of 4.8% is observed in VSE-VGG19 case. (2) In text-to-image retrieval task, the average performance improvement using our webly-supervised approach are 2.25% and 3.25% in R@1 and R@10 respectively. These improvements once again show that learning by utilizing large scale web data covering a wide variety of concepts lead to a robust embedding for cross-modal retrieval tasks. In Fig. 4, we show examples of few test images from Flickr30K dataset and the top 1 retrieved captions for the VSEPP-ResNet152 based formulations.

5. Conclusions

In this work, we showed how to leverage web images with tags to assist training robust image-text embedding models for the target task of image-text retrieval that has limited labeled data. We attempt to address the challenge by proposing a two-stage approach that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn better image-text embedding. Our approach has benefits in both performance and scalability. Extensive experiments demonstrate that our approach significantly improves the performance in the image-text retrieval task in two benchmark datasets. Moving forward, we would like to improve our method by utilizing other types of metadata (e.g., social media groups, comments) while learning the multi-modal embedding. Furthermore, the objective of webly supervised learning may suffer when the amount of noisy tags associated with web images is unexpectedly high compared to clean relevant tags. In such cases, we plan to improve our method by designing loss functions or layers specific to noise reduction, providing a more principled way for learning the multi-modal embedding in presence of significant noise.

Acknowledgement. This work was partially supported by NSF grants IIS-1746031 and CNS-1544969. We thank Sujoy Paul for helpful suggestions and Victor Hill for setting up the computing infrastructure used in this work.