A curated list of awesome Matlab frameworks, libraries and software.
Matching images and sentences demands a fine understanding of both modalities. In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image / text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss is hard for network learning, since it starts from the two heterogeneous features to build inter-modal relationship. To address this problem, we propose the instance loss which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image / text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this paper constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.READ FULL TEXT VIEW PDF
This paper considers the task of matching images and sentences. The chal...
This paper considers the task of matching images and sentences by learni...
Cross-modal information retrieval aims to find heterogeneous data of var...
Cross-modal retrieval between visual data and natural language descripti...
Current finger knuckle image recognition systems, often require users to...
This paper tackles the problem of large-scale image-based localization w...
Text-visual (or called semantic-visual) embedding is a central problem i...
A curated list of awesome Matlab frameworks, libraries and software.
Dual-Path Convolutional Image-Text Embedding https://arxiv.org/abs/1711.05535
IMAGE and text both contain very rich semantics but reside in heterogeneous modalities. Comparing to information retrieval within the same modality, matching image-text poses extra critical challenges, i.e., mapping images and text onto one shared feature space. For example, a model needs to distinguish between the “black dog”, “gray dog” and “two dogs” in the text, and understand the visual differences in images depicting “black dog”, “gray dog” and “two dogs”. In this paper, given an unseen image (text) query, we aim to measure its semantic similarity with the text (image) instances in the database and retrieve the true matched texts (images) to the query. Considering the testing procedure, this task requires connecting the two modalities with robust representations. In the early times, some relatively small datasets were used, e.g., Wikipedia  and Pascal Sentence , which contain around 3,000 and 5,000 image-text pairs, respectively. In recent years, several large-scale datasets with more than 30,000 images, including MSCOCO  and Flickr30k , have been introduced. Each image in these datasets is annotated with around five sentences. These large datasets allow deep architectures to learn robust representations and provide challenging evaluation scenarios.
During the past few years, ranking loss is commonly used as the objective function [5, 6, 7, 8, 9, 10] for image-text representation learning. The ranking loss aims to make the distance between positive pairs smaller than that between negative pairs by a predefined margin. In image-text matching, every training pair contains a visual feature and a textual feature. The ranking loss focuses on the distance between the two modalities. Its potential drawback is that it does not explicitly consider the feature distribution in a single modality. For example, when using ranking loss during training which does not distinguish between the slight differences in images, then given two testing images with slightly different semantics, the model may output similar descriptors for the two images. This is clearly undesirable for image / text matching considering the extremely fine granularity of this task. In our experiment, we observe that using the ranking loss alone in end-to-end training may cause the network to be stuck in a local minimum.
What motivates us is the effectiveness of class labels in earlier years of cross-media retrieval [11, 12, 13, 14, 15]. In these works, the class labels are annotated manually and during testing, the aim is to retrieve image / text belonging to the same class to the query. In light of this early practice, this paper explores the feasibility of “class labels” in image / text matching, which is an instance retrieval problem. Two differences exist between cross-media retrieval on the category level [13, 15] and on the instance level (considered in this paper). First, the true matches are those with the same category, and those with the exact same content with the query, respectively. That is to say, instance-level retrieval has a more strict matching criteria than category-level retrieval. Second, instance-level retrieval does not assume the existence of class labels. In this field of research, only image / text pairs are utilized during training. Given the intrinsic differences between the two tasks, it is non-trivial to directly transfer the experience from using class labels in category-level retrieval to instance-level retrieval.
Without annotated class labels, how can we initiate the investigation of the underlying data structures in the image / text embedding space? In this paper, we name an image and its associated sentences an “image / text group”. Our key assumption is that each “image / text” group is different from the others, and can be viewed as a distinct class (see Fig. 1
). So we propose a classification loss called instance loss to classify the image / text groups. Using this unsupervised class labels as supervision, we aim to enforce the model to discriminate each two images and two sentences (from different groups). It helps to investigate the fine-grained difference in single modality (intra-modal) and provides a good initialization for ranking loss which is a driving force for end-to-end retrieval representation learning. In more details, using such an unsupervised assumption, we train the network to classify every image / text group with the softmax loss. In the experiment, we show that the instance loss which classifies a large number of classes,i.e., 113,287 image / text groups on MSCOCO , is able to converge without any hyper-parameter tuning. Improved retrieval accuracy can be observed as a result of instance loss.
In addition, we notice in the field of image-text matching that most recent works employ off-the-shelf deep models for image feature extraction[16, 17, 18, 19, 8, 9, 20, 21, 22, 10]
. The fine-tuning strategy commonly seen in other computer vision tasks[23, 24, 25] is rarely adopted. A drawback of using off-the-shelf models is that these models are usually trained to classify objects into semantic categories [26, 27, 28]
. The classification models are likely to miss image details such as color, number, and environment, which may convey critical visual cues for matching images and texts. For example, a model trained on ImageNet can correctly classify the three images as “dog”; but it may not tell the difference between black dog and gray dog, or between one dog and two dogs. The ability to convey critical visual cues is a necessary component in instance-level image-text matching. Similar observations have been reported with regards to image captioning . Moreover, for the text feature, word2vec  is a popular choice in image-text matching [17, 32, 8, 21]. Aiming to model the context information, the word2vec model is learned through a shallow network to predict neighboring words. However, the word2vec model is initially trained on GoogleNews, which differs substantially from the text in the target dataset. As such, instead of using the off-the-shelf model, we explore the possibility of fine-tuning the word2vec model using image-text matching datasets.
Briefly, inspired by the effectiveness of class labels in early-time cross-media retrieval, we propose a similar practice in image-text matching called “instance loss”. Instance loss works by providing better weight initialization for the ranking loss, thus producing more discriminative and robust image / text descriptions. Next, we also note that the pretrained CNN models may not meet the fine-grained requirement in image / text matching. So we construct a dual path CNN to extract image and text features directly from data rather. The network is end-to-end trainable and yields superior results to using features extracted from off-the-shelf models as input. Our contributions are summarized as follows:
To provide better weight initialization and regularize the dual-path CNN model, we propose a large-number classification loss called instance loss. The robustness and effectiveness of instance loss are demonstrated by classifying each image / text group into one of the 113,287 classes on MSCOCO .
We propose a dual-path CNN model for visual-textual embedding learning (see Fig. 2). In contrast to the commonly used RNN+CNN model using fixed CNN features, the proposed CNN+CNN structure conducts efficient and effective end-to-end fine-tuning.
We note that Ma et al. also apply the CNN structure for text feature learning . The main difference between our method and  is two-fold. First, Ma et al.  use the ranking loss alone. In our method, we show that the proposed instance loss can further improve the result of ranking loss. Second, in , four text CNN models are used to capture different semantic levels i.e., word, short phrase, long phrase and sentence. In this paper, only one text CNN model is used and the word-level input is considered. Our model uses the residual block shown in Fig. 3, which combines low level information i.e., word, as well as high level inference to produce the final feature. In experiment (Table I and Table VIII), we show that using on the same image CNN (VGG-19), our method (with one text CNN) is superior to  with text model ensembles by a large margin.
The rest of this paper is organized as follows. Section II reviews and discusses the related works. Section III describes the proposed Image-Text CNN Structure in detail, followed by the objective function in Section IV. Training policy is described in Section V. Experimental results and comparisons are discussed in Section VI and conclusions are in Section VII. Furthermore, some qualitative results are included in Appendix.
The image-text bidirectional retrieval requires both understanding images and sentences in detail. In this section, we discuss some related works.
Deep models for image recognition. Deep models have achieved success in computer vision. The convolutional neural network (CNN) won the ILSVRC12 competition  by a large margin . Later, VGGNet  and ResNet  further deepened the CNN and provide more insights into the network structure. In the field of image-text matching, most recent methods directly use fixed CNN features [16, 17, 18, 19, 8, 9, 20, 21, 22, 10] as input which are extracted from the models pre-trained on ImageNet. While it is efficient to fix the CNN features and learn a visual-textual common space, it may lose the fine-grained differences between the images. This motivates us to fine-tune the image CNN branch in the image-text matching to provide for more discriminative embedding learning.
Deep models for natural language understanding. For natural language representation, word2vec  is commonly used [17, 32, 8, 21, 34]. This model contains two hidden layers, which learns from the context information. In the application of image-text matching, Klein et al.  and Wang et al. 
pool word vectors extracted from the fixedword2vec model to form a sentence descriptor using Fisher vector encoding. Karpathy et al.  also utilize fixed word vectors as word-level input. With respect to this routine, this paper proposes an equivalent scheme to fine-tuning the word2vec model, allowing the learned text representations to be adaptable to a specific task, which is, in our case, image-text matching.
Recurrent Neural Networks (RNN) are another common choice in natural language processing [35, 36]. Mao et al.  employ a RNN to generate image captions. Similarly, Nam et al.  utilize directional LSTM  for text encoding, yielding state-of-the-art multi-modal retrieval accuracy. Conversely, our approach is inspired by recent CNN breakthroughs on natural language understanding. For example, Gehring et al. apply CNNs to conduct machine translation, yielding competitive results and more than 9.3x speedup on the GPU . There are also researchers who apply layer-by-layer CNNs for efficient text analysis [39, 40, 41, 42], obtaining competitive results in title recognition, event detection and text content matching. In this paper, in place of RNNs which are more commonly seen in image-text matching, we explore the usage of CNNs for text representation learning.
Multi-modal learning. There is a growing body of works on the interaction between multiple modalities. Some works focus on the efficient cross-modal searching by binary coding and hashing [43, 14, 44]. Others pay more attention to the effective retrieval by understanding the semantic meaning, which is close to this work. As for the content-based retrieval, one line of methods focus on category-level retrieval and leverage the category labels in the training set. Sharma et al.  extend the Canonical Correlation Analysis  (CCA) to learning class labels, and Wang et al.  learn the shared image-text space based on coupled input with class regression. Wu et al.  propose a bi-directional learning to rank for representation learning. In , Wei et al. perform CNN fine-tuning by classifying categories on the training set and report an improved performance on image-text retrieval. Castrejon et al. deploy the multiple labels to learn the shared semantic space . The second line of methods consider instance-level retrieval and, except for matched image-text pairs, do not provide any category label. Given a query, the retrieval objective is a specific image or related sentences . Some works apply the auto-encoder to project high-dimensional features from different modalities onto a common low-dimensional latent space [48, 49]. Some works deploy the pair-wise constraints. In , He et al. use the assumption that the text and image components in a web document form a pairwise constraint. Zhang et al. consider the verification loss, using a binary classifier to classify the true matches and false matches . Other works widely apply the ranking loss for instance-level retrieval [5, 6, 7, 8, 9, 10]. Karpathy et al. propose a part-to-part matching approach using a global ranking objective . The “SPE” proposed in  extends the ranking loss with structure-preserving constraints. SPE is similar to our work in that both works consider the intra-modal distance. Nevertheless, our work differs significantly from SPE. SPE enforces the model to rank the texts, i.e., considering the feature separability within the text modality only. In comparison, with the proposed instance loss, our method jointly discriminates the two modalities, i.e., images and their associated texts.
Briefly, we focus on instance-level retrieval and propose the instance loss, a novel contribution to the cross-modality community. It views each training image / text group as a distinct class and uses the softmax loss for model training. The assumption is unsupervised. We show that this method converges well and yields consistent improvement.
In this paper, we propose a dual path CNN to simultaneously learn visual and textual representations in an end-to-end fashion, consisting of a deep image CNN for image input and one deep text CNN for sentence input. The entire network only contains four components, i.e., convolution, pooling, ReLU and batch normalisation. Compared to many previous methods which use off-the-shelf image CNNs [16, 17, 18, 19, 8, 9, 20, 21, 22, 10], end-to-end fine-tuning is superior in learning representations that encode image details (see Fig. 2).
We use ResNet-50  pre-trained on ImageNet  as a basic model (the final 1000-classification layer is removed) before conducting fine-tuning for visual feature learning. Given an input image of size , a forward pass of the network produces a -dimension feature vector. Followed by this feature, we add one fully-connected layer (input dim: , output dim:
), one batch normalization, relu and one fully-connected layer (input dim:, output dim: ). We denote the final -dim vector as the visual descriptor of the input . The forward pass process of the CNN, which is a non-linear function, is represented by function defined as:
Text processing. Next, we describe our text processing method and the text CNN structure. Given a sentence, we first convert it into code of size , where is the length of the sentence, and denotes the size of the dictionary. is used as the input for the text CNN. We use  as a general dictionary to filter out rare words; if a word does not appear in the dictionary (3,000,000 words), it is discarded. For Flickr30k, we eventually use words as the dictionary. Every word in Flickr30k thus can find an index in the dictionary; for instance, a sentence of 18 words can be converted to matrix. The text input can thus be formulated as:
where . The text CNN needs a fixed-length input. We set a fixed length in this paper because about 98% sentences contain less than words. If the length of the sentence is shorter than
, we pad with zeros to the columns of. If the length of the sentence is longer than , we clip the final several words. Now we obtain the sentence code . We further reshape into the format, which can be considered as height, width and channel known in the image CNNs [28, 26].
Position shift. We are motivated by the jittering operation in the image CNN training. For text CNN, we apply a data augmentation policy called position shift. In a baseline approach, if the sentence length is shorter than the standard input length , a straightforward idea is to pad zeros at the end of the sentence, called left alignment. In the proposed position shift approach, we pad a random number of zeros at the beginning and the end of a sentence. In this manner, shift variations are contained in the text representation, so that the learned embeddings are more robust. In the experiment, we observe that position shift is of importance to the performance.
Deep text CNN. In the text CNN, filter size of the first convolution layer is , which can be viewed as a lookup table. Using the first convolutional layer, a sentence is converted to the word vector as follows. Given input of
, the first convolution layer results in a tensor of size. There are two methods to initialize the first convolutional layer: 1) random initialization , and 2) using the -dim matrix from word2vec for initialization. In the experiment, we observe that word2vec initialization is superior to the random initialization.
For the rest of the text CNN, similar residual blocks are used as per the image CNN (see Fig. 3). Similar to the local pattern in the image CNN, every two neighbor components may form a phrase containing content information. We set the filter size of convolution layers in basic text block to . Additionally, we add the shortcut connection in the basic block, which has been demonstrated to help training deep neural networks . We apply basic blocks with a short connection to form the deep textual network (see Fig. 2). The number of blocks is consistent with the ResNet-50 model in the visual branch. Given a sentence matrix , its text descriptor can be extract in an end-to-end manner from the text CNN :
In this paper, two types of losses are used, i.e., the standard ranking loss and the proposed instance loss. In Section IV-A, we briefly review the formulation of the ranking loss and discuss the limitation of the ranking loss. Section IV-B describes the motivation and the formulation of the instance loss followed by a discussion. The differences between instance loss and ranking loss are discussed, and some primary experiments show the feasibility of instance loss. In Section IV-C, training convergence of the instance loss is discussed.
Ranking loss is a widely used objective function for retrieval problems. We use the cosine distance to measure the similarity between two samples, where is the feature of a sample, and denotes the L2-norm. The distance value .
To effectively account for two modalities, we follow the ranking loss formulation as in some previous works [6, 9]. Here, denotes the visual input, and denotes the text input. Given a quadric input , where describe the same image / text group, are negative samples, ranking loss can be written as,
is the cosine similarity, andis a margin. Given an image query , the similarity score of the correct text matching should be higher. Similarly, if we use sentence query , we expect the correct image content should be ranked higher. Ranking loss explicitly builds the relationship between the image and text.
Limitations of ranking loss. Although widely used, ranking loss has a potential drawback for the application of image-text matching. According to Eq. 4, every pair contains a visual feature and a textual feature. The ranking loss focuses on the distance between the two modalities. So the potential drawback is that the ranking loss does not explicitly consider the feature distribution in a single modality. For instance, given two testing images with slightly different semantics, the model may output similar features. It is clearly undesirable for the extremely fine granularity of this task. In the experiment, using ranking loss alone is prone to get stuck in a local minimum (as to be shown in Fig. 5 and Table II).
Motivation. Some early works use coarse-grain category i.e., art, biology, and sport, as the training supervision [11, 12, 15]. The multi-class classification loss has shown a good performance. But for instance-level retrieval, the classification loss has not been used. There may be two reasons. First, the category-level annotations are missing for most large-scale datasets. Second, if we use the category to train the model, it forces different instances, i.e., black dog, and white dogs, to the same class. It may compromise the CNN to learn the fine-grained difference.
In this paper, we propose the instance loss for instance-level image-text matching. We define an image and its related text descriptions as an image / text group. In specific applications such as language-based person retrieval [53, 33], an image / text group is defined as images and their descriptions which depict the same person (see Fig. 7). Based on image / text groups, our assumption is that each image / text group is distinct (duplicates have been removed in the datasets). Under such assumption, we view each image / text group as a class. So in essence, instance loss is a softmax loss which classifies an image / text group into one of a large number of classes. We want the trained model can tell the difference between every two images as well as every two sentences (from different groups). Formally, we define instance loss below.
Formulation. For two modalities, we formulate two classification objectives as follows,
where and are image and text features defined in Eq. 1 and Eq. 3, respectively. is the parameter of the final fully connected layer (Fig. 2). It can be viewed as concatenated weights Every weight is a 2048-dim vector. denotes the loss and
denotes the probability over all classes.is the predicted possibility of the right class . Here we enforce shared weight in the final fully connected layer for the two modalities, because otherwise the learned image and text features may exist in totally different subspaces.
As to be described in Section V
, in the first training stage, the ranking loss is not used. We only use the instance loss; in the second training stage, both losses are used. The final loss function is a combination of the ranking loss and the instance loss, defined as,
where , , are predefined weights for different losses.
Discussion. First, we show that instance loss provides better weight initialization than the ImageNet pretrained model. To prove this, we compare the image features from the off-the-self model pre-trained on ImageNet and the model trained with instance loss. Since the proposed instance loss explicitly considers the intra-modal distance, we observe that the feature correlation between two images is smaller after training with the instance loss (see Fig. 4(b)). In fact, the instance loss encourages the model to find the fine-grained image details such as ball, stick, and frisbee to discriminate between image / text groups with similar semantics. We visualize the dog retrieval results in Fig. 10. Our model can be well generalized to the test set and still sensitive to the subtle differences.
Second, we provide an example of two classes to describe the working mechanism of instance loss (Fig. 6). Given image which belongs to the first class, the softmax loss function informs the constraint of . Similarly, if is an input sentence belonging to the first class, the softmax loss will lead to the constraint of The decision boundary indicates equal probability to be classified into the first class and the second class. Since the image and text embedding networks share the same final weight , the features of the same image / text group will be close to each other in the embedding space; the data points from different image / text groups will be pushed away from each other. Therefore, after training with the instance loss, the data points will usually locate on the either side of the decision boundary. In this manner, the image / text groups can be separated in the feature space despite of the fine-grained differences among them. This property, as shown in the Fig. 6 (right), will provide better weight initialization for the subsequent training with both the ranking loss and instance loss.
Third, we demonstrate that using the instance loss alone can lead to a decent initialization. To validate this point, we plot the distribution of the intra-modal intra-class similarity and the distribution of the intra-modal inter-class similarity on Flickr30k validation set (Fig. 5(b)). We observe that, using instance loss alone, in most cases, leads to by a margin. The mean of equals to while the mean of is .
Fourth, using the ranking loss alone achieves a relatively large margin between the positive pairs and negative pairs but there also exist many “hard” negative pairs (Fig. 5(a)). These “hard” negative pairs usually have a high similarity which compromises the matching performance of the true matches. Due to the potential drawback of the ranking loss mentioned in Section IV-A, the image / text with slightly difference may have the similar feature, which result in the “hard” negative samples. To quantitatively compare the three models, we propose a simple indicator function,
which encodes the overlapping area of and over the range of cosine similarity . Indicator . The smaller is, the better the positive pairs and negative pairs are separated, and thus the better retrieval performance. indicates the case where the two distributions, and are completely overlapping. Under this worst case, the positive pairs cannot be distinguished from the negative ones, and the retrieval performance is random. To the other extreme, indicates that the positive pairs and negative pairs are perfectly separable: all the similarity scores of the positive pairs are larger than the similarity scores of the negative pairs. In this best case
, the retrieval precision and recall are both 100%. Therefore, a lower indicator scoreindicates a better retrieval system.
In our experiment (Fig. 5), the indicator scores of the three models are , and , respectively. It clearly demonstrates that in terms of the extent of feature separability: “Full Model” “Using Instance Loss Alone” “Using Ranking loss Alone”. With the indicator function, we quantitatively show that using ranking loss alone produces more hard negative pairs than the proposed two competing methods, which compromises the matching performance of the ranking loss. In comparison, using instance loss alone produces a smaller value, suggesting a better feature separability of the trained model. Importantly, when the two losses, i.e., ranking loss and instance loss, are combined, our full model has the smallest value, indicating the fewest hard negative samples and the best retrieval accuracy among the three methods.
For the retrieval performance, using the instance loss alone can lead to a competitive accuracy in the experiment (Table II). The effect of the instance loss is two-fold. In the first training stage, when used alone, it pre-trains the text CNN and fine-tunes the two fully-connected layers (and one batchnorm layer) of image CNN so that ranking loss can arrive at a better optimization for both modalities in the second stage (Fig. 6). In the second training stage, when used together with ranking loss, it exhibits a regularization effect on the ranking loss.
The instance loss views every image / text group as a class, so the number of training classes is usually large. For instance, we have 29,783 classes when training on Flickr30k. In Fig. 8, we show the training error curves of the image CNN and text CNN during training. We observe that the image CNN converges faster (Fig. 8
(a)) because the image CNN is pretrained on ImageNet. Text CNN converges more slowly because most part of it is trained from scratch, but it still begins to learn something after 20 epochs, and finally converges after 240 epochs.
On the other hand, the convergence property is evidenced by some previous works. To our knowledge, some practices also suffer from limited data per class, because manually annotating data is usually expensive. For example, in person re-ID, CUHK03 dataset  has 9.6 training samples per class; VIPeR dataset  has 2 training samples per class. The previous works [56, 57] on CUHK03 and VIPeR show that the CNN classification model can be well trained as long as each class has more than a couple of training samples. In our case, there are usually 6 positive training samples per class (1 image and 5 sentences). In the experiment, despite of the limited training data, the learned model has a good generalization ability on the validation set and test set, which accords with existing experience [56, 57].
We describe the training policy in this section. We split the training procedure into two stages. In the experiment, we show this policy helps the training.
Stage I: In this stage, we fix the pre-trained weights in the image CNN and use the proposed instance loss to tune the remaining part. The main reason is that most weights of the text CNN are learned from scratch. If we train the image and text CNNs simultaneously, the text CNN may compromise the pre-trained image CNN. We only use the proposed instance loss in this stage (). It can provide a good initialization for the ranking loss. We note that even after Stage I, our network can achieve competitive results compared to previous works using off-the-shelf CNNs.
Stage II: After Stage I converges, we start Stage II for end-to-end fine-tuning of the entire network. Note that the weights of the image CNN are also fine-tuned. In this stage, we combine the instance loss with the ranking loss (), so that both classification and ranking errors are considered. In Section VI-D, we study the mechanism of the two losses. It can be observed that in Stage II, instance loss and ranking loss are complementary, thus further improving the retrieval result. Instance loss still regularizes the model and provides more attentions to discriminate the images and sentences. After Stage II (end-to-end fine-tuning), another round of performance improvement can be observed, and we achieve even more competitive performance.
|Method||Visual||Textual||Image Query||Text Query|
|DeVise ||ft AlexNet||ft skip-gram||4.5||18.1||29.2||26||6.7||21.9||32.7||25|
|Deep Fragment ||ft RCNN||fixed word vector from ||16.4||40.2||54.7||8||10.3||31.4||44.5||13|
|DCCA ||ft AlexNet||TF-IDF||16.7||39.3||52.9||8||12.6||31.0||43.0||15|
|DVSA ||ft RCNN (init. on Detection)||w2v + ft RNN||22.2||48.2||61.4||4.8||15.2||37.7||50.5||9.2|
|LRCN ||ft VGG-16||ft RNN||23.6||46.6||58.3||7||17.5||40.3||50.8||9|
|m-CNN ||ft VGG-19||4 ft CNN||33.6||64.1||74.9||3||26.2||56.3||69.6||4|
|VQA-A ||fixed VGG-19||ft RNN||33.9||62.5||74.5||-||24.9||52.6||64.8||-|
|GMM-FV ||fixed VGG-16||w2v + GMM + HGLMM||35.0||62.0||73.8||3||25.0||52.7||66.0||5|
|m-RNN ||fixed VGG-16||ft RNN||35.4||63.8||73.7||3||22.8||50.7||63.1||5|
|RNN-FV ||fixed VGG-19||feature from ||35.6||62.5||74.2||3||27.4||55.9||70.0||4|
|HM-LSTM ||fixed RCNN from ||w2v + ft RNN||38.1||-||76.5||3||27.7||-||68.8||4|
|SPE ||fixed VGG-19||w2v + HGLMM||40.3||68.9||79.9||-||29.7||60.1||72.1||-|
|sm-LSTM ||fixed VGG-19||ft RNN||42.5||71.9||81.5||2||30.2||60.4||72.3||3|
|RRF-Net ||fixed ResNet-152||w2v + HGLMM||47.6||77.4||87.1||-||35.4||68.3||79.9||-|
|2WayNet ||fixed VGG-16||feature from ||49.8||67.5||-||-||36.0||55.6||-||-|
|DAN (VGG-19) ||fixed VGG-19||ft RNN||41.4||73.5||82.5||2||31.8||61.7||72.5||3|
|DAN (ResNet-152) ||fixed ResNet-152||ft RNN||55.0||81.8||89.0||1||39.4||69.2||79.1||2|
|Ours (VGG-19) Stage I||fixed VGG-19||ft ResNet-50 (w2v init.)||37.5||66.0||75.6||3||27.2||55.4||67.6||4|
|Ours (VGG-19) Stage II||ft VGG-19||ft ResNet-50 (w2v init.)||47.6||77.3||87.1||2||35.3||66.6||78.2||3|
|Ours (ResNet-50) Stage I||fixed ResNet-50||ft ResNet-50 (w2v init.)||41.2||69.7||78.9||2||28.6||56.2||67.8||4|
|Ours (ResNet-50) Stage II||ft ResNet-50||ft ResNet-50 (w2v init.)||53.9||80.9||89.9||1||39.2||69.8||80.8||2|
|Ours (ResNet-152) Stage I||fixed ResNet-152||ft ResNet-152 (w2v init.)||44.2||70.2||79.7||2||30.7||59.2||70.8||4|
|Ours (ResNet-152) Stage II||ft ResNet-152||ft ResNet-152 (w2v init.)||55.6||81.9||89.5||1||39.1||69.2||80.9||2|
We first introduce the three large-scale image-text retrieval datasets, i.e.
, Flickr30k, MSCOCO and CUHK-PEDES, followed by the evaluation metric in SectionVI-A. Then Section VI-B describes the implementation details and the reproducibility. We discuss the comparison with state of the art and mechanism study in Section VI-C and Section VI-D.
Flickr30k  is one of the large-scale image captioning datasets. It contains 31,783 images collected from Flickr, in which every image is annotated with five text descriptions. The average sentence length is words after removing rare words. We follow the protocol in [62, 6] to split the dataset into 1,000 test images, 1,000 validation images, and 29,783 training images.
MSCOCO  contains 123,287 images and 616,767 descriptions. Every images contains roughly 5 text descriptions on average. The average length of captions is after rare word removal. Following the protocol in , we randomly select 5,000 images as test data and 5,000 images as validation data. The remaining 113,287 images are used as training data. The evaluation is reported on 1K test images (5 fold) and 5K test images.
CUHK-PEDES  collects images from many different person re-identification datasets. It contains 40,206 images from 13,003 different pedestrians and 80,440 descriptions. On average, each person has 3.1 images, and each image has 2 sentences. The average sentence length is words after we remove rare words. We follow the protocol in , selecting the last 1,000 persons for evaluation. There are 3,074 test images with 6,156 captions, 3,078 validation images with 6,158 captions, and 34,054 training images with 68,126 captions.
Evaluation Metric We use two evaluation metrics i.e., Recall@K and Median Rank. Recall@K is the possibility that the true match appears in the top K of the rank list, where a higher score is better. Median Rank is the median rank of the closest ground truth result in the rank list, with a lower index being better.
The model is trained by stochastic gradient descent (SGD) with momentum fixed to 0.9 for weight update. While training, the images are resized topixels which are randomly cropped from images whose shorter size is . We also perform simple data augmentation such as horizontal flipping. For training text input, we conduct position shift (Section III-B) as data augmentation. Dropout is applied to both CNNs, and the dropout rate is . For Flickr30k and MSCOCO, we set the max text length to ; for CUHK-PEDES, we set the max text length to , since most sentences are longer.
In the first training stage, we fixed the pre-trained image CNN, and train the text CNN only. The learning rate is 0.001. We stop training when instance loss converges. In the second stage, we combine the ranking loss as Eq. 9 (the margin ) and fine-tune the entire network.
When testing, we can use the trained image CNN and trained text CNN separately. We extract the image feature by image CNN and the text feature by text CNN. We use the cosine distance to evaluate the similarity between the query and candidate images/sentences. It is consistent with the similarity used in the ranking loss objective. The final retrieval result is based on the similarity ranking. We also conduct the horizontal flipping when testing and use the average features (no flip and flip) as the image feature.
Reproducibility. Our source code is available online111https://github.com/layumi/Image-Text-Embedding. The implementation is based on the Matconvnet package . Since the entire network only uses four components i.e.
, convolution, pooling, ReLU and batch normalization, it can be easily modified to other deep learning packages.
Training Time The image CNN (ResNet-50) in our method uses 119 ms per image batch (batch size = 32) on an Nvidia 1080Ti GPU. The text CNN (similar ResNet-50) also uses 117 ms per sentence batch (batch size = 32). Therefore, the image feature and text feature can be simultaneously calculated. Although our implementation is sequential, the model can run in a parallel style efficiently.
We first compare our method with the state-of-the-art methods on the three datasets, i.e., Flickr30k, MSCOCO, and CUHK-PEDES. The compared methods include recent models on the bidirectional image and sentence retrieval. For a fair comparison, we present the results based on different image CNN structures, i.e., VGGNet  and ResNet . We also summarise the visual and textual embeddings used in these works in Table I and Table VIII. Extensive results are shown in Table I, Table VIII, and Table VII, respectively. On Flickr30k, we achieve competitive results with state-of-the-art DAN : Recall@1 = 55.6%, Med = 1 using image queries, and Recall@1 = 39.1%, Med = 2 using text queries. While both based on VGG-19, our method exceeds DAN and Recall@1 using image and text query respectively. On MSCOCO 1K-test-image setting, we arrive at Recall@1 = 65.6%, Med = 1 using image queries, and Recall@1 = 47.1%, Med = 2 using text queries. On 5K-test-image setting, we arrive at Recall@1 = 41.2%, Med = 2 using image queries, and Recall@1 = 25.3%, Med = 5 using text queries. CUHK-PEDES is a specific dataset for retrieving pedestrian images using the textual description. On CUHK-PEDES, we arrive at Recall@1 = 32.15%, Med = 4. While both are based on a VGG-16 network, our model has 6.21% higher recall rate. Moreover, our model based on ResNet-50 achieves new state-of-the-art performance: Recall@1 = 44.4%, Med = 2 using language description to search relevant pedestrians. Our method exceeds the second best method  by 18.46% in Recall@1 accuracy.
Note that m-CNN  also fine-tunes the CNN model to extract visual and textual features. m-CNN encompasses four different levels of text matching CNN while we only use one deep textual model with residual blocks. While both are based on VGG-19, our model has higher performance than m-CNN. Compared with a recent arXiv work, VSE++ , our result is also competitive.
|Method||Stage||Image Query||Text Query|
|Only Ranking Loss||I||6.1||27.3||4.9||27.8|
|Only Instance Loss||I||39.9||79.1||28.2||67.9|
|Only Instance Loss||II||50.5||86.0||34.9||75.7|
|Only Ranking Loss||II||47.5||85.4||29.0||68.7|
The effect of Stage 1 training. We replace the instance loss with the ranking loss at the first stage when fixing the image CNN. As shown in Table II, the performance is limited. As discussed in Section IV-B, ranking loss focuses on inter-modal distance. It may be hard to tune the visual and textual features simultaneously at the beginning. As we expected, instance loss performs better, which focuses more on learning intra-modal discriminative descriptors.
Two losses can works together. In Stage II, the experiment on the validation set verifies that two losses can work together to improve the final retrieval result (see Table II). Compared with models using only ranking loss or instance loss, the model with two losses provides for higher performance. In the second stage, instance loss does help to regularize the model.
End-to-end fine-tuning helps. In Stage II, we fine-tune the entire network. For the two general object datasets Flickr30k and MSCOCO, fine-tuning the whole network can improve the rank-1 accuracy by approximately 10% (see Table. I and Table. VIII). Imagenet collects images from the Internet, while the pedestrian dataset CUHK-PEDES collects images from surveillance cameras. The fine-tuning result is more obvious on the CUHK-PEDES due to the different data distribution. The fine-tuned network (based on ResNet-50) improves the Recall@1 by 29.37%. The experiments indicate the end-to-end training is critical to image-sentence retrieval, especially person search.
Do we really need so many classes?
For instance loss, the number of classes is usually large. Is it possible to use fewer classes? We implement the pseudo-category method by k-means clustering on MSCOCO, since MSCOCO has most images (classes). We use pool5 feature of ResNet50 pretrained on ImageNet to clusterand categories by K-means. The clustering results are used as the pseudo label for the images to conduct classification. Although clustering can decrease the number of training classes and add the samples per classes, different instances are forced to be of the same class and details may be lost (black / gray dog, two dogs), which compromises the accuracy. The retrieval result with k-classes on MSCOCO is shown in Table IV. It shows that the strategy is inferior to the instance loss.
Deeper Text CNN does not improve the performance Several previous works report the Text CNN may not improve the result when the network is very deep [65, 66]. It is different with the observation in the image recognition . In our experiment, we also observe a similar result when deepening the Text CNN on Flickr30k and MSCOCO. Deeper Text CNN does not significantly improve the result (see Table III).
|Methods||Dataset||Image-Query R@1||Text-Query R@1|
|Res152 + Res50||Flickr30k||44.4||29.6|
|Res152 + Res152||44.2||30.7|
|Res152 + Res50||MSCOCO||52.0||38.0|
|Res152 + Res152||52.8||37.7|
|Methods||Image-Query R@1||Text-Query R@1|
|3000 categories (StageI)||38.0||26.1|
|10000 categories (StageI)||44.7||31.3|
|Method||Image Query||Text Query|
|Random initialization ||38.0||78.7||26.6||66.6|
|Method||Image Query||Text Query|
|CNN-RNN (VGG-16) ||fixed||8.07||-||32.47||-|
|Neural Talk (VGG-16) ||fixed||13.66||-||41.72||-|
|GNA-RNN (VGG-16) ||fixed||19.05||-||53.64||-|
|IATV (VGG-16) ||ft||25.94||-||60.48||-|
|Ours (VGG-16) Stage I||fixed||14.26||33.07||43.47||16|
|Ours (VGG-16) Stage II||ft||32.15||54.42||64.30||4|
|Ours (ResNet-50) Stage I||fixed||15.03||31.66||41.62||18|
|Ours (ResNet-50) Stage II||ft||44.40||66.26||75.07||2|
|Method||Visual||Textual||Image Query||Text Query|
|1K test images|
|DVSA ||ft RCNN||w2v + ft RNN||38.4||69.9||80.5||1||27.4||60.2||74.8||3|
|GMM-FV ||fixed VGG-16||w2v + GMM + HGLMM||39.4||67.9||80.9||2||25.1||59.8||76.6||4|
|m-RNN ||fixed VGG-16||ft RNN||41.0||73.0||83.5||2||29.0||42.2||77.0||3|
|RNN-FV ||fixed VGG-19||feature from ||41.5||72.0||82.9||2||29.2||64.7||80.4||3|
|m-CNN ||ft VGG-19||4 ft CNN||42.8||73.1||84.1||2||32.6||68.6||82.8||3|
|HM-LSTM ||fixed CNN from ||ft RNN||43.9||-||87.8||2||36.1||-||86.7||3|
|SPE ||fixed VGG-19||w2v + HGLMM||50.1||79.7||89.2||-||39.6||75.2||86.9||-|
|VQA-A ||fixed VGG-19||ft RNN||50.5||80.1||89.7||-||37.0||70.9||82.9||-|
|sm-LSTM ||fixed VGG-19||ft RNN||53.2||83.1||91.5||1||40.7||75.8||87.4||2|
|2WayNet ||fixed VGG-16||feature from ||55.8||75.2||-||-||39.7||63.3||-||-|
|RRF-Net ||fixed ResNet-152||w2v + HGLMM||56.4||85.3||91.5||-||43.9||78.1||88.6||-|
|Ours (VGG-19) Stage I||fixed VGG-19||ft ResNet-50 (w2v init.)||46.0||75.6||85.3||2||34.4||66.6||78.7||3|
|Ours (VGG-19) Stage II||ft VGG-19||ft ResNet-50 (w2v init.)||59.4||86.2||92.9||1||41.6||76.3||87.5||2|
|Ours (ResNet-50) Stage I||fixed ResNet-50||ft ResNet-50 (w2v init.)||52.2||80.4||88.7||1||37.2||69.5||80.6||2|
|Ours (ResNet-50) Stage II||ft ResNet-50||ft ResNet-50 (w2v init.)||65.6||89.8||95.5||1||47.1||79.9||90.0||2|
|5K test images|
|GMM-FV ||fixed VGG-16||w2v + GMM + HGLMM||17.3||39.0||50.2||10||10.8||28.3||40.1||17|
|DVSA ||ft RCNN||w2v + ft RNN||16.5||39.2||52.0||9||10.7||29.6||42.2||14|
|VQA-A ||fixed VGG-19||ft RNN||23.5||50.7||63.6||-||16.7||40.5||53.8||-|
|Ours (VGG-19) Stage I||fixed VGG-19||ft ResNet-50 (w2v init.)||24.5||50.1||62.1||5||16.5||39.1||51.8||10|
|Ours (VGG-19) Stage II||ft VGG-19||ft ResNet-50 (w2v init.)||35.5||63.2||75.6||3||21.0||47.5||60.9||6|
|Ours (ResNet-50) Stage I||fixed ResNet-50||ft ResNet-50 (w2v init.)||28.6||56.2||68.0||4||18.7||42.4||55.1||8|
|Ours (ResNet-50) Stage II||ft ResNet-50||ft ResNet-50 (w2v init.)||41.2||70.5||81.1||2||25.3||53.4||66.4||5|
Word2vec initialization helps. We compare the result using the word2vec initialization or random initialization  for the first convolution layer of text CNN. Note that we remove the words, which have not appeared in the training set, in the training data as well as dictionary. So the weight of first convolution layer is instead of . is the dictionary size. When testing, the missing words in the dictionary will also be removed in advance. As shown in Table. V, it can be observed that using word2vec initialization outperforms by to compared to the random initialization. Although word2vec is not trained on the target dataset, it still serves as a proper initialization for text CNN.
Position shift vs. Left alignment: Text CNN has a fixed-length input. As discussed in Section III-B, left alignment is to pad zeros at the end of text input (like aligning the whole sentence left), if the length of the sentence is shorter than . Position shift is to add zeros at the end of text input as well as the begining of the input. We conduct the position shift online when reading data from the disk. We do the experiment on Flickr30k validation set. As shown in Table VI, the model using position shift outperforms the one using left alignment . Position shift serves as a significant data augmentation method for text feature learning.
In Fig. 9 and Fig. 10, we present some visual retrieval results on CUHK-PEDES and Flickr30k, respectively. Our method returns reasonable rank lists. (More qualitative results can be found in Appendix.)
Does Text CNN learn discriminative words? The text CNN is supposed to convey the necessary textual information for image-text matching. To examine whether the text CNN discovers discriminative words, we fix the visual feature. For text input, we remove one word from the sentence each time. If we remove a discriminative word, the matching confidence will drop. In this way, we can determine the learned importance of different words.
The proposed model learns discriminative words. As show in Fig. 11, we observe that the words which convey the objective/colour information, i.e., basketball, swing, purple, are usually discriminative. If we remove these words, the matching confidence drops. Conversely, the conjunctions, i.e., with, on, at, in, after being removed, have a small impact on the matching confidence.
In this paper, we propose the instance loss for image-text retrieval. It is based on an unsupervised assumption that every image/test group can be viewed as one class. The experiment shows instance loss can provide a proper initialization for ranking loss and further regularize the training. As a minor contribution, we propose a dual-path CNN to conduct end-to-end training on both image and text branches. The proposed method achieves competitive results on two generic retrieval datasets Flickr30k and MSCOCO. Furthermore, we arrive a +18% improvement on the person retrieval dataset CUHK-PEDES. Our code has been made publicly available. Additional examples can be found in Appendix.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, 1997.