Deep learning exploits large volumes of labeled data to learn powerful models. When the target dataset is small, it is a common practice to perform transfer learning using pre-trained models to learn new task specific representations. However, pre-trained CNNs for image recognition are provided with limited information about the image during training, which is label alone. Tasks such as scene retrieval suffer from features learned from this weak supervision and require stronger supervision to better understand the contents of the image. In this paper, we exploit the features learned from caption generating models to learn novel task specific image representations. In particular, we consider the state-of-the art captioning system Show and Tell SnT-pami-2016 and the dense region description model DenseCap densecap-cvpr-2016. We demonstrate that, owing to richer supervision provided during the process of training, the features learned by the captioning system perform better than those of CNNs. Further, we train a siamese network with a modified pair-wise loss to fuse the features learned by SnT-pami-2016 and densecap-cvpr-2016 and learn image representations suitable for retrieval. Experiments show that the proposed fusion exploits the complementary nature of the individual features and yields state-of-the art retrieval results on benchmark datasets.READ FULL TEXT VIEW PDF
Deep neural networks have been investigated in learning latent
While many BERT-based cross-modal pre-trained models produce excellent
Transfer learning has emerged as a powerful methodology for adapting pre...
Deep neural networks have shown promising results for various clinical
Text classification approaches have usually required task-specific model...
Deep neural networks have shown promising results for various clinical
Many real-world visual recognition use-cases can not directly benefit fr...
Deep learning has enabled us to learn various sophisticated models using large amounts of labeled data. Computer vision tasks such as image recognition, segmentation, face recognition, etc. require large volumes of labeled data to build reliable models. However, when the training data is not sufficient, in order to avoid over-fitting, it is a common practice to use pre-trained models rather than training from scratch. This enables us to utilize the large volumes of data (eg:) on which the pre-trained models are learned and transfer that knowledge to the new target task. Hierarchical nature of the learned representations and task specific optimization makes it easy to reuse them. This process of reusing pre-training and learning new task specific representations is referred to as transfer learning or fine-tuning the pre-trained models. There exist many successful instances of transfer learning (e.g. 
) in computer vision using Convolution Neural Networks (CNNs). Large body of these adaptations are fine-tuned architectures of the well-known recognition models[5, 6, 7, 8, 9, 10]
trained on the IMAGENET and Places  datasets.
However, these models perform object or scene classification and have very limited information about the image. All that these models are provided with during training is the category label. No other information about the scene is provided. For example, the image shown in Figure2 has dog and person as labels. Useful information such as indoor or outdoor, interaction between the objects, presence of other objects in the scene is missing. Tasks such as image search (similar image retrieval) suffer from the features learned by this weak supervision. Image retrieval requires the models to understand the image contents in a better manner (eg: [12, 13]) to be able to retrieve similar images. Specifically, when the images have multiple objects and graded relevance scores (multiple similarity levels, eg: on a scale from to ) instead of binary relevance (similar or dissimilar), the problem becomes more severe.
On the other hand, automatic caption generation models (e.g. [14, 1, 15, 16]) are trained with human given descriptions of the images. These models are trained with stronger supervision compared to the recognition models. For example, Figure 1 shows pair of images form MSCOCO  dataset along with their captions. Richer information is available to these models about the scene than mere labels. In this paper, we exploit the features learned via strong supervision by these models and learn task specific image representations for retrieval via pairwise constraints.
In case of CNNs, the learning acquired from training for a specific task (e.g. recognition on IMAGENET) is transferred to other vision tasks. Transfer learning followed by task specific fine-tuning has proven to be efficient to tackle less data scenarios. However, similar transfer learning is left unexplored in the case of caption generators. For the best of our knowledge, this is the first attempt to explore that knowledge via fine-tuning the representations learned by them to a retrieval task.
The major contributions of our work can be listed as:
We show that the features learned by the image captioning systems represent image contents better than those of CNNs via image retrieval experiments. We attempt to exploit the strong supervision observed during their training via transfer learning.
We train a siamese network using a modified pair-wise loss suitable for non-binary relevance scores to fuse the complementary features learned by  and . We demonstrate that the task specific image representations learned via our proposed fusion achieve state-of-the-art performance on benchmark retrieval datasets.
The paper is organised as follows: Section 2 provides a short summary of  and  before presenting details about the proposed approach to perform transfer learning. This section also discusses the proposed fusion architecture. Section 3 details the experiments performed on benchmark datasets and discusses various aspects along with the results. Finally, Section 4 concludes the paper.
Transfer learning followed by task specific fine-tuning is a well known technique in deep learning. In this section we present an approach to exploit the fine supervision employed by the captioning models and the resulting features. Especially, we target the task of similar image retrieval and learn suitable features.
Throughout the paper, we consider the state-of-the art captioning model Show and Tell by Vinyals et al. . Their model is an encoder-decoder framework containing a simple cascading of a CNN to an LSTM. The CNN encodes visual information from the input image and feeds via a learnable transformation to the LSTM. This is called image encoding, which is shown in Figure 3 in green color. The LSTM’s task is to predict the caption word by word conditioned on the image and previous words. Image encoding is the output of a transformation () learned from the final layer of the CNN (Inception V3 ) before it is fed to the LSTM. The system is trained end-to-end with image-caption pairs to update the image and word embeddings along with the LSTM parameters. Note that the Inception V3 layers (prior to image encoding) are frozen (not updated) during the first phase of training and they are updated during the later phase.
The features at the image encoding layer (green arrow in Figure 3) are learned from scratch. Note that these are the features input to the text generating part and fed only once. These features are very effective to summarize all the important visual content in the image to be described in the caption. These features need to be more expressive than the deep fully connected layers of the typical CNNs trained with weak supervision (labels). Therefore, we consider transferring these features to learn task specific features for image retrieval. We refer to these features as Full Image Caption features since the generated caption gives a visual summary of the whole image.
On the other hand Johnson et al.  proposed an approach to densely describe the regions in the image, called dense captioning task. Their model contains a fully convolutional CNN for object localization followed by an RNN to provide the description. Both the modules are linked via a non-linear projection (layer), similar to . The objective is to generalize the task of object detection and image captioning. Their model is trained end-to-end over the Visual genome  dataset which provides object level annotations and corresponding descriptions. They fine-tune the later (from fifth) layers of the CNN module (VGG  architecture) along with training the image encodings and RNN parameters. Similar to features we consider the image encodings to transfer the ability of this model to describe regions in the image. This model provides encodings for each of the described image regions and associated priorities. Figure 2 (right panel) shows an example image and the region descriptions predicted by DenseCap model. Note that the detected regions and corresponding descriptions are dense and reliable. In order to have a summary of the image contents, we perform mean pooling on the representations (features) belonging to top-K (according to the predicted priorities) regions. We refer to the pooled encodings as Densecap features.
Especially for tasks such as image retrieval, models trained with strong object and attribute level supervision can provide better pre-trained features than those of weak label level supervision. Therefore, we propose an approach to exploit the Densecap features along with the features and learn task specific image representations.
Figure 2 shows descriptions predicted by  and  for a sample image. Note that the predictions are complementary in nature. provides the summary of the scene: a boy is standing next to a dog. Where as, Densecap provides more details about the scene and objects: presence of green grass, metal fence, brick wall and attributes of objects such as black dog, white shirt,etc.
In the proposed approach, we attempt to learn image representations that exploit the strong supervision available from the training process of  and . Further, we take advantage of the complementary nature of these two features and fuse them to learn task specific features for image retrieval.
We train a siamese network to fuse both the features. The overview of the architecture is presented in Figure 4. The proposed siamese architecture has two wings. A pair of images is presented to the network along with their relevance score (high for similar images, low for dissimilar ones). In the first layer of the architecture, and Densecap features are late fused (concatenated) and presented to the network. A sequence of layers is added on both the wings to learn discriminative embeddings. Note that these layers on both the wings have tied weights (identical transformations in the both the paths). In the final layer, the features are compared to find the similarity and the loss is computed with respect to the ground truth relevance. The error gets back-propagated to update the network parameters. Our network accepts the complementary information provided by both the features and learns a metric via representations suitable for image retrieval. More details about the training are presented in section 3.4.
We begin by explaining the retrieval datasets111The datasets are available at http://val.serc.iisc.ernet.in/attribute-graph/Databases.zip considered for our experiments. In order to have more natural scenario, we consider retrieval datasets that have graded relevance scores instead of binary relevance (similar or dissimilar). We require the relevance to be assigned based on overall visual similarity as opposed to any one particular aspect of the image (e.g. objects). To match these requirements, we consider two datsets rPascal (ranking Pascal) and rImagenet (ranking Imagenet) composed by Prabhu et al. . These datasets are subsets of aPascal  and Imagenet  respectively. Each of the datasets contains query images and a set of corresponding relevant images. They are composed by annotators participating to assign relevance scores. The scores have grades, ranging from (irrelevant) to (excellent match).
rPascal: This dataset is composed from the test set of aPascal . The queries comprise of indoor and outdoor scenes. The dataset consists of a total of images with an average of reference images per query.
rImagenet: It is composed from the validation set of ILSVRC 2013 detection challenge. Images containing at least objects are chosen. The queries contain indoor scenes and outdoor scenes. The dataset consists of a total of images with an average of reference images per query.
Figure 5 shows sample images from the two datasets. Note that the first image in each row is query and the following images are reference images with relevance scores displayed at top right corner.
We followed the evaluation procedure presented in 
. For quantitative evaluation of the performance, we compute normalized Discounted Cumulative Gain (nDCG) of the retrieved list. nDCG is a standard evaluation metric used for ranking algorithms (e.g. and ). For all the queries in each dataset, we find the nDCG value and report the mean nDCG per dataset evaluated at different ranks (K).
In this subsection we demonstrate the effectiveness of the features obtained from the caption generation model . For each image we extract the features to encode it’s contents. Note that these are the features learned by the caption generation model via the strong supervision provided during the training. Retrieval is performed by computing distance between the query and the reference images’ features and arranging in the increasing order of the distances.
Figure 6 and 7 show the plots of nDCG evaluted at different ranks on the two datasets. As a baseline comparison, we have compared the performance of features with that of the non-finetuned visual features of inception v3 model (blue color plot in Figure 6 and 7). Note that these are features that are extracted from the last fully connected layer of the inception v3 model . The features outperform the non-finetuned visual features by a large margin emphasizing the effectiveness of the strong supervision.
We have considered another baseline using the natural language descriptors. For a given image, we have predicted the text caption using the Show and Tell  model. After pre-processing (stop word removal and lemmatizing), we encode each of the remaining words using word2vec  embeddings and mean pool them to form an image representation. Note that the features perform better than this baseline also.
We also compare the performance of features against the state-of-the art Attribute graph approach . The features clearly outperform the Attribute Graph approach in case of both the benchmark datasets.
The proposed fusion architecture222Project codes can be found at https://github.com/mopurikreddy/strong-supervision is trained with pairs of images and corresponding relevance scores . The typical pairwise training consists of binary relevance scores: simila r or dissimilar . The objective is to reduce the distance between the projections of the images if they are similar and separate them if dissimilar. Equation (1) shows the contrastive loss  typically used to train siamese networks.
where, is the prediction error, is the mini-batch size, is the relevance score ( or ), is the distance between the projections of the pair of images and is the margin to separate the projections corresponding to dissimilar pair of images.
However, in practice images can have non-binary relevance scores. To handle more fine grained relevances, we modified the contrastive loss function to include non-binary scores as shown in equation (2)
where is indicator function. Note that the modified loss function favours the nDCG measure by strongly punishing (due to the square term) the distances between images with higher relevance scores.
We train a siamese network with fully connected layers on both the wings, with tied weights. The number of units in each wing are . The representations learned at the last layer are normalized and euclidean distance is minimized according to Equation (2). We divide the queries into splits to perform fold validation and report the mean nDCG. That is, each fold contains image pairs of queries and corresponding reference images for training. The remaining queries form the evaluation set. On an average, each fold contains training pairs for rPascal and pairs for rImagenet.
Figures 8 and 9 show the performance of the task specific image representations learned via the proposed fusion. For evaluating the performance of the Densecap  method, we have mean pooled the encodings corresponding to top- image regions resulting a feature. feature is also vector, therefore forming an input of to the network. Note that the transfer learning and fine-tuning through fusion improves the retrieval performance on both the datasets.
In this paper, we have presented an approach to exploit the strong supervision observed in the training of caption generation systems. We have demonstrated that image understanding tasks such as retrieval can benefit from this strong supervision compared to weak label level supervision. Transfer learning followed by task specific fine-tuning is commonly observed in CNN based vision systems. However similar practice is relatively unexplored in the case of these captioning systems. Our approach can potentially open new directions for exploring other sources for stronger supervision and better learning. It can also motivate to tap the juncture of vision and language in order to build more intelligent systems.
This work was supported by Defence Research and Development Organization (DRDO), Government of India.
“Learning deep features for scene recognition using places database,”in NIPS, 2014.
IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015.
“Deep captioning with multimodal recurrent neural networks (m-RNN),”in ICLR, 2015.