The sense of similarity has been known as the most basic component of human reasoning 
. Likewise, understanding similarity between images has played essential roles in many areas of computer vision including image retrieval[51, 44, 45, 19], face identification [40, 12, 47], place recognition 
, pose estimation, person re-identification [41, 10], video object tracking [48, 43], local feature descriptor learning [25, 59], zero-shot learning [7, 58], and self-supervised representation learning . Also, the perception of similarity has been achieved by learning similarity metrics from labeled images, which is called metric learning.
Recent approaches in metric learning have improved performance dramatically by adopting deep Convolutional Neural Networks (CNNs) as their embedding functions. Specifically, such methods train CNNs to project images onto a manifold where two examples are close to each other if they are semantically similar and far apart otherwise. While in principle such a metric can be learned using any type of semantic similarity labels, previous approaches typically rely on binary labels over image pairs indicating whether the image pairs are similar or not. In this aspect, only a small subset of real-world image relations has been addressed by previous approaches. Indeed, binary similarity labels are not sufficient to represent sophisticated relations between images with structured and continuous labels, such as image captions[30, 36, 57], human poses [3, 21], camera poses [5, 13], and scene graphs [24, 31]. Metric learning with continuous labels has been addressed in [27, 33, 46, 16, 4]. Such methods, however, reduce the problem by quantizing continuous similarity into binary labels (i.e., similar or dissimilar) and applying the existing metric learning techniques. Therefore, they do not fully exploit rich similarity information in images with continuous labels as illustrated in Figure 1
(a) and require a careful tuning of parameters for the quantization. In this paper, we propose a novel method for deep metric learning to overcome the aforementioned limitations. We first design a new triplet loss function that takes full advantage of continuous labels in metric learning. Unlike existing triplet losses[54, 40, 55] that are interested only in the equality of class labels or the order of label distances, our loss aims to preserve ratios of label distances in the learned embedding space. This allows our model to consider degrees of similarities as well as their order and to capture richer similarity information between images as illustrated in Figure 1(b). Current methods construct triplets by sampling a positive (similar) and a negative (dissimilar
) examples to obtain the binary supervision. Here we propose a new strategy for triplet sampling. Given a minibatch composed of an anchor and its neighbors, our method samples every triplet including the anchor by choosing every pair of neighbors in the minibatch. Unlike the conventional approaches, our method does not need to introduce quantization parameters to categorize neighbors into the two classes and can utilize more triplets given the same minibatch. Our approach can be applied to various problems with continuous and structured labels. We demonstrate the efficacy of the proposed method on three different image retrieval tasks using human poses, room layouts, and image captions, respectively, as continuous and structured labels. In all the tasks, our method outperforms the state of the art, and our new loss and the triplet mining strategy both contribute to the performance boost. Moreover, we find that our approach learns a better metric space even with a significantly lower embedding dimensionality compared to previous ones. Finally, we show that a CNN trained by our method with caption similarity can serve as an effective visual feature for image captioning, and it outperforms an ImageNet pre-trained counterpart in the task.
2 Related Work
In this section, we first review loss functions and tuple mining techniques for deep metric learning, then discuss previous work on metric learning with continuous labels.
2.1 Loss Functions for Deep Metric Learning
Contrastive loss [12, 17, 6] and triplet loss [40, 55, 51] are standard loss functions for deep metric learning. Given an image pair, the contrastive loss minimizes their distance in the embedding space if their classes are the same, and separates them a fixed margin away otherwise. The triplet loss takes triplets of anchor, positive, and negative images, and enforces the distance between the anchor and the positive to be smaller than that between the anchor and the negative. One of their extensions is quadruple loss [43, 10], which considers relations between a quadruple of images and is formulated as a combination of two triplet losses. A natural way to generalize the above losses is to use a higher order relations. For example, -tuplet loss  takes as its input an anchor, a positive, and
negative images, and jointly optimizes their embedding vectors. Similarly, lifted structured loss considers all positive and negative pairs in a minibatch at once by incorporating hard-negative mining functionality within itself. For the same purpose, in  the area of intersection between similarity distributions of positive and negative pairs are minimized, and in [44, 28] clustering objectives are adopted for metric learning. All the aforementioned losses utilize image-level class labels or their equivalent as supervision. Thus, unlike ours, it is not straightforward for them to take relations between continuous and/or structured labels of images into account.
2.2 Techniques for Mining Training Tuples
Since tuples of images are used in training, the number of possible tuples increases exponentially with . The motivation of mining techniques is that some of such a large number of tuples do not contribute to training or can even result in decreased performance. A representative example is semi-hard triplet mining , which utilizes only semi-hard triplets for training since easy triplets do not update the network and hardest ones may have been corrupted due to labeling errors. It also matters how to measure the hardness. A common strategy [40, 45] is to utilize pairwise Euclidean distances in embedding space, e.g., negative pairs with small Euclidean distances are considered hard. In [56, 19, 20], an underlying manifold of embedding vectors, which is ignored in Euclidean distances, is taken into account to improve the effectiveness of mining techniques. Also, in  multiple levels of hardness are captured by a set of embedding models with different complexities. Although the above techniques substantially improve the quality of learned embedding space, they are commonly based on binary relations between image pairs, thus they are not directly applicable for metric learning with continuous labels.
2.3 Metric Learning Using Continuous Labels
There have been several metric learning methods using data with continuous labels. For example, similarities between human pose annotations have been used to learn an image embedding CNN [27, 33, 46]. This pose-aware CNN then extracts pose information of given image efficiently without explicit pose estimation, which can be transferred to other tasks relying on pose understanding like action recognition. Also, in  caption similarities between image pairs are used as labels for metric learning, and the learned embedding space enables image retrieval based on more comprehensive understanding of image content. Other examples of continuous labels that have been utilized for metric learning include GPS data for place recognition  and camera frusta for camera relocalization . However, it is hard for the above methods to take full advantage of continuous labels because they all use conventional metric learning losses based on binary relations. Due to their loss functions, they quantize continuous similarities into binary levels through distance thresholding [4, 33, 46] or nearest neighbor search [27, 16]. Unfortunately, both strategies are unnatural for continuous metric learning and have clear limitations as illustrated in Figure 2. Furthermore, it is not straightforward to find a proper value for their quantization parameters since there is no clear boundary between positive and negative examples whose distances to the anchors are continuous. To the best of our knowledge, our work is the first attempt to directly use continuous labels for metric learning.
3 Our Framework
To address limitations of existing methods described above, we propose a new triplet loss called log-ratio loss. Our loss directly utilizes continuous similarities without quantization. Moreover, it considers degrees of similarities as well as their rank so that the resulting model can infer sophisticated similarity relations between continuous labels. In addition, we present a new, simple yet effective triplet mining strategy supporting our log-ratio loss since the existing mining techniques in Section 2.2 cannot be used together with our loss. In the following sections, we briefly review the conventional triplet loss  for a clear comparison, then present details of our log-ratio loss and the new triplet mining technique.
3.1 Review of Conventional Triplet Loss
The triplet loss takes a triplet of an anchor, a positive, and a negative image as input. It is designed to penalize triplets violating the rank constraint, namely, that the distance between the anchor and the positive must be smaller than that between the anchor and the negative in the embedding space. The loss is formulated as
where indicates an embedding vector, means the squared Euclidean distance, is a margin, and denotes the hinge function. Note that the embedding vectors should be normalized since, without such a normalization, their magnitudes tend to diverge and the margin becomes trivial. For training, gradients of with respect to the embedding vectors are computed by
where is the indicator function. One may notice that the gradients only consider the directions between the embedding vectors and the rank constraint violation indicator. If the rank constraint is satisfied, all the gradients are zero.
3.2 Log-ratio Loss
Given a triplet with samples, we propose a log-ratio loss that aims to approximate the ratio of label distances by the ratio of distances in the learned embedding space. Specifically, we define the loss function as
where indicates an embedding vector, is a continuous label, and denotes the squared Euclidean distance. Also, is a triplet of an anchor and its two neighbors and without positive-negative separation, unlike and in Eq. (1). By approximating ratios between label distances instead of the distances themselves, the proposed loss enables to learn a metric space more flexibly regardless of the scale of the labels. The main advantage of the log-ratio loss is that it allows a learned metric space to reflect degrees of label similarities as well as the rank of them. Ideally, the distance between two images in the learned metric space will be proportional to their distance in the label space. Hence, an embedding network trained with our loss can represent continuous similarities between images more thoroughly than those focusing only on the rank of similarities like the triplet loss. This property of the log-ratio loss can be also explained through its gradients, which are given by
where is a scalar value computed by
As shown in Eq. (6) and (7), the gradients of the log-ratio loss are determined not only by the directions between the embedding vectors but also by that quantifies the discrepancy between the distance ratio in the label space and that in the embedding space. Thus, even when the rank constraint is satisfied, the magnitudes of the gradients could be significant if is large. In contrast, the gradients of the triplet loss in Eq. (2) and (3) become zero under the same condition. Another advantage of the log-ratio loss is that it is parameter-free. Unlike ours, the triplet loss requires the margin, which is a hyper-parameter tuned manually and forces embedding vectors to be normalized. Last but not least, we empirically find that the log-ratio loss can outperform the triplet loss even with embeddings of a significantly lower dimensionality, which enables a more efficient and effective image retrieval.
3.3 Dense Triplet Mining
The existing triplet mining methods in Section 2.2 cannot be used in our framework since they are specialized to handle images annotated by discrete and categorical labels. Hence, we design our own triplet mining method that is well matched with the log-ratio loss. First of all, we construct a minibatch of training samples with an anchor, nearest neighbors of the anchor in terms of label distance, and other neighbors randomly sampled from the remaining ones. Note that including nearest neighbors helps speed up training. Since the label distance between an anchor and its nearest neighbor is relatively small, triplets with a nearest neighbor sample in general induce large log-ratios of label distances in Eq. (9), which may increase the magnitudes of the associated gradients consequently. Given a minibatch, we aim to exploit all triplets sharing the anchor so that our embedding network can observe the greatest variety of triplets during training. To this end, we sample triplets by choosing every pair of neighbors in the minibatch and combining them with the anchor . Furthermore, since and have no difference in our loss, we choose only and disregard when to avoid duplication. We call the above procedure dense triplet mining. The set of triplets densely sampled from the minibatch is then given by
Note that our dense triplet mining strategy can be combined also with the triplet loss, which is re-formulated as
where the margin is set small compared to that of in Eq. (1) since the label distance between and could be quite small when they are densely sampled. This dense triplet loss is a strong baseline of our log-ratio loss. However, it still requires normalization of embedding vectors and ignores degrees of similarities as the conventional triplet loss does. Hence, it can be regarded as an intermediary between the existing approaches in Section 2.3 and our whole framework, and will be empirically analyzed for ablation study in the next section.
The effectiveness of the proposed framework is validated on three different image retrieval tasks based on continuous similarities: human pose retrieval on the MPII human pose dataset , room layout retrieval on the LSUN dataset , and caption-aware image retrieval on the MS-COCO dataset 
. We also demonstrate that an image embedding CNN trained with caption similarities through our framework can be transferred to image captioning as an effective visual representation. In the rest of this section, we first define evaluation metric and describe implementation details, then present qualitative and quantitative analysis of our approach on the retrieval and representation learning tasks.
4.1 Evaluation: Measures and Baselines
Evaluation metrics. Since image labels are continuous and/or structured in our retrieval tasks, it is not appropriate to evaluate performance based on standard metrics like Recall@. Instead, following the protocol in , we adopt two evaluation metrics, mean label distance and a modified version of nDCG [8, 27]. The mean label distance is the average of distances between queries and retrieved images in the label space, and a smaller means a better retrieval quality. The modified nDCG considers the rank of retrieved images as well as their relevance scores, and is defined as
where is the number of top retrievals of our interest and is a normalization factor to guarantee that the maximum value of is 1. Also, denotes the relevance between query and the retrieval, which is discounted by to place a greater emphasis on one returned at a higher rank. A higher nDCG means a better retrieval quality. Common baselines. In the three retrieval tasks, our method is compared with its variants for ablation study. These approaches are denoted by combinations of loss function and triplet mining strategy , where Log-ratio is our log-ratio loss, Triplet means the triplet loss, Dense denotes the dense triplet mining, and Binary indicates the triplet mining based on binary quantization. Specifically, (Binary) is implemented by nearest neighbor search, where 30 neighbors closest to anchor are regarded as positive. Our model is then represented as (Log-ratio)+(Dense). We also compare our model with the same network trained with the margin based loss and distance weighted sampling , a state-of-the-art approach in conventional metric learning. Finally, we present scores of Oracle and ImageNet pretrained ResNet-34 as upper and lower performance bounds. Note that nDCG of Oracle is always 1.
4.2 Implementation Details
Datasets. For the human pose retrieval, we directly adopt the dataset and setting of . Among in total 22,285 full-body pose images, 12,366 images are used for training and 9,919 for testing, while 1,919 images among the test set are used as queries for retrieval. For the room layout retrieval, we adopt the LSUN room layout dataset  that contains 4,000 training images and 394 validation images of 11 layout classes. Since we are interested in continuous and fine-grained labels only, we use only 1,996 images of the 5 layout class, which is the class with the largest number of images. Among them 1,808 images are used for training and 188 for testing, in which 30 images are employed as queries. Finally, for the caption-aware image retrieval, the MS-COCO 2014 caption dataset  is used. We follow the Karpathy split , where 113,287 images are prepared for training and 5,000 images for validation and testing, respectively. The retrieval test is conducted only on the testing set, where 500 images are used as queries. Preprocessing and data augmentation. For the human pose retrieval, we directly adopt the data augmentation techniques used in . For the room layout retrieval, the images are resized to for both training and testing, and flipped horizontally at random during training. For the caption-aware retrieval, images are jittered in both scale and location, cropped to , and flipped horizontally at random during training. Meanwhile, test images are simply resized to and cropped at center to . Embedding networks and their training. For the human pose and room layout retrieval, we choose ResNet-34  as our backbone network and append a 128-D FC layer on top for embedding. They are optimized by the SGD with learning rate
and exponential decay for 15 epochs. For the caption-aware image retrieval, ResNet-101 with a 1,024 dimensional embedding layer is adopted since captions usually contain more comprehensive information than human poses and room layouts. This network is optimized by the ADAM  with learning rate
for 5 epochs. All the networks are implemented in PyTorch and pretrained on ImageNet  before being finetuned. Hyper-parameters. The size of minibatch is set to 150 for the human pose, 100 for the room layout, and 50 for the caption-aware image retrieval, respectively. On the other hand, , the number of nearest neighbors in the minibatch for the dense triplet mining, is set to 5 for all experiments. For the common baselines, the margin of the conventional triplet loss is set to 0.2 and that of the dense triplet loss 0.03.
4.3 Human Pose Retrieval
The goal of human pose retrieval is to search for images similar with query in terms of human poses they exhibit. Following , the distance between two poses is defined as the sum of Euclidean distances between body-joint locations. Our model is compared with the previous pose retrieval model called thin-slicing  and a CNN for explicit pose estimation  as well as the common baselines. Quantitative evaluation results of these approaches are summarized in Figure 3(a), where our model clearly outperforms all the others. In addition, through comparisons between ours and its two variants (Triplet)+(Dense) and (Triplet)+(Binary), it is demonstrated that both of our log-ratio loss and the dense triplet mining contribute to the improvement. Qualitative examples of human pose retrieval are presented in Figure 4. Our model and thin-slicing overall successfully retrieve images exhibiting similar human poses with queries, while ResNet-34 focuses mostly on object classes and background components. Moreover, ours tends to capture subtle characteristics of human poses (e.g., bending left-arms in Figure 4(b)) and handle rare queries (e.g., Figure 4(e)) better than thin-slicing. Finally, we evaluate the human pose retrieval performance by varying embedding dimensionality to show how much effective our embedding space is. As illustrated in Figure 5, when decreasing the embedding dimensionality to 16, the performance of our model drops marginally while that of (Triplet)+(Dense) is reduced significantly. Consequently, the 16-D embedding of our model outperforms 128-D embedding of (Triplet)+(Dense). This result demonstrates the superior quality of the embedding space learned by our log-ratio loss.
4.4 Room Layout Retrieval
The goal of this task is to retrieve images whose 3-D room layouts are most similar with that of query image, with no explicit layout estimation in test time. We define the distance between two rooms and in terms of their room layouts as , where denotes the groundtruth room segmentation map and mIoU denotes mean Intersection-over-Union. Since this paper is the first attempt to tackle the room layout retrieval task, we compare our approach only with the common baselines. As shown quantitatively in Figure 3
(b), the advantage of the dense triplet mining is not significant in this task, probably because room layout labels of the training images are diverse and sparse so that it is not straightforward to sample triplets densely. Nevertheless, our model outperforms all the baselines by a noticeable margin thanks to the effectiveness of our log-ratio loss. Qualitative results of the room layout retrieval are illustrated in Figure6. As in the case of the pose retrieval, results of the ImageNet pretrained model are frequently affected by object classes irrelevant to room layouts (e.g., bed in Figure 6(b) and sofa in Figure 6(d)), while those of our approach are accurate and robust against such distractors.
4.5 Caption-aware Image Retrieval
An image caption describes image content thoroughly. It is not a simple combination of object classes, but involves richer information including their numbers, actions, interactions, relative locations. Thus, using caption similarities as supervision allows our model to learn image relations based on comprehensive image understanding. Motivated by this, we address the caption-aware image retrieval task, which aims to retrieve images described by most similar captions with query. To define a caption-aware image distance, we adopt a sentence distance metric called Word Mover’s Distance (WMD) . Let be the WMD between two captions and . As each image in our target dataset  has 5 captions, we compute the distance between two caption sets and through WMD by
We train our model and the common baselines with the WMD labels. As shown in Figure 3(c), our model outperforms all the baselines, and both of the log-ratio loss and the dense triplet mining clearly contribute to the performance boost, while the improvement is moderate due to the difficulty of the task itself. As illustrated in Figure 7, our model successfully retrieves images that contain high-level image content described by queries like object-object interactions (e.g., person-umbrella in Figure 7(a)), object actions (e.g.,holding something in Figure 7(b,d)), and specific objects of interest (e.g., hydrant in Figure 7(c)). In contrast, the two baselines in Figure 7 often fail to retrieve relevant images, especially those for actions and interactions.
4.6 Representation Learning for Image Captioning
An ImageNet pretrained CNN has been widely adopted as an initial or fixed visual feature extractor in many image captioning models [9, 52, 14, 38]. As shown in Figure 7, however, similarities between image pairs in the ImageNet feature space do not guarantee their caption similarities. One way to further improve image captioning quality would be exploiting caption labels for learning a visual representation specialized to image captioning. We are motivated by the above observation, and believe that a CNN learned with caption similarities through our continuous metric learning framework can be a way to implement the idea. To this end, we adopt our caption-aware retrieval model described in Section 4.5 as an initial, caption-aware visual feature extractor of two image captioning networks: Att2all2  and Topdown . Specifically, our caption-aware feature extractor is compared with the ImageNet pretrained baseline of ours, and average pooled outputs of their last convolution layers are utilized as caption-aware and ImageNet pretrained features. For training the two captioning networks, we directly follow the training scheme proposed in 
, which first pretrains the networks with cross-entropy (XE) loss then finetunes them using reinforcement learning (RL) with the CIDEr-D metric.
Table 1 quantitatively summarizes captioning performance of the ImageNet pretrained feature and our caption-aware feature. The scores of reproduced baseline are similar or higher than those reported in its original paper. Nonetheless, our caption-aware feature consistently outperforms the baseline in all evaluation metrics and for both of two captioning models. Also, qualitative examples of captions generated by the models in Table 1 are presented in Figure 8, where baselines generate incorrect captions while the models based on our caption-aware feature avoid choosing the wrong word and generate better captions.
We have presented a novel loss and tuple mining strategy for deep metric learning using continuous labels.
Our approach has achieved impressive performance on three different image retrieval tasks with continuous labels using human poses, room layouts and image captions.
Moreover, we have shown that our framework can be used to learn visual representation with continuous labels.
In the future, we will explore the effect of label distance metrics and a hard tuple mining technique for continuous metric learning to further improve the quality of learned metric space.
Acknowledgement: This work was supported by Basic Science Research Program and R&D program for Advanced Integrated-intelligence for IDentification through the National Research Foundation of Korea funded by the Ministry of Science, ICT (NRF-2018R1C1B6001223, NRF-2018R1A5A1060031, NRF-2018M3E3A1057306,
and by the Louis
Vuitton - ENS Chair on Artificial Intelligence.
This work was supported by Basic Science Research Program and R&D program for Advanced Integrated-intelligence for IDentification through the National Research Foundation of Korea funded by the Ministry of Science, ICT (NRF-2018R1C1B6001223, NRF-2018R1A5A1060031, NRF-2018M3E3A1057306, NRF-2017R1E1A1A01077999), and by the Louis Vuitton - ENS Chair on Artificial Intelligence.
-  P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Semantic propositional image caption evaluation. In Proc. European Conference on Computer Vision (ECCV), pages 382–398. Springer, 2016.
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang.
Bottom-up and top-down attention for image captioning and visual
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  V. Balntas, S. Li, and V. Prisacariu. Relocnet: Continuous metric learning relocalisation using neural nets. In Proc. European Conference on Computer Vision (ECCV), 2018.
-  J. Bromley, I. Guyon, Y. Lecun, E. Säckinger, and R. Shah. Signature verification using a ”siamese” time delay neural network. In Proc. Neural Information Processing Systems (NIPS), 1994.
-  M. Bucher, S. Herbin, and F. Jurie. Improving semantic embedding consistency by metric learning for zero-shot classification. In Proc. European Conference on Computer Vision (ECCV), 2016.
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and
Learning to rank using gradient descent.
Proc. International Conference on Machine Learning (ICML), 2005.
-  L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6298–6306. IEEE, 2017.
-  W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: A deep quadruplet network for person re-identification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proc. Neural Information Processing Systems (NIPS), 2014.
-  S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
-  A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Niessner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  B. Dai and D. Lin. Contrastive learning for image captioning. In Advances in Neural Information Processing Systems, pages 898–907, 2017.
-  M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, 2014.
-  A. Gordo and D. Larlus. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
C. Huang, C. C. Loy, and X. Tang.
Local similarity-aware deep feature embedding.In Proc. Neural Information Processing Systems (NIPS), 2016.
-  A. Iscen, G. Tolias, Y. Avrithis, and O. Chum. Mining on manifolds: Metric learning without labels. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  S. Johnson and M. Everingham. Learning effective human pose estimation from inaccurate annotation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
-  A. Karpathy. Neuraltalk2. https://github.com/karpathy/neuraltalk2.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representations (ICLR), 2015.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 2017.
-  V. Kumar B G, G. Carneiro, and I. Reid. Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances. In Proc. International Conference on Machine Learning (ICML), 2015.
-  S. Kwak, M. Cho, and I. Laptev. Thin-slicing for pose: Learning to understand pose without explicit pose estimation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
M. T. Law, R. Urtasun, and R. S. Zemel.
Deep spectral clustering learning.In Proc. International Conference on Machine Learning (ICML), 2017.
-  C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. In Proc. European Conference on Computer Vision (ECCV), 2014.
-  C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In Proc. European Conference on Computer Vision (ECCV), 2016.
-  R. Luo, B. Price, S. Cohen, and G. Shakhnarovich. Discriminability objective for training descriptive captions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  G. Mori, C. Pantofaru, N. Kothari, T. Leung, G. Toderici, A. Toshev, and W. Yang. Pose embeddings: A deep architecture for learning to match human poses. arXiv preprint arXiv:1507.00302, 2015.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 2002.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In AutoDiff, NIPS Workshop, 2017.
-  B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proc. IEEE International Conference on Computer Vision (ICCV), 2015.
-  W. V. Quine. Ontological relativity, and other essays. Columbia University Press, New York, 1969.
-  S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pages 1–42, April 2015.
F. Schroff, D. Kalenichenko, and J. Philbin.
FaceNet: A unified embedding for face recognition and clustering.In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Z. Li. Embedding deep metric for person re-identification: A study against large variations. In Proc. European Conference on Computer Vision (ECCV), 2016.
-  K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Proc. Neural Information Processing Systems (NIPS), 2016.
-  J. Son, M. Baek, M. Cho, and B. Han. Multi-object tracking with quadruplet convolutional neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  H. O. Song, S. Jegelka, V. Rathod, and K. Murphy. Deep metric learning via facility location. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
O. Sumer, T. Dencker, and B. Ommer.
Self-supervised learning of pose embeddings from spatiotemporal relations in videos.In Proc. IEEE International Conference on Computer Vision (ICCV), 2017.
-  O. Tadmor, T. Rosenwein, S. Shalev-Shwartz, Y. Wexler, and A. Shashua. Learning a metric embedding for face recognition using the multibatch method. In Proc. Neural Information Processing Systems (NIPS), 2016.
-  R. Tao, E. Gavves, and A. W. M. Smeulders. Siamese instance search for tracking. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In Proc. Neural Information Processing Systems (NIPS), 2016.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  L. Wang, A. Schwing, and S. Lazebnik. Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In Advances in Neural Information Processing Systems, pages 5756–5766, 2017.
-  X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In Proc. IEEE International Conference on Computer Vision (ICCV), 2015.
-  K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In Proc. Neural Information Processing Systems (NIPS), 2006.
-  P. Wohlhart and V. Lepetit. Learning descriptors for object recognition and 3d pose estimation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl. Sampling matters in deep embedding learning. In Proc. IEEE International Conference on Computer Vision (ICCV), 2017.
-  P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2014.
-  Y. Yuan, K. Yang, and C. Zhang. Hard-aware deeply cascaded embedding. In Proc. IEEE International Conference on Computer Vision (ICCV), 2017.
-  S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
Y. Zhang, F. Yu, S. Song, P. Xu, A. Seff, and J. Xiao.
Large-scale scene understanding challenge: Room layout estimation.
In this appendix, we first present more qualitative results omitted from the main paper due to space limit. Results for human pose retrieval, room layout retrieval, and caption-aware image retrieval are given in Section A.1, Section A.2, and Section A.3, respectively. Also, Section A.4 provides implementation details, more qualitative examples, and in-depth analysis for image captioning with our caption-aware visual features.
a.1 Human Pose Retrieval
More qualitative examples for human pose retrieval are presented in Figure 9. As in the main paper, results of our model are compared with those of a baseline (Triplet)+(Binary), which is an advanced version of the previous approach  adopting the same backbone network with ours for a fair comparison. From overall results, one can observe that even top 64 retrievals are relevant to queries in the results of our model, while those of the baseline are sometimes incorrect when a large number of images are retrieved. Also, when the query exhibits a rare pose as in the 2nd and 4th rows, the baseline includes irrelevant images (e.g., horizontally flipped poses) even among top-4 retrievals, while ours still works successfully. In Figure 10, we visualize the embedding manifold learned by our approach through 3D t-SNE. In the manifold, poses are locally consistent and change smoothly between two distant coordinates as illustrated by the series of pose images on the left-hand side of Figure 10. More interestingly, common poses (images with black boundaries) are densely populated while rare ones (images red boundaries) are scattered on the other side. This observation implicitly indicates that the learned manifold preserves degrees of similarities between images.
a.2 Room Layout Retrieval
More qualitative examples of retrieved room layouts are illustrated in Figure 11. The images are blended with their groundtruth segmentation masks. The values above the images are the mIOU scores with the query. Noted that the higher the mIOU score, the more relevant to the query is. As in Section A.1, our model is compared with a baseline (Triplet)+(Binary). From overall results, all images retrieved as top1 by our model have higher mIOU values than those of the baseline. Also, one can observe that our top 4 retrieval results have a higher mIOU than the results of baseline. As in the 1st, 2nd and 4th rows, baseline has images that among top-4 retrievals are less relevant than top 8 or top 16 retrievals to the queries On the other hand, the results show that as our method retrieves a large number of images, images gradually have a smaller value of mIOU as common sense. Overall, our model shows better performance than the baseline.
a.3 Caption-aware Image Retrieval
Our model and a baseline (Triplet)+(Binary) are compared qualitatively with more retrieval examples in Figure 12. Note that the baseline can be considered as a variant of the existing caption-aware retrieval model , which also employs a triplet loss and a tuple mining strategy based on nearest neighbors as the baseline does. From 1st to 3rd rows are images of people who take specific actions. (e.g., someone holding a cake) And the 4th rows are images of the object at a particular location. (e.g., donuts in a display case) One may notice that the retrieval results can be less relevant after top 16 since some images have a few relevant examples. For the 1st and 2nd queries, both models successfully retrieves relevant images. On the other hand, in the 3rd and 4th rows of the figure, our model performs better than the baseline. In the 3rd row, the baseline sometimes retrieves images that are not related to the ’cake’ at all. In the 4th row, our model retrieves images related to the ’donut’ even to the top 16, but one can see that the baseline does not work well. For the last query, both models fail to retrieve relevant images since the query image is unusual and the number of testing images relevant to the query is not sufficiently large.
a.4 Image Captioning
Our experiments are based on the implementation of 
. For att2all2 model, our hyperparameters are slightly different in training epoch, batch size, and learning rate during reinforcement learning from those of. We first train the captioning model using cross entropy loss with batch size of 16 until 20 epoch and then increase the batch size to 32, set learning rate 5e-4 and further 30 epoch of reinforcement learning. For topdown model, hyperparameters are same as in att2all2 model except batch size of 32 and the number of hidden units of 1024 for each LSTM. Note that this setting is different from the original work of topdown model 
We are going to introduce six interesting examples in figure 15, figure 15 and two failure cases in figure 15. Figure 15 includes two complementary pairs that the model learning from ImageNet pretrained features confuses action for each other. On the other hand, the model learning with from caption aware feature generates the sentences without confusion. This result seems to be that the model distinguish the visual semantic properly rather than repeating one side of the complementary pairs. Figure 15 shows two examples from att2all2 model. These examples show that both models (att2all2, topdown model) have similar characteristics. Two failure cases of our feature are 15. In both cases, model with our feature fails to produce a proper caption due to find the object correctly.
At figure 16 we try to check the difference of attention in the two typical examples. For these examples, each model using Imagenet pretrained feature, caption aware feature generates almost same caption except behavior. However, the tendency of attention is slightly different between two models. In the case of baseline model, strong attention is given to marginal part of image while predict the action. Whereas, the object took strong attention from the model using our feature. This result correspond to our common sense that we need to focus on object in order to notice subtle changes in posture.