Let's Transfer Transformations of Shared Semantic Representations

03/02/2019 ∙ by Nam Vo, et al. ∙ Google Georgia Institute of Technology 12

With a good image understanding capability, can we manipulate the images high level semantic representation? Such transformation operation can be used to generate or retrieve similar images but with a desired modification (for example changing beach background to street background); similar ability has been demonstrated in zero shot learning, attribute composition and attribute manipulation image search. In this work we show how one can learn transformations with no training examples by learning them on another domain and then transfer to the target domain. This is feasible if: first, transformation training data is more accessible in the other domain and second, both domains share similar semantics such that one can learn transformations in a shared embedding space. We demonstrate this on an image retrieval task where search query is an image, plus an additional transformation specification (for example: search for images similar to this one but background is a street instead of a beach). In one experiment, we transfer transformation from synthesized 2D blobs image to 3D rendered image, and in the other, we transfer from text domain to natural image domain.



There are no comments yet.


page 3

page 4

page 5

page 6

page 7

page 8

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Example of transformations that can be applied to 2 different domains. If they share similar semantics, we can learn on one and transfer to the other.

A smart image to image retrieval system should be able to incorporates user feedbacks such as relevance [24], attribute [13, 37], spatial layout [16] or text [8, 30]. This paper studies the above application; the scenario is that user want to search for images that similar to a reference image, but with an additional specification (such as “change object color to red” or “switch beach background to street”). We formulate a function parameterized by the specification, taking the reference image feature and outputting a new one that represents what the user is looking for; in this work we call such function “transformation”.

Training a vision system that can perform such kind of semantic manipulation can be straightforward. That is if there’s enough labeled data, which unfortunately is not always the case: finding images which contains desired transformation might not be possible, manually transform the images in its native domain could be a costly annotation. In this work we explore an alternative: learn the transformation function in another domain that shares similar semantics. It could be a totally different domain, or a customized simplified version of the original domain.

There are many use cases in which collecting examples in one domain is much easier, or cheaper:

  • We demonstrate this on the synthesized dataset CSS [30], where the same scene can be rendered 3D realistically or 2D simplistically. Rendering these scenes in 3D even with a GPU is still multiple magnitudes slower.

  • The second use case is image and caption [15, 36, 32]. Editing and manipulate images are highly specialized skills while manipulating text is the first thing people learn in school. In fact in our experiment we show how to generate “word replacing” transformation automatically on the fly for training.

  • Other scenarios includes 3D shape and caption [3], streetview image, computer generated image and corresponding category segmentation map [5, 23], facial images and corresponding facial landmarks [22], scene image and scene graph [12], etc. The later domains are usually easier to express transformation on. Even without manual annotation, one can automatically generate “change trees to buildings” transformation on segmentation map, or “make mouth smaller” transformation on facial landmarks.

In this work, we show that one can learn a transformation in one domain and transfer it to another domain by sharing a joint embedding space, assuming they have similar semantics and the transformation is universal to both domains. We demonstrate its usefulness on the image to image retrieval application, where the query is now a reference image plus an additional specification to enhance the query’s expressiveness [30]. 2 datasets are experimented with: the synthesized dataset CSS [30] and the image-caption dataset COCO 2014 [15], shown in Figure 1.

2 Related Works

Image retrieval: beside traditional text-to-image [32] or image-to-image retrieval [31] task, there are many image retrieval applications with other types of search query such as: sketch [25], scene layout [16], relevance feedback [24, 11], product attribute feedback [13, 37, 9, 1], dialog interaction [8] and image text combination query [30]. In this work, the image search query will be a combination of a reference image and a transformation specification. In our setup, labeled retrieval examples are not available, hence a standard training procedure like [37, 30] does not work.

Perhaps the most relevant approach is to learn a joint embedding space and analyze its structure [9, 3]. Here we learn a transformation function and use shared embedding as a way to transfer it.

Zero shot learning aims to recognize novel concepts relying on side data such as attribute [14, 2, 6] or textual description [20, 7]. This side data represents high level semantics with structure and therefore can be manipulated, composed or transformed easily by human. On the other hand, corresponding manipulation, but in the low level feature domain (like raw image), is more difficult.

GAN image generation, style transfer or translation is an active research area where high level semantics modification or synthesization of images is done [21, 10, 4, 26, 34, 19]. For example, “style” can represent a high level semantic feature that one want to enforce on the output. Reed et al [21] generate images from reference images with textual description of new “style”. Another relevant research area is works on translation between scene image, scene graph and text captions [35, 12].

Embedding and Joint embedding is one of the main approaches for many retrieval and zero shot learning tasks. It usually relies on metric learning [33, 31, 29]

in retrieval context; though other supervised setting or even unsupervised learning

[18] can also work. The result will be encoders that embed raw input into a high semantic level feature space, where retrieval or recognition is performed. Our work concerns performing transformation within such space. In [18]

, it is demonstrated that walking or perform vector arithmetic operation there can translate to similar high level semantic changes in the raw image space.

Synthesized data, simulation and domain adaptation: these areas are at high level similar to what we want to do: perform learning on another domain where label is available and apply to the target domain [23, 27, 28]. Here the source and target domains here are similar and the goal is to finetune the model trained on one domain for another by bridging the gap between 2 domains. Differently the task we are studying here requires transferring between 2 completely different domains (i.e. image and text) and so provides similarity supervision to facilitate that.

3 Method

Figure 2: Overview of our transformation transferring by sharing embedding space.

We study the problem of learning a transformation function in one domain and transfer it to another domain; we choose image retrieval for demonstration and quantitative experiments though the formulation might be applicable to other domains or other tasks.

First we formalize the problem: source domain and the target domain have the similar underlying semantics; corresponding supervision is the set where , are labeled similar if and non-similar otherwise.

Supervision for transformation is provided for the source domain: where is a ”before-transformation” example, is transformation specification/parameter and is a ”after-transformation” example. Note that the set of , and can be the same, intersected or mutually exclusive.

Given a similar labeled set but for testing on the target domain instead , the task is to for each query retrieve the correct in the pool of all examples. We propose to learn to do that by: (1) learn a shared semantic representation using supervision and (2) learn to transform that shared representation using supervision .

3.1 Deep Shared Embedding

The first step is to learn the embedding functions for each domains. For convenience, we denote:


We will make use of recent advance in deep learning for this task. In particular, CNN will be used as encoder if the domain is image and LSTM will be used if it is text.

The learning objective is for and to be close each other in this space if and

(which can be from the same domain or different) are labeled similar, and far from each other otherwise. Any any distance metric learning loss function

can be used:

We used the following function :

Where is the softmax cross-entropy function, .

3.2 Transformation in the Embedding Space

The transformation is formalized as a function where are the feature representation of example and the transformation respectively; we extend the definition of so that: when is transformation specification.

There is many kinds of feature fusion techniques which can be used as the transformation function. For example the simple concatenate fusion that we will use:

Where is concatenation operation and is a (learnable) 2 layer feed forward network. For reference, [30] benchmarks different image text fusion mechanisms in image retrieval context, we will also experiment with their proposed method TIRG.

For each transformation example , the learning objective is for close to while being far from other features in the embedding space. We use the same metric learning loss function in previous section to enfore that objective.

3.3 Domain-blind Example Augmentation

Training approach 2D-to-2D 2D-to-3D 3D-to-3D
2D-to-2D image retrieval training (16k examples) 73 - -
2D-to-3D image retrieval (16k) - 43 -
3D-to-3D image retrieval (16k) - - 72
2D-to-2D image retrieval (16k) + 2D-3D shared embedding (1k) 73 57 71
Table 1: R@1 Retrieval performance on CSS.

Note that when defining above functions, we remove domain specific notation from the variables. It’s so that they are applicable regardless which domain the examples are from. In general can also include in-domain similarity supervision and can also include cross-domain transformation supervision, if available.

If the examples in the supervision overlap, transitivity can be applied, for instance if and are labeled similar, and is a valid transformation example, then is also a valid transformation.

If the transformation is reversible (for example [add red cube] and [removed red cube]), then for each , we have also a valid transformation example. Similarly if the transformation is associative or composable and commutative.

Above strategies allow forming a diversed pool of embedding and transformation supervision for training. This can be further enhanced if it’s easy to generate examples on the fly. For instance given a text domain example ”a girl is sleeping on the beach”, and transformation ”replace word with word ”, a lot of examples can be generated by picking different and .

4 Experiment on CSS Dataset

First we experiment transferring transformation between 2 different image domains on the synthesized dataset CSS [30]. It was created for the task of image retrieval from image text compositional query; for each scene there is a 2D simple blobs and 3D rendered version image; an example query is shown in figure3. This naturally fits into our framework: (1) such composition query can be interpreted as a query image plus a transformation, here described by the text (2) we can train on the 2D images (source domain) and test on 3D images (target domain).

Figure 3: Example data of image retrieval with composition query in the CSS dataset.
Figure 4: Example 3D-to-3D retrieval result on CSS Dataset

4.1 Setup

The dataset has  16k (query, target) pairs for training. We use this as supervision for learning the transformation function using 2D images only. The image part of all queries come from a set of 1k base images. We use both 2D and 3D versions of these base images for learning the shared embedding between 2 domains. During test time, we will perform the same retrieval benchmark with the 3D image versions as in [30].

Note that we are pretending that we don’t have access to transformation examples of 3D images. We do have: (1) a lot of transformation examples in the 2D image domain (16k) and (2) a small amount of similarity labels between 2 domains (1k). This set up is to motivate our work: train on source domain and transfer to target domain where supervision is not available or more expensive (in fact [30] states that generating all these 2D images only take minues while it’s days for the 3D version even with a GPU).

4.2 Implementation detail

We used ResNet-18 to encode the 2D and 3D images, and LSTM for the transformation specification text; feature size for all of them is 512. The text feature is treated as transformation parameter. We train for 150k iterations with learning rate of 0.01.

Set up and baselines are the same as [30], we train the same system, but without any transformation transferring, for 3 cases: transformation retrieval for 2D-to-2D images, 2D-to-3D and 3D-to-3D. The main experiment is the 3D-to-3D case where we can compare directly the baseline: learning the transformation in 3D image feature space, versus our approach: learning the transformation in 2D image feature space and share it.

4.3 Result

Figure 5: Example retrieval result on COCO-val2014

We reported R@1 performance in Table 1, some qualitative result is shown in Figure 4. Our approach, without 3D-to-3D training examples, achieves comparable result to training on 3D-to-3D transformation examples. Transferring appears to be very effective for this simple dataset.

Since our method learns a shared embedding, we can do cross domain retrieval. In the 2D-to-3D retrieval case, surprisingly ours outperform actually training on 2D-to-3D examples baseline. This suggests learning a cross domain transformation retrieval is more challenging than learning in-domain and then share.

5 Experiment on Transferring Text Transformation to Image

While composing and manipulating texts is everyday task to people (who can write and read), composing and manipulating images are specialized skills. In this section we attempt to transfer text transformation to images. Note that there’s inherently differences between the vision and language, something can be described in one domain but might be difficult to fully translate to other domain. In [36], the denotation graph is introduced, where each node represents a text expression and is grounded with a set of example images; each edge in the graph represents a text transformation.

5.1 Word replacement text transformation

Figure 6: An example retrieval result: top is the query, bottom is retrieved images and captions.

We choose a very simple transformation to study here: given a text, replace a particular word with another word. As in previous experiment, we use image retrieval as demonstration task. For example if applying a text transformation of [replace beach with street] to the image of a dog running on the beach, we would want retrieve images of dogs running on the street (2nd example in figure 1).

However, exact expectation of the result is hard to define especially if the image is crowded with different things (what street scene is desired, should other objects in the scene be preserved, should the dog be kept at exact pose, or at instance level or at category level, etc). In addition to such ambiguousness, composing images is not trivial, hence collecting labels is very difficult. One can explicitly define a specific transformation in the image domain, then equate it to another specific transformation in the text domain through machine learning; while interesting, it’s not what we want study here. Our approach allows training a transformation in one domain and transfer it to the other domain without any transformation examples in the target domain.

5.2 Setup

We use the COCO train2014 dataset [15] to learn the join embedding of images and texts; it has around 80k images, each is accompanied with 5 captions.

We create a list of hundred of word replacement pairs from a pool of around 100 words (for example ”beach to street”, ”boy to girl”, ”dogs to people”, etc); theses words are manually chosen such that they are popular and visually distinctive concepts. During training, we apply word replacement to the captions to generate transformation examples on the fly.

We used pretrained ResNet-50 for encoding images, not fine-tuning the conv layers, and LSTM for encoding captions and words; the embedding size is 512. The parameter for the transformation is the word replacement pair, will be the concatenation of the encoded representations of the word to be replaced and the new word. As defined in section 3, the transformation function can be a simple 2 layers feed forward network, or a recent technique TIRG [30].

Note that since this word replacement transformation is reversible and generating new text examples on the fly is easy, we take advantage of data augmentation tricks mentioned in section 3.3.

5.3 Qualitative Result on COCO

As mentioned in section 5.1, correct retrieval result can be ambiguous to define, so we mainly focus on demonstration of qualitative result here. The COCO-val2014 with around 40k images will be used as retrieval database.

We show some result in figure 5. Somewhat reasonable retrieval result can be obtained if the replacing words are really popular and visually distinctive; this includes COCO object categories (person, dog, car, …) or common scenes (room, beach, park, …). To us, a reasonable result would be the introduction of the concept represented by the new word, while other elements in the image (subjects, background, composition,…) are kept unchanged as much as possible.

Replacing adjectives and verbs are difficult. Popular adjectives are object like attribute such as ”woody” or ”grassy”. Abstract adjectives are rare in this dataset, some might be ”young”, ”small” or ”big”. Colors might be ones that have better chance of working in our experience since they are often visually distinctive.

Verbs are the most challenging (for example last rows in figure 5

). We speculate the system relies less on verbs to matching image and text since nouns/objects are informative enough and easier to learn from (for context, recent research have demonstrated object recognition performance at superhuman level, but action recognition remains challenging). Also it could be partly because COCO is object oriented, so is ImageNet which is used for pretraining.

Finally, we note that there’s still discrepancy between images and texts even in a shared embedding. In figure 6, we show an example where top ranked images and texts are retrieved but they are not reflecting the same semantics. Hence our task could benefit with improvement in image text matching method (the one we use in this work is basic and straightforward, but slightly inferior to state of the art approaches on image-to-text and text-to-image retrieval benchmarks).

5.4 Simple Image Captions (SIC112) benchmark

Figure 7: The SIC112 dataset

In order to have a quantitative measurement, we collect a small benchmark for this task. First we manually define a set of 112 very simple, attribute-like captions, each can contain a subject (man, woman, dog, etc), a verb (running, sleeping, sitting) and a background (on the street, beach, ect). For each caption we perform a google image search to collect images, then manually filter them. On average we collect 15 images for each caption. We call this the Simple Image Captions 112 (SIC112) dataset, some examples are shown in Figure 7.

With this dataset, we can now test our image retrieval task quantitatively, by using the captions as the label for images. The retrieval is considered success if the retrieved image has the same caption label corresponding to the query image’s caption after applying the word replacement. We use the Recall at rank k (R@k) metric, which is defined as the percentage of test cases in which top k retrieved result contains at least 1 correct image. Note that the dataset is for testing only, training is done on COCO-train2014 as described in previous section.

Baselines: we compare with the following

  1. Image Only: ignore the transformation and do image to image retrieval.

  2. Embedding arithmetic: the word replacing transformation can be done by directly adding and subtracting their corresponding embedding. This simple strategy has been found to be effective in previous works on text embedding [17] (for example “king” - “man” + “woman” = “queen”), image synthesis [18] and 3D model and text joint embedding [3].

  3. Image to Text to Image retrieval: instead of transferring the transformation, this baseline translates the query image to text, perform the transformation natively and then translate it back to image. Here the translation is done by the our image text matching system since it is capable of retrieval in both direction (image to text and text to image). For image to text, our implementation uses COCO-train2014 dataset of  400k captions as text database; an alternative could be an out of the box image captioning system.

  4. Text (Ground truth target caption) to Image retrieval: this is similar to the last baseline, assuming a perfect image to text translation is given, so the ground truth caption will be used as query for retrieval.

Test queries Keep others 2, only change: keep novel subject, All
- subject verb background change background R@1 R@5 R@10
Image Only 0 0 0 0 0 12.9 21.9
Arithmetic 19.2 12.5 23.4 43.5 20.5 59.5 75.0
Img2Text2Img 19.7 14.9 17.0 12.5 17.6 45.7 59.9
Our 27.4 23.5 26.0 38.2 26.6 64.0 78.3
GT caption 40.6 40.3 39.8 29.6 39.6 82.4 92.0
Table 2: R@1 retrieval result on the SIC112 benchmark.
Figure 8: Example retrieval result on SIC112 benchmark.

Result: Some qualitative result is shown in figure 8. The retrieval performance is reported in table 2. For analysis we split test queries into 4 groups: 8205 queries changing subject (for example “girl to boy”), 3282 changing verb, 6564 changing background, and 745 special queries changing background of images which contain novel subjects (such as “stormtrooper” that doesn’t appear in COCO2014train). The last group is to demonstrate the use case where direct translation between image and text might be difficult and transferring might be more appropriate.

We consider the GT caption to image retrieval baseline as the upper bound. Among subject, verb and background, changing verb seems more challenging. Our approach outperforms the other baselines demonstrating it is more beneficial to perform the transformation in the same image domain than translating to text. Still ours is much worse than GT caption baseline, suggesting there is still a lot of room for improvement. In particular our approach could benefit a lot from better image-text joint embedding technique.

On keep novel subject change background queries, translating to text or even using GT caption result in worse performance because the system can not recognize the novel object in text domain. Performing transformation in the native image domain by embedding arithmetic operation or our approach fits this use case better. Arithmetic baseline performs very well on changing background, and even outperform ours when verb is not involved. This baseline also has the advantage that it’s simple and no additional learning need to be done. However we’d expect when the operation is more complex or subtle (for example keep verb unchanged, or change verb, or dealing with more complex captions like in COCO2014), learning a transformation function would be better than relying on simple arithmetic operation.

6 Conclusion

We propose to learn a feature transformation function where no training examples are available by learning such function on another domain with similar semantics (where training examples are abundant). Then it can be transfer to the original target domain by shared embedding feature space. We demonstrate such transformed feature can be very useful in image retrieval application. One can also learn a decoding function, for example, to generate image or text from the feature. Future works could study more complex text transformation and semantic composition beyond simple “word replacing”.


  • [1] K. E. Ak, A. A. Kassim, J. H. Lim, and J. Y. Tham. Learning attribute representations with localization for flexible fashion search. In CVPR, 2018.
  • [2] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 819–826, 2013.
  • [3] K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese. Text2shape: Generating shapes from natural language by learning joint embeddings. arXiv preprint arXiv:1803.08495, 2018.
  • [4] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. arXiv preprint arXiv:1711.09020, 2017.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [6] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In European conference on computer vision, pages 48–64. Springer, 2014.
  • [7] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
  • [8] X. Guo, H. Wu, Y. Cheng, S. Rennie, and R. S. Feris. Dialog-based interactive image retrieval. arXiv preprint arXiv:1805.00145, 2018.
  • [9] X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y. Li, Y. Zhao, and L. S. Davis. Automatic spatially-aware fashion concept discovery. In ICCV, 2017.
  • [10] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976. IEEE, 2017.
  • [11] L. Jiang, A. G. Hauptmann, and G. Xiang. Leveraging high-level and low-level features for multimedia event detection. In ACM MM, 2012.
  • [12] J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. arXiv preprint arXiv:1804.01622, 2018.
  • [13] A. Kovashka, D. Parikh, and K. Grauman. Whittlesearch: Image search with relative attribute feedback. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2973–2980. IEEE, 2012.
  • [14] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 951–958. IEEE, 2009.
  • [15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [16] L. Mai, H. Jin, Z. Lin, C. Fang, J. Brandt, and F. Liu. Spatial-semantic image search by visual feature synthesis. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1121–1130. IEEE, 2017.
  • [17] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [18] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [19] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu. Swapnet: Image based garment transfer.
  • [20] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 49–58, 2016.
  • [21] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
  • [22] S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1685–1692, 2014.
  • [23] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European Conference on Computer Vision, pages 102–118. Springer, 2016.
  • [24] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra. Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Transactions on circuits and systems for video technology, 8(5):644–655, 1998.
  • [25] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35(4):119, 2016.
  • [26] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch and color. Computer Vision and Pattern Recognition, CVPR, 2017.
  • [27] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2107–2116, 2017.
  • [28] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2962–2971. IEEE, 2017.
  • [29] N. Vo and J. Hays. Generalization in metric learning: Should the embedding layer be the embedding layer? arXiv preprint arXiv:1803.03310, 2018.
  • [30] N. Vo, L. Jiang, C. Sun, K. Murphy, L.-J. Li, L. Fei-Fei, and J. Hays. Composing text and image for image retrieval - an empirical odyssey. In CVPR, 2019.
  • [31] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. In CVPR, 2014.
  • [32] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In CVPR, 2016.
  • [33] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
  • [34] V. A. A. R. J. L. C. F. F. Y. J. H. Wenqi Xian, Patsorn Sangkloy. Texturegan: Controlling deep image synthesis with texture patches. Computer Vision and Pattern Recognition, CVPR, 2018.
  • [35] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. arXiv preprint arXiv:1701.02426, 2017.
  • [36] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  • [37] B. Zhao, J. Feng, X. Wu, and S. Yan. Memory-augmented attribute manipulation networks for interactive fashion search. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6156–6164. IEEE, 2017.