With a good image understanding capability, can we manipulate the images high level semantic representation? Such transformation operation can be used to generate or retrieve similar images but with a desired modification (for example changing beach background to street background); similar ability has been demonstrated in zero shot learning, attribute composition and attribute manipulation image search. In this work we show how one can learn transformations with no training examples by learning them on another domain and then transfer to the target domain. This is feasible if: first, transformation training data is more accessible in the other domain and second, both domains share similar semantics such that one can learn transformations in a shared embedding space. We demonstrate this on an image retrieval task where search query is an image, plus an additional transformation specification (for example: search for images similar to this one but background is a street instead of a beach). In one experiment, we transfer transformation from synthesized 2D blobs image to 3D rendered image, and in the other, we transfer from text domain to natural image domain.READ FULL TEXT VIEW PDF
While neural networks have shown impressive performance on large dataset...
In this paper, we study the task of image retrieval, where the input que...
Zero-Shot Hashing aims at learning a hashing model that is trained only ...
Zero-shot learning (ZSL) aims to recognize objects from novel unseen cla...
We propose Image-Semantic-Transformation-Reconstruction-Circle(ISTRC) mo...
Face attributes are interesting due to their detailed description of hum...
In this paper, we investigate the problem of retrieving images from a
A smart image to image retrieval system should be able to incorporates user feedbacks such as relevance , attribute [13, 37], spatial layout  or text [8, 30]. This paper studies the above application; the scenario is that user want to search for images that similar to a reference image, but with an additional specification (such as “change object color to red” or “switch beach background to street”). We formulate a function parameterized by the specification, taking the reference image feature and outputting a new one that represents what the user is looking for; in this work we call such function “transformation”.
Training a vision system that can perform such kind of semantic manipulation can be straightforward. That is if there’s enough labeled data, which unfortunately is not always the case: finding images which contains desired transformation might not be possible, manually transform the images in its native domain could be a costly annotation. In this work we explore an alternative: learn the transformation function in another domain that shares similar semantics. It could be a totally different domain, or a customized simplified version of the original domain.
There are many use cases in which collecting examples in one domain is much easier, or cheaper:
We demonstrate this on the synthesized dataset CSS , where the same scene can be rendered 3D realistically or 2D simplistically. Rendering these scenes in 3D even with a GPU is still multiple magnitudes slower.
The second use case is image and caption [15, 36, 32]. Editing and manipulate images are highly specialized skills while manipulating text is the first thing people learn in school. In fact in our experiment we show how to generate “word replacing” transformation automatically on the fly for training.
Other scenarios includes 3D shape and caption , streetview image, computer generated image and corresponding category segmentation map [5, 23], facial images and corresponding facial landmarks , scene image and scene graph , etc. The later domains are usually easier to express transformation on. Even without manual annotation, one can automatically generate “change trees to buildings” transformation on segmentation map, or “make mouth smaller” transformation on facial landmarks.
In this work, we show that one can learn a transformation in one domain and transfer it to another domain by sharing a joint embedding space, assuming they have similar semantics and the transformation is universal to both domains. We demonstrate its usefulness on the image to image retrieval application, where the query is now a reference image plus an additional specification to enhance the query’s expressiveness . 2 datasets are experimented with: the synthesized dataset CSS  and the image-caption dataset COCO 2014 , shown in Figure 1.
Image retrieval: beside traditional text-to-image  or image-to-image retrieval  task, there are many image retrieval applications with other types of search query such as: sketch , scene layout , relevance feedback [24, 11], product attribute feedback [13, 37, 9, 1], dialog interaction  and image text combination query . In this work, the image search query will be a combination of a reference image and a transformation specification. In our setup, labeled retrieval examples are not available, hence a standard training procedure like [37, 30] does not work.
Zero shot learning aims to recognize novel concepts relying on side data such as attribute [14, 2, 6] or textual description [20, 7]. This side data represents high level semantics with structure and therefore can be manipulated, composed or transformed easily by human. On the other hand, corresponding manipulation, but in the low level feature domain (like raw image), is more difficult.
GAN image generation, style transfer or translation is an active research area where high level semantics modification or synthesization of images is done [21, 10, 4, 26, 34, 19]. For example, “style” can represent a high level semantic feature that one want to enforce on the output. Reed et al  generate images from reference images with textual description of new “style”. Another relevant research area is works on translation between scene image, scene graph and text captions [35, 12].
in retrieval context; though other supervised setting or even unsupervised learning can also work. The result will be encoders that embed raw input into a high semantic level feature space, where retrieval or recognition is performed. Our work concerns performing transformation within such space. In 
, it is demonstrated that walking or perform vector arithmetic operation there can translate to similar high level semantic changes in the raw image space.
Synthesized data, simulation and domain adaptation: these areas are at high level similar to what we want to do: perform learning on another domain where label is available and apply to the target domain [23, 27, 28]. Here the source and target domains here are similar and the goal is to finetune the model trained on one domain for another by bridging the gap between 2 domains. Differently the task we are studying here requires transferring between 2 completely different domains (i.e. image and text) and so provides similarity supervision to facilitate that.
We study the problem of learning a transformation function in one domain and transfer it to another domain; we choose image retrieval for demonstration and quantitative experiments though the formulation might be applicable to other domains or other tasks.
First we formalize the problem: source domain and the target domain have the similar underlying semantics; corresponding supervision is the set where , are labeled similar if and non-similar otherwise.
Supervision for transformation is provided for the source domain: where is a ”before-transformation” example, is transformation specification/parameter and is a ”after-transformation” example. Note that the set of , and can be the same, intersected or mutually exclusive.
Given a similar labeled set but for testing on the target domain instead , the task is to for each query retrieve the correct in the pool of all examples. We propose to learn to do that by: (1) learn a shared semantic representation using supervision and (2) learn to transform that shared representation using supervision .
The first step is to learn the embedding functions for each domains. For convenience, we denote:
We will make use of recent advance in deep learning for this task. In particular, CNN will be used as encoder if the domain is image and LSTM will be used if it is text.
The learning objective is for and to be close each other in this space if and
(which can be from the same domain or different) are labeled similar, and far from each other otherwise. Any any distance metric learning loss functioncan be used:
We used the following function :
Where is the softmax cross-entropy function, .
The transformation is formalized as a function where are the feature representation of example and the transformation respectively; we extend the definition of so that: when is transformation specification.
There is many kinds of feature fusion techniques which can be used as the transformation function. For example the simple concatenate fusion that we will use:
Where is concatenation operation and is a (learnable) 2 layer feed forward network. For reference,  benchmarks different image text fusion mechanisms in image retrieval context, we will also experiment with their proposed method TIRG.
For each transformation example , the learning objective is for close to while being far from other features in the embedding space. We use the same metric learning loss function in previous section to enfore that objective.
|2D-to-2D image retrieval training (16k examples)||73||-||-|
|2D-to-3D image retrieval (16k)||-||43||-|
|3D-to-3D image retrieval (16k)||-||-||72|
|2D-to-2D image retrieval (16k) + 2D-3D shared embedding (1k)||73||57||71|
Note that when defining above functions, we remove domain specific notation from the variables. It’s so that they are applicable regardless which domain the examples are from. In general can also include in-domain similarity supervision and can also include cross-domain transformation supervision, if available.
If the examples in the supervision overlap, transitivity can be applied, for instance if and are labeled similar, and is a valid transformation example, then is also a valid transformation.
If the transformation is reversible (for example [add red cube] and [removed red cube]), then for each , we have also a valid transformation example. Similarly if the transformation is associative or composable and commutative.
Above strategies allow forming a diversed pool of embedding and transformation supervision for training. This can be further enhanced if it’s easy to generate examples on the fly. For instance given a text domain example ”a girl is sleeping on the beach”, and transformation ”replace word with word ”, a lot of examples can be generated by picking different and .
First we experiment transferring transformation between 2 different image domains on the synthesized dataset CSS . It was created for the task of image retrieval from image text compositional query; for each scene there is a 2D simple blobs and 3D rendered version image; an example query is shown in figure3. This naturally fits into our framework: (1) such composition query can be interpreted as a query image plus a transformation, here described by the text (2) we can train on the 2D images (source domain) and test on 3D images (target domain).
The dataset has 16k (query, target) pairs for training. We use this as supervision for learning the transformation function using 2D images only. The image part of all queries come from a set of 1k base images. We use both 2D and 3D versions of these base images for learning the shared embedding between 2 domains. During test time, we will perform the same retrieval benchmark with the 3D image versions as in .
Note that we are pretending that we don’t have access to transformation examples of 3D images. We do have: (1) a lot of transformation examples in the 2D image domain (16k) and (2) a small amount of similarity labels between 2 domains (1k). This set up is to motivate our work: train on source domain and transfer to target domain where supervision is not available or more expensive (in fact  states that generating all these 2D images only take minues while it’s days for the 3D version even with a GPU).
We used ResNet-18 to encode the 2D and 3D images, and LSTM for the transformation specification text; feature size for all of them is 512. The text feature is treated as transformation parameter. We train for 150k iterations with learning rate of 0.01.
Set up and baselines are the same as , we train the same system, but without any transformation transferring, for 3 cases: transformation retrieval for 2D-to-2D images, 2D-to-3D and 3D-to-3D. The main experiment is the 3D-to-3D case where we can compare directly the baseline: learning the transformation in 3D image feature space, versus our approach: learning the transformation in 2D image feature space and share it.
We reported R@1 performance in Table 1, some qualitative result is shown in Figure 4. Our approach, without 3D-to-3D training examples, achieves comparable result to training on 3D-to-3D transformation examples. Transferring appears to be very effective for this simple dataset.
Since our method learns a shared embedding, we can do cross domain retrieval. In the 2D-to-3D retrieval case, surprisingly ours outperform actually training on 2D-to-3D examples baseline. This suggests learning a cross domain transformation retrieval is more challenging than learning in-domain and then share.
While composing and manipulating texts is everyday task to people (who can write and read), composing and manipulating images are specialized skills. In this section we attempt to transfer text transformation to images. Note that there’s inherently differences between the vision and language, something can be described in one domain but might be difficult to fully translate to other domain. In , the denotation graph is introduced, where each node represents a text expression and is grounded with a set of example images; each edge in the graph represents a text transformation.
We choose a very simple transformation to study here: given a text, replace a particular word with another word. As in previous experiment, we use image retrieval as demonstration task. For example if applying a text transformation of [replace beach with street] to the image of a dog running on the beach, we would want retrieve images of dogs running on the street (2nd example in figure 1).
However, exact expectation of the result is hard to define especially if the image is crowded with different things (what street scene is desired, should other objects in the scene be preserved, should the dog be kept at exact pose, or at instance level or at category level, etc). In addition to such ambiguousness, composing images is not trivial, hence collecting labels is very difficult. One can explicitly define a specific transformation in the image domain, then equate it to another specific transformation in the text domain through machine learning; while interesting, it’s not what we want study here. Our approach allows training a transformation in one domain and transfer it to the other domain without any transformation examples in the target domain.
We use the COCO train2014 dataset  to learn the join embedding of images and texts; it has around 80k images, each is accompanied with 5 captions.
We create a list of hundred of word replacement pairs from a pool of around 100 words (for example ”beach to street”, ”boy to girl”, ”dogs to people”, etc); theses words are manually chosen such that they are popular and visually distinctive concepts. During training, we apply word replacement to the captions to generate transformation examples on the fly.
We used pretrained ResNet-50 for encoding images, not fine-tuning the conv layers, and LSTM for encoding captions and words; the embedding size is 512. The parameter for the transformation is the word replacement pair, will be the concatenation of the encoded representations of the word to be replaced and the new word. As defined in section 3, the transformation function can be a simple 2 layers feed forward network, or a recent technique TIRG .
Note that since this word replacement transformation is reversible and generating new text examples on the fly is easy, we take advantage of data augmentation tricks mentioned in section 3.3.
As mentioned in section 5.1, correct retrieval result can be ambiguous to define, so we mainly focus on demonstration of qualitative result here. The COCO-val2014 with around 40k images will be used as retrieval database.
We show some result in figure 5. Somewhat reasonable retrieval result can be obtained if the replacing words are really popular and visually distinctive; this includes COCO object categories (person, dog, car, …) or common scenes (room, beach, park, …). To us, a reasonable result would be the introduction of the concept represented by the new word, while other elements in the image (subjects, background, composition,…) are kept unchanged as much as possible.
Replacing adjectives and verbs are difficult. Popular adjectives are object like attribute such as ”woody” or ”grassy”. Abstract adjectives are rare in this dataset, some might be ”young”, ”small” or ”big”. Colors might be ones that have better chance of working in our experience since they are often visually distinctive.
Verbs are the most challenging (for example last rows in figure 5
). We speculate the system relies less on verbs to matching image and text since nouns/objects are informative enough and easier to learn from (for context, recent research have demonstrated object recognition performance at superhuman level, but action recognition remains challenging). Also it could be partly because COCO is object oriented, so is ImageNet which is used for pretraining.
Finally, we note that there’s still discrepancy between images and texts even in a shared embedding. In figure 6, we show an example where top ranked images and texts are retrieved but they are not reflecting the same semantics. Hence our task could benefit with improvement in image text matching method (the one we use in this work is basic and straightforward, but slightly inferior to state of the art approaches on image-to-text and text-to-image retrieval benchmarks).
In order to have a quantitative measurement, we collect a small benchmark for this task. First we manually define a set of 112 very simple, attribute-like captions, each can contain a subject (man, woman, dog, etc), a verb (running, sleeping, sitting) and a background (on the street, beach, ect). For each caption we perform a google image search to collect images, then manually filter them. On average we collect 15 images for each caption. We call this the Simple Image Captions 112 (SIC112) dataset, some examples are shown in Figure 7.
With this dataset, we can now test our image retrieval task quantitatively, by using the captions as the label for images. The retrieval is considered success if the retrieved image has the same caption label corresponding to the query image’s caption after applying the word replacement. We use the Recall at rank k (R@k) metric, which is defined as the percentage of test cases in which top k retrieved result contains at least 1 correct image. Note that the dataset is for testing only, training is done on COCO-train2014 as described in previous section.
Baselines: we compare with the following
Image Only: ignore the transformation and do image to image retrieval.
Embedding arithmetic: the word replacing transformation can be done by directly adding and subtracting their corresponding embedding. This simple strategy has been found to be effective in previous works on text embedding  (for example “king” - “man” + “woman” = “queen”), image synthesis  and 3D model and text joint embedding .
Image to Text to Image retrieval: instead of transferring the transformation, this baseline translates the query image to text, perform the transformation natively and then translate it back to image. Here the translation is done by the our image text matching system since it is capable of retrieval in both direction (image to text and text to image). For image to text, our implementation uses COCO-train2014 dataset of 400k captions as text database; an alternative could be an out of the box image captioning system.
Text (Ground truth target caption) to Image retrieval: this is similar to the last baseline, assuming a perfect image to text translation is given, so the ground truth caption will be used as query for retrieval.
|Test queries||Keep others 2, only change:||keep novel subject,||All|
Result: Some qualitative result is shown in figure 8. The retrieval performance is reported in table 2. For analysis we split test queries into 4 groups: 8205 queries changing subject (for example “girl to boy”), 3282 changing verb, 6564 changing background, and 745 special queries changing background of images which contain novel subjects (such as “stormtrooper” that doesn’t appear in COCO2014train). The last group is to demonstrate the use case where direct translation between image and text might be difficult and transferring might be more appropriate.
We consider the GT caption to image retrieval baseline as the upper bound. Among subject, verb and background, changing verb seems more challenging. Our approach outperforms the other baselines demonstrating it is more beneficial to perform the transformation in the same image domain than translating to text. Still ours is much worse than GT caption baseline, suggesting there is still a lot of room for improvement. In particular our approach could benefit a lot from better image-text joint embedding technique.
On keep novel subject change background queries, translating to text or even using GT caption result in worse performance because the system can not recognize the novel object in text domain. Performing transformation in the native image domain by embedding arithmetic operation or our approach fits this use case better. Arithmetic baseline performs very well on changing background, and even outperform ours when verb is not involved. This baseline also has the advantage that it’s simple and no additional learning need to be done. However we’d expect when the operation is more complex or subtle (for example keep verb unchanged, or change verb, or dealing with more complex captions like in COCO2014), learning a transformation function would be better than relying on simple arithmetic operation.
We propose to learn a feature transformation function where no training examples are available by learning such function on another domain with similar semantics (where training examples are abundant). Then it can be transfer to the original target domain by shared embedding feature space. We demonstrate such transformed feature can be very useful in image retrieval application. One can also learn a decoding function, for example, to generate image or text from the feature. Future works could study more complex text transformation and semantic composition beyond simple “word replacing”.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.