One of the peculiar features of human perception is multi-modality. We unconsciously attach attributes to objects, which can sometimes uniquely identify them. For instance, when a person says apple it is quite natural that an image of an apple, which may be green or red in color, forms in their mind. In information retrieval, the user seeks information from a retrieval system by sending a query. Traditional information retrieval systems allow a unimodal query, i.e., either a text or an image. Advanced information retrieval systems should enable the users in expressing the concept in their mind by allowing a multi-modal query.
In this work, we consider such an advanced retrieval system, where users can retrieve images from a database based on a multi-modal query. Concretely, we have an image retrieval task where the input query is specified in the form of an image and natural language expressions describing the desired modifications in the query image. Such a retrieval system offers a natural and effective interface . This task has applications in the domain of E-Commerce search, surveillance systems and internet search. Fig. 1 shows a potential application scenario of this task.
Recently, Vo  have proposed the Text Image Residual Gating (TIRG) method for composing the query image and text for image retrieval. They have achieved state-of-the-art (SOTA) results on this task. However, their approach does not perform well for real-world application scenarios, i.e. with long and detailed texts (see Sec. 4.4). We think the reason is that their approach is too focused on changing the image space and does not give the query text its due importance. The gating connection takes element-wise product of query image features with image-text representation after passing it through two fully connected layers. In short, TIRG assigns huge importance to query image features by putting it directly in the final composed representation. Similar to [19, 22], they employ LSTM for extracting features from the query text. This works fine for simple queries but fails for more realistic queries.
In this paper, we attempt to overcome these limitations by proposing ComposeAE, an autoencoder based approach for composing the modalities in the multi-modal query. We employ a pre-trained BERT model  for extracting text features, instead of LSTM. We hypothesize that by jointly conditioning on both left and right context, BERT is able to give better representation for the complex queries. Similar to TIRG , we use a pre-trained ResNet-17 model for extracting image features. The extracted image and text features have different statistical properties as they are extracted from independent uni-modal models. We argue that it will not be beneficial to fuse them by passing through a few fully connected layers, as typically done in image-text joint embeddings .
We adopt a novel approach and map these features to a complex space. We propose that the target image representation is an element-wise rotation of the representation of the source image in this complex space. The information about the degree of rotation is specified by the text features. We learn the composition of these complex vectors and their mapping to the target image space by adopting a deep metric learning (DML) approach. In this formulation, text features take a central role in defining the relationship between query image and target image. This also implies that the search space for learning the composition features is restricted. From a DML point of view, this restriction proves to be quite vital in learning a good similarity metric.
We also propose an explicit rotational symmetry constraint on the optimization problem based on our novel formulation of composing the image and text features. Specifically, we require that multiplication of the target image features with the complex conjugate of the query text features should yield a representation similar to the query image features. We explore the effectiveness of this constraint in our experiments (see Sec. 4.5).
We validate the effectiveness of our approach on three datasets: MIT-States, Fashion200k and Fashion IQ. In Sec. 4, we show empirically that ComposeAE is able to learn a better composition of image and text queries and outperforms SOTA method. In DML, it has been recently shown that improvements in reported results are exaggerated and performance comparisons are done unfairly . In our experiments, we took special care to ensure fair comparison. For instance, we introduce several variants of TIRG. Some of them show huge improvements over the original TIRG. We also conduct several ablation studies to quantify the contribution of different modules in the improvement of the ComposeAE performance.
Our main contributions are summarized below:
We propose a ComposeAE model to learn the composed representation of image and text query.
We adopt a novel approach and argue that the source image and the target image lie in a common complex space. They are rotations of each other and the degree of rotation is encoded via query text features.
We propose a rotational symmetry constraint on the optimization problem.
ComposeAE outperforms the SOTA method TIRG by a huge margin, i.e., 30.12% on Fashion200k and 11.13% on MIT-States on the Recall@10 metric.
We enhance SOTA method TIRG  to ensure fair comparison and identify its limitations.
2 Related Work
Deep metric learning (DML) has become a popular technique for solving retrieval problems. DML aims to learn a metric such that the distances between samples of the same class are smaller than the distances between the samples of different classes. The task where DML has been employed extensively is the cross-modal retrieval, i.e. retrieving images based on text query and getting captions from the database based on the image query [24, 9, 28, 2, 8, 26].
In the domain of Visual Question Answering (VQA), many methods have been proposed to fuse the text and image inputs [19, 17, 16]. We review below a few closely related methods. Relationship  is a method based on relational reasoning. Image features are extracted from CNN and text features from LSTM to create a set of relationship features. These features are then passed through a MLP and after averaging them the composed representation is obtained. FiLM 
method tries to “influence” the source image by applying an affine transformation to the output of a hidden layer in the network. In order to perform complex operations, this linear transformation needs to be applied to several hidden layers. Another prominent method is parameter hashing where one of the fully-connected layers in a CNN acts as the dynamic parameter layer.
In this work, we focus on the image retrieval problem based on the image and text query. This task has been studied recently by Vo 
. They propose a gated feature connection in order to keep the composed representation of query image and text in the same space as that of the target image. They also incorporate a residual connection which learns the similarity between concatenation of image-text features and the target image features. Another simple but effective approach is Show and Tell. They train a LSTM to predict the next word in the sequence after it has seen the image and previous words. The final state of this LSTM is considered the composed representation. Han  presents an interesting approach to learn spatially-aware attributes from product description and then use them to retrieve products from the database. But their text query is limited to a predefined set of attributes. Nagarajan  proposed an embedding approach, “Attribute as Operator”, where text query is embedded as a transformation matrix. The image features are then transformed with this matrix to get the composed representation.
This task is also closely related with interactive image retrieval task [4, 21] and attribute-based product retrieval task [27, 23]. These approaches have their limitations such as that the query texts are limited to a fixed set of relative attributes , require multiple rounds of natural language queries as input [4, 21] or that query texts can be only one word i.e. an attribute . In contrast, the input query text in our approach is not limited to a fixed set of attributes and does not require multiple interactions with the user. Different from our work, the focus of these methods is on modeling the interaction between user and the agent.
3.1 Problem Formulation
Let denote the set of query images, denote the set of query texts and denote the set of target images. Let denote the pre-trained image model, which takes an image as input and returns image features in a -dimensional space. Let denote the similarity kernel, which we implement as a dot product between its inputs. The task is to learn a composed representation of the image-text query, denoted by , by maximising
where denotes all the network parameters.
3.2 Motivation for Complex Projection
In deep learning, researchers aim to formulate the learning problem in such a way that the solution space is restricted in a meaningful way. This helps in learning better and robust representations. The objective function (Equation 1) maximizes the similarity between the output of the composition function of the image-text query and the target image features. Thus, it is intuitive to model the query image, query text and target image lying in some common space. One drawback of TIRG is that it does not emphasize the importance of text features in defining the relationship between the query image and the target image.
Based on these insights of the learning problem, we restrict the compositional learning of query image and text features in such a way that: (i) query and target image features lie in the same space, (ii) text features encode the transition from query image to target image in this space and (iii) transition is symmetric, i.e. some function of the text features must encode the reverse transition from target image to query image.
In order to incorporate these characteristics in the composed representation, we propose that the query image and target image are rotations (transitions) of each other in a complex space. The rotation is determined by the text features. This enables incorporating the desired text information about the image in the common complex space. The reason for choosing the complex space is that some function of text features required for the transition to be symmetric can easily be defined as the complex conjugate of the text features in the complex space (see Fig. 2).
Choosing such projection also enables us to define a constraint on the optimization problem, referred to as rotational symmetry constraint (see Equations 12, 13 and 3.4.1). We will empirically verify the effectiveness of this constraint in learning better composed representations. We will also explore the effect on performance if we fuse image and text information in the real space. Refer to Sec. 4.5.
An advantage of modelling the reverse transition in this way is that we do not require captions of query image. This is quite useful in practice, since a user-friendly retrieval system will not ask the user to describe the query image for it. In the public datasets, query image captions are not always available, e.g. for Fashion IQ dataset. In addition to that, it also forces the model to learn a good “internal” representation of the text features in the complex space.
Interestingly, such restrictions on the learning problem serve as implicit regularization. , the text features only influence angles of the composed representation. This is in line with recent developments in deep learning theory [15, 13]. Neyshabur  showed that imposing simple but global constraints on the parameter space of deep networks is an effective way of analyzing learning theoretic properties and may aid in decreasing the generalization error.
3.3 Network Architecture
Now we describe ComposeAE, an autoencoder based approach for composing the modalities in the multi-modal query. Figure 3 presents the overview of the ComposeAE architecture.
For the image query, we extract the image feature vector living in a -dimensional space, using the image model (e.g. ResNet-17), which we denote as:
Similarly, for the text query , we extract the text feature vector living in an -dimensional space, using the BERT model , as:
Since the image features and text features are extracted from independent uni-modal models; they have different statistical properties and follow complex distributions. Typically in image-text joint embeddings [23, 24], these features are combined using fully connected layers or gating mechanisms.
In contrast to this we propose that the source image and target image are rotations of each other in some complex space, say, . Specifically, the target image representation is an element-wise rotation of the representation of the source image in this complex space. The information of how much rotation is needed to get from source to target image is encoded via the query text features. During training, we learn the appropriate mapping functions which give us the composition of and in . We learn the angles from the text features , specifying element-wise rotation of source image features.
More precisely, we learn a mapping and obtain the coordinate-wise complex rotations via
where denotes the matrix exponential function and is square root of . The mapping
is implemented as a multilayer perceptron (MLP) i.e. two fully-connected layers with non-linear activation.
Next, we learn a mapping function, , which maps image features to the complex space. is also implemented as a MLP. The composed representation denoted by can be written as:
The optimization problem defined in Eq. 1
aims to maximize the similarity between the composed features and the target image features extracted from the image model. Thus, we need to learn a mapping function,, from the complex space back to the -dimensional real space where extracted target image features exist. is implemented as MLP.
In order to better capture the underlying cross-modal similarity structure in the data, we learn another mapping, denoted as . It is implemented as two fully connected layers followed by a single convolutional layer. This enables learning effective local interactions among different features. In addition to , also takes raw features and as input. plays a really important role for queries where the query text asks for a modification that is spatially localized. , a user wants a t-shirt with a different logo on the front (see second row in Fig. 4).
Let denote the overall composition function which learns how to effectively compose extracted image and text features for target image retrieval. The final representation, , of the composed image-text features can be written as follows:
where and are learnable parameters.
In autoencoder terminology, the encoder has learnt the composed representation of image and text query, . Next, we learn to reconstruct the extracted image and text features from . Separate decoders are learned for each modality, i.e., image decoder and text decoder denoted by and respectively. The reason for using the decoders and reconstruction losses is two-fold: first, it acts as regularizer on the learnt composed representation and secondly, it forces the composition function to retain relevant text and image information in the final representation. Empirically, we have seen that these losses reduce the variation in the performance and aid in preventing overfitting.
3.4 Training Objective
We adopt a deep metric learning (DML) approach to train ComposeAE. Our training objective is to learn a similarity metric, , between composed image-text query features and extracted target image features . The composition function should learn to map semantically similar points from the data manifold in onto metrically close points in . Analogously, should push the composed representation away from non-similar images in .
For sample from the training mini-batch of size , let denote the composition feature, denote the target image features and denote the randomly selected negative image from the mini-batch. We follow TIRG  in choosing the base loss for the datasets.
So, for MIT-States dataset, we employ triplet loss with soft margin as a base loss. It is given by:
where denotes the number of triplets for each training sample . In our experiments, we choose the same value as mentioned in the TIRG code, i.e. 3.
For Fashion200k and Fashion IQ datasets, the base loss is the softmax loss with similarity kernels, denoted as . For each training sample , we normalize the similarity between the composed query-image features () and target image features by dividing it with the sum of similarities between and all the target images in the batch. This is equivalent to the classification based loss in [23, 3, 20, 10].
In addition to the base loss, we also incorporate two reconstruction losses in our training objective. They act as regularizers on the learning of the composed representation. The image reconstruction loss is given by:
Similarly, the text reconstruction loss is given by:
3.4.1 Rotational Symmetry Loss
As discussed in subsection 3.2, based on our novel formulation of learning the composition function, we can include a rotational symmetry loss in our training objective. Specifically, we require that the composition of the target image features with the complex conjugate of the text features should be similar to the query image features. In concrete terms, first we obtain the complex conjugate of the text features projected in the complex space. It is given by:
Let denote the composition of with the target image features in the complex space. Concretely:
Finally, we compute the composed representation, denoted by , in the following way:
The rotational symmetry constraint translates to maximizing this similarity kernel: . We incorporate this constraint in our training objective by employing softmax loss or soft-triplet loss depending on the dataset.
Since for Fashion datasets, the base loss is , we calculate the rotational symmetry loss, , as follows:
Analogously, the resulting loss function,, for MIT-States is given by:
The total loss is computed by the weighted sum of above mentioned losses. It is given by:
where depending on the dataset.
4.1 Experimental Setup
We evaluate our approach on three real-world datasets, namely: MIT-States, Fashion200k  and Fashion IQ . For evaluation, we follow the same protocols as other recent works [23, 6, 17]. We use recall at rank , denoted as
To ensure fair comparison, we keep the same hyperparameters as TIRG and use the same optimizer (SGD with momentum). Similar to TIRG, we use ResNet-17 for image feature extraction to get 512-dimensional feature vector. In contrast to TIRG, we use pretrained BERT  for encoding text query. Concretely, we employ BERT-as-service  and use Uncased BERT-Base which outputs a 768-dimensional feature vector for a text query. Further implementation details can be found in the code: https://anonymous.4open.science/r/d1babc3c-0e72-448a-8594-b618bae876dc/.
|# train queries||43207||172049||46609|
|# test queries||82732||33480||15536|
|Average length of|
|complete text query||2||4.81||13.5|
|Average # of|
We compare the results of ComposeAE with several methods, namely: Show and Tell, Parameter Hashing, Attribute as Operator, Relationship, FiLM and TIRG. We explained them briefly in Sec. 2.
In order to identify the limitations of TIRG and to ensure fair comparison with our method, we introduce three variants of TIRG. First, we employ the BERT model as a text model instead of LSTM, which will be referred to as TIRG with BERT. Secondly, we keep the LSTM but text query contains full target captions. We refer to it as TIRG with Complete Text Query. Thirdly, we combine these two variants and get TIRG with BERT and Complete Text Query. The reason for complete text query baselines is that the original TIRG approach generates text query by finding one word difference in the source and target image captions. It disregards all other words in the target captions.
While such formulation of queries may be effective on some datasets, but the restriction on the specific form (or length) of text query largely constrain the information that a user can convey to benefit the retrieval process. Thus, such an approach of generating text query has limited applications in real life scenarios, where a user usually describes the modification text with multiple words. This argument is also supported by several recent studies [5, 4, 21]. In our experiments, Fashion IQ dataset contains queries asked by humans in natural language, with an average length of 13.5 words. (see Table 1). Due to this reason, we can not get results of original TIRG on this dataset.
Table 1 summarizes the statistics of the datasets. The train-test split of the datasets is the same for all the methods.
MIT-States  dataset consists of 60k diverse real-world images where each image is described by an adjective (state) and a noun (categories). For instance, “sliced potato”, “ripe tomato” etc. There are 245 nouns in the dataset and 49 of them are reserved for testing, i.e. there is no overlap between training and testing queries in terms of nouns (categories). This split ensures that the algorithm is able to learn the composition on the unseen nouns (categories). The input image (say “unripe tomato”) is sampled and the text query asks to change the state to ripe. The algorithm is considered successful if it retrieves the correct target image (“ripe tomato”) from the pool of all test images. Note that the description (image caption) itself is not available to the algorithm.
|Show and Tell||11.9||31.0||42.0|
|Att. as Operator||8.8||27.3||39.1|
|TIRG with BERT||12.3||31.8||42.6|
|Complete Text Query||7.9||28.7||34.1|
|TIRG with BERT and|
|Complete Text Query||13.3||34.5||46.8|
Fashion200k  consists of 200k images of 5 different fashion categories, namely: pants, skirts, dresses, tops and jackets. Each image has a human annotated caption, e.g. “blue knee length skirt”.
Fashion IQ is a challenging dataset consisting of 77684 images belonging to three categories: dresses, top-tees and shirts. In contrast to two other datasets, Fashion IQ has two human written annotations for each target image. We report the performance on the validation set as the test set labels are not available.
4.4 Discussion of Results
First, we note that our proposed method ComposeAE outperforms other methods by a significant margin. On Fashion200k, the performance improvement of ComposeAE over the original TIRG and its enhanced variants is most significant. Specifically, in terms of R@10 metric, the performance improvement over the second best method is 6.96% and 30.12% over the original TIRG method . Similarly on R@10, for MIT-States, ComposeAE outperforms the second best method by 2.35% and by 11.13% over the original TIRG method. For the Fashion IQ dataset , ComposeAE has 2.61% and 3.82% better performance than the second best method in terms of R@10 and R@100 respectively.
|Show and Tell||12.3||40.2||61.8|
|TIRG with BERT||14.2||41.9||63.3|
|Complete Text Query||18.1||52.4||73.1|
|TIRG with BERT and|
|Complete Text Query||19.9||51.7||71.8|
|Complete Text Query||3.34||9.18||9.45|
|TIRG with BERT and|
|Complete Text Query||11.5||28.8||28.8|
Second, we observe that the performance of the methods on MIT-States and Fashion200k datasets is in a similar range as compared to the range on the Fashion IQ. For instance, in terms of R@10, the performance of TIRG with BERT and Complete Text Query is 46.8 and 51.8 on MIT-States and Fashion200k datasets while it is 11.5 for Fashion IQ. The reasons which make Fashion IQ the most challenging among the three datasets are: (i) the text query is quite complex and detailed and (ii) there is only one target image per query (See Table 1). That is even though the algorithm retrieves semantically similar images but they will not be considered correct by the recall metric. For instance, for the first query in Fig.4, we can see that the second, third and fourth image are semantically similar and modify the image as described by the query text. But if the third image which is the labelled target image did not appear in top-5, then R@5 would have been zero for this query.
Third, for MIT-States and Fashion200k datasets, we observe that the TIRG variant which replaces LSTM with BERT as a text model results in slight degradation of the performance. On the other hand, the performance of the TIRG variant which uses complete text (caption) query is quite better than the original TIRG. However, for the Fashion IQ dataset which represents a real-world application scenario, the performance of TIRG with complete text query is significantly worse. Concretely, TIRG with complete text query performs 253% worse than ComposeAE on R@10. The reason for this huge variation is that the average length of complete text query for MIT-States and Fashion200k datasets is 2 and 3.5 respectively. Whereas average length of complete text query for Fashion IQ is 12.4. It is because TIRG uses the LSTM model and the composition is done in a way which underestimates the importance of the text query. This shows that TIRG approach does not perform well when the query text description is more realistic and complex.
Fourth, one of the baselines (TIRG with BERT and Complete Text Query) that we introduced shows significant improvement over the original TIRG. Specifically, in terms of R@10, the performance gain over original TIRG is 8.58% and 21.65% on MIT-States and Fashion200k respectively. This method is also the second best performing method on all datasets. We think that with more detailed text query, BERT is able to give better representation of the query and this in turn helps in the improvement of the performance.
Qualitative Results: Fig.4 presents some qualitative retrieval results for Fashion IQ. For the first query, we see that all images are in “blue print” as requested by text query. The second request in the text query was that the dress should be “short sleeves”, four out of top-5 images fulfill this requirement. For the second query, we can observe that all retrieved images share the same semantics and are visually similar to the target images. Qualitative results for other two datasets are given in the supplementary material.
|- Concat in real space||48.4||46.2||09.8|
4.5 Ablation Studies
We have conducted various ablation studies, in order to gain insight into which parts of our approach helps in the high performance of ComposeAE. Table 5 presents the quantitative results of these studies.
Impact of : on the performance can be seen on Row 2. For Fashion200k and Fashion IQ datasets, the decrease in performance is quite significant: 7.17% and 12.38% respectively. While for MIT-States, the impact of incorporating is not that significant. It may be because the text query is quite simple in the MIT-states case, i.e. 2 words. This needs further investigation.
Efficacy of Mapping to Complex Space: ComposeAE has a complex projection module, see Fig. 3. We removed this module to quantify its effect on the performance. Row 3 shows that there is a drop in performance for all three datasets. This strengthens our hypothesis that it is better to map the extracted image and text features into a common complex space than simple concatenation in real space.
Convolutional versus Fully-Connected Mapping: ComposeAE has two modules for mapping the features from complex space to target image space, i.e., one with only fully-connected layers and the second with an additional convolutional layer . Rows 4 and 5 show that the performance is quite similar for fashion datasets. While for MIT-States, ComposeAE without performs much better. Overall, it can be observed that for all three datasets both modules contribute in improving the performance of ComposeAE.
In this work, we propose ComposeAE to compose the representation of source image rotated with the modification text in a complex space. This composed representation is mapped to the target image space and a similarity metric is learned. Based on our novel formulation of the problem, we introduce a rotational symmetry loss in our training objective. Our experiments on three datasets show that ComposeAE consistently outperforms SOTA method on this task. We enhance SOTA method TIRG  to ensure fair comparison and identify its limitations.
We would like to thank Alan Schelten, Rayyan A. Khan and Till Brychcy for insightful discussions which helped in improving the quality of this work. This work has been supported by the Bavarian Ministry of Economic Affairs, Regional Development and Energy through the WoWNet project IUK-1902-003// IUK625/002.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §3.3, §4.1.
-  (2018) VSE++: improving visual-semantic embeddings with hard negatives. External Links: Cited by: §2.
-  (2005) Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513–520. Cited by: §3.4.
-  (2018) Dialog-based interactive image retrieval. In Advances in Neural Information Processing Systems, pp. 678–688. Cited by: §2, §4.2.
-  (2019) The fashion iq dataset: retrieving images by combining side information and relative natural language feedback. arXiv preprint arXiv:1905.12794. Cited by: §4.1, §4.2, §4.3.
Automatic spatially-aware fashion concept discovery.
Proceedings of the IEEE International Conference on Computer Vision, pp. 1463–1471. Cited by: §2, §2, §4.1, §4.3, Table 3.
-  (2015) Discovering states and transformations in image collections. In CVPR, Cited by: §4.1, §4.3.
-  (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216. Cited by: §2.
-  (2016-06) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE CVPR, Cited by: §2.
-  (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE ICCV, pp. 360–368. Cited by: §3.4.
-  (2020) A metric learning reality check. External Links: Cited by: §1.
-  (2018) Attributes as operators: factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 169–185. Cited by: §2.
-  (2017) Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 5947–5956. Cited by: §3.2.
Norm-based capacity control in neural networks. In Proceedings of The 28th Conference on Learning Theory, P. Grünwald, E. Hazan, and S. Kale (Eds.),
Proceedings of Machine Learning Research, Vol. 40, Paris, France, pp. 1376–1401. External Links: Cited by: §3.2.
-  (2015) In search of the real inductive bias: on the role of implicit regularization in deep learning. CoRR abs/1412.6614. Cited by: §3.2.
Image question answering using convolutional neural network with dynamic parameter prediction. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 30–38. Cited by: §2.
Film: visual reasoning with a general conditioning layer.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2, §4.1.
-  (2018) Towards building large scale multimodal domain-aware conversation systems. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §1, §2.
-  (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §3.4.
-  (2019) Drill-down: interactive retrieval of complex scenes using natural language queries. In Advances in Neural Information Processing Systems, pp. 2647–2657. Cited by: §2, §4.2.
-  (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §1, §2.
-  (2019) Composing text and image for image retrieval - an empirical odyssey. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 6432–6441. Cited by: Compositional Learning of Image-Text Query for Image Retrieval, 5th item, §1, §1, §2, §2, §3.3, §3.4, §3.4, §4.1, §4.1, §5.
-  (2016) Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE CVPR, pp. 5005–5013. Cited by: §1, §2, §3.3.
-  (2018) Bert-as-service. Note: https://github.com/hanxiao/bert-as-service Cited by: §4.1.
-  (2018) Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701. Cited by: §2.
-  (2017) Memory-augmented attribute manipulation networks for interactive fashion search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1520–1528. Cited by: §2.
-  (2017) Dual-path convolutional image-text embeddings with instance loss. ACM TOMM. Cited by: §2.
Appendix A Important Notes on Fashion IQ Dataset
In Fashion IQ dataset, 49% annotations describe the target image directly. While 32% annotations compares target and source images, e.g. “is red with a cat logo on front” and the second annotation is, “is more pop culture and adolescent”. The dataset consists of three non-overlapping subsets, namely “dress”, “top-tee” and “shirt”. We join the two annotations with the text “ and it” to get a description similar to a normal sentence a user might ask on an E-Com platform. Now the complete text query is: “is red with a cat logo on front and it is more pop culture and adolescent”. Furthermore, we combine the train sets of all three categories to form a bigger training set and train a single model on it. Analogously, we also combine the validation sets to form a single validation set.
A challenge was conducted in ICCV 2019 on Fashion IQ dataset 111https://sites.google.com/view/lingir/fashion-iq. The website also has some technical reports submitted by the best performing teams. The numbers reported in these reports are quite high, even for TIRG approach. We investigated the reasons and reached the conclusion that these technical reports have have quite different settings. It is not possible for us to compare our results with them in a fair manner. The reasons and differences are delineated briefly as:
They treat Fashion IQ as three independent datasets and train one model for each category (“dress”, “top-tee” and “shirt”). This results in better performance for each category.
They do pre-training on external datasets like Fashiongen, Fashion200k etc. It is well-known that such transfer learning (via pre-training) inevitably increases the performance of any model.
They employ product attributes as side information in their models. In our experiments, we do not consider in such side information and rely solely on the image and text query.
They employ higher capacity models such as ResNet101, ResNet-152 etc. In original TIRG and in all our experiments, we use ResNet17 as image model.
Since these reports developed models specifically for the competition, they have incorporated several hacks, like ensembeling, data augmentation techniques etc.
Unfortunately, none of the technical reports have published their code. Thus, we are not able to assess the performance of their model in our experiment setting.
In short, it is neither possible for us to reproduce their results nor are we able to fairly compare the performance of their models in a common experiment setting.
Appendix B Qualitative Results
Fig. 5 presents some qualitative retrieval examples for MIT-States dataset. For the first query, we see that two “burnt bush” images are retrieved. We can observe that other retrieved images share the same semantics and are visually similar to the target images. In second and third row, we note that same objects in different states can look drastically different. This highlights the importance of incorporating the text information in the composed representation.
Some qualitative retrieval results for Fashion200k dataset are presented in Fig. 6. In these results, we observe that the model is able to capture the style and color information quite well. In the first row, we see similar sleeveless dresses with sequin. Similarly, in other two queries, the model successfully images from the same product category, i.e. jacket and skirts. Moreover, the retrieved images seem to follow the desired modifications expressed in the query text remarkably well.
It is pertinent to highlight that the captions under the images are the ground truth. They are not available to the model as additional input during training or inference.