Content-based image retrieval (CBIR) is the task of identifying relevant images using the representative visual content (such as high-level information in an image)[1, 2, 3, 4]. On the other hand, instance retrieval at fine-grained level can be defined as a finer visual search, for example finding a dress with a design or pattern similar to the query’s from the catalog of dresses or retrieving a certain species of bird. Fine details (namely attributes) help to tell apart different instances with similar appearances, such as two similar gulls of different types. This work discusses instance retrieval at fine-grained level which we briefly call instance retrieval through the rest of this paper to differentiate from CBIR.
The importance of semantic attributes in instance retrieval has been emphasized before . Most of the research in fine-grained instance retrieval use the attributes directly to retrieve instances [7, 8, 9]
. Another alternative is to use the features extracted from a pre-trained network on Imagenet[10, 11, 4, 12] (similar to some approaches in CBIR). However, this is not a very accurate approach for instance retrieval, since Imagenet  includes 1000 object classes which are mostly categorized at a coarse level of recognition, for instance there are a few categories of birds but not as fine-grained as a dataset like CUB200 . On the other hand, fine-tuning a pre-trained model on Imagenet for a fine-grained dataset requires the prior knowledge of classes in that dataset. However, in fine-grained instance retrieval, very often we receive datasets of images with no specific categories or classes and the only available information is the annotated (or extracted from the meta data) attributes. This makes fine-grained instance retrieval different from fine-grained instance classification [14, 15]; a problem which is often addressed with regards to fine-grained visual recognition. In this work, we show that it is possible to achieve good instance retrieval results using the global features extracted from a trained multi-attribute recognition network. Similar to CBIR, the global features (e.g. from a fully connected layer) are used by a metric learning method (e.g. euclidean distance) to retrieve similar instances to the query image; however, here the network is trained for attribute recognition at fine-grained level. We choose VGG16 architecture 
as it has small filter kernels which make capturing small details possible. To adopt VGG16 for multi-attribute recognition we use pair-wise ranking loss function which is proved to be efficient for multi-label classification. This simple approach leads to competitive results and it is able to capture similar instances, specially with a good visual similarity in terms of texture, color, material and shape. Further, by adopting Bilinear CNN  we modify the VGG16  network and introduce a small model (in terms of number of parameters) for multi-attribute recognition at fine-grained level, which can achieve satisfactory results. The size of the network is always an important factor to consider at the production level.
We are experimenting with two datasets: CUB200 dataset  which consists of 11k images of 200 species of birds, annotated with 312 attributes and dress category from DeepFashion In-shop Retrieval dataset  with 336 attributes. Previous research in clothes retrieval has rarely considered a fine-grained case where only one category of clothes (e.g. dress) is available to train on and the retrieval task has to be done against a gallery of diverse clothing types. We prove it is possible to achieve good retrieval results when the network is only exposed to images of one category, similar to that of the query image (e.g. when a customer provides a catalogue of dresses for which the attributes are known or can be extracted from the metadata and he/she is interested in finding similar items to the ones in the catalogue but from a diverse dataset of clothes.). Further, most of the clothes retrieval techniques often heavily rely on landmark detection , whereas, in our approach we are ignoring landmark information.
The paper is organized as follows. In Section II, we explain the datasets used in the experiments. In Section III, we first discuss the network used for multi-attribute recognition and the smaller network which is based on the bilinear CNN architecture. Then we explain what global features are used for retrieval and eventually, which metric learning methods are used for retrieving a query item from the gallery. The results are presented in Section IV and the paper is finally concluded in Section V.
There are a few fine-grained datasets publicly available [18, 5, 19], among which some provide annotated attributes for fine-grained parts in addition to the instance classes [18, 5] and some do not provide any attribute annotations . Here, we have chosen CUB200 
which includes 11k images of 200 birds species. The species are classified at fine-grained level. An example can be seen in Fig.1 where all instances in the first row are the same type of woodpecker, whereas, in the second row we can see different species of woodpeckers (the differences are very subtle even for human observers). The birds are annotated with 312 attributes including the colors for different parts, beak shape, etc. The problem with the CUB200 annotations is that the list of the annotated attributes per item is long and not very distinctive which makes the task of retrieval based on attributes more difficult.
DeepFashion In-shop Retrieval dataset  consists of several categories of clothes for men and women and overall 465 attributes. The dataset is designed specifically for retrieval, therefore, for each query image there are similar items (some in different colors) available in the gallery. Deepfashion is not a fine-grained dataset, however, each category of clothes can make a fine-grained case. We have chosen the dress category which is annotated with 336 attributes. The dataset also provides the bounding box for each item as well as landmark points. In our experiments, we are not using the landmark information and only crop the images within the given bounding box to lessen the effect of faces on the network training process. We are experimenting with both the gallery of dresses only and the gallery of all categories to compare the latter’s results with the benchmark results by FashionNet  for the dress category of In-shop Retrieval dataset.
Iii Retrieval Method
Iii-a Multi-Attribute Recognition Network
) which is a convolutional neural network with small convolutional filter kernels () which makes it suitable for capturing fine details of textures in an image.
We have adopted VGG16 for multi-attribute recognition by using a smooth pairwise ranking loss function :
where is a label (attribute) prediction model that maps an image to a K-dimensional label space which represents the confidence scores. The model
is designed such that it produces a vector whose values for true labels are greater than those for negative labels (i.e.). The loss function in (1) enforces this property by calculating the log-sum-exp of all pairs of labels (attributes) and penalizing the values which do not follow the mentioned rule. This creates the framework of learning to rank  via pairwise comparisons. Equation (1) is a smooth approximation of a similar hinge function [21, 22] used for pairwise comparison. The smooth version proposed by  makes optimization easier due it its differentiability.
) right after the last convolutional layer (Conv5_3 + Relu). To save the space and reduce the correlation between the feature maps (hence, capturing more details) the second copy of the feature map (in Fig. 2) is generated by projecting a copy of the original feature map () into a 20 dimensional ICA projection space  (which is generated beforehand based on the feature maps from the same training set). Then, the sum of the outer product of and at each location is calculated which is passed through a fully connected layer. The same smooth pairwise ranking loss (1) is used to learn the ranked list of attributes.
In both networks, the number of labels ( in Fig. 2) equals the number of attributes describing the dataset which is 312 for CUB200 dataset and 336 for the dress category in DeepFashion In-shop Retrieval dataset. The important advantage of the second architecture is the size of this network in terms of parameters. For a vocabulary of roughly 300 words (similar to the one for CUB200 or dress category form the Deepfashion) the second architecture is smaller, which makes it suitable for mobile-device application. The number of parameters for the fully connected layers in VGG16 is shown in Fig. 2, we can see that two of these layers are replaced by the bilinear layer which uses a projection space to reduce the dimension of one copy of Conv5_3 + Relu feature map which results in a great overall save of space.
For the DeepFashion In-shop retrieval dataset, our models are trained only on the dress category and not all the categories, since we are interested in the case where only same category of clothes as the query image is available for training. A practical scenario is when a customer provides a catalogue of one type of clothes (e.g. dresses) and is interested in finding dresses with the same pattern or design within a set of diverse clothes.
To train the models, VGG16 pre-trained weights [24, 25] on the Imagenet dataset  are used for initializing the convolutional layers in the second network and for all layers in the first architecture.
The models are built on Tensorflow framework and the experiments are run on an NVIDIA Tesla V100 GPU. For each category the model is trained in average for 14 epochs with the batch size of 16 using Adam optimizer with the base learning rate of 0.00001.
Through the rest of the paper we call the first architecture “VGG16 + MultiAttrib” and the second one “bilin + MultiAttrib”.
Iii-B Features Used for Retrieval
We are using global features from both networks. We have experimented with the last three fully connected layers: fc6,fc7 and fc8 from VGG16 + MultiAttrib network, out of which fc6 features resulted in better retrieval results compared to the other two. From bilin + MultiAttrib architecture, the feature map from the bilinear layer are used as well as the outputs of the network, i.e. the scores given to attributes (312 attributes of the CUB200 dataset and 336 attributes of the dress dataset), this is annotated as ‘prob’ in Table I.
As mentioned before, it is common to use the features of a pre-trained network on Imagenet dataset for content-based image retrieval. Here, we are comparing our results with the ones from the fc6 layer of the pre-trained VGG16 on imagenet dataset (again, we experimented with different layers and found fc6 features to be slightly better).
Iii-C Metric Learning
For most of the experiments the global features are L2 normalized and then the Euclidean distance is used for retrieving the query from the set of images. However, we found out that histogram intersection works better when using prob features (the scores) from the bilin + MultiAttrib network for retrieval.
Iv Experiments and Results
As mentioned before three global feature maps are used for instance retrieval in our experiments: 1- the fc6 layer of VGG16 + MultiAttrib network, 2- the bilinear layer from bilin + MultiAttrib network and 3- the scores from bilin + MultiAttrib network (prob).
Table I shows the retrieval results for dress category of DeepFashion In-shop Retrieval dataset (the top section of the table) and CUB200 dataset (the bottom section of the table). The model used for producing the features are mentioned in the first column of the table. Second column shows which global features are used for retrieval and fourth column lists the metric learning methods. We also show the number of features for each case in the third column of table. The bilinear layer feature map (from bilin + MultiAttrib network) has 10240 features which is the size of the outer product of the original 512 features from the previous layer (Conv5_3 + Relu) multiplied by the size of the reduced copy of it (which is 20). Also, the size of prob (scores by bilin + MultiAttrib network) equals the number of attributes for each dataset.
|Deep-fashion In-shop Retrieval Dress|
|Fine-level Similarity (mAP)||Attribute Similarity (IoU)|
|VGG16 (imagenet)||fc6||4096||L2 + Euclidean||0.33||0.15||0.10||0.48||0.35||0.31|
|VGG16 + MultiAttrib||fc6||4096||L2 + Euclidean||0.74||0.36||0.22||0.81||0.52||0.42|
|bilin + MultiAttrib||bilinear||10240||L2 + Euclidean||0.70||0.37||0.23||0.77||0.53||0.42|
|bilin + MultiAttrib||prob||336||L2 + Hist_inter||0.69||0.39||0.24||0.77||0.55||0.45|
|Fine-level Similarity (mAP)||Coarse-level Similarity (mAP)||Attribute Similarity (IoU)|
|VGG16 (imagenet)||fc6||4096||L2 + Euclidean||0.40||0.31||0.27||0.59||0.53||0.48||0.27||0.27||0.27|
|VGG16 + MultiAttrib||fc6||4096||L2 + Euclidean||0.41||0.32||0.28||0.52||0.53||0.50||0.28||0.29||0.28|
|bilin + MultiAttrib||bilinear||10240||L2 + Euclidean||0.43||0.37||0.33||0.63||0.58||0.54||0.29||0.29||0.29|
|bilin + MultiAttrib||prob||312||L2 + Hist_inter||0.42||0.30||0.26||0.55||0.50||0.47||0.29||0.29||0.29|
For the dress dataset the query is retrieved from the gallery of dresses only. The first three columns of the results are the mean average precision  of top-1, top-5 and top-10 retrieval calculated as the ratio between the relevant retrieved items and k retrieved items, for instance if in top-5 retrieval results 3 items are retrieved correctly the precision is calculated as 0.60. The reported results are the mean average precision over the whole 1901 query images of dresses. The last three columns in the table show the attribute similarity precision which is calculated as the Intersection over Union between the attributes of the query image and the retrieved items from the gallery. We can see that the best top-1 precision results belong to the fc6 features of VGG16 + MultiAttrib network. In top-5 and top-10 retrieval results the output (prob) from the bilin + MultiAttrib network outperforms the other two global features. The first row shows the global features from the fc6 layer of the pre-trained VGG16 on Imagenet. We can see that using the global features from the pre-trained VGG16 network on Imagenet for instance retrieval leads to poor results compared to the ones by global features of the multi-attribute recognition network.
The bottom section of the table shows the same results for CUB200 dataset. In addition to the fine-grained retrieval results, we are also reporting the results for the retrieval precision at coarse level. An example of which can be seen in the second row of Fig. 1 where all retrieved items are woodpeckers but not all are Downy woodpecker (which is the query item). For CUB200 dataset we can see that the best results are achieved using the feature maps from the bilinear layer of bilin + MultiAttrib network.
Another point to notice is that the performance for birds is not as good as dress dataset since the attribute annotation for birds in general is poorer than dresses, i.e. there is a lot of overlap between the attributes for different species and the list of attributes is very long and less distinctive.
More detailed precision results (for all top-k retrieval k=1, …, 10) for the dress dataset are shown in Fig. 3.
To compare the retrieval results for the In-shop Retrieval dress dataset with the state of the art results we have retrieved the query dresses against the whole gallery consisting of clothes from all categories (including dresses) and compared the results with the ones by FashionNet . The results are plotted in Fig. 4 for top-k retrieval (k=1, …, 50). We are using the same evaluation technique proposed by FashionNet authors who calculate retrieval accuracy based on successful retrievals which is defined as finding at least one similar item to the query in top-k results. We can see that the top-1 retrieval results for all three global feature maps extracted from our multi-attribute recognition networks outperform the FashionNet results. Further, we can see that fc6 features from VGG16 + MultiAttrib network is always leading for all top-k retrieval results. It is important to notice that we are comparing a much simpler technique with a complicated network such as FashionNet which makes use of landmark information and the retrieval results are also learned by the triplet loss. The simplicity of our proposed method makes it more suitable for practical applications. Further, we need to consider the fact that the multi-attribute recognition networks used in our experiment have never been exposed to other categories of clothes and are only trained on the dress training set. This addresses the scenario where only the catalogue of one type of clothes desired by a costumer is provided for training and we need to query against a diverse clothes dataset.
Fig. 5 demonstrates the visual examples of the retrieval results using fc6 features from VGG16 + MultiAttrib network for different types of attributes. The query item is marked in a black rectangular box and the correctly retrieved items are shown in red rectangles. The first row is an example of successful retrieval of texture by our method where the first two retrieved items are exactly the same as query and the last two dresses are very similar in terms of texture. The second row shows that all the retrieved items are from the same hue (green/bluish green). Fabric is another important factor in clothes retrieval, an example of the successful retrieval of similar fabrics by our method can be seen in the third row where all the retrieved dresses are made of lace. Finally, the last row of Fig. 5 shows that the retrieved dresses are all maxi with strapped shoulders which is an example of successful retrieved styles by fc6 features of VGG16 + MultiAttrib.
In this paper, we showed that by using the global features from the multi-attribute recognition network we can achieve successful instance retrieval results at a fine-grained level. We concluded that for both CUB200 and DeepFashion IN-Shop Retrieval dress datasets the instance retrieval results using the global features of the multi-attribute recognition networks are better than the ones by the global features from the pre-trained network on Imagenet. We demonstrated that for the dress category of DeepFashion In-shop Retrieval dataset we can get competing retrieval results in comparison to the benchmark FashionNet method. These results are significant considering the fact that our proposed method is oblivious to the landmark information and it is simpler to implement. Besides, it addresses the scenario where only one category of clothes with annotated attributes is provided for training, but the retrieval needs to be done from a diverse set of clothes. Further, we showed that by adopting bilinear CNN architecture we can reduce the size of the network to smaller than the original VGG16 and still achieve good retrieval results using global features extracted from the model. The latter design makes the model suitable for mobile-device application. The visual analysis of the retrieved results for the dresses confirms the efficiency of our method in retrieving similar items in terms of texture, color, fabric and design.
-  A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-based image retrieval at the end of the early years,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 12, pp. 1349–1380, 2000.
-  M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimedia information retrieval: State of the art and challenges,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 2, no. 1, pp. 1–19, 2006.
-  Y. Liu, D. Zhang, G. Lu, and W.-Y. Ma, “A survey of content-based image retrieval with high-level semantics,” Pattern recognition, vol. 40, no. 1, pp. 262–282, 2007.
-  W. Zhou, H. Li, and Q. Tian, “Recent advance in content-based image retrieval: A literature survey,” arXiv preprint arXiv:1706.06064, 2017.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust
clothes recognition and retrieval with rich annotations,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1096–1104.
-  B. Siddiquie, R. S. Feris, and L. S. Davis, “Image ranking and retrieval based on multi-attribute queries,” 2011.
-  M. Rastegari, A. Diba, D. Parikh, and A. Farhadi, “Multi-attribute queries: To merge or not to merge?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3310–3317.
-  X. Cao, H. Zhang, X. Guo, S. Liu, and X. Chen, “Image retrieval and ranking via consistently reconstructing multi-attribute queries,” in European Conference on Computer Vision. Springer, 2014, pp. 569–583.
J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deep learning for content-based image retrieval: A comprehensive study,” inProceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 157–166.
-  J. Yue-Hei Ng, F. Yang, and L. S. Davis, “Exploiting local features from deep networks for image retrieval,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 53–61.
-  A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, “Visual instance retrieval with deep convolutional networks,” ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 251–258, 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1449–1457.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” in European conference on computer vision. Springer, 2014, pp. 834–849.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  Y. Li, Y. Song, and J. Luo, “Improving pairwise ranking for multi-label image classification,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
-  L. Yang, P. Luo, C. Change Loy, and X. Tang, “A large-scale car dataset for fine-grained categorization and verification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3973–3981.
-  T.-Y. Liu et al., “Learning to rank for information retrieval,” Foundations and Trends® in Information Retrieval, vol. 3, no. 3, pp. 225–331, 2009.
-  Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep convolutional ranking for multilabel image annotation,” arXiv preprint arXiv:1312.4894, 2013.
-  J. Weston, S. Bengio, and N. Usunier, “Wsabie: Scaling up to large vocabulary image annotation,” in IJCAI, vol. 11, 2011, pp. 2764–2770.
A. Hyvärinen, “Survey on independent component analysis,” 1999.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  H. Schütze, C. D. Manning, and P. Raghavan, Introduction to information retrieval. Cambridge University Press, 2008, vol. 39.