Something for myself
Deep convolutional neural network models pre-trained for the ImageNet classification task have been successfully adopted to tasks in other domains, such as texture description and object proposal generation, but these tasks require annotations for images in the new domain. In this paper, we focus on a novel and challenging task in the pure unsupervised setting: fine-grained image retrieval. Even with image labels, fine-grained images are difficult to classify, let alone the unsupervised retrieval task. We propose the Selective Convolutional Descriptor Aggregation (SCDA) method. SCDA firstly localizes the main object in fine-grained images, a step that discards the noisy background and keeps useful deep descriptors. The selected descriptors are then aggregated and dimensionality reduced into a short feature vector using the best practices we found. SCDA is unsupervised, using no image label or bounding box annotation. Experiments on six fine-grained datasets confirm the effectiveness of SCDA for fine-grained image retrieval. Besides, visualization of the SCDA features shows that they correspond to visual attributes (even subtle ones), which might explain SCDA's high mean average precision in fine-grained retrieval. Moreover, on general image retrieval datasets, SCDA achieves comparable retrieval results with state-of-the-art general image retrieval approaches.READ FULL TEXT VIEW PDF
In image retrieval, the features extracted from an item are used to look...
Text contained in an image carries high-level semantics that can be expl...
Fine-Grained Visual Categorization (FGVC) has achieved significant progr...
Current visualization based network interpretation methodssuffer from la...
There are great demands for automatically regulating inappropriate appea...
Capturing the essence of a textile image in a robust way is important to...
MultiGrain is a network architecture producing compact vector representa...
Something for myself
After the breakthrough in image classification using Convolutional Neural Networks (CNN) , pre-trained CNN models trained for one task (e.g., recognition or detection) have also been applied to domains different from their original purposes (e.g., for describing texture  or finding object proposals ). Such adaptations of pre-trained CNN models, however, still require further annotations in the new domain (e.g., image labels). In this paper, we show that for fine-grained images which contain only subtle differences among categories (e.g., varieties of dogs), pre-trained CNN models can both localize the main object and find images in the same variety. Since no supervision is used, we call this novel and challenging task fine-grained image retrieval.
In fine-grained image classification [4, 5, 6, 7, 8, 9], categories correspond to varieties in the same species. The categories are all similar to each other, only distinguished by slight and subtle differences. Therefore, an accurate system usually requires strong annotations, e.g., bounding boxes for object or even object parts. Such annotations are expensive and unrealistic in many real applications. In answer to this difficulty, there are attempts to categorize fine-grained images with only image-level labels, e.g., [6, 7, 8, 9].
In this paper, we handle a more challenging but more realistic task, i.e., Fine-Grained Image Retrieval (FGIR). In FGIR, given database images of the same species (e.g., birds, flowers or dogs) and a query, we should return images which are in the same variety as the query, without resorting to any other supervision signal. FGIR is useful in applications such as biological research and bio-diversity protection. As illustrated in Fig. 1, FGIR is also different from general-purpose image retrieval. General image retrieval focuses on retrieving near-duplicate images based on similarities in their contents (e.g., textures, colors and shapes), while FGIR focuses on retrieving the images of the same types (e.g., the same species for the animals and the same model for the cars). Meanwhile, objects in fine-grained images have only subtle differences, and vary in poses, scales and rotations.
To meet these challenges, we propose the Selective Convolutional Descriptor Aggregation (SCDA) method, which automatically localizes the main object in fine-grained images and extracts discriminative representations for them. In SCDA, only a pre-trained CNN model (from ImageNet which is not fine-grained) is used and we use absolutely no supervision. As shown in Fig. 2, the pre-trained CNN model first extracts convolution activations for an input image. We propose a novel approach to determine which part of the activations are useful (i.e., to localize the object). These useful descriptors are then aggregated and dimensionality reduced to form a vector representation using practices we propose in SCDA. Finally, a nearest neighbor search ends the FGIR process.
We conducted extensive experiments on six popular fine-grained datasets (CUB200-2011 , Stanford Dogs , Oxford Flowers 102 , Oxford-IIIT Pets , Aircrafts  and Cars ) for image retrieval. Moreover, we also tested the proposed SCDA method on standard general-purpose retrieval datasets (INRIA Holiday  and Oxford Building 5K ). In addition, we report the classification accuracy of the SCDA method, which only uses the image labels. Both retrieval and classification experiments verify the effectiveness of SCDA. The key advantages and major contributions of our method are:
We propose a simple yet effective approach to localize the main object. This localization is unsupervised, without utilizing bounding boxes, image labels, object proposals, or additional learning. SCDA selects only useful deep descriptors and removes background or noise, which benefits the retrieval task.
With the ensemble of multiple CNN layers and the proposed dimensionality reduction practice, SCDA has shorter but more accurate representation than existing deep learning based methods (cf. Sec.IV). For fine-grained images, as presented in Table III, SCDA achieves the best retrieval results. Furthermore, SCDA also has accurate results on general-purpose image retrieval datasets, cf. Table V.
As shown in Fig. 8, the compressed SCDA feature has stronger correspondence to visual attributes (even subtle ones) than the deep activations, which might explain the success of SCDA for fine-grained tasks.
Moreover, beyond the specific fine-grained image retrieval task, our proposed method could be treated as one kind of transfer learning, i.e., a model trained for one task (image classification on ImageNet) is used to solve another different task (fine-grained image retrieval). It indeed reveals the reusability of deep convolutional neural networks.
The rest of this paper is organized as follows. Sec. II introduces the related work about general deep image retrieval and fine-grained image tasks. The details of the proposed SCDA method are presented in Sec. III. In Sec. IV, for fine-grained image retrieval, we compare our method with several baseline approaches and three state-of-the-art general deep image retrieval approaches. Moreover, discussion on the quality of the SCDA feature is illustrated. Sec. V concludes the paper.
We will briefly review two lines of related work: deep learning approaches for image retrieval and research on fine-grained images.
Until recently, most image retrieval approaches were based on local features (with SIFT being a typical example) and feature aggregation strategies on top of these local features. Vector of Locally Aggregated Descriptors (VLAD)  and Fisher Vector (FV)  are two typical feature aggregation strategies. After the success of CNN , image retrieval also embraced deep learning. Out-of-the-box features from pre-trained deep networks were shown to achieve state-of-the-art results in many vision related tasks, including image retrieval .
Some efforts (e.g., [21, 22, 23, 24, 25, 26, 27]) studied what deep descriptors can be used and how to use them in image retrieval, and have achieved satisfactory results. In , to improve the invariance of CNN activations without degrading their discriminative ability, they proposed the multi-scale orderless pooling (MOP-CNN) method. MOP-CNN firstly extracts CNN activations from the fully connected layers for local patches at multiple scale levels, and performed orderless VLAD  pooling of these activations at each level separately, and finally concatenated the features. After that, 
has extensively evaluated the performance of such features with and without fine-tuning on related dataset. This work has shown that PCA-compressed deep features can outperform compact descriptors computed on traditional SIFT-like features. Later, found that using sum-pooling to aggregate deep features on the last convolutional layer leads to better performance, and proposed the sum-pooled convolutional (SPoC) features. Based on that,  applied weighting both spatially and per channel before sum-pooling to create a final aggregation.  proposed a compact image representation derived from the convolutional layer activations that encodes multiple image regions without the need to re-feed multiple inputs to the network. Very recently, the authors of  investigated several effective usages of CNN activations on both image retrieval and classification. In particular, they aggregated activations of each layer and concatenated them into the final representation, which achieved satisfactory results.
However, these approaches directly used the CNN activations/descriptors and encoded them into a single representation, without evaluating the usefulness of the obtained deep descriptors. In contrast, our proposed SCDA method can select only useful deep descriptors and remove background or noise by localizing the main object unsupervisedly. Meanwhile, we have also proposed several good practices of SCDA for retrieval tasks. In addition, the previous deep learning based image retrieval approaches were all designed for general image retrieval, which is quite different from fine-grained image retrieval. As will be shown by our experiments, state-of-the-art general image retrieval approaches do not work well for the fine-grained image retrieval task.
We can roughly categorize these methods into three groups. The first group, e.g., [31, 8], attempted to learn a more discriminative feature representation by developing powerful deep models for classifying fine-grained images. The second group aligned the objects in fine-grained images to eliminate pose variations and the influence of camera position, e.g., . The last group focused on part-based representations. However, because it is not realistic to obtain strong annotations (object bounding boxes and/or part annotations) for a large number of images, more algorithms attempted to classify fine-grained images using only image-level labels, e.g., [6, 7, 8, 9].
All the previous fine-grained classification methods needed image-level labels (others even needed part annotations) to train their deep networks. Few works have touched unsupervised retrieval of fine-grained images. Wang et al.  proposed Deep Ranking to learn similarity between fine-grained images. However, it requires image-level labels to build a set of triplets, which is not unsupervised and cannot scale well for large scale image retrieval tasks.
One related research to FGIR is . The authors of  proposed the fine-grained image search problem.  used the bag-of-word model with SIFT features, while we use pre-trained CNN models. Beyond this difference, a more important difference is how the database is constructed.
 constructed a hierarchical database by merging several existing image retrieval datasets, including fine-grained datasets (e.g., CUB200-2011 and Stanford Dogs) and general image retrieval datasets (e.g., Oxford Buildings and Paris). Given a query,  first determines its meta class, and then does a fine-grained image search if the query belongs to the fine-grained meta category. In FGIR, the database contains images of one single species, which is more suitable in fine-grained applications. For example, a bird protection project may not want to find dog images given a bird query. To our best knowledge, this is the first attempt to fine-grained image retrieval using deep learning.
In this section, we propose the Selective Convolutional Descriptor Aggregation (SCDA) method. Firstly, we will introduce the notations used in this paper. Then, we present the descriptor selection process, and finally, the feature aggregation details will be described.
The following notations are used in the rest of this paper. The term “feature map” indicates the convolution results of one channel; the term “activations” indicates feature maps of all channels in a convolution layer; and the term “descriptor” indicates the -dimensional component vector of activations. “
” refers to the activations of the max-pooled last convolution layer, and “” refers to the activations of the last fully connected layer.
Given an input image of size , the activations of a convolution layer are formulated as an order-3 tensor with elements, which include a set of 2-D feature maps . of size is the -th feature map of the corresponding channel (the -th channel). From another point of view, can be also considered as having cells and each cell contains one -dimensional deep descriptor. We denote the deep descriptors as , where is a particular cell (). For instance, by employing the popular pre-trained VGG-16 model  to extract deep descriptors, we can get a activation tensor in if the input image is . Thus, on one hand, for this image, we have 512 feature maps (i.e., ) of size ; on the other hand, 49 deep descriptors of -d are also obtained.
What distinguishes SCDA from existing deep learning-based image retrieval methods is: using only the pre-trained model, SCDA is able to find useful deep convolutional features, which in effect localizes the main object in the image and discards irrelevant and noisy image regions. Note that the pre-trained model is not fine-tuned using the target fine-grained dataset. In the following, we propose our descriptor selection method, and then present quantitative and qualitative localization results.
In Fig. 3, we show some images taken from five fine-grained datasets, CUB200-2011 , Stanford Dogs , Oxford Flowers 102 , Aircrafts  and Cars . We randomly sample several feature maps from the 512 feature maps in and overlay them to original images for better visualization. As can be seen from Fig. 3, the activated regions of the sampled feature map (highlighted in warm color) may indicate semantically meaningful parts of birds/dogs/flowers/aircrafts/cars, but can also indicate some background or noisy parts in these fine-grained images.
In addition, the semantic meanings of the activated regions are quite different even for the same channel. For example, in the 464th feature map for birds on the right side, the activated region in the first image indicates the Pine Warbler’s tail and the second does the Black-capped Vireo’s head. In the 274th feature map for dogs, the first indicates the German Shepherd’s head, while the second even has no activated region for the Cockapoo, except for a part of noisy background. The other examples of flowers, aircrafts and cars have the same characteristics. In addition, there are also some activated regions representing the background, e.g., the 19th feature map for Pine Warbler and the 418th one for German Shepherd. Fig. 3 conveys that not all deep descriptors are useful, and one single channel contains at best weak semantic information due to the distributed nature of this representation. Therefore, selecting and using only useful deep descriptors (and removing noise) is necessary. However, in order to decide which deep descriptor is useful (i.e., containing the object we want to retrieve), we cannot count on any single channel individually.
We propose a simple yet effective method (shown in Fig. 2), and its quantitative and qualitative evaluation will be demonstrated in the next section. Although one single channel is not very useful, if many channels fire at the same region, we could expect this region to be an object rather than the background. Therefore, in the proposed method, we add up the obtained activation tensor through the depth direction. Thus, the 3-D tensor becomes an 2-D tensor, which we call the “aggregation map”, i.e., (where is the -th feature map in ). For the aggregation map , there are summed activation responses, corresponding to positions. Based on the aforementioned observation, it is straightforward to say that the higher activation response of a particular position , the more possibility of its corresponding region being part of the object. Additionally, fine-grained image retrieval is an unsupervised problem, in which we have no prior knowledge of how to deal with it. Consequently, we calculate the mean value of all the positions in as the threshold to decide which positions localize objects: the position whose activation response is higher than indicates the main object, e.g., birds, dogs or aircrafts, might appear in that position. A mask map of the same size as can be obtained as:
where is a particular position in these positions.
In Fig. 3, the figures in the second last column for each fine-grained datasets show some examples of the mask maps for birds, dogs, flowers, aircrafts and cars, respectively. For these figures, we first resize the mask map
using the bicubic interpolation, such that its size is the same as the input image. We then overlay the corresponding mask map (highlighted in red) onto the original images. Even though the proposed method does not train on these datasets, the main objects (e.g., birds, dogs, aircrafts or cars) can be roughly detected. However, as can be seen from these figures, there are still several small noisy parts activated on a complicated background. Fortunately, because the noisy parts are usually smaller than the main object, we employ Algorithm1 to collect the largest connected component of , which is denoted as , to get rid of the interference caused by noisy parts. In the last column, the main objects are kept by , while the noisy parts are discarded, e.g., the plant, the cloud and the grass.
Therefore, we use to select useful and meaningful deep convolutional descriptors. The descriptor should be kept when , while means the position might have background or noisy parts:
where stands for the selected descriptor set, which will be aggregated into the final representation for retrieving fine-grained images. The whole convolutional descriptor selection process is illustrated in Fig. 2b-2e.
In this section, we give the qualitative evaluation of the proposed descriptor selection process. Because four fine-grained datasets (i.e., CUB200-2011, Stanford Dogs, Aircrafts and Cars) supply the ground-truth bounding box for each image, it is desirable to evaluate the proposed method for object localization. However, as seen in Fig. 3, the detected regions are irregularly shaped. So, the minimum rectangle bounding boxes which contain the detected regions are returned as our object localization predictions.
We evaluate the proposed method to localize the whole-object (birds, dogs, aircrafts or cars) on their test sets. Example predictions can be seen in Fig. 4. From these figures, the predicted bounding boxes approximate the ground-truth bounding boxes fairly accurately, and even some results are better than the ground truth. For instance, in the first dog image shown in Fig. 4, the predicted bounding box can cover both dogs; and in the third one, the predicted box contains less background, which is beneficial to retrieval performance. Moreover, the predicted boxes of Aircrafts and Cars are almost identical to the ground-truth bounding boxes in many cases. However, since we utilize no supervision, some details of the fine-grained objects, e.g., birds’ tails, cannot be contained accurately by the predicted bounding boxes.
We also report the results in terms of the Percentage of Correctly Localized Parts (PCP) metric for object localization in Table I. The reported metrics are the percentage of whole-object boxes that are correctly localized with a > IOU with the ground-truth bounding boxes. In this table, for CUB200-2011, we show the PCP results of two fine-grained parts (i.e., head and torso) reported in some previous part localization based fine-grained classification algorithms [38, 4, 5]. Here, we first compare the whole-object localization rates with that of fine-grained parts for a rough comparison. In fact, the torso bounding box is highly similar to that of the whole-object in CUB200-2011. By comparing the results of PCP for torso and our whole-object, we find that, even though our method is unsupervised, the localization performance is just slightly lower or even comparable to that of these algorithms using strong supervisions, e.g., ground-truth bounding box and parts annotations (even in the test phase). For Stanford Dogs, our method can get 78.86% object localization accuracy. Moreover, the results of Aircrafts and Cars are 94.91% and 90.96%, which validates the effectiveness of the proposed unsupervised object localization method.
Additionally, in our proposed method, the largest connected component of the obtained mask map is kept. We further investigate how this filtering step affects object localization performance by removing this processing. Then, based on , the object localization results based on these datasets are: 45.18%, 68.67%, 59.83% and 79.36% for CUB200-2011, Stanford Dogs, Aircrafts and Cars, respectively. The localization accuracy based on is much lower than the accuracy based on , which proves the effectiveness of obtaining the largest connected component. Besides, we also consider these drops through the relation to the size of the ground truth bounding boxes. From this point of view, Fig. 5 shows the percentage of the whole images covered by the ground truth bounding boxes on four fine-grained datasets, respectively. It is obvious that most ground truth bounding boxes of CUB200-2011 and Aircrafts are less than 50% size of the whole images. Thus, for the two datasets, the drops are large. However, for Cars, as shown in Fig. (d)d
, the percentage’s distribution approaches a normal distribution. ForStanford Dogs, a few ground truth bounding boxes cover less than 20% image size or covering more than 80% image size. Therefore, for these two datasets, the effect of removing the largest connected component processing could be small.
What’s more, because our method does not require any supervision, a state-of-the-art unsupervised object localization method, i.e., , is conducted as the baseline.  uses off-the-shelf region proposals to form a set of candidate bounding boxes for objects. Then, these regions are matched across images using a probabilistic Hough transform that evaluates the confidence for each candidate correspondence considering both appearance and spatial consistency. After that, dominant objects are discovered and localized by comparing the scores of candidate regions and selecting those that stand out over other regions containing them. As  is not a deep learning based method, most of its localization results on these fine-grained datasets are not satisfactory, which are reported in Table I. Specifically, for many images of Aircrafts,  returns the whole images as the corresponding bounding boxes predictions. While, as shown in Fig. (c)c, only a small percentage of ground truth bounding boxes approach the whole images, which could explain why the unsupervised localization accuracy on Aircrafts of  is much worse than ours.
|Dataset||Method||Train phase||Test phase||Head||Torso||Whole-object|
|CUB200-2011||Strong DPM ||43.49||75.15||–|
|Part-based R-CNN with BBox ||68.19||79.82||–|
|Deep LAC ||74.00||96.00||–|
|Part-based R-CNN ||61.42||70.68||–|
|Unsupervised object discovery ||–||–||69.37|
|Stanford Dogs||Unsupervised object discovery ||–||–||36.23|
|Aircrafts||Unsupervised object discovery ||–||–||42.11|
|Cars||Unsupervised object discovery ||–||–||93.05|
After the selection process, the selected descriptor set is obtained. In the following, we compare several encoding or pooling approaches to aggregate these convolutional features, and then give our proposal.
Pooling approaches. We also try two traditional pooling approaches, i.e., global average-pooling and max-pooling, to aggregate the deep descriptors, i.e.,
where and are both dimensional. is the number of the selected descriptors.
After encoding or pooling the selected descriptor set into a single vector, for VLAD and FV, the square root normalization and -normalization are followed; for max- and average-pooling methods, we do
-normalization (the square root normalization did not work well). Finally, the cosine similarity is used for nearest neighbor search. We use two datasets to demonstrate which type of aggregation method is optimal for fine-grained image retrieval. The original training and testing splits provided in the datasets are used. Each image in the testing set is treated as a query, and the training images are regarded as the gallery. The top-mAP retrieval performance is reported in Table II.
For the parameter choice of VLAD/FV, we follow the suggestions reported in . The number of clusters in VLAD and the number of Gaussian components in FV are both set to 2. As shown in the table, larger values lead to lower accuracy. Moreover, we find the simpler aggregation methods such as global max- and average-pooling achieve better retrieval performance comparing with the high-dimensional encoding approaches. These observations are also consistent with the findings in  for general image retrieval. The reason why VLAD and FV do not work well in this case is related to the rather small number of deep descriptors that need to be aggregated. The average number of deep descriptors selected per image for CUB200-2011 and Stanford Dogs is 40.12 and 46.74, respectively. Then, we propose to concatenate the global max-pooling and average-pooling representations, “avgmaxPool”, as our aggregation scheme. Its performance is significantly and consistently higher than the others. We use the “avgmaxPool” aggregation as “SCDA feature” to represent the whole fine-grained image.
|Fisher Vector (=2)||2,048||52.04||59.19||68.37||73.74|
|Fisher Vector (=128)||131,072||45.44||53.10||61.40||67.63|
As studied in [41, 42], the ensemble of multiple layers boosts the final performance. Thus, we also incorporate another SCDA feature produced from the layer which is three layers in front of in the VGG-16 model .
Following , we get the mask map from . Its activations are less related to the semantic meaning than those of . As shown in Fig. 6 (c), there are many noisy parts. However, the bird is more accurately detected than . Therefore, we combine and together to get the final mask map of . is firstly upsampled to the size of . We keep the descriptors when their position in both and are 1, which are the final selected descriptors. The aggregation process remains the same. Finally, we concatenate the SCDA features of and into a single representation, denoted by “”:
where is the coefficient for . It is set to 0.5 for FGIR. After that, we do the normalization on the concatenation feature. In addition, another of the horizontal flip of the original image is incorporated, which is denoted as “” (4,096-d). Additionally, we also try to combine features from more different layers, e.g., pool. However, the retrieval performance improved slightly (about 0.01%0.04% top-1 mAP), while the feature dimensionality became much larger than the proposed SCDA features.
In this section, we firstly describe the datasets and the implementation details of the experiments. Then, we report the fine-grained image retrieval results. We also test our proposed SCDA method on two general-purpose image retrieval datasets. As additional evidence to prove the effectiveness of SCDA, we report the fine-grained classification accuracy by fine-tuning the pre-trained model with image-level labels. Finally, the main observations are summarized.
For fine-grained image retrieval, the empirical evaluation is performed on six benchmark fine-grained datasets, CUB200-2011  (200 classes, 11,788 images), Stanford Dogs  (120 classes, 20,580 images), Oxford Flowers 102  (102 classes, 8,189 images), Oxford-IIIT Pets  (37 classes, 7,349 images), Aircrafts  (100 classes, 10,000 images) and Cars  (196 classes, 16,185 images).
In experiments, for the pre-trained deep model, the publicly available VGG-16 model  is employed to extract deep convolutional descriptors using the open-source library MatConvNet . For all the retrieval datasets, the subtracted mean pixel values for zero-centering the input images are provided by the pre-trained VGG-16 model. All the experiments are run on a computer with Intel Xeon E5-2660 v3, 500G main memory, and an Nvidia Tesla K80 GPU.
In the following, we report the results for fine-grained image retrieval. We compare the proposed method with several baseline approaches and three state-of-the-art general image retrieval approaches, SPoC , CroW  and R-MAC . The top- and top-5 mAP results are reported in Table III.
Firstly, we conduct the SIFT descriptors with Fisher Vector encoding as the handcrafted-feature-based retrieval baseline. The parameters of SIFT and FV used in experiments followed . The feature dimension is 32,768. Its retrieval performance on CUB200-2011, Stanford Dogs, Oxford Flowers and Oxford Pets is significantly worse than the deep learning methods/baselines. But, the retrieval results on rigid bodies like aircrafts and cars are good, while they are still worse than deep learning retrieval methods. In addition, we also feed the ground truth bounding boxes to replace the whole images. As shown in Table III, because the ground truth bounding boxes of these fine-grained images just contain the main objects, “SIFT_FV_gtBBox” achieves significantly better performance than that of the whole images.
For the baseline, because it requires the input images at a fixed size, the original images are resized to and then fed into VGG-16. Similar to the SIFT baseline, we also feed the ground truth bounding boxes to replace the whole images. The feature of the ground truth bounding box achieves better performance. Moreover, the retrieval results of the feature using the bounding boxes predicted by our method are also shown in Table III, which are slightly lower than the ground-truth ones. This observation validates the effectiveness of our method’s object localization once again.
For the baseline, the descriptors are extracted directly without any selection process. We pool them by both average- and max-pooling, and concatenate them into a 1,024-d representation. As shown in Table III, the performance of is better than “_im”, but much worse than the proposed SCDA feature. In addition, VLAD and FV is employed to encode the selected deep descriptors, and we denote the two methods as “selectVLAD” and “selectFV” in Table III. The features of selectVLAD and selectFV have larger dimensionality, but lower mAP in the retrieval task.
State-of-the-art general image retrieval approaches, e.g., SPoC, CroW and R-MAC, can not get satisfactory results for fine-grained images. Hence, general deep learning image retrieval methods could not be directly applied to FGIR.
We also report the results of and on these six fine-grained datasets in Table III. In general, is the best amongst the compared methods. Comparing these results with the ones of SCDA, we find the multiple layer ensemble strategy (cf. Sec. III-D) could improve the retrieval performance, and furthermore horizontal flip boosts the performance significantly. Therefore, if your retrieval tasks prefer a low dimensional feature representation, SCDA is the optimal choice, or, the post-processing on features is recommended.
|Method||Dimension||CUB200-2011||Stanford Dogs||Oxford Flowers||Oxford Pets||Aircrafts||Cars|
|SPoC (w/o cen.)||256||34.79||42.54||48.80||55.95||71.36||74.55||60.86||67.78||37.47||43.73||29.86||36.23|
|SPoC (with cen.)||256||39.61||47.30||48.39||55.69||65.86||70.05||64.05||71.22||42.81||48.95||27.61||33.88|
|Method||Dimension||CUB200-2011||Stanford Dogs||Oxford Flowers||Oxford Pets||Aircrafts||Cars|
In the following, we compare several feature compression methods on the
feature: (a) Singular Value Decomposition (SVD); (b) Principal Component Analysis (PCA); (c) PCA whitening (its results were much worse than other methods and are omitted) and (d) SVD whitening. We compress thefeature to 256-d and 512-d, respectively, and report the compressed results in Table IV. Comparing the results shown in Table III and Table IV, the compressed methods can reduce the dimensionality without hurting the retrieval performance. SVD (which does not remove the mean vector) has slightly higher rates than PCA (which removes the mean vector). The “512-d SVD+whitening” feature can achieve better retrieval performance: 2%4% higher than the original feature on CUB200-2011 and Oxford Flowers, and significantly 7%13% on Aircrafts and Cars. Moreover, “512-d SVD+whitening” with less dimensions generally achieves better performance than other compressed SCDA features. Therefore, we take it as our optimal choice for FGIR. In the following, we present some retrieval examples based on “512-d SVD+whitening”.
In Fig. 7, we show two successful retrieval results and two failure cases for each fine-grained dataset, respectively. As shown in the successful cases, our method can work well when the same kind of birds, animals, flowers, aircrafts or cars appear in different kinds of background. In addition, for these failure cases, there exist only tiny differences between the query image and the returned ones, which can not be accurately detected in this pure unsupervised setting. We can also find some interesting observations, e.g., the last failure case of the flowers and pets. For the flowers, there are two correct predictions in the top-5 returned images. Even though the flowers in the correct predictions have different colors with the query, our method can still find them. For the pets’ failure cases, the dogs in the returned images have the same pose as the query image.
In this section, we discuss the quality of the proposed SCDA feature. After SVD and whitening, the former distributed dimensions of SCDA have more discriminative ability, i.e., directly correspond to semantic visual properties that are useful for retrieval. We use five datasets (CUB200-2011, Stanford Dogs, Oxford Flowers, Aircrafts and Cars) as examples to illustrate the quality. We first select one dimension of “512-d SVD+whitening”, and then sort the value of that dimension in the descending order. Then, we visualize images in the same order, which is shown in Fig. 8.
Images of each column have some similar “attributes”, e.g., living in water and opening wings for birds; brown and white heads and similar looking faces for dogs; similar shaped inflorescence and petals with tiny spots for flowers; similar poses and propellers for aircrafts; similar point of views and motorcycle types for cars. Obviously, the SCDA feature has the ability to describe the main objects’ attributes (even subtle attributes). Thus, it can produce human-understandable interpretation manuals for fine-grained images, which might explain its success in fine-grained image retrieval. In addition, because the values of the compressed SCDA features might be positive, negative and zero, it is meaningful to sort these values either in the descending order (shown in Fig. 8), or ascending order. The images returned in ascending also exhibit some similar visual attributes.
For further investigation of the effectiveness of the proposed SCDA method, we compare it with three state-of-the-art general image retrieval approaches (SPoC , CroW  and R-MAC ) on the INRIA Holiday  and Oxford Building 5K  datasets. Following the protocol in [22, 23], for the Holiday dataset, we manually fix images in the wrong orientation by rotating them by degrees, and report the mean average precision (mAP) over 500 and 55 queries for Holiday and Oxford Building 5K, respectively.
In the experiments of these two general image retrieval datasets, we use the SCDA and SCDA_flip features, and compress them by SVD whitening. As shown by the results presented in Table V, the compressed SCDA_flip (512-d) achieves the highest mAP among the proposed ones. In addition, comparing with state-of-the-arts, the compressed SCDA_flip (512-d) is significantly better than SPoC , CroW , and comparable with the R-MAC approach . Therefore, the proposed SCDA not only significantly outperforms the general image retrieval state-of-the-art approaches for fine-grained image retrieval, but can also obtain comparable results for general-purpose image retrieval tasks.
|SPoC (w/o cen.) ||256||80.2||58.9|
|SPoC (with cen.) ||256||78.4||65.7|
|Method of ||9,664||84.2||71.3|
|SCDA_flip (SVD whitening)||256||91.6||66.4|
|SCDA_flip (SVD whitening)||512||92.1||67.7|
In the end, we compare with several state-of-the-art fine-grained classification algorithms to validate the effectiveness of SCDA from the classification perspective.
In the classification experiments, we adopt two strategies to fine-tune the VGG-16 model with only the image-level labels. One strategy is directly fine-tuning the pre-trained model of the original VGG-16 architecture by adding the horizontal flips of the original images as data augmentation. After obtaining the fine-tuned model, we extract the as the whole image representations and feed them into a linear SVM  to train a classifier.
The other strategy is to build an end-to-end SCDA architecture. Before each epoch, the masksof and are extracted first. Then, we implement the selection process as an element-wise product operation between the convolutional activation tensor and the mask matrix . Therefore, the descriptors located in the object region will remain, while the other descriptors will become zero vectors. In the forward pass of the end-to-end SCDA, we select the descriptors of and as aforementioned, and then, both max- and average-pool (followed by normalization) the selected descriptors into the corresponding SCDA feature. After that, the SCDA features of and are concatenated, which is the so called “SCDA
”, as the final representation of the end-to-end SCDA model. Then, a classification (fc+softmax) layer is added for end-to-end training. Because the partial derivative of the mask is zero, it will not affect the backward processing of the end-to-end SCDA. After each epoch, the masks will be updated based on the learned SCDA model in the last epoch. When end-to-end SCDA converges, theis also extracted.
For both strategies, the coefficient of is set to 1 to let the classifier to learn and then select important dimensions automatically. The classification accuracy comparison is listed in Table VI.
For the first fine-tuning strategy, the classification accuracy of our method (“SCDA (f.t.)”) is comparable or even better than the algorithms trained with strong supervised annotations, e.g., [4, 5]. For these algorithms using only image-level labels, our classification accuracy is comparable with the algorithms using similar fine-tuning strategies ([6, 7, 9]), but still does not perform as well as those using more powerful deep architectures and more complicated data augmentations [8, 31]. For the second fine-tuning strategy, even though “SCDA (end-to-end)” obtains a slightly lower classification accuracy than “SCDA (f.t.)”, the end-to-end SCDA model contains the least number of parameters (i.e., only 15.53M), which attributes to no fully connected layers in the architecture.
|Method||Train phase||Test phase||Model||Para.||Dim.||Birds||Dogs||Flowers||Pets||Aircrafts||Cars|
|PB R-CNN with BBox ||Alex-Net3||173.03M||12,288||76.4||–||–||–||–||–|
|Deep LAC ||Alex-Net||173.03M||12,288||80.3||–||–||–||–||–|
|PB R-CNN ||Alex-Net||173.03M||12,288||73.9||–||–||–||–||–|
|Weakly supervised FG ||VGG-16||135.07M||262,144||79.3||80.4||–||–||–||–|
|Constellations ||VGG-19||140.38M||208,896||81.0||68.611footnotemark: 1||95.3||91.6||–||–|
|Bilinear ||VGG-16 and VGG-M||73.67M||262,144||84.0||–||–||–||83.9||91.3|
|Spatial Transformer Net ||ST-CNN (inception)||62.68M||4,096||84.1||–||–||–||–||–|
|SCDA (end-to-end)||VGG-16 (w/o FCs)1||15.53M||4,096||80.1||77.4||90.2||90.3||78.6||85.1|
 reported the result of the Birds dataset using VGG-19, while the result of Dogs is based on the Alex-Net model.
Thus, our method has less dimensions and is simple to implement, which makes SCDA more scalable for large-scale datasets without strong annotations and is easier to generalize. In addition, the CroW  paper presented the classification accuracy on CUB200-2011 without any fine-tuning (56.5% by VGG-16). We also experiment on the 512-d SCDA feature (only contains the max-pooling part this time for fair comparison) without any fine-tuning. The classification accuracy on that dataset is 73.7%, which outperforms their performance by a large margin.
For further investigating the generalization ability of the proposed SCDA method, we additionally conduct experiments on a recently released fine-grained dataset for biodiversity analysis, i.e., the Moth dataset . This dataset includes 2,120 moth images of 675 highly similar classes, which are completely disjoint with the images of ImageNet. In Table VII, we present the retrieval results of SCDA and other baseline methods. Because there are several classes in Moth have less than five images per class, we only report the top-1 mAP results. Consistent with the observations in Sec. IV-B, still outperforms other baseline methods, which proves the proposed method could generalize well.
|SPoC (with cen.)||256||42.96|
In this section, we compare the inference speeds of our SCDA with other methods. Because the methods listed in Table VIII can handle arbitrary image resolutions, different fine-grained data sets have different speeds. Specifically, much larger image will cause the GPUs out of memory. Thus, according to the original image scaling, we resize the images until pixels, when . As the speeds reported in Table VIII, it is understandable that the speed of SCDA is lower than that of pool. In general, SCDA has the comparable computational speeds with CroW, and is significantly faster than R-MAC. But, its speed is slightly lower (about 1 frame/sec) than SPoC. In practice, if your retrieval tasks prefer high accuracy, SCDA_flip is recommended. While, if you prefer efficiency, SCDA is scalable enough for handling large scale fine-grained datasets. Meanwhile, SCDA will bring good retrieval accuracy (cf. Table III).
|SPoC (w/o cen.)||8.70||8.77||5.92||10.10||2.76||5.46|
|SPoC (with cen.)||8.40||8.62||5.78||10.10||2.72||5.49|
In the following, we summarize several empirical observations of the proposed selective convolutional descriptor aggregation method for FGIR.
Simple aggregation methods such as max- and average-pooling achieved better retrieval performance than high-dimensional encoding approaches. The proposed SCDA representation concatenated both the max- and average-pooled features, which achieved the best retrieval performance as reported in Table II and Table III.
Convolutional descriptors performed better than the representations of the fully connected layer for FGIR. In Table III, the representations of “”, “selectFV” and “selectVLAD” are all based on the convolutional descriptors. No matter what kind of aggregation methods they used, their top- retrieval results are (significantly) better than the fully connected features.
Selecting descriptors is beneficial to both fine-grained image retrieval and general-purposed image retrieval. As the results reported in Table III and Table V, the proposed SCDA method achieved the best results for FGIR, meanwhile was comparable with general image retrieval state-of-the-art approaches.
The SVD whitening compression method can not only reduce the dimensions of the SCDA feature, but also improve the retrieval performance, even by a large margin (cf. the results of Aircrafts and Cars in Table IV). Moreover, the compressed SCDA feature had the ability to describe the main objects’ subtle attributes, which is shown in Fig. 8.
In this paper, we proposed to solely use a CNN model pre-trained on non-fine-grained tasks to tackle the novel and difficult fine-grained image retrieval task. We proposed the Selective Convolutional Descriptor Aggregation (SCDA) method, which is unsupervised and does not require additional learning. SCDA first localized the main object in fine-grained image unsupervised with high accuracy. The selected (localized) deep descriptors were then aggregated using the best practices we found to produce a short feature vector for a fine-grained image. Experimental results showed that, for fine-grained image retrieval, SCDA outperformed all the baseline methods including general image retrieval state-of-the-arts. Moreover, these features of SCDA exhibited well-defined semantic visual attributes, which may explain why it has high retrieval accuracy for fine-grained images. Meanwhile, SCDA had the comparable retrieval performance on standard general image retrieval datasets. The satisfactory results of both fine-grained and general-purpose image retrieval datasets validated the benefits of selecting convolutional descriptors.
In the future, we consider including the selected deep descriptors’ weights to find object parts. Another interesting direction is to explore the possibility of pre-trained models for more complicated vision tasks such as unsupervised object segmentation. Indeed, enabling models trained for one task to be reusable for another different task, particularly without additional training, is an important step toward the development of learnware .
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, Jun. 2015, pp. 3828–3836.
T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application of two-level attention models in deep convolutional neural network for fine-grained image classification,” inProceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, Jun. 2015, pp. 842–850.
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” inAdvances in Neural Information Processing Systems, Montréal, Canada, Dec. 2015, pp. 2008–2016.
Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.