Fine-Grained Image Retrieval (FGIR) [fgis_tmm, crl, towardscrl, scda, part_based_fgir, adversarial_fgir]
is a practical but challenging computer vision task. It aims to retrieve images belonging to various sub-categories of a certain meta-category (e.g., birds, cars and aircrafts) and return images with the same sub-category as the query image. In real FGIR applications, previous methods could suffer from slow query speed and redundant storage costs due to both the explosive growth of massive fine-grained data and high-dimensional real-valued features.
In the literature, learning to hash [LSH:conf/compgeom/DatarIIM04, ITQ:conf/cvpr/GongL11, SH:conf/nips/WeissTF08, CNNH:conf/aaai/XiaPLLY14, DH:conf/cvpr/LiongLWMZ15, DPSH:conf/ijcai/LiWK16, DSH:conf/cvpr/Liu0SC16, DSDH:conf/nips/LiSHT17, DCH:conf/cvpr/CaoLL018, ADSH:conf/aaai/JiangL18, HashGAN:conf/cvpr/DizajiZSYDH18] has proven to be a promising solution for large-scale image retrieval because it can greatly reduce the storage costs and increase the query speeds. As a representative research area of approximate nearest neighbor (ANN) search [LSH:conf/compgeom/DatarIIM04, PQ:journals/pami/JegouDS11, KDTree:journals/cacm/Bentley75], hashing aims to embed data points as similarity-preserving binary codes. Recently, hashing has been successfully applied in a wide range of image retrieval tasks, e.g., face image retrieval [DDH:conf/ijcai/LinLT17], person re-identification [PDH:journals/tip/ZhuKZFT17, CSBT:conf/cvpr/ChenWQLS17], etc. We hereby explore the effectiveness of hashing for fine-grained image retrieval.
To the best of our knowledge, this is the first work to study the fine-grained hashing problem, which refers to the problem of designing hashing for fine-grained objects. As shown in Figure 1, the task is desirable to generate compact binary codes for fine-grained images sharing both large intra-class variances and small inter-class variances. To deal with the challenging task, we propose a unified end-to-end trainable network ExchNet to first learn fine-grained tailored features and then generate the final binary hash codes.
In concretely, our ExchNet consists of three main modules, including representation learning, local feature alignment and hash code learning, as shown in Figure 2. In the representation learning module, beyond obtaining the holistic image representation (i.e., global features), we also employ the attention mechanism to capture the part-level features (i.e., local features) for representing fine-grained objects’ parts. Localizing parts and embedding part-level cues are crucial for fine-grained tasks, since these discriminative but subtle parts (e.g., bird heads or tails) play a major role to distinguish different sub-categories. Moreover, we also develop two kinds of attention constraints, i.e., spatial and channel constraints, to collaboratively work together for further improving the discriminative ability of these local features. In the following, to ensure that these part-level features can correspond to their own corresponding parts across different fine-grained images, we design an anchor based feature alignment approach to align these local features. Specifically, in the local feature alignment module, we treat the anchored local features as the “prototype” w.r.t. its sub-category by averaging all the local features of that part across images. Once local features are well aligned for their own parts, even if we exchange one specific part’s local feature of an input image with the same part’s local feature of the prototype, the image meanings derived from the image representations and also the final hash codes should be both extremely similar. Inspired by this motivation, we perform a feature exchanging operation upon the anchored local features and other learned local features, which is illustrated in Figure 3. After that, for effectively training the network with our feature alignment fashion, we utilize an alternating algorithm to solve the hashing learning problem and update anchor features simultaneously.
To quantitatively prove both effectiveness and efficiency of our ExchNet, we conduct comprehensive experiments on five fine-grained benchmark datasets, including the large-scale ones, i.e., NABirds [NABirds:conf/cvpr/HornBFHBIPB15], VegFru [VegFru:conf/iccv/HouFW17] and Food101 [Food101:conf/eccv/BossardGG14]. Particularly, compared with competing approximate nearest neighbor methods, our ExchNet achieves up to hundreds times speedup for large-scale fine-grained image retrieval without significant accuracy drops. Meanwhile, compared with state-of-the-art generic hashing methods, ExchNet could consistently outperform these methods by a large margin on all the fine-grained datasets. Additionally, ablation studies and visualization results justify the effectiveness of our tailored model designs like local feature alignment and proposed attention approach.
The contributions of this paper are summarized as follows:
We study the novel fine-grained hashing topic to leverage the search and storage efficiency of hash codes for solving the challenging large-scale fine-grained image retrieval problem.
We propose a unified end-to-end trainable network, i.e., ExchNet, to first learn fine-grained tailored features and then generate the final binary hash codes. Particularly, the proposed attention constraints, local feature alignment and anchor-based learning fashion contribute well to obtain discriminative fine-grained representations.
We conduct extensive experiments on five fine-grained datasets to validate both effectiveness and efficiency of our proposed ExchNet. Especially for the results on large-scale datasets, ExchNet exhibits its outperforming retrieval performance on either speedup, memory usages and retrieval accuracy.
2 Related Work
2.0.1 Fine-Grained Image Retrieval
Fine-Grained Image Retrieval (FGIR) is an active research topic emerged in recent years, where the database and query images could share small inter-class variance but large intra-class variance. In previous works [fgis_tmm]
, handcrafted features were initially utilized to tackle the FGIR problem. Powered by deep learning techniques, more and more deep learning based FGIR methods[fgis_tmm, crl, maskcnn, towardscrl, scda, part_based_fgir, adversarial_fgir, piecewise] were proposed. These deep methods can be roughly divided into two parts, i.e., supervised and unsupervised methods. In supervised methods, FGIR is defined as a metric learning problem. Zheng et al. [crl]
designed a novel ranking loss and a weakly-supervised attractive feature extraction strategy to facilitate the retrieval performance. Zheng et al.[towardscrl] improved their former work [crl] with a normalize-scale layer and de-correlated ranking loss. As to unsupervised methods, Selective Convolutional Descriptor Aggregation (SCDA) [scda] was proposed to localize the main object in fine-grained images firstly, and then discard the noisy background and keep useful deep descriptors for fine-grained image retrieval.
2.0.2 Deep Hashing
Hashing methods can be divided into two categories, i.e., data-independent methods [LSH:conf/compgeom/DatarIIM04] and data-dependent methods [ITQ:conf/cvpr/GongL11, DPSH:conf/ijcai/LiWK16]
, based on whether training points are used to learn hash functions. Generally speaking, data-dependent methods, also named as Learning to Hash (L2H) methods, can achieve better retrieval performance with the help of the learning on training data. With the rise of deep learning, some L2H methods integrate deep feature learning into hash frameworks and achieve promising performance. As previous work, many deep hashing methods[CNNH:conf/aaai/XiaPLLY14, DH:conf/cvpr/LiongLWMZ15, DPSH:conf/ijcai/LiWK16, DSH:conf/cvpr/Liu0SC16, DSDH:conf/nips/LiSHT17, DCH:conf/cvpr/CaoLL018, ADSH:conf/aaai/JiangL18, HashGAN:conf/cvpr/DizajiZSYDH18, eccv:hashing1, eccv:hashing2, eccv:hashing3, eccv:hashing4, eccv:hashing5] for large-scale image retrieval have been proposed. Compared with deep unsupervised hashing methods [DH:conf/cvpr/LiongLWMZ15, HashGAN:conf/cvpr/DizajiZSYDH18, ADSH:conf/aaai/JiangL18], deep supervised hashing methods [CNNH:conf/aaai/XiaPLLY14, DPSH:conf/ijcai/LiWK16, DSDH:conf/nips/LiSHT17, ADSH:conf/aaai/JiangL18] can achieve superior retrieval accuracy as they can fully explore the semantic information. Specifically, the previous work [CNNH:conf/aaai/XiaPLLY14] was essentially a two-stage method which tried to learn binary codes in the first stage and employed feature learning guided by the learned binary codes in the second stage. Then, there appeared numerous one-stage deep supervised hashing methods, including Deep Pairwise Supervised Hashing (DPSH) [DPSH:conf/ijcai/LiWK16], Deep Supervised Hashing (DSH) [DSH:conf/cvpr/Liu0SC16], and Deep Cauchy Hashing (DCH) [DCH:conf/cvpr/CaoLL018], which aimed to integrate feature learning and hash code learning into an end-to-end framework. Hashing Network (HashNet) [HashGAN:conf/cvpr/DizajiZSYDH18] utilized to approximate by increasing . Asymmetric Deep Supervised Hashing (ADSH) [ADSH:conf/aaai/JiangL18] tried to use asymmetric hashing improve the training efficiency and retrieval performance.
The framework of our ExchNet is presented in Figure 2, which contains three key modules, i.e., the representation learning module, local feature alignment module, and hash code learning module.
3.1 Representation Learning
The learning of discriminative and meaningful local features is mutually correlated with fine-grained tasks [bcnn, isqrt, navigate, macnn, racnn], since these local features can greatly benefit the distinguishing of sub-categories with subtle visual differences deriving from the discriminative fine-grained parts (e.g., bird heads or tails) . In consequence, as shown in Figure 2, beyond the global feature extractor, we also introduce a local feature extractor in the representation learning module. Specifically, by considering model efficiency, we hereby propose to learn local features with the attention mechanism, rather than other fine-grained techniques with tremendous computation cost, e.g., second-order representations [bcnn, isqrt] or complicated network architectures [navigate, macnn, racnn].
Given an input image , a backbone CNN is utilized to extract a holistic deep feature , which serves as the appetizer for both the local feature extractor and the global feature extractor.
It is worth mentioning that the attention is engaged in the middle of the feature extractor. Since, in the shallow layers of deep neural networks, low-level context information (e.g., colors and edges, etc.) are well preserved, which is crucial for distinguish subtle visual differences of fine-grained objects. Then, by feedinginto the attention generation module, pieces of attention maps are generated and we use to denote the attentive region of the -th () part cues for . After that, the obtained part-level attention map is element-wisely multiplied on to select the attentive local feature corresponding to the -th part, which is formulated as:
where represents the -th attentive local feature of , and “” denotes the Hadamard product on each channel. For simplification, we use to denote a set of local features and, subsequently, is fed into the later Local Features Refinement (LFR) network composed of a stack of convolution layers to embed these attentive local features into higher-level semantic meanings:
where the output of the network is denoted as , which represents the final local feature maps w.r.t. high-level semantics. We denote
as the local feature vector after applying global average pooling (GAP) onas:
On the other side, as to the global feature extractor, for , we directly adopt a Global Features Refinement (GFR) network composed of conventional convolutional operations to embed , which is presented by:
We use and to denote the learned global feature and the corresponding holistic feature vector after GAP, respectively.
Furthermore, to facilitate the learning of localizing local feature cues (i.e., capturing fine-grained parts), we impose the spatial diversity and channel diversity constraints over the local features in .
Specifically, it is a natural choice to increase the diversity of local features by differentiating the distributions of attention maps [macnn]. However, it might cause a problem that the holistic feature can not be activated in some spatial positions, while the attention map has large activation values on them due to over-applied constraints upon the learned attention maps. Instead, in our method, we design and apply constraints on the local features. In concretely, for the local feature , we obtain its “aggregation map” by adding all feature maps through the channel dimension and apply the softmax function on it for converting it into a valid distribution, then flat it into a vector . Based on the Hellinger distance, we propose a spatial diversity induced loss as:
where is used to denote the combinatorial number of ways to pick unordered outcomes from possibilities. The spatial diversity constraint drives the aggregation maps to be activated in spatial positions as diverse as possible. As to the channel diversity constraint, we first convert the local feature vector into a valid distribution, which can be formulated by
Subsequently, we propose a constraint loss over as:
where is a hyper-parameter to adjust the diversity and denotes . Equipping with the channel diversity constraint could benefit the network to depress redundancies in features through channel dimensions. Overall, our spatial diversity and channel diversity constraints can work in a collaborative way to obtain discriminative local features.
3.2 Learning to Align by Local Feature Exchanging
Upon the representation learning module, the alignment on local features is necessary for confirming that they represent and more importantly correspond to common fine-grained parts across images, which are essential to fine-grained tasks. Hence, we propose an anchor-based local features alignment approach assisted with our feature exchanging operation.
Intuitively, local features from the same object part (e.g., bird heads of a bird species) should be embedded with almost the same semantic meaning. As illustrated by Figure 3, our key idea is that, if local features were well aligned, exchanging the features of identical parts for two input images belonging to the same sub-category should not change the generated hash codes. Inspired by that, we propose a local feature alignment strategy by leveraging the feature exchanging operation, which happens between learned local features and anchored local features. As a foundation for feature exchanging, a set of dynamic anchored local features for class should be maintained, in which the -th anchored local feature is obtained by averaging all -th part’s local features of training samples from class
. At the end of each training epoch, anchored local features will be recalculated and updated. Subsequently, as shown in Figure4, for a sample whose category is , we exchange a half of the learned local features in with its corresponding anchored local features in . The exchanging process can be formulated as:
where-th part. The local features after exchanging are denoted as and fed into the hashing learning module for generating binary codes and computing similarity preservation losses.
3.3 Hash Code Learning
After obtaining both global features and local features, we concatenate them together and feed them into the hashing learning module. Specifically, the hashing network contains a fully connected layer and a activation function layer. In our method, we choose an asymmetric hashing for ExchNet for its flexibility [powerofasym]. Concretely, we utilize two hash functions, defined as and , to learn two different binary codes for the same training sample. The learning procedure is as follows:
where denotes the concatenation operator, and denote the two different binary codes of the -th sample. represents the code length. and present the parameters of hash functions and ***We omit the bias term for simplicity., respectively. We denote and as learned binary codes. Inspired by [ADSH:conf/aaai/JiangL18], we only keep binary codes and set hash function implicitly. Hence, we can perform feature learning and binary codes learning simultaneously.
To preserve the pairwise similarity, we adopt the squared loss and define the following objective function:
where , is the pairwise similarity label and . We use to denote the parameters of deep neural network and hash layer. The aforementioned process is generally illustrated by Figure 4.
Due to the zero-gradient problem caused by the function, becomes intractable to optimize. In this paper, we relax into
to alleviate this problem. Then, we can derive the following loss function:
where and becomes .
3.4 Learning Algorithm
To solve the optimization problem in Equation (13), we design an alternating algorithm to learn , , and . Specifically, we learn one parameter with the others fixed.
3.4.1 Learn with and fixed
When , fixed, we use back-propagation (BP) to update the parameters . In particular, for input sample , we first calculate the following gradient:
Then, we use the back-propagation algorithm to update .
3.4.2 Learn with and fixed
When , are fixed, we rewrite as follows:
Because is defined over , we learn column by columns as that in ADSH [ADSH:conf/aaai/JiangL18]. Specifically, we can get the closed-form solution for the -th column as follows:
3.4.3 Learn with and fixed
When , fixed, we use the following equation to update each :
where denotes the number of samples in class .
3.5 Out-of-Sample Extension
When we finish the training phase, we can generate the binary code for the sample by .
For comparisons, we select two widely used fine-grained datasets, i.e., CUB [CUB:journals/CalTech/WBSWPB2011] and Aircraft [AirCraft:journals/corr/MajiRKBV13], as well as three popular large-scale fine-grained datasets, i.e., NABirds [NABirds:conf/cvpr/HornBFHBIPB15], VegFru [VegFru:conf/iccv/HouFW17], and Food101 [Food101:conf/eccv/BossardGG14], to conduct experiments.
Specifically, CUB is a bird classification benchmark dataset containing images from 200 bird species. It is officially split into for training and for test. Aircraft contains images from kinds of aircraft model variants with for training and for test. Moreover, for large-scale datasets, NABirds has common species of birds in North America with training images and test images. VegFru is a large-scale fine-grained dataset covering vegetables and fruits from categories with for training and for test.Food101 contains kinds of foods with images. For each class, test images are manually reviewed for correctness while training images still contain some amount of noises.
4.2 Baselines and Implementation Details
For comparisons with other ANN algorithms, we select two tree-based ANN methods, i.e., BallTree [BallTree:journals/corr/DolatshahHM15] and KDTree [KDTree:journals/cacm/Bentley75], and one production quantization based ANN method, i.e., Product Quantization (PQ) [PQ:journals/pami/JegouDS11]. The linear scan means that we directly perform exhaustive search based on the learned real-valued features. For comparisons with other hashing baselines, we choose eight state-of-the-art generic hashing methods. They are LSH [LSH:conf/compgeom/DatarIIM04], SH [SH:conf/nips/WeissTF08], ITQ [ITQ:conf/cvpr/GongL11], SDH [SDH:conf/cvpr/ShenSLS15], DPSH [DPSH:conf/ijcai/LiWK16], DSH [DSH:conf/cvpr/Liu0SC16], HashNet [Hashnet:conf/iccv/CaoLWY17], and ADSH [ADSH:conf/aaai/JiangL18]. Among these methods, DPSH, DSH, HashNet and ADSH are based on deep learning and others are not.
4.2.2 Implementation Details
For comparisons with other ANN algorithms, we carry out experiments on Food101 in which the database is the largest. We first utilize the triplet loss [tripletloss] to learn -D and -D feature embeddings for its frequent usages in fine-grained retrieval tasks. Then, the performance of linear scan is tested on the learned features. More experimental settings about BallTree [BallTree:journals/corr/DolatshahHM15], KDTree [KDTree:journals/cacm/Bentley75] and PQ [PQ:journals/pami/JegouDS11] can be found in the supplementary materials. For our ExchNet, the retrieval procedure is divided into coarse ranking to select top as candidates and re-ranking to return top () from top candidates. We adopt the real-valued features learned with the triplet loss directly. As presented in Table 1, we report results including precision at top (P@K), wall clock time (WC time), speed up ratio, and memory cost.
Our backbone employs the first three stages of ResNet50 and the attention generation module is the fourth stage of ResNet50 without downsample convolutions. The LFR and GFR of ExchNet are independent networks, sharing the same architecture with the fourth stage of ResNet50. The optimizer is standard mini-batch stochastic gradient descent with weight decay. The mini-batch size is set to 64 and the iteration times is 100. Learning rate is set to 0.001, which is divided by 10 at the 60-th and 80-th iteration, respectively. The hyper-parameter is set to . The number of training epochs is 20. For efficient training, we randomly sample a subset of the training set in each epoch. Specifically, for CUB, Aircraft, Food101, we sample 2,000 samples per epoch, while 4,000 samples are randomly selected for other datasets. To provide reliable local features for our local feature alignment strategy, in the first 50 iterations, since both local and global features are not well learned, the part-level feature exchanging operation is disabled for avoiding aligning meaningless local features.
4.3 Comparisons with other ANN Methods
To prove the practicality and effectiveness of our proposed method, comparisons with other ANN methods are presented in this section. All experiments are conducted based on hash codes of bits generated by our model.
In Table 1, we present the retrieval performance on the Food101 dataset. Specifically, we present the P@10, WC time, speedup, and memory cost for all methods. We can observe that, compared with the linear search, our method can achieve up to and acceleration on features of -D and -D, respectively. The memory cost of our method is also much less than tree-based methods. The best speed-up and the lowest storage usage prove the practicality of our proposed method. Meanwhile, our method can achieve state-of-the-art retrieval accuracies, which demonstrates that our ExchNet is the most effective one compared with other ANN methods. Above results illustrate our ExchNet deserves to be the optimal choice for fine-grained image retrieval.
4.4 Comparisons with State-of-the-art Hashing Methods
In Table 2, we present the mean average precision (MAP) results for comparisons with state-of-the-art hashing methods on all datasets. From Table 2, we can observe that our method can achieve the best retrieval performance in all cases. On fine-grained datasets (CUB and Aircraft) of relatively small size, almost all the generic hashing methods (except for ADSH) can not achieve a satisfactory performance, i.e., a relatively low MAP. Also, our ExchNet outperforms the most powerful baseline ADSH considerably. It can verify that given limited training data, our proposed method could still perform well. As to large-scale fine-grained datasets, the improvements become more significant. Particularly, comparing with the most powerful baselines, we achieve and MAP improvements on the 32 bits and 48 bits evaluation experiments of the large-scale VegFru dataset. Meanwhile, we achieve and MAP improvements on the 32 bits and 48 bits experiments of the Food101 dataset. It shows that, with sufficient training data, we can get better retrieval results with our ExchNet on large-scale fine-grained datasets.
4.5 Ablation Studies
4.5.1 Effectiveness of the Exchanging-based Feature Alignment
We verify the effectiveness of the local feature alignment approach (cf. Section 3.2) in this section. The retrieval accuracy are present in Figure 5, where “Ours w/o Exchange” means that we do not perform the feature exchanging operation (i.e., the local feature alignment) during training. Note that “Ours w/o Exchange” is degenerated to the ADSH [ADSH:conf/aaai/JiangL18] learned with our proposed representation learning architecture instead of ResNet50. Hence, we also present the results of ADSH.
It can be observed that our method can achieve the best accuracy thanks to the feature exchanging operation. Specifically, on CUB and Aircraft datasets, our proposed method with the exchanging operation performs considerably better than that without exchanging. The performance improvement on the large-scale fine-grained datasets (e.g., Food101) becomes more significant. Above results illustrate that our proposed local features alignment strategy is effective, especially on large-scale datasets. Moreover, even if bits of hash codes are limited, our feature alignment strategy could still benefit fine-grained retrieval greatly.
|(a) CUB||(b) Aircraft||(c) Food101|
|(a) CUB||(b) Aircraft||(c) Food101|
4.5.2 Sensitivity to Hyper Parameter
In our ExchNet, we use to denote the number of local features, which is also the number of attention maps. In this section, we present the influence of the hyper-parameter by ablation studies.
As presented in Figure 6, we vary as , and . From that figure, it is observed that satisfactory retrieval accuracies are achieved regardless of different values, and the best fine-grained retrieval accuracy is obtained when . As analyzed, redundant local features (i.e., overmuch object parts when is large) might cause redundancies in local feature representations, while the lack of local features (i.e., scant object parts when is small) may result in that fine-grained images are under-represented for distinguishing subtle visual differences. Those might be the reasons why is too small or large will cause slightly accuracy drops. Moreover, comparable retrieval results of different values show that our ExchNet is not sensitive to .
In this paper, we studied the practical but challenging fine-grained hashing task, which aims to solve large-scale FGIR problems by leveraging the search and storage efficiency of compact hash codes. Specifically, we proposed a unified network ExchNet to obtain representative fine-grained local and global features by performing our attention approach equipped with the tailored attention constraints. Then, ExchNet utilized its local feature alignment to align these local features to their corresponding object parts across images. Later, an alternating learning algorithm was employed to return the final fine-grained binary codes. Compared with ANN methods and competing generic hash methods, experiments validated both effectiveness and efficiency of our ExchNet. In the future, we would like to explore a more challenging unsupervised fine-grained hashing topic.
Acknowledgements Q. Cui’s contribution was made when he was an intern at Megvii Research Nanjing. This research was supported by the National Key Research and Development Program of China under Grant 2017YFA0700800 and “111” Program B13022.