Fine-grained visual classification (FGVC) refers to classification tasks where the differences between the different categories are very subtle. Examples of such tasks are the classification of bird species or differentiating between different car models. The general appearance of the categories is very similar (all birds have two wings and a beak, cars typically have four wheels) and as result the inter-class variation is small. On the other hand, the intra-class variation can be quite high (due to different poses). This makes FGVC a very challenging problem that receives a lot of attention in the research community. State-of-the-art approaches typically involve a backbone CNN such as ResNet  or VGG  that is extended by a method that localizes and attends to specific discriminative regions. These methods can become quite complex and sometimes require multiple passes through the backbone CNN.
In this work we aim to improve the performance of a given backbone CNN with little increase in complexity and requiring just a single pass through the backbone network. Specifically, we propose the following three steps:
Global k-max pooling: For FGVC-models, the final convolutional layer often still has a spatial resolution of (e.g. for a ResNet-50 with input the resolution is
). A single feature vector describing the image can then be obtained by using global average or global max pooling. However, to approximate part-based recognition, we propose to use global k-max pooling, where the average over themaximal features is computed.
Embedding layer: In a typical setup for face verification tasks, the test subjects (classes) are not known during training, which means a standard softmax classifier can not be trained. CNNs are therefore often used to train a discriminative embedding space in which face images can be compared efficiently and accurately. The embeddings are learned using specifically designed loss functions such as center loss, triplet loss  or DFF . We insert such an embedding layer trained with a loss function based similar to  into the backbone CNN as penultimate layer. We show that this greatly improves the performance of the softmax classifier.
Localization module: Using bounding boxes to crop the input images typically improves the performance of the classification model. In order to avoid having to rely on human bounding box annotations we train an efficient bounding box detector that can be applied before the image is processed by the backbone CNN. This localization module is lightweight and trained using only the class labels. Bounding box annotations are not needed.
We evaluate our model on three popular FGVC datasets from different domains. The first dataset is CUB200-2011  where the task is the classification of bird species. The second dataset is Stanford cars  where different car models are classified and the third is FGVC-Aircraft  for the classification of different aircraft models. We obtain very competitive results on all three datasets and to the best of our knowledge new best state-of-the-art results for the latter two.
1.1 Related work
As mentioned in the introduction FGVC has received a lot of attention in the research community. As a result, many different approaches have been proposed. Especially using some form of visual attention has been very popular lately [37, 36, 34, 7, 32, 18, 27]. The work presented in  is of particular relevance since here also an embedding loss is used. However this loss is not used to train an independent embedding layer but to regulate the defined attention mechanism.
Spatial transformations that extract the discriminative parts of the input can also be seen as a form of attention. For example, the well known spatial transformer introduced in  is capable of learning global transformations (affine transformations), but is known to be difficult to train and usually needs a second large network to estimate the transformation parameters. In  a module to learn pixel-wise translations is proposed. However, this module is applied very late in the network, possibly due to being reliant on high-level features. As a result only the last layers can profit from the localized input. In  an ensemble of networks is learned sequentially, where each network is trained based on a spatial transformation derived from the previous network. This means that each input image needs to be passed through multiple networks. Our localization module fits into the category of spatial transformations (limited to scale and translation), but it is very lightweight and easy to train while still being able to significantly boost the recognition performance.
The work presented in 
proposes to train a gaussian mixture model based on part proposals provided by selective search. However, this requires a looped training procedure with the EM-algorithm.
Another popular approach is based on bilinear models  which can lead to issues with efficiency due to very high dimensional features and multi-stream architectures. Also other second-order pooling methods such as the work in  often results in very high dimensional features.
Other approaches include learning global and patch features in an asymmetric multi-stream architecture , learning a complex sequence of data augmentation steps from the data , deep layer aggregation  or the training of very large networks (557 million parameters) . It is also possible to boost the performance by obtaining more training data [4, 15].
The goal of this work is to improve the performance of a given backbone CNN (a ResNet ) with lightweight components that do not require multiple passes through the backbone CNN or significantly higher runtime or memory usage during testing. We achieve this goal by adding three components, a localization module, global k-max pooling and an embedding layer. An overview of the resulting model in the testing stage is given in Figure 1. The input image that needs to be classified is forwarded through the localization module which estimates the bounding box of the object in the image and returns a cropped image. This cropped image is then forwarded thought the backbone CNN that contains global k-max pooling and the embedding layer at the later stages. The classification result is then given by a softmax classification layer.
In the training stage the model is trained jointly with a standard cross-entropy loss applied at the classification layer and a specific loss function that is applied at the embeddings layer:
The localization module is trained separate training step (c.f. Section 5).
3 Global K-Max Pooling
Often global max pooling (GMP) or global average pooling (GAP) is used between the last convolutional layer and the classification layer of a CNN. These pooling operations allow to break down the spatial dimension of the final convolutional layer and obtain a single vector describing the image. For FGVC it has been shown that part-based approaches () can boost the classification performance. For this reason we propose to use global k-max pooling (GKMP). This two-step pooling procedure first applies k-max pooling  at the last convolutional layer which is followed by an averaging operation over the selected maximal values in each feature map. This way the network can learn features that activate at the most important parts of the image (during back-propagation the error gets propagated through the most important parts instead of just one as with GMP or all of them as with GAP). This could be seen as a very simple form of attention.
The global k-max pooling layer can be defined as follows. Given an input image , let be the output of the last convolutional layer of a CNN, where has a spatial resolution of (in this work typically ) and contains feature maps. Further, given a specific the sorted vector contains the values at sorted in descending order:
The global k-max pooling operation for a specified is then is then defined as:
This definition corresponds to the pooling operation used in , except in  also the minimal activations are included. However, preliminary experiments suggested that including the minimal activations does not help the recognition performance. Therefore we use global k-max pooling as defined in Equation (3).
Note that if we select then Equation (3) results in the standard GMP, while a choice of leads to GAP. In this work we chose in all experiments.
3.1 Global k-max pooling with weighted averaging
4 Embedding layer
The embedding layer is inserted between GKMP and the classification layer. The idea is to map the images into an discriminative embedding space, where the distances between images of the same class are small, while the distances between images of different classes are large. This concept is known from face verification [23, 30, 10] and metric learning 
. Different from those two tasks, we do not compare images directly in the embedding space, but use it as an intermediate layer (more specifically as penultimate layer). The classification is done by a standard classification layer trained on the embedding space with cross-entropy. We argue that one of the advantages of this approach is that it can help mitigate the issue of limited training data that we often see in FGVC tasks (see Table1).
The embedding space is trained using a specific loss function applied directly to output of this intermediate layer. In this work we use a formulation based on optimizing class means [30, 10] since this is easy to integrate into the training and does not require any specific batch construction schemes (unlike tuplet-based losses such as the triplet loss  or the NPair loss ). For each class a feature vector is computed (and updated online during training) that describes the class mean within the embedding space. The goal of the loss function is to minimize the distance of each image within a batch to its respective class mean, while maximizing the distance between means of different classes.
The loss function is composed of two parts:
The first part is the within-class (intra-class) loss that minimizes the distances of the images to their class means, while the second part is the between-class (inter-class) loss that maximizes the distances between class means.
4.1 Within-class loss
We use the same formulation for the within-class loss as in  (which is derived from the center-loss proposed in ) including an additional -normalization. Given classes and a batch of images with let be the normalized output of the embedding layer. Assuming we are currently in training iteration the first step is to update the class means using the data points in the current batch and the class means from the previous iteration
where the hyper-parameter can be considered the learning rate of the class means. The term is defined as
and is the Kronecker-function:
Note that this update creates a functional dependence between the class means and their corresponding images within the batch which has to be considered during back-propagation .
The within-class loss function is then defined as
4.2 Between-class loss
The between-class loss maximizes the distances between the updated class means:
The terms and are the new class means updated with the features from the current batch as computed in Equation (6). The margin defines a threshold for the distances to be penalized and controls the contribution of the between-class part to the embedding loss . The set with cardinality contains all class-pairs in the current batch.
To see why squaring the maximum in Equation (10) is important we have a look at the gradient with respect to :
The squared maximization leads to the appearance of the distance in the gradient. As a result the gradient gets larger if the distance between two class means gets smaller which encourages the model to focus more on improving the distance of class-pairs with very close means. This should help to reduce classification errors due to the confusion of images from class-pairs with close class means.
While the formulation of the between-class loss defined above is close to , the appearance of in the gradient is a key difference, since in  the maximization in the between-class loss is not squared. In addition there are two more differences. First the centers are based on -normalized features, which makes the choice for the margin easier, since distances between -normalized vectors are restricted by a certain range. The second difference is that in  the set contains a sampled subset of all class-pairs in the current batch. Since we work with much smaller batch sizes (see Section 6) compared to , we use all pairs within a batch.
The embedding layer as defined above is also closely related to the attention regulation proposed in OSME-MAMC . The latter uses a metric learning loss to learn correlations between the output of different attention branches. If the multiple attention branches were replaced by a single fully connected layer then this method would be equivalent to training an embedding layer with the NPair loss .
5 Localization module
It can be observed that cropping the images based on bounding boxes of the objects that need to be recognized can improve the classification performance. Since we do not want to use any annotations apart from the class labels in training (and of course no annotations at all during testing) we need a way to obtain these localizations without any additional annotations if we want to profit from the increased performance the localizations can provide. We design such a localization module based on the following observations.
The final convolutional layer of a trained classification model typically contains higher activations in positions corresponding the discriminative areas of the image (see Figure 2). By computing the mean over all feature maps a quite accurate heat map can be computed for the object in the image. This heat map can the be used to find the boundaries of the object within the final layer. A similar observation has been made in , but in contrast to  we do not consider class-specific heat maps due to the small inter-class variations in FGVC.
The downside with this approach is that we obtain these bounding boxes only at the final layer while the full image needs to be forwarded through all previous layers. One option would be to apply ROI-Pooling after the final layer based on the estimated bounding box. However, this would mean that only the fully connected layers following the last convolutional layer would profit from focusing on the discriminative area of the image. All other layers still have to operate on the full image and we can not fully exploit the potential from using the bounding boxes (see Figure 5). An alternative would be to pass the image through the full backbone network, extract the bounding boxes and then train a second, more precise model based on the bounding boxes (similar to ). However, with this approach the image would have to be passed through a full network twice, one pass to obtain the bounding box and a second pass for classification. This is not very efficient. To avoid such a multi-pass procedure we propose to train a very lightweight localization module that predicts the bounding boxes and is integrated into the backbone network such that an image can be processed in one pass.
The architecture of the localization module is equivalent to the first few layers of a ResNet-50 (initialized from the trained classification model) until the end of the first residual block including an initial down-sampling layer that resizes the input to a spatial resolution of . The module has only parameters and can be added to the classification model without taking up much runtime or memory. The output is of the size , the same as the mean of the last convolutional layer of the trained classification model. The localization module is trained by feeding an input image through the trained classification model and the localization module. The outputs are compared using the smooth L1 loss  (see Figure 3). During back-propagation the weights of the trained classification model are fixed and only the localization module is trained. This way the localization module learns to directly predict the heat maps.
Examples of the estimated heat maps are given in Figure 4. The localization module is able to predict the heat maps very accurately, even though less focused on a specific part of the bird (especially in the second and third row).
To obtain the final bounding box we we process the heat map using min-max normalization and binarization based on a given threshold. The bounding box is then the smallest rectangle containing all pixels where the heat map has a value that is greater than .
6 Experimental evaluation
|Stanford cars ||8144||8041||196|
|FGVC Aircraft ||6667||3333||100|
We evaluate our proposed approach on three popular datasets for fine-grained classification, CUB200-2011  for bird-species classification, Stanford cars  for the classification of car models and FGVC-Aircraft  for the classification of airplane models. The statistics of the three datasets are given in Table 1. On all datasets we use only the class labels annotations. We do not use any additional annotations such as bounding boxes or part annotations.
is done using ResNet-50 as backbone CNN.). The models are pretrained on the ImageNet from which we remove all images that overlap with the test sets of the three FGVC tasks used for this evaluation.
Since our added components are all lightweight and do not occupy much memory we are able to train both ResNet variants on a single NVIDIA GeForce GTX 1080 Ti GPU with 11GB memory with a batch-size of and input images with the spatial resolution . We use this batch-size and input resolution for all experiments. The models are trained with standard back-propagation for epochs with momentum of and weight decay of . The starting learning rate is which is reduced by a factor of after epochs. The weighted average pooling (see Section 3.1) is applied as finetuning step for more epochs. The approach is implemented using the torch7 framework .
There is a number of hyper-parameters to set for our approach. We perform a very limited search for the hyper-parameters on CUB200-2011 using a validation set separated from the training images. The parameters are quite robust and we can use the same set of hyper-parameters in all other experiments. The threshold for the localization module is the only exception, where we use a slightly smaller value on the Stanford cars and the FGVC-Aircraft dataset. The exact values for the hyper-parameters are given in Table 2.
In Section 6.4, 6.5 and 6.6 we compare to state-of-the-art approaches using a similar experimental setting, specifically with the same training data (ImageNet and training set of the FGVC task at hand) and no bounding box or part annotations. Note that for some datasets better results can be achieved by acquiring large amounts of additional training data [4, 15].
6.1 Computational efficiency
A comparison of the computational efficiency between a baseline ResNet-50 and our approach with ResNet-50 as backbone is given in Table 3. With the typical input resolution of the additional complexity in terms of FLOPs (multiply-adds) caused by our approach is very small (one reason is that the localization module operates on low resolution input ()). There is also no large increase in parameters highlighting the efficiency of the proposed approach.
6.2 Localization module
In Figure 5 we analyze the effect on the recognition performance if the image or feature maps are cropped based on ground-truth bounding boxes (ROI-Pooling). We can observe that the earlier the ROI-Pooling is applied the better is the recognition accuracy, confirming our expectation in Section 5.
|ResNet-50 mean feature map||58.6|
The accuracy of the bounding boxes can be calculated by counting a bounding box as correct if the intersection over union (IoU) with the ground truth bounding box is at least 0.5. We can observe in Table 4 that using the mean over baseline ResNet-50 feature maps of the last convolutional layer already achieves a better accuracy then what is reported in . However, on top of being much more efficient the localization module is also even more accurate. We argue that this is due to the slightly more general heatmaps generated by the localization module as a result of the approximation (see Figure 4).
In Figure 6 we show qualitative results of the bounding boxes estimated for validation images by the localization module. We can observe that the localization module is able to find quite accurate bounding boxes. In some cases (right column of Figure 6) the localization module over-estimates the bounding box due to other objects in the image such as a branch or a distracting background. Another interesting reason for over-estimated bounding boxes can be multiple instances of the given FGVC domain such as two different airplane models in the same image. However, in each of these cases the classifier still has a good chance of finding the correct class, since the objects are still present in the cropped image.
6.3 Ablation study
Table 5 shows how the components contribute to the recognition performance. The two baseline results with GAP and GMP include a fully connected layer with dimension before the classification layer. We can observe in Table 5 that adding an embedding loss and adding the localization module leads to the largest boost in performance at about absolute (, using ground-truth bounding boxes yields ). The addition of global k-max pooling, weighted average finetuning and the full embedding loss compared to only the within-class part lead to an improvement of about each.
The effect of the value selected for in GKMP is illustrated in Figure 7. The special case is equivalent to GMP and is equivalent to GAP (the spatial dimension of the last convolutional layer is ). We can observe that a value around seems to be optimal. The improvement compared to GMP is only absolute, but using a value larger than also enables us to apply the weighted averaging, which gives another performance boost (see Table 5).
|Spatial RNN ||M-Net/D-Net||89.7|
|Stacked LSTM ||GoogleNet||90.4|
In Table 6 we compare our approach to other state-of-the-art methods. In addition to the results reported in Table 5 we also evaluate our method with ResNet-101 as backbone network. This includes all components (GKMP with weighted average pooling, embedding layer with full embedding loss, localization module). Our best result is with very competitive, even though a little less accurate then the best state-of-the-art result is reported in  and 
. However, these involve recurrent neural networks which can be computationally expensive.
6.5 Stanford cars
|Spatial RNN ||M-Net/D-Net||93.4|
The results on the Stanford cars dataset are given in Table 7. Here we run the experiment with ResNet-50 and ResNet-101 with all components included. As mentioned earlier, we use the same hyper-parameters as for CUB200-2011 (apart from the threshold ). Again, our approach achieves a very competitive result. In fact, to the best of our knowledge, the accuracy of is the best result, even though by a small margin.
|Spatial RNN ||M-Net/D-Net||88.4|
Similar to the Stanford cars dataset we use all components introduced in the previous sections and the same hyper-parameters as in Stanford cars. Also on FGVC-Aircraft (Table 8) we can report a very competitive result which is again to the best of our knowledge the best state-of-the-art result.
In this work we presented three efficient methods to improve the classification performance of a backbone CNN for fine-grained visual classification. Specifically, we propose a lightweight localization module that relies only on class label annotations during training. We showed that even though the localization module can find reliable bounding boxes and significantly boost the recognition performance. Additionally we propose to use global k-max pooling to obtain a global vector describing the image. This approximates part-based modeling and can further be improved by learning weights to regulate the contribution of the maximal values in each feature map. Finally, we project the image descriptor into a discriminative embedding space from which the classification layer makes the classification. As an intermediate layer of the full classification network the embedding space is trained jointly with the full network and a specific loss function that optimizes class means. We evaluate our approach on three popular FGVC tasks and achieve competitive results on all three. In fact, on Stanford cars and FGVC-Aircraft we can report new best classification accuracies.
Destruction and construction learning for fine-grained image recognition.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, Table 7, Table 8.
Torch7: a matlab-like environment for machine learning. In BigLearn, NIPS workshop, Vol. 5, pp. 10. Cited by: §6.
-  (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §1.1, Table 7, Table 8.
Large scale fine-grained categorization and domain-specific transfer learning. In IEEE conference on computer vision and pattern recognition, pp. 4109–4118. Cited by: §1.1, §6.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §6.
Weldon: weakly supervised learning of deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4743–4752. Cited by: §3.
-  (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In IEEE conference on computer vision and pattern recognition, pp. 4438–4446. Cited by: §1.1.
-  (2019-06) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.4, Table 6.
-  (2015) Fast r-cnn. In IEEE international conference on computer vision, pp. 1440–1448. Cited by: §5.
-  (2017) Deep fisher faces. In British Machine Vision Conference (BMVC), Vol. 1, pp. 7. Cited by: 2nd item, §4.1, §4.2, §4, §4, Table 5.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2, §6.
-  (2018) GPipe: efficient training of giant neural networks using pipeline parallelism. CoRR abs/1811.06965. External Links: Cited by: §1.1, Table 7, Table 8.
-  (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1.1, Table 6.
-  (2013) Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. Computer vision and image understanding 117 (5), pp. 479–492. Cited by: §3.
-  (2016) The unreasonable effectiveness of noisy data for fine-grained recognition. In European Conference on Computer Vision, pp. 301–320. Cited by: §1.1, §6.
-  (2013) 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: §1, Table 1, §6.
-  (2018-06) Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.1, Table 6, Table 7, Table 8.
-  (2017) Dynamic computational time for visual attention. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 1199–1209. Cited by: §1.1, Table 6, Table 7.
-  (2018) Fine-grained image classification with gaussian mixture layer. IEEE Access 6 (), pp. 53356–53367. External Links: Cited by: §1.1, Table 6, Table 7, Table 8.
-  (2015) Bilinear cnn models for fine-grained visual recognition. In IEEE international conference on computer vision, pp. 1449–1457. Cited by: §1.1.
-  (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §1, Table 1, §6.
-  (2015) Metric learning with adaptive density discrimination. arXiv preprint arXiv:1511.05939. Cited by: §4.
Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: 2nd item, §4, §4.
-  (2018) Increasingly specialized ensemble of convolutional neural networks for fine-grained recognition. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 594–598. Cited by: §1.1, §5, Table 6, Table 7, Table 8.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
-  (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §4.2, §4.
-  (2018) Multi-attention multi-class constraint for fine-grained image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 805–821. Cited by: §1.1, §4.2, Table 6, Table 7.
-  (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §1, Table 1, §6.
-  (2018) Learning a discriminative filter bank within a cnn for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4148–4157. Cited by: §1.1, Table 6, Table 7, Table 8.
-  (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: 2nd item, §4.1, §4, §4, Table 5.
-  (2013) Learning weighted geometric pooling for image classification. In 2013 IEEE International Conference on Image Processing, pp. 3805–3809. Cited by: §3.1.
-  (2018) Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE transactions on cybernetics 49 (5), pp. 1791–1802. Cited by: §1.1, §6.4, Table 6, Table 7, Table 8.
-  (2018) Attend and align: improving deep representations with feature alignment layer for person retrieval. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2148–2153. Cited by: §1.1, Table 6.
-  (2018) Learning to navigate for fine-grained classification. In ECCV, Cited by: §1.1, Table 6, Table 7, Table 8.
-  (2018) Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412. Cited by: §1.1, Table 6, Table 7, Table 8.
-  (2017) Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia 19 (6), pp. 1245–1256. Cited by: §1.1.
-  (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE international conference on computer vision, pp. 5209–5217. Cited by: §1.1, §3, Table 6, Table 7, Table 8.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §5, §6.2, Table 4.