ELoPE: Fine-Grained Visual Classification with Efficient Localization, Pooling and Embedding

11/17/2019 ∙ by Harald Hanselmann, et al. ∙ 0

The task of fine-grained visual classification (FGVC) deals with classification problems that display a small inter-class variance such as distinguishing between different bird species or car models. State-of-the-art approaches typically tackle this problem by integrating an elaborate attention mechanism or (part-) localization method into a standard convolutional neural network (CNN). Also in this work the aim is to enhance the performance of a backbone CNN such as ResNet by including three efficient and lightweight components specifically designed for FGVC. This is achieved by using global k-max pooling, a discriminative embedding layer trained by optimizing class means and an efficient bounding box estimator that only needs class labels for training. The resulting model achieves new best state-of-the-art recognition accuracies on the Stanford cars and FGVC-Aircraft datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fine-grained visual classification (FGVC) refers to classification tasks where the differences between the different categories are very subtle. Examples of such tasks are the classification of bird species or differentiating between different car models. The general appearance of the categories is very similar (all birds have two wings and a beak, cars typically have four wheels) and as result the inter-class variation is small. On the other hand, the intra-class variation can be quite high (due to different poses). This makes FGVC a very challenging problem that receives a lot of attention in the research community. State-of-the-art approaches typically involve a backbone CNN such as ResNet [11] or VGG [25] that is extended by a method that localizes and attends to specific discriminative regions. These methods can become quite complex and sometimes require multiple passes through the backbone CNN.

In this work we aim to improve the performance of a given backbone CNN with little increase in complexity and requiring just a single pass through the backbone network. Specifically, we propose the following three steps:

  • Global k-max pooling: For FGVC-models, the final convolutional layer often still has a spatial resolution of (e.g. for a ResNet-50 with input the resolution is

    ). A single feature vector describing the image can then be obtained by using global average or global max pooling. However, to approximate part-based recognition, we propose to use global k-max pooling, where the average over the

    maximal features is computed.

  • Embedding layer: In a typical setup for face verification tasks, the test subjects (classes) are not known during training, which means a standard softmax classifier can not be trained. CNNs are therefore often used to train a discriminative embedding space in which face images can be compared efficiently and accurately. The embeddings are learned using specifically designed loss functions such as center loss

    [30], triplet loss [23] or DFF [10]. We insert such an embedding layer trained with a loss function based similar to [10] into the backbone CNN as penultimate layer. We show that this greatly improves the performance of the softmax classifier.

  • Localization module: Using bounding boxes to crop the input images typically improves the performance of the classification model. In order to avoid having to rely on human bounding box annotations we train an efficient bounding box detector that can be applied before the image is processed by the backbone CNN. This localization module is lightweight and trained using only the class labels. Bounding box annotations are not needed.

We evaluate our model on three popular FGVC datasets from different domains. The first dataset is CUB200-2011 [28] where the task is the classification of bird species. The second dataset is Stanford cars [16] where different car models are classified and the third is FGVC-Aircraft [21] for the classification of different aircraft models. We obtain very competitive results on all three datasets and to the best of our knowledge new best state-of-the-art results for the latter two.

1.1 Related work

As mentioned in the introduction FGVC has received a lot of attention in the research community. As a result, many different approaches have been proposed. Especially using some form of visual attention has been very popular lately [37, 36, 34, 7, 32, 18, 27]. The work presented in [27] is of particular relevance since here also an embedding loss is used. However this loss is not used to train an independent embedding layer but to regulate the defined attention mechanism.

Spatial transformations that extract the discriminative parts of the input can also be seen as a form of attention. For example, the well known spatial transformer introduced in [13] is capable of learning global transformations (affine transformations), but is known to be difficult to train and usually needs a second large network to estimate the transformation parameters. In [33] a module to learn pixel-wise translations is proposed. However, this module is applied very late in the network, possibly due to being reliant on high-level features. As a result only the last layers can profit from the localized input. In [24] an ensemble of networks is learned sequentially, where each network is trained based on a spatial transformation derived from the previous network. This means that each input image needs to be passed through multiple networks. Our localization module fits into the category of spatial transformations (limited to scale and translation), but it is very lightweight and easy to train while still being able to significantly boost the recognition performance.

The work presented in [19]

proposes to train a gaussian mixture model based on part proposals provided by selective search. However, this requires a looped training procedure with the EM-algorithm.

Another popular approach is based on bilinear models [20] which can lead to issues with efficiency due to very high dimensional features and multi-stream architectures. Also other second-order pooling methods such as the work in [17] often results in very high dimensional features.

Other approaches include learning global and patch features in an asymmetric multi-stream architecture [29], learning a complex sequence of data augmentation steps from the data [3], deep layer aggregation [35] or the training of very large networks (557 million parameters) [12]. It is also possible to boost the performance by obtaining more training data [4, 15].

2 Overview

Figure 1: Overview of the proposed model including a lightweight localization module, global k-max pooling and an embedding layer.

The goal of this work is to improve the performance of a given backbone CNN (a ResNet [11]) with lightweight components that do not require multiple passes through the backbone CNN or significantly higher runtime or memory usage during testing. We achieve this goal by adding three components, a localization module, global k-max pooling and an embedding layer. An overview of the resulting model in the testing stage is given in Figure 1. The input image that needs to be classified is forwarded through the localization module which estimates the bounding box of the object in the image and returns a cropped image. This cropped image is then forwarded thought the backbone CNN that contains global k-max pooling and the embedding layer at the later stages. The classification result is then given by a softmax classification layer.

In the training stage the model is trained jointly with a standard cross-entropy loss applied at the classification layer and a specific loss function that is applied at the embeddings layer:

(1)

The localization module is trained separate training step (c.f. Section 5).

3 Global K-Max Pooling

Often global max pooling (GMP) or global average pooling (GAP) is used between the last convolutional layer and the classification layer of a CNN. These pooling operations allow to break down the spatial dimension of the final convolutional layer and obtain a single vector describing the image. For FGVC it has been shown that part-based approaches ([37]) can boost the classification performance. For this reason we propose to use global k-max pooling (GKMP). This two-step pooling procedure first applies k-max pooling [14] at the last convolutional layer which is followed by an averaging operation over the selected maximal values in each feature map. This way the network can learn features that activate at the most important parts of the image (during back-propagation the error gets propagated through the most important parts instead of just one as with GMP or all of them as with GAP). This could be seen as a very simple form of attention.

The global k-max pooling layer can be defined as follows. Given an input image , let be the output of the last convolutional layer of a CNN, where has a spatial resolution of (in this work typically ) and contains feature maps. Further, given a specific the sorted vector contains the values at sorted in descending order:

(2)

The global k-max pooling operation for a specified is then is then defined as:

(3)

This definition corresponds to the pooling operation used in [6], except in [6] also the minimal activations are included. However, preliminary experiments suggested that including the minimal activations does not help the recognition performance. Therefore we use global k-max pooling as defined in Equation (3).

Note that if we select then Equation (3) results in the standard GMP, while a choice of leads to GAP. In this work we chose in all experiments.

3.1 Global k-max pooling with weighted averaging

GKMP can be extended by including weights in the averaging operation in Equation (3) similar to [31]:

(4)

This formulation extends the network with only parameters that regulate the contribution of the maximal activations in each feature map.

4 Embedding layer

The embedding layer is inserted between GKMP and the classification layer. The idea is to map the images into an discriminative embedding space, where the distances between images of the same class are small, while the distances between images of different classes are large. This concept is known from face verification [23, 30, 10] and metric learning [22]

. Different from those two tasks, we do not compare images directly in the embedding space, but use it as an intermediate layer (more specifically as penultimate layer). The classification is done by a standard classification layer trained on the embedding space with cross-entropy. We argue that one of the advantages of this approach is that it can help mitigate the issue of limited training data that we often see in FGVC tasks (see Table

1).

The embedding space is trained using a specific loss function applied directly to output of this intermediate layer. In this work we use a formulation based on optimizing class means [30, 10] since this is easy to integrate into the training and does not require any specific batch construction schemes (unlike tuplet-based losses such as the triplet loss [23] or the NPair loss [26]). For each class a feature vector is computed (and updated online during training) that describes the class mean within the embedding space. The goal of the loss function is to minimize the distance of each image within a batch to its respective class mean, while maximizing the distance between means of different classes.

The loss function is composed of two parts:

(5)

The first part is the within-class (intra-class) loss that minimizes the distances of the images to their class means, while the second part is the between-class (inter-class) loss that maximizes the distances between class means.

4.1 Within-class loss

We use the same formulation for the within-class loss as in [10] (which is derived from the center-loss proposed in [30]) including an additional -normalization. Given classes and a batch of images with let be the normalized output of the embedding layer. Assuming we are currently in training iteration the first step is to update the class means using the data points in the current batch and the class means from the previous iteration

(6)

where the hyper-parameter can be considered the learning rate of the class means. The term is defined as

(7)

and is the Kronecker-function:

(8)

Note that this update creates a functional dependence between the class means and their corresponding images within the batch which has to be considered during back-propagation [10].

The within-class loss function is then defined as

(9)

4.2 Between-class loss

The between-class loss maximizes the distances between the updated class means:

(10)

The terms and are the new class means updated with the features from the current batch as computed in Equation (6). The margin defines a threshold for the distances to be penalized and controls the contribution of the between-class part to the embedding loss . The set with cardinality contains all class-pairs in the current batch.

To see why squaring the maximum in Equation (10) is important we have a look at the gradient with respect to :

(11)

with

(12)

and

The squared maximization leads to the appearance of the distance in the gradient. As a result the gradient gets larger if the distance between two class means gets smaller which encourages the model to focus more on improving the distance of class-pairs with very close means. This should help to reduce classification errors due to the confusion of images from class-pairs with close class means.

While the formulation of the between-class loss defined above is close to [10], the appearance of in the gradient is a key difference, since in [10] the maximization in the between-class loss is not squared. In addition there are two more differences. First the centers are based on -normalized features, which makes the choice for the margin easier, since distances between -normalized vectors are restricted by a certain range. The second difference is that in [10] the set contains a sampled subset of all class-pairs in the current batch. Since we work with much smaller batch sizes (see Section 6) compared to [10], we use all pairs within a batch.

The embedding layer as defined above is also closely related to the attention regulation proposed in OSME-MAMC [27]. The latter uses a metric learning loss to learn correlations between the output of different attention branches. If the multiple attention branches were replaced by a single fully connected layer then this method would be equivalent to training an embedding layer with the NPair loss [26].

5 Localization module

It can be observed that cropping the images based on bounding boxes of the objects that need to be recognized can improve the classification performance. Since we do not want to use any annotations apart from the class labels in training (and of course no annotations at all during testing) we need a way to obtain these localizations without any additional annotations if we want to profit from the increased performance the localizations can provide. We design such a localization module based on the following observations.

  
  
Figure 2: Min-max normalized activations of the last convolutional layer of a trained model. The three images in the middle are the activations in a specific feature map, while the last image shows the mean over all feature maps.

The final convolutional layer of a trained classification model typically contains higher activations in positions corresponding the discriminative areas of the image (see Figure 2). By computing the mean over all feature maps a quite accurate heat map can be computed for the object in the image. This heat map can the be used to find the boundaries of the object within the final layer. A similar observation has been made in [38], but in contrast to [38] we do not consider class-specific heat maps due to the small inter-class variations in FGVC.

The downside with this approach is that we obtain these bounding boxes only at the final layer while the full image needs to be forwarded through all previous layers. One option would be to apply ROI-Pooling after the final layer based on the estimated bounding box. However, this would mean that only the fully connected layers following the last convolutional layer would profit from focusing on the discriminative area of the image. All other layers still have to operate on the full image and we can not fully exploit the potential from using the bounding boxes (see Figure 5). An alternative would be to pass the image through the full backbone network, extract the bounding boxes and then train a second, more precise model based on the bounding boxes (similar to [24]). However, with this approach the image would have to be passed through a full network twice, one pass to obtain the bounding box and a second pass for classification. This is not very efficient. To avoid such a multi-pass procedure we propose to train a very lightweight localization module that predicts the bounding boxes and is integrated into the backbone network such that an image can be processed in one pass.

The architecture of the localization module is equivalent to the first few layers of a ResNet-50 (initialized from the trained classification model) until the end of the first residual block including an initial down-sampling layer that resizes the input to a spatial resolution of . The module has only parameters and can be added to the classification model without taking up much runtime or memory. The output is of the size , the same as the mean of the last convolutional layer of the trained classification model. The localization module is trained by feeding an input image through the trained classification model and the localization module. The outputs are compared using the smooth L1 loss [9] (see Figure 3). During back-propagation the weights of the trained classification model are fixed and only the localization module is trained. This way the localization module learns to directly predict the heat maps.

Figure 3: Training procedure for the localization module: The localization module learns to directly predict the heat maps by minimizing the mean squared error with the heat maps generated by the backbone CNN.

Examples of the estimated heat maps are given in Figure 4. The localization module is able to predict the heat maps very accurately, even though less focused on a specific part of the bird (especially in the second and third row).

Figure 4: True mean (middle) of the trained classification model compared to estimated mean (right).

To obtain the final bounding box we we process the heat map using min-max normalization and binarization based on a given threshold

. The bounding box is then the smallest rectangle containing all pixels where the heat map has a value that is greater than .

6 Experimental evaluation

Name #Train #Test #Classes
images images
CUB200-2011 [28] 5994 5794 200
Stanford cars [16] 8144 8041 196
FGVC Aircraft [21] 6667 3333 100
Table 1: Datasets used for the evaluation.

We evaluate our proposed approach on three popular datasets for fine-grained classification, CUB200-2011 [28] for bird-species classification, Stanford cars [16] for the classification of car models and FGVC-Aircraft [21] for the classification of airplane models. The statistics of the three datasets are given in Table 1. On all datasets we use only the class labels annotations. We do not use any additional annotations such as bounding boxes or part annotations.

As backbone CNN we select ResNet [11] where we try the two variants ResNet-50 and ResNet-101 (the analysis in Section 6.2 and 6.3

is done using ResNet-50 as backbone CNN.). The models are pretrained on the ImageNet

[5] from which we remove all images that overlap with the test sets of the three FGVC tasks used for this evaluation.

Since our added components are all lightweight and do not occupy much memory we are able to train both ResNet variants on a single NVIDIA GeForce GTX 1080 Ti GPU with 11GB memory with a batch-size of and input images with the spatial resolution . We use this batch-size and input resolution for all experiments. The models are trained with standard back-propagation for epochs with momentum of and weight decay of . The starting learning rate is which is reduced by a factor of after epochs. The weighted average pooling (see Section 3.1) is applied as finetuning step for more epochs. The approach is implemented using the torch7 framework [2].

There is a number of hyper-parameters to set for our approach. We perform a very limited search for the hyper-parameters on CUB200-2011 using a validation set separated from the training images. The parameters are quite robust and we can use the same set of hyper-parameters in all other experiments. The threshold for the localization module is the only exception, where we use a slightly smaller value on the Stanford cars and the FGVC-Aircraft dataset. The exact values for the hyper-parameters are given in Table 2.

In Section 6.4, 6.5 and 6.6 we compare to state-of-the-art approaches using a similar experimental setting, specifically with the same training data (ImageNet and training set of the FGVC task at hand) and no bounding box or part annotations. Note that for some datasets better results can be achieved by acquiring large amounts of additional training data [4, 15].

Hyper- Reference Value
parameter
Equation (6) 0.5
Equation (1) 2.0
Equation (10) 16.0
Equation (10) 0.75
Equation (3) 4
Section (5) 0.3/0.2
Table 2: Hyper-parameters used in the experiments. The threshold is 0.3 on CUB200-2011 and 0.2 on the other two datasets.

6.1 Computational efficiency

Method FLOPs Parameters
Baseline 24M
Ours 25M
Table 3: Computational efficiency comparison between baseline ResNet-50 (input 1x3x448x448) and our approach.

A comparison of the computational efficiency between a baseline ResNet-50 and our approach with ResNet-50 as backbone is given in Table 3. With the typical input resolution of the additional complexity in terms of FLOPs (multiply-adds) caused by our approach is very small (one reason is that the localization module operates on low resolution input ()). There is also no large increase in parameters highlighting the efficiency of the proposed approach.

6.2 Localization module

ROI-PoolingNo bounding boxes
Figure 5: Accuracies for ROI-Pooling with ground-truth bounding boxes applied at different depths of a CNN on CUB200-2011 (ResNet-50 with GAP and no embedding learning).

In Figure 5 we analyze the effect on the recognition performance if the image or feature maps are cropped based on ground-truth bounding boxes (ROI-Pooling). We can observe that the earlier the ROI-Pooling is applied the better is the recognition accuracy, confirming our expectation in Section 5.

Method Accuracy[%]
GoogleNet-GAP [38] 41.0
ResNet-50 mean feature map 58.6
Localization module 68.9
Table 4: Accuracy of localization if intersection over union (IoU) is at least 0.5.

The accuracy of the bounding boxes can be calculated by counting a bounding box as correct if the intersection over union (IoU) with the ground truth bounding box is at least 0.5. We can observe in Table 4 that using the mean over baseline ResNet-50 feature maps of the last convolutional layer already achieves a better accuracy then what is reported in [38]. However, on top of being much more efficient the localization module is also even more accurate. We argue that this is due to the slightly more general heatmaps generated by the localization module as a result of the approximation (see Figure 4).

In Figure 6 we show qualitative results of the bounding boxes estimated for validation images by the localization module. We can observe that the localization module is able to find quite accurate bounding boxes. In some cases (right column of Figure 6) the localization module over-estimates the bounding box due to other objects in the image such as a branch or a distracting background. Another interesting reason for over-estimated bounding boxes can be multiple instances of the given FGVC domain such as two different airplane models in the same image. However, in each of these cases the classifier still has a good chance of finding the correct class, since the objects are still present in the cropped image.

Figure 6: Examples of bounding boxes detected by the localization module.

6.3 Ablation study

Table 5 shows how the components contribute to the recognition performance. The two baseline results with GAP and GMP include a fully connected layer with dimension before the classification layer. We can observe in Table 5 that adding an embedding loss and adding the localization module leads to the largest boost in performance at about absolute (, using ground-truth bounding boxes yields ). The addition of global k-max pooling, weighted average finetuning and the full embedding loss compared to only the within-class part lead to an improvement of about each.

Pooling Localization Weighted Accuracy[%]
module Average
GAP 81.2
GMP 82.2
GKMP 82.7
GKMP 84.2
GKMP Center-loss [30] 84.1
GKMP DFF [10] 84.6
GKMP 84.4
GKMP 84.9
GKMP 86.9
GKMP 87.4
Table 5: Results on CUB200-2011 with a ResNet-50. The notation means the embedding layer is trained only with the within-class part of the embedding loss, while means the embedding layer is trained with the full embedding loss. Weighted average refers to Section 3.1. For comparison we include results obtained with [30] and [10].
12481632196K
Figure 7: Accuracies for different values of in -max pooling on CUB200-2011.

The effect of the value selected for in GKMP is illustrated in Figure 7. The special case is equivalent to GMP and is equivalent to GAP (the spatial dimension of the last convolutional layer is ). We can observe that a value around seems to be optimal. The improvement compared to GMP is only absolute, but using a value larger than also enables us to apply the weighted averaging, which gives another performance boost (see Table 5).

6.4 Cub200-2011

Method Backbone CNN Accuracy[%]
STN [13] BN-Inception 84.1
MA-CNN [37] VGG-19 86.5
GMM [19] VGG-19 86.3
Spatial RNN [32] M-Net/D-Net 89.7
Stacked LSTM [8] GoogleNet 90.4
DLA [35] DLA-102 85.1
FAL [33] ResNet-50 84.2
DT-RAM [18] ResNet-50 86.0
ISE [24] ResNet-50 87.2
DFL-CNN [29] ResNet-50 87.4
NTS-Net [34] ResNet-50 87.5
DCL [1] ResNet-50 87.8
OSME-MAMC [27] ResNet-101 86.5
iSQRT-COV [17] ResNet-101 88.7
Ours ResNet-50 87.4
Ours ResNet-101 88.5
Table 6: Comparison of our approach with other state-of-the-art methods on CUB200-2011.

In Table 6 we compare our approach to other state-of-the-art methods. In addition to the results reported in Table 5 we also evaluate our method with ResNet-101 as backbone network. This includes all components (GKMP with weighted average pooling, embedding layer with full embedding loss, localization module). Our best result is with very competitive, even though a little less accurate then the best state-of-the-art result is reported in [32] and [8]

. However, these involve recurrent neural networks which can be computationally expensive.

6.5 Stanford cars

Method Backbone CNN Accuracy[%]
MA-CNN [37] VGG-19 92.8
GMM [19] VGG-19 93.5
DFL-CNN [29] VGG-16 93.8
Spatial RNN [32] M-Net/D-Net 93.4
DLA [35] DLA-X-60-C 94.1
DT-RAM [18] ResNet-50 93.1
ISE [24] ResNet-50 94.1
NTS-Net [34] ResNet-50 93.9
DCL [1] ResNet-50 94.5
GPipe [12] AmoebaNet-B 94.8
AutoAugm [3] Inception-v4 94.8
OSME-MAMC [27] ResNet-101 93.0
iSQRT-COV [17] ResNet-101 93.3
Ours ResNet-50 94.5
Ours ResNet-101 95.0
Table 7: Comparison of our approach with other state-of-the-art methods on Stanford cars.

The results on the Stanford cars dataset are given in Table 7. Here we run the experiment with ResNet-50 and ResNet-101 with all components included. As mentioned earlier, we use the same hyper-parameters as for CUB200-2011 (apart from the threshold ). Again, our approach achieves a very competitive result. In fact, to the best of our knowledge, the accuracy of is the best result, even though by a small margin.

6.6 FGVC-Aircraft

Method Backbone CNN Accuracy[%]
MA-CNN [37] VGG-19 89.9
GMM [19] VGG-19 90.5
DFL-CNN [29] VGG-16 92.0
Spatial RNN [32] M-Net/D-Net 88.4
DLA [35] DLA-X-60 92.9
ISE [24] ResNet-50 90.9
NTS-Net [34] ResNet-50 91.4
DCL [1] ResNet-50 93.0
GPipe [12] AmoebaNet-B 92.9
AutoAugm [3] Inception-v4 92.7
iSQRT-COV [17] ResNet-101 91.4
Ours ResNet-50 93.4
Ours ResNet-101 93.5
Table 8: Comparison of our approach with other state-of-the-art methods on FGVC-Aircraft.

Similar to the Stanford cars dataset we use all components introduced in the previous sections and the same hyper-parameters as in Stanford cars. Also on FGVC-Aircraft (Table 8) we can report a very competitive result which is again to the best of our knowledge the best state-of-the-art result.

7 Conclusion

In this work we presented three efficient methods to improve the classification performance of a backbone CNN for fine-grained visual classification. Specifically, we propose a lightweight localization module that relies only on class label annotations during training. We showed that even though the localization module can find reliable bounding boxes and significantly boost the recognition performance. Additionally we propose to use global k-max pooling to obtain a global vector describing the image. This approximates part-based modeling and can further be improved by learning weights to regulate the contribution of the maximal values in each feature map. Finally, we project the image descriptor into a discriminative embedding space from which the classification layer makes the classification. As an intermediate layer of the full classification network the embedding space is trained jointly with the full network and a specific loss function that optimizes class means. We evaluate our approach on three popular FGVC tasks and achieve competitive results on all three. In fact, on Stanford cars and FGVC-Aircraft we can report new best classification accuracies.

References

  • [1] Y. Chen, Y. Bai, W. Zhang, and T. Mei (2019-06) Destruction and construction learning for fine-grained image recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: Table 6, Table 7, Table 8.
  • [2] R. Collobert, K. Kavukcuoglu, C. Farabet, et al. (2011)

    Torch7: a matlab-like environment for machine learning

    .
    In BigLearn, NIPS workshop, Vol. 5, pp. 10. Cited by: §6.
  • [3] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §1.1, Table 7, Table 8.
  • [4] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie (2018)

    Large scale fine-grained categorization and domain-specific transfer learning

    .
    In IEEE conference on computer vision and pattern recognition, pp. 4109–4118. Cited by: §1.1, §6.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §6.
  • [6] T. Durand, N. Thome, and M. Cord (2016)

    Weldon: weakly supervised learning of deep convolutional neural networks

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4743–4752. Cited by: §3.
  • [7] J. Fu, H. Zheng, and T. Mei (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In IEEE conference on computer vision and pattern recognition, pp. 4438–4446. Cited by: §1.1.
  • [8] W. Ge, X. Lin, and Y. Yu (2019-06) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.4, Table 6.
  • [9] R. Girshick (2015) Fast r-cnn. In IEEE international conference on computer vision, pp. 1440–1448. Cited by: §5.
  • [10] H. Hanselmann, S. Yan, and H. Ney (2017) Deep fisher faces. In British Machine Vision Conference (BMVC), Vol. 1, pp. 7. Cited by: 2nd item, §4.1, §4.2, §4, §4, Table 5.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2, §6.
  • [12] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. V. Le, and Z. Chen (2018) GPipe: efficient training of giant neural networks using pipeline parallelism. CoRR abs/1811.06965. External Links: Link, 1811.06965 Cited by: §1.1, Table 7, Table 8.
  • [13] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1.1, Table 6.
  • [14] P. Koniusz, F. Yan, and K. Mikolajczyk (2013) Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. Computer vision and image understanding 117 (5), pp. 479–492. Cited by: §3.
  • [15] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei (2016) The unreasonable effectiveness of noisy data for fine-grained recognition. In European Conference on Computer Vision, pp. 301–320. Cited by: §1.1, §6.
  • [16] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: §1, Table 1, §6.
  • [17] P. Li, J. Xie, Q. Wang, and Z. Gao (2018-06) Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.1, Table 6, Table 7, Table 8.
  • [18] Z. Li, Y. Yang, X. Liu, S. Wen, and W. Xu (2017) Dynamic computational time for visual attention. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 1199–1209. Cited by: §1.1, Table 6, Table 7.
  • [19] J. Liang, J. Guo, X. Liu, and S. Lao (2018) Fine-grained image classification with gaussian mixture layer. IEEE Access 6 (), pp. 53356–53367. External Links: Document, ISSN 2169-3536 Cited by: §1.1, Table 6, Table 7, Table 8.
  • [20] T. Lin, A. RoyChowdhury, and S. Maji (2015) Bilinear cnn models for fine-grained visual recognition. In IEEE international conference on computer vision, pp. 1449–1457. Cited by: §1.1.
  • [21] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §1, Table 1, §6.
  • [22] O. Rippel, M. Paluri, P. Dollar, and L. Bourdev (2015) Metric learning with adaptive density discrimination. arXiv preprint arXiv:1511.05939. Cited by: §4.
  • [23] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: 2nd item, §4, §4.
  • [24] A. Simonelli, F. De Natale, S. Messelodi, and S. R. Bulo (2018) Increasingly specialized ensemble of convolutional neural networks for fine-grained recognition. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 594–598. Cited by: §1.1, §5, Table 6, Table 7, Table 8.
  • [25] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [26] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §4.2, §4.
  • [27] M. Sun, Y. Yuan, F. Zhou, and E. Ding (2018) Multi-attention multi-class constraint for fine-grained image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 805–821. Cited by: §1.1, §4.2, Table 6, Table 7.
  • [28] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §1, Table 1, §6.
  • [29] Y. Wang, V. I. Morariu, and L. S. Davis (2018) Learning a discriminative filter bank within a cnn for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4148–4157. Cited by: §1.1, Table 6, Table 7, Table 8.
  • [30] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: 2nd item, §4.1, §4, §4, Table 5.
  • [31] C. Weng, H. Wang, and J. Yuan (2013) Learning weighted geometric pooling for image classification. In 2013 IEEE International Conference on Image Processing, pp. 3805–3809. Cited by: §3.1.
  • [32] L. Wu, Y. Wang, X. Li, and J. Gao (2018) Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE transactions on cybernetics 49 (5), pp. 1791–1802. Cited by: §1.1, §6.4, Table 6, Table 7, Table 8.
  • [33] Q. Xu, Y. Sun, Y. Li, and S. Wang (2018) Attend and align: improving deep representations with feature alignment layer for person retrieval. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2148–2153. Cited by: §1.1, Table 6.
  • [34] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang (2018) Learning to navigate for fine-grained classification. In ECCV, Cited by: §1.1, Table 6, Table 7, Table 8.
  • [35] F. Yu, D. Wang, E. Shelhamer, and T. Darrell (2018) Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412. Cited by: §1.1, Table 6, Table 7, Table 8.
  • [36] B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan (2017) Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia 19 (6), pp. 1245–1256. Cited by: §1.1.
  • [37] H. Zheng, J. Fu, T. Mei, and J. Luo (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE international conference on computer vision, pp. 5209–5217. Cited by: §1.1, §3, Table 6, Table 7, Table 8.
  • [38] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §5, §6.2, Table 4.