Fine-grained visual classification (FGVC) aims at identifying sub-classes of a given object category, e.g., different species of birds, and models of cars and aircrafts. It is a much more challenging problem than traditional classification due to the inherently subtle intra-class object variations amongst sub-categories. Most effective solutions to date rely on extracting fine-grained feature representations at local discriminative regions, either by explicitly detecting semantic parts [11, 38, 35, 12, 36] or implicitly via saliency localization [31, 10, 4, 24]. It follows that such locally discriminative features are collectively fused to perform final classification.
Early work mostly finds discriminative regions with the assistance of manual annotations [2, 21, 34, 37, 16]. However, human annotations are difficult to obtain, and can often be error-prone resulting in performance degradations . Research focus has consequently shifted to training models in a weakly-supervised manner given only category labels [38, 35, 31, 4]. Success behind these models can be largely attributed to being able to locate more discriminative local regions for downstream classification. However little or no effort has been made towards (i) at which granularities are these local regions most discriminative, e.g., head or beak of a bird, and (ii) how can information across different granularities be fused together to classification accuracy, e.g., can do head and beak work together.
Information cross various granularities is however helpful for avoiding the effect of large intra-class variations. For example, experts sometimes need to identify a bird using both the overall structure of a bird’s head, and finer details such as the shape of its beak. That is, it is often not sufficient to identify discriminative parts, but also how these parts interact amongst each other in a complementary manner. Very recent research has focused on the “zooming-in” factor [11, 36], i.e., not just identifying parts, but also focusing on the truly discriminative regions within each part (e.g., the beak, more than the head). Yet these methods mostly focuses on a few parts and ignores others as zooming in beyond simple fusion. More importantly, they do not consider how features from different zoomed-in parts can be fused together in a synergistic manner. Different to these approaches, we further argue that, one not only needs to identify parts and their most discriminative granularities, but meanwhile how parts at different granularities can be effectively merged.
In this paper, we take an alternative stance towards fine-grained classification. We do not explicitly, nor implicitly attempt to mine fine-grained feature representations from parts (or their zoomed-in versions). Instead, we approach the problem with the hypothesis that the fine-grained discriminative information lies naturally within different visual granularities – it is all about encouraging the network to learn at different granularities and simultaneously fusing multi-granularity features together. This can be better explained by Figure 1.
More specifically, we propose a consolidated framework that accommodates part granularity learning and cross-granularity feature fusion simultaneously. This is achieved through two components that work synergistically with each other: (i) a progressive training strategy that effectively fuses features from different granularities, and (ii) a random jigsaw patch generator that encourages the network to learn features at specific granularities. Note that we refrain from using “scale” since we do not apply Gaussian blur filters on image patches, rather we evenly divide and shuffle image patches to form different granularity levels.
As the first contribution, we propose a multi-granularity progressive training framework to learn the complementary information across different image granularities. This differs significantly to prior art where parts are first detected, and later fused in an ad-hoc manner. Our progressive framework works in steps during training, where at each step the training focuses on cultivating granularity-specific information with a corresponding stage of the network. We start with finer granularities which are more stable, gradually move onto coarser ones, which avoids the confusion made by large intra-class variations that appear in large regions. On its own, this is akin to a “zooming out” operation, where the network would focus on a local region, then zoom out a larger patch surrounding this local region, and finish when we reach the whole image. More specifically, when each training step ends, the parameters trained at the current step will pass onto the next training step as its parameter initialization. This passing operation essentially enables the network to mine information of larger granularity based on the region learned in its previous training step. Features extracted from all stages are concatenated only at the last step to further ensure complementary relationships are fully explored.
However, applying progressive training naively would not benefit fine-grained feature learning. This is because the mulit-granularity information learned via progressive training may tend to focus on the similar region. As the second contribution, we tackle this problem by introducing a jigsaw puzzle generator to form different granularity levels at each training step, and only the last step is still trained with original images. This effectively encourage the model to operate on patch-level, where patch sizes are specific to a particular granularity. It essentially forces each stages of the network to focus on local patches other than holistically across the entire image, therefore learning information specific to a given granularity level. This effect in demonstrated in Figure 1. Note that, the very recent work of  first adopted a jigsaw solver to solve for fine-grained classification. We differ significantly in that we do not employ jigsaw solver as part of feature learning. Instead, we simply generate jigsaw patches randomly as means of introducing different object parts levels to assist progressive training.
Main contributions of this paper can be summarized as follows:
We propose a novel progressive training strategy to solve for fine-grained classification. It operates in different training steps, and at each step fuses data from previous levels of granularity, ultimately cultivating the inherent complementary properties across different granularities for fine-grained feature learning.
We adapt a simple yet effective jigsaw puzzle generator to form different levels of granularity. This allows the network to focus on different “scales” of features as per prior work.
The proposed Progressive Multi-Granularity (PMG) training framework obtains state-of-the-art or competitive performances on all three standard FGVC benchmark datasets.
2 Related Work
2.1 Fine-Grained Classification
Benefiting from the recent development of neural networks e.g., VGG  and ResNet , the feature extraction capabilities of the neural networks have been significantly improved. Recent studies about FGVC have moved from strongly-supervised scenario with extra annotations e.g., bounding box [2, 21, 34, 37, 16] to weakly-supervised conditions with only category label [11, 38, 35, 12, 36].
In the weakly supervised configuration, recent studies mainly focus on locating the most discriminative parts, more complementary parts, and parts of various granularities. However, few of them consider that how to fuse information from these discriminative parts together better, and the current fusion techniques can be roughly divided into two categories. (i) The first technique conducts predictions based on different parts and then directly combines their probabilities together. Zhang et al.  trained several networks focusing on features of different granularities to produce diverse prediction distribution, and then weighting their results before combine them together. (ii) Some other methods concatenate features extracted from different parts together for next prediction [38, 11, 12, 35]. Fu et al. found region detection and ﬁne-grained feature learning can reinforce each other, and built a series of networks which find discriminative regions for the next network as they conducting predictions. With similar motivation, Zheng et al.  jointly learned part proposals and the feature representations on each part, and located various discriminative parts before prediction. Both of them train a fully-connected fusion layer to fuse features extracted from different parts. Ge et al.  went one step further by fusing features from complementary object parts with two LSTMs stacked together.
Fusion features from different parts is still a challenge problem but few efforts have been made for it. In this work, we try to address it based on the Intrinsic characteristics of fine-grained objects: although with large intra-class variation, the subtle details show stability at local regions. Hence, instead of locating the discriminative first, we guide the network to learn features from small granularity to large granularity progressively.
2.2 Image Splitting Operation
Splitting an image into pieces with the same size has been utilized for various goals in previous works. Among them, one typical solution is to solve the jigsaw puzzle [6, 29]. It can also go one step further by adopting the jigsaw puzzle solution as the initialization to a weakly-supervised network, which leads to better transformation performance . This method helps the network exploit the spatial relationship of images. In one-shot learning, image splitting operation is used for augmentation, which split two image and exchange some patches of them to generate new training ones . In more recent research, DCL  first adopt image splitting operation for FGVC by destructing the global structure to emphasis local details and reconstructing the images to learn semantic correlation among local regions. However, it splits images with the same size during the whole training process, which means it is difficult to exploit multi-granularity regions. In this work, we apply a jigsaw puzzle generator to restrict the granularity of learned regions at each training step.
2.3 Progressive Training
Progressive training methodology was originally proposed for generative adversarial networks , where it started with low-resolution images, and then progressively increased the resolution by adding layers to the networks. Instead of learning the information from all the scales, this strategy allows the network to discover large-scale structure of the image distribution and then shift attention to increasingly ﬁner scale details. Recently, progressive training strategy has been widely utilized for generation tasks [19, 27, 32, 1], since it can simplify the information propagation within the network by intermediate supervision.
For FGVC, the fusion of multi-granularity information is critical to the model performance. In this work, we adopt the idea of progressive training to design a single network that can learn these information with a series of training stages. The input images are firstly split into small patches to train a low-level layers of model. Then the number of patches are progressively increased and the corresponding layers high-level lays have been added and trained, correspondingly. Most of the existing work with progressive training are focusing on the task of sample generation. To the best of our knowledge, it has not been attempted earlier for the task of FGVC.
In this section, we present our proposed Progressive Multi-Granularity (PMG) training framework. As shown in Figure 2, to address the large intra-class variations, we encourage the model to learn stable fine-grained information in the shallower layers and gradually shift attention to the learning of abstract information of large granularity level in the deeper layers as training progresses.
3.1 Network Architecture
Our network design is generic and could be implemented on the top of any state-of-the-art backbone feature extractor, like Resnet . Let us be our backbone feature extractor, which has stages. The output feature-map from any intermediate stages is represented as , where , , are the height, width and number of channels of the feature map at -th stage, and . Here, our objective is to impose classification loss on the feature-map extracted at different intermediate stages. Hence, in addition to , we introduce convolution block that takes -th intermediate stage output
as input and reduces it to a vector representation. Thereafter, a classification module consisting of two fully-connected stage with Batchnorm  and Elu non-linearity, corresponding to
-th stage, predicts the probability distribution over the classes as. Here, we consider last stages: . Finally, we concatenate the output from last three stages as
This is followed by an additional classification module
3.2 Progressive Training
We adopt progressive training where we train the low stage first and then progressively add new stages for training. Since the receptive field and representation ability of low stage is limited, the network will be forced to first exploit discriminative information from local details (i.e. object textures). Compared to training the whole network directly, this increment nature allows the model to locate discriminative information from local details to global structures when the features are gradually sent into higher stages, instead of learning all the granularities simultaneously.
For the training of the outputs from each stages and the output from the concatenated features, we adopt cross entropy (CE) between ground truth label and prediction probability distribution for loss computation as
At each iteration, a batch of data will be used for steps, and we only train one stage’s output at each step in series. It needs to be clear that all parameters are used in the current prediction will be optimized, even they may have been updated in the previous steps, and this can help each stage in the model work together.
3.3 Jigsaw Puzzle Generator
Jigsaw Puzzle solving  has been found to be suitable for self-supervised task in representation learning. On the contrary, we borrow the notion of Jigsaw Puzzle to generate input images for different steps of progressive training. The objective is to devise different granularity regions and force the model to learn information specific to the corresponding granularity level at each training step. Given an input image , we equally split it into patches which have dimensions. and should be integral multiples of , respectively. Then, the patches are shuffled randomly and merged together into a new image . Here, the granularities of patches are controlled by the hyper-parameter .
Regarding the choice of hyper-parameter for each stage, two conditions needs to be satisfied: (i) the size of the patches should be smaller than the receptive field of the corresponding stage, otherwise, the performance of the jigsaw puzzle generator will be reduced; (ii) the patch size should increase proportionately with the increase of the receptive fields of the stages. Usually, the receptive field of each stage is approximately double than that of the last stage. Hence, we set as for the stage’s output.
During training, a batch of training data will first be augmented to several jigsaw puzzle generator-processed batches, obtaining . All the jigsaw puzzle generator-processed batches share the same label . Then, for the stage’s output , we input the batch , and optimize all the parameters used in this propagation. Figure 2 illustrates the training procedure step by step.
It should be clarified that the jigsaw puzzle generator cannot always guarantee the completeness of all the parts which are smaller than the size of the patch. Although there could exist some parts which are smaller than the patch size, those still have chances of getting split. However, it should not be a bad news for model training, since we adopt random cropping which is a standard data augmentation strategy before the jigsaw puzzle generator and leads to the result that patches are different compared with those of previous iterations. Small discriminative parts, which are split at this iteration due to the jigsaw puzzle generator, will not be always split in other iterations. Hence, it brings an additional advantage of forcing our model to find more discriminative parts at the specific granularity level.
At the inference step, we merely input the original images into the trained model and the jigsaw puzzle generator is unnecessary. If we only use for prediction, the FC layers for the other three stages can be removed which leads to less computational budget. In this case, the final result can be expressed as
However, the prediction from a single stage based on information of a specific granularity is unique and complementary, which leads to a better performance when we combine all outputs together with equal weights. The multi-output combined prediction which can be written as
Hence, both the prediction of and multi-output combined prediction can be obtained in our model. In addition, although all predictions are complementary for final result, is enough for those objects whose shapes are relatively smooth, for example, cars. More details of experiments could be found in Section 4.
4 Experiment Results and Discussion
In this section, we evaluate the performance of the proposed method on three ﬁne-grained image classiﬁcation datasets: Caltech UCSD-Birds (CUB) , Stanford Cars (CAR) , and FGVC-Aircraft (AIR) . Firstly, the implementation details are introduced in Section 4.1. Subsequently, the classiﬁcation accuracy comparisons with other state-of-the-art methods are then provided in Section 4.2. In order to illustrate the advantages of different components and design choices in our method, a comprehensive ablation study and a visualization are provided in Section 4.3 and 4.4.
4.1 Implementation Details
We perform all experiments using PyTorch with version higher than 1.3 over a cluster of GTX 2080 GPUs. The proposed method is evaluated on the widely used backbone networks: VGG16  and ResNet50 , which means the total number of stages . For the best performance, we set , , and . The category labels of the images are the only annotations used for training. The input images are resized to a ﬁxed size of and randomly cropped into , and random horizontal ﬂip is applied for data augmentation when we train the model. During testing, The input images are resized to a ﬁxed size of and cropped from center into . All the above settings are standard in the literatures.
We use stochastic gradient descent (SGD) optimizer and batch normalization as the regularizer. Meanwhile, the learning rates of the convolution layers and the FC layers, respectively, which are newly added by us are initialized as 0.002 and reduced by following the cosine annealing schedule
during training. The learning rates of the pre-trained convolution layers are maintained as 1/10 of those of the newly added layers. For all the aforementioned models, we train them for up to 300 epochs with batch size as 16 and used a weight decay as 0.0005 and a momentum as 0.9.
4.2 Comparisons with State-of-the-Art Methods
The comparisons of our method with other state-of-the-art methods on CUB-200-2011, Stanford Cars, and FGVC-Aircraft are presented in Table 1. Both the accuracy of and the combined accuracy of all four outputs are listed.
|Method||Base Model||CUB (%)||CAR (%)||AIR (%)|
|FT VGG (CVPR18) ||VGG16||77.8||84.9||84.8|
|FT ResNet (CVPR18) ||ResNet50||84.1||91.7||88.5|
|B-CNN (ICCV15) ||VGG16||84.1||91.3||84.1|
|KP (CVPR17) ||VGG16||86.2||92.4||86.9|
|RA-CNN (ICCV17) ||VGG19||85.3||92.5||-|
|MA-CNN (ICCV17) ||VGG19||86.5||92.8||89.9|
|PC (ECCV18) ||DenseNet161||86.9||92.9||89.2|
|DFL (CVPR18) ||ResNet50||87.4||93.1||91.7|
|NTS-Net (ECCV18) ||ResNet50||87.5||93.9||91.4|
|MC-Loss (TIP20) ||ResNet50||87.3||93.7||92.6|
|DCL (CVPR19) ||ResNet50||87.8||94.5||93.0|
|MGE-CNN (ICCV19) ||ResNet50||88.5||93.9||-|
|S3N (ICCV19) ||ResNet50||88.5||94.7||92.8|
|Stacked LSTM (CVPR19) ||ResNet50||90.4||-||-|
|PMG (Combined Accuracy)||VGG16||88.8||94.3||92.7|
|PMG (Combined Accuracy)||ResNet50||89.6||95.1||93.4|
We achieve competitive result on this dataset in a much easier experimental procedure, since only one network and one propagation are needed during testing. Our method outperform RA-CNN and MGE-CNN  by 4.3% and 1.1%, even though they build several different networks to learn information of various granularities. They train the classification of each network separately and then combine their information for testing, which proofs our advantage of exploiting multi-granularity information gradually in one network. Besides, even Stacked LSTM  better performance than our method, it is a two phase algorithm that requires Mask-RCNN  and CPF to offer complementary object parts and then use bi-directional LSTM  for classification, which leads to more inference time and computation budget.
4.2.2 Stanford Cars
Our method achieves state-of-the-art performance with Resnet50 as the base model. Since the cars at Stanford Cars dataset are much more rigid and performance of is good enough, the improvement of combining multi-stage outputs is not obvious. The result of our method surpasses PC  even it improves its performance by adopting more advanced backbone network i.e. DenseNet161. For MA-CNN  and NTS-Net  which first locate several different discriminative parts and then combine features extracted from them for final classification, we outperform them by large margins of 2.3% and 1.2%.
On this task, the multi-stage outputs combined result of our method also achieves State-of-the-Art performance. Although S3N  find both discriminative part and complementary part for feature extraction and apply additional inhomogeneous transform to highlight these parts, we still outperform it by 0.6% with the same backbone network ResNet50, and show competitive result even when we adopt VGG16 as the base network.
4.3 Ablation Study
We conduct ablation studies to understand the effectiveness of the progressive training strategy and the jigsaw puzzle generator. We choose CUB-200-2011 dataset for experiments and ResNet50 as the backbone network, which means the total number of stages is . We first design different runs with the number of stages used for output increasing from to and no jigsaw puzzle generator, as shown in Table 2. The is kept for all runs and number of steps is . It is clear that the increasing of boosts the model performance significantly when . However, we also notice the accuracy starts to decrease when become . The possible reason is that low stage layers are mainly focus on class-irrelevant features, but the additional supervision will force it to find class-relevant information and then affect the overall performance.
In Table 3, we report the results of our method with assistance of the jigsaw puzzle generator. The hyper-parameter of the jigsaw puzzle generator for stage follows the pattern that . Compared with results in Table 3, the jigsaw puzzle generator improves the model performance on the basis of progressive training when . When , the model with the jigsaw puzzle generator does not show any advantages, and when the jigsaw puzzle generator lowers the model performance. This is because when the split patches are too small to keep meaningful information, which confuses the model training.
According to the above analysis, progressive training are beneficial for fine-grained classification task when we choose appropriate . In such a case, the jigsaw puzzle generator can further improve the performance.
|S,n||Accuracy (%)||Combined Accuracy (%)|
|S,n||Accuracy (%)||Combined Accuracy (%)|
In order to illustrate the advantages of the proposed method, we apply the Grad-CAM to implement the visualization for last three stages’ convolution layer of both our method and baseline model. Columns (a)-(c) in Figure 3 are visualization of the convolution layers from the third to the fifth stage of our model’s backbone, which are supervised by images generated by jigsaw puzzle generator with sequentially. It is clear in column (a) that the model concentrates on discriminative parts of small granularity at the third stage like bird eyes and small pattern or texture of birds’ feather. And when it comes to column (c), the fifth stage of the model pays attention to parts of larger granularity. The visualization result demonstrates that our model truly gives predictions based on discriminative parts of small granularity to large granularity gradually.
When compared with the activation map of the baseline model, our model shows more meaningful concentration on the target object, while the baseline model only shows the correct attention at the last stage. This difference indicates that the intermediate supervision of progressive training can help the model locate useful information at earlier stages. Besides, we find the baseline model usually only concentrates on one or two parts of the object at the last stage where it makes prediction. However, the attention regions of our method nearly cover the whole object at each stage, which indicates that images generated by the jigsaw puzzle generator can forcing the model to learn more discriminative parts at each granularity level.
In this paper apply progressive training strategy into fine-grained classification tasks and propose a novel framework named Progressive Multi-Granularity (PMG) Training with two main components: (i) a novel training strategy that fuses multi-granularity features in a progressive manner, and (ii) a simple jigsaw puzzle generator to form images contain information of different granularity levels. Our method can be trained end-to-end without other manual annotations except category labels, and only needs one network with one propagation during testing. We conduct experiments on three widely used fine-grained datasets and obtain state-of-the-art performance on two of them and a competitive result on the other one, which demonstrate the effectiveness of our method.
Image super-resolution via progressive cascading residual network. In CVPR workshops, Cited by: §2.3.
Poof: part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In CVPR, Cited by: §1, §2.1.
-  (2020) The devil is in the channels: mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing. Cited by: Table 1.
-  (2019) Destruction and construction learning for fine-grained image recognition. In CVPR, Cited by: §1, §1, §1, §2.2, Table 1.
-  (2019) Image block augmentation for one-shot learning. In AAAI, Cited by: §2.2.
-  (2010) A probabilistic image jigsaw puzzle solver. In CVPR, Cited by: §2.2.
-  (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §3.1.
-  (2017) Kernel pooling for convolutional neural networks. In CVPR, Cited by: Table 1.
-  (2019) Selective sparse sampling for fine-grained image recognition. In ICCV, Cited by: §4.2.3, Table 1.
-  (2018) Pairwise confusion for fine-grained visual classification. In ECCV, Cited by: §1, §4.2.2, Table 1.
-  (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, Cited by: §1, §1, §2.1, §2.1, §4.2.1, Table 1.
-  (2019) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In CVPR, Cited by: §1, §2.1, §2.1, §4.2.1, Table 1.
-  (2017) Mask r-cnn. In ICCV, Cited by: §4.2.1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §2.1, §3.1, §4.1.
-  (1997) Long short-term memory. Neural Computation. Cited by: §4.2.1.
-  (2016) Part-stacked cnn for fine-grained visual categorization. In CVPR, Cited by: §1, §2.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §3.1.
-  (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2.3.
-  (2019) A style-based generator architecture for generative adversarial networks. In CVPR, Cited by: §2.3.
-  (2013) 3d object representations for fine-grained categorization. In ICCV workshops, Cited by: §4.
-  (2016) Fast mode decision based on grayscale similarity and inter-view correlation for depth map coding in 3d-hevc. IEEE Transactions on Circuits and Systems for Video Technology 28 (3), pp. 706–718. Cited by: §1, §2.1.
-  (2015) Bilinear cnn models for fine-grained visual recognition. In ICCV, Cited by: Table 1.
-  (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §4.1.
-  (2019) Cross-x learning for fine-grained visual categorization. In ICCV, Cited by: §1.
-  (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §4.
-  (2017) Automatic differentiation in pytorch. Cited by: §4.1.
-  (2019) Singan: learning a generative model from a single natural image. In ICCV, Cited by: §2.3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1, §4.1.
-  (2014) Solving square jigsaw puzzles with loop constraints. In ECCV, Cited by: §2.2.
-  (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.
-  (2018) Learning a discriminative filter bank within a cnn for fine-grained recognition. In CVPR, Cited by: §1, §1, Table 1.
-  (2018) A fully progressive approach to single-image super-resolution. In CVPR workshops, Cited by: §2.3.
-  (2019) Iterative reorganization with weak spatial constraints: solving arbitrary jigsaw puzzles for unsupervised representation learning. In CVPR, Cited by: §2.2, §3.3.
-  (2013) Hierarchical part matching for fine-grained visual categorization. In ICCV, Cited by: §1, §2.1.
-  (2018) Learning to navigate for fine-grained classification. In ECCV, Cited by: §1, §1, §2.1, §2.1, §4.2.2, Table 1.
-  (2019) Learning a mixture of granularity-specific experts for fine-grained categorization. In ICCV, Cited by: §1, §1, §2.1, §2.1, §4.2.1, Table 1.
-  (2014) Part-based r-cnns for fine-grained category detection. In ECCV, Cited by: §1, §2.1.
-  (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In ICCV, Cited by: §1, §1, §2.1, §2.1, §4.2.2, Table 1.