Fine-grained image classification, also known as sub-category recognition, is a hot research topic in computer vision and pattern recognition in recent years. The purpose of fine-grained classification is to subclass the images belonging to the same basic category (cars, dogs, flowers, birds, etc.) in a more detailed way. However, due to the subtle inter-class differences and large intra-class differences among sub-categories, the classification of fine-grained images is more difficult than ordinary image classification tasks. Deep learning has brought tremendous advances to many areas of computer vision[23, 20, 30], however, its performance in fine-grained classification is not satisfactory on account of the difficulty of finding informative regions and extracting discriminative features. The result of classification is even worse due to a variety of different backgrounds, poses and shooting Angles. Previous research[51, 5, 54, 29, 3, 14, 6] of fine-grained classification rely on human annotations. However, it is so expensive to annotate fine-grained features that these approaches are impractical. Some improvements [55, 56, 57, 48]
employed unsupervised learning have been proposed, which can localize informative regions automatically. Unfortunately, most of the work had low accuracy. NTS-net innovatively put forward a multi-agent cooperative learning mechanism which have Navigator, Teacher and Scrutinizer that can increase accuracy. However, we found it is hard to accurately detect separate bounding boxes, such as a single head.
In this paper, we combine fine-grained image classification with saliency detection to attach more attention on the small and distinctive parts. Since saliency detection is a segmentation task with pixel-level annotation, we use a weakly supervised box-level segmentation map instead. In many detection-based works, we found it is hard to directly locate the key areas in an object. In order to solve this problem, we make it two stages, using activation based heatmaps to locate the instance and saliency segmentation to accurately detect informative parts on an instance. We also found the previous work; especially fancy attention mechanism is not that satisfying and not that explainable. we designed an unambiguous network to show users whether it is concentrating on the parts that users wish to.
Our main contributions can be summarized as follows:
The first to introduce saliency detection to fine-grained image classification task by adding SPPN module.
Design a location heatmap generation mechanism to boost SPPN(saliency part proposal network) module.
Design R_IOU loss and R_area loss to get separate and accurate saliency part proposals, which constructed as box-level saliency segmentation mask.
Use a self-supervised mechanism to refine saliency proposals.
Design a simple and effective method to fuse dual attention which is extremely helpful for deep bilinear transformation module as PAF(part attention filter).
Use a knowledge distillation based method to boost student and teacher net.
2 Related Work
2.1 Fine-grained image classification
The difficulty of fine-grained image classification is subtle inter-class differences and large intra-class differences among sub-categories. In order to successfully carry out fine-grained classification of two very similar species, the most important thing is to find the informative region that can distinguish the two species in the image.
Spatial Transformer Network achieve spatial invariability by predicting the location of informative regions and then correcting the image to an ideal position. Bilinear CNN models uses two feature extractors, which coordinate with each other to detect informative regions and extract features.  propose an automatic component detection method for fine image recognition, learning part detectors and part saliency maps. RA-CNN  uses mutually reinforcing methods to recursively learn discriminative region attention and region-based feature representation. DVAN 
improves the diversity of visual attention to extract the maximum discriminant features. It consists of four parts: attention region generation, CNN feature extraction, diverse visual attention and classification. FCN attention
is a full-convolution attentional network based on reinforcement learning, which can adaptively select multi-task-driven attention regions. Because it is based on FCN architecture, it is more efficient and can locate multiple object parts and extract features of multiple attention regions at the same time. Short-term Memory network (LSTM) is unified into a deep loop architecture called HSnet. Therefore, HSnet(1) produces the proposal of image parts with information, and (2) integrates all the proposals for final-grained recognition.
2.2 Object detection
Early object detection methods mainly use SIFT  or HOG  features. Recently, deep learning-based object detection methods have shown dramatic improvements. Two-stage approaches like R-CNN , OverFeat  and SPPnet  adopt traditional image-processing methods to generate object proposals and perform category classification and bounding box regression. Faster R-CNN  first proposes a unified end-to-end framework for object detection. It introduces a region proposal network (RPN) which shares the same backbone network with the detection network to replace the original standalone time-consuming region proposal methods. On the other hand, one-stage methods which are popularized by YOLO [9, 36, 37] and SSD  improve detection speed over Faster R-CNN  by employing a single-shot architecture. RetinaNet  proposes a new focal loss to address the extreme foreground-background class imbalance which stands out as a central issue in one-stage detectors. Feature Pyramid Networks (FPN)  focuses on better handling multi-scale objects and generates anchors from multiple feature maps.
2.3 Saliency Detection
Visual saliency prediction in computer vision aims at estimating the locations in an image that attract the attention of humans. Salient points in an image are understood as a result of a bottom-up process where a human observer explores the image for a few seconds with no particular task in mind. These data are traditionally measured with eye-trackers, and more recently with mouse clicks or webcams 
. The salient points over the image are usually aggregated and represented in a saliency map, a single channel image obtained by convolving each point with a Gaussian kernel. As a result, a gray-scale image or heatmap is generated to represent the probability of each corresponding pixel in the image to capture the human attention. These saliency maps have been used as soft-attention guides for other computer vision tasks, and also directly for user studies in fields like marketing.
However,it is expensive to acquire pixel-level annotations of saliency map as ground truth.Instead, we try to use a weakly supervised method by only use image level classification ground truth to guide us. In , they try to combine object detection with fine-grained image classification to get higher accuracy. Early works 
attempt to use image-level annotations to get pixel-level segmentation results. In our work, instead of using fully connected convolutional neural networks to get segmentations mask, we use feature proposal network to do part-level object detection which could provide us box-level saliency information inside an object. We hope to get distinctive features from those saliency areas. Like [7, 34, 34, 21], we use bonding box information to get segmentation masks. Further, inspired by , we found it is feasible to concatenate saliency detection map with image as a attention guiding.
2.4 Activation Based Heatmap
Recently, several heatmap methods sourcing from the activation layer of backbone have been proposed to boost the fine-grained image classification. Typically, SCDA
is a kind of unsupervised method to locate the main region in the fine-grained images, without the help of image label or bounding box annotation. It takes the activation-map after convolution layer , ReLU layer or pooling layer and then add up the activation through the deep direction ,calculates the mean of the activation-map. As for the value in the activation-map who is higher than the mean value, they will keep unchanged while others will be erased to zero. In this way ,the SCDA method generate a activation-map which could be coarsely considered as the object segmentation map. Apart from that, it is also an elegant way to guide the following network as an attention.
2.5 Bilinear Pooling
Bilinear pooling is proposed to obtain invariant and discriminative global representation for convolutional feature, which achieved the state-of-the-art results in many fine-grained datasets. However, the high-dimensionality issue is caused by calculating pairwise interaction between channels, thus dimension reduction methods are proposed. Specifically, low-rank bilinear pooling proposed to reduce feature dimensions before conducting bilinear transformation, and compact bilinear pooling proposed a sampling based approximation method, which can reduce feature dimensions by two orders of magnitude without performance drop. Moreover, feature matrix normalization is proved to be important for bilinear feature, while we do not use such technics in our deep bilinear transformation since calculating such root is expensive and not practical to be deeply stacked in CNNs. Second-order pooling convolutional networks also proposed to integrate bilinear interactions into convolutional blocks, while they only use such bilinear features for weighting convlutional channels. In , they manage to use semantic grouping to select relative features and combine them together for further bilinear feature transformation.
2.6 Knowledge Distillation
, which is transfer learning method that aims to improve training of student work by relying on knowledge borrowed from a powerful teacher network. The framework could compress an ensemble of deep networks(teacher) into a student network of similar depth. To do so, the student is trained to predict the output of the teacher as well as the true classification labels.
However, all previous works related on this method focus on the theme about model compression. In this paper, we are the first one to introduce knowledge distillation method to fine-grained classification while reduce the distinguish of parameters distribution between the teacher net and student net in the training process.
3.1 Method Overview
Our method consists of two parts, student net and teacher net respectively.As for student net, received an image as input, a backbone serves as a feature extractor.Similar to retina-net , we add a feature pyramid network to predict the bounding box in multi-level.Inspired by , we use the activation map of the three layers in last three blocks of backbone to get the location-heatmap in the mean time.The location-heatmap is considered as a coarse location for the object, which has two effects:1. A judgement for foreground and background for the object detection task. 2. A coarse attention for teacher net to locate the object. Further, the feature pyramid network is also used to operate the region proposal network, where the saliency part of an object could be detected by a bounding box. In order to get the accurate informative parts, it is reasonable that the bounding boxes should be approximately 1/N size of the total area of the object and the bounding boxes should have rare intersections since we want each bounding box matches a distinctive part. To manage that, we designed two losses, R_area loss and R_IOU loss respectively. Since the detection is lack of ground truth, we operate a self-supervise method by using a ranking loss.
After student net,a location-heatmap map and a saliency map constructed by the bounding boxes are acquired.We concatenate the two attention maps with original image and feed it in teacher net.Firstly, the two attention maps hint the location and saliency part of the object.Secondly,we use saliency part mask to filter the feature map.Precisely,discriminative feature is highlighted,which is a great help for semantic grouping and group bilinear part.After the deep bilinear transformation,the final feature vector is obtained.Since teacher net outperform student net,we use distillation and hint learning method to encourage student net learn better.The final result for infence is an ensemble between results of student net and teacher net.
3.2 Location Heatmap Module
In our method, we use location heatmap,similar to the SCDA
method, to separate the foreground and background.Three ReLU layers corresponding to last three blocks of backbone in the student network are aggregated to generate the heatmap.For each, we add up the activation tensors through the depth direction.Thus, the hwd 3D tensor becomes an hw 2D tensor, which we call the “aggregation map” and name it A, we will use A in the following.Then, we calculate the mean value of all the positions in A as the threshold to decide which positions localize objects. Consequently, the position whose activation response is higher than the mean value indicates the main object, we set it to one, and other positions zero.We concatenate the three “aggregation map” together, and normalize it.In this way, we get the location heatmap which involves not only the low dim texture information but also the high dimension whole target shape.The location heatmap will be used in two places. First, it is put into SPPN network, as it aggregates the object region, it can be used as the coarse target location, which directs the SPPN network in regressing the location of the proposals.Second, in the teacher net, the location heatmap will be used as an input information together with the saliency map.
3.3 SPPN Module
Saliency part proposal network is a subnet cascaded behind our backbone, in order to predict the saliency parts of an object. Like widely used region proposal network(RPN) in object detection, our SPPN aims to yield saliency part proposals. Benefit from location-heatmap, we can select the positive samples who falls in the activated area in that heatmap. For others, we consider them as negative samples and neglect the following classification and regression for them. By doing so, the difficulty for SPPN to learn is lower.
Similar to retina-net, classification head and regression head are added in all three level of feature map to predict the convincing score and saliency bounding box corresponding to different size of target. We choose top-N informative regions after NMS and use a self-supervised method to further boost SPPN. NTS-net uses a similar way, however they only remain the information inside the bounding box and remove others. As far as our concerned, it could be considered as high-level crop along with losing the context messages. We would argue that it is more reasonable to restore the information inside the box while keeping the context message in the meantime.
In our method, we designed a fancy method to for evaluate the bounding boxes. For the original input image and every bounding box chosen from N boxes, firstly we keep the pixel in the box unchanged. Secondly, we add noise in the corresponding area for other N-1 boxes. Thirdly, we operate heavy Gaussian blur in other areas. N processed images are acquired by the method and we feed them into our backbone to get N scores. A ranking loss is implemented in order to realize the self-supervise mechanism.
3.4 PAF Module
PAF, stands for part attention filter, is designed to highlight the discriminative parts of an object.Firstly,saliency part mask is input in adaptive layer,which out put a part attention mask with the same shape as the corresponding feature map.Secondly,it is used as a filter to highlight informative feature while ignore useless areas.Inspired by,we use semantic grouping to gather feature map who focus on similar areas and group bilinear to calculate and aggregate intra-group pairwise interactions.The pipeline is shown below:
3.5 Distillation Framework
Distillation framework is applied to improve the similar of parameters distribution between teacher net and student net which is borrowed from knowledge distillation structure. In this paper, we introduce intermediate-level hints from the teacher hidden layers to guide the training process of the student
. The distillation framework can be summarized as follows. First, we introduce the hints which are defined as the output of teacher net hidden layers. Second, the guided layer is selected from the student net layers to learn from the teachers hint layer. Third, we add a regressor whose output match the size of hint layer to the guide layer. Finally, the soft target cross entropy loss is applied on the head of the distillation framework. Then the loss function of distillation framework is defined as
where is the soft target cross entropy loss.
3.6 CSF Module
CSF stands for class score fusion,which is used to ensumble results from student net and teacher net.In ,authors use a module to fuse two scores from two streams in video classification task.Similarly,our teacher-student net architecture can also be considered as two-stream module.So we designed a class score fusion module to fuse the result vector from two nets by using fully connected layer.Cross entropy loss is then used to optimise this module.
3.7 Loss function and Optimization
In our model, we have designed several loss functions in order to get better accuracy. They are illustrated as follow.
3.7.1 R_IOU Loss
In order to get precise discriminating parts of an object, we attempt to make our bounding boxes have little intersection with each other. otherwise, more than one bounding box will predicted corresponding to same part. Inspired by dice loss, we simply add a hype-parameter alpha to control the penalty of the intersection. The formula is shown below:
3.7.2 R_Area Loss
In our experiment, we find that only use R_IOU loss is not enough to make bounding boxes accurately locate single parts of an object since the box are often big and box another areas rather the distinctive part. So we would argue that another restrict to reduce the area of bounding box is required. Thanks to location-heatmap, we can coarsely estimate the area S of the total object. Then we set a hype-parameter N indicated that how many parts are there in a object. The number of parts N could be changed due to different type of objects. For instance, in the bird dataset CUB-200-2011, we set N=4. The formula is shown below:
3.7.3 Distillation Loss
In order to let student and teacher net collaborate better, we design a distillation loss which combine two parts, hint learning and soft target learning. The formula is shown below:
3.7.4 Concentrate Loss
For total M saliency part proposals, we also introduced a concentrate loss to combine their effects. Technically,each box, like we mentioned above, could be convert to an unique image and we feed these M images into our student net(without SPPN part) again to get M logits. Each logit corresponding to the probability for each class. We use an ensemble method to get the fusion logit of M boxes. In our experiment, we directly use average of the M logits. The formula is shown below:
3.7.5 Ranking Loss
Similar to what they do in NTS-net, we use ranking loss to penalize pairs with wrong orders.Let X = X1, X2, · · · , Xn denote the objects to rank, and Y = Y1, Y2, · · · , Yn the indexing of the objects, where means Xi should be ranked before Xj. We use pair-wise ranking approach, Suppose F(Xi,Xj) only takes a value from 1,0, where F(Xi,Xj) = 0 means Xi is ranked before Xj. Then the loss is defined on all pairs, and the goal is to find an optimal F to minimize the average number of pairs with wrong order. The formula is shown below:
3.7.6 Total Loss for DAF-net
The total loss constructed by three parts: student net,teacher net and the class score fusion(CSF) module.
For loss of student net, there are five parts. Prediction loss indicated the cross entropy loss.
For loss of teacher net and CSF module, only one loss are used separately.
We comprehensively evaluate our algorithm on Caltech-UCSD Birds (CUB-200-2011) , Stanford Cars  and FGVC Aircraft  datasets, which are widely used benchmark for fine-grained image classification. We do not use any bounding box/part annotations in all our experiments. Statistics of all 3 datasets are shown in Table. 1, and we follow the same train/test splits as in the table.
Caltech-UCSD Birds. CUB-200-2011 is a bird classification task with 11,788 images from 200 wild bird species. The ratio of train data and test data is roughly 1 : 1. It is generally considered one of the most competitive datasets since each species has only 30 images for training.
Stanford Cars. Stanford Cars dataset contains 16,185 images over 196 classes, and each class has a roughly 50-50 split. The cars in the images are taken from many angles, and the classes are typically at the level of production year and model (e.g. 2012 Tesla Model S).
FGVC Aircraft. FGVC Aircraft dataset contains 10,000 images over 100 classes, and the train/test set split ratio is around 2 : 1. Most images in this dataset are airplanes. And the dataset is organized in a four-level hierarchy, from finer to coarser: Model, Variant, Family, Manufacturer.
|FGVC Aircraft||100||6,667||3, 333|
4.2 Implementation Details
In all our experiments, we preprocess images to size 448, and we fix M=3 which means 3 regions are used to train student network for each image (there is no restriction on hyper-parameter). We use fully-convolutional network ResNet-50 as feature extractor and use Batch Normalization as regularizer. We use Momentum SGD with initial learning rate 0.01 and multiplied by 0.5 after 50 epochs, and we use weight decay 1e-4
. The NMS threshold is set to 0.25. Our model is robust to the selection of hyper-parameters.
4.3 Quantitative Results
Overall, our proposed system outperforms all previous methods expect for WS-DAN. Since we do not use any bounding box/part annotations, we do not compare with methods which depend on those annotations. Table. 2 shows the comparison between our results and previous best results in CUB-200-2011. Both Inception-V3 and ResNet-50 are strong baselines. However, we found that combining with activation based heatmap, ResNet-50 is a better choice. So in our framework, we choose ResNet-50 as backbone in both teacher and student net and achieve 89.1% accuracy.
|Our student net||87.6%|
|Our student+teacher net||89.1%|
4.4 Ablation Study
In order to analyze the influence of different components in our framework, we design different runs in CUB-200-2011 and report the results in Table. 3. The experiments are focusing on our separate contributions.
-  (2007) An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, pp. 473–480. Cited by: §3.5.
-  (2016) SSD: single shot multibox detector. In ECCV, Cited by: §2.2.
-  (2013) Poof: part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In CVPR, Cited by: §1.
-  (2011) Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, Cited by: §2.6.
-  (2014) Bird species categorization using pose normalized deep convolutional nets. In BMVC, Cited by: §1.
-  (2013) Symbiotic segmentation and part localization for fine-grained categorization. In ICCV, pp. 321–328. Cited by: §1.
-  (2015) BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation.. In ICCV, Cited by: §2.3.
-  (2005) Histograms of oriented gradients for human detection. In CVPR, pp. 886–893. Cited by: §2.2.
-  (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §2.2.
-  (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, Cited by: §2.3.
-  Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, Cited by: §2.1.
-  (2004) Distinctive image features from scale-invariant keypoints. pp. 91–110. Cited by: §2.2.
-  (2016) Compact bilinear pooling. Cited by: §2.5.
-  (2014) Fine-grained categorization by alignments. In ICCV, pp. 1713–1720. Cited by: §1.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Cited by: §2.2.
-  (2013) Knowledge matters: importance of prior information for optimization.. In ICLR, Cited by: §3.5.
-  (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. pp. 1904–16. Cited by: §2.2.
-  (2019) Learning deep bilinear transformation for fine-grained image representation. Cited by: §2.5, §3.4.
-  (2015) Spatial transformer networks. In NIPS, pp. 2017–2025. Cited by: §2.1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1.
-  (2017) Simple does it: weakly supervised instance and semantic segmentation.. In CVPR, Cited by: §2.3.
-  (2017) Low-rank bilinear pooling for fine-grained classification. Cited by: §2.5.
-  (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §1.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. Cited by: §3.6.
-  (2017) Fine-grained recognition as hsnet search for informative image parts. In CVPR, Cited by: §2.1.
-  (2015) Bilinear cnn models for fine-grained visual recognition. In ICCV, Cited by: §2.1.
-  (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §2.2.
-  (2018) Focal loss for dense object detection. In TPAMI, Cited by: §3.1.
-  (2012) Dog breed classification using part localization. In ECCV, pp. 172–185. Cited by: §1.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
-  (2006) Model compression. In KDD, pp. 535–541. Cited by: §2.6.
-  (2015) SalGAN: visual saliency prediction with generative adversarial networks. In Arxiv, Cited by: §2.3.
Weakly- and semi-supervised learning of a dcnn for semantic image segmentation.. In ICCV, Cited by: §2.3.
-  (2015) Weakly- and semi-supervised learning of a dcnn for semantic image segmentation.. In ICCV, Cited by: §2.3.
-  (2015) From image-level to pixel-level labeling with convolutional networks. In CVPR, Cited by: §2.3.
-  (2017) YOLO9000: better, faster, stronger. In CVPR, Cited by: §2.2.
-  (2018) YOLOv3: an incremental improvement.. Cited by: §2.2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In IEEE, pp. 1137–1149. Cited by: §2.2.
Selective convolutional descriptor aggregation for fine-grained image retrieval. In IEEE Transactions on Image Processing, Cited by: §3.1.
-  (2017) Selective convolutional descriptor aggregation for fine-grained image retrieval. In IEEE Transactions on Image Processing, Cited by: §2.4, §3.2.
-  (2016) Built-in foreground/background prior for weakly-supervised semantic segmentation.. In ECCV, Cited by: §2.3.
-  (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. In Arxiv, Cited by: §2.2.
-  (2016) Distinct class-specific saliency maps for weakly supervised semantic segmentation.. In ECCV, Cited by: §2.3.
-  (2006) . A fast learning algorithm for deep belief net. In . Neural Computation, Vol. 18(7), pp. 1527–1554. Cited by: §2.6.
Divide the gradient by a running average of its recent magnitude..
Neural Networks for Machine Learning, Cited by: §2.6.
-  (2015) Bilinear cnn models for fine-grained visual recognition.. Cited by: §2.5.
-  (2017) Improved bilinear pooling with cnns. Cited by: §2.5.
-  (2015) Multiple granularity descriptors for fine-grained categorization. In ICCV, pp. 2399–2406. Cited by: §1.
-  (2016) Learning to segment with image-level annotations.. In Pattern Recognition, Cited by: §2.3.
-  (2016) Fully convolutional attention localization networks: efficient attention localization for fine-grained recognition. Cited by: §2.1.
-  (2013) Hierarchical part matching for fine-grained visual categorization. In ICCV, pp. 1641–1648. Cited by: §1.
-  (2017) Focal loss for dense object detection. In IEEE, pp. 2999–3007. Cited by: §2.2.
-  (2018) Learning to navigate for fine-grained classification. Cited by: §1.
-  (2014) Part-based rcnn for fine-grained detection. In ECCV, Cited by: §1.
-  (2016) Picking deep filter responses for fine-grained image recognition. In CVPR, Cited by: §1, §2.1.
-  (2017) Diversified visual attention networks for fine-grained object classification. pp. 1245–1256. Cited by: §1, §2.1.
-  (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In ICCV, Cited by: §1.
-  (2019) Global second-order pooling convolutional networks. Cited by: §2.5.