Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained Image Recognition

by   Xiu-Shen Wei, et al.

Fine-grained image recognition is a challenging computer vision problem, due to the small inter-class variations caused by highly similar subordinate categories, and the large intra-class variations in poses, scales and rotations. In this paper, we propose a novel end-to-end Mask-CNN model without the fully connected layers for fine-grained recognition. Based on the part annotations of fine-grained images, the proposed model consists of a fully convolutional network to both locate the discriminative parts (e.g., head and torso), and more importantly generate object/part masks for selecting useful and meaningful convolutional descriptors. After that, a four-stream Mask-CNN model is built for aggregating the selected object- and part-level descriptors simultaneously. The proposed Mask-CNN model has the smallest number of parameters, lowest feature dimensionality and highest recognition accuracy when compared with state-of-the-arts fine-grained approaches.



There are no comments yet.


page 2

page 3

page 4

page 5

page 8


Part-Stacked CNN for Fine-Grained Visual Categorization

In the context of fine-grained visual categorization, the ability to int...

Three-branch and Mutil-scale learning for Fine-grained Image Recognition (TBMSL-Net)

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is one of the...

Fully Convolutional Attention Networks for Fine-Grained Recognition

Fine-grained recognition is challenging due to its subtle local inter-cl...

SFCN-OPI: Detection and Fine-grained Classification of Nuclei Using Sibling FCN with Objectness Prior Interaction

Cell nuclei detection and fine-grained classification have been fundamen...

Understanding Intra-Class Knowledge Inside CNN

Convolutional Neural Network (CNN) has been successful in image recognit...

Fully Convolutional Cross-Scale-Flows for Image-based Defect Detection

In industrial manufacturing processes, errors frequently occur at unpred...

End-to-end Learning of a Fisher Vector Encoding for Part Features in Fine-grained Recognition

Part-based approaches for fine-grained recognition do not show the expec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fine-grained recognition tasks such as identifying the species of a bird, have been popular in computer vision. Since the categories are all similar to each other, different categories can only be distinguished by slight and subtle differences, which makes fine-grained recognition a challenging problem. Compared to the general object recognition tasks, fine-grained recognition benefits more from learning critical parts of the objects, which helps discriminate different subclasses and align objects of the same class Azizpour and Laptev (2012); Huang et al. (2016); Lin et al. (2015a); Zhang et al. (2014a, 2016b).

In the deep learning era, a straightforward way to represent parts is to use the deep convolutional features/descriptors. The convolutional descriptors contain more localized (i.e., parts) information compared to the feature of the fully connected layers (i.e., whole image). In addition, these deep descriptors are known to correspond to mid-level information, e.g., object parts 

Zeiler et al. (2013). All the previous part-based fine-grained approaches, e.g., Huang et al. (2016); Lin et al. (2015a); Zhang et al. (2014a, 2016b)

, directly used the deep convolutional descriptors and encoded them into a single representation, without evaluating the usefulness of the obtained object/part deep descriptors. By using powerful convolutional neural networks 

Krizhevsky et al. (2012)

, we may not need to select useful dimensions inside feature vectors, as what we do to hand-crafted features 

Eigenstetter and Ommer (2012); Zhang et al. (2014b)

. However, since most deep descriptors are not useful or meaningful for fine-grained recognition, it is necessary to select useful deep convolutional descriptors. Recently, selecting deep descriptors sheds its light on the fine-grained image retrieval task 

Wei et al. (2016). Moreover, it is also beneficial to fine-grained image recognition.

In this paper, by developing a novel deep part detection and descriptor selection scheme, we propose an end-to-end Mask-CNN (M-CNN) model which discards the fully connected layers for fine-grained recognition. We only require the part annotations and image-level labels during the training time. In M-CNN, given the part annotations, we firstly separate them into two point sets. One set corresponds to the head part of the fine-grained bird image, and the other is for the torso. Then, the smallest convex polygons that cover each point set are returned as the ground-truth mask, as shown in Fig. 1. The other pixels are background. By treating part localization as a three-class segmentation task, we leverage fully convolutional networks (FCN) Long et al. (2015) to generate masks in the testing time for both localizing parts and selecting useful deep descriptors, which does not use any annotation during testing. After getting these two part masks, we combine them to form the object. Based on these object/part masks, a four-stream Mask-CNN (image, head, torso, object) is built for joint training and aggregating the object-level and part-level cues simultaneously. The architecture of the proposed four-stream M-CNN is shown in Fig. 2

. In each stream of M-CNN, we discard the fully connected layers of CNNs. In the last convolutional layer, an input image is represented by multiple deep descriptors. In order to select useful descriptors to keep only those corresponding to the object, the pre-learned object/part masks by FCN are used. After that, the selected descriptors of each stream are both averaged and max pooled into 512-d feature vectors. The standard

-normalization is followed. Finally, the feature vectors of these four streams are concatenated, and then a classification (fc+softmax) layer is added for end-to-end joint training.

(a) Part annotations
(b) Part polygons
Figure 1: We generate the convex polygons (in (b)b) for the bird’s head and torso based on the part annotations (red, blue and yellow dots in (a)a). Other pixels are treated as background. The two yellow part key points (i.e., nape and throat) are included in both head torso. (Best if viewed in color.)
Figure 2: Architecture of the proposed four-stream Mask-CNN. The four streams correspond to the whole image, head, torso and object images/patches, respectively. Note that we removed the fully connected layers. As illustrated in this figure, thanks to the descriptor selection scheme, a large number of descriptors corresponding to background can be discarded by M-CNN, which is beneficial to fine-grained recognition. (This figure is best viewed in color.)

We validate the proposed four-stream M-CNN on the popular Caltech-UCSD Birds-200-2011 Wah et al. (2011) dataset, in which we achieved 85.5% classification accuracy. We also get accurate part localization (84.62% for head and 89.83% for torso). The key advantages and major contributions of the proposed M-CNN model are:


To the best of our knowledge, Mask-CNN is the first end-to-end model that selects deep convolutional descriptors for object recognition, especially for fine-grained image recognition.


We present a novel and efficient part-based four-stream model for fine-grained recognition. We discard the fully connected layers, and the proposed M-CNN is computationally and storage efficient. Comparing with state-of-the-art methods, M-CNN has the least parameters and smallest feature dimensionality (60.49M and 8,192-d), respectively. At the same time, it achieves 85.4% classification accuracy on CUB200-2011, which is the highest among existing methods. With the SVD whitening method, our feature representation can be compressed to 4,096-d, and meanwhile improve the accuracy to 85.5%.


The part localization performance of the proposed model outperforms other part-based fine-grained approaches which requires additional bounding boxes. In particular, M-CNN is about 10% higher than state-of-the-art for head localization.

2 Related Work

Fine-grained recognition is a challenging problem and has recently emerged as a hot topic. During the past few years, a number of effective fine-grained recognition methods have been developed in the literature Huang et al. (2016); Jaderberg et al. (2015); Lin et al. (2015a, b); Zhang et al. (2014a, 2016b). We can roughly categorize these methods into three groups. The first group, e.g., Jaderberg et al. (2015); Lin et al. (2015b)

, attempted to learn a more discriminative feature representation by developing powerful deep models for classifying fine-grained images. The second group aligned the objects in fine-grained images to eliminate pose variations and the influence of camera position, e.g., 

Branson et al. (2014); Gavves et al. (2014); Lin et al. (2015a). The last group focused on part-based representations, because it is widely acknowledged that the subtle difference between fine-grained images mostly resides in the unique properties of object parts.

For the part-based fine-grained recognition methods, Azizpour and Laptev (2012); Lin et al. (2015a); Zhang et al. (2014a) used both bounding boxes of the birds and part annotations during training to learn an accurate part localization model. Then, based on these detected parts, different CNNs are fine-tuned using the detected parts separately. To ensure satisfactory localization results, they even used bounding boxes in the testing phase. In contrast, our method only need part annotations for training, and do not need any supervision during testing. Moreover, our four-stream M-CNN is a unified framework for capturing object- and part-level information simultaneously. Some other part-based methods considered a weakly supervised setting, in which they categorize fine-grained images with only image-level labels, e.g., Simon and Rodner (2015); Xiao et al. (2015); Zhang et al. (2016a, b). As will be shown by our experiments, classification accuracy of M-CNN is significantly higher than these weakly supervised methods. Meanwhile, the model size of M-CNN is the smallest among all state-of-the-art methods, which make it efficient to train.

Besides, there are also fine-grained recognition methods based on segmentation, e.g., Huang et al. (2016); Krause et al. (2015). The most significant difference between them and M-CNN is: these methods only use segmentation to localize the whole object Krause et al. (2015) or parts Huang et al. (2016), while we further select useful deep convolutional descriptors using the masks from segmentation. Among them, the part-stacked CNN model Huang et al. (2016) is the most related work to ours. In Huang et al. (2016), part-stacked CNN requires both bounding box and part annotations in training, and even needed the bounding boxes during testing. Within the image patch cropped using the bounding box, Huang et al. (2016) treated the image crop around each of the fifteen part key points as 15 segmentation foreground classes, and used FCN to solve the 16-classes segmentation task. After obtaining the trained FCN, it localized these part point positions in the last convolutional layer. Then, deep activations corresponding to the fifteen parts and the whole object were stacked together. Fully connected layers were used for classification. Comparing with part-stacked CNN, M-CNN only needs to localize two main parts (head and torso), which makes the segmentation problem much easier and more accurate. M-CNN achieves high localization accuracy, as will be shown in Table 3. Meanwhile, as demonstrated in Huang et al. (2016), using all the fifteen part activations cannot lead to better classification accuracy. Besides, M-CNN’s accuracy on CUB200-2011 is 1.8% higher than that of Huang et al. (2016) using the same baseline network, although we use less annotations in training and do not use any annotation in testing (cf. Sec. 4.2.2).

3 The Mask-CNN Model

In this section, we present the proposed four-stream Mask-CNN (M-CNN) model. Firstly, we adopt a fully convolutional network (FCN) Long et al. (2015) to generate the object/part masks for locating object/parts, and more importantly selecting deep descriptors. Then, based on these masks, the four-stream M-CNN is built for joint training and capturing both object- and part-level information.

3.1 Learning Object and Part Masks

The fully convolutional network (FCN) Long et al. (2015) is designed for pixel-wise labeling. FCN can take an input image with any resolution and produce an output of corresponding dimensions. In our method, we use FCN to not only localize the object and parts in fine-grained images, but also treat the segmentation predictions as the object and parts masks for the later descriptor selection process.

Each fine-grained image in the CUB200-2011 Wah et al. (2011) dataset is annotated with part annotations, i.e., fifteen part key points. As shown in Fig. 1, we split these key points into two sets, including the head key points (i.e., the beak, forehead, crown, left eye, right eye, nape and throat) and torso key points (i.e., the back, breast, belly, left leg, right leg, left wing, nape, right wing, tail and throat). Based on the key points, two ground-truth of part masks are generated. One is the head mask, which corresponds to the smallest convex polygon covering all the head key points. The other is the torso mask, which is the smallest convex polygon covering the torso key points. In Fig. 1, the red polygon is the head mask, and the blue one is for torso. The rest of the image is background. Therefore, we model the part mask learning procedure as a three-class segmentation problem. For effective training, all the training and testing fine-grained images are with their original resolutions. Then, we crop a image patch in the middle of the original image as the inputs. The mask learning network architecture is shown in Fig. 3. In our experiments, we adopted FCN-8s Long et al. (2015) for learning and predicting part masks.

Figure 3: Demonstration of the mask learning procedure by FCN Long et al. (2015). (Best viewed in color.)

During the FCN inference, without using any annotation, three class heat maps (in the same size as the original input image) are returned for each image. We randomly choose some qualitative examples of the predicted part masks, and show them in Fig. 4. In these figures, the learned masks are overlaid onto the original images. The head part is highlighted in red, and the torso is in blue. The predicted background pixels are in black. As can be seen from these figures, even though the ground-truth part masks are not very accurate, the learned FCN model is able to return more accurate part masks. Meanwhile, these part masks can also localize the part positions. Quantitative results of part localization and object segmentation will be reported in Sec. 4.3 and Sec. 4.4, respectively.

Both part masks, if accurately predicted, will benefit the later deep descriptor selection process and the final fine-grained classification. Therefore, during both training and testing, we will use the predicted masks for both part localization and descriptor selection in M-CNN. We also combine the two masks to form a mask for the whole object, which is called the object mask.

Figure 4: Sixteen random samples of predicted part masks from the testing set. In these figures, we overlay the part mask predicted by FCN (the head highlighted in red and the torso in blue) onto the original images. The pixels predicted as background are in black. (Best viewed in color.)

3.2 Training Mask-CNN

After obtaining the object and part masks, we build the four-stream M-CNN for joint training. The overall architecture of the proposed model is presented in Fig. 2. We take the whole image stream as an example to illustrate the pipeline of each stream in M-CNN.

The inputs of the whole image stream are the original images resized with . In our experiments, we report the results for and

, respectively. The input images are fed into a traditional convolutional neural network, but the fully connect layers are discarded. That is to say, the CNN model used in our proposed M-CNN only contains convolutional, ReLU and pooling layers, which greatly brings down the M-CNN model size. Specifically, we use VGG-16 

Simonyan and Zisserman (2015) as the baseline model, and the layers before are kept (including ). We obtain a

activation tensor in

if the input image is . Therefore, we have 49 deep convolutional descriptors of 512-d, which also correspond to spatial positions in the input images. Then, the learned object mask (cf. Sec. 3.1) is firstly resized to

by the nearest interpolation, and then used for selecting useful and meaningful deep descriptors. As illustrated in Fig. 

2 (c) and (d), the descriptor should be kept when it locates in the object region. If it locates in the background region, that descriptor will be discarded. In our implementation, the mask is set as a binary matrix, in which 1 stands for keeping and 0 is for discarding. We implement the selection process as an element-wise product operation between the convolutional activation tensor and the mask matrix, which is similar to the element-sum summarize operation in FCN Long et al. (2015). Therefore, the descriptors located in the object region will remain, while the other descriptors will become zero vectors.

For these selected descriptors, in the end-to-end M-CNN learning process, we both average and max pool them into two 512-d feature vectors, respectively. Then, the -normalization is followed for each of them. After that, we concatenate them into an 1024-d feature as the final representation of the whole image stream.

The streams for head and torso have similar processing steps as the whole image one. However, different from the inputs of the whole image stream, we generate the input images of the head and torso streams as follows. After obtaining the two part masks (i.e., the head and torso masks), we use the part masks as the part detectors to localize the head part and torso part in the input images. For each part, we return the smallest rectangle bounding box which contains the part mask regions. Based on the rectangle bounding box, we crop the image patch which acts as the inputs of the part stream. The two streams in the middle of Fig. 2 show the head and torso streams in M-CNN. The last stream is the object stream, which crops the image patch by combining the two part masks into an object mask. Thus, its inputs are the main object (i.e., bird) detected by our FCN segmentation network. The inputs of these three streams are all resized into in our experiments.

In the classification step shown in Fig. 2 (f), the final 4,096-d image representation is the concatenation of the whole image, the head, the torso and the object features. The last layer of M-CNN is a 200-way classification (fc+softmax) layer for classification on the CUB200-2011 dataset. The four stream M-CNN is learned end-to-end, with the parameters of four CNNs learned simultaneously. During training M-CNN, the parameters of the learned FCN segmentation network are fixed.

4 Experiments

In this section, we firstly describe the experimental settings and implementation details. Then, we report the classification accuracy and present discussions about the proposed M-CNN model. Finally, the performance of part localization and object segmentation will also be provided.

4.1 Dataset and Implementation Details

The empirical evaluation is performed on the widely-used fine-grained benchmark Caltech-UCSD 2011 bird dataset Wah et al. (2011). This dataset contains 200 bird categories, and each category has roughly 30 training images. We follow the training and testing splitting included with the dataset. In the training phase, the fifteen part annotations are adopted for generating the part masks’ ground-truth, and meanwhile the image-level labels are used for the end-to-end M-CNN joint training. We need no supervision signals (e.g., part annotations or bounding boxes) when testing.

The proposed Mask-CNN model and FCN used for generating masks are implemented using the open-source library MatConvNet Vedaldi and Lenc (2014). In our experiments, after getting the learned part masks, we firstly generate the image patches of birds’ head, torso and object as described in Sec. 3.2. Then, to facilitate the convergence of four stream CNNs, each single stream corresponding to the whole image, head, torso and object is fine-tuned on its input images separately. The CNNs used in each stream is initialized by the popular VGG-16 model Simonyan and Zisserman (2015)

pre-trained on ImageNet. In addition, we double the training data by horizontal flipping for all the four streams. After fine-tuning on each stream, as shown in Fig. 


, the joint training of four-stream M-CNN is performed. Dropout is not used in M-CNN. At the test time, we average the predictions of the image and its flipped copy, and output the class with the highest score as the prediction for a test image. In addition, directly using the softmax predictions results is a slight drop in accuracy compared to logistic regression (LR), which is consistent with the observations in 

Lin et al. (2015b). Therefore, in the following, the reported results of M-CNN are all achieved by one-vs-all logistic regression Fan et al. (2008) on the extracted features (4096-d) with the default hyper-parameter .

4.2 Classification Accuracy and Comparisons

We report the classification accuracy on the CUB200-2011 dataset of the proposed four-stream M-CNN model, and compare with the baseline methods and state-of-the-art methods in the literature.

4.2.1 Baseline Methods

In order to validate the effectiveness of the descriptor selection process in M-CNN, we perform two baseline methods which are also based on the proposed four-stream architecture. Different from our M-CNN, these two baseline methods do not contain the descriptor selection part, i.e., the processing shown in Fig. 2 (d).

The first baseline method employ the traditional fully connected layers to conduct classification for each stream, which is called “4-stream FCs”. In “4-stream FCs”, we replace the (b) to (e) parts of each stream in Fig. 2 with a CNN containing fully connected layers (i.e., VGG-16 with only fc8 removed). Thus, the generated feature in the last layer of each stream is a 4,096-d single vector. The rest procedure is also to concatenate the four 4,096-d features into the final one with 16,384-d, and to learn a 200-way classification (fc+softmax) layer on the 16,384-d image representation.

The second baseline is similar to the proposed M-CNN. The most prominent difference is that it discards the descriptor selection part, i.e., the processing in Fig. 2 (d). Thus, the convolutional deep descriptors of in each stream are directly average and max pooled, and then -normalized, respectively. Therefore, we call it the “4-stream Pooling”. The remaining procedures are the same as the proposed M-CNN.

Table 1 presents the comparison of classification accuracy on the CUB200-2011 dataset, where the input images of the whole image stream are . The proposed M-CNN achieves the best classification accuracy rate. Due to the missing of descriptor selection, “4-stream Pooling” is about 1% lower than M-CNN. The “4-stream FCs” baseline method has the lowest accuracy. Its lower accuracy might be caused by the fully connected layers, which may have caused overfitting.

4-stream FCs 4-stream Pooling The proposed 4-stream M-CNN
81.1% 82.2% 83.1%
Table 1: Comparison with the baseline methods on CUB200-2011.

4.2.2 Comparisons with state-of-the-art methods

The classification accuracy of the proposed four-stream M-CNN and state-of-the-art methods on CUB200-2011 are presented in Table 2. For fair comparison, we only report the results when they do not use part annotations in testing.

As aforementioned, when all the inputs are of size , the accuracy of the proposed four-stream M-CNN model is 83.1%. Following Lin et al. (2015b), we change the input images of the whole image stream to pixels, which improves the classification performance by 2.1%. We also resize the input images of the object stream to . But the accuracy is slightly lower than before.

Moreover, as the ensemble of multiple layers can boost the final performance Long et al. (2015); Wei et al. (2016), after joint training, we extract the deep descriptors from the layer which is three layers in front of . Then, the predicted part masks are also used to select the corresponding descriptors of the four streams. Similar to the pooling and concatenation processes done for , we can obtain another 4,096-d image representation of . After that, we combine it with the one into a 8,192-d feature vector (called “4-stream M-CNN” in Table 2), which achieves the best classification accuracy 85.4% on CUB200-2011. Additionally, we compress the 8,192-d feature vector to 4,096 by SVD whitening. It can reduce the dimensionality, and meanwhile improve the accuracy to 85.5%.

Specifically, because part-stacked CNN Huang et al. (2016) used the Alex-Net model Krizhevsky et al. (2012), we also build another four-stream M-CNN based on Alex-Net. The accuracy of our four-stream M-CNN (Alex-Net) is 78.0%. It is 1.8% higher than that of Huang et al. (2016). Moreover, in Alex-Net based four-stream M-CNN, the number of parameters is only 9.74M, and the final feature vector is only 2,048-dimensional .

Method Train phase Test phase Model para. Dim. Acc.
BBox Parts BBox Parts
Part-Stacked CNN Huang et al. (2016) Part-Stacked CNN 130.80M 4,096 76.2%
PB R-CNN with BBox Zhang et al. (2014a) Alex-Net 173.03M 12,288 76.4%
Deep LAC Lin et al. (2015a) Alex-Net 173.03M 12,288 80.3%
PB R-CNN Zhang et al. (2014a) Alex-Net 173.03M 12,288 73.9%
Pose Normalized CNNs Branson et al. (2014) Alex-Net 173.03M 13,512 75.7%
Co-Segmentation Krause et al. (2015) VGG-19 287.30M 126,976 82.0%
Two-Level Xiao et al. (2015) VGG-16 138.35M 16,384 77.9%
Weakly supervised FG Zhang et al. (2016b) VGG-16 138.35M 262,144 79.3%
Constellations Simon and Rodner (2015) VGG-19 143.65M 208,896 81.0%
Bilinear Lin et al. (2015b) VGG-16 and VGG-M 73.67M 262,144 84.1%
Spatial Transformer CNN Jaderberg et al. (2015) ST-CNN (inception) 62.68M 4,096 84.1%
PDFS Zhang et al. (2016a) VGG-16 138.35M 69,632 84.5%
Our 4-stream M-CNN (224) VGG-16 (w.o. FCs) 59.67M 4,096 83.1%
Our 4-stream M-CNN (448) VGG-16 (w.o. FCs) 59.67M 4,096 85.2%
Our 4-stream M-CNN (448) VGG-16 (w.o. FCs) 60.49M 8,192 85.4%
Table 2: Comparison of classification accuracy on CUB200-2011 with state-of-the-arts methods.

4.3 Part Localization Results

Except for the qualitative part localization results shown in Sec. 3.1, in this section, we quantitatively assess the localization correctness using the Percentage of Correctly Localized Parts (PCP) metric. As reported in Table 3, the metrics are the percentage of parts (i.e., the head and torso) that are correctly localized with a 50% IOU with the ground-truth part bounding boxes as generated in Lin et al. (2015a); Zhang et al. (2014a). By comparing the results of PCP for torso, our method outperforms part-based R-CNN Zhang et al. (2014a) and strong DPM Azizpour and Laptev (2012) by a large margin. However, because we do not use any supervision in testing, the localization performance is lower than the one of Deep LAC Lin et al. (2015a) which used the bounding boxes during testing. In addition, for the head localization task which is more challenging than the torso one, even though our method just uses part annotations in training, the head localization performance (84.62%) is still significantly higher than the other methods.

Method Train phase Test phase Head Torso
BBox Parts BBox Parts
Strong DPM Azizpour and Laptev (2012) 43.49% 75.15%
Part-based R-CNN with BBox Zhang et al. (2014a) 68.19% 79.82%
Deep LAC Lin et al. (2015a) 74.00% 96.00%
Part-based R-CNN Zhang et al. (2014a) 61.42% 70.68%
Ours 84.62% 89.83%
Table 3: Comparison of part localization performance on the CUB200-2011 dataset.

4.4 Object Segmentation Performance

Because the CUB200-2011 dataset also supplies the object segmentation ground-truth, we can directly test the learned object masks on the segmentation metric. Fig. 5 shows qualitative segmentation results. Our method based on FCN is generally able to segment the foreground object well, but understandably has trouble to segment the finer birds’ parts, e.g., claws and beak. Since our goal is not to segment objects, we do not perform any refinement as pre-processing or post-processing. Moreover, we evaluate the segmentation performance quantitatively by the common semantic segmentation metric mean IU (pixel accuracy and region intersection over union) of the ground truth foreground object wit the predicted object masks. It is 72.41% on the testing set. In fact, a better segmentation result will lead to better predicted object/part masks, and also benefit the final classification. To further improve the classification accuracy, some pre-processing methods, e.g., GrabCut Rother et al. (2004), are worth trying to obtain better mask ground-truth than the convex polygons in Fig. 3 (c).

Figure 5: Examples of segmentation results. The first row is the original fine-grained images. The second row is the corresponding segmentation ground-truth. The last row is the predicted results. Note that although the segmentation ground-truth only annotates one bird, there are two birds in the first image and M-CNN correctly finds both.

5 Conclusion

In this paper, we presented the benefits of selecting deep convolutional descriptor in object recognition, especially fine-grained image recognition. By developing the descriptor selection scheme, we proposed a novel end-to-end Mask-CNN (M-CNN) model without the fully connected layers to not only accurately localize object/parts, but also generate object/part masks for selecting deep convolutional descriptors. After aggregating the selected descriptors, the object-level and part-level cues were encoded by the proposed four-stream M-CNN model. Mask-CNN not only achieved 85.5% classification accuracy on CUB200-2011, but also had the least parameters and the lowest dimensional feature representations.

In the future, we plan to solve the part detection problem of M-CNN in the weakly supervised setting, in which we only require the image-level labels. Thus, it will require far less labeling effort to achieve comparable classification accuracy. In addition, another interesting direction is to explore the benefits of descriptor selection for general object categorization.


  • Azizpour and Laptev (2012) H. Azizpour and I. Laptev. Object detection using strongly-supervised deformable part models. In ECCV 2012, Part I, LNCS 7572, pages 836–849, 2012.
  • Branson et al. (2014) S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. In BMVC, pages 1–14, 2014.
  • Eigenstetter and Ommer (2012) A. Eigenstetter and B. Ommer.

    Visual recognition using embedded feature selection for curvature self-similarity.

    In NIPS, pages 377–385, 2012.
  • Fan et al. (2008) R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008.
  • Gavves et al. (2014) E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuytelaars. Local alignments for fine-grained categorization. IJCV, 111(2):191–212, 2014.
  • Huang et al. (2016) S. Huang, Z. Xu, D. Tao, and Y. Zhang. Part-stacked CNN for fine-grained visual categorization. In CVPR, 2016.
  • Jaderberg et al. (2015) M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2008–2016, 2015.
  • Krause et al. (2015) J. Krause, H. Jin, J. Yang, and L. Fei-Fei. Fine-grained recognition without part annotations. In CVPR, pages 5546–5555, 2015.
  • Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
  • Lin et al. (2015a) D. Lin, X. Shen, C. Lu, and J. Jia. Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In CVPR, pages 1666–1674, 2015.
  • Lin et al. (2015b) T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV, pages 1449–1457, 2015.
  • Long et al. (2015) J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
  • Rother et al. (2004) C. Rother, V. Kolmogorov, and A. Blake. GrabCut: Interactive foreground extraction using iterated graph cuts. ACM TOG, 23:309–314, 2004.
  • Simon and Rodner (2015) M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In ICCV, pages 1143–1151, 2015.
  • Simonyan and Zisserman (2015) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, pages 1–14, 2015.
  • Vedaldi and Lenc (2014) A. Vedaldi and K. Lenc. MatConvNet: Convolutional neural networks for MATLAB. In ACM MM, pages 689–692, 2014.
  • Wah et al. (2011) C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD birds-200-2011 dataset. Tech. Report CNS-TR-2011-001, 2011.
  • Wei et al. (2016) X.-S. Wei, J.-H. Luo, and J. Wu. Selective convolutional descriptor aggregation for fine-grained image retrieval. In arXiv: 1604.04994, pages 1–16, 2016.
  • Xiao et al. (2015) T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang.

    The application of two-level attention models in deep convolutional neural network for fine-grained image classification.

    In CVPR, pages 842–850, 2015.
  • Zeiler et al. (2013) M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In ICCV, pages 2018–2025, 2013.
  • Zhang et al. (2014a) N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based R-CNNs for fine-grained category detection. In ECCV 2014, Part I, LNCS 8689, pages 834–849, 2014.
  • Zhang et al. (2014b) Y. Zhang, J. Wu, and J. Cai. Compact representation for image classification: To choose or to compress? In CVPR, pages 907–914, 2014.
  • Zhang et al. (2016a) X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian. Picking deep filter resonses for fine-grained image recognition. In CVPR, 2016.
  • Zhang et al. (2016b) Y. Zhang, X.-S. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N. Do. Weakly supervised fine-grained categorization with part-based image representation. TIP, 25(4):1713–1725, 2016.