Object localization and detection is a fundamental computer vision problem, which aims to discover and locate interesting objects within an image. Benefiting from the learning capability of deep convolutional neural networks and large-scale object bounding box annotations, object localization has achieved remarkable performance[19, 22]. However, it is expensive and labor-intensive to manually annotate bounding boxes on the large-scale dataset. This motivates the development of weakly supervised object localization methods, which only use image-level annotations to localize objects. Most existing weakly supervised object localization methods locate the objects by training a classification convolutional neural network with image-level annotations [38, 6, 36].
Recently, some works [26, 14, 3, 17, 31] shift their attention to image co-localization problem, which assumes less supervision and requires an image collection containing a single object class. But most of these efforts usually need to first generate enormous object proposals. This may lead to high time consumption and the performance heavily depends on the quality of the proposals.
Compared to co-localization methods [26, 14, 3, 17, 17, 31], we address unsupervised object localization in a far more challenging scenario where only a single image is given. As illustrated in Figure 1, the setting of our method is fully unsupervised, without utilizing any annotations or even a single dominant class. Recent works [34, 12, 30]
demonstrate that the pre-trained models on ImageNet can be reused to obtain powerful semantic representation for a given image. Additionally, we observe that the convolutional activations usually fire at the same region that may be a general part of an object. However, the convolutional activations are only useful to roughly localize the object regions, these regions are often small and sparse and cannot cover the entire objects. To make it worse, the localization results are not very robust to the noisy background. Therefore, it is of great interest to develop more efficient approaches to automatically discover and utilize these visual patterns, which would be beneficial to identify integral object regions.
To tackle the above issues, this paper proposes a novel and simple approach, named Object Mining (OM), to mine frequent patterns and discover objects from the pattern mining perspective. The success of our proposed method relies on two key foundations: 1) The CNN features extracted from the pre-trained model have powerful representation and provide abundant semantic and spatial information; 2) Pattern mining techniques can efficiently mine frequently-occurring visual patterns, which often indicate the location of objects in one image.
Object Mining first converts the deep features from the particular layer of a pre-trained CNN model into a set of transactions (i.e. a transaction database). Specifically, we propose an efficient transaction creation strategy: we extract the deep features from multiple convolutional layers and use a tunable threshold to select the descriptors that are used to convert to items. Benefiting from this strategy, we could retain the most useful information while discarding redundant information, which is critical to obtain a suitable input for a mining algorithm. Then we look for relevant but non-redundant patterns automatically through pattern mining techniques. Finally, we merge the selected patterns to generate a support map which represents the object regions. The experimental results show that the discovered regions are not only semantically consistent but also cover the objects accurately.
Our proposed method is simple but effective, which does not need the training process. Thus, we do not need to design complex loss function and avoid collecting a large amount of annotations which is labor consuming. More importantly, compared with co-localization methods, we address unsupervised object localization in more challenging scenarios where only a single image is given. It is more reasonable in the practical scenario.
To the best of our knowledge, we propose the first usage of pattern mining for fully-unsupervised object localization. The main contributions are summarized as follows:
We observe that the frequently-occurring patterns in CNN feature maps strongly correspond to the spatial cues of objects. Such simple observation leads to an effective unsupervised object discovery and localization method based on pattern mining techniques, named Object Mining (OM).
We propose an efficient transaction creation strategy to transform the convolutional activations into transactions, which is the key issue for the success of pattern mining techniques.
Extensive experiments are conducted on four fine-grained datasets, Object Discovery dataset, ImageNet subsets and PASCAL VOC 2007 dataset. Our proposed method outperforms other fully-unsupervised methods by a large margin. Compared with co-localization methods, our method achieves competitive performance even for one single image in an unsupervised setting.
2 Related Work
2.1 Unsupervised Object Localization
Fully unsupervised localization is challenging due to the fact that it does not depend on any auxiliary information rather than a given image. Thus, many methods shift their attention to solve image co-localization problem [8, 24, 14, 26, 3, 17, 31]. Fully unsupervised image localization shares some similarities with image co-localization in the sense that both problems do not require image-level labels or bounding box annotations. The key difference is that image co-localization requires a set of images containing objects from the same category whereas fully unsupervised localization deals with only one single image. Some earlier image co-localization methods [8, 24, 14, 3] address this problem based on low-level features (e.g., SIFT, HOG).  is the first to use the features from fully connected layer of a pre-trained CNN model to learn a common object detector. However, the spatial correlation of deep descriptors in convlutional layers are lost. [37, 38] demonstrate that the convolutional activations remain spatial and semantic information and have remarkable localization ability.  utilizes the convolutional activations and further considers their correlations in an image collection to deal with image co-localization problem.
extracts feature descriptors from the last max-pooling layer of a pre-trained VGG-16 model and employs a simple mean value strategy to locate the main objects in fine-grained images.
Similar with , we also explore convolutional activations in an unsupervised setting. Whereas, the novelty of our approach is that we develop a multi-layer feature combination strategy and take advantage of the powerful ability of pattern mining techniques to discover objects accurately. Compared with , we can mine more distinctive object regions, and our localization results cover the target objects more densely and completely.
2.2 Pattern Mining in Computer Vision
Pattern mining techniques have been developed for several decades in the data mining community. Usually, a set of patterns is a combination of several elements and the distinctive information is captured. Inspired by this fact, more researchers rise to investigate the problem of employing pattern mining to address computer vision tasks, including image classification [18, 10], image collections summarization  and object retrieval .
A key issue of pattern mining methods is how to transform an image into transaction data, which are suitable for pattern mining and maintain most of the corresponding discriminative information. Most earlier methods simply treat an individual visual word as an item in a transaction. Due to the sparsity of local bag-of-words (LBOW), LBOW is usually adopted as image representations in [21, 33, 1], then each visual word is treated as an item. However, this operation only considers the absence/presence of the visual word and may lead to loss of information during transaction creation. To avoid the above issues,  proposes a frequent local histograms method to represent an image with the histograms of patterns sets. More recently,  is first to illustrate how pattern mining techniques are combined with the CNN features, a more appealing alternative than the hand-crafted features. In these works, a local patch is transformed into a transaction by treating each dimension index of a CNN activation as an item. However, it also suffers from the loss of the critical spatial information due to the features extracted from the fully connected layer.
3 The Proposed Method
In this section, we provide details of our OM approach. The overview of the proposed method is illustrated in Figure 2. First, we extract the feature maps from the Pool-5 and ReLU-5 layers of a pre-trained VGG-16 model. Then, the feature maps are converted into a set of transactions and the meaningful patterns are discovered by pattern mining techniques. Finally, we illustrate the details of how to merge the selected patterns to localize potential target regions.
3.1 Notations and terminology
First, we introduce the data mining notations and terminology. Let denotes an itemset containing items. A transaction satisfies , where is the number of transaction . We define a transaction database , where each . Given an itemset , we calculate the frequencies of the itemset in the transaction database by . Then, we define the support value of as:
where measures the cardinality, and is the number of transactions in . The itemset whose support value is larger than a predefined threshold is considered as the frequent itemset. The itemset is the pattern we want to mine.
3.2 Extracting multi-layer feature maps
Recent works demonstrate that different convolutional layers learn different level features . Specifically, the shallow and fine convolutional layers tend to learn simple appearance features (e.g. textures and color), while the deep and coarse convolutional layers learn semantic cues (e.g. meaningful patterns, dog’s face or bird’s head). In our work, the images are passed through a pre-trained VGG-16  model and the feature maps are obtained from Pool-5 and ReLU-5 layers. The reason we adopt the multi-layer combination is that such strategy allows us to take original images with arbitrary sizes as input, and also alleviates the loss of useful information caused by only considering single layer activations.
We randomly select an image from the fine-grained dataset (CUB-200-2011 ) and visualize the local response regions of Pool-5 and ReLU-5 layers. As shown in Figure 3, we can clearly observe that most semantic parts of a bird are frequently activated at the same location in the feature maps. Moreover, the activations of two specific layers sometimes overlap but complement each other very well. Motivated by this observation, we could combine these high-level feature maps to retain potential useful information as much as possible. However, some background regions are also activated, and these regions only present in few channels and are distributed sparsely. That means not all activated areas in the feature maps are useful, so it is necessary to further mine the meaningful regions using pattern mining techniques.
3.3 From feature maps to transactions
It is the most critical step to convert data into a set of transactions while maintaining useful information when applying pattern mining techniques to computer vision applications. In our method, each feature map is treated as a transaction denoted by , and each position index activated from the feature map is considered as an item . For example, if there are five positions fired in the feature map, the corresponding transaction would include five items, i.e. . The index set of all activated positions from all feature maps, also known as an itemset, is denoted by . Each transaction is a subset of items , i.e. . The set of all transactions (i.e. all feature maps) is denoted by . In our scenario, feature maps correspond to transactions, i.e. .
In practice, an image with size is fed into a pre-trained VGG-16 model . Then we obtain 512 feature maps with size from Pool-5 and 512 feature maps with size from ReLU-5 respectively. We resize Pool-5 feature maps to the same size with ReLU-5 by bilinear interpolation. Therefore, the size of each feature map is .
Next, we choose useful descriptors that can be converted into items. In , a fixed threshold is used to select top activations from the full connected layer. However, this may result in the loss of some distinctive activations. In order to avoid this problem, we adopt a more appropriate and flexible strategy. More specifically, we calculate the mean value of the activation responses larger than as the tunable threshold. The position whose response magnitude is higher than is highlighted and the index will be converted into an item.
As noted in , there are two strict requirements for applying pattern mining techniques: 1) only a small number of items could be included in one transaction; 2) only a set of integers could be recorded in one transaction. Fortunately, our proposed transaction conversion satisfies two conditions. First, from the above visualization results, we can notice that only small regions are activated in each channel of feature maps. More importantly, the tunable threshold ensures us to select useful descriptors and discard the disturbing descriptors. Therefore, it guarantees that the number of activations in each feature map is limited. Second, since we treat the indexes of the selected position in each feature map as items, this ensures that a transaction is recorded with a small set of integers. So our conversion strategy helps us to further mine possible objects successfully using pattern mining techniques.
3.4 Mining patterns
Given a set of transaction database , we apply the Apriori algorithm  to find the frequent items. For a given minimum support threshold , an itemset is considered as frequent if . In other words, the support threshold determines which patterns would be mined.
The overall process of pattern mining is shown in Figure 4. First, we convert feature maps to a transaction database using the tunable threshold strategy. Note that the selected positions and the discarded positions are represented by red and blue dots, respectively. Then, we calculate the frequency and keep the items whose frequency is larger than . For example, we can observe that the frequency of is greater than , so we denote these items as a frequent pattern . Thus, all frequent patterns are mined and the corresponding regions are discovered.
3.5 Selecting and Merging the best patterns for object localization
Although patterns are mined in the previous section, some of them may be isolated. Therefore we need to select the optimal patterns for object localization. Here we introduce our selection strategy: spatial continuity. In our method, a mined pattern corresponds to a region in one image. Since one target object is spatially continuous in one image, the object regions represented by selected patterns should also be spatially continuous. In addition, we discover that the isolated regions usually belong to the background of an image. Thus, we attempt to select the largest connected component based on the mined patterns, which would cover the entire target object as much as possible.
Subsequently, we merge the selected patterns to generate a support map for each image. Specifically, is defined as , where is the frequency of an item represented by its position . Note that the support map is generated by relevant and non-redundant patterns. The size of support map is same with the feature map. To obtain the support map with the same size as the original image, we upsample the support map by bilinear interpolation. Besides, to further generate a bounding box from the support map, we use a simple technique similar to CAM . More significantly, the higher value of the position, the more likely its corresponding region could be a part of a target object.
This section evaluates the performance of our proposed OM approach. We first describe the details of datasets and evaluation metric, and then we report the results of object localization. Finally, we give a detailed discussion.
4.1 Datasets and Evaluation Metric
Datasets: We evaluate our method on four challenging fine-grained datasets, including CUB-200-2011 (200 classes, 11,788 images) , Stanford Dogs (120 classes, 20,580 images) , Stanford Cars-196 (196 classes, 16,185 images)  and FGVC-Aircrafts (100 classes, 10,000 images) .
To further investigate the localization ability of our proposed OM, we also perform experiments on the Object Discovery dataset  following , six ImageNet Subsets  following [17, 31] and PASCAL VOC 2007  following [14, 3].
4.2 Implementation Details.
4.3 Object Localization
Fine-Grained datasets. Table 1 shows the comparison with the state-of-the-art on four fine-grained datasets. The CorLoc arrives at 80.45%, 80.70%, 94.94% and 92.51% on CUB-200-2011, Stanford Dog, Aircrafts and Cars-196, respectively.
Fully unsupervised localization is a particularly challenging task. Compared with our method, only SCDA  is proposed in the same setting. As shown in Table 1, we can observe that our method significantly exceeds SCDA  on all four datasets (80.45% vs 76.79% on CUB-200-2011, 80.70% vs 78.86% on Dogs, 94.94% vs 94.91% on Aircrafts, 92.51% vs 90.96% on Cars-196). Moreover, our method is only slightly lower than co-localization method  (92.51% vs 93.05%) on Cars-196. The key reasons for the improvement are as follows: (1) Our multi-layer strategy can take advantage of spatial and semantic information learned from convolutional layers. (2) The regions selected by OM are more robust and accurate than the simple mean strategy .
Figure 5 visualizes the localized regions of the proposed method compared with SCDA . It can be seen that our approach obtain decent object localization results in many challenging scenarios. Specifically, in the first and second rows, our method accurately locate the small objects. More surprisingly, it discovers the precise objects even they are very similar to the background or in a complex background. In the third row, our method locate much finer contour of the objects compared with SCDA. Additionally, the results in the fourth row show that our method can distinguish the importance of the part regions for an object, such as the dog’s head or the car’s face, which is very useful for other high-level vision tasks.
Besides, Table 1 also shows that OM outperforms recent weakly supervised methods by a large margin. This result may be due to the following reason. Since only image labels can be used in weakly supervised methods, most of them are trained based on classification neural network. Therefore, the generated discriminative areas are only suitable for classification, but may not be optimal for localization. i.e. the located areas are small or sparse regions instead of the whole object regions. In contrast, our method is fully unsupervised. We simply reuse the pre-trained model and do not fine-tune on specific dataset, which is beneficial for localizing object regions when incorporating powerful pattern mining techniques. We believe that our work can bring a new insight for solving the localization problem.
The Object Discovery dataset. We further evaluate the performance of our proposed OM approach on the Object Discovery dataset. The results are presented in Table 2. Note that [24, 26, 31, 3] are co-localization methods, which utilize a set of images containing the objects from the same category.
In Table 2, we can observe that OM approach outperforms unsupervised localization method  by 2.6% (85.80% vs 83.20%). Compared with co-localization methods, we also obtain a significant improvement, i.e. 1.61% , 9.22%  and 10.64% . Our method is only lower than , which needs a set of images while only one single image is needed in our experiments. In particular, on the “horse” category, which is the most challenging subcategory due to multi-targets and complex background, our method achieves the new state-of-the-art compared with all other methods. These results suggest that our method is reasonably robust to the complex scenarios.
ImageNet Subsets dataset. To demonstrate the generalization capabilities of our proposed OM approach, we also perform the experiment on six subsets of the ImageNet which have not been used to train the CNN model. Table 3 presents the CorLoc metric on these subsets. Note that [3, 17, 31] are co-localization methods, which require the image collection containing the objects from the same category. However, our proposed method is fully unsupervised.
In Table 3, we can see that OM outperforms  by a large margin (60.1% vs 37.8%) under the unsupervised settings. Compared with co-localization methods, our OM approach still significantly improves over two co-localization methods by about 22.4%  and 11.8%  respectively. Object Mining is only lower than the recent co-localization method , but we achieve the best accuracy of 44.9% and outperform  by 14.6% on the most difficult category (i.e. rake). The competitive results on the challenging ImageNet subsets demonstrate that incorporating pattern mining techniques can efficiently mine target objects in real-world applications in an unsupervised manner.
PASCAL VOC 2007 dataset. We further evaluate our method on the PASCAL VOC 2007. In order to make an appropriate comparison with other methods, our work follows [14, 3] to use all images in the trainval set ( discarding images that only contain object instances marked as difficult or truncated). We observe that many images consists of multiple object instances on PASCAL VOC 2007 dataset, which results in multiple connectivity regions after pattern merging, and each of which may correspond to a target. More importantly, our OM is a fully unsupervised method without any type of annotations, thus we adopt a quite understandable way: we keep all the connected components, and each component will generate the corresponding bounding box. We consider the image to be correctly localized if one predicted bounding box satisfies the CorLoc condition similar to . The results are reported in Table 4. Our proposed method outperforms unsupervised localization method  by 15.6% (45.1% vs 29.5%). Compared with co-localization method [14, 3, 17, 31], our unsupervised OM approach also achieve competitive performance, which is only lower than  (45.1% vs 46.9%) but higher than any other methods.
Furthermore, in Figure 6, we visualize some localization results of our OM approach on PASCAL VOC 2007. It can be seen that our approach obtains decent object localization results in many challenging scenarios. Specifically, it discovers the precise objects even they are very similar to the background or in a complex background, e.g., the black cat on the tree. More surprisingly, although some images contain multiple objects, OM can still effectively localize the target objects.
4.4 Classification on Fine-Grained Dataset
We employ fine-grained classification task to further verify the localization ability of our OM approach. We perform classification on two baselines, VGG-16  and VGG-19 . We first adopt our OM approach to obtain the bounding boxes in both train and test sets. Then, we fine-tune the baseline models only with image-level annotations. After that, we crop the images according to the bounding boxes to form a cropped image dataset. Finally, we train and test models on the cropped image dataset.
Table 5 summarizes the classification results on the CUB-200-2011 test set with or without (w/o) bounding box annotations. We can observe that, using OM, the performance is improved by 2.58% (69.72% vs 72.30%) and 1.90% (76.89% vs 74.99%) respectively. Moreover, we also achieve better classification results than VGGNet-ACoL  (72.30% vs 71.90%). The accuracy improvement is mainly due to mined discriminative regions using the proposed method, which locates the entire object and boosts the classification performance.
|GoogLeNet-GAP  on full image||w/o||63.00|
|GoogLeNet-GAP  on crop||w/o||67.80|
|GoogLeNet-GAP  on BBox||BBox||70.50|
|VGGNet-ACoL  on crop||w/o||71.90|
|Ours (VGG-16) on full image||w/o||69.72|
|Ours (VGG-16) on crop||w/o||72.30|
|Ours (VGG-19) on full image||w/o||74.99|
|Ours (VGG-19) on crop||w/o||76.89|
4.5 Further Analysis
This section further evaluates the effects of the parameter and the strategy of convolutional feature combination.
Impact of parameter . To investigate the effect of the parameter on localization performances, we vary from to , finding that , , , maximizes the accuracy on CUB-200-2011, Stanford Dogs, FGVC-Aircrafts and Stanford Cars-196 respectively, as shown in Figure 7. We observe that a well-designed minimum support threshold improves the performance, and a lower threshold brings background noise while the higher threshold could only discover a small region of a target object.
Strategy of convolutional feature combination. To verify the impact of features extracted from different convolutional layers, we perform the experiment with the following settings. Pool-5 refers to features maps extracted only with the Pool-5 layer, and ReLU-5 represents that features maps extracted only with the ReLU-5 layer. The localization results using different convolutional layers are reported in Table 6. Experimental results show that we can achieve significant improvement by combining feature maps extracted from Pool-5 and ReLU-5 layers on all four fine-grained datasets. Additionally, we also attempt to combine feature maps from different layers, e.g., Conv-4, Pool-4. However, the Conv-4&Pool-4 yields lower accuracy, and the cost of pattern mining method dramatically increases due to the rapid growth of items.
4.6 Computational complexity
Here, we randomly select 400 images from the CUB-200-2011 dataset as examples to report the computational complexity. We perform the experiments on a computer with Intel Xeon E5-2683 v3, 128G main memory, and a TITAN Xp GPU. Our proposed OM approach consists of two major steps: (1) feature extraction and (2) pattern mining-based object localization including transaction creation, pattern mining, and support map generation. The execution time for feature extraction is about 0.03 second/image on GPU and 0.74 second/image on CPU, respectively. The second step only takes about 0.21 second/image both on GPU and CPU. Thus, the execute time is totally about 0.24 second/image on GPU and 0.95 second/image on CPU, respectively. That shows the efficiency of our proposed method in the practical scenario.
In this paper, we propose a novel pattern mining-based method, called Object Mining (OM), for unsupervised object discovery and localization. Our method exploits the advantage of data mining and feature representation of pre-trained CNN models. Experimental results show that OM achieves competitive performance on a variety of benchmarks, demonstrating the effectiveness of coupling pattern mining with pre-trained model reuse. Our approach does not need any annotations and still shows surprising localization ability, which provides a new perspective to solve the localization problem.
-  A. Agarwal and B. Triggs. Multilevel image coding with hyperfeatures. International Journal of Computer Vision, 78(1):15–27, 2008.
-  R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In International Conference on Very Large Data Bases, pages 487–499, 1994.
-  M. Cho, S. Kwak, C. Schmid, and J. Ponce. Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In CVPR, pages 1201–1210, 2015.
-  J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell,
J. Donahue, Y. Jia, and O. Vinyals.
Decaf: A deep convolutional activation feature for generic visual
International Conference on Machine Learning, pages 647–655, 2014.
T. Durand, T. Mordan, N. Thome, and M. Cord.
Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation.In CVPR, pages 5957–5966, 2017.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.
-  A. Faktor and M. Irani. “Clustering by Composition” – Unsupervised Discovery of Image Categories. Springer Berlin Heidelberg, 2012.
-  B. Fernando, E. Fromont, and T. Tuytelaars. Mining mid-level features for image classification. International Journal of Computer Vision, 108(3):186–203, 2014.
B. Fernando and T. Tuytelaars.
Mining multiple queries for image retrieval: On-the-fly learning of an object-specific mid-level representation.In ICCV, pages 2544–2551, 2014.
-  E. Gavves, B. Fernando, C. G. M. Snoek, A. W. M. Smeulders, and T. Tuytelaars. Local alignments for fine-grained categorization. International Journal of Computer Vision, 111(2):191–212, 2015.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, pages 447–456, 2015.
-  X. He and Y. Peng. Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In AAAI, pages 4075–4081, 2017.
-  A. Joulin, K. Tang, and F. F. Li. Efficient image and video co-localization with frank-wolfe algorithm. In ECCV, pages 253–268, 2014.
-  A. Khosla, N. Jayadevaprakash, B. Yao, and F. fei Li. L.: Novel dataset for fine-grained image categorization. In CVPR Workshop on FGVC, 2011.
-  J. Krause, M. Stark, D. Jia, and F. F. Li. 3d object representations for fine-grained categorization. In ICCV Workshops, pages 554–561, 2013.
-  Y. Li, L. Liu, C. Shen, and A. V. D. Hengel. Image co-localization by mimicking a good detector’s confidence score distribution. In ECCV, pages 19–34, 2016.
-  Y. Li, L. Liu, C. Shen, and A. V. D. Hengel. Mining mid-level visual patterns with deep cnn activations. International Journal of Computer Vision, 121(3):1–21, 2016.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37, 2016.
-  S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. HAL-INRIA, 2013.
-  T. Quack, V. Ferrari, B. Leibe, and L. J. V. Gool. Efficient mining of frequent and distinctive feature configurations. In ICCV, pages 1–8, 2007.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
-  K. Rematas, B. Fernando, F. Dellaert, and T. Tuytelaars. Dataset fingerprints: Exploring image collections through data mining. In CVPR, pages 4867–4875, 2015.
-  M. Rubinstein, A. Joulin, J. Kopf, and C. Liu. Unsupervised joint object discovery and segmentation in internet images. In CVPR, pages 1939–1946, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014.
-  K. Tang, A. Joulin, L. J. Li, and F. F. Li. Co-localization in real-world images. In CVPR, pages 1464–1471, 2014.
-  A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In ACM International Conference on Multimedia, pages 689–692, 2015.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds200-2011 dataset. California Institute of Technology, 2011.
-  C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervised object localization with latent category learning. In ECCV, pages 431–445, 2014.
-  X. S. Wei, J. H. Luo, J. Wu, and Z. H. Zhou. Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Transactions on Image Processing, 26(6):2868–2881, 2017.
-  X. S. Wei, C. L. Zhang, Y. Li, C. W. Xie, J. Wu, C. Shen, and Z. H. Zhou. Deep descriptor transforming for image co-localization. In IJCAI, pages 3048–3054, 2017.
-  Z. Xu, D. Tao, S. Huang, and Y. Zhang. Friend or foe: Fine-grained categorization with weak supervision. IEEE Transactions on Image Processing, 26(1):135–146, 2017.
-  J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: from visual words to visual phrases. In CVPR, pages 1–8, 2007.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, volume 8689, pages 818–833, 2014.
-  N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformable part descriptors for fine-grained recognition and attribute prediction. In IEEE International Conference on Computer Vision, pages 729–736, 2013.
-  X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. S. Huang. Adversarial complementary learning for weakly supervised object localization. In CVPR, pages 1325–1334, 2018.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. Computer Science, 2015.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, pages 2921–2929, 2016.