Ensemble of Part Detectors for Simultaneous Classification and Localization

Part-based representation has been proven to be effective for a variety of visual applications. However, automatic discovery of discriminative parts without object/part-level annotations is challenging. This paper proposes a discriminative mid-level representation paradigm based on the responses of a collection of part detectors, which only requires the image-level labels. Towards this goal, we first develop a detector-based spectral clustering method to mine the representative and discriminative mid-level patterns for detector initialization. The advantage of the proposed pattern mining technology is that the distance metric based on detectors only focuses on discriminative details, and a set of such grouped detectors offer an effective way for consistent pattern mining. Relying on the discovered patterns, we further formulate the detector learning process as a confidence-loss sparse Multiple Instance Learning (cls-MIL) task, which considers the diversity of the positive samples, while avoid drifting away the well localized ones by assigning a confidence value to each positive sample. The responses of the learned detectors can form an effective mid-level image representation for both image classification and object localization. Experiments conducted on benchmark datasets demonstrate the superiority of our method over existing approaches.



There are no comments yet.


page 1

page 4

page 5

page 8

page 11

page 12


Deep Self-Taught Learning for Weakly Supervised Object Localization

Most existing weakly supervised localization (WSL) approaches learn dete...

Mining Mid-level Visual Patterns with Deep CNN Activations

The purpose of mid-level visual element discovery is to find clusters of...

Automatic Discovery and Optimization of Parts for Image Classification

Part-based representations have been shown to be very useful for image c...

CIA-SSD: Confident IoU-Aware Single-Stage Object Detector From Point Cloud

Existing single-stage detectors for locating objects in point clouds oft...

Improvement of Classification in One-Stage Detector

RetinaNet proposed Focal Loss for classification task and improved one-s...

Progressive Representation Adaptation for Weakly Supervised Object Localization

We address the problem of weakly supervised object localization where on...

Mid-level Deep Pattern Mining

Mid-level visual element discovery aims to find clusters of image patche...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Object parts that capture crucial characteristics of an image are important in a variety of object recognition and related applications. For instance, in Deformable Part Model (DPM) [15], an object is modeled as a set of deformable parts organized in a tree structure. In relative attribute learning [35], local parts that are shared across categories are used to learn relative attributes. In fine-grained recognition [47], [50], distinctive parts such as the head of birds are detected out to enable part-based representation. Nevertheless, obtaining informative parts usually requires object-level [15] or even part-level annotations [2], which is tedious and costly for large-scale datasets. Accordingly, it is desirable to discover these parts with minimal human supervision.

The success of Convolutional Neural Network (CNN) [21] has shed light on the possibility of automatically discovering object parts. It has been revealed that [48] the CNN filters at different layers are sensitive to patches with varying receptive fields, i.e., from low-level cues such as the edges and corners in earlier layers to semantically meaningful parts or even the whole object in deeper layers. From the point of detection, the output of the convolutional layers can be interpreted as detection scores of multiple detectors. In this sense, CNN learns detectors relevant for the dataset it is trained from. However, since the network is trained based on image-level classification losses, these detectors (the hidden layers) are trained implicitly. As a result, the discriminative power of the CNN detectors is rather weak, producing activations with inhomogeneous appearances. Though a collection of such weak detectors boost the representative ability, it still leaves room for improvement by enhancing these weak filters.

An alternative method of discovering informative parts automatically is to learn detectors explicitly, which we refer to weakly supervised detector learning. As shown in Fig. 1, the standard approach for detector learning requires initial patterns (object parts) for detector initialization, and an optimization strategy for detector learning. However, learning part detectors automatically is a classical chicken-and-egg problem: without an accurate appearance model, examples of a part cannot be discovered, while an accurate appearance model cannot be learned without appropriate part exemplars. To solve this challenge, we need to answer the following two crucial issues.

Fig. 1: Image representation based on the part responses. Given a set of training images which are only provided with image-level labels, our goal is to mine mid-level patterns (object parts) that capture crucial aspects of an object, and learn a set of part detectors for image representation.

What are the right initial patterns? As the quality of the learned detectors depends heavily on initialization, it is crucial to select appropriate initial patterns. As noted in [39], such patterns should meet two criteria, i.e., representation and discrimination. Representation requires that such patterns should frequently occur in images with the same label, while discrimination claims that they should be seldom found in images not containing the object of interest. Unfortunately, algorithms aim at finding such patterns are rather ad hoc and have limited performance. Most previous works [29], [39], [41] start from unsupervised clustering such as -means to initialize a part model. However, -means behaves poorly in high dimension since distance metric becomes less meaningful, often producing clustered instances which are inhomogeneous.

How to learn generalized detectors?

Given the initial patterns, most weakly supervised learning algorithms follow the pipeline of standard SVM training


, or an iterative SVM optimization which alternates between training classifier and choosing new positive samples

[11], [20], [39]. Nevertheless, due to the uncertainty of initial patterns, such optimization is easily to get stuck into a local minimum. On the other hand, due to the occlusion, illumination variation, and viewpoint variation, the same part from different images suffers significant differences. As a result, such methods easily latch on to a few samples which are similar with the initial patterns, but are weak in generalization. Thus, developing optimization strategy under weakly supervised paradigm is important to obtain robust detection performance.

This paper proposes to learn a set of detectors in a weakly supervised paradigm, which aims at solving the above two issues. The main contribution is an iterative optimization strategy for detector learning, which we formulate as a confidence loss sparse Multiple Instance Learning (cls-MIL) task. Different from conventional MIL methods which represents each positive image with a single instance and treats each image equally important, cls-MIL represents each positive image as a sparse linear combination of its member instances, and considers the diversity of the positive images, while avoid drifting away the well localized ones by assigning a confidence value to each positive image. The responses of the learned detectors formulate an effective mid-level image representation for recognition. Another interesting finding is that different from most previous methods which treat image classification [41], [52] and object localization [3], [7], [23] separately, the proposed approach is able to effectively integrate the two tasks into a whole framework. Benefit from the powerful discriminative ability of the learned part detectors, the detector responses by our approach are able to indicate the locations of the objects. Experiments conducted on benchmark datasets demonstrate the superiority of the proposed representation.

As the detector learning procedure heavily relies initial patterns, a second novelty of our approach is the use of a spectral clustering technology for mining consistent and discriminative patterns. To this end, a selection strategy is first utilized to sample discriminative patches of the corresponding category, followed by exemplar-SVM [26] detector training for each sampled patch, finally, these exemplar-SVM detectors are grouped via a spectral clustering strategy for pattern mining. Comparing with traditional clustering methods which are conducted on the original patches, the clustered detectors are able to focus on discriminative details, and a set of such grouped detectors offer an effective way for consistent pattern mining. Furthermore, an entropy coverage criterion is utilized to measure the discriminativeness of each cluster, which enables us to greedily select clusters for detector learning, while not worrying about choosing appropriate number of clusters.

The rest of this paper is organized as follows. Sec. II reviews related works on weakly supervised detector learning. The details of our proposed detector learning method are elaborated in Sec. III. In Sec. IV, we apply the learned detectors for classification and localization. Experiments and discussions are given in Sec. V. Finally, Sec. VI concludes the paper.

Ii Related Works

Over the past years there has been a lot of researches aiming at learning part models in an unsupervised or weakly supervised way. Most methods target at improving the two modules: pattern mining technologies for model initialization, and optimization strategies for detector learning. The learned part models offer a promising way for feature representation, which is beneficial for image recognition and other related applications. In the following, we organize the discussions related to part model learning with the above aspects.

Ii-a Pattern Mining Methods

Since the ground truth annotations are not available in a weakly supervised paradigm, a number of strategies have been proposed to discover the discriminative patches for model initialization. A simple method, taken in [29], [39], [41], starts by randomly sampling a large pool of patches, and employs unsupervised clustering to generate initial patterns for detector learning. Such methods are clumsy and most returned clusters are with inhomogeneous appearances. Hence, many pattern mining technologies are developed to offer better initialization. Song et al. [40] formulate a constrained submodular algorithm to identify discriminative configurations of visual patterns. Wang et al. [44] propose to discover these latent parts via a probabilistic latent semantic analysis on the windows of positive samples and further employ these clusters as sub-categories. Li et al. [24] combine the activations of CNN with the association rule mining technique to discover the representative mid-level patterns. Doersch et al. [11] formulate part discovery from the perspective of the well-known mean-shift algorithm to maximize the density ratio in the feature space. There is a special case in which we do not need to worry about exemplar alignment, i.e., a training set consisting exactly of one part instance [26]. However, training detectors based on a single exemplar is with limited discriminative power, and the number of detectors scales with the training samples, which is tremendous for large-scale datasets.

Different from previous approaches which aim at grouping the original patches, this paper performs clustering in terms of the corresponding weak detectors, and makes use of the grouped detectors for pattern mining. In order to generate weak detectors, a selection strategy is first utilized to sample discriminative patches, and each patch is associated with a detector via exemplar-SVM training. Though a single exemplar-SVM detector is weak, a collection of such detectors offer relatively satisfactory localization capacity for pattern mining.

Ii-B Optimization for Detector Learning

Based on these discovered patterns, most methods employ an iterative learning approach to refine the detectors. Juneja et al. [20] employ an LDA accelerated version [18] of the exemplar-SVMs [26], which reduces the training cost substantially comparing with the standard SVM procedure that involves hard negative mining [15]. However, the detectors are trained with only one positive instance, which results in limited discriminative powers. Singh et al. [39] split the training set into two disjoint parts, and a part model is refined via an iterative procedure which alternates between clustering on one dataset and training discriminative classifiers on the other to avoid overfitting. Parizi et al. [29] propose a jointly training method which optimizes part models and class specific weights iteratively. Sun et al. [41] propose a latent SVM model to learn detectors, which tends to select the discriminative parts by enforcing group sparsity regularizer. However, these methods suffer from complex jointly optimization, e.g., [29] takes over five days to train detectors on MIT Indoor-67 [32].

The majority of related works treat weakly supervised detector learning as a Multiple Instance Learning (MIL) task, in which labels are assigned to bags (sets of patterns), instead of individual patterns. The positive bags are sets of instances containing at least one positive example, while the negative bags are sets of instances which are all negative. MIL is originally introduced to solve a problem in biochemistry [10], and a variety of MIL algorithms have been developed over the years. The simplest method is to transform MIL into a standard supervised learning problem by applying the bag’s label to all instances in the bag [33]. However, such method assumes that the positive examples are rich in the positive bags. Andrews et al. [1] present a new formulation of MIL as a max-margin SVM problem. Bunescu et al. [6] develop an MIL method which is particularly effective when the positive bags are sparse. When applying MIL for detector learning, the detector is obtained by an iterative procedure which alternates between selecting the highest scoring detection per bag as positive instance and refining the detector models [5]. However, such simplified setting is sensitive to initialization and easy to getting stuck in a local minimum.

This paper also formulates the weakly supervised detector learning as a MIL task. Different from previous works, we introduce a confidence loss term in MIL problem when determining the classifier hyperplane. The key insight is that due to the occlusion, illumination variation, and viewpoint variation, it is suboptimal to treat instances from different bags equally important for detector learning. The introduced confidence loss term measures the reliability of each instance for MIL learning. As a result, the detectors are able to focus on more confident samples and downweights those samples with lower reliability. Furthermore, a cross-validation strategy is introduced to avoid overfitting the initial patterns.

Ii-C Mid-level Image Representation

A collection of detector responses can be used as mid-level image representation. The paradigm is inspired by object bank [22], a pioneering work of using detector responses for image representation. The object bank represents an image as a scale-invariant response map of a large number of pre-trained generic object detectors. Following that, most technologies employ detection scores as image representation, and improve the performance by incorporating part responses [36], [39] or via multiple scale pooling [29], [41].

Over the past years, CNN has become a powerful tool for image representation. Due to the domain mismatch between ImageNet (the source dataset where CNN is trained from) and the target dataset, previous works attempt to enhance CNN representation by transferring learning

[27], [37], or network fine tuning [16], [50]. However, these methods need substantial object / part annotations of the target dataset, which is tedious and impractical in real applications. Zhang et. al [49] propose an alternative method to fine tune the network via saliency-based sampling, which is free of the object annotations. Nevertheless, such method is only limited to datasets with relatively simple backgrounds (such as fine grained dataset [43]). It may obtain limited performance improvement on datasets with complex scenes such as Pascal VOC [14] datasets.

Our approach follows the pipeline of using detector responses as feature representation. Different from previous works which learn a large number of detectors for classification [20], [24], [29] or focus on learning a single detector for localization [7], [25], [40], [44], this paper integrates classification and localization into a whole framework, i.e., we not only solve the problem of whether an object is present in an image, but also focus on where the object (if exists) is. We find that it is possible to use only a few detectors for both classification and localization if each detector is distinctive enough. Such an integrated framework is beneficial to close the gap between these two tasks.

Our feature representation is also related to dictionary learning methods [12], [25], [52], where patches are encoded as a sparse linear combination of dictionary elements, optimized for image reconstruction [12] or recognition [25], [52]. Compared with these approaches, this paper uses detectors as dictionary elements (basis), and chooses detection responses as the combinational coefficients.

Iii Learning Part Detectors

In this section, we target at learning a collection of discriminative part detectors automatically for image representation. Our detector learning system consists of two modules: mid-level pattern mining and detector optimization. The pattern mining module first selects patches which are representative and discriminative, then a series of exemplar-SVM [26] detectors are trained from each selected patch. This is followed by a spectral clustering procedure which groups exemplar-SVM detectors for pattern mining. Furthermore, an entropy coverage criterion is proposed to measure the generalization ability of each cluster. The detector optimization module formulates the weakly supervised detector training as a confidence loss sparse MIL (cls-MIL) task, which considers the reliability of each positive sample via alternating between mining new positive samples and retraining the part model. The whole framework of the proposed approach is illustrated in Fig. 2. In the following, we present the detailed design for each module.

Fig. 2: The framework of the proposed approach. Given a set of training images, we first learn a set of exemplar-SVM detectors from the selected discriminative patches, followed by detector clustering to discover patterns which are consistent and discriminative. The mined patterns are seeded for detector learning, which we formulate as a cls-MIL task. The detector responses are applied for both image classification and object localization.

Iii-a Pattern Mining with Spectral Clustered Detectors

Discovering groups of mid-level patterns that are discriminative and representative is crucial for detector learning. To solve this issue, we first introduce a sampling strategy which aims at selecting the discriminative patches, and propose a detector-based spectral clustering approach to mine consistent patterns. Furthermore, we present an entropy coverage criterion to measure the discriminativeness of each cluster, which enables us to greedily select detectors for image representation. These steps are described as follows:

Iii-A1 Discriminative patch selection

It is a challenging task to find discriminative patches without object / part annotations. To address this issue, a sampling strategy is introduced to select the discriminative and representative patches. Specifically, given an image , we first generate region proposals with edge boxes [51]

, which probably includes the object of interest with a high recall. Denote the features extracted from a CNN (after ReLU layer) as

, and the final representation of image is obtained by sum pooling the features over regions: i.e., . Finally, a one-vs-all SVM classifier is trained based on the sum pooled features . Benefit from the non-negativity of CNN features and the additivity of linear classifier, we select the patches which contribute significantly to the classification score. Specifically, given one category and its classification model , the discriminative patch set of an image is denoted as:


where denotes the threshold (set as 1) which enforces selecting the discriminative patches for classification.

In order to avoid the classifier overfitting the training set , we equally divide into disjoint and complementary subsets . The classifier is trained on subsets and validated on the rest one. For generalization, only correctly classified images are retained for discriminative patch selection. Fig. 3 illustrates some discriminative patches selected on Pascal VOC 2007 dataset. It can be seen that the selected patches probably locate around the object of interest, and skip other irrelevant backgrounds.

Fig. 3: Examples of the selected discriminative patches (shown in red bounding boxes) on Pascal VOC 2007 [14].

Iii-A2 Detector-based clustering

The patch selection process usually generates tens of thousands of patterns per category, and most of them are highly correlated, e.g., there exists some patches describing the head of dogs, and some others describing the legs of dogs. It is necessary to cluster these patterns into smaller and representative groups for detector initialization. To this end, an alternative method is to employ some form of unsupervised clustering such as -means [29], [39], [41]. However, -means behaves poorly in high dimensional space since distance metric becomes less meaningful, and often produces clustered instances which are in no way visually similar. Instead of clustering the original patches, this paper proposes a detector-based spectral clustering strategy, which discovers similar patterns via the grouped detectors.

Fig. 4: Examples of the discovered mid-level patterns with clustered detectors on (a) Pascal VOC 2007 [14] and (b) MIT Indoor-67 [32]. These patterns are obtained by the top responses of each clustered detectors.

Inspired from exemplar-SVMs [26], we start learning detectors from only one instance, which avoids worrying about exemplar misalignment. The negative samples are defined as patches which do not contain the object of interest, i.e., all patches sampled from images with different labels. Since the negative samples are too large, standard hard mining method [15] is quite expensive. We use instead Linear Discriminant Analysis (LDA) [18] to train a detector, which is an accelerated version of the exemplar-SVMs. Specifically, the detector template is learned simply by , where is the mean features of the positive examples, denotes the mean of the features in the whole dataset, and is the corresponding covariance matrix. Since each exemplar-SVM detector is supposed to fire only on visually similar examples, we cannot expect it to generalize too much. To solve this issue, we follow an iterative procedure [20] which adds new positive samples each round to enhance the exemplar detectors. At each round, we run the current detector on all other images with the same label, and retrain it by augmenting the positives with the top scored patches. The idea behind this process is using detection score as a similarity metric, which emphasizes the distinctive details and suppresses those irrelevant ones.

Using exemplar-SVMs, each selected patch is associated with a detector. The key insight of the proposed strategy is that instead of clustering the original patches, we group the corresponding detectors. Specifically, given exemplar detectors trained from one class , we perform spectral clustering on the similarity matrix generated from the detectors, and obtain clusters , where

denotes the cosine similarity of

and . Thus, detectors sharing similar response distributions are grouped together. Inspired by boosting strategy [42], each cluster acts as an integrated detector to discover similar patterns, i.e., the detection score of a patch with respect to a cluster is denoted as:


As an illustration, Fig. 4 shows some examples of the discovered patterns using the clustered detectors. It can be shown that although a single detector is weak, a collection of such detectors offer satisfactory localization capacity. Another advantage of the detector-based pattern mining method is that we can select the most discriminative and representative patterns according to the top responses of the grouped detectors.

Iii-A3 Entropy coverage

The detector-based clustering generates a series of clusters with varying discriminative capacities. The notation of discriminative clusters is that the detectors within a cluster should be trained from as many images as possible. Such clusters include detectors corresponding to repeated patterns among varying images. We propose an entropy coverage criterion to measure the discriminativeness of each cluster. Given images belonging to the same class and the corresponding clustered detectors , the entropy coverage of cluster is defined as:


where denotes the probability of detectors coming from image . The subitem of is a standard entropy function, which enjoys the following property:

Corollary 1. Denote the entropy function as , then for .

Proof. For the left side, we have:


The right side is obvious according to the maximum property of entropy, i.e., the entropy reaches its maximum when events are equiprobable.

According to Corollary 1, is large if the clustered detectors within are trained from diverse images, and reaches its maximum when the detectors are trained from patterns with equal distribution. The larger is, the more frequent patterns the detectors in could find. Such an entropy coverage criterion enables us to greedily select clusters for detector initialization, while not worrying about choosing appropriate number of clusters. In the experimental section, we would find that the optimal number of clusters is determined by the classification performance.

Iii-B Detector Optimization with cls-MIL

Although the grouped detectors offer a relatively robust localization capacity, it is far from enough. These detectors are trained from a subset of discriminative patches, and are only powerful to discover patches which are also significant in discriminativeness. While we cannot ensure that they respond consistently among all the images of that class, especially those not correctly classified ones in cross-validation. Based on these observations, we formulate the weakly supervised detector learning as a confidence loss sparse MIL (cls-MIL) task, which considers the diversity of the positive samples, while avoid drifting away the well localized ones by assigning a confidence value to each mined positive sample.

Iii-B1 Motivation

To use MIL for detector learning, each image is considered as a bag, and the patches within it as instances. Given a set of training images, we treat images of one particular category as positive bags, and the rest images as negative bags. Intuitively, for each image, if it is labeled as positive, then at least one patch within it should be treated as a positive instance, when it is labeled as negative, then all patches within it should be treated as negative instances. Standard MIL is based on alternatively selecting the highest detection per bag as the positive instance and refining the detection model. However, it suffers from several issues. First, the detectors would latch on to the initial patches they are trained from and prefer them at each round of instance selection when training and selecting are performed on the same dataset. Second, standard MIL often mines a single instance per positive bag and treats each mined instance equally important, which is often not the case. Due to the occlusion, illumination variation, and viewpoint variation, the same part from different images suffers from varying confidence of positiveness. Based on these observations, a multi-fold cross-validation [7] is introduced to avoid overfitting the initial training samples, and a confidence loss sparse MIL (cls-MIL) technology is proposed to tackle the dataset bias. In the following we define the problem in a formal way.

Iii-B2 Problem formulation

Let be the set of bags used for training, which consists of a set of positive bags and negative bags , i.e., . Denote be a bag of images, and and be the set of instances from positive bags and negative bags, respectively. For any instance from a bag , let

be the feature vector representation of

(for brevity, we include the bias term into feature representation). The cls-MIL problem can be formulated as solving the following objective:


where is the feature representation of bag , is the latent variable which measures the positiveness of a bag , and is the control parameter of the loss term.

One remained issue is how to determine the representation . We would prefer that a positive bag be represented as much as possible by the true positives within it. However, even the state-of-the-art region proposal algorithms [51] could only generate patches containing the object of interest with a high recall, not to mention the difficulty of determining the positive samples under weakly supervised paradigm. To tackle this issue, we introduce a pooling strategy for representation to improve the robustness. Note that among all the given region proposals, only a few instances are the patterns we expect to find (which is sparse). Based on these observations, each bag is represented as the weighted sum of its mined member instances: , where is a weight assigned to each instance, and is an indicator which denotes the patterns selected as the positive “witness” in a positive bag . In practice, only a few instances per positive bag are selected (we set the number of as 10), while all the negative instances are taken into consideration.

0:  Positive bags , negative bags , the number of spectral clusters , and the number of iterations T; Mid-level Pattern Mining: For instances in the positive bags , mining patterns for detector initialization. a). Select discriminative patches with Eq. (1) via cross-validation.b). For each selected patch , learn exemplar-SVM detector .c). Spectral clustering of detectors into clusters . d). For each cluster, pattern mining on according to scores . Detector Optimization: For each cluster, given initial patterns discovered by the clustered detectors, solving cls-MIL in Eq. (5) via iteratively updating and optimizing. For iteration t=1 to T a). Updating: Updating the latent variables via cross-validation. The latent variables in are determined by detectors trained on , i.e., updating instance weights of by: , and the confidence loss term . b). Optimizing: solving Eq. (5) via hard negative mining on negative bags with the updated latent variables and . end
0:  Detector set .
Algorithm 1 Weakly Supervised Detector Learning

Iii-B3 Optimization

The cls-MIL leads to a non-convex optimization problem due to the introduction of implicit feature representation for the positive bags and the latent confidence variables . However, this problem is semi-convex since optimization problem becomes convex once these latent variables are fixed. In the following, we solve Eq. (5) via an iterative procedure which alternates between fixing the latent variables and optimizing the detectors. In order to avoid focusing on the initial positive samples, the optimization procedure is processed via cross-validation. Specifically, the training set is equally divided into disjoint and complementary subsets . Starting from the patterns discovered by the clustered exemplar-SVM detectors, the detector is optimized via iteratively Updating the latent variables and Optimizing Eq. (5). In the Updating step, the latent variables in are determined by detectors trained on , i.e., each instance weight of is updated by: , and the confidence loss term , where

is a sigmoid function which maps the value into the range of

. In the Optimizing step, the detector is optimized according to the updated latent variables via hard negative mining [15].

Corollary 2. The solution of Eq. (5) is a linear combination of the positive instances and the negative instances , i.e., , where the coefficients and are bounded by: , , respectively.

Proof. The constrained minimization problem in Eq. (5) can be solved with a classical Lagrangian method. The Lagrangian operator can be represented as:


where , , , and denote Lagrange multipliers. The minimization of Lagrangian operator in Eq. (6) with respect to is obtained:


Due to the nonnegativity of and , we have and . Given a test example , the detection score can be represented as:


It can be seen that the final detection score is a weighted combination of the inner product between training features , and test feature , and is only determined by samples with nonzero coefficients . These s are called support vectors, since they are the only training samples necessary to define the separating hyperplane. Note that for positive samples, the coefficient is bounded by , with KKT conditions, it is also possible to see when an example is a support vector, this happens only if the example is on the margin, or it does not respect the separation conditions in Eq. (5). According to [8], the coefficient for positive samples in different locations is defined as:


For positive bags which do not respect the classification hyperplane, the corresponding coefficient is bounded by , which takes the reliability of into consideration. The regularized term helps to boost the detection performance. If a positive bag is not reliable at previous round, its contribution to the classification hyperplane at current round would be lowered. As a result, MIL introduces diverse samples for detector learning, while the confidence loss term encourages the detector focusing on positive instances which are good enough and downweighting those instances with lower reliability. The whole procedure of the proposed weakly supervised detector learning algorithm is summarized in Algorithm 1.

Fig. 5: An illustration of how to compute image representation and object heat maps according to the detector responses.

Iv Applications: Image Classification and Object Localization

The learned detectors are discriminative for the corresponding category, and an ensemble of the detectors across different categories offers an effective mid-level image representation. In this section, we apply such mid-level representation for image classification and object localization.

Iv-a Image Classification

Unsupervised clustering methods have been used for feature representation [31], [45]. Since our learned detectors can be considered as the true visual patterns corresponding to a certain category (as opposed to the clustered ambiguous visual letters in [31], [45]), it makes sense to apply such detectors for image coding. Denote all the learned detectors across different categories as , where is the total number of detectors. Our mid-level feature representation is based on the maximal responses of a collection of detectors. Specifically, given an image and the corresponding region proposals , the feature representation is denoted as: , where is a latent variable indicating the region with maximum response corresponding to detector , i.e., . An illustration of image representation is shown in Fig. 5.

Fig. 6: Examples of localization process on Pascal VOC 2007 trainval split. We generate the object heat map and perform grabcut [34] to obtain segmentation mask of the object. Then a tight object bounding box (shown in red) is obtained via enclosing the segmentation mask.

Given the image representation, a conventional SVM classifier is performed to produce the final classification results. Note that the complexity of the feature coding using detector responses is very low, which includes no more than a dot product operation once the features (e.g., CNN) are extracted. On the other hand, we greedily select detector responses based on the entropy coverage criterion, and find that the performance saturates as the first few detectors are added in, which decreases the feature dimension by one order. In the experimental section, we will demonstrate the effectiveness of the proposed feature coding approach.

Iv-B Object Localization

The learned part detectors are discriminative for the corresponding category, and a collection of them offers rough position of the object of interest. In this section, we present a simple object localization technology based on the learned part detectors. The basic idea is to accumulate the part responses into a whole object heat map, which indicates the potential position of an object. Specifically, starting from a collection of part detectors corresponding to a category, we first define a part map based on detector , the confidence of a pixel which is contained in an object part is denoted as:


where denotes the patch set that includes pixel , is a sigmoid function, and is a normalization constant such that . Finally, the object map is a weighted linear combination of the part maps obtained by all part detectors, i.e., , where is a weight factor which denotes the reliability of each detector, and is given by . Fig. 5 illustrates examples of how to compute the object heat maps.

The object heat map indicates the most discriminative details of an object, and usually focuses on object parts (e.g., the head of dogs), instead of the whole object. Inspired from [30] which casts localization as a segmentation task, we perform grabcut [34]

on the object heat map to generate the segmentation mask. The goal is to propagate the discriminative part details to the whole object with color continuity cues. To this end, the foreground and background are set to be gaussian mixture models. The foreground model is estimated from heat map values higher than

, and the background model is estimated from values lower than . Finally, we take the bounding box that covers the largest connected component in the generated segmentation mask as localization result. Some example localization processes are shown in Fig. 6. In the experimental section, we will show that as a byproduct of the learned discriminative detectors, such localization technique achieves satisfactory localization performance.

V Experiments

In this section, we present an evaluation of the proposed weakly supervised image classification and object localization framework. We also perform ablation study to understand how various design choices impact the recognition performance.

V-a Datasets and evaluation metrics

We evaluate the proposed approach on three publicly available benchmarks, two for generic object recognition and one for scene recognition. The details of the datasets are briefly summarized as follows:

Pascal VOC 2007: The Pascal VOC 2007 dataset [14] is a widely used benchmark for multi-label image classification and object localization. The benchmark contains a total of 9,963 images spanning 20 generic object classes, of which 5,011 images are used for trainval and the rest 4,952 images for test. For image classification, we choose trainval split as training set and test

split as test set, and the evaluation metrics is mean Average Precision (mAP), which is complying with the Pascal VOC challenge protocols.

Pascal VOC 2012: The Pascal VOC 2012 dataset [13] is an extended version of the Pascal VOC 2007, which contains a total of 22,531 images, including 11,540 images for trainval and 10,991 images for test. Since ground truth labels are not available for test split, we use the online evaluation server to evaluate recognition performance of the proposed algorithm.

MIT Indoor-67: The MIT Indoor-67 [32] dataset consists of 15,620 images belonging to 67 categories of indoor scenes. It is challenging because of the large ambiguities between categories. We follow the standard train/test split as in [32], i.e., approximately 80 images per class for train and 20 images per class for test. The evaluation metric for MIT Indoor dataset is the mean classification accuracy.

In addition to classification, we also evaluate the localization performance of the proposed approach. We follow previous methods on object localization [4], [7], and evaluate the performance on Pascal VOC trainval set with CorLoc criterion [9]. CorLoc measures the percentage of images with correct localization, i.e., a window is considered to be correct if it has an Intersection-over-Union (IoU) ratio of at least with one of the ground truth instances.

NO. of proposals 300 500 1000 2000
mAP 82.4% 83.2% 83.4% 83.7%
TABLE I: Recognition performance on VOC 2007 with different number of region proposals. Results are based on model CaffeNet.

V-B Implementation Details

Models and features. We choose two widely used CNN models for feature extraction, a typical network CaffeNet [19] and a more accurate but deeper one VGG-VD [38] (the

-layer model). We extract features from the fc6 layer (FC-CNN) after the rectified linear unit (ReLU), which is a

-d nonnegative vector for each region. Edge boxes [51] are used for generating candidate region proposals. In addition to region proposals, edge boxes also provide an objectness score for each region. For computation efficiency, we disregard regions which occupy less than areas of an image, and retain the top scored region proposals as candidates.

Parameter settings. In pattern mining, the number of spectral clustering per category is set as , and the top scored patches per clustered detectors are selected as patterns for detector initialization. In detector optimization, the number of iterations is set as , as we find that the performance of the detectors do not need more to converge. For all situations where cross-validation is needed, we use typical -fold cross-validation.

V-C Ablation Study

To better understand the relative contribution of each module, we analyze the performance of our approach with different configurations. As the localization can be regarded as a byproduct of the learned detectors, we mainly measure how different designs affect the discriminativeness of the detectors in terms of classification performance.

method aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv mAP
MR-CaffeNet [19] 90.4 87.0 87.2 84.1 40.5 76.4 86.9 87.5 60.7 70.5 75.7 82.7 89.4 80.4 93.9 53.9 76.6 66.6 90.9 71.5 77.6
MR-VD [38] 98.3 95.3 96.0 95.0 70.0 90.1 93.8 94.9 73.7 84.6 85.9 94.5 95.4 92.0 97.5 70.6 90.6 79.7 98.1 86.7 89.1


PRE-1000[27] 88.5 81.5 87.9 82.0 47.5 75.5 90.1 87.2 61.6 75.7 67.3 85.5 83.5 80.0 95.6 60.8 76.8 58.0 90.4 77.9 77.7
HCP Alex[46] 95.4 90.7 92.9 88.9 53.9 81.9 91.8 92.6 60.3 79.3 73.0 90.8 89.2 86.4 92.5 66.9 86.4 65.6 94.4 80.4 82.7
HCP VD[46] 98.6 97.1 98.0 95.6 75.3 94.7 95.8 97.3 73.1 90.2 80.0 97.3 96.1 94.9 96.3 78.3 94.7 76.2 97.9 91.5 90.9
WSDDN [4] 93.3 93.9 91.6 90.8 82.5 91.4 92.9 93.0 78.1 90.5 82.3 95.4 92.7 92.4 95.1 83.4 90.5 80.1 94.5 89.6 89.7
EPD CaffeNet 94.6 92.0 90.4 89.3 56.9 81.9 93.0 90.8 67.9 71.7 77.0 84.9 89.7 86.4 97.1 71.8 80.7 69.4 93.8 84.3 83.2
EPD VD 98.6 97.7 97.2 96.0 78.4 92.0 95.8 96.9 76.5 86.9 82.4 94.1 95.3 93.5 98.6 79.4 94.5 80.1 98.6 92.2 91.3
EPD VD+[38] 99.3 97.8 97.6 96.4 79.1 92.9 95.9 97.3 78.0 88.5 87.1 95.4 96.1 94.4 98.7 80.0 94.6 82.9 99.0 92.2 92.2
TABLE II: Recognition average precision () on VOC 2007 test split. We report performance with two models: CaffeNet [19] and VGG-VD [38]. The method marked with are those using additional training images.
method aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv mAP
PRE-1000[27] 93.5 78.4 87.7 80.9 57.3 85.0 81.6 89.4 66.9 73.8 62.0 89.5 83.2 87.6 95.8 61.4 79.0 54.3 88.0 78.3 78.7
Weak Sup.[28] 96.7 88.8 92.0 87.4 64.7 91.1 87.4 94.4 74.9 89.2 76.3 93.7 95.2 91.1 97.6 66.2 91.2 70.0 94.5 83.7 86.3
HCP Alex[46] 97.7 83.2 92.8 88.5 60.1 88.7 82.7 94.4 65.8 81.9 68.0 92.6 89.1 87.6 92.1 58.0 86.6 55.5 92.5 77.6 81.8
HCP VD[46] 99.1 92.8 97.4 94.4 79.9 93.6 89.8 98.2 78.2 94.9 79.8 97.8 97.0 93.8 96.4 74.3 94.7 71.9 96.7 88.6 90.5
EPD CaffeNet 96.2 84.9 90.7 87.1 61.8 89.9 83.4 92.1 71.1 77.8 73.4 89.6 88.1 89.8 96.4 63.6 82.9 63.7 93.1 82.2 82.9
EPD VD 99.0 90.7 95.5 93.7 78.9 93.2 88.6 97.3 80.5 91.3 81.6 96.0 96.1 95.2 97.9 70.0 93.6 72.3 97.5 89.0 89.9
TABLE III: Recognition average precision () on VOC 2012 test. The method marked with are those using additional training images. Available at http://host.robots.ox.ac.uk:8080/anonymous/UKZVBM.html and http://host.robots.ox.ac.uk:8080/anonymous/CD25HO.html.

V-C1 Number of detectors

An advantage of the proposed approach is that detectors are trained from patterns with different coverage entropies. This enables us to greedily select detectors based on the entropy coverage criterion. As shown in Fig. 8, we add detectors orderly to probe how the number of detectors affect the classification performance. Note that the performance improves fast when a small number of detectors are used (e.g., from 1 to 10), it tends to be stable and even drops sightly when more detectors are added in. This is mainly because the subsequent detectors are not discriminative enough for classification. For computational efficiency, we fixed the number of detectors ( per category for VOC 2007 and per category for MIT Indoor-67) for the following experiments.

Fig. 7: The classification performance with respect to the number of detectors per category on (a) VOC 2007 and (b) MIT Indoor-67. The detectors are greedily selected via the entropy coverage criterion.

V-C2 Number of region proposals

In order to probe the performance with respect to the number of candidate region proposals, we select the number of region proposals in different settings. Table I shows the results on VOC 2007 by varying the number of region proposals. The performance are relatively stable (from 2000 to 300 region proposals, only drop). Considering the performance and computational efficiency, we choose the number of region proposals as .

Fig. 8: The classification performance comparisons with different configurations on Pascal VOC 2007 test

split. BL refers to baseline which max pooling CNN region features, KM is short for standard

-means pattern initialization algorithm, PM denotes the proposed pattern mining approach, MIL stands for standard multiple instance learning, and cls-MIL is the proposed confidence loss sparse MIL method. Results are based on model CaffeNet.

V-C3 Effects of different modules

We now compare the results with different configurations to analysis how each module affect the final classification performance. Different modules are summarized as follows:

BL: This is the baseline method which directly max pooling multiple region proposal features for classification. It is introduced to help understand how the proposed approach improve the discriminative power of the detectors.

PMKM: PM denotes the proposed pattern mining method in Sec. III A, while KM is the standard -means clustering method that is widely used for detector initialization in previous algorithms [29], [39], [41]. For fair comparisons, we perform -means clustering on the selected patches with the number of clusters setting as .

MILcls-MIL: MIL stands for standard multiple instance learning method which mines new positive sample without considering the confidence of each bag, and cls-MIL is the confidence loss sparse MIL detector optimization strategy proposed in Sec. III B.

As shown in Fig. 8, both -means and multiple instance learning do help to improve the classification performance, nevertheless with limited gains. The proposed pattern mining and cls-MIL method surpass the counterparts consistently, e.g., pattern mining improves the accuracy from (-means) to , and cls-MIL obtains an accuracy improvement of ( vs ) comparing with standard MIL. We also find that detector initialization really counts for multiple instance learning, even for the modified cls-MIL ( with -means, and with pattern mining). This is widely discussed in previous approaches which aim to develop efficient pattern mining methods [24], [3] for detector initialization. However, few works emphasis detector optimization. We demonstrate that both modules are essential, and a combination of them achieves considerable performance improvement.

V-D Image Classification

Method Dimension Accuracy (%)
DMS [11] 13K 64.0
DSFL [29] 13K 77.1
MOP-CNN [17] 13K 68.9
MDPM [24] 3.3K 77.6


FC-CNN CaffeNet [19] 4K 60.3
MR-CNN CaffeNet [19] 4K 65.1
Clustered Detectors CaffeNet 2K 66.3
EPD CaffeNet 2K 69.0
EPD VD 2K 77.9
EPD VD+[38] 6K 80.1
TABLE IV: Comparisons of recognition performance on MIT Indoor-67. Clustered detectors refer to directly using clustered exemplar-SVM detector responses as features.

V-D1 Object Recognition

Table II and III show the object recognition results of the proposed approach on Pascal VOC 2007 and 2012 test splits, respectively. In order to make fair comparisons, we extract CNN features from multiple region proposals, and max-pooling the region features into a final representation, which we refers to MR-CNN. Then the only difference between MR-CNN and our method is the detectors since they make use of the same region proposals. From Table II we can see that the proposed detectors improve the classification performance considerably, achieving accuracies of with CaffeNet, and with very deep model, which bring and gains comparing with using CNN features.

There exist many previous approaches that report classification results on Pascal VOC dataset, and we compare our results with some most recent ones. Most of previous approaches that achieve high classification results are based on network fine tuning [4], [27], [46]. Since network fine tuning is hard for multi-label images, previous works [27] rely on object annotations to find category specific patches. In [46], the authors proposed a weakly supervised classification framework via two-steps of network fine tuning, while it makes use of additional training data, which is more demanding. Our result () is slightly better than the best performing one () [46], demonstrating that the traditional optimization approaches are able to achieve competing results with CNN fine tuning. Furthermore, the proposed features are complementary with CNN features, and achieve an accuracy of when combined. For VOC 2012, our method obtains an accuracy of , which is slightly worse than [46] () that makes use of additional training images. The reason lies in that CNN-based methods are powerful as the training data grow, while MIL-based methods are relatively robust to the amount of data.

Fig. 9: Some visualizations of the correct and incorrect classification. We show the top detection that makes it look like the corresponding category, and some patches that the detectors are trained on.

V-D2 Scene Recognition

Table IV compares the recognition results on MIT Indoor-67. MR-CNN denotes max-pooling multiple region features for representation, and FC-CNN refers to directly extract a single global feature from the whole image. Clustered detectors denote the method which relies on the responses of the clustered exemplar-SVM detectors as features. From Table IV we observe that:

MR-CNN is much better than FC-CNN. Using CaffeNet model, the accuracy is with MR-CNN, and with FC-CNN. This demonstrate that local features are crucial for scene recognition.

The features using clustered detector responses () is better than MR-CNN (), even with half dimension (2K versus 4K). This is mainly because CNN is primarily trained from the object centric images, instead of the scene centric images. As a result, the weak exemplar-SVM detectors still outperform MR-CNN due to the data specific representation.

The proposed EPD is much better than the features with clustered responses. Benefit from the detector optimization strategy, our method obtains an accuracy of , which brings about a improvement comparing with the clustered responses. The performance is boosted to when switching to the very deep model. Another observation is that the proposed features are complementary with CNN features, and achieve an accuracy of when combined.

There are some approaches which also aim at learning discriminative part detectors for recognizing indoor scenes. The method of [29] integrates detector learning and classification by jointly training, and [11] poses mid-level pattern discovery as discriminative mode seeking via developing an extension of the classic mean-shift algorithm to density ratio estimation. Our method is closely related to [24], which also makes use of CNN activations for pattern mining. Our method achieves better result comparing with the best performing method ( vs ). There exists a majority of algorithms which employ multiple region pooling for final feature representation. A typical representation is MOP-CNN [17] which uses VLAD to encode CNN activations into bag of words representation, and achieves an accuracy of , our results () is comparable with [17] using the same model, but with much lower dimension (2K vs 13K).

V-D3 Visualizing Mid-level Patterns

As an illustration, Fig. 9 shows some discovered patterns on VOC 2007 (top row) and MIT-67 (bottom row) test splits. We show the highest activation region per image, which offers a clue indicating why it is classified as the corresponding category. Specifically, given a test image and the category label that the image is classified with (no matter correct or not), we employ category specific detectors to find which region responds most to the given category, and show some patches that the detector is trained on. For correctly classified images, there often exist discriminative patches that respond significantly to the corresponding detectors, e.g., on VOC 2007, the head of a train is important for recognizing the trains, and the upper body of a person is important for recognizing the persons. Similar results can be found on MIT-67, it is the pillar of a cloister that makes it look like a cloister, and the slide rail that makes bowling look like bowling. It is helpful to investigate why incorrect results happen, on VOC 2007, a classifier mis-classifies chair as the plant, or horse as bicycle, probably because there exist corresponding details, e.g., the wheel of the carriage is similar with bicycle wheels. Similar results can be found on MIT-67, the window of the office is misclassified as the bar of the baby bed, which is most discriminative for recognizing nursery. Actually, these details look similar, and it is hard to recognize them. However, these observations offer a direction to further improve the recognition performance.

method aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv mAP
Mimick [23] 73.1 45.0 43.4 27.7 6.8 53.3 58.3 45.0 6.2 48.0 14.3 47.3 69.4 66.8 24.3 12.8 51.5 25.5 65.2 16.8 40.0
Con-Clust [3] 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38.5 49.5 60.0 19.8 39.2 41.7 30.1 50.2 44.1 43.7
MMIL [7] 56.6 58.3 28.4 20.7 6.8 54.9 69.1 20.8 9.2 50.5 10.2 29.0 58.0 64.9 36.7 18.7 56.5 13.2 54.9 59.4 38.8
PLSA [44] 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56.7 58.3 69.5 14.1 38.3 58.8 47.2 49.1 60.9 48.5
EPD CaffeNet 60.8 55.3 43.8 16.5 29.4 64.5 69.3 49.4 12.6 52.7 29.7 39.1 58.2 81.1 34.0 39.6 58.8 47.8 59.3 53.1 47.7
EPD VD 60.8 58.8 40.8 17.6 24.8 67.0 68.1 50.0 12.2 48.6 27.4 36.5 58.2 78.7 29.7 36.6 63.9 44.4 58.9 55.6 46.9
TABLE V: Object localization precision () on VOC 2007 trainval images in terms of CorLoc metric.

Fig. 10: An illustration of the error distribution of the proposed localization method on Pascal VOC 2007 trainval split.

Fig. 11: Examples of localization results on Pascal VOC 2007 trainval split. The correct localization are marked with red bounding boxes, while the failed ones are marked with green. The failed results often come from localizing object parts or grouping multiple objects from the same class.

V-E Object Localization

Table V shows the image localization results on Pascal VOC 2007 trainval split. Benefit from the learned part detectors, the proposed localization strategy () is better than recent methods that is specifically designed for localization [3], [23], and is comparable with [44] () which uses latent category learning for object localization. Another observation is that different from recognition, using deeper model does not bring about localization improvement (46.9%). This can be explained with the fact that deeper models frequently focus on parts of the object instead of the whole object. Note that all these comparing methods are designed for localization, which often makes use of context information for better localization, while we rely on detectors which are learned for classification to uncover the connection between these two basic tasks. The results demonstrate that image classification and localization can be done simultaneously.

V-E1 Localization Error Analysis

In order to better understand the localization errors, following [7], [23], we summarize the errors to uncover the pros and cons of our localization method. Each predicted bounding box is categorized into the following five cases: 1) correct localization, IoU overlap is greater than with the ground truth. 2) hypothesis completely inside ground truth, 3) ground truth is completely inside the hypothesis, 4) no overlap, IoU equals to zero, and 5) low overlap, none of the above. Fig. 10 shows the error distribution of the proposed method across categories on Pascal VOC 2007 trainval set. It can be noted that among the failed modes, the most important failure modality of our method is that an object part is localized instead of the whole object. This is intuitive since in most situations, correct classification only demands catching local discriminative details.

V-E2 Visualizations and Limitations

Fig. 11 shows some localization results on Pascal VOC 2007 trainval split. The correct localizations are marked with red bounding boxes, while the failed ones are marked with green. It can be shown that the proposed localization method is able to find objects where there is only one object from the same category, but is short of localizing multiple objects of the same category. Actually, it is the main challenge for weakly supervised localization [7], and is a promising direction for future research.

V-E3 Classification versus Localization

Comparing classification (Table II) with localization (Table V), we find that the least successfully recognized objects are bottle () and chair (), which are also hard for localization ( and ). This is because they usually occupy a small fraction of the image, and are within cluttered backgrounds. The exception is person, which suffers a low localization accuracy (), but with a high recognition accuracy (). This can be explained by the fact that person is easy to be recognized by face, and usually, there exist multiple persons in an image, which offers abundant cues for recognition. In contrast, localization is failed when focusing on the face, and it is hard to tell apart individual person from the crowd.

Vi Conclusion

In this paper, we propose an effective mid-level image representation approach for visual applications. The proposed framework aims at learning a collection of discriminative part detectors in a weakly supervised paradigm, which only needs the labels of training images, while does not need any object / part annotations. Our approach tackles several key issues in automatic part detector learning. First, we propose an efficient pattern mining technique via spectral clustering of exemplar-SVM detectors. Second, we formulate the detector learning as a confidence loss sparse MIL (cls-MIL) task, which considers the diversity of the positive instances, while avoid drifting away the well localized ones by assigning a confidence value to each positive instance. The proposed method shows notable performance improvements on several recognition benchmarks. Furthermore, we simultaneously considering classification and localization based on the learned detectors, and find that the accumulated responses of part detectors offer satisfactory localization performance, which bridges these two widely studied visual tasks.


  • [1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In Proc. Neur. Info. Pro. Sys.,, pages 561–568, 2002.
  • [2] H. Azizpour and I. Laptev. Object detection using strongly-supervised deformable part models. In Proc. Eur. Conf. Comput. Vis., pages 836–849. 2012.
  • [3] H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervised object detection with convex clustering. In

    Proc. IEEE Comput. Vis. Pattern Recognit.

    , pages 1081–1089, 2015.
  • [4] H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 2846–2854, 2016.
  • [5] M. Blaschko, A. Vedaldi, and A. Zisserman. Simultaneous object detection and ranking with weak supervision. In Proc. Neur. Info. Pro. Sys.,, pages 235–243, 2010.
  • [6] R. C. Bunescu and R. J. Mooney. Multiple instance learning for sparse positive bags. In Int. Conf. Mach. Learn., pages 105–112, 2007.
  • [7] R. G. Cinbis, J. Verbeek, and C. Schmid. Multi-fold mil training for weakly supervised object localization. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 2409–2416, 2014.
  • [8] R. Collobert.

    Large scale machine learning.

    Technical report, IDIAP, 2004.
  • [9] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised localization and learning with generic knowledge. Int. J. Compt. Vis., 100(3):275–293, 2012.
  • [10] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artifi. Intell., 89(1):31–71, 1997.
  • [11] C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual element discovery as discriminative mode seeking. In Proc. Neur. Info. Pro. Sys., pages 494–502, 2013.
  • [12] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process., 15(12):3736–3745, 2006.
  • [13] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis., 111(1):98–136, 2015.
  • [14] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis., 88(2):303–338, 2010.
  • [15] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 1–8, 2008.
  • [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 580–587, 2014.
  • [17] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In Proc. Eur. Conf. Comput. Vis., pages 392–407. 2014.
  • [18] B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In Proc. Eur. Conf. Comput. Vis., pages 459–472. 2012.
  • [19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, pages 675–678, 2014.
  • [20] M. Juneja, A. Vedaldi, C. Jawahar, and A. Zisserman. Blocks that shout: Distinctive parts for scene classification. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 923–930, 2013.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Neur. Info. Pro. Sys.,, pages 1097–1105, 2012.
  • [22] L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Proc. Neur. Info. Pro. Sys.,, pages 1378–1386, 2010.
  • [23] Y. Li, L. Liu, C. Shen, and A. v. d. Hengel. Image co-localization by mimicking a good detector’s confidence score distribution. arXiv preprint arXiv:1603.04619, 2016.
  • [24] Y. Li, L. Liu, C. Shen, and A. van den Hengel. Mid-level deep pattern mining. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 971–980, 2015.
  • [25] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 1–8, 2008.
  • [26] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In Proc. Int. Conf. Comput. Vis., pages 89–96, 2011.
  • [27] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 1717–1724, 2014.
  • [28] M. Oquab, L. Bottou, I. Laptev, J. Sivic, et al. Weakly supervised object recognition with convolutional neural networks. In Proc. Neur. Info. Pro. Sys. Citeseer, 2014.
  • [29] S. N. Parizi, A. Vedaldi, A. Zisserman, and P. Felzenszwalb. Automatic discovery and optimization of parts for image classification. arXiv preprint arXiv:1412.6598, 2014.
  • [30] O. M. Parkhi, A. Vedaldi, C. Jawahar, and A. Zisserman. The truth about cats and dogs. In Proc. Int. Conf. Comput. Vis., pages 1427–1434, 2011.
  • [31] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In Proc. Eur. Conf. Comput. Vis., pages 143–156. Springer, 2010.
  • [32] A. Quattoni and A. Torralba. Recognizing indoor scenes. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 413–420, 2009.
  • [33] S. Ray and M. Craven. Supervised versus multiple instance learning: An empirical comparison. In Int. Conf. Mach. Learn., pages 697–704, 2005.
  • [34] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM trans. on graphics, volume 23, pages 309–314. ACM, 2004.
  • [35] R. Sandeep, Y. Verma, and C. Jawahar. Relative parts: Distinctive parts for learning relative attributes. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 3614–3621, 2014.
  • [36] K. J. Shih, I. Endres, and D. Hoiem. Learning discriminative collections of part detectors for object recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(8):1571–1584, 2015.
  • [37] X. Shu, G.-J. Qi, J. Tang, and J. Wang. Weakly shared deep transfer networks for heterogeneous domain knowledge propagation. In ACM Multimedia, pages 35–44, 2015.
  • [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [39] S. Singh, A. Gupta, and A. Efros. Unsupervised discovery of mid-level discriminative patches. pages 73–86, 2012.
  • [40] H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell. Weakly-supervised discovery of visual pattern configurations. In Proc. Neur. Info. Pro. Sys.,, pages 1637–1645, 2014.
  • [41] J. Sun and J. Ponce. Learning discriminative part detectors for image classification and cosegmentation. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 3400–3407, 2013.
  • [42] P. Viola and M. J. Jones.

    Robust real-time face detection.

    Int. J. Comput. Vis., 57(2):137–154, 2004.
  • [43] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • [44] C. Wang, K. Huang, W. Ren, J. Zhang, and S. Maybank. Large-scale weakly supervised object localization via latent category learning. IEEE Trans. Image Process., 24(4):1371–1385, 2015.
  • [45] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 3360–3367, 2010.
  • [46] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan. Hcp: A flexible cnn framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1901–1907, 2016.
  • [47] S. Yang, L. Bo, J. Wang, and L. G. Shapiro. Unsupervised template learning for fine-grained object recognition. In Proc. Neur. Info. Pro. Sys.,, pages 3122–3130, 2012.
  • [48] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Proc. Eur. Conf. Comput. Vis., pages 818–833, 2014.
  • [49] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian. Picking deep filter responses for fine-grained image recognition. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 1134–1142, 2016.
  • [50] X. Zhang, H. Xiong, W. Zhou, and Q. Tian. Fused one-vs-all mid-level features for fine-grained visual categorization. In ACM Multimedia, pages 287–296, 2014.
  • [51] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In Proc. Eur. Conf. Comput. Vis., pages 391–405. 2014.
  • [52] Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang. Learning discriminative and shareable features for scene classification. In Proc. Eur. Conf. Comput. Vis., pages 552–568. 2014.