Object parts that capture crucial characteristics of an image are important in a variety of object recognition and related applications. For instance, in Deformable Part Model (DPM) , an object is modeled as a set of deformable parts organized in a tree structure. In relative attribute learning , local parts that are shared across categories are used to learn relative attributes. In fine-grained recognition , , distinctive parts such as the head of birds are detected out to enable part-based representation. Nevertheless, obtaining informative parts usually requires object-level  or even part-level annotations , which is tedious and costly for large-scale datasets. Accordingly, it is desirable to discover these parts with minimal human supervision.
The success of Convolutional Neural Network (CNN)  has shed light on the possibility of automatically discovering object parts. It has been revealed that  the CNN filters at different layers are sensitive to patches with varying receptive fields, i.e., from low-level cues such as the edges and corners in earlier layers to semantically meaningful parts or even the whole object in deeper layers. From the point of detection, the output of the convolutional layers can be interpreted as detection scores of multiple detectors. In this sense, CNN learns detectors relevant for the dataset it is trained from. However, since the network is trained based on image-level classification losses, these detectors (the hidden layers) are trained implicitly. As a result, the discriminative power of the CNN detectors is rather weak, producing activations with inhomogeneous appearances. Though a collection of such weak detectors boost the representative ability, it still leaves room for improvement by enhancing these weak filters.
An alternative method of discovering informative parts automatically is to learn detectors explicitly, which we refer to weakly supervised detector learning. As shown in Fig. 1, the standard approach for detector learning requires initial patterns (object parts) for detector initialization, and an optimization strategy for detector learning. However, learning part detectors automatically is a classical chicken-and-egg problem: without an accurate appearance model, examples of a part cannot be discovered, while an accurate appearance model cannot be learned without appropriate part exemplars. To solve this challenge, we need to answer the following two crucial issues.
What are the right initial patterns? As the quality of the learned detectors depends heavily on initialization, it is crucial to select appropriate initial patterns. As noted in , such patterns should meet two criteria, i.e., representation and discrimination. Representation requires that such patterns should frequently occur in images with the same label, while discrimination claims that they should be seldom found in images not containing the object of interest. Unfortunately, algorithms aim at finding such patterns are rather ad hoc and have limited performance. Most previous works , ,  start from unsupervised clustering such as -means to initialize a part model. However, -means behaves poorly in high dimension since distance metric becomes less meaningful, often producing clustered instances which are inhomogeneous.
How to learn generalized detectors?
Given the initial patterns, most weakly supervised learning algorithms follow the pipeline of standard SVM training
, or an iterative SVM optimization which alternates between training classifier and choosing new positive samples, , . Nevertheless, due to the uncertainty of initial patterns, such optimization is easily to get stuck into a local minimum. On the other hand, due to the occlusion, illumination variation, and viewpoint variation, the same part from different images suffers significant differences. As a result, such methods easily latch on to a few samples which are similar with the initial patterns, but are weak in generalization. Thus, developing optimization strategy under weakly supervised paradigm is important to obtain robust detection performance.
This paper proposes to learn a set of detectors in a weakly supervised paradigm, which aims at solving the above two issues. The main contribution is an iterative optimization strategy for detector learning, which we formulate as a confidence loss sparse Multiple Instance Learning (cls-MIL) task. Different from conventional MIL methods which represents each positive image with a single instance and treats each image equally important, cls-MIL represents each positive image as a sparse linear combination of its member instances, and considers the diversity of the positive images, while avoid drifting away the well localized ones by assigning a confidence value to each positive image. The responses of the learned detectors formulate an effective mid-level image representation for recognition. Another interesting finding is that different from most previous methods which treat image classification ,  and object localization , ,  separately, the proposed approach is able to effectively integrate the two tasks into a whole framework. Benefit from the powerful discriminative ability of the learned part detectors, the detector responses by our approach are able to indicate the locations of the objects. Experiments conducted on benchmark datasets demonstrate the superiority of the proposed representation.
As the detector learning procedure heavily relies initial patterns, a second novelty of our approach is the use of a spectral clustering technology for mining consistent and discriminative patterns. To this end, a selection strategy is first utilized to sample discriminative patches of the corresponding category, followed by exemplar-SVM  detector training for each sampled patch, finally, these exemplar-SVM detectors are grouped via a spectral clustering strategy for pattern mining. Comparing with traditional clustering methods which are conducted on the original patches, the clustered detectors are able to focus on discriminative details, and a set of such grouped detectors offer an effective way for consistent pattern mining. Furthermore, an entropy coverage criterion is utilized to measure the discriminativeness of each cluster, which enables us to greedily select clusters for detector learning, while not worrying about choosing appropriate number of clusters.
The rest of this paper is organized as follows. Sec. II reviews related works on weakly supervised detector learning. The details of our proposed detector learning method are elaborated in Sec. III. In Sec. IV, we apply the learned detectors for classification and localization. Experiments and discussions are given in Sec. V. Finally, Sec. VI concludes the paper.
Ii Related Works
Over the past years there has been a lot of researches aiming at learning part models in an unsupervised or weakly supervised way. Most methods target at improving the two modules: pattern mining technologies for model initialization, and optimization strategies for detector learning. The learned part models offer a promising way for feature representation, which is beneficial for image recognition and other related applications. In the following, we organize the discussions related to part model learning with the above aspects.
Ii-a Pattern Mining Methods
Since the ground truth annotations are not available in a weakly supervised paradigm, a number of strategies have been proposed to discover the discriminative patches for model initialization. A simple method, taken in , , , starts by randomly sampling a large pool of patches, and employs unsupervised clustering to generate initial patterns for detector learning. Such methods are clumsy and most returned clusters are with inhomogeneous appearances. Hence, many pattern mining technologies are developed to offer better initialization. Song et al.  formulate a constrained submodular algorithm to identify discriminative configurations of visual patterns. Wang et al.  propose to discover these latent parts via a probabilistic latent semantic analysis on the windows of positive samples and further employ these clusters as sub-categories. Li et al.  combine the activations of CNN with the association rule mining technique to discover the representative mid-level patterns. Doersch et al.  formulate part discovery from the perspective of the well-known mean-shift algorithm to maximize the density ratio in the feature space. There is a special case in which we do not need to worry about exemplar alignment, i.e., a training set consisting exactly of one part instance . However, training detectors based on a single exemplar is with limited discriminative power, and the number of detectors scales with the training samples, which is tremendous for large-scale datasets.
Different from previous approaches which aim at grouping the original patches, this paper performs clustering in terms of the corresponding weak detectors, and makes use of the grouped detectors for pattern mining. In order to generate weak detectors, a selection strategy is first utilized to sample discriminative patches, and each patch is associated with a detector via exemplar-SVM training. Though a single exemplar-SVM detector is weak, a collection of such detectors offer relatively satisfactory localization capacity for pattern mining.
Ii-B Optimization for Detector Learning
Based on these discovered patterns, most methods employ an iterative learning approach to refine the detectors. Juneja et al.  employ an LDA accelerated version  of the exemplar-SVMs , which reduces the training cost substantially comparing with the standard SVM procedure that involves hard negative mining . However, the detectors are trained with only one positive instance, which results in limited discriminative powers. Singh et al.  split the training set into two disjoint parts, and a part model is refined via an iterative procedure which alternates between clustering on one dataset and training discriminative classifiers on the other to avoid overfitting. Parizi et al.  propose a jointly training method which optimizes part models and class specific weights iteratively. Sun et al.  propose a latent SVM model to learn detectors, which tends to select the discriminative parts by enforcing group sparsity regularizer. However, these methods suffer from complex jointly optimization, e.g.,  takes over five days to train detectors on MIT Indoor-67 .
The majority of related works treat weakly supervised detector learning as a Multiple Instance Learning (MIL) task, in which labels are assigned to bags (sets of patterns), instead of individual patterns. The positive bags are sets of instances containing at least one positive example, while the negative bags are sets of instances which are all negative. MIL is originally introduced to solve a problem in biochemistry , and a variety of MIL algorithms have been developed over the years. The simplest method is to transform MIL into a standard supervised learning problem by applying the bag’s label to all instances in the bag . However, such method assumes that the positive examples are rich in the positive bags. Andrews et al.  present a new formulation of MIL as a max-margin SVM problem. Bunescu et al.  develop an MIL method which is particularly effective when the positive bags are sparse. When applying MIL for detector learning, the detector is obtained by an iterative procedure which alternates between selecting the highest scoring detection per bag as positive instance and refining the detector models . However, such simplified setting is sensitive to initialization and easy to getting stuck in a local minimum.
This paper also formulates the weakly supervised detector learning as a MIL task. Different from previous works, we introduce a confidence loss term in MIL problem when determining the classifier hyperplane. The key insight is that due to the occlusion, illumination variation, and viewpoint variation, it is suboptimal to treat instances from different bags equally important for detector learning. The introduced confidence loss term measures the reliability of each instance for MIL learning. As a result, the detectors are able to focus on more confident samples and downweights those samples with lower reliability. Furthermore, a cross-validation strategy is introduced to avoid overfitting the initial patterns.
Ii-C Mid-level Image Representation
A collection of detector responses can be used as mid-level image representation. The paradigm is inspired by object bank , a pioneering work of using detector responses for image representation. The object bank represents an image as a scale-invariant response map of a large number of pre-trained generic object detectors. Following that, most technologies employ detection scores as image representation, and improve the performance by incorporating part responses ,  or via multiple scale pooling , .
Over the past years, CNN has become a powerful tool for image representation. Due to the domain mismatch between ImageNet (the source dataset where CNN is trained from) and the target dataset, previous works attempt to enhance CNN representation by transferring learning, , or network fine tuning , . However, these methods need substantial object / part annotations of the target dataset, which is tedious and impractical in real applications. Zhang et. al  propose an alternative method to fine tune the network via saliency-based sampling, which is free of the object annotations. Nevertheless, such method is only limited to datasets with relatively simple backgrounds (such as fine grained dataset ). It may obtain limited performance improvement on datasets with complex scenes such as Pascal VOC  datasets.
Our approach follows the pipeline of using detector responses as feature representation. Different from previous works which learn a large number of detectors for classification , ,  or focus on learning a single detector for localization , , , , this paper integrates classification and localization into a whole framework, i.e., we not only solve the problem of whether an object is present in an image, but also focus on where the object (if exists) is. We find that it is possible to use only a few detectors for both classification and localization if each detector is distinctive enough. Such an integrated framework is beneficial to close the gap between these two tasks.
Our feature representation is also related to dictionary learning methods , , , where patches are encoded as a sparse linear combination of dictionary elements, optimized for image reconstruction  or recognition , . Compared with these approaches, this paper uses detectors as dictionary elements (basis), and chooses detection responses as the combinational coefficients.
Iii Learning Part Detectors
In this section, we target at learning a collection of discriminative part detectors automatically for image representation. Our detector learning system consists of two modules: mid-level pattern mining and detector optimization. The pattern mining module first selects patches which are representative and discriminative, then a series of exemplar-SVM  detectors are trained from each selected patch. This is followed by a spectral clustering procedure which groups exemplar-SVM detectors for pattern mining. Furthermore, an entropy coverage criterion is proposed to measure the generalization ability of each cluster. The detector optimization module formulates the weakly supervised detector training as a confidence loss sparse MIL (cls-MIL) task, which considers the reliability of each positive sample via alternating between mining new positive samples and retraining the part model. The whole framework of the proposed approach is illustrated in Fig. 2. In the following, we present the detailed design for each module.
Iii-a Pattern Mining with Spectral Clustered Detectors
Discovering groups of mid-level patterns that are discriminative and representative is crucial for detector learning. To solve this issue, we first introduce a sampling strategy which aims at selecting the discriminative patches, and propose a detector-based spectral clustering approach to mine consistent patterns. Furthermore, we present an entropy coverage criterion to measure the discriminativeness of each cluster, which enables us to greedily select detectors for image representation. These steps are described as follows:
Iii-A1 Discriminative patch selection
It is a challenging task to find discriminative patches without object / part annotations. To address this issue, a sampling strategy is introduced to select the discriminative and representative patches. Specifically, given an image , we first generate region proposals with edge boxes , and the final representation of image is obtained by sum pooling the features over regions: i.e., . Finally, a one-vs-all SVM classifier is trained based on the sum pooled features . Benefit from the non-negativity of CNN features and the additivity of linear classifier, we select the patches which contribute significantly to the classification score. Specifically, given one category and its classification model , the discriminative patch set of an image is denoted as:
where denotes the threshold (set as 1) which enforces selecting the discriminative patches for classification.
In order to avoid the classifier overfitting the training set , we equally divide into disjoint and complementary subsets . The classifier is trained on subsets and validated on the rest one. For generalization, only correctly classified images are retained for discriminative patch selection. Fig. 3 illustrates some discriminative patches selected on Pascal VOC 2007 dataset. It can be seen that the selected patches probably locate around the object of interest, and skip other irrelevant backgrounds.
Iii-A2 Detector-based clustering
The patch selection process usually generates tens of thousands of patterns per category, and most of them are highly correlated, e.g., there exists some patches describing the head of dogs, and some others describing the legs of dogs. It is necessary to cluster these patterns into smaller and representative groups for detector initialization. To this end, an alternative method is to employ some form of unsupervised clustering such as -means , , . However, -means behaves poorly in high dimensional space since distance metric becomes less meaningful, and often produces clustered instances which are in no way visually similar. Instead of clustering the original patches, this paper proposes a detector-based spectral clustering strategy, which discovers similar patterns via the grouped detectors.
Inspired from exemplar-SVMs , we start learning detectors from only one instance, which avoids worrying about exemplar misalignment. The negative samples are defined as patches which do not contain the object of interest, i.e., all patches sampled from images with different labels. Since the negative samples are too large, standard hard mining method  is quite expensive. We use instead Linear Discriminant Analysis (LDA)  to train a detector, which is an accelerated version of the exemplar-SVMs. Specifically, the detector template is learned simply by , where is the mean features of the positive examples, denotes the mean of the features in the whole dataset, and is the corresponding covariance matrix. Since each exemplar-SVM detector is supposed to fire only on visually similar examples, we cannot expect it to generalize too much. To solve this issue, we follow an iterative procedure  which adds new positive samples each round to enhance the exemplar detectors. At each round, we run the current detector on all other images with the same label, and retrain it by augmenting the positives with the top scored patches. The idea behind this process is using detection score as a similarity metric, which emphasizes the distinctive details and suppresses those irrelevant ones.
Using exemplar-SVMs, each selected patch is associated with a detector. The key insight of the proposed strategy is that instead of clustering the original patches, we group the corresponding detectors. Specifically, given exemplar detectors trained from one class , we perform spectral clustering on the similarity matrix generated from the detectors, and obtain clusters , where
denotes the cosine similarity ofand . Thus, detectors sharing similar response distributions are grouped together. Inspired by boosting strategy , each cluster acts as an integrated detector to discover similar patterns, i.e., the detection score of a patch with respect to a cluster is denoted as:
As an illustration, Fig. 4 shows some examples of the discovered patterns using the clustered detectors. It can be shown that although a single detector is weak, a collection of such detectors offer satisfactory localization capacity. Another advantage of the detector-based pattern mining method is that we can select the most discriminative and representative patterns according to the top responses of the grouped detectors.
Iii-A3 Entropy coverage
The detector-based clustering generates a series of clusters with varying discriminative capacities. The notation of discriminative clusters is that the detectors within a cluster should be trained from as many images as possible. Such clusters include detectors corresponding to repeated patterns among varying images. We propose an entropy coverage criterion to measure the discriminativeness of each cluster. Given images belonging to the same class and the corresponding clustered detectors , the entropy coverage of cluster is defined as:
where denotes the probability of detectors coming from image . The subitem of is a standard entropy function, which enjoys the following property:
Corollary 1. Denote the entropy function as , then for .
Proof. For the left side, we have:
The right side is obvious according to the maximum property of entropy, i.e., the entropy reaches its maximum when events are equiprobable.
According to Corollary 1, is large if the clustered detectors within are trained from diverse images, and reaches its maximum when the detectors are trained from patterns with equal distribution. The larger is, the more frequent patterns the detectors in could find. Such an entropy coverage criterion enables us to greedily select clusters for detector initialization, while not worrying about choosing appropriate number of clusters. In the experimental section, we would find that the optimal number of clusters is determined by the classification performance.
Iii-B Detector Optimization with cls-MIL
Although the grouped detectors offer a relatively robust localization capacity, it is far from enough. These detectors are trained from a subset of discriminative patches, and are only powerful to discover patches which are also significant in discriminativeness. While we cannot ensure that they respond consistently among all the images of that class, especially those not correctly classified ones in cross-validation. Based on these observations, we formulate the weakly supervised detector learning as a confidence loss sparse MIL (cls-MIL) task, which considers the diversity of the positive samples, while avoid drifting away the well localized ones by assigning a confidence value to each mined positive sample.
To use MIL for detector learning, each image is considered as a bag, and the patches within it as instances. Given a set of training images, we treat images of one particular category as positive bags, and the rest images as negative bags. Intuitively, for each image, if it is labeled as positive, then at least one patch within it should be treated as a positive instance, when it is labeled as negative, then all patches within it should be treated as negative instances. Standard MIL is based on alternatively selecting the highest detection per bag as the positive instance and refining the detection model. However, it suffers from several issues. First, the detectors would latch on to the initial patches they are trained from and prefer them at each round of instance selection when training and selecting are performed on the same dataset. Second, standard MIL often mines a single instance per positive bag and treats each mined instance equally important, which is often not the case. Due to the occlusion, illumination variation, and viewpoint variation, the same part from different images suffers from varying confidence of positiveness. Based on these observations, a multi-fold cross-validation  is introduced to avoid overfitting the initial training samples, and a confidence loss sparse MIL (cls-MIL) technology is proposed to tackle the dataset bias. In the following we define the problem in a formal way.
Iii-B2 Problem formulation
Let be the set of bags used for training, which consists of a set of positive bags and negative bags , i.e., . Denote be a bag of images, and and be the set of instances from positive bags and negative bags, respectively. For any instance from a bag , let
be the feature vector representation of(for brevity, we include the bias term into feature representation). The cls-MIL problem can be formulated as solving the following objective:
where is the feature representation of bag , is the latent variable which measures the positiveness of a bag , and is the control parameter of the loss term.
One remained issue is how to determine the representation . We would prefer that a positive bag be represented as much as possible by the true positives within it. However, even the state-of-the-art region proposal algorithms  could only generate patches containing the object of interest with a high recall, not to mention the difficulty of determining the positive samples under weakly supervised paradigm. To tackle this issue, we introduce a pooling strategy for representation to improve the robustness. Note that among all the given region proposals, only a few instances are the patterns we expect to find (which is sparse). Based on these observations, each bag is represented as the weighted sum of its mined member instances: , where is a weight assigned to each instance, and is an indicator which denotes the patterns selected as the positive “witness” in a positive bag . In practice, only a few instances per positive bag are selected (we set the number of as 10), while all the negative instances are taken into consideration.
The cls-MIL leads to a non-convex optimization problem due to the introduction of implicit feature representation for the positive bags and the latent confidence variables . However, this problem is semi-convex since optimization problem becomes convex once these latent variables are fixed. In the following, we solve Eq. (5) via an iterative procedure which alternates between fixing the latent variables and optimizing the detectors. In order to avoid focusing on the initial positive samples, the optimization procedure is processed via cross-validation. Specifically, the training set is equally divided into disjoint and complementary subsets . Starting from the patterns discovered by the clustered exemplar-SVM detectors, the detector is optimized via iteratively Updating the latent variables and Optimizing Eq. (5). In the Updating step, the latent variables in are determined by detectors trained on , i.e., each instance weight of is updated by: , and the confidence loss term , where
is a sigmoid function which maps the value into the range of. In the Optimizing step, the detector is optimized according to the updated latent variables via hard negative mining .
Corollary 2. The solution of Eq. (5) is a linear combination of the positive instances and the negative instances , i.e., , where the coefficients and are bounded by: , , respectively.
Proof. The constrained minimization problem in Eq. (5) can be solved with a classical Lagrangian method. The Lagrangian operator can be represented as:
where , , , and denote Lagrange multipliers. The minimization of Lagrangian operator in Eq. (6) with respect to is obtained:
Due to the nonnegativity of and , we have and . Given a test example , the detection score can be represented as:
It can be seen that the final detection score is a weighted combination of the inner product between training features , and test feature , and is only determined by samples with nonzero coefficients . These s are called support vectors, since they are the only training samples necessary to define the separating hyperplane. Note that for positive samples, the coefficient is bounded by , with KKT conditions, it is also possible to see when an example is a support vector, this happens only if the example is on the margin, or it does not respect the separation conditions in Eq. (5). According to , the coefficient for positive samples in different locations is defined as:
For positive bags which do not respect the classification hyperplane, the corresponding coefficient is bounded by , which takes the reliability of into consideration. The regularized term helps to boost the detection performance. If a positive bag is not reliable at previous round, its contribution to the classification hyperplane at current round would be lowered. As a result, MIL introduces diverse samples for detector learning, while the confidence loss term encourages the detector focusing on positive instances which are good enough and downweighting those instances with lower reliability. The whole procedure of the proposed weakly supervised detector learning algorithm is summarized in Algorithm 1.
Iv Applications: Image Classification and Object Localization
The learned detectors are discriminative for the corresponding category, and an ensemble of the detectors across different categories offers an effective mid-level image representation. In this section, we apply such mid-level representation for image classification and object localization.
Iv-a Image Classification
Unsupervised clustering methods have been used for feature representation , . Since our learned detectors can be considered as the true visual patterns corresponding to a certain category (as opposed to the clustered ambiguous visual letters in , ), it makes sense to apply such detectors for image coding. Denote all the learned detectors across different categories as , where is the total number of detectors. Our mid-level feature representation is based on the maximal responses of a collection of detectors. Specifically, given an image and the corresponding region proposals , the feature representation is denoted as: , where is a latent variable indicating the region with maximum response corresponding to detector , i.e., . An illustration of image representation is shown in Fig. 5.
Given the image representation, a conventional SVM classifier is performed to produce the final classification results. Note that the complexity of the feature coding using detector responses is very low, which includes no more than a dot product operation once the features (e.g., CNN) are extracted. On the other hand, we greedily select detector responses based on the entropy coverage criterion, and find that the performance saturates as the first few detectors are added in, which decreases the feature dimension by one order. In the experimental section, we will demonstrate the effectiveness of the proposed feature coding approach.
Iv-B Object Localization
The learned part detectors are discriminative for the corresponding category, and a collection of them offers rough position of the object of interest. In this section, we present a simple object localization technology based on the learned part detectors. The basic idea is to accumulate the part responses into a whole object heat map, which indicates the potential position of an object. Specifically, starting from a collection of part detectors corresponding to a category, we first define a part map based on detector , the confidence of a pixel which is contained in an object part is denoted as:
where denotes the patch set that includes pixel , is a sigmoid function, and is a normalization constant such that . Finally, the object map is a weighted linear combination of the part maps obtained by all part detectors, i.e., , where is a weight factor which denotes the reliability of each detector, and is given by . Fig. 5 illustrates examples of how to compute the object heat maps.
The object heat map indicates the most discriminative details of an object, and usually focuses on object parts (e.g., the head of dogs), instead of the whole object. Inspired from  which casts localization as a segmentation task, we perform grabcut 
on the object heat map to generate the segmentation mask. The goal is to propagate the discriminative part details to the whole object with color continuity cues. To this end, the foreground and background are set to be gaussian mixture models. The foreground model is estimated from heat map values higher than, and the background model is estimated from values lower than . Finally, we take the bounding box that covers the largest connected component in the generated segmentation mask as localization result. Some example localization processes are shown in Fig. 6. In the experimental section, we will show that as a byproduct of the learned discriminative detectors, such localization technique achieves satisfactory localization performance.
In this section, we present an evaluation of the proposed weakly supervised image classification and object localization framework. We also perform ablation study to understand how various design choices impact the recognition performance.
V-a Datasets and evaluation metrics
We evaluate the proposed approach on three publicly available benchmarks, two for generic object recognition and one for scene recognition. The details of the datasets are briefly summarized as follows:
Pascal VOC 2007: The Pascal VOC 2007 dataset  is a widely used benchmark for multi-label image classification and object localization. The benchmark contains a total of 9,963 images spanning 20 generic object classes, of which 5,011 images are used for trainval and the rest 4,952 images for test. For image classification, we choose trainval split as training set and test
split as test set, and the evaluation metrics is mean Average Precision (mAP), which is complying with the Pascal VOC challenge protocols.
Pascal VOC 2012: The Pascal VOC 2012 dataset  is an extended version of the Pascal VOC 2007, which contains a total of 22,531 images, including 11,540 images for trainval and 10,991 images for test. Since ground truth labels are not available for test split, we use the online evaluation server to evaluate recognition performance of the proposed algorithm.
MIT Indoor-67: The MIT Indoor-67  dataset consists of 15,620 images belonging to 67 categories of indoor scenes. It is challenging because of the large ambiguities between categories. We follow the standard train/test split as in , i.e., approximately 80 images per class for train and 20 images per class for test. The evaluation metric for MIT Indoor dataset is the mean classification accuracy.
In addition to classification, we also evaluate the localization performance of the proposed approach. We follow previous methods on object localization , , and evaluate the performance on Pascal VOC trainval set with CorLoc criterion . CorLoc measures the percentage of images with correct localization, i.e., a window is considered to be correct if it has an Intersection-over-Union (IoU) ratio of at least with one of the ground truth instances.
|NO. of proposals||300||500||1000||2000|
V-B Implementation Details
-layer model). We extract features from the fc6 layer (FC-CNN) after the rectified linear unit (ReLU), which is a-d nonnegative vector for each region. Edge boxes  are used for generating candidate region proposals. In addition to region proposals, edge boxes also provide an objectness score for each region. For computation efficiency, we disregard regions which occupy less than areas of an image, and retain the top scored region proposals as candidates.
Parameter settings. In pattern mining, the number of spectral clustering per category is set as , and the top scored patches per clustered detectors are selected as patterns for detector initialization. In detector optimization, the number of iterations is set as , as we find that the performance of the detectors do not need more to converge. For all situations where cross-validation is needed, we use typical -fold cross-validation.
V-C Ablation Study
To better understand the relative contribution of each module, we analyze the performance of our approach with different configurations. As the localization can be regarded as a byproduct of the learned detectors, we mainly measure how different designs affect the discriminativeness of the detectors in terms of classification performance.
V-C1 Number of detectors
An advantage of the proposed approach is that detectors are trained from patterns with different coverage entropies. This enables us to greedily select detectors based on the entropy coverage criterion. As shown in Fig. 8, we add detectors orderly to probe how the number of detectors affect the classification performance. Note that the performance improves fast when a small number of detectors are used (e.g., from 1 to 10), it tends to be stable and even drops sightly when more detectors are added in. This is mainly because the subsequent detectors are not discriminative enough for classification. For computational efficiency, we fixed the number of detectors ( per category for VOC 2007 and per category for MIT Indoor-67) for the following experiments.
V-C2 Number of region proposals
In order to probe the performance with respect to the number of candidate region proposals, we select the number of region proposals in different settings. Table I shows the results on VOC 2007 by varying the number of region proposals. The performance are relatively stable (from 2000 to 300 region proposals, only drop). Considering the performance and computational efficiency, we choose the number of region proposals as .
V-C3 Effects of different modules
We now compare the results with different configurations to analysis how each module affect the final classification performance. Different modules are summarized as follows:
BL: This is the baseline method which directly max pooling multiple region proposal features for classification. It is introduced to help understand how the proposed approach improve the discriminative power of the detectors.
PMKM: PM denotes the proposed pattern mining method in Sec. III A, while KM is the standard -means clustering method that is widely used for detector initialization in previous algorithms , , . For fair comparisons, we perform -means clustering on the selected patches with the number of clusters setting as .
MILcls-MIL: MIL stands for standard multiple instance learning method which mines new positive sample without considering the confidence of each bag, and cls-MIL is the confidence loss sparse MIL detector optimization strategy proposed in Sec. III B.
As shown in Fig. 8, both -means and multiple instance learning do help to improve the classification performance, nevertheless with limited gains. The proposed pattern mining and cls-MIL method surpass the counterparts consistently, e.g., pattern mining improves the accuracy from (-means) to , and cls-MIL obtains an accuracy improvement of ( vs ) comparing with standard MIL. We also find that detector initialization really counts for multiple instance learning, even for the modified cls-MIL ( with -means, and with pattern mining). This is widely discussed in previous approaches which aim to develop efficient pattern mining methods ,  for detector initialization. However, few works emphasis detector optimization. We demonstrate that both modules are essential, and a combination of them achieves considerable performance improvement.
V-D Image Classification
|FC-CNN CaffeNet ||4K||60.3|
|MR-CNN CaffeNet ||4K||65.1|
|Clustered Detectors CaffeNet||2K||66.3|
V-D1 Object Recognition
Table II and III show the object recognition results of the proposed approach on Pascal VOC 2007 and 2012 test splits, respectively. In order to make fair comparisons, we extract CNN features from multiple region proposals, and max-pooling the region features into a final representation, which we refers to MR-CNN. Then the only difference between MR-CNN and our method is the detectors since they make use of the same region proposals. From Table II we can see that the proposed detectors improve the classification performance considerably, achieving accuracies of with CaffeNet, and with very deep model, which bring and gains comparing with using CNN features.
There exist many previous approaches that report classification results on Pascal VOC dataset, and we compare our results with some most recent ones. Most of previous approaches that achieve high classification results are based on network fine tuning , , . Since network fine tuning is hard for multi-label images, previous works  rely on object annotations to find category specific patches. In , the authors proposed a weakly supervised classification framework via two-steps of network fine tuning, while it makes use of additional training data, which is more demanding. Our result () is slightly better than the best performing one () , demonstrating that the traditional optimization approaches are able to achieve competing results with CNN fine tuning. Furthermore, the proposed features are complementary with CNN features, and achieve an accuracy of when combined. For VOC 2012, our method obtains an accuracy of , which is slightly worse than  () that makes use of additional training images. The reason lies in that CNN-based methods are powerful as the training data grow, while MIL-based methods are relatively robust to the amount of data.
V-D2 Scene Recognition
Table IV compares the recognition results on MIT Indoor-67. MR-CNN denotes max-pooling multiple region features for representation, and FC-CNN refers to directly extract a single global feature from the whole image. Clustered detectors denote the method which relies on the responses of the clustered exemplar-SVM detectors as features. From Table IV we observe that:
MR-CNN is much better than FC-CNN. Using CaffeNet model, the accuracy is with MR-CNN, and with FC-CNN. This demonstrate that local features are crucial for scene recognition.
The features using clustered detector responses () is better than MR-CNN (), even with half dimension (2K versus 4K). This is mainly because CNN is primarily trained from the object centric images, instead of the scene centric images. As a result, the weak exemplar-SVM detectors still outperform MR-CNN due to the data specific representation.
The proposed EPD is much better than the features with clustered responses. Benefit from the detector optimization strategy, our method obtains an accuracy of , which brings about a improvement comparing with the clustered responses. The performance is boosted to when switching to the very deep model. Another observation is that the proposed features are complementary with CNN features, and achieve an accuracy of when combined.
There are some approaches which also aim at learning discriminative part detectors for recognizing indoor scenes. The method of  integrates detector learning and classification by jointly training, and  poses mid-level pattern discovery as discriminative mode seeking via developing an extension of the classic mean-shift algorithm to density ratio estimation. Our method is closely related to , which also makes use of CNN activations for pattern mining. Our method achieves better result comparing with the best performing method ( vs ). There exists a majority of algorithms which employ multiple region pooling for final feature representation. A typical representation is MOP-CNN  which uses VLAD to encode CNN activations into bag of words representation, and achieves an accuracy of , our results () is comparable with  using the same model, but with much lower dimension (2K vs 13K).
V-D3 Visualizing Mid-level Patterns
As an illustration, Fig. 9 shows some discovered patterns on VOC 2007 (top row) and MIT-67 (bottom row) test splits. We show the highest activation region per image, which offers a clue indicating why it is classified as the corresponding category. Specifically, given a test image and the category label that the image is classified with (no matter correct or not), we employ category specific detectors to find which region responds most to the given category, and show some patches that the detector is trained on. For correctly classified images, there often exist discriminative patches that respond significantly to the corresponding detectors, e.g., on VOC 2007, the head of a train is important for recognizing the trains, and the upper body of a person is important for recognizing the persons. Similar results can be found on MIT-67, it is the pillar of a cloister that makes it look like a cloister, and the slide rail that makes bowling look like bowling. It is helpful to investigate why incorrect results happen, on VOC 2007, a classifier mis-classifies chair as the plant, or horse as bicycle, probably because there exist corresponding details, e.g., the wheel of the carriage is similar with bicycle wheels. Similar results can be found on MIT-67, the window of the office is misclassified as the bar of the baby bed, which is most discriminative for recognizing nursery. Actually, these details look similar, and it is hard to recognize them. However, these observations offer a direction to further improve the recognition performance.
V-E Object Localization
Table V shows the image localization results on Pascal VOC 2007 trainval split. Benefit from the learned part detectors, the proposed localization strategy () is better than recent methods that is specifically designed for localization , , and is comparable with  () which uses latent category learning for object localization. Another observation is that different from recognition, using deeper model does not bring about localization improvement (46.9%). This can be explained with the fact that deeper models frequently focus on parts of the object instead of the whole object. Note that all these comparing methods are designed for localization, which often makes use of context information for better localization, while we rely on detectors which are learned for classification to uncover the connection between these two basic tasks. The results demonstrate that image classification and localization can be done simultaneously.
V-E1 Localization Error Analysis
In order to better understand the localization errors, following , , we summarize the errors to uncover the pros and cons of our localization method. Each predicted bounding box is categorized into the following five cases: 1) correct localization, IoU overlap is greater than with the ground truth. 2) hypothesis completely inside ground truth, 3) ground truth is completely inside the hypothesis, 4) no overlap, IoU equals to zero, and 5) low overlap, none of the above. Fig. 10 shows the error distribution of the proposed method across categories on Pascal VOC 2007 trainval set. It can be noted that among the failed modes, the most important failure modality of our method is that an object part is localized instead of the whole object. This is intuitive since in most situations, correct classification only demands catching local discriminative details.
V-E2 Visualizations and Limitations
Fig. 11 shows some localization results on Pascal VOC 2007 trainval split. The correct localizations are marked with red bounding boxes, while the failed ones are marked with green. It can be shown that the proposed localization method is able to find objects where there is only one object from the same category, but is short of localizing multiple objects of the same category. Actually, it is the main challenge for weakly supervised localization , and is a promising direction for future research.
V-E3 Classification versus Localization
Comparing classification (Table II) with localization (Table V), we find that the least successfully recognized objects are bottle () and chair (), which are also hard for localization ( and ). This is because they usually occupy a small fraction of the image, and are within cluttered backgrounds. The exception is person, which suffers a low localization accuracy (), but with a high recognition accuracy (). This can be explained by the fact that person is easy to be recognized by face, and usually, there exist multiple persons in an image, which offers abundant cues for recognition. In contrast, localization is failed when focusing on the face, and it is hard to tell apart individual person from the crowd.
In this paper, we propose an effective mid-level image representation approach for visual applications. The proposed framework aims at learning a collection of discriminative part detectors in a weakly supervised paradigm, which only needs the labels of training images, while does not need any object / part annotations. Our approach tackles several key issues in automatic part detector learning. First, we propose an efficient pattern mining technique via spectral clustering of exemplar-SVM detectors. Second, we formulate the detector learning as a confidence loss sparse MIL (cls-MIL) task, which considers the diversity of the positive instances, while avoid drifting away the well localized ones by assigning a confidence value to each positive instance. The proposed method shows notable performance improvements on several recognition benchmarks. Furthermore, we simultaneously considering classification and localization based on the learned detectors, and find that the accumulated responses of part detectors offer satisfactory localization performance, which bridges these two widely studied visual tasks.
-  S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In Proc. Neur. Info. Pro. Sys.,, pages 561–568, 2002.
-  H. Azizpour and I. Laptev. Object detection using strongly-supervised deformable part models. In Proc. Eur. Conf. Comput. Vis., pages 836–849. 2012.
H. Bilen, M. Pedersoli, and T. Tuytelaars.
Weakly supervised object detection with convex clustering.
Proc. IEEE Comput. Vis. Pattern Recognit., pages 1081–1089, 2015.
-  H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 2846–2854, 2016.
-  M. Blaschko, A. Vedaldi, and A. Zisserman. Simultaneous object detection and ranking with weak supervision. In Proc. Neur. Info. Pro. Sys.,, pages 235–243, 2010.
-  R. C. Bunescu and R. J. Mooney. Multiple instance learning for sparse positive bags. In Int. Conf. Mach. Learn., pages 105–112, 2007.
-  R. G. Cinbis, J. Verbeek, and C. Schmid. Multi-fold mil training for weakly supervised object localization. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 2409–2416, 2014.
Large scale machine learning.Technical report, IDIAP, 2004.
-  T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised localization and learning with generic knowledge. Int. J. Compt. Vis., 100(3):275–293, 2012.
-  T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artifi. Intell., 89(1):31–71, 1997.
-  C. Doersch, A. Gupta, and A. A. Efros. Mid-level visual element discovery as discriminative mode seeking. In Proc. Neur. Info. Pro. Sys., pages 494–502, 2013.
-  M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process., 15(12):3736–3745, 2006.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis., 111(1):98–136, 2015.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis., 88(2):303–338, 2010.
-  P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 1–8, 2008.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 580–587, 2014.
-  Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless pooling of deep convolutional activation features. In Proc. Eur. Conf. Comput. Vis., pages 392–407. 2014.
-  B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In Proc. Eur. Conf. Comput. Vis., pages 459–472. 2012.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, pages 675–678, 2014.
-  M. Juneja, A. Vedaldi, C. Jawahar, and A. Zisserman. Blocks that shout: Distinctive parts for scene classification. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 923–930, 2013.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Neur. Info. Pro. Sys.,, pages 1097–1105, 2012.
-  L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Proc. Neur. Info. Pro. Sys.,, pages 1378–1386, 2010.
-  Y. Li, L. Liu, C. Shen, and A. v. d. Hengel. Image co-localization by mimicking a good detector’s confidence score distribution. arXiv preprint arXiv:1603.04619, 2016.
-  Y. Li, L. Liu, C. Shen, and A. van den Hengel. Mid-level deep pattern mining. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 971–980, 2015.
-  J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 1–8, 2008.
-  T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In Proc. Int. Conf. Comput. Vis., pages 89–96, 2011.
-  M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 1717–1724, 2014.
-  M. Oquab, L. Bottou, I. Laptev, J. Sivic, et al. Weakly supervised object recognition with convolutional neural networks. In Proc. Neur. Info. Pro. Sys. Citeseer, 2014.
-  S. N. Parizi, A. Vedaldi, A. Zisserman, and P. Felzenszwalb. Automatic discovery and optimization of parts for image classification. arXiv preprint arXiv:1412.6598, 2014.
-  O. M. Parkhi, A. Vedaldi, C. Jawahar, and A. Zisserman. The truth about cats and dogs. In Proc. Int. Conf. Comput. Vis., pages 1427–1434, 2011.
-  F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In Proc. Eur. Conf. Comput. Vis., pages 143–156. Springer, 2010.
-  A. Quattoni and A. Torralba. Recognizing indoor scenes. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 413–420, 2009.
-  S. Ray and M. Craven. Supervised versus multiple instance learning: An empirical comparison. In Int. Conf. Mach. Learn., pages 697–704, 2005.
-  C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM trans. on graphics, volume 23, pages 309–314. ACM, 2004.
-  R. Sandeep, Y. Verma, and C. Jawahar. Relative parts: Distinctive parts for learning relative attributes. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 3614–3621, 2014.
-  K. J. Shih, I. Endres, and D. Hoiem. Learning discriminative collections of part detectors for object recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(8):1571–1584, 2015.
-  X. Shu, G.-J. Qi, J. Tang, and J. Wang. Weakly shared deep transfer networks for heterogeneous domain knowledge propagation. In ACM Multimedia, pages 35–44, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  S. Singh, A. Gupta, and A. Efros. Unsupervised discovery of mid-level discriminative patches. pages 73–86, 2012.
-  H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell. Weakly-supervised discovery of visual pattern configurations. In Proc. Neur. Info. Pro. Sys.,, pages 1637–1645, 2014.
-  J. Sun and J. Ponce. Learning discriminative part detectors for image classification and cosegmentation. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 3400–3407, 2013.
P. Viola and M. J. Jones.
Robust real-time face detection.Int. J. Comput. Vis., 57(2):137–154, 2004.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
-  C. Wang, K. Huang, W. Ren, J. Zhang, and S. Maybank. Large-scale weakly supervised object localization via latent category learning. IEEE Trans. Image Process., 24(4):1371–1385, 2015.
-  J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 3360–3367, 2010.
-  Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan. Hcp: A flexible cnn framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1901–1907, 2016.
-  S. Yang, L. Bo, J. Wang, and L. G. Shapiro. Unsupervised template learning for fine-grained object recognition. In Proc. Neur. Info. Pro. Sys.,, pages 3122–3130, 2012.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Proc. Eur. Conf. Comput. Vis., pages 818–833, 2014.
-  X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian. Picking deep filter responses for fine-grained image recognition. In Proc. IEEE Comput. Vis. Pattern Recognit., pages 1134–1142, 2016.
-  X. Zhang, H. Xiong, W. Zhou, and Q. Tian. Fused one-vs-all mid-level features for fine-grained visual categorization. In ACM Multimedia, pages 287–296, 2014.
-  C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In Proc. Eur. Conf. Comput. Vis., pages 391–405. 2014.
-  Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang. Learning discriminative and shareable features for scene classification. In Proc. Eur. Conf. Comput. Vis., pages 552–568. 2014.