pwc
Papers with code. Sorted by stars. Updated weekly.
view repo
Weakly supervised object detection is a challenging task when provided with image category supervision but required to learn, at the same time, object locations and object detectors. The inconsistency between the weak supervision and learning objectives introduces significant randomness to object locations and ambiguity to detectors. In this paper, a minentropy latent model (MELM) is proposed for weakly supervised object detection. Minentropy serves as a model to learn object locations and a metric to measure the randomness of object localization during learning. It aims to principally reduce the variance of learned instances and alleviate the ambiguity of detectors. MELM is decomposed into three components including proposal clique partition, object clique discovery, and object localization. MELM is optimized with a recurrent learning algorithm, which leverages continuation optimization to solve the challenging nonconvexity problem. Experiments demonstrate that MELM significantly improves the performance of weakly supervised object detection, weakly supervised object localization, and image classification, against the stateoftheart approaches.
READ FULL TEXT VIEW PDF
Weakly supervised object detection (WSOD) is a challenging task when pro...
read it
Of late, weakly supervised object detection is with great importance in
...
read it
Weakly supervised object detection (WSOD) aims to tackle the object dete...
read it
In this paper, a selflearning approach is proposed towards solving
scen...
read it
Weakly supervised object detection (WSOD) aims to classify and locate ob...
read it
Automatic detection of firearms is important for enhancing the security ...
read it
Learning to localize and name object instances is a fundamental problem ...
read it
Papers with code. Sorted by stars. Updated weekly.
This repository contains lists of stateorart weakly supervised semantic segmentation works
Papers with code. Sorted by stars. Updated weekly.
Papers with code. Sorted by stars. Updated weekly.
Papers with code. Sorted by stars. Updated weekly.
Supervised object detection has made great progress in recent years [1, 2, 3, 4, 5, 6], as concluded in the object detection survey [7]
. This can be attributed to the availability of large datasets with precise object annotations and deep neural networks capable of absorbing the annotation information, especially. Nevertheless, annotating a boundingbox for each object in large datasets is laborious, expensive, or even impractical. It is also not consistent with cognitive learning, which requires solely the presence or absence of a class of objects in a scene, instead of boundingboxes that indicate the precise locations of all objects.
Weakly supervised learning (WSL) refers to methods that rely on training data with incomplete annotations to learn recognition models. Weakly supervised object detection (WSOD) requires solely the imagelevel annotations indicating the presence or absence of a class of objects in images to learn detectors [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]. It can leverage rich Web images with tags to learn objectlevel models.
To tackle the WSOD problem, existing approaches often resort to latent variable learning or multiinstance learning (MIL) by using redundant object proposals as inputs. The learning objective is designed to choose a true instance from redundant object proposals of each image to minimize the image classification loss. Due to the unavailability of objectlevel annotations, WSOD approaches require to collect instances from redundant proposals, as well as learning detectors that compromise the appearance of various objects. It typically requires solving a nonconvex model and thus is challenged by the local minimum problem.
In the learning procedure of weakly supervised deep detection networks (WSDDN) [22], a representative WSOD approach, the problem has been observed, , the collected instances switch among different object parts with great randomness, Fig. 1. Various object parts were capable of minimizing image classification loss, but experienced difficulty in optimizing object detectors due to their appearance ambiguity. Recent approaches have used image segmentation [30, 28], context information [24]
, and instance classifier refinement
[27] to empirically regularize the learning procedure. However, the issues about principally reducing localization randomness and alleviating the local minimum remain unresolved.In this paper, we propose a cliquebased minentropy latent model (MELM) ^{1}^{1}1Source code is available at https://github.com/WinFrand/MELM. to collect instances with minimum randomness, motivated by a classical thermodynamic principle: Minimizing entropy results in minimum randomness of a system. Minentropy is used as a model to learn object locations and a metric to measure the randomness of localization during learning. MELM is concluded as three components: (1) Instance (object and object part) collection with a clique partition module; (2) Object clique discovery with a global minentropy model; (3) Object localization with a local minentropy model, Fig. 2. A clique is defined as a set of object proposals which are spatially related ( overlapping with each other) and class related ( having similar object class scores), Fig. 3. The introduction of proposal cliques can facilitate reducing the redundancy of region proposals and optimizing minentropy models.
With the clique partition module and minentropy models, we can collect instances with minimum randomness, activate true object extent, and suppress object parts, Fig. 1
. MELM is deployed as a clique partition module and network branches concerning object clique discovery and object localization on top of a deep convolutional neural network (CNN). Based on the global and local minentropy models, we adopt a recurrent strategy to train detectors and pursue true object extent using solely imagelevel supervision. This is based on the priori that in deep networks the image classification task and object detection task are highly correlated, which allows MELM to recurrently transfer the weak supervision,
, image category annotations, to object locations. By accumulating multiple iterations, MELM discovers multiple objects, if such exist, from a single image.MELM is first proposed in our CVPR paper [31] and is promoted both theoretically and experimentally in this full version. The contributions of this paper include: (1) A minentropy latent model that is integrated with deep networks to effectively collect instances and principally minimize the localization randomness during weakly supervised learning. (2) A clique partition module that facilitates instance collection, object extent activation, and object part suppression. (3) A recurrent learning algorithm that formulates image classification and object detection as a predictor and a corrector, respectively, and leverages continuation optimization to solve the challenging nonconvexity problem. (4) Stateoftheart performance of weakly supervised detection, localization, and image classification.
The remainder of this paper can be concluded as follows. Related works are described in Section 2 and the proposed method is presented in Section 3. Experimental results are given in Section 4. We conclude this paper in Section 5.
WSOD was often solved with a pipelined approach, , an image was first decomposed into object proposals, with which clustering [14, 15, 16], latent variable learning [12, 14, 17, 13, 15] or multiple instance learning [8, 10, 11, 21, 32]
was used to perform proposal selection and classifier estimation. With the rise of deep learning, pipelined approaches have been evolving into multiple instance learning (MIL) networks
[33, 34, 35, 36, 22, 25, 23, 24, 29, 28, 27, 37, 38].Clustering. Various clustering methods were based on a hypothesis that a class of object instances shape a single compact cluster while the negative instances form multiple diffuse clusters. With such a hypothesis, Wang [15, 16] calculated clusters of object proposals using probabilistic latent Semantic Analysis (pLSA) on positive samples, and employed a voting strategy on these clusters to determine positive subcategories. Bilen and Song [13, 14] leveraged clustering to initialize latent variables, object regions, part configurations and subcategories, and learn object detectors based on the initialization. Clustering is a simple but effective method. The disadvantage lies in that a true positive cluster could incorporate significant noise if the objects are surrounded by clutter backgrounds.
Latent Variable Learning. Latent SVM [26]
learned object locations and detectors using an ExpectationMaximizationlike algorithm. Probabilistic Latent Semantic Analysis
[15, 16] learned object locations in a latent space.Various latent variable methods were required to solve the nonconvexity problem. They often got stuck in a poor local minimum during learning, , falsely localizing object parts or backgrounds. To pursue a stronger minimum, object symmetry and class mutual exclusion information [12], Nesterov’s smoothing [17], and convex clustering [14] were introduced to the optimization function. These approaches can be regarded as regularization which enforces the appearance similarity among objects.
Multiple Instance Learning (MIL). A major approach for tackling WSOD is to formulate it as an MIL problem [8], which treats each training image as a “bag” and iteratively selects highscored instances from each bag when learning detectors. However, MIL remains puzzled by random poor solutions. The multifold MIL [10, 11] used division of a training set and cross validation to reduce the randomness and thereby prevented training from prematurely locking onto erroneous solutions. Hoffman [21] trained detectors with weak annotations while transferring representations from extra object classes using full supervision (boundingbox annotation) and joint optimization. To reduce the randomness of positive instances, bag splitting was used during the optimization procedure of MILinear [25].
MIL has been updated to MIL networks [22, 27]
, where the convolutional filters behave as detectors to activate regions of interest on the deep feature maps
[39, 40, 41]. The beam search [42] was used to localize objects by leveraging spatial distributions and informative patterns captured in the convolutional layers. To alleviate the nonconvexity problem, Li [23]adopted progressive optimization as regularized loss functions. Tang
[27] proposed to refine instance classifiers online by propagating instance labels to spatially overlapped instances. Diba [28] proposed weakly supervised cascaded convolutional networks (WCCN). It learned to produce a class activation map and then selected the best object locations on the map by minimizing the segmentation loss.MIL networks [28, 27, 29] report stateoftheart performance, but are misled by the inconsistency between data annotations and learning objectives. With imagelevel annotations, they are capable of learning effective representations for image classification. Without objectlevel annotation, however, their localization ability is limited. The convolutional filters learned with imagelevel supervision incorporate redundant patterns, , object parts and backgrounds, which cause localization randomness and model ambiguity.
Recent methods leveraged online instance classifier refinement (OICR) [27, 43] and proposal clusters [29, 43] to improve localization. The iterative generation of the proposal clusters [43] with OICR prevented the network from concentrating on parts of objects. In this paper, we propose to solve the localization randomness problem by introducing proposal cliques and minentropy latent models. Our defined proposal cliques facilitate reducing the redundancy of proposals and optimizing minentropy models. Using the cliquebased minentropy models, we can learn instances with minimum randomness, activate object extent, and suppress object parts, Fig. 1.
To translate the image labels to object locations, the MIL network approaches [27, 43] defined multiple network branches: the first one for the basic MIL network and the others for instance classifier refinement. We inherit the multibranch architecture but add recurrent learning to facilitate the object score feedback [44]. With recurrent learning, the network branches can directly benefit from each other.
In weakly supervised learning, the inconsistency between the supervision (imagelevel annotation) and the objective (objectlevel classifier) introduces significant randomness to object localization and ambiguity to detectors. We aim at reducing this randomness to facilitate the collection of instances. To this end, we analyze two factors that cause such randomness: proposal redundancy and location uncertainty. 1) It is known that the objective functions of WSOD models are typically nonconvex [8] and have many local minima. The redundant proposals deteriorate them by introducing more local minima and larger searching space. 2) As the object locations are uncertain, the learned instances may switch among object parts, , local minima.
To reduce the proposal redundancy, we firstly partition the redundant object proposals into cliques and collect instances which are spatially related ( overlapping with each other) and class related ( having similar object class scores). To minimize localization randomness, we design a global minentropy model that reflects class and spatial distributions of object cliques. By optimizing the global minentropy model, discriminative cliques containing objects and object parts are discovered, Fig. 2, and the cliques which lack discriminative information are suppressed. The discovered cliques are used to activate true object extent.
To localize objects in the discovered cliques, a local minentropy latent model is defined. By optimizing the local minentropy model pseudoobjects are estimated and their spatial neighbors are estimated as hard negatives. Such pseudoobjects and hard negatives estimated under the minentropy principle have minimized randomness during learning, and further improve the performance of object localization, Fig. 2. MELM is deployed as a clique partition module and two network branches concerning object clique discovery and object localization, Fig. 4. During learning, it leverages a clique partition module to smooth the objective function and a continuation optimization method to solve the challenging nonconvexity problem.
Let denote an image and denote labels indicating if contains an object or not, where . indicates that there is at least one object of positive class in the image (positive image) while indicates an image without the object of positive class (negative image). denoting an object proposal (location) is a latent variable and denoting object proposals in an image is the solution space. denoting proposal clique is a subset of . denotes the network parameters. The minentropy latent model (MELM) with object locations and network parameters to be learned, is defined as
(1) 
where and are the global and local entropy models which respectively serve for object clique discovery and object localization, Fig. 4. is a regularization weight. and are loss functions based on and , respectively.
Given imagelevel annotations, , the presence or absence of a class of objects in images, the learning objective of MELM is to find a solution that disentangles object instances from noisy object proposals with minimum image classification loss and localization randomness. To this end, MELM is decomposed into three components including clique partition, object clique discovery, and object localization.
Noting that the localization randomness usually occurs among highscored proposals, we empirically select a set of highscored (top200) proposals to construct the cliques, where .
The proposal cliques are the minimum sufficient cover to which satisfy the following formulations, as
(2) 
where and is the number of proposal cliques. To partition cliques, the proposals are sorted by their object scores and the following two steps are iteratively performed: 1) Construct a clique using the proposal of highest object score but not belonging to any clique. 2) Find the proposals that overlap with any proposal in the clique larger than a threshold and merge them into the clique.
During the learning procedure, it is required that the cliques evolve with minimum randomness. At the same time, it is required to discover discriminative cliques containing objects and object parts. The network parameters finetuned with such cliques can activate true object extent. To this end, a global minentropy model is defined as
(3) 
where is the class probability of a clique defined on the object score , as
(4) 
where calculates proposal number in a clique. denotes the last FC layer in the object clique discovery branch that outputs object scores for proposals.
To ensure that the discovered cliques can best discriminate the positive images from negative ones, we further introduce a classificationrelated weight . Based on the prior that the object class probabilities of proposals are correlated with their image class probabilities, the global minentropy is then defined as
(5) 
where , defined as
(6) 
is the classificationrelated weight of clique . Eq. (5) belongs to the Aczél and Daróczy (AD) entropy [45, 46] family and is derivable. Eq. (6) shows that when , is positively correlated to object score of the positive class in a clique, but negatively correlated to scores of all other classes.
With above definitions, we implement an object clique discovery branch on top of the network, Fig. 4, and define a loss function to learn network parameters, as
(7) 
For positive images, , the second term is zero and only the global minentropy term is optimized. For negative images, , the first term is zero and the second term (image classification loss) is optimized.
The cliques discovered by the global minentropy model constitute good initialization for object localization, but nonetheless incorporate random false positives, e.g., object parts and/or partial objects with backgrounds. This is caused by the learning objective of object clique discovery, which selects proposals to discriminate positive images from negative ones but does not consider how to precisely localize objects.
A local minentropy latent model is then defined to localize objects based on the discovered cliques, as
(8) 
where
(9) 
also belongs to the AD entropy [45, 46] family and is also derivable. Different from Eq. (5) which considers the sum of the proposal probabilities globally to predict the image labels, Eq. (9) is designed to locally discriminate each proposal to be positive or negative. is defined as
(10) 
where denotes neighborhoods of in the clique. is a Gaussian kernel function with parameter . is the IoU of two proposals. The Gaussian kernel function returns a high value when is large, and a low value when is small. With Eq. (10), we define a “soft” proposal labeling strategy for object localization, which is validated to be less sensitive to noises [47] compared to the hard thresholding approach defined in [31].
Accordingly, the loss function of the object localization branch is defined as
(11) 
According to the definition of , the proposals close to tend to be true objects, and those far from , , , are hard negatives. Optimizing the loss function produces sparse object proposals of high object probability and suppresses object parts in clique . During the learning procedure, the localization capability of detectors is progressively improved.
MELM is implemented with an integrated deep network, with a clique partition module and two network branches added on top of the FC layers, Fig. 4. The first network branch, designated as the object clique discovery branch, has a global minentropy layer, which defines the distribution of object probability and targets at finding candidate object cliques by optimizing the global entropy and the image classification loss. The second branch, designated as the object localization branch, has a local minentropy layer and a softmax layer. The local minentropy layer classifies the object candidates in a clique into pseudo objects^{2}^{2}2Pseudo objects are the instantaneously learned objects. and hard negatives by optimizing the local entropy and pseudo object detection loss.
In the learning phase, object proposals are firstly generated for each image. An ROIpooling layer atop the convolutional layer (CONV5) is used for efficient feature extraction for these proposals. The MELMs are optimized with a recurrently learning algorithm, which uses forward propagation to select sparse proposals as object instances, and backpropagation to optimize the network parameters with the gradient defined in Appendix. The object probability of each proposal is recurrently aggregated by multiplying by the object probability learned in the preceding iteration. In the detection phase, the learned object detectors,
, the parameters for the softmax and FC layers, are used to classify proposals and localize objects.The objective of model learning is transferring the image category supervision to object locations with minentropy constraints, , minimum localization randomness.
Recurrent Learning. A recurrent learning algorithm is implemented to transfer the imagelevel (weak) supervision using an integrated forward and backpropagation procedure, Fig. 5(a). In a feedforward procedure, the minentropy latent models discover object cliques and localize objects which are used as pseudoobjects for detector learning. With the learned detectors the object localization branch assigns all proposals new object probability, which is used to aggregate the object scores with an elementwise multiply operation in the next learning iteration. In the backpropagation procedure, the object clique discovery and object localization branches are jointly optimized with an SGD algorithm, which propagates gradients generated with image classification loss and pseudoobject detection loss. With forward and backpropagation procedures, the network parameters are updated and the image classifiers and object detectors are mutually enforced. The recurrent learning algorithm is described in Alg. 1.
Accumulated Recurrent Learning. Fig. 5(b) shows the proposed accumulated recurrent learning (ARL). In ARL, we add multiple object localization branches, which may localize objects different from those discovered by previous branches. We thus accumulates objects from all previous branches. Doing so not only endows this approach the capability to localize multiple objects in a single image but also improves the robustness about object appearance diversity by learning various objects with multiple detectors.
With the clique partition module and recurrent learning, MELM implements the idea of continuation optimization[48] to alleviate the nonconvexity problem.
In continuation optimization, a complex nonconvex objective function is denoted as , where denotes the model parameters. Optimizing is to find the solution
(12) 
While directly optimizing Eq. (12) causes local minimum solutions, a smoothed function is introduced to approximate and facilitate the optimization, as
(13) 
where controls the smoothness of the approximate function and is a correction function. The traditional continuation method traces an implicitly defined curve from a starting point to a solution point , where is the solution of when =1. During the procedure, if is smooth and its solution is close to , we need only to fill the gap between them. This is done by defining a consequence of predictions and corrections to iteratively approximate the original objective function and approach the globally optimal solution .
The objective function of MELM, defined in Eq. (1), is to find the solution ,
(14) 
For the complexity and nonconvexity of , we propose to optimize an approximate function,
(15) 
which corresponds to Eq. (1). is defined by the clique partition module and is smoother than . This is achieved by reducing the solution space from thousands of proposals to tens of cliques in each image and averaging the class probability of all proposals in each clique, as defined by Eq. (4).
With the approximate function defined, we explore recurrent predictions and corrections to optimize the model. The gap between and is that the former is defined to discover object cliques but the latter to localize objects. As the solution of (object) is included in the solution of (clique), the gap can be simply filled by designing a correction model to localize the object in the clique. With recurrent learning, the original objective function is thus progressively approximated.
By optimizing the minentropy latent models, we obtain object detectors, which are applied to detect objects from test images. The detection procedure involves feature extraction and object localization Fig. 4. With redundant object proposals extracted by the Selective Search [51] or the EdgeBox method [52], a test image is fed to the feature extraction module, and then a ROIpooling layer is used to extract features for each proposal. The detector outputs object scores for each proposal and a NonMaximum Suppression (NMS) procedure is used to remove the overlapped proposals.
The PASCAL VOC 2007, 2010, 2012 datasets [53], the ILSVRC 2013 dataset [54], and the MSCOCO 2014 dataset [55] are used to evaluate the proposed approach. In what follows, the datasets and experimental settings are first described. The evaluation of the model and comparison with the stateoftheart approaches are then presented.
Datasets. The VOC datasets have 20 object categories. The VOC 2007 datasets contains 9963 images which are divided into three subsets: 5011 for and , and 4952 for . The VOC 2010 dataset contains 19740 images of which 10103 for and , and 9637 for . The VOC 2012 dataset contains 22531 images which are divided into three subsets: 11540 for and , and 10991 for . The ILSVRC 2013 detection dataset is more challenging for object detection as it has 200 object categories, containing 464278 images where 424126 image for and , and 40152 images for . For comparison with the previous works, we split the set of ILSVRC 2013 detection dataset into and as in [1], which was used for training and test, respectively. Although it has more training images, the number of images for each object category is much less than that in the VOC datasets. The MSCOCO 2014 dataset contains 80 object categories, with challenging aspects including multiple objects, multiple classes, and small objects. On the PASCAL VOC and ILSVRC 2013 datasets the mean average precision (mAP) is used for evaluation. On the MSCOCO 2014 dataset the mAP under multiple IoUs is used.
CNN Models.
MELM is implemented with two popular CNN models pretrained on the ImageNet ILSVRC 2012 dataset. The first CNN model VGGCNNF (VGGF for short)
[56] has a similar architechture as the AlexNet [57] which has 5 convolutional layers and 3 fully connected layers. The second CNN model is VGG16 [58], which has 13 convolutional layers and 3 fully connected layers. For these two CNN models, we replaced the spatial pooling layer after the last convolution layer with the ROIpooling layer as [2]. The FC8 layer in the two CNN models was removed and the MELM model was added.Object Proposals. The Selective Search [51] or EdgeBoxes method [52] was used to extract about 2000 object proposals for each image. As the conventional object detection task, we used the fast setting when generating proposals by Selective Search. We also removed the proposals whose width or height are less than 20 pixels.
Learning settings. Following [22, 24, 27, 28], the input images were resized into 5 scales {480, 576, 688, 864, 1200} with respect to the larger side, height or width. The scale of a training image was randomly selected and each image was randomly flipped.
In this way, each test image was augmented into 10 images. For recurrent learning, we employed the SGD algorithm with momentum 0.9, weight decay 5e4, and batch size 1. The model iterated 20 epochs where the learning rate was 5e3 for the first 15 epochs and 5e4 for the last 5 epochs. The output scores of each proposal from the 10 augmented images were averaged.
Fig. 6 shows that in discovered cliques discriminative objects and object parts were collected and the proposals which lack discriminative information were suppressed. With the proposals about objects and object parts, the global minentropy model could activate object extent during the backpropagation procedure. It can also be seen that the true object in a clique can be precisely localized after the recurrent learning procedure.
Fig. 7 shows the object cliques from different learning epochs. It can be seen that in the early training stage (Epoch 2), the object clique collected the object extent, , object and object parts. This ensured the object extent activation by the object clique discovery branch. The object localization branch further suppressed the object parts in the object clique (Epoch 4). MELM finally activated the true object extent, suppressed the object part and detected objects accurately (Epoch 20).
Fig. 8a shows the evolution of global and local entropy, suggesting that our approach optimizes the minentropy objective during learning. Fig. 8b provides the gradient evolution of the FC layers. In the early learning epochs, the gradient of the global minentropy module was slightly larger than that of the local minentropy module, suggesting that the network focused on optimizing the image classifiers. As learning proceeded, the gradient of the global minentropy module decreased such that the local minentropy module dominated the training of the network, indicating that the object detectors were being optimized.
To evaluate the effect of minentropy, the randomness of object locations was evaluated with localization accuracy and localization variance. Localization accuracy was calculated by weighted averaging the overlaps between the groundtruth object boxes and the learned object boxes, by using as the weight. Localization variance was defined as the weighted variance of the overlaps by using as the weight. Fig. 8c and Fig. 8d show that the proposed MELM had significantly greater localization accuracy and lower localization variance than WSDDN. This strongly indicates that our approach effectively reduces localization randomness during weakly supervised learning.
Such an effect was further illustrated in Fig. 9, where we compared WSDDN with MELM by the localization accuracy and localization variance during the learning. As shown in Fig. 9, MELM significantly reduced the localization randomness and achieved higher localization accuracy than WSDDN. Take the “bicycle” in Fig. 9 for example. In the early training epochs, both WSDDN and MELM failed to localize the objects. In the following training epochs MELM reduced the randomness and achieved high localization accuracy. In contrast, WSDDN switched among object parts and failed to localize the true objects.


CNN  Method  mAP 
VGGF  MELMbase  31.5 
MELMbase+Clique  33.9  
MELMD  33.6  
MELML  36.0  
MELMD+RL  34.1  
MELML+RL  38.4  
VGG16  MELMbase+Clique  29.5 
MELMD  32.6  
MELML  40.1  
MELMD+RL  34.5  
MELML+RL  42.6  
MELMD+ARL  37.4  
MELML1+ARL  46.4  
MELML2+ARL  47.3  

Baseline. The baseline approach was derived by simplifying Eq. (7) to solely model the global entropy . This is similar to WSDDN without the spatial regulariser [22] where the single learning objective is to minimize the image classification loss. This baseline, referred to as “MELMbase” in Table I, achieved 31.5% mAP using the VGGF network.
Clique Effect. By dividing the object proposals into cliques, the “MELMbase” approach was promoted to “MELMbase+Clique”. Table I shows that the introduction of proposal cliques improved the detection performance by 2.4% (from 31.5% to 33.9%). That occurred because using partitioned cliques reduced the solution space of the latent variable learning, thus readily reducing the redundancy of object proposals and facilitating a better solution. We also conducted experiments with different values, which controls the clique size as defined in Sec. 3.2.1, and summarized the results in Table. II. Accordingly, we empirically set to be 0.7 in other experiments.


0.1  0.3  0.5  0.7  0.9  1  
mAP  32.6  34.3  34.4  35.3  33.5  34.4 




CNN  Method  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  mbike  person  plant  sheep  sofa  train  tv  mAP 
VGGF/ AlexNet  MILinear [25]  41.3  39.7  22.1  9.5  3.9  41.0  45.0  19.1  1.0  34.0  16.0  21.3  32.5  43.4  21.9  19.7  21.5  22.3  36.0  18.0  25.4 
Multifold MIL [11]  39.3  43.0  28.8  20.4  8.0  45.5  47.9  22.1  8.4  33.5  23.6  29.2  38.5  47.9  20.3  20.0  35.8  30.8  41.0  20.1  30.2  
PDA [23]  49.7  33.6  30.8  19.9  13.0  40.5  54.3  37.4  14.8  39.8  9.4  28.8  38.1  49.8  14.5  24.0  27.1  12.1  42.3  39.7  31.0  
LCL+Context [16]  48.9  42.3  26.1  11.3  11.9  41.3  40.9  34.7  10.8  34.7  18.8  34.4  35.4  52.7  19.1  17.4  35.9  33.3  34.8  46.5  31.6  
WSDDN [22]  42.9  56.0  32.0  17.6  10.2  61.8  50.2  29.0  3.8  36.2  18.5  31.1  45.8  54.5  10.2  15.4  36.3  45.2  50.1  43.8  34.5  
ContextNet [24]  57.1  52.0  31.5  7.6  11.5  55.0  53.1  34.1  1.7  33.1  49.2  42.0  47.3  56.6  15.3  12.8  24.8  48.9  44.4  47.8  36.3  
WCCN [28]  43.9  57.6  34.9  21.3  14.7  64.7  52.8  34.2  6.5  41.2  20.5  33.8  47.6  56.8  12.7  18.8  39.6  46.9  52.9  45.1  37.3  
OICR [27]  53.1  57.1  32.4  12.3  15.8  58.2  56.7  39.6  0.9  44.8  39.9  31.0  54.0  62.4  4.5  20.6  39.2  38.1  48.9  48.6  37.9  
MELM  56.4  54.7  30.9  21.1  17.3  52.8  60.0  36.1  3.9  47.8  35.5  28.9  30.9  61.0  5.8  22.8  38.8  39.6  42.1  54.8  38.4  


VGG16  WSDDN [22]  39.4  50.1  31.5  16.3  12.6  64.5  42.8  42.6  10.1  35.7  24.9  38.2  34.4  55.6  9.4  14.7  30.2  40.7  54.7  46.9  34.8 
PDA [23]  54.5  47.4  41.3  20.8  17.7  51.9  63.5  46.1  21.8  57.1  22.1  34.4  50.5  61.8  16.2  29.9  40.7  15.9  55.3  40.2  39.5  
OICR [27]  58.0  62.4  31.1  19.4  13.0  65.1  62.2  28.4  24.8  44.7  30.6  25.3  37.8  65.5  15.7  24.1  41.7  46.9  64.3  62.6  41.2  
SelfTaught [29]  52.2  47.1  35.0  26.7  15.4  61.3  66.0  54.3  3.0  53.6  24.7  43.6  48.4  65.8  6.6  18.8  51.9  43.6  53.6  62.4  41.7  
WCCN [28]  49.5  60.6  38.6  29.2  16.2  70.8  56.9  42.5  10.9  44.1  29.9  42.2  47.9  64.1  13.8  23.5  45.9  54.1  60.8  54.5  42.8  
TSC [59]  59.3  57.5  43.7  27.3  13.5  63.9  61.7  59.9  24.1  46.9  36.7  45.6  39.9  62.6  10.3  23.6  41.7  52.4  58.7  56.6  44.3  
WeakRPN [60]  57.9  70.5  37.8  5.7  21.0  66.1  69.2  59.4  3.4  57.1  57.3  35.2  64.2  68.6  32.8  28.6  50.8  49.5  41.1  30.0  45.3  
MELM  55.6  66.9  34.2  29.1  16.4  68.8  68.1  43.0  25.0  65.6  45.3  53.2  49.6  68.6  2.0  25.4  52.5  56.8  62.1  57.1  47.3  


OICREns. [27]  58.5  63.0  35.1  16.9  17.4  63.2  60.8  34.4  8.2  49.7  41.0  31.3  51.9  64.8  13.6  23.1  41.6  48.4  58.9  58.7  42.0  
MELMEns.  60.3  65.0  39.5  29.0  17.5  66.1  66.4  44.8  18.6  59.0  48.4  53.2  53.0  67.2  11.0  26.5  50.0  55.7  63.1  62.4  47.8  

Minentropy models. We denoted the minentropy models by “MELMD” and “MELML” in Table I, which respectively corresponded to object clique discovery and object localization. We trained the models by simply cascading the object clique discovery and object localization branches, without using the recurrent learning. Table I shows that with VGGF we achieved 33.6% and 36.0% mAP for object clique discovery and object localization branches, which improved the baseline “MELMbase” by 2.1% and 5.5%. For VGG16, “MELML” significantly improved the “MELMbase+Clique” from 29.5% to 40.1%, with a 10.6% margin at most. This fully demonstrated that the minentropy models and their implementation with object clique discovery and object localization branches were pillars of our approach.
Recurrent learning. In Table I, the recurrent learning algorithms “MELMD+RL” and “MELML+RL”, respectively achieved 34.5% and 42.6% mAP, improving the “MELML” (without recurrent learning) by 0.5% and 2.4%. When using VGG16, “MELMD+RL” and “MELML+RL” respectively achieved 34.5% and 42.6% mAP, improving the “MELML” by 1.9% and 2.5%. These improvements showed that with recurrent learning, Fig. 4, the object clique discovery and object localization branches benefited from each other and thus were mutually enforced.
Accumulated recurrent learning. The models with accumulated recurrent learning were denoted by “MELMD+ARL”, “MELML1+ARL”, and “MELML2+ARL” in Table I. In the learning procedure, the high scored proposals were accumulated into the next branch. When using two object localization branches, “MELML2ARL” significantly improved the mAP of “MELMLRL” from 42.6% to 46.4% (+3.8%). It further improved the mAP from 46.4% to 47.3% (+0.9%) when using three branches, but did not significantly improve when using four.


CNN  Method  mAP 
VGGF/ AlexNet  MILinear [25]  43.9 
LCL+Context [16]  48.5  
PDA [23]  49.8  
WCCN [28]  52.6  
Multifold MIL [11]  54.2  
WSDDN [22]  54.2  
ContextNet [24]  55.1  
MELM  58.4  


VGG16  PDA [23]  52.4 
WSDDN [22]  53.5  
WCCN [28]  56.7  
MELM  61.4  



Dataset  CNN  Method  Dataset Splitting  mAP 
PASCAL VOC 2010  VGGF/ AlexNet  PDA [23]  train/val  21.4 
WCCN [28]  trainval/test  28.8  
MELM  train/val  35.6  
MELM  trainval/test  36.3  
VGG16  PDA [23]  train/val  30.7  
WCCN [28]  trainval/test  39.5  
MELM  train/val  37.1  
MELM  trainval/test  39.9  


PASCAL VOC 2012  VGGF/ AlexNet  PDA [23]  train/val  22.4 
MILinear [25]  train/val  23.8  
WCCN [28]  trainval/test  28.4  
ContextNet [24]  trainval/test  35.3  
OICRVGGM [27]  trainval/test  34.6  
MELM  train/val  36.2  
MELM  trainval/test  36.4  
VGG16  PDA [23]  train/val  29.1  
SelfTaught [29]  train/val  39.0  
WCCN [28]  trainval/test  37.9  
OICR [27]  trainval/test  37.9  
SelfTaught [29]  trainval/test  38.3  
TSC [59]  trainval/test  40.0  
MELM  train/val  40.2  
MELM  trainval/test  42.4  


ILSVRC 2013  VGGF/ AlexNet  MILinear [25]    9.6 
PDA [23]  val1/val2  7.7  
WCCN [28]    9.8  
MELM  val1/val2  13.4  




Image Classification  
Method  mAP  F1C  PC  RC  F1O  PO  RO 
CAM [61]  54.4             
SPN [37]  56.0             
ResNet101 [62]  75.2  69.5  80.8  63.4  74.4  82.2  68.0 
MELMVGG16  79.1  72.0  79.3  68.6  76.8  82.5  71.9 


Pointing Localization (with class prediction)  
Method  WeakSup [34]  Pronet [63]  DFM [42]  SPN [37]  MELM 
mAP  41.2  43.5  49.2  55.3  65.1 


Object Detection  
Method  CNN  mAP@.5  mAP@[.5,.95] 
WSDDN [22]  VGGF  10.1  3.1 
MELM  VGGF  11.9  4.1 
VGG16  18.8  7.8  

Weakly Supervised Object Detection. Table III compared the detection performance of MELM with the stateoftheart approaches on the PASCAL VOC 2007 dataset. It can be seen that MELM respectively achieved 38.4% and 47.3% with the VGGF and VGG16 models. With the popular VGG16 model, MELM respectively outperformed the OICR [27], SelfTaught [29], WCCN [28], WeakRPN [60], and TSC [59] by 6.1% (47.3% vs. 41.2%), 5.6% (47.3% vs. 41.7%), 4.5% (47.3% vs. 42.8%), 3.0% (47.3% vs. 44.3%) and 2.0% (47.3% vs. 45.3%), which were significant margins in terms of the challenging WSOD task. MELM using multiple networks (MELMEns.) outperformed OICREns. (47.8% mAP vs. 42.0% mAP). To further improve the detection performance, we retrained a FastRCNN detector using learned pseudo objects and a ResNet101 network, and achieved 49.0% mAP.
Table V compared the detection performance of MELM with the stateof theart approaches on the VOC 2010 and VOC 2012 datasets. It can be seen that MELM usually outperformed the stateoftheart approaches. On the VOC 2010 dataset, MELM with VGGF significantly outperformed WCCN [28] by 7.5% (36.3% vs. 28.8%) with a VGGF model, and was comparable to it with a VGG16 model. On the VOC2012 dataset, with a VGGF model, MELM respectively outperformed WCCN [28] and OICR [27] by 8.0% ( 36.4% vs. 28.4%) and 1.8% (36.4% vs. 34.6%). With a VGG16 model, MELM respectively outperformed WCCN [28], SelfTaught [29], OICR [27], and TSC [59] by 4.5% (42.4% vs. 37.9%), 4.1% (42.4% vs. 38.3%), 4.5% (42.4% vs. 37.9%) and 2.4% (42.4% vs. 40.0%).
Specifically, the detection performance for “bicycle” (+4.5%), “cow” (+8.5%), “diningtable” (+14.7%), “dog” (+9.6%) significantly improved, which shows the general effectiveness of MELM
Despite of the average good performance, our approach failed on the “person” class, as shown in the last image of Fig. 10(a). “Person” is one of the most challenging class as people often involve great appearance variance from clothes, poses, and occlusions. Furthermore, the definition for ??person?? is not consistent. A “person” could be defined as a pedestrian, a headandshoulder, or just a human face. Given such ambiguous definition, what the algorithm can do is to localize the most discriminative part of a “person”, e.g., the face. We also note that although the performance of “person” decreased, the average performance for all class significantly increased.
For the object classes with large appearance variance, we observed that the algorithm correctly classified the object regions but often failed to precisely localize them, i.e., the IoU between the learned bounding boxes and the groundtruth is smaller than 0.5. When using the “pointing localization” metric [37], the “person” class achieved 97.1% localization accuracy, which shows potential to practical applications.
Fig. 10 shows some of the detection examples. It can be seen that MELM precisely localize objects from clutter background and correctly localized multiple object regions in a single image.
Weakly Supervised Object Localization. The Correct Localization (CorLoc) metric [18] was employed to evaluate the localization accuracy. CorLoc is the percentage of images for which the region of highest object score has at least 0.5 interactionoverunion (IoU) with the groundtruth object region. This experiment was done on the set because the region selection exclusively worked in the training process.
It can be seen in Table IV that with VGGF model, the mean CorLoc of MELM respectively outperformed the stateoftheart WSDDN [22] and WCCN [28] by 4.2% (58.4% vs. 54.2%) and 5.8% (58.4% vs. 52.6%). With the VGG16 model, it respectively outperformed the stateoftheart WSDDN [22] and WCCN [28] by 7.9% (61.4% vs. 53.5%) and 4.7% (61.4% vs. 56.7%). Noticeably, on the “bus”, “car”, “chair”, and “table” classes, MELM outperformed the compared stateoftheart methods up to 715%. This shows that the cliquebased minentropy strategy is more effective than the image segmentation strategy used in WCCN.


CNN  Method  mAP 
VGGF/ AlexNet  MILinear [25]  72.0 
AlexNet [57]  82.4  
WSDDN [22]  85.3  
WCCN [28]  87.8  
MELM  87.8  


VGG16  VGG16 [58]  89.3 
WSDDN [22]  89.7  
WCCN [28]  90.9  
MELM  93.1  

Image Classification. The object clique discovery and object localization components highlighted informative regions and suppressed disturbing backgrounds, which also benefited image classification. As shown in Tab. VII, with the VGGF model, MELM achieved 87.8% mAP. With the VGG16 model, MELM achieved 93.1% mAP, which respectively outperformed WSDDN [22] and WCCN [28] up to 3.4% (93.1% vs. 89.7%) and 2.2% (93.1% vs. 90.9%). It is noteworthy that MELM outperformed the VGG16 network, specifically trained for image classification, by 3.8% mAP (93.1% vs. 89.3%).
On the ILSVRC2013 dataset with 200 object classes, Table V
, MELM with VGGF outperformed the WCCN approach by 3.6% (13.4% vs. 9.8%). On the MS COCO 2014 dataset, we evaluated the image classification, pointing localization, and object detection performance and compared it with the stateofthearts. The evaluation metrics for image classification included macro/micro precision (PC and PO), macro/micro recall (RC and RO), macro/micro F1measure (F1C and F1O)
[64]. It can be seen in Table. VI that for image classification MELM outperformed SPN [37] by 23.1% (79.1% vs. 56%). For pointing localization, MELM outperformed SPN by 9.8% (65.1% vs. 55.3%). For object detection, MELM outperformed WSDDN. With these experiments, we set new baselines for weakly supervised object detection on largescale datasets.In this paper, we proposed an effective deep minentropy latent model (MELM) for weakly supervised object detection (WSOD). MELM was deployed as three components of clique partition, object clique discovery, and object localization, and was unified with the deep learning framework in an integrated manner. By partitioning and discovering cliques, MELM provided a new way to learn latent object regions from redundant object proposals. With the minentropy principle, it can principally reduce the variance of positive instances and alleviate the ambiguity of detectors. With the recurrent learning algorithm, MELM improved the performance of weakly supervised detection, weakly supervised localization, and image classification, in striking contrast with stateoftheart approaches. The underlying reality is that minentropy results in minimum randomness of an information system and the recurrent learning takes advantages of continuation optimization, which provides fresh insights for weakly supervised learning problems.
For succinct representation, we denote , , , and as , , , and , respectively.
Derivation for object clique discovery. Given the object score as the input of the entropy models, its gradient can be computed as
(16) 
where the partial derivation of with respect to is computed as
(17) 
where is the clique including . The partial derivation of with respect to is computed as
(18) 
Derivation for object localization. In Eq. (11), the term is used as a pseudo label for , which does not backpropagate gradients. Therefore, the derivation for object localization can be simply computed as
(19) 
The partial derivation of with respect to is calculated with Eq. (18) and Eq. (19).
This work was supported in part by the NSFC under Grant 61836012, 61671427, and 61771447, and Beijing Municipal Science and Technology Commission under Grant Z181100008918014. Qixiang Ye is the corresponding author.
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR)
, 2014, pp. 580–587.A. Stuart, T. Ioannis, and H. Thomas, “Support vector machines for multipleinstance learning,” in
Adv. in Neural Inf. Process. Syst. (NIPS), 2002, pp. 561–568.P. Megha and L. Svetlana, “Scene recognition and weakly supervised object localization with deformable partbased models,” in
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2011, pp. 1307–1314.IEEE Conference on Computer Vision and Pattern Recognition, CVPR
, 2017, pp. 302–310.
Comments
There are no comments yet.