Multiple Instance Curriculum Learning for Weakly Supervised Object Detection

11/25/2017 ∙ by Siyang Li, et al. ∙ 0

When supervising an object detector with weakly labeled data, most existing approaches are prone to trapping in the discriminative object parts, e.g., finding the face of a cat instead of the full body, due to lacking the supervision on the extent of full objects. To address this challenge, we incorporate object segmentation into the detector training, which guides the model to correctly localize the full objects. We propose the multiple instance curriculum learning (MICL) method, which injects curriculum learning (CL) into the multiple instance learning (MIL) framework. The MICL method starts by automatically picking the easy training examples, where the extent of the segmentation masks agree with detection bounding boxes. The training set is gradually expanded to include harder examples to train strong detectors that handle complex images. The proposed MICL method with segmentation in the loop outperforms the state-of-the-art weakly supervised object detectors by a substantial margin on the PASCAL VOC datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection is an important problem in computer vision. In recent years, a set of detectors based on convolutional neural networks (CNN) are proposed 

[Girshick(2015), Ren et al.(2015)Ren, He, Girshick, and Sun, Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg], which perform significantly better than traditional methods (e.g., [Felzenszwalb et al.(2010)Felzenszwalb, Girshick, McAllester, and Ramanan]). Those detectors need to be supervised with fully labeled data, where both object category and location (bounding boxes) are provided. However, we argue that such data are expensive in terms of labeling efforts and thus are not scalable. With the increase of dataset size, it becomes extremely difficult to label the locations of all object instances.

In this work, we focus on object detection with weakly labeled data, where only image-level category labels are provided, while the object locations are unknown. This type of methods is attractive since image-level labels are usually much cheaper to obtain. For each image in a weakly labeled dataset, the image-level label tells both present and absent object categories. Thus, for each category, we have positive examples where at least one instance of that category is present as well as negative ones where no objects of that category exist.

Some previous methods [Li et al.(2016)Li, Huang, Li, Wang, and Yang, Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev, Bilen and Vedaldi(2016)] extract a set of object candidates via unsupervised object proposals (also called object candidates) [Uijlings et al.(2013)Uijlings, van de Sande, Gevers, and Smeulders, Zitnick and Dollár(2014), Li et al.(2017)Li, Zhang, Zhang, Ren, and Kuo, Carreira and Sminchisescu(2010)]

and then identify the best proposals that lead to high image classification scores for existing categories. However, the best proposals for image classification do not necessarily cover the full objects. (For example, to classify an image as “cat”, seeing the face is already sufficient and even more robust than seeing the whole body, as the fluffy fur can be confused with other animals.) Specifically, the best proposals usually focus on discriminative object parts, which oftentimes do not overlap enough with the extent of full objects, and thus become false positives for detection.

Figure 1: The training diagram of our MICL method. It iterates over re-localization and re-training. During re-localization, saliency maps are generated from the current detector. Segmentation seeds are obtained from the saliency maps, which later grow to segmentation masks. We use the segmentation masks to guide the detector to avoid being trapped in the discriminative parts. A curriculum is designed based on the segmentation masks and the current top scoring detections. With the curriculum, the multiple instance learning process can be organized in an easy-to-hard manner for the detector re-training.

To reduce the false positives due to trapping in the discriminative parts, we use segmentation masks to guide the weakly supervised detector in the typical relocalization-and-retraining loop [Shi and Ferrari(2016), Gokberk Cinbis et al.(2014)Gokberk Cinbis, Verbeek, and Schmid]. The segmentation process starts with a few seeds from the object saliency maps generated by the current detector. Then segmentation masks are obtained by expanding those seeds using the “Seed, Expand and Constrain (SEC)” method [Kolesnikov and Lampert(2016)]. One may use all generated masks to directly supervise the detector. However, it will be misled by the hard and noisy examples where the segmentation network fails to produce reasonably good object masks. To overcome this challenge, we propose the multiple instance curriculum learning (MICL) approach, which combines the commonly used multiple instance learning (MIL) framework with the easy-to-hard learning paradigm - curriculum learning (CL) [Bengio et al.(2009)Bengio, Louradour, Collobert, and Weston]. The work flow of the proposed MICL system is shown in Fig. 1. It learns from the “easy” examples only in the re-training step of MIL. The re-trained detector is later used to re-localize the segmentation seeds and object boxes. While this process iterates, the training set is gradually expanded from easy to hard examples so the detector learns to handle more complex examples. We identify the easiness of an example by examining the consistency between the results from the detector and the segmenter, without additional supervision on “easiness” required by traditional CL methods. Once the proposed MICL process is finished, the detector is applied to test images directly.

The contributions of this work are summarized as following. First, we incorporate a semantic segmentation network to guide the detector to learn the extent of full objects and avoid being stuck at the discriminative object parts. Second, we propose an MICL method by combining MIL with CL so that the detector is not misled by hard and unreliable examples, and our CL process does not require any additional supervision on the “easiness” of the training examples. Third, we demonstrate the superior performance of the proposed MICL method as compared with the state-of-the-art weakly supervised detectors.

2 Related Work

Weakly supervised object detection. The weakly supervised object detection problem is oftentimes treated as a multiple instance learning (MIL) task [Bilen et al.(2014)Bilen, Pedersoli, and Tuytelaars, Siva and Xiang(2011), Song et al.(2014a)Song, Girshick, Jegelka, Mairal, Harchaoui, Darrell, et al., Song et al.(2014b)Song, Lee, Jegelka, and Darrell, Shi and Ferrari(2016), Gokberk Cinbis et al.(2014)Gokberk Cinbis, Verbeek, and Schmid, Bilen et al.(2015)Bilen, Pedersoli, and Tuytelaars, Siva et al.(2012)Siva, Russell, and Xiang, Siva et al.(2013)Siva, Russell, Xiang, and Agapito]. Each image is considered as a bag of instances. An image is labeled as positive for one category if it contains at least one instance from that category, and an image is labeled as negative if no instance from that category appears in it. As a non-convex optimization problem, MIL is sensitive to model initialization and many prior arts focus on good initialization strategies [Song et al.(2014a)Song, Girshick, Jegelka, Mairal, Harchaoui, Darrell, et al., Deselaers et al.(2010)Deselaers, Alexe, and Ferrari]. Recently, people start to exploit the powerful CNN to solve the weakly supervised object detection problem. Oquab et al [Oquab et al.(2015)Oquab, Bottou, Laptev, and Sivic] convert the Alexnet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] to a fully convolutional network (FCN) to obtain an object score map, but it gives only a rough location of objects. WSDNN [Bilen and Vedaldi(2016)] constructs a two-branch detection network with a classification branch and a localization branch in parallel. ContextLoc [Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev] is built on WSDNN and adds an additional context branch. DA [Li et al.(2016)Li, Huang, Li, Wang, and Yang]

proposes a domain adaptation approach to identify proposals corresponding to objects and uses them as pseudo location labels for detector training. Size Estimate (SE) 

[Shi and Ferrari(2016)] trains a separate size estimator with additional object size annotation and the detector is trained by feeding images with a decreasing estimated size. Singh et al [Kumar Singh et al.(2016)Kumar Singh, Xiao, and Jae Lee] take the advantage of videos and transfer the object appearance to images for object localization. Most of the aforementioned approaches suffer from trapping in discriminative regions as no specific supervision is provided to learn the extent of full objects.

Weakly supervised image semantic segmentation. Similar with the location labels for detector training, the pixel-wise labels for semantic segmentation are also laborious to obtain. Consequently, many researchers focus on weakly supervised image semantic segmentation [Bearman et al.(2016)Bearman, Russakovsky, Ferrari, and Fei-Fei, Lin et al.(2016)Lin, Dai, Jia, He, and Sun, Kolesnikov and Lampert(2016), Papandreou et al.(2015)Papandreou, Chen, Murphy, and Yuille], where only image-level labels or scribbles on objects are available for training. PointSup [Bearman et al.(2016)Bearman, Russakovsky, Ferrari, and Fei-Fei] trains a segmentation network with one point annotation per category (or object) only and uses objectness prior to improve the results. Similarly, ScribbleSup [Lin et al.(2016)Lin, Dai, Jia, He, and Sun] trains the segmentation network with scribbles by alternating between the GraphCut [Rother et al.(2004)Rother, Kolmogorov, and Blake] algorithm and the FCN training [Long et al.(2015)Long, Shelhamer, and Darrell]. SEC [Kolesnikov and Lampert(2016)] generates coarse location cues from image classification networks as “pseudo scribbles”, and constrains the training process with smoothness. In this work, SEC is used for segmentation network training with seeds.

Curriculum learning. The concept of curriculum learning is proposed by Bengio et al [Bengio et al.(2009)Bengio, Louradour, Collobert, and Weston], indicating that learning from easy to hard examples can be beneficial. It has been applied to various problems in computer vision [Shi and Ferrari(2016), Tudor Ionescu et al.(2016)Tudor Ionescu, Alexe, Leordeanu, Popescu, Papadopoulos, and Ferrari, Lee and Grauman(2011), Pentina et al.(2015)Pentina, Sharmanska, and Lampert], where people present different definitions to “easy” examples. Some require human labelers to evaluate the difficulty level of images [Pentina et al.(2015)Pentina, Sharmanska, and Lampert] while others measure the easiness based on labeling time [Tudor Ionescu et al.(2016)Tudor Ionescu, Alexe, Leordeanu, Popescu, Papadopoulos, and Ferrari]. In our work, we determine the easiness by measuring the consistency between a trained detector and the segmentation masks, without requiring additional human labeling.

3 Proposed MICL Method

The proposed MICL approach starts with initializing a detector to find the most salient candidate (Sec. 3.1). Meanwhile, we obtain the saliency maps from a trained classifier/detector. The saliency maps are thresholded to obtain segmentation seeds. Then a segmentation network is trained to grow object masks from those seeds (Sec. 3.2). After that, we inject the curriculum learning (CL) paradigm into the commonly used re-localization/re-training multiple instance learning (MIL) framework [Gokberk Cinbis et al.(2014)Gokberk Cinbis, Verbeek, and Schmid] in the weakly supervised object detection problem, leading to the multiple instance curriculum learning (MICL) approach. With MICL we further train the detector to learn the extent of objects under the guidance of the segmentation network (Sec. 3.3).

3.1 Detector Initialization

To start the MIL process detailed in Sec. 3.3, we need an initial detector with certain localization capability. We achieve this by first training a whole image classifier and then tuning it into a detector to identify the most salient candidate. Similar with the methods described in [Bilen and Vedaldi(2016), Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev, Teh et al.(2016)Teh, Rochan, and Wang], we first extract SS [Uijlings et al.(2013)Uijlings, van de Sande, Gevers, and Smeulders] object proposals. In order to find the most salient candidate (MSC) among the proposals, we then add a “saliency” branch composed of an FC layer in parallel with the classification branch, as shown in Fig. 2.

Figure 2: The detector with a saliency branch to find the most salient candidate (MSC) for image classification.

This branch takes the region of interest (RoI, denoted by ) features from the second last layer of the VGG16 network [Simonyan and Zisserman(2015)] as the input, and computes its saliency. Then the softmax operation is applied, so that the saliency scores of all RoIs for category , denoted by , sum to one, i.e. . The saliency score is used to aggregate the RoI classification score into image-level scores:

(1)

Then, the whole network can be trained with image-level labels using the multi-label cross entropy loss. We rank the RoIs by the combined scores and record the top scoring RoI for the curriculum learning process detailed in Sec. 3.3.

3.2 Segmentation-based Seed Growing (SSG)

In this module, we first obtain saliency maps from a classification (a single region of interest) or detection (multiple regions of interest) network and then train a segmentation network to expand object masks from the salient regions.

Saliency map for a single region. Several methods [Zhou et al.(2016)Zhou, Khosla, Lapedriza, Oliva, and Torralba, Simonyan et al.(2013)Simonyan, Vedaldi, and Zisserman, Zhang et al.(2016)Zhang, Lin, Brandt, Shen, and Sclaroff] are proposed to identify the saliency map from a trained image classification network automatically. The saliency map for category , denoted by , describes the discriminative power of the cell, , on the feature map generated by the trained classification network. Here, we use CAM [Zhou et al.(2016)Zhou, Khosla, Lapedriza, Oliva, and Torralba] to obtain the saliency maps for existing categories and Grad [Simonyan et al.(2013)Simonyan, Vedaldi, and Zisserman] for background as elaborated below.

Being applied to a GAP classification network111

GAP refers to global average pooling. GAP networks refer to the architecture with only the top layer being fully connected (FC). Prior to the FC layer is a GAP layer which averages the 3-D feature map into a feature vector.

, the CAM saliency map is defined as

(2)

where is the cell coordinates of the feature map from the last convolutional layer, is the response at the -th unit at this layer and is the weight parameters of the fully connected (FC) layer corresponding to category .

The Grad background saliency maps are defined as

(3)

where

is the output of the classification network, indicating the probability that the input image belongs to category

, and is a normalization factor so that the maximum saliency for all existing categories, denoted by , is normalized to one. More specifically,

(4)

Aggregated saliency map from multiple regions. The outputs of a detector are essentially a group of bounding boxes with classification scores. We propose a generalized saliency map mechanism that can be applied to detectors with the RoI pooling layer (e.g., Fast R-CNN [Girshick(2015)] as used in this work). As illustrated in Fig. 3(a), given a RoI and its feature map, denoted by (which has a fixed size because the classifier is composed of FC layers), obtained by the RoI pooling operator, one can compute the saliency map within as

(5)

where denotes element-wise multiplication and is the classification score for . To obtain the saliency maps for the entire image, the RoI saliency maps are aggregated via

(6)

where is obtained by resizing

to the RoI size using bilinear interpolation and then padding it to the image size, as shown in Fig. 

3(b).

It is worthwhile to point out that Eq. (5) is reduced to the single region or the whole image saliency map for a classification network with no RoI inputs (i.e., the entire image as one RoI). When the image classification network is a GAP network, it can be further reduced to Eq. (2) as derived in the original CAM [Zhou et al.(2016)Zhou, Khosla, Lapedriza, Oliva, and Torralba]. These relations are explained in the supplementary materials.

Figure 3: (a) Compute the saliency map in a RoI, and (b) aggregate RoI saliency maps to the image saliency maps.

Object mask growing from discriminative regions. Segmentation seeds are obtained by thresholding these saliency maps and the seeds from CAM and Grad are simply pooled together. We adopt the SEC method [Kolesnikov and Lampert(2016)] to train the DeepLab [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] network to expand masks from those seeds. In the first round of training, seeds come from a classification network while in later rounds they are from the currently trained detector. The trained segmentation network is applied to all training images and a bounding box is drawn around the largest connected component in the mask. Thus one instance from each existing category is localized. The location information will guide the detector training process described in Sec. 3.3.

3.3 Multiple Instance Curriculum Learning

The commonly used MIL framework usually starts with an initial detector and then alternates between updating the boxes (re-localization) and updating the model (re-training). In re-locolization, the current detector is applied to training images and the highest-scoring box is saved. In re-training, the detector is re-trained on the saved boxes.

To re-train the initialized detector from Sec. 3.1, one may use the highest-score boxes produced by itself, but it get stuck easily at the same box. Alternatively, on can use the bounding boxes from the SSG network, but those boxes may not be reliable due to inaccurate segmentation seeds. In other words, relying solely on the initial detector or the segmenter leads to sub-optimal results. To avoid misleading the detector by unreliable boxes, we guide the detector by organizing the MIL process on a curriculum that requires the detection boxes and segmentation masks to agree with each other. Details are elaborated below.

SSG-Guided Detector Training. The easy-to-hard learning principle proposed in [Bengio et al.(2009)Bengio, Louradour, Collobert, and Weston] has been proved helpful in training a weakly supervised object detector [Shi and Ferrari(2016), Tudor Ionescu et al.(2016)Tudor Ionescu, Alexe, Leordeanu, Popescu, Papadopoulos, and Ferrari]. However, most previous methods require additional human supervision or additional object size information to determine the hardness of a training example, which is expensive to acquire. Instead of seeking additional supervision, we determine the hardness of an example by measuring the consistency between the outputs from the detector and the SSG network. The consistency is defined as the intersection over union (IoU) of the boxes from the SSG and the detector:

(7)

where represents a positive example for category ; and stand for the predicted bounding boxes of the object by the detector and the SSG network, respectively. As shown in Fig. 1, an example is considered easy if

(8)

where is a threshold to control the hardness of the selected examples. We argue the validity of this criterion from two perspectives. First, those examples are easier because the goal for the detector is to mimic the mask expansion ability of SSG, and one example would appear easier if the detector already produces something similar (i.e., the gap between achieved results and the learning target is small). Second, the object localization on those examples are confirmed by both the detector and SSG, meaning that those predicted locations are more reliable. In other words, if significantly deviates from , it tends to be unreliable. We verify the reliability of on the selected examples in Sec. 4.3.

For an existing category , one instance is localized by taking the average of and on the selected examples. Those localized instances on easy examples are used for further detector training in a fully supervised manner. In this work, we use the popular Fast R-CNN detector [Girshick(2015)] with Selective Search [Uijlings et al.(2013)Uijlings, van de Sande, Gevers, and Smeulders] as the object proposal generator.

Re-localization and re-training. The detector trained on easy examples lacks the ability to handle hard examples because it focuses on easy ones in the aforementioned training round. Thus, we gradually include more training examples by adopting the re-localization/re-training iterations in the MIL framework [Gokberk Cinbis et al.(2014)Gokberk Cinbis, Verbeek, and Schmid], illustrated in Fig. 1. In the re-localization step, the trained detector is applied to the whole training set and the highest scoring boxes for existing categories are recorded as the new . Meanwhile, the outputs from the detector are used to re-localize segmentation seeds, based on which the SSG network is re-trained to generate new . By applying the same example selection criterion, another training subset, containing more examples because their results are more similar after learning from each other, is identified and the Fast R-CNN detector is re-trained. The MIL process alternates between re-training and re-localization until all training examples are included. After training is finished, the detector is applied directly to test images.

4 Experiments

4.1 Experiment Settings

Datasets. To evaluate the performance of the weakly supervised MICL detector, we conduct experiments on the PASCAL VOC 2007 and 2012 datasets (abbreviated as VOC07 and VOC12 below) [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman], where 20 object categories are labeled. For the MICL detector training, we only use image-level labels, with no human labeled bounding boxes involved.

Evaluation metrics. Following the evaluation of fully supervised detectors, we use the average precision (AP) for each category on the test set as the performance metric. Besides, we choose another metric, the Correct Location (CorLoc) [Deselaers et al.(2012)Deselaers, Alexe, and Ferrari], to evaluate weakly supervised detectors, which is usually applied to training images. The CorLoc is the percentage of the true positives among the most confident predicted boxes for existing categories. A predicted box is a true positive if it overlaps sufficiently with one of the ground truth object boxes. The IoU threshold is set to 50% for both metrics.

Implementation Details. The backbone architecture for all modules is the VGG16 [Simonyan and Zisserman(2015)]

network, which is pre-trained on the ImageNet dataset 

[Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei]

. The whole system is implemented with the TensorFlow

[Abadi et al.(2016)Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, et al.] library. The saliency map thresholds in SSG are 0.2 for objects and 0.9 for background222We reduce the background threshold to make background seeds take at least 10% of saliency map, as in [Kolesnikov and Lampert(2016)].. The parameter in Eq. 8 balances the number of selected examples and their easiness. Higher T means easier but fewer examples. It trades off between “overfitting to fewer clean examples” and “learning from large noisy data”. We find empirically that gives good results. Other details are found in the supplementary materials.

4.2 Experimental Results

Figure 4: Qualitative detection results, where the correctly detected ground truth objects are in green boxes and blue boxes represent correspondingly predicted locations. Objects that the model fails to detect are in yellow boxes and false positive detections are in red.
Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike prsn plant sheep sofa train tv mAP
WSDNN-Ens[Bilen and Vedaldi(2016)] 46.4 58.3 35.5 25.9 14.0 66.7 53.0 39.2 8.9 41.8 26.6 38.6 44.7 59.0 10.8 17.3 40.7 49.6 56.9 50.8 39.3
WSDNN[Bilen and Vedaldi(2016)] 43.6 50.4 32.2 26.0 9.8 58.5 50.4 30.9 7.9 36.1 18.2 31.7 41.4 52.6 8.8 14.0 37.8 46.9 53.4 47.9 34.9
DA[Li et al.(2016)Li, Huang, Li, Wang, and Yang] 54.5 47.4 41.3 20.8 17.7 51.9 63.5 46.1 21.8 57.1 22.1 34.4 50.5 61.8 16.2 29.9 40.7 15.9 55.3 40.2 39.5
ConLoc[Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev] 57.1 52.0 31.5 7.6 11.5 55.0 53.1 34.1 1.7 33.1 49.2 42.0 47.3 56.6 15.3 12.8 24.8 48.9 44.4 47.8 36.3
Attn[Teh et al.(2016)Teh, Rochan, and Wang] 48.8 45.9 37.4 26.9 9.2 50.7 43.4 43.6 10.6 35.9 27.0 38.6 48.5 43.8 24.7 12.1 29.0 23.2 48.8 41.9 34.5
SE[Shi and Ferrari(2016)] - - - - - - - - - - - - - - - - - - - - 37.2
MICL 61.2 51.9 47.1 13.5 10.1 52.1 56.9 71.0 7.6 36.4 49.7 64.5 63.0 57.8 27.9 16.6 30.4 53.8 41.1 40.3 42.6
Table 1: Comparison of mAP values on the VOC07 test set. Note that the gaps between the previous approaches and our method are particularly large on categories such as “cat”, “dog” and “horse”. The improvements are from the SSG network, which grows the bounding box to cover the full objects from the discriminative parts (i.e., faces).

The average precision (AP) on the VOC07 test set is shown in Tab. 1. An mAP of 42.6% is achieved by our MICL method, with 3.1% higher than the second best method, DA [Li et al.(2016)Li, Huang, Li, Wang, and Yang]. Note that DA [Li et al.(2016)Li, Huang, Li, Wang, and Yang] needs to cache the RoI features from pretrained CNN models for MIL and demands highly on disk space. Our proposed method avoids feature caching and thus is more scalable. The third best method is from ensembles [Bilen and Vedaldi(2016)] (WSDNN-Ens in Tab. 1). If compared directly to the results without ensembles (WSDNN in Tab. 1), our method is 7.7% superior. Some visualized detection results are shown in Fig. 4.

Tab. 2 shows the CorLoc evaluation on the VOC07 trainval set. We achieve 2.5% higher in the CorLoc than WSDNN-Ens [Bilen and Vedaldi(2016)]. Also, as compared with that of DA [Li et al.(2016)Li, Huang, Li, Wang, and Yang], which ranks the second in the mAP, our MICL method is superior by 8.1%. Note that in terms of both AP and CorLoc, our MICL detector performs much better on certain categories such as “cat”, “dog” and “horse”. Objects in these categories usually have very discriminative parts (i.e. faces). Improvements on those categories are from the SSG network, which grows the bounding boxes from the discriminative regions.

Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike prsn plant sheep sofa train tv Avg.
WSDNN-Ens[Bilen and Vedaldi(2016)] 68.9 68.7 65.2 42.5 40.6 72.6 75.2 53.7 29.7 68.1 33.5 45.6 65.9 86.1 27.5 44.9 76.0 62.4 66.3 66.8 58.0
WSDNN[Bilen and Vedaldi(2016)] 65.1 63.4 59.7 45.9 38.5 69.4 77.0 50.7 30.1 68.8 34.0 37.3 61.0 82.9 25.1 42.9 79.2 59.4 68.2 64.1 56.1
DA[Li et al.(2016)Li, Huang, Li, Wang, and Yang] 78.2 67.1 61.8 38.1 36.1 61.8 78.8 55.2 28.5 68.8 18.5 49.2 64.1 73.5 21.4 47.4 64.6 22.3 60.9 52.3 52.4
ConLoc[Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev] 83.3 68.6 54.7 23.4 18.3 73.6 74.1 54.1 8.6 65.1 47.1 59.5 67.0 83.5 35.3 39.9 67.0 49.7 63.5 65.2 55.1
WSC[Diba et al.(2016)Diba, Sharma, Pazandeh, Pirsiavash, and Van Gool] 83.9 72.8 64.5 44.1 40.1 65.7 82.5 58.9 33.7 72.5 25.6 53.7 67.4 77.4 26.8 49.1 68.1 27.9 64.5 55.7 56.7
MICL 85.3 58.4 68.5 30.4 20.9 67.2 77.1 84.6 24.3 69.5 51.5 80.0 79.8 85.3 46.2 44.9 52.1 64.6 61.3 60.0 60.5
Table 2: Comparison of the CorLoc on the VOC07 trainval set.

To our best knowledge, only [Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev] and [Li et al.(2016)Li, Huang, Li, Wang, and Yang] reported results on the VOC12 dataset. We follow the training/testing split in [Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev] and show the results in Table 3. We improve the mAP and the CorLoc by 3.6% and 8.4%, respectively. For performance benchmarking with [Li et al.(2016)Li, Huang, Li, Wang, and Yang], we estimate the mAP of our MICL method on the VOC12 val set by applying the detector trained on the VOC07 trainval set. The obtained mAP is 37.8%, which is 8% higher than [Li et al.(2016)Li, Huang, Li, Wang, and Yang]. We should point out that the VOC07 trainval set does not overlap with the VOC12 val set and its size is about the same as the VOC12 train set used in [Li et al.(2016)Li, Huang, Li, Wang, and Yang].

Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike prsn plant sheep sofa train tv Avg.
AP ConLoc[Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev] 64.0 54.9 36.4 8.1 12.6 53.1 40.5 28.4 6.6 35.3 34.4 49.1 42.6 62.4 19.8 15.2 27.0 33.1 33.0 50.0 35.3
MICL 65.5 57.3 53.4 5.4 11.5 48.8 45.4 80.5 7.6 35.2 25.3 75.8 59.5 68.8 18.0 17.0 24.7 37.7 25.8 14.1 38.9
Cor- ConLoc[Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev] 78.3 70.8 52.5 34.7 36.6 80.0 58.7 38.6 27.7 71.2 32.3 48.7 76.2 77.4 16.0 48.4 69.9 47.5 66.9 62.9 54.8
Loc MICL 84.9 78.6 76.9 30.1 32.6 80.0 69.7 90.6 32.1 67.7 47.4 85.5 85.3 85.9 41.4 50.7 62.8 62.7 57.9 41.7 63.2
Table 3: Results on the VOC12 dataset, where the AP and the CorLoc are measured on the test and trainval set, respectively.

4.3 Analyses

On effectiveness of curriculum learning. To prove the effectiveness of curriculum learning, we set up three baseline Fast R-CNN detectors: (1) trained with pseudo location labels from the initial MSC detector, with no segmentation cue used, (2) trained with pseudo location labels from the SSG network, and (3) based on MIL yet without curriculum (where the subset of training examples is selected randomly in each round and the average of and is used as pseudo location labels). The ablation study is conducted on the VOC07 dataset. The comparison of the MICL approach and the three baselines are illustrated in Tab. 5. Adding the segmentation cue into the detector training boosts the CorLoc by and mAP by from the initial MSC detector. In parallel, MIL without curriculum gives us even slightly larger improvement of CorLoc and mAP. By introducing curriculum learning (CL), we obtain an total improvement of in the CorLoc and in the mAP. We also compare the CorLoc change during training with and without CL in Fig. 6.

MSC SSG MIL MICL
CorLoc 42.5 53.4 54.4 60.5
mAP 32.0 36.4 38.3 42.6
Table 4: Performance comparison of the three Fast R-CNN baselines and the one with the proposed multiple instance curriculum learning paradigm.
Subset All (SSG) All (MSC)
CorLoc 72.9 53.2 42.4
mAP 38.0 36.3 32.0
Table 5: Comparison of the CorLoc on the selected training subset versus on the whole set and mAP on the test set achieved by the correspondingly trained detectors.

On easy example selection criterion. One reason that curriculum learning is effective is that the pseudo location labels on the selected subset are more reliable than the average, as shown in Tab. 5. This can be explained by the different preferences of the SSG network and the MSC detector when they localize objects: the SSG tends to group close instances from the same category due to the lack of instance information, resulting boxes larger than objects; the MSC detector, however, may focus on the most discriminative regions, usually smaller than the true objects. This intuition is confirmed by Fig. 6, where we analyze the localization errors from the SSG and the MSC. Among objects that are mis-localized, errors can be classified into three categories: too large, too small and others333We provide details about how mis-localized objects are categorized in the supplementary materials.. It is clear that the SSG tends to generate bounding boxes larger than the target while the MSC prefers smaller ones. Thus, if the results from the SSG and the MSC are consistent, the bounding box is neither too small nor too large, indicating reliable locations. We also train a Fast R-CNN detector on the easy subset only and compare with the detectors trained on the whole set with pseudo locations from SSG and MSC, respectively. As shown in Tab. 5, the one trained with easy examples achieves 1.7% higher in mAP than the second best, even though it sees fewer examples, which indicates the importance of the quality of pseudo location labels.

Figure 5: Comparison of the CorLoc performance of the Fast R-CNN detector with and without curriculum learning.
Figure 6: Percentages of three error types among the mis-localized objects from the SSG and the MSC.

5 Conclusion

In this work, we proposed an MICL method with a segmentation network injected to overcome the challenge that detectors often focus on the most discriminative regions when trained without manually labeled tight object bounding boxes. In the proposed MICL approach, where the segmentation-guided MIL is organized on a curriculum, the detector is trained to learn the extent of full objects from easy to hard examples and the easiness is determined automatically by measuring the consistency between the results from the current detector and the segmenter. The benefits from the segmentation network and the power of the easy-to-hard curriculum learning paradigm are demonstrated by extensive experimental results.

Acknowledgements

This research was partially supported by Ittiam Systems. We thank Yaguang Li, Junting Zhang and Heming Zhang for helpful discussions and proofreading.

References

  • [Abadi et al.(2016)Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, et al.] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  • [Bearman et al.(2016)Bearman, Russakovsky, Ferrari, and Fei-Fei] Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. In European Conference on Computer Vision, pages 549–565. Springer, 2016.
  • [Bengio et al.(2009)Bengio, Louradour, Collobert, and Weston] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    , pages 41–48. ACM, 2009.
  • [Bilen and Vedaldi(2016)] Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2016.
  • [Bilen et al.(2014)Bilen, Pedersoli, and Tuytelaars] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object detection with posterior regularization. In British Machine Vision Conference, volume 3, 2014.
  • [Bilen et al.(2015)Bilen, Pedersoli, and Tuytelaars] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object detection with convex clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1081–1089, 2015.
  • [Carreira and Sminchisescu(2010)] Joao Carreira and Cristian Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3241–3248. IEEE, 2010.
  • [Chen et al.(2014)Chen, Papandreou, Kokkinos, Murphy, and Yuille] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
  • [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE, 2009.
  • [Deselaers et al.(2010)Deselaers, Alexe, and Ferrari] Thomas Deselaers, Bogdan Alexe, and Vittorio Ferrari. Localizing objects while learning their appearance. In European conference on computer vision, pages 452–466. Springer, 2010.
  • [Deselaers et al.(2012)Deselaers, Alexe, and Ferrari] Thomas Deselaers, Bogdan Alexe, and Vittorio Ferrari. Weakly supervised localization and learning with generic knowledge. International journal of computer vision, 100(3):275–293, 2012.
  • [Diba et al.(2016)Diba, Sharma, Pazandeh, Pirsiavash, and Van Gool] Ali Diba, Vivek Sharma, Ali Pazandeh, Hamed Pirsiavash, and Luc Van Gool. Weakly supervised cascaded convolutional networks. arXiv preprint arXiv:1611.08258, 2016.
  • [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision (IJCV), 88(2):303–338, 2010.
  • [Felzenszwalb et al.(2010)Felzenszwalb, Girshick, McAllester, and Ramanan] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
  • [Girshick(2015)] Ross Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015.
  • [Gokberk Cinbis et al.(2014)Gokberk Cinbis, Verbeek, and Schmid] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Multi-fold mil training for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2409–2416, 2014.
  • [Kantorov et al.(2016)Kantorov, Oquab, Cho, and Laptev] Vadim Kantorov, Maxime Oquab, Minsu Cho, and Ivan Laptev. Contextlocnet: Context-aware deep network models for weakly supervised localization. In European Conference on Computer Vision, pages 350–365. Springer, 2016.
  • [Kolesnikov and Lampert(2016)] Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In European Conference on Computer Vision, pages 695–711. Springer, 2016.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
  • [Kumar Singh et al.(2016)Kumar Singh, Xiao, and Jae Lee] Krishna Kumar Singh, Fanyi Xiao, and Yong Jae Lee. Track and transfer: Watching videos to simulate strong human supervision for weakly-supervised object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [Lee and Grauman(2011)] Yong Jae Lee and Kristen Grauman. Learning the easy things first: Self-paced visual category discovery. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1721–1728. IEEE, 2011.
  • [Li et al.(2016)Li, Huang, Li, Wang, and Yang] Dong Li, Jia-Bin Huang, Yali Li, Shengjin Wang, and Ming-Hsuan Yang. Weakly supervised object localization with progressive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3512–3520, 2016.
  • [Li et al.(2017)Li, Zhang, Zhang, Ren, and Kuo] Siyang Li, Heming Zhang, Junting Zhang, Yuzhuo Ren, and C.-C. Jay Kuo. Box refinement: Object proposal enhancement and pruning. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 979–988. IEEE, 2017.
  • [Lin et al.(2016)Lin, Dai, Jia, He, and Sun] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3159–3167, 2016.
  • [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, pages 21–37. Springer, 2016.
  • [Long et al.(2015)Long, Shelhamer, and Darrell] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  • [Oquab et al.(2015)Oquab, Bottou, Laptev, and Sivic] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic.

    Is object localization for free?-weakly-supervised learning with convolutional neural networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 685–694, 2015.
  • [Papandreou et al.(2015)Papandreou, Chen, Murphy, and Yuille] George Papandreou, Liang-Chieh Chen, Kevin P Murphy, and Alan L Yuille.

    Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation.

    In Proceedings of the IEEE International Conference on Computer Vision, pages 1742–1750, 2015.
  • [Pentina et al.(2015)Pentina, Sharmanska, and Lampert] Anastasia Pentina, Viktoriia Sharmanska, and Christoph H Lampert. Curriculum learning of multiple tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5492–5500, 2015.
  • [Ren et al.(2015)Ren, He, Girshick, and Sun] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015.
  • [Rother et al.(2004)Rother, Kolmogorov, and Blake] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), volume 23, pages 309–314. ACM, 2004.
  • [Shi and Ferrari(2016)] Miaojing Shi and Vittorio Ferrari. Weakly supervised object localization using size estimates. In European Conference on Computer Vision, pages 105–121. Springer, 2016.
  • [Simonyan and Zisserman(2015)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [Simonyan et al.(2013)Simonyan, Vedaldi, and Zisserman] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  • [Siva and Xiang(2011)] Parthipan Siva and Tao Xiang. Weakly supervised object detector learning with model drift detection. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 343–350. IEEE, 2011.
  • [Siva et al.(2012)Siva, Russell, and Xiang] Parthipan Siva, Chris Russell, and Tao Xiang. In defence of negative mining for annotating weakly labelled data. In European Conference on Computer Vision, pages 594–608. Springer, 2012.
  • [Siva et al.(2013)Siva, Russell, Xiang, and Agapito] Parthipan Siva, Chris Russell, Tao Xiang, and Lourdes Agapito.

    Looking beyond the image: Unsupervised learning for object saliency and detection.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3238–3245, 2013.
  • [Song et al.(2014a)Song, Girshick, Jegelka, Mairal, Harchaoui, Darrell, et al.] Hyun Oh Song, Ross B Girshick, Stefanie Jegelka, Julien Mairal, Zaid Harchaoui, Trevor Darrell, et al. On learning to localize objects with minimal supervision. In ICML, pages 1611–1619, 2014a.
  • [Song et al.(2014b)Song, Lee, Jegelka, and Darrell] Hyun Oh Song, Yong Jae Lee, Stefanie Jegelka, and Trevor Darrell. Weakly-supervised discovery of visual pattern configurations. In Advances in Neural Information Processing Systems, pages 1637–1645, 2014b.
  • [Teh et al.(2016)Teh, Rochan, and Wang] E Teh, Mrigank Rochan, and Yang Wang. Attention networks for weakly supervised object localization. BMVC, 2016.
  • [Tudor Ionescu et al.(2016)Tudor Ionescu, Alexe, Leordeanu, Popescu, Papadopoulos, and Ferrari] Radu Tudor Ionescu, Bogdan Alexe, Marius Leordeanu, Marius Popescu, Dim P. Papadopoulos, and Vittorio Ferrari. How hard can it be? estimating the difficulty of visual search in an image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [Uijlings et al.(2013)Uijlings, van de Sande, Gevers, and Smeulders] Jasper RR Uijlings, Koen EA van de Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. International Journal of Computer Vision (IJCV), 104(2):154–171, 2013.
  • [Zhang et al.(2016)Zhang, Lin, Brandt, Shen, and Sclaroff] Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. In European Conference on Computer Vision, pages 543–559. Springer, 2016.
  • [Zhou et al.(2016)Zhou, Khosla, Lapedriza, Oliva, and Torralba] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.

    Learning deep features for discriminative localization.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.
  • [Zitnick and Dollár(2014)] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision, pages 391–405. Springer, 2014.