In recent years, object proposal has become crucial for modern object detection methods as an important pre-processing step [girshick2014rich, he2014spatial, girshick2015fast]. It aims to identify a small number (usually at the order of hundreds or thousands) of candidate regions that possibly contain class-agnostic objects of interest in an image. Compared with the exhaustive search scheme such as sliding windows [felzenszwalb2010object]
, object proposal methods can significantly reduce the number of candidates to be examined and benefit object detection in following two aspects: they can reduce computation time and allow for applying more sophisticated classifiers.
Most of existing object proposal methods can be roughly divided into two categories: the classic low-level cues based ones and the modern convolutional neural network (CNN) based ones. The former category of methods mainly exploit low-level image features, including edge, gradient and saliency [cheng2014bing, zitnick2014edge, alexe2010object, uijlings2013selective, manen2013prime, zhang2011proposal] to localize regions possibly containing objects. Typically they either follow a bottom-up paradigm e.g., hierarchical image segmentation [uijlings2013selective, arbelaez2014multiscale] or examine densely distributed windows [cheng2014bing, zitnick2014edge]. However, it is difficult for them to balance well between localization quality and computation efficiency – they cannot provide object proposals of high quality without incurring expensive computational cost. On the other hand, CNN-based methods either directly predict the coordinates of all the objects in an image [Erhan2013Scalable] or scan the image with a fully convolutional network (FCN) [ren2015faster, jieobject] to find the regions of high objectness111“Objectness” measures membership to foreground objects vs. background. Although they can achieve high recall rate w.r.t. relatively loose overlap criteria, e.g. intersection over union (IoU) with a threshold value of , this type of methods usually fails to provide high recall rate under more strict criteria (e.g. IoU ), suggesting their poor localization quality.
Ideally, a generic object proposal generator should offer the following desired features: high recall rate on objects of various categories with only a few proposals, good localization quality for each specific object instance and high computation efficiency. In this work, we make an effort to develop the object proposal method toward these targets.
Our method is motivated by a statistical study on the scale of objects in a collection of natural images. As shown in Figure 2, we plot the distribution of objects with varying scales (measured by number of pixels) from the training and validation sets of the PASCAL VOC detection benchmark [everingham2014pascal]. From the figure, one can observe that the objects of small scales (less than pixels) actually dominate the distribution. Similar observations also hold in the ILSVRC 2013 and 2014 benchmark [russakovsky2014imagenet]. Unfortunately, most of existing methods perform poorly in localizing objects of such small sizes, in terms of the best overlap222Best overlap of a particular ground-truth object is defined as the maximal intersection over union (IoU) among all the given proposals w.r.t. this object. Throughout the paper, Average Best Overlap (ABO) is obtained by averaging the best overlap of all the ground-truth objects. Based on these empirical observations, we argue that the quality of small objects localization is one main bottleneck for further improving the recall rate and average best overlap (ABO) for object proposal methods. Therefore, we focus on tackling such a challenging problem in this work.
In particular, we develop a novel CNN based object proposal method which contains a pixel-wise object proposal network, sharing the similar spirit with object segmentation networks [chen2014semantic, long2014fully, liang2015proposal]. Here the “pixel-wise” refers to: for every pixel in an image, our proposed network model will predict a bounding box of the object containing this pixel. Such a pixel-level comprehensive object proposal strategy fully exploits the available annotations for object segmentation333The segmentation annotations can be readily collected from many public benchmark datasets. and substantially improves the quality of object proposals through enhancing the opportunities of accurately hitting the ground-truth object. As the receptive field of each pixel in CNN is a local region around the pixel, directly predicting the coordinates of the bounding box is challenging due to the various spatial displacements of objects. We thus propose to predict the offset of the bounding box w.r.t this pixel, for each pixel.
We then take a further step to focus on enhancing the localization precision for small-scale objects. We propose a new scale-aware strategy for object proposal, which is inspired by the divide-and-conquer philosophy. Specifically, we train two independent networks, each of which predicts bounding box coordinates for objects at different scales (small or large). Then for each pixel, we will obtain two object proposals for choice. To adaptively fuse them, we introduce another object confidence network. The network consists of two branches – one for predicting objectness confidence and the other one for weighting the large-/small-size444Throughout the paper, we use “large-size network”/“small-size network” to refer to a localization network trained specifically for localizing objects of large/small sizes.
object localization networks. The objectness branch predicts the likelihood of each pixel coming from an object of interest, and the large-/small-size weighting branch trade-offs the contribution of the large-size and small-size networks to final prediction, by predicting the probability of the pixel belonging to an object of a large size. In the training phase, the size of an object can be easily inferred from its annotated segmentation mask, which is used for training the proposed network. For a new image without annotation, both the large-size and small-size object localization networks will predict the bounding box coordinates which are combined according to the weights from the confidence network. An overview of the proposed network model is presented in Figure1.
Therefore, the scale-aware coordinates prediction can achieve outperforming localization quality for a wide range of object sizes as for various object sizes, the final result can always considers and fuses the bounding boxes predicted by two localization networks robustly based on a reliable large-/small-size weighting mechanism.
To further improve the performance of localizing small objects, we employ a multi-scale strategy for object proposal on a new image. This is inspired by the observation that by enlarging the challenging small object into a larger one, the coordinates prediction error of the small object will be scaled down, as in the case of zooming in on a small object to obtain a clearer view for humans or cameras. Finally, a superpixel based bounding box refinement operation is applied to fine tune the proposals.
In short, we make the following contributions to object proposal generation. Firstly, we introduce a segmentation-like pixel-wise localization network to densely predict the object coordinates for each pixel. Secondly, we develop a scale-aware object localization strategy which combines the predictions from a large-size and a small-size network with a weighting mechanism to boost the coordinates prediction accuracy for a wide range of object sizes. Thirdly, we conduct extensive experiments on the PASCAL VOC 2007 and ILSVRC 2013 datasets. The results demonstrate that our proposed approach outperforms the state-of-the-art methods by a significant margin, verifying the superiority of the proposed scale-aware pixel-wise object proposal network.
Ii Related Work
The existing object proposal generation methods can be classified into three types: window scoring methods, segment grouping methods and CNN-based methods.
Window scoring methods design different scoring strategies to predict the confidence of containing an object of interest for each candidate window. Generally, this type of methods first initializes a set of candidate window regions across scales and positions in an image, and then sorts them with a scoring model and selects the top ranked windows as proposals. Objectness [alexe2012measuring] selects the initial proposals from the salient regions in an image and sorts them based on multiple low-level cues, such as color, edges, location size, etc. [zhang2011proposal]
proposed a cascade of SVMs trained on gradient features to estimate the objectness. BING[cheng2014bing] trains a simple linear SVM on image gradients and applies it in a sliding window scheme to find the highest scored windows as object proposals. Edge Boxes [zitnick2014edge] is also performed in a sliding window manner, but relies on a carefully hand-designed scoring model which sums the edge strengths fully inside the window. Window scoring methods are usually computationally efficient as they do not output segmentation masks for the proposals. However, it seems difficult for them to achieve high recall rate under high overlap criteria (e.g. IoU ), which suggests the poor localization quality. This can probably be attributed to the discrete sampling of the sliding windows which are all in the pre-defined scales and positions.
|hole size||training map size||receptive field size||#weight|
Segment grouping methods are usually initialized with an oversegmentation to obtain superpixels for an image. Then different merging strategies are adopted to group the similar segments hierarchically to generate the object proposals of all scales. Generally, they follow a bottom-up scheme which relies on diverse low-level image cues including color, shape and texture. For example, Selective Search [uijlings2013selective] iteratively merges the most similar segments to form proposals based on several low-level cues. Randomized Prim [manen2013prime] learns a randomized merging strategy based on the superpixel connectivity graph. Multiscale Combinatorial Grouping (MCG) [arbelaez2014multiscale] utilizes multi-scale hierarchical segmentations based on the edge strength and the obtained proposals are then ranked using features including size, location, shape and contour. Geodesic object proposal [krahenbuhl2014geodesic] also depends on superpixels as initialization, and then computes a geodesic distance transform and selects certain level sets of the distance transform as object proposals. [kk-lpo-15] proposes learning conditional random field (CRF) in multiscales to classify the superpixels into objects or background. Generally, compared with window scoring methods, segment grouping methods achieve more consistent and acceptable recall under both loose and strict overlap criteria, indicating a better localization ability. Nevertheless, these methods produce high quality proposals often by multiple segmentations in different scales and color spaces, thus are quite computationally expensive and time-consuming.
CNN-based methods follow the great success of Convolutional Neural Network in other vision tasks, [krizhevsky2012imagenet, wei2015hcp, szegedy2014going, liang2015towards], especially semantic segmentation [wei2015stc, wei2016learning, liang2015reversible]. They leverage the powerful discrimination ability of Convolutional Neural Network (CNN) to extract visual features as inputs of other techniques to produce proposals or directly regress the coordinates of all the object bounding boxes in an image. MultiBox [Erhan2013Scalable] trains a network to directly predict a fixed number of proposals and their confidences in an image and ranks them with the obtained confidences. RPN [ren2015faster] uses a Fully Convolutional Network (FCN) to densely generate the proposals in each local patch based on several pre-defined “anchors” in the patch. DeepProposal [ghodrati2015deepproposal] hunts for the proposals in a sliding window manner by using the CNN features from the final to the beginning layers and training a cascade of linear classifiers to obtain the highest scored windows. Current CNN-based methods typically achieve high recall with only a small number (usually ) of proposals, under loose overlap criteria (e.g. IoU). But similar to window scoring methods, they can hardly achieve high recall rate under more strict overlap criteria (e.g. IoU ). To improve the object proposal localization quality, different from them, our approach predicts the object locations in a pixel-wise manner so that we have much more chances to localize each object with high precision. This also takes the full advantage of the publicly available segmentation masks annotations. This is similar to [huang2015densebox] which deals with object detection task in the object coordinates prediction part. In addition, our scale-aware prediction strategy provides adaptive accurate prediction for both large-size and small-size objects, which also distinguishes our method from others.
Iii Scale-aware Pixel-wise Proposal Network
The proposed Scale-aware Pixel-wise Object Proposal Network (SPOP-net) is based on a pixel-wise segmentation-like object coordinates prediction network, and includes a scale-aware localization mechanism for predicting the coordinates of objects of different sizes. In addition, a multi-scale prediction strategy is employed during testing to boost the small objects localization. Finally, a superpixel boundary based proposal refinement is introduced to further improve the proposal precision. We will elaborate all the components of SPOP-net in this section.
Iii-a Pixel-wise Localization Network
The proposed Scale-aware Pixel-wise Object Proposal Network (SPOP-net) takes an image of any size as input and predicts the location of the object w.r.t. each pixel in the image. More concretely, for each pixel, SPOP-net predicts the normalized coordinates of the bounding box of the object that contains the pixel. The predictions from the background pixels make no sense and will be ranked behind due to low objectness scores they obtain, thus making no difference to the recall performance of top-ranked proposals, which will be detailed later. In this subsection, we first explain the architecture of SPOP-net and then elaborate on how to train and apply the SPOP-net.
Our SPOP-net is built upon a pre-trained DeepLab-LargeFOV segmentation network [chen2014semantic]. Its architecture is shown in Table I. The receptive field of our localization network in the last layer is . This large receptive field enables SPOP-net to “see” a large region of the image in its last layer and predict the object bounding boxes effectively.
For each pixel, the pixel-wise localization network aims to predict the bounding box coordinates of the object that contains this pixel. Here (, ) and (, ) denote the coordinates of the top-left and bottom-right corners of the object bounding box containing the pixel; and represent the height and the width of the image plane respectively. Therefore, for a single object, all the pixels inside this object are given the same ground-truth values . We train the pixel-wise localization network to minimize the following localization error
that is proportional to the Euclidean distance between the predicted coordinate vectorand the ground-truth coordinate vector
for all the foreground pixels. The loss functionis defined as
where is the predicted 4-d object coordinate vector, and
is a binary variable indicating whether the pixelis a foreground one: it takes if the pixel is from a foreground object and otherwise. Such a filtered loss (through ) enables the localization network to concentrate on localizing foreground objects without being distracted by background pixels in the training phase. In the practical implementation, as the final layer has smaller size than the input image, we resize the ground-truth coordinate map to the same small size as the final layer.
However, due to the possible spatial displacement (e.g. two exactly the same objects could appear at different locations in an image), accurately predicting the absolute object bounding box coordinates is difficult. It is because these two objects have the same visual input for the model, but their locations the model needs to learn to predict are totally different. To solve this issue, for each pixel, we change its learning targets from the absolute object bounding box coordinates to the offsets from the pixel to the object bounding box. E.g. for object bounding box coordinate , we change the target from to , here is the coordinate of the pixel itself. Changing the coordinates to offsets can be conveniently achieved by element-wisely summing the output of the 2nd last layer and the spatial coordinate map ( or values of all the pixels themselves). Then the absolute object bounding box coordinates can be used as learning targets for the final layer. In this way, applying the absolute coordinates learning targets to the final layer is equivalent to applying the following object coordinate offsets to the 2nd last layer.
Then we can directly obtain the absolute object proposal coordinates from the predictions of the final layer. After obtaining the output map from the final layer having a smaller size than the input image, all the subsequent procedures (e.g. refinement, ranking and NMS) are only based on the output map of smaller size. Because we just leverage pixel-level prediction of proposals for having higher chance to hit the ground-truth objects accurately instead of doing pixel-level classification as DeepLab. If resizing the smaller output map back into the original size, the subsequent refinement, ranking and NMS steps will bring much higher computation burden but not significant performance improvement.
Iii-B Scale-aware Localization
A fully trained pixel-wise localization network can predict the coordinates of object bounding boxes w.r.t. each pixel from an image. However, a single network model may not be able to well handle all the annotated objects that have quite diverse sizes and only offers inferior localization performance for objects of small sizes. To verify this point, we conduct the following preliminary experiments to evaluate the errors of bounding box prediction for large and small objects, using a single pixel-wise localization network trained on the annotated objects of all sizes. The evaluation results are shown in Table II.
From Table II, one can observe that the network trained on all the objects of different sizes produces an error for small objects that is about to times larger than the error for large objects. This demonstrates the poor localization ability of a single network model for small objects.
The difficulty of accurately localizing both large and small objects using a single network arguably lies in handling the highly diverse offsets of large and small objects. Apart from this, another difficulty comes from the extremely unbalanced training samples between the pixels from large and small objects. Such imbalance leads to the fact that training error of large objects dominates the training loss to minimize.
Also, we empirically verify the sample imbalance through statistics on the pixel-level distribution of the annotations in terms of the area of the object (see Figure 4) since our pixel-wise localization network is trained on pixel-level annotations.
|large objects||small objects|
To improve the localization accuracy for small objects, we propose a scale-aware localization strategy. Roughly, in the scale-aware strategy, two localization networks are trained – which share the same architecture – with two non-overlapped subsets of the objects. The large-size network is only trained on the pixels belonging to large objects and the small-size network is only trained on the pixels belonging to small objects. The loss function to be optimized for the large-size and small-size network are shown in Eqn. (2) and Eqn. (3) below respectively:
where and are binary indicators showing whether the pixel belongs to a large object or a small object. The effectiveness of training such scale-aware networks is validated by evaluating the errors of small objects location prediction with the small-size network. See Table III. During the testing phase, the two networks work simultaneously to output their own prediction for an image. Then, the predictions from two networks are combined with an adaptive weighting scheme.
The weight is output by a network trained for classifying large and small objects pixel-wisely and the weight is equal to the confidence of the pixel belonging to a large object obtained in the last layer of the network. Such a classification network is termed as “confidence network”.
The structure of the confidence network is illustrated in Figure 5. Apart from the large/small classification branch, the confidence network also outputs the objectness confidence in another branch aiming to classify all the pixels into two categories, i.e., foreground pixels and background pixels.
In the confidence network, the two branches share the convolutional features in the lower layers. The last feature maps shared are then fed into the two branches. The intuition for dividing the confidence network into two branches at the higher layer is that for different tasks, the low-level features are usually common and can be shared [zeiler2014visualizing]
, while the semantically high-level features extracted by the higher layers may be totally different for different tasks. For example, the foreground/background classification task prefers the common features that are insensitive to different sizes of objects, but the large/small classification task aims to extract the discriminative features between large and small objects. The large receptive field (i.e. ) in the last layer of the “confidence network” provides a sufficient large view enabling the prediction of both foreground/background and large/small classifications.
The objective function to be optimized during training the confidence network is a multi-task cross-entropy loss:
Here and are the ground-truth label of the foreground/background classification and the predicted confidence of being a foreground pixel for pixel , respectively. and are the ground-truth label of the large/small object classification and the predicted confidence of being contained in a large object for pixel , respectively. Note that the second term is only activated when equals 1, indicating that the pixel belongs to a foreground object. After the large object confidence for the pixel is obtained, the final predicted coordinates of the object it belongs to are the weighted sum of the predictions by the large-size and small-size networks as follows.
where and are the predictions by the large-size and the small-size network respectively. Then we treat the predicted object coordinates by each pixel as an initial proposal to be passed to the later proposal refinement and non-maximum suppression (NMS) steps to obtain the final object proposals.
Iii-C Multi-scale Inference
To further enhance the accuracy of small objects localization, we propose to employ a multi-scale prediction strategy in the testing phase. The motivation is quite straightforward: by enlarging the challenging small object into a larger one, the coordinates prediction error of the small object will be scaled down, which is similar to zooming in on a small object to improve the localization accuracy. At the enlarged scale, all the proposals in the enlarged image will be mapped back to their corresponding positions at the original scale.
Therefore, given a testing image, in addition to its original scale, we resize it into a larger scale and run the prediction process as well. Specifically, both on the original scale and the enlarged scale, we simultaneously run the two localization networks (i.e. large-size and small-size) and the confidence network, and combine the both location predictions weighted by the large object confidence of its own scale. As all the feed-forward computation of the networks is independent and can be performed in parallel, the computation time cost can remain relatively low.
Iii-D Proposal Refinement
We then refine the two sets of proposals obtained in both original and enlarged scales. An inherent weakness for object localization by regressing the four coordinates with CNN is that the objectness and coordinates ground-truths only permit determining the most discriminative foreground windows. Therefore, even though the windows decided by the localization networks are likely to overlap with target objects, it cannot be ensured that they are able to delineate object boundaries well.
To take object boundaries into consideration, we utilize a superpixel boundary based window refinement method, similar to [Chen2015Improving]. The main idea is to expand or shrink the proposals to align the four sides of the proposals with the boundaries of the superpixels better. The reason for using superpixels is that the boundaries of superpixels are informative indicators of object boundaries and superpixels can be generated efficiently with off-the-shelf algorithms (e.g. SLIC [achanta2012slic]). Specifically, for each proposal, we generate two versions of refined proposals, i.e. the minimum bounding rectangle of all the superpixels entirely inside this proposal and the minimum bounding rectangle of all the superpixels entirely inside this proposal or straddling this proposal (see Figure 7). As illustrated in Figure 7, expansion and shrinkage offer two possible ways of getting close to the ground-truth box for the proposals with different location biases to the ground-truth. Therefore, we pass all the two versions of refined proposals as well as the initial proposals to the later proposal ranking and NMS processing.
In the stage of proposal ranking , we sort all the proposals (including the initial and the two refined ones in both original and enlarged scale) by their objectness confidence . Recall is the output from foreground/background classification branch of the confidence network. For each initial proposal, its two versions of refined proposals are assigned with the same objectness confidence as itself. Finally, the standard non-maximum suppression (NMS) is employed to remove the highly overlapped redundant proposals.
Iv Experiments and Discussion
Iv-a Experimental Setups
The proposed Scale-aware Pixel-wise Object Proposal Network (SPOP-net) is trained on the SBD annotations [hariharan2011semantic] of PASCAL VOC 2012 trainval set, which provides images with fine segmentation masks annotations. We manually label the objects containing more than pixels as large objects and those containing less than pixels as small ones. Considering the unbalanced pixel samples when training the large-/small-size weighting branch, for each large object, we randomly sample pixels in it for training to balance the number of training pixels belonging to large and small objects. Both the “confidence network” and the two localization networks are trained using the published DeepLab code [chen2014semantic]
, which is based on the publicly available Deep Learning platform Caffe[jia2014caffe] and the biases are initialized with . The initial learning rate is for the pre-trained layers in the DeepLab-LargeFOV network and for the newly-added layers. All of them are reduced by a scale of after every epochs. The mini-batch size is set as . We train the network for about epochs. The overlap threshold for NMS in our experiments is set to 0.8 for a good trade-off between the recall at low IoU thresholds (e.g. 0.5) and high IoU thresholds (e.g. 0.8). The training images are all resized to 513*513. During testing, for original scale, all the images are directly fed into the networks without any scaling; for enlarged scale, all the images are enlarged by a factor of 2.
The proposed SPOP-net is then extensively evaluated on PASCAL VOC 2007 testing set which is the most widely used in comparison of object proposal algorithms. It contains
images with annotated objects (including “hard” objects) in bounding boxes. We are not able to evaluate on PASCAL VOC 2012 testing set because the ground-truths are not publicly released. Since the missed objects can never be recovered in the post-classification stage in a proposal-based object detection pipeline, object recall rate is naturally regarded as the standard evaluation metric for object proposal algorithms. Also, we evaluate the localization quality measured by Average Best Overlap (ABO). In addition, the object detection performance using our proposals in Fast-RCNN[girshick2015fast] detection pipeline is evaluated to validate the effectiveness of our proposals in the object detection task. Finally, we conduct the generalization ability evaluation by testing the recall rate on ILSVRC 2013 validation set using our network which is trained on PASCAL VOC 2012.
Iv-B Ablation Studies
We first study the effectiveness of the four components in our method: pixel-wise localization network (basic setting), scale-aware localization, multi-scale inference and proposal refinement. Several simplified variants of the SPOP-net are tested in terms of the object recall rate on PASCAL VOC 2007 testing set. Specifically, we use the prediction only at the original scale without scale-awareness and proposal refinement as our baseline, which is referred to as single scale. Without scale-awareness, only one localization network is trained on all of the foreground pixels including both large-size and small-size ones. Then, we accumulatively add scale-awareness, multi-scale inference, proposal refinement to the baseline to see the benefits of each component. Please note that multi-scale inference here indicates the prediction at two scales, namely the original image scale and the -time enlarged scale.
Figure 8 shows the recall and average best overlap (ABO) comparisons under different scenarios between the four variants, i.e. single scale, single scale with scale-awareness, multi-scales with scale-awareness, multi-scales with scale-awareness and refinement. The number of proposals of S-scale and S-scale+SA are around 500 due to that most proposals can be filtered after NMS as pixel-wise localization networks generate highly overlapped proposals (see Figure 14). From Figure 8(a), 8(b) and 8(c), 8(e), 8(f) and 8(g), we find that both scale-awareness and multi-scale inference improve the recall under both low IoU threshold (e.g. ) and high IoU threshold (e.g. ). As for proposal refinement, it is found that it harms the recall under low IoU thresholds (e.g. ) when the number of proposals is less than . The reason probably lies in the large number of proposals after refinement, which is times as big as that before refinement. Although this increases the opportunities of getting close to the ground-truths which can boost the recall for a large number of proposals, this also causes too many duplicate proposals to concentrate on a small area, which lowers down the recall under loose IoU criteria when only requiring a small number of proposals. For average best overlap, it shows a similar trend to the recall from Figure 8(d), suggesting the benefits of all three components in terms of localization quality.
We then study the contributions of all the components for different object areas. Figure 9 presents the distributions of the detected objects of both the four variants of SPOP-net and the ground-truths w.r.t the object areas. It is found that the baseline variant, i.e. single scale without scare-awareness and refinement, can hit most of big objects but performs poor for small objects. Scare-aware weighted combination mechanism and multi-scale inference help improve the recall for small objects significantly, which shows the effectiveness of both the proposed scare-aware localization strategy and multi-scale inference.
To further verify the effectiveness of scale-awareness and multi-scale inference in small objects localization, we break up the SPOP-net into four building blocks, i.e. large-size network and small-size network in original scale, and large-size network and small-size network in enlarged scale, in order to investigate their respective contributions to the final localization. We evaluate the average best overlap (ABO) of the four building blocks for the ground-truth objects with different areas. Figure 10 shows the ABO versus object area curves of the four building blocks. It can be seen that when the object becomes larger, the large-size network in original scale predicts more accurate localization results. The small-size network in original scale achieves the highest ABO when the object area is around , but it also performs poorly for those too small objects. Fortunately, the small-size network in enlarged scale covers this shortage, and gives the best performance for very small objects due to the enlarged view of small objects. As for the large-size network in enlarged scale, it performs the best for those middle-size objects containing to pixels, serving as the bridge between the large-size network in original scale and the small-size networks in both scales. The reason for the behavior of the large-size network in enlarged scale is probably that when the small objects are enlarged, they become “large objects” such that it becomes easier for the large-size network to predict, but original large objects become even larger which cannot be covered by the receptive field, making it difficult to precisely localize them. In both original scale and enlarged scale, the result after scale-aware fusion can achieve the maximal ABO among the two ABOs obtained by large-size and small-size networks, validating the effectiveness of the adaptive scale-aware fusion strategy.
By investigating the building blocks of the proposed SPOP-net, it is found that they can complement each other in localizing the objects with different areas and ensures the SPOP-net to perform well for a wide range of object sizes.
Iv-C Comparisons on Object Recall
We compare our SPOP-net with the following state-of-the-art object proposal methods: BING [cheng2014bing], Edge Boxes [zitnick2014edge], Geodesic Object Proposal [krahenbuhl2014geodesic], MCG [arbelaez2014multiscale], Objectness [alexe2012measuring], Selective Search [uijlings2013selective] and Region Proposal Network (RPN with VGG-16) [ren2015faster]. We first evaluate object recall on PASCAL VOC 2007 testing set, which contains images with about annotated objects. Proposals of most state-of-the-art methods were provided by Hosang et al. [Hosang2015arXiv] in a standard format. As for DeepProposal approach, we directly downloaded the pre-computed proposals from the official website555https://github.com/aghodrati/deepproposal.
Figure 11(a) and 11(b) show the recall when varying the number of proposals for different IoU thresholds. As can be seen, under a loose IoU threshold, RPN takes the lead all the time for both a small and a large number of proposals.DeepProposal 50 also performs well under low IoU thresholds (e.g. 0.5). Given a more strict IoU threshold , our SPOP-net almost keeps the best consistently. We also plot the average recall (AR) versus the number of proposals curves for all the methods in Figure 11(c). This is because AR summarizes proposal performance across IoU thresholds and correlates well with object detection performance [Hosang2015arXiv]. The proposed SPOP-net also takes the first place all the time regarding the number of proposals. Figure 11(d) shows the average best overlap (ABO) when changing the number of proposals. The proposed SPOP-net shows good localization quality, especially when the number of proposals is more than . Figure 11(e), 11(f) and 11(g) demonstrate the recall when the IoU threshold changes within the range [, ] for different numbers of proposals. It is found that RPN performs well with a small number of proposals when setting a low IoU threshold (). When increasing the number of proposals from to , our SPOP-net gradually shows its advantage. Especially for the top proposals, the SPOP-net performs superiorly across a wide range of IoU thresholds from to , which have the strongest correlation to object detection performance thus are typically desired in practice [Hosang2015arXiv].
Figure 13 shows the average best overlap (ABO) of the proposed SPOP-net as well as several state-of-the-art methods for the ground-truth objects with different areas. For most object sizes, the SPOP-net shows outstanding performance. Especially for small objects whose area is less than about , the SPOP-net takes the first place, achieving an ABO higher than . RPN can achieve a good ABO around for the objects whose areas are more than pixels, but can hardly reach a higher ABO even if the object is large. This may explain why the recall of RPN is very high when setting a loose IoU threshold (e.g. ) but decreases sharply with the increasing of IoU threshold when it exceeds . The classic low-level cues based methods (e.g. Selective Search, MCG, GOP) perform very well for large objects but have inferior performance for small ones compared with two CNN-based methods (i.e. SPOP-net, RPN).
For better understanding of the keys of enabling the SPOP-net to work well, we show the intermediate output maps of both the localization and confidence networks for visualization in Figure 14. For each image, we show its “objectness confidence map”, “offsets map” pointing to the object center, and its proposals. We argue that the first key is the reliable objectness prediction as the proposals predicted by the pixels obtaining low objectness confidence will be ranked behind. Based on an accurate objectness confidence, for each ground-truth object, each pixel inside it predicts its own location of this object, as shown in the “offsets maps”, thus greatly increasing the chances of precise localization. Another advantage of this pixel-wise prediction is that most of predicted bounding box locations from the pixels within the same object are heavily overlapping, which can be easily filtered by NMS. Normally only a few proposals are remained after NMS, thus improving the recall of the top-ranked proposals. For small objects, to overcome the inherent weakness that less chances are available to propose the correct locations, a scale-aware prediction is adopted by relying on an accurate estimation of the object size (i.e. large or small) and combining the predictions of two networks.
|Time cost per image|
The detailed running speed of the SPOP-net as well as other state-of-the-art methods is presented in Table IV. The detailed setting of parameters for each method is as follows. We choose the single color space (i.e. RGB) proposal computation for BING, and the ”Fast” version for selective search. For the rest methods, we directly run their default codes. As can be seen, window scoring methods and CNN-based methods are much faster than segment grouping methods. Inference for an image of PASCAL VOC size (e.g. *) takes s for our SPOP-net on a single TITAN X CPU. Specifically, testing one network of the original scale and the enlarged scale takes s and s on a single TITAN X GPU, respectively. However, as the computation within different CNNs of SPOP-net are independent of each other, training and testing SPOP-net can be accelerated by parallel computation over multiple GPUs. Although it is not one of the fastest object proposal methods (compared to BING, RPN and Edge Boxes), our approach is still competitive in speed among the proposal generators. We do, however, require use of the library Caffe [jia2014caffe] which is based on GPU computation for efficient inference like all CNN based methods. To further reduce the running time, some CNN speedup methods such as FFT, batch parallelization, or truncated SVD could be used in the future.
We also evaluate the proposed SPOP-net on MS COCO [lin2014Microsoft] 2014 validation set and the results are shown in Figure 12. The SPOP-net model here is trained on MS COCO training set which contains more than pixel-level annotated images. To conduct fair comparisons with the state-of-the-art segmentation annotations based approach, i.e.,DeepMask, we only evaluate on the first images. Note that we directly used the public DeepProposal model to extract proposals on MS COCO images. It is observed that DeepMask performs well, especially for the cases with low IoU thresholds (e.g. ) and a few proposals (e.g. proposals). The performance of the proposed SPOP-net gradually increases and SPOP-net demonstrates its superiority as the number of proposals increases. Specifically, SPOP-net outperforms DeepMask in terms of recall at IoU (Figure 12(d)), recall at IoU (Figure 12(e)), ABO (Figure 12(f)) and average recall (Figure 12(g)) when the number of proposals is more than . Figure 12(h) and Figure 12(i) shows the average recall of all the methods on large and small objects, respectively. On can observe that SPOP-net performs best on detecting small objects in terms of AR, which clearly validates the superiority of SPOP-net in small objects localization.
|RPN ( props)|
|RPN ( props)||73.6|
Iv-D Object Detection Performance
We conduct experiments analyzing object proposals for use with object detectors to evaluate the effects of proposals on the detection quality. We utilize the standard Fast-RCNN [girshick2015fast] framework as the benchmark. We choose the publicly released VGG -layer [simonyan2014very] detector trained on VOC 2007 trainval set in all the evaluation experiments. The proposals of the proposed SPOP-net, Selective Search, Edge Boxes, MCG and RPN are evaluated. Please note that RPN itself integrates proposal generation and detection in a unified framework, called Faster-RCNN. To be fair, we do not adopt this unified detector for object detection with RPN proposals in our evaluation. This is because this unified detector has a weights sharing mechanism in layers which are used for both proposal generation and object detection. These layers are trained on the class-specific annotations with object category information that is not employed in training other methods. For SPOP-net, Selective Search, Edge Boxes and MCG, we select the top proposals to pass through the object detectors for post-classification. For the RPN method, considering that it only needs a small number of proposals to achieve high recall, and more proposals do not bring too many improvements to the recall but introduce more false positives, we conduct an extra setting which uses the top proposals for detection, which is also claimed by [ren2015faster].
The detection mean average precision (mAP) as well as the average precision of categories is presented in Table V. It can be seen that the proposed SPOP-net wins on categories among the categories of PASCAL VOC 2007 and also achieves the best mAP . Using RPN proposals for detection, mAP can be obtained. With only proposals, RPN achieves a better mAP than proposals. This verifies the good performance of RPN when generating a small number of proposals.
Iv-E Generalization to Unseen Categories
The high recall rate which our approach achieves on the PASCAL VOC 2007 testing set does not guarantee it to have learned the generic objectness notion or be able to predict the object proposals for the images containing novel objects in unseen categories. This is because it is possible that the model is highly tuned to the
categories of PASCAL VOC. To investigate whether it is capable of predicting the proposals for the unseen categories beyond training, we evaluate our approach on the ImageNet ILSVRC 2013 validation set which contains more thanimages with around annotated objects in categories.
From Figure 15, the overall trend of the SPOP-net remains consistent with that on the PASCAL VOC 2007. Specifically, with a small number of proposals (e.g. proposals), the SPOP-net does not perform as well as MCG, RPN and Edge Boxes, but shows its superiority when the number of proposals reaches . See Figure 15(b). As for average recall (AR) and average best overlap (ABO), the SPOP-net is also one of the best methods across a broad range of proposal numbers. It is worth mentioning that RPN does not perform as well as on PASCAL VOC 2007. An obvious drop is seen under all the evaluation scenarios from Figure 15 compared to Figure 11. This may result from the category information employed when training the layers in the RPN network shared with class-specific detectors. Such class-awareness enables RPN to fit the categories of PASCAL VOC 2007 better but affects its generalization ability to unseen categories.
Based on the high recall rate the SPOP-net remains when evaluated on ILSVRC 2013, no significant overfitting towards the PASCAL VOC categories is observed. In other words, the proposed approach has learned a generic notion of objectness and can generalize well to the unseen categories.
V Summary and Conclusions
In this paper, we developed an effective scale-aware pixel-wise localization network for object proposal generation. The network fully exploits the available pixel-wise segmentation annotations and predicts the proposals pixel-wisely. Each proposal combines two proposals predicted by two networks specialized for different sizes respectively. The combination follows a weighting mechanism utilizing the weighting confidence produced by a large-/small-size object classification model. This strategy is shown to enhance the accuracy of localization on small objects. Significant improvements over the state-of-the-art methods were achieved by the proposed SPOP-net on the PASCAL VOC 2007 testing set. The proposals of the SPOP-net used in Fast-RCNN detector also provide the highest mAP, benefiting from the high recall rate of the proposed model. In the future, we plan to extend our method to deal with both object proposal generation and bounding box regression step to achieve better localization performance.