## 1 Introduction

Objectness is a new concept in the computer vision community proposed by

[1], which aims to produce a set of regions (*i.e*

., object proposals) that have high probability to contain objects. The main advantage of object proposal is that when keeping a high recall, they also can dramatically reduce the search space from the whole image over all positions, scales and aspect ratios to only a few suggested regions. Therefore, it is an important technique for further vision tasks such as object recognition and detection.

Since in most contexts, the object proposal actually serves as a preprocessing step, several important factors should be considered for a successful proposal algorithm. First, the algorithm should be fast enough. Otherwise, its superiority to the sliding window paradigm will be degraded. Second, it should produce a manageable number of proposals with a high recall. There are lots of works which focus on such two aspects.

For example, in the work of [5], Cheng *et al*. designed an effective feature descriptor and proposed an accelerated technique to approximate the original feature. Thus they achieved very fast computational efficiency without loss of much accuracy. In [18], Uijlings *et al*. started with the low level superpixels and carefully designed some simple yet effective features that can deal with a variety of image conditions. The proposals were then generated by grouping the superpixels according to the handcrafted features. As there is not much computation cost in the grouping process, their algorithm is comparably efficient. Notably, their model is fully unsupervised and there is no parameter to be learned or tuned.

There are also many works that aim at a higher recall. Typically, complicated features are designed in those works. In [4, 7, 3], for example, various visual cues, such as segmentation, boundary, saliency are utilized to describe a candidate region. Subsequently, based on the similarity of region features, a hierarchical grouping strategy was adopted to form the final object proposals.

*e.g*., Selective Search) generate a set of proposals as shown in (b). Typically,

*only*the top- candidates are used to feed further vision tasks like object detection. In (c) and (d), we show the top- results produced by SVM and our model respectively. Clearly, our ranking model is superior to (c) since there is fewer inaccurate proposals within the top- candidates.

Usually, there are a large number of candidates produced by a proposal algorithm. Hence, existing algorithms always provide a confidence score for each candidate which indicates the probability of containing an object. Commonly used schemes for the objectness scoring are summarized in [21]. Among them, the large margin based SVM, or its variant is a popular framework [7]. Given all the candidates of an image, SVM will consider the relative ranks of all the pairs as the constraints.

However, imposing such strict rankings for each candidate is possibly not necessary, and sometimes not reasonable. To see this, consider an extreme case that we have two candidates with IoU (intersection over union) 0.01 and 0.001. Actually, they both can be thought as wrong proposals. In this case, constraining the first candidate to have a higher rank than the other does not make sense. As we only care about the top- candidates, a full ranking algorithm such as SVM is not suitable for object proposals. In Figure 1, we give an example showing that an accurate prediction for the top- candidates is more important than obtaining the rank for all candidates.

Therefore, in this paper, we propose a new partial ranking algorithm. The overview of the procedure is illustrated in Figure 2. Given an ensemble of object candidates which are produced by existing methods, we compute the IoU (intersection over union) for them and then split them into two subsets consisting of the top- candidates and the last candidates ( is the total number of candidates). In our work, we use the output of Selective Search (SS) [18] to obtain the total proposals, although any prior proposal algorithm can be utilized. The features used here is the well know HOG [6], which will be described in Section 3.1. Then we learn a large margin based model that promotes a high score for top- candidates. The derivation of our model and the learning algorithm will be elaborated in Section 3.4. Finally, we state the testing phase in Section 3.5.

The main difference of our model and other ranking methods is that, when training the model, we split the candidates of each image into two subsets: one with top- rankings and the other consisting of the remaining candidates. We only compare any candidate from the first subset and any one from the second subset, instead of comparing all pairs of candidates. In this formulation, our model can focus on obtaining a reliable prediction for only the top- candidates rather than learning to rank all the candidates. Also note that our partial ranking model is different from top- ranking models in information retrieval, which aims to provide an accurate ranking for each top- retrieval [15]. In our case, we only impose that any candidate with top- rankings is better than the last candidates. The reason why it is not necessary to provide an accurate ranking within the top- candidates is that, when utilizing the proposals for further processing, like recognition, we typically do not care about the orders of proposals.

### 1.1 Contribution

The main technical contribution of this paper is that, based on the observation that the widely used full ranking algorithm may not be suitable for object proposal, we propose a new partial ranking algorithm PRanking, which produces a more reliable prediction for the top- candidates. After some “equivalent” transformation, our problem can be reduced to the large margin framework and can be learned efficiently.

## 2 Related Work

Recent progress in object proposal greatly drives researcher to dig in the area. In [12]

, the proposed algorithm achieves best of the two worlds: accuracy and efficiency. Benefited from the powerful convolutional neural networks,

[10] proposed the region proposals with CNN features (R-CNN) that improved mean average precision (mAP) by more than 30 than the previous best result on VOC2012. With the adoption of regions, R-CNN outperformed the state-of-the-art CNN archetecture OverFeat on the 200-class ILSVRC2013 detection dataset.Generally, object proposal methods can be classified as two categories based on whether low-level segmentations are needed in advance. The low-level segmentation based methods usually generate seed superpixels, followed by some grouping strategy to form final object proposals [11][1][18][3]. The graph-based image segmentation [9] is broadly used to generate low-level superpixels. In [11][13], the hierarchical occlusion boundary segmentation [2] is used to create seed regions and their segmentations. Existing grouping strategies usually learn a model based on multiple cues. For instance, [11] represented each region with contour/edge shape, color, and texture. Then they learned weights of regions and hierarchically grouped them as object candidates. [1] presented a generic objectness measure composed of saliency, edge density, superpixels straddling and so on. The measures are combined in a Bayesian framework to qualify how likely a window covers an object. Selective Search [18] exhaustively merged adjacent superpixels based on the complementary similarity measurements including colour, texture, size and so forth. Endres *et al*. learned cascades features to rank candidates based on edge distributions near region boundaries [7]. Rahu *et al*

. provided new objectness features and combined hypothesis by learning a ridge regression model

[17]. Manen*et al*. employed similar features with Selective Search and merged superpixels based on a randomized version of Prim’s algorithm [14]. [16]

greedily merged adjacent superpixels by global and local searches with SIFT and RGB descriptors. The RIGOR method generated a pool of overlapping segmentations and proposed a piecewise-linear regression tree for grouping

[13]. The MCG method proposed a multi-scale hierarchical segmentation and grouped multiscale regions by features of size/location, shape and contours [3].There are only three methods which are not based on low-level segmentations, including Constrained Pramametric Min-Cuts (CPMC) [4]

, Binarized Normed Gradients (Bing)

[5] and EdgeBoxes [21]. CPMC generated about overlapping segmentations per image by solving min-cut problems and merged regions by their mid-level cues such as graph partition, region and gestalt properties. Bing learned an object model by cascaded SVMs in multiple scales. It speeded up the model learning by proposing a novel binarized normed gradient feature,*i.e*., Bing. EdgeBoxes evaluated boxes with their enclosed number of contours in a sliding window fashion. Unlike previous methods based on the edge feature, Edgeboxes removed straddling contours which greatly improved the proposal quality.

Hosang *et al*. provided a careful analysis for the state-of-the-art object proposal methods [12]. According to [12], Bing is the fastest so far. EdgeBoxes obtains the best recall but needs to tune the parameters at each desired overlap threshold. Selective Search obtains relatively fast speed and high recall. It is also attractive as it has no parameters to tune. Thus, Selective Search has been widely used in top detection methods [19][10]. The proposal based detection methods, such as R-CNN, select proposals as the search space. Thus, it is required to assign a high score for those proposals with high overlap to the groundtruth. To this end, some proposal generation methods utilized ranking algorithms [7][4][3]. Specifically, [7][20] ranked regions using structured learning based on various visual cues. CPMC [4]

trained a Random Forest to regress the object overlap with the groundtruth. MCG

[3] adopted the similar ranking algorithm as CPMC and used Maximum Marginal Relevance measures to diversify the ranking. All the re-ranking methods mentioned above aimed to provide an accurate full rank for all candidates. However, it is difficult and not necessary to learn the difference between proposals with similar IoU. For example, the difference between boxes with similar IoU, like 0.8 or 0.81, is not easy to characterize. This motivates us to learn a partial ranking model in this paper.## 3 Problem Setup

In this paper, we focus on a new re-ranking method for object proposal. Assume that for each image, we have an ensemble of candidates (*i.e*., bounding boxes) ^{1}^{1}1For simplicity, we assume that each image has a number of proposals.

. Each ensemble of candidates is associated with a vector

, with each being the IoU to the groundtruth of the candidate . Denote the input space as(1) |

where is some feature descriptor for , and the output space as

(2) |

We are interested in learning the prediction function

(3) |

To be more detailed, we assume that

(4) |

where is the weight vector we aim to learn and “” denotes the inner product. In this way, the mapping function is formulated as follows:

(5) |

In the following, we will elaborate the design of the feature descriptor and the learning algorithm for the weight vector .

### 3.1 Feature

The discipline of a successful proposal algorithm is the efficiency. Thus we handcraft some simple features for computational efficiency. Given a bounding box , its feature descriptor used here is the well known HOG features [6]. One may extract more effective features, such as those produced by convolutional neural networks (CNNs), whose performance was carefully studied in [10]. However, in this paper we focus on the ranking model rather than choosing the most effective features.

### 3.2 Partial Ranking Model

Given a training set where denotes the number of training images, we aim to learn the weight vector , such that the top- candidates in each is better than the others. Assume without loss of generality that is in a descending order. Thus, we actually solve the following convex optimization problem:

(6) |

where denotes the integer set of .

### 3.3 Comparison with SVM

In previous work, a commonly utilized ranking model is SVM, which is formulated as follows:

(7) |

Note that in the formulation (6), we divide the set into two subsets: and . The constraints are only imposed between the two subsets: any candidate with the top- ranks enjoys a higher confidence than those with the last ranks. This formulation is motivated by the practical usage of object proposals: basically, proposals alleviate the drawback of the sliding window scheme by reducing the search space from the whole image (over each position and scale) to a manageable number of regions (*i.e*., bounding boxes). After that, one may extract finer visual cues on only the proposals for accurate recognition and detection. Thus, essentially a good prediction for the top- candidates is sufficient for a successful proposal algorithm. From this point of view, it is more reasonable to compare the top- candidates (which are of interest for the user) to the remaining ones, rather than comparing any pair from the ensemble of the candidates (which is the formulation of (7)).

Moreover, note that in our formulation (6), there is no comparison between candidates within nor those within , since when we post-process the proposals for further recognition, we actually consider little for their orders produced by the ranking model.

As a matter of fact, one may have noticed that the constraints in Problem (6) is a subset of those in Problem (7). That is, our formulation is a reasonable *relaxation* for the classical SVM model. Besides the superiority we mentioned above, it can be expected that the learning procedure of our problem can be more efficient than SVM, since in our model, the number of constraints is reduced from to .

At last, as we only constrain candidates belonging to different subsets, we can transform our problem into the well known large margin based framework, stated as follows.

### 3.4 Learning

#### 3.4.1 Large Margin Model

In practical problems, the data can be corrupted and the hard constraints in Eq. (6) will be violated. Thus for the robustness of the model, it is necessary to derive a soft formulation. To this end, we first transform Eq. (6) to the hard large margin based model:

(8) |

The above formulation is “equivalent” to Problem (6) in the sense that the confidence of the top- candidates produced by the model is higher than that of the last candidates. In other words, the reformulation keeps the relative rank for the subsets and .

By introducing non-negative slack variables , we obtain the soft margin formulation:

(9) |

where is a trade-off parameter. We set it as 1 for a soft margin.

#### 3.4.2 Connection to Binary SVM

At a first sight, our soft margin formulation is similar to binary SVM. We remark here some difference. First, for binary SVM, the positive and negative samples are usually picked by some binary evaluation metric. For example, for image classification, a sample is considered to be positive if it contains some object. There is no

*ranking*information in binary SVM. In our case, we actually rank the samples and select the top- candidates as positive.

Also, the derivation of ours is different from binary SVM. We actually derive (9) from Eq. (6). Note that when we transform Eq. (6) to Eq. (8), we impose the top- candidates with confidence greater than 1 and the others less than . However, a more flexible way is to allow the confidence of the top- candidates to be greater than some positive variable , and the others less than . In this way, we have the adaptive soft margin problem:

(10) |

This adaptive model is of interest and we leave it for future work.

### 3.5 Testing

During the testing phase, we first obtain the total candidates by running SS on the test sample. Then we extract the HOG feature for each bounding box, whose confidence can be subsequently computed by Eq. (5). After a sorting step, we have an estimation for the top- candidates.

## 4 Experiments

### 4.1 The Dataset and Metrics

We evaluate our method on the challenging PASCAL Visual Object Challenge (VOC) 2007 dataset [8]. VOC2007 dataset contains 9,963 images in 20 classes and is divided into “train”, “val”, and “test” subsets. We conduct our experiments on the “train” and “test” splits. The employed evaluation metrics are Detection Rate (DR) and Mean Averatge Best Overlap (MABO) defined in [18]. The DR metric is derived from the Pascal Overlap Criterion and widely used to evaluate the quality of proposals [1][12]. It considers an object covered when the overlap (IoU) between its ground truth region and at least one proposal is larger than a threshold . The Average Best Overlap (ABO) calculates the best overlap between each ground truth region and proposals for each class. The MABO is the mean ABO over all classes.

### 4.2 Experiment Setting

We set in our experiments, *i.e*., there are 1000 candidates for re-ranking. We vary the threshold among three values 0.5, 0.7 and 0.9.

We get proposals from Selective Search (SS). That is, in the training stage, we obtain 1000 boxes per image from Selective Search on VOC2007 training set. is set to be 20. In case of memory bottleneck, we use the last candidates rather than the last ones. We extract the HOG feature [6] for all regions, each of which is resized to 50 by 60. The cell size of HOG is 8. In the testing stage, we get 1000 proposals in sequence from Selective Search on VOC2007 testing set. Then, we extract HOG features of boxes and utilize our model to re-rank them.

### 4.3 Experiment Results

We show some exemplar results in Figure 5, where for a better view, only the top-10 proposals are drawn on the image. Images with red bounding boxes are those produced by SS and results of PRanking are denoted by green bounding boxes. We can see that our method provide a more accurate estimation for the top-10 proposals. Moreover, Selective Search tends to produce diverse boxes in various locations, but our top proposals are near the groundtruth. For instance, in Figure 5.(c), SS covers a pen, torso of people, while ours all contain heads which are more accurate proposals for detection. In Figure 5.(n), ours mainly covers the head of the dog, while no boxes from SS contain it. Our re-ranking model could not only select boxes covering parts of the object but also selects large boxes, such as in Figure 5.(p).

Then we report the DR produced by SS and our PRanking model under different threshold in Figure 4. Note that we train the model with top-20 candidates but we draw the DR curve for all top candidates. We see that our PRanking model generalizes really well for other top- DR.

Specifically, as shown in Table 1 with IoU , the detection rate of Selective Search is less than 80.00 percent for top-200 proposals, but PRanking reaches 84.25 percent. Our PRanking obtains an average 3.54% percent increase in detection rate compared with original proposals. When we vary the to 0.7, *i.e*., a stricter evaluation metric, our method achieves more improvement. As shown in Table 2, the detection rate of PRanking is over 7 percent higher than Selective Search for top-100 and top-500 proposals. To achieve the detection rate of 78 percent, Selective Search requires about 1000 proposals, but our PRanking only needs about half the number of proposals. Table 3 shows the detection rates with IoU of 0.9. The first proposal of Selective Search is better, but our method achieves much higher detection rates with the increase of the number of proposals. Specifically, our PRanking reaches DR of 30.65 percent for top-200 proposals, when Selective Search is only 22.95 percent. When the number of proposals reaches 1000, PRranking obtains the detection rate of 45.78 when that of Selective Search is still less than 40 percent.

Table 4 reports the MABO scores. The first proposal of Selective Search is with slightly higher IoU. As the number of proposals increases, the MABO score keeps steadily rising, and our PRaning enjoys a considerable improvement in quality (about 0.0365 increase). For instance, our MABO score reaches 0.8059 with 500 proposals, while Selective Search is 0.7935.

In summary, although the proposals of Selective Search are already with high quality, our model obtains a great increase in both detection rates and MABO scores. This demonstrates that our idea of partial ranking is reasonable. We could accurately learn the margin between bounding boxes with higher IoU and low IoU. Thus, the bounding boxes with high IoU will be put in high rank by our model.

Algorithms | 1 | 10 | 50 | 100 | 200 | 500 | 800 | 1000 |
---|---|---|---|---|---|---|---|---|

SS | 13.75 | 37.44 | 62.34 | 71.48 | 79.86 | 88.16 | 91.50 | 92.87 |

SS+PRanking | 14.98 | 41.71 | 66.51 | 76.09 | 84.25 | 91.96 | 94.25 | 94.87 |

Algorithms | 1 | 10 | 50 | 100 | 200 | 500 | 800 | 1000 |
---|---|---|---|---|---|---|---|---|

SS | 5.99 | 17.25 | 38.54 | 48.28 | 58.22 | 70.08 | 75.72 | 78.26 |

SS+PRanking | 6.68 | 22.69 | 44.81 | 55.28 | 65.91 | 78.03 | 82.17 | 83.19 |

Algorithms | 1 | 10 | 50 | 100 | 200 | 500 | 800 | 1000 |
---|---|---|---|---|---|---|---|---|

SS | 1.21 | 3.93 | 11.31 | 16.93 | 22.95 | 31.78 | 36.55 | 38.66 |

SS+PRanking | 0.78 | 5.64 | 15.96 | 22.59 | 30.65 | 40.98 | 44.81 | 45.78 |

Algorithms | 1 | 10 | 50 | 100 | 200 | 500 | 800 | 1000 |
---|---|---|---|---|---|---|---|---|

SS | 0.1990 | 0.4082 | 0.5739 | 0.6393 | 0.6986 | 0.7646 | 0.7935 | 0.8062 |

SS+PRanking | 0.1909 | 0.4249 | 0.6018 | 0.6718 | 0.7380 | 0.8059 | 0.8288 | 0.8342 |

## 5 Conclusion

In this paper, based on the observation that it is typically not necessary to derive a full ranking for the total candidates, we propose a new partial ranking model for object proposal. The main difference of our model and other full ranking models, such as SVM, is that we only constrain the relative orders of the two subsets: the top- candidates and the last candidates. We then show that such a model can be equivalently transformed into the large margin based framework, in the sense of keeping the relative ranks of the two subsets. In the experiments, we show that after the partial re-ranking step, the detection rate of proposals produced by our algorithm enjoy a dramatic improvement.

## References

- [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, pages 73–80. IEEE, 2010.
- [2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From contours to regions: An empirical evaluation. In CVPR, pages 2294–2301. IEEE, 2009.
- [3] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, pages 328–335, 2014.
- [4] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In CVPR, pages 3241–3248. IEEE, 2010.
- [5] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. Bing: Binarized normed gradients for objectness estimation at 300fps. In CVPR, pages 3286–3293, 2014.
- [6] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886–893. IEEE, 2005.
- [7] I. Endres and D. Hoiem. Category independent object proposals. In ECCV, pages 575–588. Springer, 2010.
- [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
- [9] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004.
- [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
- [11] C. Gu, J. J. Lim, P. Arbeláez, and J. Malik. Recognition using regions. In CVPR, pages 1030–1037. IEEE, 2009.
- [12] J. Hosang, R. Benenson, and B. Schiele. How good are detection proposals, really? arXiv preprint arXiv:1406.6962, 2014.
- [13] A. Humayun, F. Li, and J. M. Rehg. Rigor: Reusing inference in graph cuts for generating object regions. pages 336–343, 2014.
- [14] S. Manen, M. Guillaumin, and L. V. Gool. Prime object proposals with randomized prim’s algorithm. In ICCV, pages 2536–2543. IEEE, 2013.
- [15] S. Niu, J. Guo, Y. Lan, and X. Cheng. Top-k learning to rank: labeling, ranking and evaluation. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 751–760. ACM, 2012.
- [16] R. P. and K. J. . R. E. Generating object segmentation proposals using global and local search. In CVPR, pages 2417–2424, 2014.
- [17] E. Rahtu, J. Kannala, and M. Blaschko. Learning a category independent object detection cascade. In ICCV, pages 1052–1059. IEEE, 2011.
- [18] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
- [19] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object detection. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 17–24. IEEE, 2013.
- [20] P. Yadollahpour, D. Batra, and G. Shakhnarovich. Discriminative re-ranking of diverse segmentations. In CVPR, pages 1923–1930. IEEE, 2013.
- [21] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, pages 391–405. Springer, 2014.

Comments

There are no comments yet.