Current high-quality object detection approaches use the scheme of salience-based object proposal methods followed by post-classification using deep convolutional features. This spurred recent research in improving object proposal methods. However, domain agnostic proposal generation has the principal drawback that the proposals come unranked or with very weak ranking, making it hard to trade-off quality for running time. This raises the more fundamental question of whether high-quality proposal generation requires careful engineering or can be derived just from data alone. We demonstrate that learning-based proposal methods can effectively match the performance of hand-engineered methods while allowing for very efficient runtime-quality trade-offs. Using the multi-scale convolutional MultiBox (MSC-MultiBox) approach, we substantially advance the state-of-the-art on the ILSVRC 2014 detection challenge data set, with 0.5 mAP for a single model and 0.52 mAP for an ensemble of two models. MSC-Multibox significantly improves the proposal quality over its predecessor MultiBox method: AP increases from 0.42 to 0.53 for the ILSVRC detection challenge. Finally, we demonstrate improved bounding-box recall compared to Multiscale Combinatorial Grouping with less proposals on the Microsoft-COCO data set.READ FULL TEXT VIEW PDF
Accurately localising object proposals is an important precondition for ...
We are motivated by the need for a generic object proposal generation
Existing object proposal approaches use primarily bottom-up cues to rank...
Current top performing object detectors employ detection proposals to gu...
We propose a unified approach for bottom-up hierarchical image segmentat...
With the advent of deep learning, object detection drifted from a bottom...
The collection of internet images has been growing in an astonishing spe...
of the 2014 Imagenet object detection competition, make use of salience-based object localization, in particular Selective Search  followed by some post-classification method using features from a deep convolutional network.
Given the fact that the best salience-based methods can reach up to 95% coverage of all objects at 0.5 overlap threshold on the detection challenge validation set, it is tempting to focus on improving the post-classification ranking alone while considering the proposal generation part to be solved. However, this might be a premature conclusion: a better way of ranking the proposals is to cut down their number at generation time already. In the ideal case, we will be able to achieve high coverage with very few proposals. This can improve not only the running time but also the quality, because the post-classification stage would need to handle fewer potential false positives. Furthermore, a strong proposal ranking function provides a way to balance recall versus running-time in a simple, consistent manner by just selecting appropriate thresholds: use a high threshold for use cases where speed is essential, and a low threshold when quality matters most.
Motivated by the fact that hand-engineered features are getting replaced by higher-quality deep neural network features for image classification[13, 12, 26], we show that the same trend holds for proposal generation. In Section 4.6 we demonstrate that our purely learned proposal method closely rivals salience-based methods in performance, at a significantly lower computational cost. Furthermore, the ability to directly learn region proposal methods is a key advantage as it is easy to adapt the model to new domains such as medical or aerial imaging or to specific use cases, such as recognizing only certain objects. In contrast, hand-engineered proposal methods are typically tuned for natural objects with clear segmentation, but do less well in domains where the distinction between objects needs more subtle cues and cannot return proposals only for objects of interest.
Our work builds upon the MultiBox approach presented in , which was an earlier attempt to learn a proposal generation model but was never directly competitive with the best expert-engineered alternatives. We demonstrate that switching to the latest Inception -style architecture and utilizing multi-scale convolutional predictors of bounding box shape and confidence, in combination with an Inception-based post-classification model significantly improves the proposal quality and the final object detection quality. Combining this with a simple but efficient contextual model, we end up with a single system that scales to a variety of use cases from real-time to very high-quality detection and achieves a new state of the art result on the ImageNet detection challenge.
In summary, the main contributions of our approach are:
Improved network architecture for bounding box generation, including multi-scale convolutional bounding box predictors.
Integration of a context model during post-classification, which improves performance.
200 classes detection at mAP with proposals per image generated by our box proposal method.
mAP with a single model and
with an ensemble of three post-classifiers and two MultiBox proposal generators.
Additionally, in Sec. 4 we analyze the effect of the various components of the MSC-Multibox model.
The previous state-of-the-art paradigm in detection is to use part-based models [6, 5] such as Deformable Part Models (DPMs). Sadeghi and Forsyth  developed a framework with several configurable runtime-quality trade-offs and demonstrate real-time detection using DPMs on the PASCAL 2007 detection data.
Deep neural network architectures with repeated convolution and pooling layers [7, 13] have more recently become the dominant approach for large-scale and high-quality recognition and detection. Szegedy et al.  used deep neural networks for object detection formulated as a regression onto bounding box masks. Sermanet et al.  developed a multi-scale sliding window approach using deep neural networks, winning the ILSVRC2013 localization competition.
The original work on MultiBox  also used deep networks, but focused on increasing efficiency and scalability. Instead of producing bounding box masks, the MultiBox approach directly produces bounding box coordinates, and avoids linear scaling in the number of classes by making class-agnostic region proposals. In our current work (detailing improvements to MultiBox) we demonstrate greatly increased recall of object locations by increasing the number of potential proposals with a fixed budget of evaluated proposals. We also demonstrate improvements to the training strategy and underlying network architecture that yield state-of-the-art performance.
Other recent works have also attempted to improve the scalability of the now-predominant R-CNN detection framework . He et al. proposed Spatial Pyramid Pooling  (SPP), which engineers robustness to aspect-ratio variation into the network. They also improve the speed of evaluating Selective Search proposals by classifying mid-level CNN features (generated from a single feed-forward pass) rather than pushing all image crops through a full CNN. They report roughly two orders of magnitude (x) speedup over R-CNN using their method.
Compared to the SPP approach, we show a comparable efficiency improvement by drastically reducing the number and improving the quality of region proposals via our MultiBox network, which also associates a confidence score to each proposal. Architectural changes to the underlying network and contextual post-classification were the main factors in reaching high quality. We emphasize that MultiBox and SPP are complementary in the sense that spatial pyramid pooling can be added to the underlying ConvNet if desired, and post-classification of proposals can be sped up in the same way with no change to the MultiBox objective.
Another way in which efficiency of detection methods can be improved is by unifying the detection and classification models, reusing as much computation as possible and in the process abandoning the idea of data-independent region proposals. An example of such an approach is Pinheiro et al. 
’s work, who propose a convolutional neural network model with two branches: one that can generate class-agnosticmasks, and second branch predicting the likelihood of a given patch being centered on an object. Inference is efficient since the model is applied convolutionally on an image and one can get the class scores and segmentation masks using a single model.
The YOLO approach by Redmon et al. 
is similar to it, in that it uses a single network to predict bounding boxes and class probabilities, in an end to end network. The difference is that it divides the input image into a grid of cells and predicts the coordinates and confidences of objects contained in the cells. This approach is fast, but limited in that each grid cell can only contain one object by construction, with the grid being quite coarse. It is also unclear to which extent these results can translate to good performance on data sets with significantly more objects, such as the ILSVRC detection challenge.
Faster R-CNN  is a technique that merges the convolutional features of the full-image network with the detection network, thereby simultaneously predicting boxes and objectness scores. The detection network–called the Region Proposal Network (RPN)–is trained end to end in an alternating fashion with the Fast R-CNN network; its objective is to produce good region proposals. The RPN is thus quite similar to the Multibox approach described in this paper: the two approaches have been co-developed at the same time. The biggest similarity is the usage of priors (called “anchors” in the Fast R-CNN work ) that are designed to be translation invariant and that are predicted from the top layer feature map. Our multiscale priors are different in that we use multiple tapering layers, while the Fast R-CNN approach is predicting boxes of many scales from a single feature map. The other differences include the fact that in our approach the confidences are class-agnostic, and we used different box regression and classification losses. Notably, we also use radically different network architectures, with parts designed specifically to overcome the shortcomings of networks designed for classification. Finally, we argue that our two-stage setup scales to a higher number of classes well: the Faster R-CNN work uses many thousands of priors and scaling that approach to thousands of classes is not obvious. It would ultimately be interesting to disentangle which of these differences are important, by comparing the two methods on the same evaluation set.
In order to describe the changes to , let us revisit the basic tenets of the MultiBox method. The fundamental idea is to train a convolutional network that outputs the coordinates of the object bounding boxes directly. However, this is just half of the story, since we would also like to rank the proposals by their likelihood of being an accurate bounding box for an object of interest. In order to achieve this, the MultiBox loss is the weighted sum of the following two losses:
: a logistic loss on the estimates of a proposal corresponding to an object of interest.
Location: a loss corresponding to some similarity measure between the objects and the closest matching object box predictions. By default we used L2 distance.
The network is an improved Inception-style  convolutional network, followed by a structured output module producing a set of bounding box coordinates and confidence scores. In the original MultiBox solution, the predictors were fully connected to the top layer of the network. Here we propose a multi-scale convolutional architecture described below.
Let be the -th set of predicted box coordinates for an image, and let be the -th ground-truth box coordinates. At training time, for each image, we perform a bipartite matching between predictions and ground-truth boxes. We denote to indicate that the -th prediction is matched to the -th ground-truth, and otherwise. Note that is constrained so that . Given a matching between predictions and groundtruth, the location loss term can be written as
Given the predicted scores , the confidence loss term can be written as follows:
The overall objective is a weighted sum of both terms
We train the network with stochastic gradient descent. For each training example with ground truthand network output we compute the matching by picking the minimizer of the loss:
and update the network parameters following the gradient evaluted at the matching that was found.
The MultiBox  setup is to predict locations (the five coordinates) and confidences for a constant number of boxes. We call the associated outputs of the network “slots”: each slot corresponds to one predicted proposal. However, these proposals might be low confidence, in which case the network predicts that the associated box does not correspond to any object on the image. Our goal is to maximize the coverage of the high-confidence predictions. Our network is an “objectness” detector, but our notion of what constitutes an object depends on the task we try to tackle.
A crucial detail of our approach is that we do not let the proposals free-float, but impose diversity by introducing a prior for each box output slot of the network. Let us assume the our network predicts boxes, together with their confidences, then each of those output slots will be associated with a prior rectangle . These rectangles are computed before training the network in a way that matches the distribution of object boxes in the training set. Our goal is to maximize the expected coverage of this constant set of priors at a given Jaccard (IOU) overlap threshold . In , the goal was to maximize the expected overlap between each ground-truth object box and the best matching prior. Here we try to find a set of priors to optimize , where are matching groundtruth bounding boxes. Intuitively, we can say that we want the best proposal generation method that is independent of the image pixels and has the maximum coverage at a given overlap threshold ( in our case).
As in Multibox , the bounding boxes predicted by slot of the network will be interpreted with respect to prior . That is, we are regressing toward where is the groundtruth box minimizing and at inference time if the network outputs for slot , the predicted box will be set to . Erhan et al. 
took a similar approach, but they tried to maximize the expected overlap as opposed to the coverage. However, it is a highly non-convex objective function, so they needed to resort to the heuristic of performing-means clustering of the ground-truth object boxes of the training set objects and took the -means centroids as priors. Here, we are taking a different approach that is closely related to the approach of Faster R-CNN  and exploits the expected translation invariance of the object locations in the data set. The priors are assumed to lie on grids with grid lines parallel to the image boundary. Formally, we assume that our set of prior boxes is the union of boxes placed regularly on those grids.
where is a regular two dimensional grid and is the template box displaced by the grid and denotes the grid resolution. In our setup we have set . In addition to the top layer of our base network, we add a prediction tree to our network as depicted in Fig. 2. We have a dedicated layer for producing prediction locations and scores for each of the , , a , a a and a grids (the grid is created by applying average pooling on the top base network layer). Each tile of each grid but the is responsible for predicting outputs with priors of different aspect ratios. The top grid is used for predicting the single largest prior. This way we end up using
priors. Each of these priors is associated with one location output slot and its associated confidence output slot of the network. The outputs are emitted by the and layers as shown in Fig. 2.
In large-scale data sets such as ImageNet, there are many missing true-positive labels. In the confidence term of the MultiBox training objective, a large loss will be incurred if the model assigns a high confidence to a true positive object in the image that is missing a label. We hypothesize that the dissonance caused by missing or noisy training data may encourage the model to be overly conservative in its predictions and thereby reduce the recall of MultiBox proposals. To deal with the issue, we adopted the “hard bootstrapping” approach of Reed et al. .
Training with this method is equivalent to reformulating the confidence objective as follows:
where is the set of indices into the top- most confident predictions. In practice, we precompute for every image within a batch before computing the gradients. The learning iterates between “generating data” according to the previous model state, and then updating the model based on the augmented data. In our experiments we initialized the network with networks pre-trained with no bootstrapping, and then fine-tuned on .
For both the MultiBox localizer model and the post-classifier, we have been using new variants of the Inception architecture as described in . This is a 42 layers deep convolutional network over a receptive field, containing over 130 layers. We are using the top convolutional layer as described earlier. The extra side heads are removed for simplicity. The exact architecture topology is given in the supplementary model.txt file that can be downloaded together with the source file of this paper. Also we employ the spatially sensitive grid-size reduction technique as depicted in Figure 3.
MSC-MultiBox can be used in two ways: as a one-shot detector that produces object locations and confidences, or as a class-agnostic localizer providing region proposals to a post-classifier. However, in the high-quality regime, it is essential to zoom into the actual object proposals and perform an extra post-classification step to maximize performance. When used in this setting, an additional post-classification step is necessary. Again, for this use case we utilize the Inception architecture from .
As a motivation for designing a new network architecture, we noted that the post-classifier network not only needs to produce the correct label for each class, but it also needs to decide whether the object overlaps the crop occupying the center part of the receptive field. (We follow the cropping methodology of the R-CNN  paper.) This requires the network to be spatially sensitive.
We hypothesized that the large pooling layers of traditional network architectures – which are also inherited by the Inception  architecture – might be detrimental for accurately predicting spatial information. This leads to the construction of a variant of the Inception network, where in parallel to the large pooling layers stride-
convolutions are used in the Inception modules when reducing the grid size. This is depicted in Fig.3.
It is known that the global context can be useful when making predictions for local image regions. Most high-performing detectors use elaborate schemes to update scores or take whole-image classification into account. Instead of working with scores, we just concatenate the whole image features with the object features, where the feature vector is taken from the topmost layer before the classifier. See Fig.4.
Note, however that two separate models are used for the context and object features and they don’t share weights.
The context classification network is trained first with the logistic objective, meaning that we have a separate logistic classifier for each class and the sum of their losses is used as the total objective of the whole network. We do not use the classifier output of the context network at object proposal evaluation time. The combiner network in fig 4 is trained in a second step after whole image features have been extracted. The combiner is uses a softmax classifier, since each bounding box can only have a single class. A designated “background” class is used for crops that don’t overlap any of the objects with at least intersection over union (IOU) similarity.
Another interesting feature of our approach that it allows for a computationally efficient form of ensembling at evaluation time. First we extract context features for large crops in the image. In our case we used the whole image, size squares in each corner and one same sized square at the center of the image. After context features for each of those features extracted, the final score will be given by which is the average of the combiner classifier scores evaluated for each pair of context and object. This results in a modest (- mAP), but consistent improvement at a relatively small additional cost, if there are a lot of proposals for each image and the combiner classifier is much cheaper to evaluate than extracting the features.
All three models: the MultiBox, the context and post-classifier were trained with the Google DistBelief machine learning system using stochastic gradient descent. The context and post-classifier networks reported in this paper had been pretrained on the million images of the ILSVRC classification challenge task. We used only the classification labels during pretraining and ignored any available location information. The pretraining was done according to the prescriptions of . All other models were trained with AdaGrad. There were two major factors that affected the performance of our models:
The ratio of positives versus negatives during the training of the post-classifier. A ratio of negatives versus positive samples gave good results.
Geometric distortions like random size and aspect ratio distortions proved to be crucial, especially for the MultiBox model. We have employed random aspect ratio distortions of up to in random (either horizontal or vertical) directions.
In this section we discuss aspects of the underlying convolutional network that benefited the detection performance. First, we found that switching from a Zeiler-Fergus-style network (detailed in ) to an Inception-style network greatly improved the quality of the MultiBox proposals (see Fig. 5). A thorough ablative study of the underlying network is not the focus of this paper, but we observed that for a given budget , both the (class-agnostic) AP and maximum recall increased substantially by the change, as shown in Fig. 6.
Figure 6 also shows that with the Inception-style convolutional networks, increasing the number of priors from around 150 (used in the original MultiBox paper ) to 800 provided a large benefit. Beyond 800, we did not notice a significant improvement.
In this section we present an analysis of the runtime-quality trade-off in our proposed method. The detection runtime is determined mostly by the number of network evaluations, which scales linearly with the number of proposal boxes. Since MultiBox scores the region proposal boxes, we can achieve the maximum quality with the number of network evaluations we can afford by only evaluating the top- most confident ones.
Figure 7 shows that performance degrades very gracefully with computational budget. Compared to the highest-quality operating point111The mAP leveled off at around ., very competitive performance (e.g. maintaining of the mAP) can be achieved with an order of magnitude fewer network evaluations. Also worth noting is that quality does not increase indefinitely with the number of proposals; swamping the post-classifier with low-quality proposals actually reduces the quality.
We used the same networks to generate both the contextual and non-contextual features, but the non-contextual network was trained without the extra context features. Both have a softmax classifier at the top and neither of them used hard negative mining, they were both pre-trained on the ImageNet classification challenge and used the same layers deep Inception variant as the MultiBox proposal generation model. Table 1 shows that adding contextual features greatly improves results.
|Contextual MSC-MultiBox as in fig 4|
The most efficient MultiBox solution generates proposals from a single network evaluation on a single image crop. We can increase the quality at the cost of a few more network evaluations by taking multiple crops of the image at multiple scales and locations, and combining all of the generated proposals and applying non-maximal suppression.
In the MultiBox case, one needs to be cautious: if the proposals are kept indiscriminately, then the system will produce high confidence boxes from partial objects that overlap the crop. This naive implementation ends up with a loss of quality. Our solution was to drop all the proposals that are not completely contained in the sub-window of the crop. However this implies that MultiBox should be applied on highly overlapping windows. We have run two experiments in which a crop was slid over the image such that each window overlaps at least (or ) each of its neighboring window in the dimension they are adjacent, respectively. This allows enough room for small object to be picked up by at least one of the crops evaluated with MultiBox.
Table 2 demonstrates that we can get almost mAP improvement by taking multiple image crops in the proposal generating step. The resulting number of proposals increases from per image to per image on average, but is still significantly lower than that used by Selective Search.
In this section we combine multi-scale convolutional MultiBox proposals with context features and a post-classifier network on the full 200-category ILSVRC2014 detection challenge data set.
|MSC-MultiBox multi-crop (0.625)|
Table 1 shows several rows, each of which lies on a different point along the runtime-quality trade-off. Note that our improved MultiBox pipeline with a single crop yields mAP, which exceeds last year’s GoogLeNet ensemble validation performance in the ILSVRC2014 competition, and is even higher than the latest and best known result published with Deep-ID-Net . In addition, we attain superior performance at the high-precision operating point. Given a single Multibox region proposal network and a single post-classifier model, we obtain mAP.
We obtain even better results by using an ensemble of models. Naive ensembling, such as the one done by the GoogLeNet team on the ILSVRC 2014 detection challenge , uses a single Multibox network to propose boxes and then averages the result of several post-classifier models on the boxes. When we tried this with 3 post-classifier models, we got a mAP of – a slight improvement.
We wanted to leverage the results of several different Multibox models, as well. Intuition suggests that box proposals that are consistent across several different Multibox models are more likely to be high-quality proposals. To capture this, we designed the following ensembling approach for Multibox models. For the boxes of each Multibox model , we can use either a single post-classifier model, or average the scores of several post-classifier models, obtaining a set of bounding boxes and class scores , for each class , and post-classifier model . For each class score , we aggregate scores from the other Multibox models as follows:
where is the Jaccard overlap between the bounding boxes. Put in words, the objective above reinforces detections that have consistent matches in the other Multibox results both in terms of location (high Jaccard overlap) and high score. After computing these scores for all detections and scores of all Multibox models, we apply non-max suppression to keep only the best ones. This ensembling approach yielded mAP with two Multibox models, a substantial improvement over the naive version.
|Deep Insight ensemble|
|MSC-MultiBox multi-crop, one model|
|Ensemble of two models of MSC-MultiBox|
Table 3 demonstrates that multi-scale convolutional MultiBox establishes a new state-of-the-art by a healthy margin.
|category||AP||Recall at 60% precision|
|proposals||Recall at Jaccard overlap|
In this section, we are comparing the coverage of our class-agnostic proposal generation method with the state-of-the-art Multiscale Combinatorial Grouping  approach on the Microsoft-COCO  validation set.
For this purpose, we have trained a class-agnostic MultiBox model on top of the Inception-v3 network 
using the TensorFlow large scale distributed system with asynchronous gradient descent with 30 model replicas for 2 million batches, each of size .
For MultiBox, we have evaluated the crops from each image at three scales:
The whole image was warped to the receptive field of the network.
A square crop was slid on the image such that the minimum overlap between adjacent crops is at least . Only those proposals are kept that are completely contained in the center square covering of the crop.
A square crop was slid on the image such that the adjacent crops have at least overlap. This crop is scaled up to the receptive field. Again all predicted proposals not fully contained in the the center square are ignored.
Finally, for each image, we took the union of all proposals from each crop and ran non-maximum-suppression with Jaccard threshold .
To compute recall, the proposals are ranked by their confidence scores. We have took 15 different pre-sigmoid score thresholds ranging from to . which gave rise to various average numbers of proposals per image. The results are reported in Table 5 and the corresponding Figure 8.
As one can see, MultiBox significantly outperforms MCG below 2000 proposals, especially for lower overlap threshold. MCG only outperforms MultiBox at or higher thresholds with over proposals. However, we expect that MultiBox might do better if pre-processed with less aggressive Non-Maximum-Suppression threshold (exceeding the currently used threshold) when optimizing for recall at tight thresholds (above ).
In this work we demonstrated a method for high-quality object detection that is simple, efficient and practical to use at scale.
The proposed framework flexibly allows the choice of operating point along the runtime-quality trade-off curve. Even using single-crop multi-scale convolutional MultiBox with only several dozen proposals per image on average, we exceed the previously-reported state-of-the-art ILSVRC2014 detection performance, outperforming even highly-tuned ensembles using costly Selective Search proposal generation. At the high-quality end of the curve, we outperform the nearest reported mAP by over relative.
We conclude that learning-based proposal generation has closed the performance gap with state-of-the-art engineered proposal generation methods, MCG  in our study, while reducing the computational cost of detection. This is mostly the result of improved underlying network architecture especially the use of multi-scale convolutional proposal generation. Improvements in training methodology, context modeling and inference-time tricks like multi-crop evaluation and in-model ensembling resulted in modest, but significant cumulative gains on ILSVRC Detection 2014. Multi-scale convolutional MultiBox is not just a computationally more efficient replacement for static proposal generating algorithms; by providing a smaller number of higher-quality proposals, multi-scale convolutional MultiBox improves the overall object detection performance.
Bing: Binarized normed gradients for objectness estimation at 300fps.In IEEE CVPR, 2014.