CVPR18 Paper: Multi-scale Location-aware Kernel Representation for Object Detection
Although Faster R-CNN and its variants have shown promising performance in object detection, they only exploit simple first-order representation of object proposals for final classification and regression. Recent classification methods demonstrate that the integration of high-order statistics into deep convolutional neural networks can achieve impressive improvement, but their goal is to model whole images by discarding location information so that they cannot be directly adopted to object detection. In this paper, we make an attempt to exploit high-order statistics in object detection, aiming at generating more discriminative representations for proposals to enhance the performance of detectors. To this end, we propose a novel Multi-scale Location-aware Kernel Representation (MLKP) to capture high-order statistics of deep features in proposals. Our MLKP can be efficiently computed on a modified multi-scale feature map using a low-dimensional polynomial kernel approximation.Moreover, different from existing orderless global representations based on high-order statistics, our proposed MLKP is location retentive and sensitive so that it can be flexibly adopted to object detection. Through integrating into Faster R-CNN schema, the proposed MLKP achieves very competitive performance with state-of-the-art methods, and improves Faster R-CNN by 4.9 VOC 2007, VOC 2012 and MS COCO benchmarks, respectively. Code is available at: https://github.com/Hwang64/MLKP.READ FULL TEXT VIEW PDF
Few-shot object detection (FSOD) helps detectors adapt to unseen classes...
Object detection with Transformers (DETR) has achieved a competitive
Current semantic segmentation models only exploit first-order statistics...
Recent convolutional object detectors exploit multi-scale feature
We present Sparse R-CNN, a purely sparse method for object detection in
This paper revisits feature pyramids networks (FPN) for one-stage detect...
Modern CNN-based object detectors focus on feature configuration during
CVPR18 Paper: Multi-scale Location-aware Kernel Representation for Object Detection
Object detection is one of the most fundamental and popular topics in computer vision community, and it has attracted a lot of attentions in past decades. The fast and effective object detection method plays a key role in many applications, such as autonomous driving, surgical navigation  and video surveillance . With the rapid development of deep convolutional neural networks (CNNs) [35, 16, 36], the performance of object detection has been significantly improved. R-CNN  is among the first which exploits the outputs of deep CNNs to represent the pre-generated object proposals. R-CNN greatly improves traditional DPM  and its variants , where hand-crafted features are employed. Going beyond R-CNN, Fast R-CNN  introduces a Region of Interest (RoI) pooling layer to generate representations of all object proposals on feature map with only one CNN pass, which avoids passing separately each object proposal through deep CNNs, leading much faster training/testing process. Furthermore, Faster R-CNN  designs a region proposal network (RPN) for learning to generate proposals instead of using pre-generated proposals with traditional methods [40, 37, 1]. By combining RPN with Fast R-CNN network (FRN), Faster R-CNN develops a unified framework by end-to-end learning. Faster R-CNN has shown promising performance in object detection, and has become a strong baseline due to its good trade-off between effectiveness and efficiency .
Subsequently, numerous methods [26, 23, 3, 15, 20, 24, 34] have been suggested to further improve Faster R-CNN, and these methods mainly focus on one issue: original Faster R-CNN only exploits the feature map from single convolution () layer (i.e., the last layer), leading to discard information of different resolutions, especially for small objects. As illustrated in Fig. 1 (a), Faster R-CNN fails to detect some small objects such as persons far away camera. There are two research directions to solve this problem, i.e., feature map concatenation [3, 15, 20, 24, 34] and pyramidal feature hierarchy [26, 23]. The methods based on concatenation obtain a coarse-to-fine representation for each object proposal by concatenating outputs of different convolution layers (e.g., , and of VGG-16  in HyperNet ) into one single feature map. For pyramidal feature hierarchy based methods, they combine the outputs of different convolution layers in a pyramid manner (e.g., , and ). Moreover, each combination gives its own prediction, and all detection results are fused by using non-maximum suppression. As shown in Fig. 1 (b) and (c), methods based on concatenation (e.g., HyperNet ) and pyramidal feature hierarchy (e.g., RON ) both can improve the performance by using features of different resolutions.
Although aforementioned methods can improve the detection accuracy over Faster R-CNN, they all only exploit the first-order statistics of feature maps to represent object proposals in RoI pooling. The recent researches on challenging fine-grained visual categorization [4, 22, 38, 6] show that high-order statistics representations can capture more discriminative information than first-order ones, and obtain promising improvements. Based on this observation, we propose a Multi-scale Location-aware Kernel Representation (MLKP) to incorporate high-order statistics into RoI pooling stage for effective object detection. As illustrated in Fig. 1, the method based on concatenation  and pyramidal feature hierarchy based on  mis-locate the occluded persons far away camera. Furthermore, these methods mis-locate the plant, which is very similar to the background so that it is difficult to detect. Owing to usage of more discriminative high-order statistics, our MLKP predicts more accurate locations for all objects.
Fig. 2 illustrates the overview of our proposed MLKP. Through modifying the multi-scale strategy in , we exploit features of multiple layers in different convolution blocks, and concatenate them into one single feature map. Then, we compute the high-order statistics on such feature map. It is well known that the dimension of statistical information in general is , where is dimension of features and denotes order number of statistical information. In our case, is usually very high (e.g., 512 or 2048 in [35, 16]), which results in much higher dimensional representations [28, 38] and suffering from high computation and memory costs. To overcome this problem, we adopt polynomial kernel approximation based high-order methods , which can efficiently generate low-dimensional high-order representations. To this end, the kernel representation can be reformulated with convolution operation followed by element-wise product. Going beyond high-order kernel representation, we introduce a trainable location-weight structure to measure contribution of different locations, making our representation location sensitive. Finally, the different orders of representations are concatenated for classification and regression. Note that instead of global average pooling in , we utilize max RoI pooling proposed in , which is more suitable for object detection. To sum up, our MLKP is a kind of multi-scale, location aware, high-order representation designed for effective object detection.
(1) We propose a novel Multi-scale Location-aware Kernel Representation (MLKP), which to our best knowledge, makes the first attempt to incorporate discriminative high-order statistics into representations of object proposals for effective object detection.
(2) Our MLKP is based on polynomial kernel approximation so that it can efficiently generate low-dimensional high-order representations. Moreover, the properties of location retentive and sensitive inherent in MLKP guarantee that it can be flexibly adopted to object detection.
(3) The experiments on three widely used benchmarks demonstrate our MLKP can significantly improve performances than original Faster R-CNN, and performs favorably in comparison to the state-of-the-art methods.
In contrary to region-based detection methods (e.g., Faster R-CNN and its variants), alternative research pipeline is designing region-free detection methods. Among them, YOLO [31, 32] and SSD [10, 29] are two representative methods. YOLO 
utilizes one single neural network to predict bounding boxes and class probabilities from the full images directly, which trains the network with a loss function in term of detection performance. Different from YOLO, SSD discretizes the space of prediction of bounding boxes into a set of default boxes over several specific convolution layers. For inference, they compute the scores of each default box being to different object categories. Although region-free methods have faster training and inference speed than region-based ones, these methods discard generation of region proposals so that they often struggle with small objects and cannot filter out the negative samples belonging to the background. Furthermore, experimental results show our method can obtain higher accuracy than state-of-the-art region-free detection methods (See Sec. 4.3 for more details). Note that it is indirect to incorporate our MLKP into region-free methods, where no object proposals can be represented by MLKP. This interesting problem is worth to be investigated in future.
Recent works have shown that the integration of high-order statistics with deep CNNs can improve classification performance [4, 6, 18, 28, 25, 38]. Thereinto, the global second-order pooling methods [18, 25, 28] are plugged into deep CNNs to represent whole images, in which the sum of outer product of convolutional features is firstly computed, then element-wise power normalization , matrix logarithm normalization  and matrix power normalization  are performed, respectively. Wang et al. 
embed a trainable global Gaussian distribution into deep CNNs, which exploits first-order and second-order statistics of deep convolutional features. However, all these methods generate very high dimensional orderless representations, which can not be directly adopted to object detection. The methods in[4, 6] adopt polynomial and Gaussian RBF kernel functions to approximate high-order statistics, respectively. Such methods can efficiently generate low-dimensional high-order representations. However, different from methods that are designed for whole image classification, our MLKP is location retentive and sensitive to guarantee that it can be flexibly adopted to object detection.
In this section, we introduce the proposed Multi-scale Location-aware Kernel Representation (MLKP). Firstly, we introduce a modified multi-scale feature map to effectively utilize multi-resolution information. Then, a low-dimensional high-order representation is obtained by polynomial kernel function approximation. Furthermore, we propose a trainable location-weight structure incorporated into polynomial kernel function approximation, resulting in a location-aware kernel presentation. Finally, we show how to apply our MLKP to object detection.
The original Faster R-CNN only utilizes the feature map of the last convolution layer for object detection. Many recent works [3, 15, 24, 34] show feature maps of the former convolution layers have higher resolution and are helpful to detect small objects. These methods demonstrate that combining feature maps of different convolutional layers can improve the performance of detection. The existing multi-scale object detection networks all exploit feature maps of the last layer in each convolution block (e.g., layers of , and in VGG-16 ). Although more feature maps usually bring more improvements, they suffer from higher computation and memory costs, especially for those layers that are closer to inputs.
Different from the aforementioned multi-scale strategies, this paper suggests to exploit feature maps of multiple layers in each convolution block. As illustrated in Fig. 3, our method first concatenates different convolution layers in the same convolution block (e.g., layers of and in VGG-16 ), then performs element-wise sum for feature maps of different convolution blocks (e.g., blocks of and for VGG-16 ).
Since different convolution blocks have different sizes of feature maps. Feature maps need to share the same size, so that element-wise sum operation can be performed. As suggested in , an upsampling operation is used to increase the size of feature map in later layer. To this end, a deconvolution layer  is used to enlarge the resolution of feature map in later layer. Finally, we add a
convolution layer with stride 2 for recovering size of feature map in the original model, because the size of feature map has been enlarged two times than original one after upsampling. The experiment results in Sec. 4.2 show our modified multi-scale feature map can achieve higher accuracy with less computational cost.
The recent progress of challenging fine-grained visual categorization task demonstrates integration of high-order representations with deep CNNs can bring promising improvements [4, 6, 28, 38]. However, all these methods can not be directly adopted to object detection due to high dimension and missing location information of the feature map. Hence, we present a location-aware polynomial kernel representation to overcome above limitations and to integrate high-order representations into object detection.
Let be a 3D feature map from a specific convolution layer. Then, we define a linear predictor  on the high-order statistics of ,
where is the number of order, is a
-th order tensor containing the weight of order-predictor, and denotes the -th element of . Suppose that can be approximated by rank-1 tensors , i.e. . And Eqn. (1) can be further rewritten as,
where with , and
is the weight vector.
Based on Eqn. (3), we can compute arbitrary order of representation by learning parameters of weight , and (, , and ). As suggested in , we first focus on the computation of . Let . By defining , we can then obtain by performing -th convolution layers with channels, where and indicate the number of order and the rank of the tensor, respectively. In general, is much larger than dimension of original feature map to make a better approximation. In this way, the high-order polynomial kernel representation can be computed as follows. Given an input feature map , we compute feature map of -th order presentation with performing -th convolutions with channels on (denoted as ), following by an element-wise product of all feature maps i.e. . Finally, global sum-pooling is adopted to obtain the orderless representation as the input to the linear predictor .
The orderless representation, however, is unsuitable for object detection. The location information is discarded with the introduction of global sum-pooling, thereby making it ineffective to bounding box regression. Fortunately, we note that convolution and element-wise product can preserve location information. Thus, we can simply remove the global sum-pooling operation, and use as the kernel representation. Moreover, the dimension of is (e.g., equals to 4,096 in our network), which is far less than the feature map size adopted in , where is dimension of original feature map (e.g., equals to 1,024).
Furthermore, parts of the feature maps are more useful to locate objects, and they should be endowed with larger weights. To this end, we propose a location-aware representation by integrating location weight into the high-order kernel representation. For computing our location-aware kernel representation, we introduce a learnable weight to ,
where denotes the element-wise product, is a learnable CNN block with parameter to obtain a location-aware weight and is a re-mapping operation indicating the duplication of matrix along channel direction to form a tensor with the same size as for subsequent element-wise product operation as shown in Fig. 4. A residual block without identity skip-connection is used as our location-weight network. After passing through three different convolutional layers, a weighted feature map is obtained and each point among the feature map represents the contributions to the detection results.
Finally, the representations of with different orders are concatenated into one single feature map to generate the high-order polynomial kernel representation,
Moreover, different from using globally average pooling  to compute polynomial kernel representation, we propose to exploit max RoI pooling on feature map , which computes high-order polynomial kernel representation for each object proposal to preserve location information.
where , , and , can be obtained during the back-propagation of the location-weight networks. Although location weight can be learned for different orders of kernel representations by using multiple location-weight networks, we share the same location weight for orders of kernel representations to make a balance between effectiveness and efficiency.
The above describes the proposed multi-scale location-aware kernel representation (MLKP), and then we illustrate how to apply our MLKP to object detection. As shown in Fig. 5, we adopt the similar detection pipeline with Faster R-CNN . Specifically, we first pass an input image through the convolution layers in a basic CNN model (e.g., VGG-16  or ResNet ). Then, we compute the proposed MLKP on the outputs of convolutional layers while generating object proposals with a region proposal network (RPN). Finally, a RoI pooling layer combining MLKP with RPN is used for classification and regression. This network can be trained in an end-to-end manner.
In this section, we evaluate our proposed method on three widely used benchmarks: PASCAL VOC 2007, PASCAL VOC 2012  and MS COCO . We first describe implementation details of our MLKP, and then make ablation studies on PASCAL VOC 2007. Finally, comparisons with state-of-the-arts on three benchmarks are given.
. We initialize the network using an ImageNet pretrained basic model[35, 16] and the weights of layers in MLKP are initialized with the method of Xavier 
. In the first step we freeze all layers in basic model with only training the layers of MLKP and RPN. Secondly, the whole network is trained within two stages by decreasing learning rate. Our programs are implemented by Caffe Toolkit on a NVidia 1080Ti GPU. Following the common used settings in Faster R-CNN , the input images are firstly normalized and then we employ two kinds of deep CNNs as basic networks including VGG-16  and RseNet-101. The mean Average Precision (mAP) is used to measure different detectors. Note that we use single-scale training and testing throughout all experiments, and compare with state-of-the-art methods on three datasets without bells and whistles.
In this subsection, we first evaluate the key components of our MLKP on PASCAL VOC 2007, including multi-scale feature map, location-aware polynomial kernel representation, and effect of location-weight network. As suggested in , we train the network on the union set of train/validation in VOC 2007 and VOC 2012, and report the results on test set of VOC 2007 for comparison.
In this part, we investigate the effect of different multi-scale feature maps generation strategies on detection performance. In general, integration of more convolutional feature maps brings more improvements, but leads more computational costs. The existing multi-scale methods all only consider feature maps of the last layer in each convolution block. Different from them, we propose a modified strategy to exploit feature maps of multiple layers in each convolution block. Tab. 1 lists the mAP and inference time (Frames Per Second, FPS) of multi-scale feature maps with different convolutional layers on PASCAL VOC 2007. Note that we employ original Faster R-CNN for object detection, aiming at investigating the effect of different strategies.
From the results of Tab. 1, we can see that integration of more convolutional feature maps indeed brings more improvements. However, feature map with obtains only gain over one with , but runs about two times slower. For our modified strategy, is superior to single layer of over with comparable inference time. The gains of two layers in over one may owe to the complementarity of different layers within a convolution block, which enhances representation of proposals. Meanwhile, can further improve over with less additional computational cost. Finally, outperforms and with same or less inference time. The above results demonstrate that our modified strategy is more efficient and effective than existing ones.
|Method||mAP||Inference Time (FPS)|
|Order||Dimension||mAP / Inference Time(FPS)|
|1||-||73.2 / 15||76.5 / 11|
|2||2048||76.4 / 14||77.7 / 10|
|4096||76.5 / 14||77.5 / 10|
|3||2048||76.6 / 13||77.8 /10|
|4096||76.6 / 12||78.1 / 10|
|8192||76.2 /10||77.7 / 8|
Next, we analyze the proposed location-aware kernel representation under settings of single-scale and multi-scale feature maps. As shown in Eqn. (3), our location-aware kernel representation involves two parameters, i.e., the order- and the dimension . To obtain compact representations for efficient detection, this paper only considers order-. Meanwhile, the dimension of varies from 2048 to 8192. We summarize the results of MLKP with various order- and dimension in Tab. 2, which can be concluded as follows.
|Faster R-CNN ||07+12||73.2||76.5||79.0||70.9||65.5||52.1||83.1||84.7||86.4||52.0||81.9||65.7||84.8||84.6||75.5||76.7||38.8||73.6||73.9||83.0||72.6|
|Faster R-CNN *||07+12||76.4||79.8||80.7||76.2||68.3||55.9||85.1||85.3||89.8||56.7||87.8||69.4||88.3||88.9||80.9||78.4||41.7||78.6||79.8||85.3||72.0|
Firstly, our location-aware kernel representation improves the baseline with first-order representation by and under settings of single and multi-scale feature maps, respectively. It demonstrates the effectiveness of high-order statistics brought by location-aware kernel representation. Secondly, appropriate high-order statistics can achieve promising performance, but the gains tend to saturate as number of order becoming larger. So, integration of overmuch high-order statistics will get fewer gains, and higher dimension mapping does not necessarily lead to better results. Finally, both the multi-scale feature map and location-aware kernel representation can both significantly improve detection performance with less additional inference time, and combination of them achieves further improvement.
In the final part of this subsection, we assess the effect of location-weight network on our MLKP. Here, we employ kernel representation with order of 3 and dimension of 4096, which achieves the best result as shown above. Note that our location-weight network in Fig. 4 is very tiny and only cost per image. The results of kernel representation with/without location-weight network are illustrated in Fig. 6, we can see that location-weight network can achieve improvement under various settings of feature maps. Meanwhile, for more effective feature maps, location-weight network obtains bigger gains. Note when more effective feature maps of are employed, location-weight network obtains improvement, which is nontrivial since the counterpart without location-weight network gets a strong baseline result ().
To further evaluate our method, we compare our MLKP with several recently proposed state-of-the-art methods on three widely used benchmarks: i.e., PASCAL VOC 2007, PASCAL VOC 2012  and MS COCO .
On PASCAL VOC 2007, we compare our method with seven state-of-the-art methods. Specifically, the network is trained for 120k iterations for the first step with learning rate of 0.001. Then, whole network is fine-tuned for 50k iterations with learning rate of 0.001 and 30k iterations with learning rate of 0.0001. The weight decay and momentum are set to 0.0005 and 0.9 in the total three steps, respectively. For a fair comparison, all competing methods except R-FCN  utilize single input size without multi-scale training/testing and box voting. The results (mAP and AP of each category) of different methods with VGG-16  or ResNet-101  models are listed in Tab. 3.
|Faster R-CNN ||07++12||70.4||84.9||79.8||74.3||53.9||49.8||77.5||75.9||88.5||45.6||77.1||55.3||86.9||81.7||80.9||79.6||40.1||72.6||60.9||81.2||61.5|
|Faster R-CNN *||07++12||73.8||86.5||81.6||77.2||58.0||51.0||78.6||76.6||93.2||48.6||80.4||59.0||92.1||85.3||84.8||80.7||48.1||77.3||66.5||84.7||65.6|
|Method||Training set||Avg.Precision, IOU:||Avg.Precision, Area:||Avg.Recall, #Det:||Avg.Recall, Area:|
|Faster R-CNN ||trainval||21.9||42.7||23.0||6.7||25.2||34.6||22.5||32.7||33.4||10.0||38.1||53.4|
As reported in Tab. 3, when VGG-16 model is employed, our MLKP improves Faster R-CNN by . Because of sharing the exactly similar framework with Faster R-CNN, we owe the significant gains to the proposed MLKP. Meanwhile, our method also outperforms HyperNet , ION-Net  and RON  over , and , respectively. The improvements over aforementioned methods demonstrate the effectiveness of location-aware high-order kernel representation. In addition, our MLKP is superior to state-of-the-art region-free method SSD  by .
Then we adopt ResNet-101 model, our MLKP outperforms Faster R-CNN by . Meanwhile, it is superior to DSSD  which incorporates multi-scale information into SSD method . Additionally, MLKP slightly outperforms R-FCN , even R-FCN exploits multi-scale training/testing strategy. Note that R-FCN obtains only mAP of using single-scale training/testing.
Following the same experimental settings on PASCAL VOC 2007, we compare our method with five state-of-the-art methods on PASCAL VOC 2012. We train network on training and validation sets of PASCAL VOC 2007 and PASCAL VOC 2012 with additional test set of PASCAL VOC 2007. The results of different methods on test set of VOC 2012 are reported in Tab. 4. Our MLKP achieves the best results with both VGG-16 and ResNet-101 models, and it improves Faster R-CNN over and , respectively. These results verify the effectiveness of our MLKP again. Note that the results on both VOC 2007 and 2012 show our method can achieve impressive improvement in detecting small and hard objects, e.g., bottles and plants.
Finally, we compare our MLKP with four state-of-the-art methods on the challenging MS COCO benchmark . MS COCO contains 80k training images, 40k validation images and 80k testing images from 80 classes. We train our network on trainval35 set following the common settings  and report the results getting from test-dev2017 evaluation sever. Because test-dev2017 and test-dev2015
contain the same images, so the results obtaining from them are comparable. As MS COCO is a large-scale benchmark, we employ four GPUs to accelerate the training process. We adopt the same settings of hyper-parameters with PASCAL VOC datasets, but train the network with more iterations as MS COCO containing much more training images. Specifically, we train the network in the first step with 600k iterations, and performs fine-tuning with learning rate of 0.001 and 0.0001 for 150k and 100k iterations, respectively. We adopt the single-train and single-test strategy, and use the standard evaluating metric on MS COCO for comparison.
The comparison results are given in Tab. 5. As Faster R-CNN only reported two numerical results, we conduct additional experiments using the released model in . When adopting VGG-16 network, our MLKP can improve the Faster R-CNN over at IOU=[0.5:0.05:0.95], and is superior to other competing methods. For the ResNet-101 backbone, our method outperforms state-of-the-art region-free methods DSSD and SSD by . In particular, the proposed MLKP improves competing methods in detecting small or medium size objects.
This paper proposes a novel location-aware kernel approximation method to represent object proposals for effective object detection, which, to our best knowledge, is among the first which exploits high-order statistics in improving performance of object detection. Our MLKP takes full advantage of statistical information of multi-scale convolution features. The significant improvement over the first-order counterparts demonstrates the effectiveness of our MLKP. The experimental results on three popular benchmarks show our method is very competitive, indicating integration of high order statistics is a encouraging direction to improve object detection. As our MLKP is framework-independent, we plan to extend it other frameworks.
Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.In CVPR, 2016.
Matrix backpropagation for deep networks with structured layers.In ICCV, 2015.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, 2017.