Object detection aims at localizing instances of real-world objects within an image automatically. The past few years have witnessed significant progress in this field, arguably benefiting from the rapid development and wide application of deep convolutional neural network (CNN) models [1, 2, 3, 4, 5].
Among the CNN based object detectors, the Region-based Convolutional Neural Network (R-CNN) one  is seen as a milestone and has achieved state-of-the-art performance. The R-CNN model detects objects through using a deep CNN to classify region proposals that possibly contain objects [7, 8, 9]. Inspired by R-CNN, two follow-up models, i.e. Fast R-CNN  and Faster R-CNN , have been developed to further improve the detection accuracy as well as computational efficiency. Those two methods share a similar pipeline which becomes de facto
standard one nowadays: first apply Region-of-Interest (RoI) pooling to extract proposal features from feature maps produced by CNN and then minimize multi-task loss functions for simultaneous localization and classification. Many recent works[12, 13, 14, 15] have also demonstrated the effectiveness of this pipeline. However, most of those state-of-the-art methods localize objects using only information within a specific proposal that may be insufficient for accurately detecting challenging objects, such as the ones with low resolution, small scale or heavy occlusion.
It has been believed for quite a long time that contextual information is beneficial for various visual recognition tasks. Many previous studies [16, 17, 18, 19, 20, 21] have demonstrated considerable improvements for object detection brought by exploiting contextual information. For example, Chen et al.  proposed a contextualized SVM model for complementary object detection and classification, and they provided state-of-the-art results using hand-crafted features. More recently, Bell et al.  presented a seminal work devoted to integrating contextual information into the Fast-RCNN framework. In 
, a spatial recurrent neural network was employed to model the contextual information of interest around proposals. However, despite the fact that they provide new state-of-the-art performance, few of those methods present significant benefit of contextual information for object detection. How to effectively model and integrate contextual information into current state-of-the-art detection pipelines is still a worthwhile problem yet has not been fully studied.
Intuitively, a global view on background of an image can provide useful contextual information. For example, if one wants to detect a specific car within a scene image, those objects such as person, road or another car that usually co-occur with the target may provide useful clues for detecting the target. However, not all the background information is positive for improving object detection performance — incorporating meaningless background noise may even hurt the detection performance. Therefore, identifying useful contextual information is necessarily the first step towards effectively utilizing context for object detection. In addition to such “global” context, a local view into the neighborhood of one object proposal region can also provide some useful cues for inferring contents of a specific proposal. For example, surrounding environment (e.g. “road”) and discriminative part of the object (e.g. “wheels”) could benefit detecting the object (e.g. “car”) obviously.
Motivated by the above observations, in this work, we propose a novel Attention to Context Convolution Neural Network (AC-CNN) model for better “contextualizing” the region-based CNN detectors (such as Fast RCNN). AC-CNN captures contextual information from both global and local views, and effectively identifies helpful global context through an attention mechanism. Taking the image shown in Figure 1 as an example, AC-CNN exploits local context inside and surrounding a specific proposal at multiple scales. As aforementioned, looking into the interior local context information helps discover some discriminative object parts and exploiting external local context gives local cues to determine what objects are present in the scene. Taking surrounding local context into consideration provides an enhanced feature representation of a specific proposal for object recognition.
In addition, to identify useful contextual information from a global view, AC-CNN employs an attention-based recurrent neural network that consists of multiple stacked Long Short-Memory (LSTM) layers to recurrently find locations within the input image valuable for object detection. As shown in Figure 1, the locations (around the target car and another car) with stronger semantic correlations are discovered with higher confidences. In contrast, background noise that may hurt the detection performance is successfully suppressed on the attention map. Then, combining the feature maps of all locations guided by the attentive location map can produce “cleaned” global contextual feature to assist recognition of each proposal.
The main contributions of the proposed AC-CNN method can be summarized as follows:
We propose a novel Attention to Context CNN (AC-CNN) object detection model which effectively contextualizes popular region-based CNN detection models. To the best of our knowledge, this work is the first research attempt to detect objects by exploiting both local and global context with attention.
The attention-based contextualized sub-network in AC-CNN recurrently generates the attentive location map for each image that helps to incorporate the most discriminative global context into local object detection.
The inside and outside local contextual information for each proposal are captured by the proposed multi-scale contextualized sub-network.
Extensive experiments on the PASCAL VOC 2007 and VOC 2012 well demonstrate the effectiveness of the proposed AC-CNN: it outperforms the popular Fast-RCNN by and in mAP on VOC 2007 and VOC 2012 respectively. We also visualize results on automatically produced attention maps and seek to provide in-depth understanding on the role of global and local context for successful object detectors.
2 Related Work
2.1 CNN-based Object Detection Methods
Object detection aims to recognize and localize each object instance with an accurate bounding box. Currently, most of the state-of-the-art detection pipelines follow the Region-based Convolutional Neural Network (R-CNN) . In R-CNN, object proposals are first generated with some hand-crafted methods (e.g. Selective Search , Edge Boxes  and MCG ) from the input image, and then the classification and bounding box regression are performed to identity the target objects. However, training an R-CNN model is usually expensive in both space and time costs. Meanwhile, object classification and bounding box regression are implemented through a multi-stage pipeline. To enhance the computational efficiency and detection accuracy, a Fast R-CNN framework  was proposed to jointly classify object proposals and refine their locations through multi-task learning. To further reduce the time of proposal generation, a novel Region Proposal Network (RPN)  was proposed, which can be seamlessly embedded in the Fast R-CNN framework for proposal generation. However, those methods only consider information extracted from a specific proposal in training, and cues from context are not well exploited. In this work, we dedicate our efforts to attending contextual information based on the state-of-the-art pipeline to improve the accuracy of object detection. In particular, both local and global contextual information w.r.t. a specific proposal are taken into account for better object detection.
2.2 RNN-based Object Detection Methods
Recently, LSTM has shown outstanding performance for the tasks of image captioning , video description [23, 24], people detection  and action recognition , benefiting from its excellent ability to model long-range information. Most of those existing works tend to adopt a CNN accompanied with several LSTMs to address specific visual recognition problems. Specifically, Sharma et al. 
proposed a soft visual attention model based on LSTMs for action recognition. Yaoet al.  proposed to use CNN features and an LSTM decoder to generate video descriptions. Stewart et al.  employed a recurrent LSTM layer for people detection. In this work, we offer the first research attempt to apply LSTM to learning the useful global contextual information with guidance from annotated class labels. Feature cubes of the entire image are taken as the input to a recurrent model consisting of multiple LSTM layers. With the recurrent model, some contextual slices beneficial for the detection task are iteratively highlighted to provide powerful feature representations for object detection.
3 Attention to Context Convolution Neural Network (AC-CNN)
As shown in Figure 2
, the proposed AC-CNN method is built on the VGG-16 ImageNet model and the Fast-RCNN  object detection framework. AC-CNN takes an image and its object proposals provided by Selective Search  as inputs. In recent successful object detection frameworks, such as Fast R-CNN  and Faster R-CNN , features of generated proposals are pooled from the last convolutional layer. Similarly, in our model, an input image first passes the convolutional and pooling layers to generate a feature cube. Then, two context-aware sub-networks (our main contributions) with attention to local and global context are introduced. For convenience of illustration, we term these two sub-networks as multi-scale contextualized sub-network and attention-based contextualized sub-network, respectively. In the following subsections, we explain these two sub-networks in more details.
3.1 Multi-scale Contextualized Sub-network
A local view into the neighborhood of a specific proposal can bring some useful cues for inferring the corresponding content. Beyond the original bounding box of a specific proposal, two more scales used to exploit inside and outside contextual information are employed to enhance the feature representation.
We use and to denote the input image and one specific proposal throughout the paper. The proposal is encoded by its size and coordinates of its center , i.e., . To exploit inside and outside contextual information of , we crop from the feature cube with three scaling factors: , and . We denote the features of pooled from crops of the feature cube with different sizes as . Considering implementation, since the output feature should have compatible dimension with that of the pre-trained VGG-16 model on ImageNet, we need to resize the features to before feeding them into the first fully connect layer.
To meet this requirement on the size of input features, we first concatenate features of crops with different sizes into , where indicates the operation of concatenating each pooled feature along the channel axis. Each pooled feature is L2 normalized and scaled to match the original amplitudes. Then, we use a convolution operator to reduce the shape of from to .
Finally, the final feature passes two fully-connected layers to generate the feature representation of , which is denoted as . Here the subscript denotes “local” context.
3.2 Attention-based Contextualized Sub-network
A global view from the entire image can provide useful contextual information for object detection. To exploit global context beneficial for object detection and filter out noisy contextual information, we propose to adapt an attention based recurrent model  to adaptively identify positive contextual information from the global view. Since input images usually have different sizes, shapes of feature cubes from the last convolutional layer are also different. To calculate the global context with consistent parameters, the feature cube is pooled into a fixed shape of ( in our experiments). Based on the feature cube, the recurrent model learns an attention map with the shape of to highlight the region that may benefit the object detection from the global view.
We now illustrate how to build the attention model to learn an attention map for extracting global context information. Denote feature slices in the feature cube as , where is with dimension. The recurrent model is composed of three layers of Long Short Memory (LSTM) units. We adopt the LSTM implementation discussed in the work by Sharma et al. :
where , , , and denote the input gate, forget gate, cell state, output gate and hidden state of the LSTM, respectively. is the input to the LSTM at time-step . is an affine transformation consisting of parameters with and , where is the dimensionality of all of , , , , and . Besides, and correspond to the logistic sigmoid activation and the element-wise multiplication, respectively.
At each time-step , the attention model predicts a weighted map , a softmax over
locations. This is probabilistic estimation about whether the corresponding region in the input image is beneficial for the object classification from the global view. The location softmax at time-stepis computed as follows:
where is the weights mapping to the element of the location softmax and
is a random variable which takes 1-of-
value. With these probabilities, the attended feature can be calculated by taking the expectation over the feature slices at different regions following the soft attention mechanism. The feature taken as the input to the LSTM at the next time-step is defined as , which can be calculated as
where is the feature cube and is the slice of the feature cube.
The cell state and the hidden state of the LSTM are initialized following the strategy proposed in  for faster convergence:
where and are two multilayer perceptions. These values are used to calculate the first location softmax which determines the initial input .
Figure 3 illustrates the process of producing the global attention-based feature. It can be observed that a -dimensional global attention-based feature can be obtained by combining features of all locations according to the attentive location map. Finally, the feature passes through two fully-connected layers to produce the feature representation of with global contextual information, which can be denoted as .
3.3 Learning with Multi-task Loss Function
Denote as the ground-truth label out of totally categories (0 indicates ). With the obtained feature representations of and , the loss function for jointly optimizing object classification and bounding box regression could be defined as
where indicates concatenating two features along the channel axis, and is equal to 1 when and 0 otherwise. is the cross-entropy loss  for the ground-truth class , and is a smooth loss proposed in . It should be noted that only the feature from the local context net is used for bounding box regression. The corresponding justified experiment is provided in the next section.
4 Experimental Results
4.1 Experimental Setting
Datasets and Evaluation Metrics We evaluate the proposed AC-CNN on two mainstream datasets: PASCAL VOC 2007 and VOC 2012 . These two datasets contain 9,963 and 22,531 images respectively, and they are divided into train, val and test subsets. The model for VOC 2007 is trained based on the trainval splits from VOC 2007 (5,011) and VOC 2012 (11,540). The model for VOC 2012 is trained based on all images from VOC 2007 (9,963) and trainval
split from VOC 2012 (11,540). The evaluation metrics are(AP) and (mAP) complying with the PASCAL challenge protocols.
built on the Caffe platform. The VGG-16 model is pre-trained on ILSVRC 2012  classification and localization dataset, which can be downloaded from the Caffe Model Zoo. We fine-tune the proposed AC-CNN based on the pre-trained VGG-16. During the fine-tuning, we use 2 images per mini-batch, which contains 128 selected object proposals. Following Fast R-CNN, 25% of object proposals are selected as foreground that have Intersection over Union (IoU) overlap with a ground-truth bounding box of larger than 0.5 and the rest are background in each mini-batch. During training, images are horizontally flipped with a probability of 0.5 to augment the training data. All the newly added layers (all the fully-connected layers in the global context net,
convolutional layer in the local context net, the last fully-connected layer for object classification and bounding box regression) are randomly initialized with zero-mean Gaussian distributions with standard deviations of 0.01 and 0.001, respectively.
We run Stochastic Gradient Descent (SGD) for 120K and 150K iterations in total to train the network parameters for VOC 2007 an VOC 2012, respectively. The AC-CNN is trained based on a NVIDIA GeForce Titan X GPU and Intel Core i7-4930K CPU @ 3.40 GHz. The initial learning rate of all layers is set as 0.001 and decreased to one tenth of the current rate of each layer after 50K and 60K iterations for VOC 2007 and VOC 2012, receptively.
|AP(%)||VOC 2007||VOC 2012|
4.2 Performance Comparisons
We evaluate the proposed AC-CNN on VOC 2007 and the more challenging VOC 2012 by submitting the results to the publicly evaluation server. Table 1 provides the comparisons of the proposed AC-CNN and Fast-RCNN (FRCN). All the experimental results are based on the model trained on the VOC 2007 trainval set merged with the VOC 2012 trainval set.
It can be observed that our method obtains the mAP scores of 72.0% and 70.6% on VOC 2007 and VOC 2012, which outperforms the baselines by 2.0% and 2.2%, respectively. Our method reaches better detection results on most categories compared with FRCN. Specifically, small or occluded objects often appear in some specific categories such as bottle, chair and potted plant. The improvements on these three challenging categories are 5.2%, 5.7% and 3.1%, which can further validate the effectiveness of the AC-CNN for detecting challenging objects.
4.3 Ablation Analysis on Training Pipeline
To validate the effectiveness of experimental settings in this work, we evaluate some components of AC-CNN.
4.3.1 Global Context Attention Method
In the proposed AC-CNN framework, we employ an attention based recurrent model to capture positive contextual information from the global view. To validate the powerful mechanism of LSTMs for global context attending, we compare it with another pooling method, i.e. average pooling. We compare these two methods on VOC 2007 test, shown in Table 2. It can be observed that the proposed recurrent model performs better than the average pooling scheme. It should be noted that the performance will drop by 0.4% by adding global context via average pooling compared with “AC-CNN”. The reason may be that not all information from the entire image is useful. By averaging features of all regions, some potential noise may also be introduced, decreasing the detection accuracy. With the recurrent model, an attentive location map can be optimized to highlight those regions that are with the positive effect upon the object detection. Therefore, we use the LSTMs to compute the global context in this work.
|Global Context||mAP (%)|
4.3.2 Contributions of Each Sub-network
The proposed AC-CNN consists of two sub-networks, i.e. attention-based contextualized sub-network and multi-scale contextualized sub-network. These two sub-networks are utilized to capture global and local contextual information, respectively. We compare the detection results of different sub-networks on VOC 2007 test, shown in Table 3. “AC-CNN minus L” represents the model of only using attention-based global context information for object detection. “AC-CNN minus G” represents the model of only using multi-scale local context information for object detection. It can be observed that the performance will drop by 0.6% if we remove either sub-subnetwork, which can further validate the effectiveness of each component in the proposed AC-CNN.
|AC-CNN minus L||71.4|
|AC-CNN minus G||71.4|
4.3.3 Effectiveness of Multi-scale Setting
We compare the detection performance with different scale settings based on the multi-scale contextualized sub-network. It can be observed that the performance will be decreased if we remove any additional scale (0.8 or 1.8). The reason is that inside (or outside) surrounding context of a specific proposal can not be well exploited by removing the corresponding scale, i.e. 0.8 (or 1.8). To further validate the effectiveness of contextual information, we also add one more scale (i.e. 2.7) for comparison. It can be observed the multi-scale contextualized sub-network will make a further improvement with more scales. However, to balance the detection accuracy and the computational consumption of time and GPU memory, we choose the multi-scale as “0.8+1.2+1.8” in this work.
|AC-CNN minus G (0.8 + 1.2)||71.3|
|AC-CNN minus G (1.2 + 1.8)||71.1|
|AC-CNN minus G (0.8 + 1.2 + 1.8)||71.4|
|AC-CNN minus G (0.8 + 1.2 + 1.8 + 2.7)||71.6|
|Feature Representation||mAP (%)|
4.3.4 No Global Context for Bounding Box Regression
We do not use the concatenated feature from both local context net and global context net for bounding box regression. The reason is that bounding box regression is a totally different task compared with object classification. It is excepted that the feature of one specific proposal could represent the difference between its current position and the ground-truth position. However, by adding the feature from the global view, this difference will be slacked. This may decrease the accuracy of bounding box regression. We compare the results of different settings on VOC 2007 test, shown in Table 5. It can be observed that 0.1% improvement will be made by removing the global context information for the bounding box regression task.
4.3.5 Detection Error Analysis
The tool of Hoiem et al.  is employed to analyze the detection errors in this work. As show in Figure 4, we plot pie charts with the percentage of detections that are correct and false positives due to bad localization, confusion with similar categories and other categories, and confusion with background or unlabeled objects. It can be observed that the proposed AC-CNN model can make a considerable reduction in the percentage of false positives for challenging categories. Specifically, it is well known that small objects usually appear in bottle and pottedplant. The improvements on these two challenging categories are both 5%, which can validate the effectiveness of the proposed context-aware method for those objects with small scales. A similar observation can also be deducted from Figure 5 where we plot the top ranked false positive types of the results from FRCN and the proposed AC-CNN.
4.4 Visual Comparison
Figure 6 shows some input images and the corresponding attentive maps computed by the attention-based contextualized sub-network. It can be observed that those regions which may benefit the object classification task are highlighted. Some selected detection results on VOC 2007 test set are shown in Figure 7. For some small objects, e.g. boat, bottle and potted plant, the proposed AC-CNN achieves better detection results. Besides, for the third group, the occluded chair can also be detected with the AC-CNN model. Therefore, our AC-CNN can effectively advance the Fast-RCNN by incorporating valuable contextual information.
In this paper, we propose a novel Attention to Context Convolution Neural Networks (AC-CNN) for object detection. Specifically, AC-CNN advances the traditional Fast-CNN to contextualized object detectors, which effectively incorporates the attention-based contextualized sub-network and multi-scale contextualized sub-network into a unified framework. To capture global context, the attention-based contextualized sub-network recurrently generates the attentive location map for the input image by employing the stacked LSTM layers. With the attentive location map, the global contextual features can be produced by selectively combining the feature cubes of all locations. To capture local context, the multi-scale contextualized sub-network exploits the inside and outside contextual information of each specific proposal by scaling the corresponding bounding box with three pre-defined ratios. Extensive experiments on VOC 2007 and VOC 2012 demonstrate that the proposed AC-CNN can make a significant improvement by exploiting contextual information.
-  Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems. (2012) 1106–1114
-  Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations. (2014)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2014)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
-  Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. (2014) 580–587
Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.:
Selective search for object recognition.
International Journal on Computer Vision104(2) (2013) 154–171
-  Zitnick, C.L., Dollár, P.: Edge boxes: Locating object proposals from edges. In: European Conference on Computer Vision. (2014) 391–405
Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J.:
Multiscale combinatorial grouping.
In: IEEE Conference on Computer Vision and Pattern Recognition. (2014) 328–335
-  Girshick, R.: Fast r-cnn. In: IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1440–1448
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Neural Information Processing Systems. (2015) 91–99
-  Liang, X., Liu, S., Wei, Y., Liu, L., Lin, L., Yan, S.: Towards computational baby learning: A weakly-supervised approach for object detection. (2015) 999–1007
-  Liang, X., Wei, Y., Shen, X., Jie, Z., Feng, J., Lin, L., Yan, S.: Reversible recursive instance-level object segmentation. arXiv preprint arXiv:1511.04517 (2015)
-  Zeng, X., Ouyang, W., Wang, X.: Window-object relationship guided representation learning for generic object detections. arXiv preprint arXiv:1512.02736 (2015)
-  Gidaris, S., Komodakis, N.: Object detection via a multi-region and semantic segmentation-aware cnn model. In: IEEE International Conference on Computer Vision. (2015) 1134–1142
-  Choi, M.J., Lim, J.J., Torralba, A., Willsky, A.S.: Exploiting hierarchical context on a large database of object categories. In: IEEE Conference on Computer Vision and Pattern Recognition. (2010) 129–136
-  Galleguillos, C., Belongie, S.: Context based object categorization: A critical survey. Computer Vision and Image Understanding 114(6) (2010) 712–722
-  Oliva, A., Torralba, A.: The role of context in object recognition. Trends in cognitive sciences 11(12) (2007) 520–527
-  Zweig, A., Weinshall, D.: Exploiting object hierarchy: Combining models from different category levels. In: IEEE International Conference on Computer Vision. (2007) 1–8
-  Chen, Q., Song, Z., Dong, J., Huang, Z., Hua, Y., Yan, S.: Contextualizing object detection and classification. IEEE Transactions on Pattern Recognition and Machine Intelligence 37(1) (2015) 13–27
-  Bell, S., Zitnick, C.L., Bala, K., Girshick, R.: Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. arXiv preprint arXiv:1512.04143 (2015)
-  Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)
-  Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision. (2015) 4507–4515
-  Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
-  Stewart, R., Andriluka, M.: End-to-end people detection in crowded scenes. arXiv preprint arXiv:1506.04878 (2015)
-  Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015)
-  Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
-  Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal on Computer Vision 88(2) (2010) 303–338
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: ACM Multimedia, ACM (2014) 675–678
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition. (2009) 248–255
-  Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: European Conference on Computer Vision. (2012) 340–353