Attentive Contexts for Object Detection

03/24/2016 ∙ by Jianan Li, et al. ∙ 0

Modern deep neural network based object detection methods typically classify candidate proposals using their interior features. However, global and local surrounding contexts that are believed to be valuable for object detection are not fully exploited by existing methods yet. In this work, we take a step towards understanding what is a robust practice to extract and utilize contextual information to facilitate object detection in practice. Specifically, we consider the following two questions: "how to identify useful global contextual information for detecting a certain object?" and "how to exploit local context surrounding a proposal for better inferring its contents?". We provide preliminary answers to these questions through developing a novel Attention to Context Convolution Neural Network (AC-CNN) based object detection model. AC-CNN effectively incorporates global and local contextual information into the region-based CNN (e.g. Fast RCNN) detection model and provides better object detection performance. It consists of one attention-based global contextualized (AGC) sub-network and one multi-scale local contextualized (MLC) sub-network. To capture global context, the AGC sub-network recurrently generates an attention map for an input image to highlight useful global contextual locations, through multiple stacked Long Short-Term Memory (LSTM) layers. For capturing surrounding local context, the MLC sub-network exploits both the inside and outside contextual information of each specific proposal at multiple scales. The global and local context are then fused together for making the final decision for detection. Extensive experiments on PASCAL VOC 2007 and VOC 2012 well demonstrate the superiority of the proposed AC-CNN over well-established baselines. In particular, AC-CNN outperforms the popular Fast-RCNN by 2.0 terms of mAP, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection aims at localizing instances of real-world objects within an image automatically. The past few years have witnessed significant progress in this field, arguably benefiting from the rapid development and wide application of deep convolutional neural network (CNN) models [1, 2, 3, 4, 5].

Among the CNN based object detectors, the Region-based Convolutional Neural Network (R-CNN) one [6] is seen as a milestone and has achieved state-of-the-art performance. The R-CNN model detects objects through using a deep CNN to classify region proposals that possibly contain objects [7, 8, 9]. Inspired by R-CNN, two follow-up models, i.e. Fast R-CNN [10] and Faster R-CNN [11], have been developed to further improve the detection accuracy as well as computational efficiency. Those two methods share a similar pipeline which becomes de facto

standard one nowadays: first apply Region-of-Interest (RoI) pooling to extract proposal features from feature maps produced by CNN and then minimize multi-task loss functions for simultaneous localization and classification. Many recent works 

[12, 13, 14, 15] have also demonstrated the effectiveness of this pipeline. However, most of those state-of-the-art methods localize objects using only information within a specific proposal that may be insufficient for accurately detecting challenging objects, such as the ones with low resolution, small scale or heavy occlusion.

It has been believed for quite a long time that contextual information is beneficial for various visual recognition tasks. Many previous studies [16, 17, 18, 19, 20, 21] have demonstrated considerable improvements for object detection brought by exploiting contextual information. For example, Chen et al. [20] proposed a contextualized SVM model for complementary object detection and classification, and they provided state-of-the-art results using hand-crafted features. More recently, Bell et al. [21] presented a seminal work devoted to integrating contextual information into the Fast-RCNN framework. In [21]

, a spatial recurrent neural network was employed to model the contextual information of interest around proposals. However, despite the fact that they provide new state-of-the-art performance, few of those methods present significant benefit of contextual information for object detection. How to effectively model and integrate contextual information into current state-of-the-art detection pipelines is still a worthwhile problem yet has not been fully studied.

Intuitively, a global view on background of an image can provide useful contextual information. For example, if one wants to detect a specific car within a scene image, those objects such as person, road or another car that usually co-occur with the target may provide useful clues for detecting the target. However, not all the background information is positive for improving object detection performance — incorporating meaningless background noise may even hurt the detection performance. Therefore, identifying useful contextual information is necessarily the first step towards effectively utilizing context for object detection. In addition to such “global” context, a local view into the neighborhood of one object proposal region can also provide some useful cues for inferring contents of a specific proposal. For example, surrounding environment (e.g. “road”) and discriminative part of the object (e.g. “wheels”) could benefit detecting the object (e.g. “car”) obviously.

Figure 1: Illustration of incorporating both local and global contextual information for guiding object detection. For the local context, beyond the original bounding box of a specific proposal, inside and outside contextual information are employed to enhance the feature representation. For the global context, an attention based recurrent model is utilized to obtain positive contextual information (the highlighted region) from the global view.

Motivated by the above observations, in this work, we propose a novel Attention to Context Convolution Neural Network (AC-CNN) model for better “contextualizing” the region-based CNN detectors (such as Fast RCNN). AC-CNN captures contextual information from both global and local views, and effectively identifies helpful global context through an attention mechanism. Taking the image shown in Figure 1 as an example, AC-CNN exploits local context inside and surrounding a specific proposal at multiple scales. As aforementioned, looking into the interior local context information helps discover some discriminative object parts and exploiting external local context gives local cues to determine what objects are present in the scene. Taking surrounding local context into consideration provides an enhanced feature representation of a specific proposal for object recognition.

In addition, to identify useful contextual information from a global view, AC-CNN employs an attention-based recurrent neural network that consists of multiple stacked Long Short-Memory (LSTM) layers to recurrently find locations within the input image valuable for object detection. As shown in Figure 1, the locations (around the target car and another car) with stronger semantic correlations are discovered with higher confidences. In contrast, background noise that may hurt the detection performance is successfully suppressed on the attention map. Then, combining the feature maps of all locations guided by the attentive location map can produce “cleaned” global contextual feature to assist recognition of each proposal.

The main contributions of the proposed AC-CNN method can be summarized as follows:

  • We propose a novel Attention to Context CNN (AC-CNN) object detection model which effectively contextualizes popular region-based CNN detection models. To the best of our knowledge, this work is the first research attempt to detect objects by exploiting both local and global context with attention.

  • The attention-based contextualized sub-network in AC-CNN recurrently generates the attentive location map for each image that helps to incorporate the most discriminative global context into local object detection.

  • The inside and outside local contextual information for each proposal are captured by the proposed multi-scale contextualized sub-network.

  • Extensive experiments on the PASCAL VOC 2007 and VOC 2012 well demonstrate the effectiveness of the proposed AC-CNN: it outperforms the popular Fast-RCNN by and in mAP on VOC 2007 and VOC 2012 respectively. We also visualize results on automatically produced attention maps and seek to provide in-depth understanding on the role of global and local context for successful object detectors.

2 Related Work

2.1 CNN-based Object Detection Methods

Object detection aims to recognize and localize each object instance with an accurate bounding box. Currently, most of the state-of-the-art detection pipelines follow the Region-based Convolutional Neural Network (R-CNN) [6]. In R-CNN, object proposals are first generated with some hand-crafted methods (e.g. Selective Search [7], Edge Boxes [8] and MCG [9]) from the input image, and then the classification and bounding box regression are performed to identity the target objects. However, training an R-CNN model is usually expensive in both space and time costs. Meanwhile, object classification and bounding box regression are implemented through a multi-stage pipeline. To enhance the computational efficiency and detection accuracy, a Fast R-CNN framework [10] was proposed to jointly classify object proposals and refine their locations through multi-task learning. To further reduce the time of proposal generation, a novel Region Proposal Network (RPN) [11] was proposed, which can be seamlessly embedded in the Fast R-CNN framework for proposal generation. However, those methods only consider information extracted from a specific proposal in training, and cues from context are not well exploited. In this work, we dedicate our efforts to attending contextual information based on the state-of-the-art pipeline to improve the accuracy of object detection. In particular, both local and global contextual information w.r.t. a specific proposal are taken into account for better object detection.

2.2 RNN-based Object Detection Methods

Recently, LSTM has shown outstanding performance for the tasks of image captioning [22], video description [23, 24], people detection [25] and action recognition [26], benefiting from its excellent ability to model long-range information. Most of those existing works tend to adopt a CNN accompanied with several LSTMs to address specific visual recognition problems. Specifically, Sharma et al. [26]

proposed a soft visual attention model based on LSTMs for action recognition. Yao

et al. [23] proposed to use CNN features and an LSTM decoder to generate video descriptions. Stewart et al. [25] employed a recurrent LSTM layer for people detection. In this work, we offer the first research attempt to apply LSTM to learning the useful global contextual information with guidance from annotated class labels. Feature cubes of the entire image are taken as the input to a recurrent model consisting of multiple LSTM layers. With the recurrent model, some contextual slices beneficial for the detection task are iteratively highlighted to provide powerful feature representations for object detection.

Figure 2: Details on how the proposed AC-CNN exploits contextual information for object detection. AC-CNN consists of two main sub-networks, i.e. the attention-based contextualized sub-network and the multi-scale contextualized sub-network. An image is first fed into a convolutional network to produce the feature cube. Then the feature cub passes through the multi-scale contextualized sub-network for local context information extraction. The bounding box of each proposal is scaled with three pre-defined factors and feature representations from the bounding boxes are extracted by an RoI pooling layer. Each feature representation, after L2-normalization, concatenation, scaling, dimension-reduction, is then fed into two fully-connected layers. In the attention-based contextualized sub-network, the feature cube is first pooled into a cube with a fixed scale. Then, a recurrent attention model including three LSTM layers is adopted to recurrently detect useful regions from a global view. Finally, a global context feature is pooled based on the calculated attention map and fed into two fully-connected layers. AC-CNN uses the output feature from the multi-scale contextualized sub-network for bounding box regression. The concatenated feature of the outputs from both two sub-networks is used for object classification.

3 Attention to Context Convolution Neural Network (AC-CNN)

As shown in Figure 2

, the proposed AC-CNN method is built on the VGG-16 ImageNet model 

[4] and the Fast-RCNN [10] object detection framework. AC-CNN takes an image and its object proposals provided by Selective Search [7] as inputs. In recent successful object detection frameworks, such as Fast R-CNN [10] and Faster R-CNN [11], features of generated proposals are pooled from the last convolutional layer. Similarly, in our model, an input image first passes the convolutional and pooling layers to generate a feature cube. Then, two context-aware sub-networks (our main contributions) with attention to local and global context are introduced. For convenience of illustration, we term these two sub-networks as multi-scale contextualized sub-network and attention-based contextualized sub-network, respectively. In the following subsections, we explain these two sub-networks in more details.

3.1 Multi-scale Contextualized Sub-network

A local view into the neighborhood of a specific proposal can bring some useful cues for inferring the corresponding content. Beyond the original bounding box of a specific proposal, two more scales used to exploit inside and outside contextual information are employed to enhance the feature representation.

We use and to denote the input image and one specific proposal throughout the paper. The proposal is encoded by its size and coordinates of its center , i.e., . To exploit inside and outside contextual information of , we crop from the feature cube with three scaling factors: , and . We denote the features of pooled from crops of the feature cube with different sizes as . Considering implementation, since the output feature should have compatible dimension with that of the pre-trained VGG-16 model on ImageNet, we need to resize the features to before feeding them into the first fully connect layer.

To meet this requirement on the size of input features, we first concatenate features of crops with different sizes into , where indicates the operation of concatenating each pooled feature along the channel axis. Each pooled feature is L2 normalized and scaled to match the original amplitudes. Then, we use a convolution operator to reduce the shape of from to .

Finally, the final feature passes two fully-connected layers to generate the feature representation of , which is denoted as . Here the subscript denotes “local” context.

3.2 Attention-based Contextualized Sub-network

A global view from the entire image can provide useful contextual information for object detection. To exploit global context beneficial for object detection and filter out noisy contextual information, we propose to adapt an attention based recurrent model [26] to adaptively identify positive contextual information from the global view. Since input images usually have different sizes, shapes of feature cubes from the last convolutional layer are also different. To calculate the global context with consistent parameters, the feature cube is pooled into a fixed shape of ( in our experiments). Based on the feature cube, the recurrent model learns an attention map with the shape of to highlight the region that may benefit the object detection from the global view.

We now illustrate how to build the attention model to learn an attention map for extracting global context information. Denote feature slices in the feature cube as , where is with dimension. The recurrent model is composed of three layers of Long Short Memory (LSTM) units. We adopt the LSTM implementation discussed in the work by Sharma et al. [26]:

(1)
(2)
(3)

where , , , and denote the input gate, forget gate, cell state, output gate and hidden state of the LSTM, respectively. is the input to the LSTM at time-step . is an affine transformation consisting of parameters with and , where is the dimensionality of all of , , , , and . Besides, and correspond to the logistic sigmoid activation and the element-wise multiplication, respectively.

At each time-step , the attention model predicts a weighted map , a softmax over

locations. This is probabilistic estimation about whether the corresponding region in the input image is beneficial for the object classification from the global view. The location softmax at time-step

is computed as follows:

(4)

where is the weights mapping to the element of the location softmax and

is a random variable which takes 1-of-

value. With these probabilities, the attended feature can be calculated by taking the expectation over the feature slices at different regions following the soft attention mechanism 

[27]. The feature taken as the input to the LSTM at the next time-step is defined as , which can be calculated as

(5)

where is the feature cube and is the slice of the feature cube.

The cell state and the hidden state of the LSTM are initialized following the strategy proposed in [22] for faster convergence:

(6)
(7)

where and are two multilayer perceptions. These values are used to calculate the first location softmax which determines the initial input .

Figure 3 illustrates the process of producing the global attention-based feature. It can be observed that a -dimensional global attention-based feature can be obtained by combining features of all locations according to the attentive location map. Finally, the feature passes through two fully-connected layers to produce the feature representation of with global contextual information, which can be denoted as .

Figure 3: Illustration of how to produce the global attention-based feature. With the attention-based contextualized sub-network, a attentive location map is calculated to selectively combine the features of all locations into a global attention-based feature.

3.3 Learning with Multi-task Loss Function

Denote as the ground-truth label out of totally categories (0 indicates ). With the obtained feature representations of and , the loss function for jointly optimizing object classification and bounding box regression could be defined as

(8)

where indicates concatenating two features along the channel axis, and is equal to 1 when and 0 otherwise. is the cross-entropy loss [1] for the ground-truth class , and is a smooth loss proposed in [10]. It should be noted that only the feature from the local context net is used for bounding box regression. The corresponding justified experiment is provided in the next section.

4 Experimental Results

4.1 Experimental Setting

Datasets and Evaluation Metrics We evaluate the proposed AC-CNN on two mainstream datasets: PASCAL VOC 2007 and VOC 2012 [28]. These two datasets contain 9,963 and 22,531 images respectively, and they are divided into train, val and test subsets. The model for VOC 2007 is trained based on the trainval splits from VOC 2007 (5,011) and VOC 2012 (11,540). The model for VOC 2012 is trained based on all images from VOC 2007 (9,963) and trainval

split from VOC 2012 (11,540). The evaluation metrics are

(AP) and (mAP) complying with the PASCAL challenge protocols.

Implementation Details The implementation of the proposed AC-CNN adopts VGG-16 [4] as the bottom CNN architecture. Our experiments use the publicly available Fast-RCNN [10]

built on the Caffe 

[29] platform. The VGG-16 model is pre-trained on ILSVRC 2012 [30] classification and localization dataset, which can be downloaded from the Caffe Model Zoo. We fine-tune the proposed AC-CNN based on the pre-trained VGG-16. During the fine-tuning, we use 2 images per mini-batch, which contains 128 selected object proposals. Following Fast R-CNN, 25% of object proposals are selected as foreground that have Intersection over Union (IoU) overlap with a ground-truth bounding box of larger than 0.5 and the rest are background in each mini-batch. During training, images are horizontally flipped with a probability of 0.5 to augment the training data. All the newly added layers (all the fully-connected layers in the global context net,

convolutional layer in the local context net, the last fully-connected layer for object classification and bounding box regression) are randomly initialized with zero-mean Gaussian distributions with standard deviations of 0.01 and 0.001, respectively.

We run Stochastic Gradient Descent (SGD) for 120K and 150K iterations in total to train the network parameters for VOC 2007 an VOC 2012, respectively. The AC-CNN is trained based on a NVIDIA GeForce Titan X GPU and Intel Core i7-4930K CPU @ 3.40 GHz. The initial learning rate of all layers is set as 0.001 and decreased to one tenth of the current rate of each layer after 50K and 60K iterations for VOC 2007 and VOC 2012, receptively.

AP(%) VOC 2007 VOC 2012
FRCN AC-CNN FRCN AC-CNN
aeroplane 77.0 82.3
bicycle 78.1 78.4
bird 69.3
boat 59.4 52.3
bottle 38.3 38.7
bus 80.1 77.8
car 78.6 71.6
cat 86.7 89.3
chair 42.8 44.2
cow 78.8 73.0
dining table 68.9 55.0
dog 83.5 87.5
horse 82.0 80.5
motorbike 76.6 80.8
person 69.9 72.0
potted plant 31.8 35.1
sheep 70.1 68.3
sofa 71.4 65.7
train 80.4 80.4
tv/monitor 70.4 64.2
mAP 70.0 68.4
Table 1: Comparison of object detection results on VOC 2007 test and VOC 2012 test.

4.2 Performance Comparisons

We evaluate the proposed AC-CNN on VOC 2007 and the more challenging VOC 2012 by submitting the results to the publicly evaluation server. Table 1 provides the comparisons of the proposed AC-CNN and Fast-RCNN (FRCN). All the experimental results are based on the model trained on the VOC 2007 trainval set merged with the VOC 2012 trainval set.

It can be observed that our method obtains the mAP scores of 72.0% and 70.6% on VOC 2007 and VOC 2012, which outperforms the baselines by 2.0% and 2.2%, respectively. Our method reaches better detection results on most categories compared with FRCN. Specifically, small or occluded objects often appear in some specific categories such as bottle, chair and potted plant. The improvements on these three challenging categories are 5.2%, 5.7% and 3.1%, which can further validate the effectiveness of the AC-CNN for detecting challenging objects.

4.3 Ablation Analysis on Training Pipeline

To validate the effectiveness of experimental settings in this work, we evaluate some components of AC-CNN.

4.3.1 Global Context Attention Method

In the proposed AC-CNN framework, we employ an attention based recurrent model to capture positive contextual information from the global view. To validate the powerful mechanism of LSTMs for global context attending, we compare it with another pooling method, i.e. average pooling. We compare these two methods on VOC 2007 test, shown in Table 2. It can be observed that the proposed recurrent model performs better than the average pooling scheme. It should be noted that the performance will drop by 0.4% by adding global context via average pooling compared with “AC-CNN”. The reason may be that not all information from the entire image is useful. By averaging features of all regions, some potential noise may also be introduced, decreasing the detection accuracy. With the recurrent model, an attentive location map can be optimized to highlight those regions that are with the positive effect upon the object detection. Therefore, we use the LSTMs to compute the global context in this work.

Global Context mAP (%)
Average Pooling 71.6
Attention-based Pooling 72.0
Table 2: Effect of different global context methods.

4.3.2 Contributions of Each Sub-network

The proposed AC-CNN consists of two sub-networks, i.e. attention-based contextualized sub-network and multi-scale contextualized sub-network. These two sub-networks are utilized to capture global and local contextual information, respectively. We compare the detection results of different sub-networks on VOC 2007 test, shown in Table 3. “AC-CNN minus L” represents the model of only using attention-based global context information for object detection. “AC-CNN minus G” represents the model of only using multi-scale local context information for object detection. It can be observed that the performance will drop by 0.6% if we remove either sub-subnetwork, which can further validate the effectiveness of each component in the proposed AC-CNN.

Model mAP (%)
AC-CNN minus L 71.4
AC-CNN minus G 71.4
AC-CNN 72.0
Table 3: Effect of different sub-networks.

4.3.3 Effectiveness of Multi-scale Setting

We compare the detection performance with different scale settings based on the multi-scale contextualized sub-network. It can be observed that the performance will be decreased if we remove any additional scale (0.8 or 1.8). The reason is that inside (or outside) surrounding context of a specific proposal can not be well exploited by removing the corresponding scale, i.e. 0.8 (or 1.8). To further validate the effectiveness of contextual information, we also add one more scale (i.e. 2.7) for comparison. It can be observed the multi-scale contextualized sub-network will make a further improvement with more scales. However, to balance the detection accuracy and the computational consumption of time and GPU memory, we choose the multi-scale as “0.8+1.2+1.8” in this work.

Model mAP (%)
AC-CNN minus G (0.8 + 1.2) 71.3
AC-CNN minus G (1.2 + 1.8) 71.1
AC-CNN minus G (0.8 + 1.2 + 1.8) 71.4
AC-CNN minus G (0.8 + 1.2 + 1.8 + 2.7) 71.6
Table 4: Comparison of multi-scale contextualized sub-network with different scale settings.
Feature Representation mAP (%)
Local+Global Context 71.9
Local Context 72.0
Table 5: Comparison of bounding box regression based on different feature representations.

4.3.4 No Global Context for Bounding Box Regression

We do not use the concatenated feature from both local context net and global context net for bounding box regression. The reason is that bounding box regression is a totally different task compared with object classification. It is excepted that the feature of one specific proposal could represent the difference between its current position and the ground-truth position. However, by adding the feature from the global view, this difference will be slacked. This may decrease the accuracy of bounding box regression. We compare the results of different settings on VOC 2007 test, shown in Table 5. It can be observed that 0.1% improvement will be made by removing the global context information for the bounding box regression task.

Figure 4: Analysis of top ranked false positives on VOC 2007 test. Fraction of top detections ( is the number of objects in the category) that are correct (Cor), or false positives due to poor localization (Loc), confusion with similar objects (Sim), confusion with other VOC objects (Oth), or confusion with background or unlabeled objects (BG). We only show the graphs for challenging classes, i.e. boat, bottle, chair and pottedplant, due to space limitations. Top row: the results of the FRCN model. Bottom row: the results of the proposed AC-CNN model.
Figure 5: Top ranked false positive types on VOC 2007 test. We only show the graphs for challenging classes, i.e. boat, bottle, chair and pottedplant, due to space limitations. Top row: the results of the FRCN model. Bottom row: the results of the proposed AC-CNN model.
Figure 6: Examples of input images and the corresponding attentive location maps generated by the attention-based contextualized sub-network.
Figure 7: Examples of detection results produced by FRCN and AC-CNN. Red color and green color indicate the ground-truth bounding boxes and the predicted results, respectively.

4.3.5 Detection Error Analysis

The tool of Hoiem et al. [31] is employed to analyze the detection errors in this work. As show in Figure 4, we plot pie charts with the percentage of detections that are correct and false positives due to bad localization, confusion with similar categories and other categories, and confusion with background or unlabeled objects. It can be observed that the proposed AC-CNN model can make a considerable reduction in the percentage of false positives for challenging categories. Specifically, it is well known that small objects usually appear in bottle and pottedplant. The improvements on these two challenging categories are both 5%, which can validate the effectiveness of the proposed context-aware method for those objects with small scales. A similar observation can also be deducted from Figure 5 where we plot the top ranked false positive types of the results from FRCN and the proposed AC-CNN.

4.4 Visual Comparison

Figure 6 shows some input images and the corresponding attentive maps computed by the attention-based contextualized sub-network. It can be observed that those regions which may benefit the object classification task are highlighted. Some selected detection results on VOC 2007 test set are shown in Figure 7. For some small objects, e.g. boat, bottle and potted plant, the proposed AC-CNN achieves better detection results. Besides, for the third group, the occluded chair can also be detected with the AC-CNN model. Therefore, our AC-CNN can effectively advance the Fast-RCNN by incorporating valuable contextual information.

5 Conclusion

In this paper, we propose a novel Attention to Context Convolution Neural Networks (AC-CNN) for object detection. Specifically, AC-CNN advances the traditional Fast-CNN to contextualized object detectors, which effectively incorporates the attention-based contextualized sub-network and multi-scale contextualized sub-network into a unified framework. To capture global context, the attention-based contextualized sub-network recurrently generates the attentive location map for the input image by employing the stacked LSTM layers. With the attentive location map, the global contextual features can be produced by selectively combining the feature cubes of all locations. To capture local context, the multi-scale contextualized sub-network exploits the inside and outside contextual information of each specific proposal by scaling the corresponding bounding box with three pre-defined ratios. Extensive experiments on VOC 2007 and VOC 2012 demonstrate that the proposed AC-CNN can make a significant improvement by exploiting contextual information.

References