[CVIU 2019] ASSD learns to highlight useful regions on the feature maps while suppressing the irrelevant information, thereby providing reliable guidance for object detection.
This paper proposes a new deep neural network for object detection. The proposed network, termed ASSD, builds feature relations in the spatial space of the feature map. With the global relation information, ASSD learns to highlight useful regions on the feature maps while suppressing the irrelevant information, thereby providing reliable guidance for object detection. Compared to methods that rely on complicated CNN layers to refine the feature maps, ASSD is simple in design and is computationally efficient. Experimental results show that ASSD competes favorably with the state-of-the-arts, including SSD, DSSD, FSSD and RetinaNet.READ FULL TEXT VIEW PDF
[CVIU 2019] ASSD learns to highlight useful regions on the feature maps while suppressing the irrelevant information, thereby providing reliable guidance for object detection.
In recent years, object detection has experienced a rapid development with the aid of convolutional neural networks (CNN). Generally, the CNN-based object detectors can be divided into two types: one-stage object detector and two-stage object detector. The two-stage object detectors, such as R-CNN, Fast and Faster R-CNN [7, 26] and SPPnet , are proposal driven, with a second stage for refining the detection. However, these two-stage object detectors are inefficient for real-time applications due to the decoupled multi-stage processing. In contrast, the one-stage object detectors, including YOLO , YOLO-v2  and SSD , propose to model the object detection as a simple regression problem and encapsulate all the computation in a single feed-forward CNN, thereby speeding up the detection to a large extent. However, the one-stage detectors are generally less accurate than the two-stage ones. The main reason would be the extreme foreground-background class imbalance of the dense anchor boxes . To solve this issue, RetinaNet  proposes a focal loss to train its FPN-based  one-stage detector. However, the focal loss is parameter sensitive, and it would require exhaustive experiments to obtain the optimal parameters.
In this paper, we aim to improve the one-stage detectors from a different perspective. We propose to discover the intrinsic feature relations on the feature map to focus the detector on regions that are critical to the detection task. Our key motivation comes from the human vision system. When perceiving a scene, humans first glance at the scene and then instantly figure out the contents through global dependency analysis. Besides, when the eyeballs focus on a fixation point, the resolution of the neighboring regions decreases. To simulate such human vision mechanism, we design an attention unit that is capable of analyzing the importance of features at different positions, based on the global feature relations. The attention unit is fully differentiable and in-place. This design generates the attention maps which highlight the useful regions and suppress the irrelevant information. Compared to methods that only build relations among proposals [11, 31], our method considers the global feature correlations at pixel level and conforms to the visual mechanism of humans.
We choose the SSD as our base one-stage detector, which provides the optimal trade-off among simplicity, speed and accuracy. Combined with the attention unit, we term the resulting object detector as Attentive SSD (ASSD). ASSD is simpler in design and more effective at refining the contextual semantics compared to the existing SSD-based detectors (see Fig. 1). In particular, DSSD  relies on a complex feature pyramid to encourage the information flow among different layers. While achieving better accuracies than the original SSD, it is relatively more complex and thus computationally inefficient. Another recent approach, FSSD , builds additional fusion modules for multi-scale feature aggregation, but only achieves marginal improvements upon SSD. In contrast to these works, our ASSD retains the original structure of SSD and employs a single efficient attention unit to refine the object information from each layer (see Fig. 1d). This design preserves the advantages of the original SSD while being more effective at learning object features. We demonstrate the advantages of ASSD on a number of representative benchmark datasets, including PASCAL VOC  and COCO . Experimental results validate the superiority of ASSD compared to the state-of-the-arts in terms accuracy and efficiency. Our main contributions can be summarized as follows:
We propose to incorporate pixel-wise feature relations into the one-stage detector. Our design follows the human vision mechanism and facilitates the object feature learning.
The proposed network preserves the simplicity and efficiency of SSD while being more accurate.
We perform a series of experiments to validate the advantages of ASSD. The experimental results show that ASSD competes favorably with the state-of-the-arts in terms of accuracy and efficiency.
Object detection involves localization and classification. From traditional hand-crafted feature-based methods (e.g., SIFT  and HOG ) to recent CNN-based models, last decades have witnessed a significant development of object detection techniques. In recent years, CNN-based object detectors have gained remarkable success and generally can be divided into two categories: the proposal-driven two-stage detectors, and the regression-oriented one-stage detectors.
The two-stage object detectors are composed of two decoupled operations: proposal generation and box refinement. The pioneering work, R-CNN 
, utilizes selective search to generate region proposals and classifies them with class-specific linear SVM using the learned CNN features. The major weakness of R-CNN is that it needs to perform the forward pass for each proposal, leading to an extremely inefficient model. To solve this issue, SPPnet suggests sharing the CNN computation for all proposals, whereas Fast R-CNN  replaces the SVM with fully-connected layers (FCs) to enable single-stage training without additional feature caching. Faster R-CNN  goes a step further and introduces a region proposal network (RPN) where the proposal computation is performed through shared CNN features, thereby largely speeding up the detection process. In a more aggressive manner, R-FCN 
replaces the FCs with position-sensitive score maps and encodes translation variance information into these maps, leading to a variance insensitive fully convolutional network (FCN) for accurate object detection. Another recent work, FPN, employs a top-down pyramid structure to reuse the higher-resolution features maps from the feature hierarchy and has achieved the state-of-the-art results. Two-stage object detectors are quite effective at object feature learning. However, they are generally inefficient in computation.
Different from two-stage detectors, one-stage object detectors discard the region proposal stage, thereby making the detection more efficient. YOLO 
proposes to use a single CNN to simultaneously predict multiple bounding boxes as well as their class probabilities. While being extremely fast, YOLO is far less accurate than the two-stage models. Instead of directly predicting the coordinates of bounding boxes, YOLOv2 employs the anchor boxes to facilitate the detection and improves the accuracy a lot. From a different perspective, SSD  builds a pyramid CNN network on top of the backbone, and detects objects of different scales from the multi-scale feature maps in a single forward pass. SSD has achieved better performance than YOLOv2. Based on SSD and similar to FPN, DSSD  employs top-down pyramid CNN layers to improve the accuracy but at the cost of computational efficiency. FSSD  inserts a fusion module at the bottom of the feature pyramid to enhance the accuracy of SSD. While still being fast, FSSD only achieves marginal improvements upon SSD in accuracy. Other works, such as RefineDet , DSOD  and STOD , also improve the detection accuracy of SSD either through refinining the anchors or by aggregating the feature maps at different scales. CornerNet  follows a different strategy and improves the detection accuracy with keypoint-based object detectors. The recent work, RetinaNet 
, builds the one-stage detector based on FPN and proposes a focal loss for better training. RetinaNet is efficient in inference; however, it requires a large effort for loss function parameter tuning. In this work, we show that by explicitly modeling the feature relations, our ASSD model competes favorably with RetinaNet without heavy tuning of parameters.
Visual attention mechanism is generally used to exploit the salient visual information and facilitate visual tasks such as object recognition. There are many visual attention methods in the literature. For example, the saliency-based visual attention model selects attended locations from saliency maps. In contrast, RAM , AttentionNet  and RA-CNN 
search and crop the useful regions recurrently. In particular, RAM employs Recurrent Neural Network (RNN) and reinforcement learning to discover the target. AttentionNet explores the direction that leads to the real object through CNN classification. RA-CNN also uses reinforcement learning to learn the discriminative region attention and region-based feature representation. The common characteristic of these methods is that they only focus on single instance problems. For multi-object recognition, AC-CNN, LPA  and RelationNet 
have been proposed to discover a global contextual guidance. AC-CNN examines the global context through the stacked Long Short-Term Memory (LSTM) units. LPA learns the attention maps from the compatibility scores between the shallow and deep layers. RelationNet correlates the geometry features and appearance information between proposals to generate and forward the attentive features, and it is designed specifically for the two-stage object detectors. In practice, RelationNet only achieves a slight improvement.
The self-attention mechanism has been widely used in natural language processing (NLP) field to model long-range dependencies of a sentence. LSTMN develops an attention memory network that discovers the relations between tokens to enhance the memorization capability of LSTM. Structured self-attentive sentence embedding  introduces self-attention in the bidirectional LSTM to generate a 2-D matrix representation of the embeddings, where each row attends to a different part of the sentence. Transformer  draws global dependencies between input and output based solely on attention mechanisms. Inspired by Transformer, in this work we build the long-range dependencies among all feature pixels within the feature map itself. In a similar spirit to Transformer, our ASSD is capable of attending to different regions for more effective object detection.
SSD  performs the detection on multi-scale feature maps to handle various object sizes effectively. However, the shallow layer lacks semantic information and is therefore insufficient for detecting small objects. One way to solving this problem is to build more CNN layers to make further refinements of the feature maps or inject semantics from deep layers to the shallow ones exhaustively. Considering that speed is the key advantage of one-stage object detectors, we aim to improve the SSD accuracy with small extra computational cost. To this end, we construct a small network, namely attention unit, and embed it into SSD to improve the detection accuracy. Our ASSD network architecture is illustrated in Fig. 2. Specifically, we use ResNet101 (conv1-5)  as the backbone. The pyramid convolutional blocks (conv6-9) follow the same design as the original SSD . The feature maps from conv3-9 are used to detect objects with different scales. ASSD places the attention unit between the feature map and the prediction module, where the box regression and object classification are performed.
|Layer Name||Output Size||Specifications|
7, 64, stride 2
|Method||Backbone||Training Data||mAP||Input Size||FPS||GPU||#Anchors||#Parameters|
|Faster R-CNN||VGG16||07+12||73.2||1000600||7||Titan X||6000||134.7M|
We adapt the self-attention mechanism from the sequence transduction problem 
to our task. In sequence transduction, self-attention mechanism draws global dependencies between the input and output sequences by an attention function, which maps a query and a set of key-value pairs to an output. In self-attention, the attention is motivated by the input features and used for refining these features. Here we repurpose our problem as a similar query problem that estimates the relevant information from the input features in order to build global pixel-level feature correlations.
Suppose is the feature map at a given scale , with and
representing the number of channels and total spatial locations in the feature map, respectively. We first linearly transform the feature mapinto three different feature spaces and , i.e., , , and , where and with . The attention score matrix is then calculated by the matrix multiplication of and , as shown in Fig. 2. Each row of the attention score matrix is normalized by a softmax operation:
where describes the pixel relations when querying the -th location of the feature map. We call as an attention map. Note that, the reason we transform the input feature into and is to reduce computational cost. The matrix computation of and calculates the feature similarities and creates an attention map that reveals the feature relations. Note that such pixel-wise relations are learned through the network.
Next, we apply a matrix multiplication between and the attention maps . In this way we compute an updated feature map as the weighted sums of individual features at each location. Finally, we add the matrix multiplication result back to the input feature map :
Attention map relates the long-range dependencies of features at all positions and therefore learns global contexts of the feature map. It highlights the relevant parts of the feature map and guides the detection with refined information.
|Method||Training Data||Backbone||Avg. Precision, IoU:||Avg. Precision, Area:||Avg. Recall, #Dets:||Avg. Recall, Area:|
Motivated by FSSD , we fuse the contextual information from layer4 and layer5 into layer3 to enrich its semantics. In our experiment, we find the fusion operation alone does not notably improve the detection accuracy (see Table 3). Instead, it even decreases the accuracy a bit with more computational cost. The reason would be that the three layers possess different receptive fields and have different capabilities; further, the concatenation and conv transformation would possibly neutralize the relative importance of the three layers and suppress the critical features in original layer3. However, when we place the attention unit after the fusion operation, there is a noticeable improvement (see Table 3). It is possible that semantics from the deep layers help the attention unit to discover useful information that resides in the original layer3. Finally, when only applying the attention unit, we observe inferior performance in contrast to the model with both fusion and attention mechanisms. This indicates that the feature fusion and attention are complementary to each other. The semantic fusion process can be formulated as:
where is the feature map at layer , and
. In the concatenation operation, layer4 and layer5 are upsampled through bilinear interpolation in order to align their sizes with that of layer3.
We follow the same anchor box generating method as SSD . Specifically, we use aspect ratio for anchor boxes on feature maps conv3,8,9 and for anchor boxes on feature maps of conv4-7. Each box has a minimum scale and a maximum scale , where the scale is regularly spaced over the feature map layers and is the of next layer. The normalized width and height of an anchor box are calculated by and , where for , otherwise . We use hard negative mining to solve the positive-negative box class imbalance problem as in the original SSD . Also, we employ the same data augmentations and the same loss functions as SSD.
Our model is implemented with Pytorch
and trained on 8 NVIDIA Tesla K80 GPUs. The weights of ResNet101 backbone are pretrained on ImageNet. We use Stochastic Gradient Descent (SGD) algorithm to optimize ASSD weights, with a momentum of 0.9, a decay of 0.0005 and an initial learning rate of 0.001. Following the settings of SSD, DSSD and FSSD, we train and evaluate ASSD on two input resolution images:and . In particular, we set the mini-batch size to 10 images per GPU for ASSD321 and 8 images per GPU for ASSD512.
We conduct experiments on two common datasets: PASCAL VOC  and COCO . The PASCAL VOC dataset contains 20 object classes for object detection challenge. We evaluate ASSD on the PASCAL VOC 2007/2012 test set. The COCO dataset includes 80 object categories. In this work, we use COCO 2017 dataset, which has the same train, validation and test images as COCO 2014. Hence we have a fair comparison with the state-of-the-art methods. Note that RetinaNet  does not have PASCAL VOC detection results. Therefore we only compare the accuracy and speed of RetinaNet on COCO dataset.
We first evaluate our ASSD on PASCAL VOC 2007 test set with a primary goal of comparing the speed and accuracy of ASSD with state-of-the-art methods. The training dataset we use here is a union of 2007 trainval and 2012 trainval. We train ASSD321 for 280 epochs, where the initial learning rate of 0.001 decreases by 0.1 at the 200th epoch and the 250th epoch. For ASSD513, we train for 180 epochs, with a learning rate decay of 0.1 at the 120th and 170th epochs. As shown in Table2, with a comparable fast speed, ASSD achieves a large improvement in accuracy compared to SSD, DSSD, and FSSD.
We perform ablation study to explore the effects of attention unit and semantic fusion on detection accuracy and speed. Here we investigate four models, SSD513, SSD513+fusion, SSD513+att, SSD513+fusion+att, on the PASCAL VOC 2007 test set. It can be observed from Table 3 that the fusion module alone does not show noticeable accuracy improvement. On the contrary, it brings a little more computational overhead. In contrast, attention unit alone leads to a significant performance improvement. When combining the attention unit with the fusion module, we observe further boost of performance. We conjecture that the attention unit may have the ability to analyze the contextual semantics at different levels and select the useful information for guiding a better detection.
We compare the detection accuracy of ASSD with the state-of-the-art methods on the PASCAL VOC 2012 test set. The mAP is evaluated by online PASCAL VOC evaluation server. We present a detailed comparison of average precision (AP) for each class in Table 4. The training dataset contains 2007 trainval+test and 2012 trainval. We follow similar training settings as PASCAL VOC 2007. From Table 4, it can be seen that ASSD513 improves the detection accuracy for most of the classes. The reason would be that the attention unit figures out the pixel-level feature relationships and therefore enhances the model ability to distinguish objects of different classes.
We train and validate ASSD on COCO training dataset (118k) and validation dataset (5k). We compare with the state-of-the-art methods on COCO test-dev. The detection performance is evaluated by the online evaluation server. We train ASSD321 for 160 epochs with a learning rate decay of 0.1 at the 100th epoch and the 150th epoch. ASSD513 is trained for 140 epochs, and the learning rate decreases after 80 and 130 epochs. As illustrated in Table 5, ASSD achieves a large improvement over SSD, DSSD and FSSD. Besides, at a similar input resolution, ASSD513 obtains better accuracies than RetinaNet500, especially for AP at different object area thresholds. In particular, when the intersection over union (IoU) is higher than 0.5, ASSD513 has a 2.4% improvement compared to RetinaNet500. Furthermore, from Table 5 it can also be observed that ASSD is more effective at detecting the small, medium and large objects. Note that, with the above superiority in detection accuracy, ASSD513 (6.1FPS K40) still achieves comparable speed as RetinaNet500 (6.8FPS K40).
To better investigate the attention mechanism, we visualize the attention maps of different scales. In particular, we project the attention maps onto the original images. Here we utilize the PASCAL VOC 2007 test set, which contains 20 classes. From Fig. 3, we observe that the attention maps highlight the crucial locations of objects, indicating the feature relations help the model concentrate on useful regions. At shallow layers, the attention map guides the model to focus on small objects; while at deep layers, the attention map highlights objects with large sizes. Moreover, it can also be observed that the attention map suppresses the negative regions, which would be of great help for fast determination of negative anchor boxes.
In this paper, we propose an attentive single shot multibox detector, termed ASSD, for more effective object detection. Specifically, ASSD utilizes a fast and light-weight attention unit to help discover feature dependencies and focus the model on useful and relevant regions. ASSD improves the accuracy of SSD by a large margin at a small extra cost of computation. Moreover, ASSD competes favorably with the other state-of-the-art methods. In particular, it achieves better performance than the one-stage detector RetinaNet, while being easier to train without the need to heavily tune the loss parameters.
International Journal of Computer Vision, 111(1):98–136, Jan. 2015.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4203–4212, 2018.