ASSD: Attentive Single Shot Multibox Detector

by   Jingru Yi, et al.
Rutgers University

This paper proposes a new deep neural network for object detection. The proposed network, termed ASSD, builds feature relations in the spatial space of the feature map. With the global relation information, ASSD learns to highlight useful regions on the feature maps while suppressing the irrelevant information, thereby providing reliable guidance for object detection. Compared to methods that rely on complicated CNN layers to refine the feature maps, ASSD is simple in design and is computationally efficient. Experimental results show that ASSD competes favorably with the state-of-the-arts, including SSD, DSSD, FSSD and RetinaNet.



There are no comments yet.


page 3

page 7


Enhancement of SSD by concatenating feature maps for object detection

We propose an object detection method that improves the accuracy of the ...

ScarfNet: Multi-scale Features with Deeply Fused and Redistributed Semantics for Enhanced Object Detection

Convolutional neural network (CNN) has led to significant progress in ob...

MDCN: Multi-Scale, Deep Inception Convolutional Neural Networks for Efficient Object Detection

Object detection in challenging situations such as scale variation, occl...

S-DOD-CNN: Doubly Injecting Spatially-Preserved Object Information for Event Recognition

We present a novel event recognition approach called Spatially-preserved...

Deep Regionlets for Object Detection

A key challenge in generic object detection is being to handle large var...

Improving Object Detection with Inverted Attention

Improving object detectors against occlusion, blur and noise is a critic...

DNN Feature Map Compression using Learned Representation over GF(2)

In this paper, we introduce a method to compress intermediate feature ma...

Code Repositories


[CVIU 2019] ASSD learns to highlight useful regions on the feature maps while suppressing the irrelevant information, thereby providing reliable guidance for object detection.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, object detection has experienced a rapid development with the aid of convolutional neural networks (CNN). Generally, the CNN-based object detectors can be divided into two types: one-stage object detector and two-stage object detector. The two-stage object detectors, such as R-CNN

[8], Fast and Faster R-CNN [7, 26] and SPPnet [9], are proposal driven, with a second stage for refining the detection. However, these two-stage object detectors are inefficient for real-time applications due to the decoupled multi-stage processing. In contrast, the one-stage object detectors, including YOLO [24], YOLO-v2 [25] and SSD [21], propose to model the object detection as a simple regression problem and encapsulate all the computation in a single feed-forward CNN, thereby speeding up the detection to a large extent. However, the one-stage detectors are generally less accurate than the two-stage ones. The main reason would be the extreme foreground-background class imbalance of the dense anchor boxes [18]. To solve this issue, RetinaNet [18] proposes a focal loss to train its FPN-based [17] one-stage detector. However, the focal loss is parameter sensitive, and it would require exhaustive experiments to obtain the optimal parameters.

Figure 1: The structures of different SSD-based detectors. (a) SSD [21], (b) DSSD [5], (c) FSSD [16], (d) ASSD (Ours).

In this paper, we aim to improve the one-stage detectors from a different perspective. We propose to discover the intrinsic feature relations on the feature map to focus the detector on regions that are critical to the detection task. Our key motivation comes from the human vision system. When perceiving a scene, humans first glance at the scene and then instantly figure out the contents through global dependency analysis. Besides, when the eyeballs focus on a fixation point, the resolution of the neighboring regions decreases. To simulate such human vision mechanism, we design an attention unit that is capable of analyzing the importance of features at different positions, based on the global feature relations. The attention unit is fully differentiable and in-place. This design generates the attention maps which highlight the useful regions and suppress the irrelevant information. Compared to methods that only build relations among proposals [11, 31], our method considers the global feature correlations at pixel level and conforms to the visual mechanism of humans.

We choose the SSD as our base one-stage detector, which provides the optimal trade-off among simplicity, speed and accuracy. Combined with the attention unit, we term the resulting object detector as Attentive SSD (ASSD). ASSD is simpler in design and more effective at refining the contextual semantics compared to the existing SSD-based detectors (see Fig. 1). In particular, DSSD [5] relies on a complex feature pyramid to encourage the information flow among different layers. While achieving better accuracies than the original SSD, it is relatively more complex and thus computationally inefficient. Another recent approach, FSSD [16], builds additional fusion modules for multi-scale feature aggregation, but only achieves marginal improvements upon SSD. In contrast to these works, our ASSD retains the original structure of SSD and employs a single efficient attention unit to refine the object information from each layer (see Fig. 1d). This design preserves the advantages of the original SSD while being more effective at learning object features. We demonstrate the advantages of ASSD on a number of representative benchmark datasets, including PASCAL VOC [4] and COCO [19]. Experimental results validate the superiority of ASSD compared to the state-of-the-arts in terms accuracy and efficiency. Our main contributions can be summarized as follows:

  1. We propose to incorporate pixel-wise feature relations into the one-stage detector. Our design follows the human vision mechanism and facilitates the object feature learning.

  2. The proposed network preserves the simplicity and efficiency of SSD while being more accurate.

  3. We perform a series of experiments to validate the advantages of ASSD. The experimental results show that ASSD competes favorably with the state-of-the-arts in terms of accuracy and efficiency.

2 Related Works

Figure 2: Overview of the ASSD architecture. The backbone of ASSD (conv1-5) is ResNet101 [10]. The extra convolutional blocks follow the same settings as the original SSD [21]

. Batch normalization and ReLU are used in all layers. The feature maps are displayed as “number of channels

height width”. Feature map from conv3 is enhanced by fusion of feature maps from conv3-5.

2.1 Object Detection

Object detection involves localization and classification. From traditional hand-crafted feature-based methods (e.g., SIFT [27] and HOG [3]) to recent CNN-based models, last decades have witnessed a significant development of object detection techniques. In recent years, CNN-based object detectors have gained remarkable success and generally can be divided into two categories: the proposal-driven two-stage detectors, and the regression-oriented one-stage detectors.

The two-stage object detectors are composed of two decoupled operations: proposal generation and box refinement. The pioneering work, R-CNN [8]

, utilizes selective search to generate region proposals and classifies them with class-specific linear SVM using the learned CNN features. The major weakness of R-CNN is that it needs to perform the forward pass for each proposal, leading to an extremely inefficient model. To solve this issue, SPPnet

[9] suggests sharing the CNN computation for all proposals, whereas Fast R-CNN [7] replaces the SVM with fully-connected layers (FCs) to enable single-stage training without additional feature caching. Faster R-CNN [26] goes a step further and introduces a region proposal network (RPN) where the proposal computation is performed through shared CNN features, thereby largely speeding up the detection process. In a more aggressive manner, R-FCN [2]

replaces the FCs with position-sensitive score maps and encodes translation variance information into these maps, leading to a variance insensitive fully convolutional network (FCN) for accurate object detection. Another recent work, FPN

[17], employs a top-down pyramid structure to reuse the higher-resolution features maps from the feature hierarchy and has achieved the state-of-the-art results. Two-stage object detectors are quite effective at object feature learning. However, they are generally inefficient in computation.

Different from two-stage detectors, one-stage object detectors discard the region proposal stage, thereby making the detection more efficient. YOLO [24]

proposes to use a single CNN to simultaneously predict multiple bounding boxes as well as their class probabilities. While being extremely fast, YOLO is far less accurate than the two-stage models. Instead of directly predicting the coordinates of bounding boxes, YOLOv2

[25] employs the anchor boxes to facilitate the detection and improves the accuracy a lot. From a different perspective, SSD [21] builds a pyramid CNN network on top of the backbone, and detects objects of different scales from the multi-scale feature maps in a single forward pass. SSD has achieved better performance than YOLOv2. Based on SSD and similar to FPN, DSSD [5] employs top-down pyramid CNN layers to improve the accuracy but at the cost of computational efficiency. FSSD [16] inserts a fusion module at the bottom of the feature pyramid to enhance the accuracy of SSD. While still being fast, FSSD only achieves marginal improvements upon SSD in accuracy. Other works, such as RefineDet [32], DSOD [28] and STOD [33], also improve the detection accuracy of SSD either through refinining the anchors or by aggregating the feature maps at different scales. CornerNet [14] follows a different strategy and improves the detection accuracy with keypoint-based object detectors. The recent work, RetinaNet [18]

, builds the one-stage detector based on FPN and proposes a focal loss for better training. RetinaNet is efficient in inference; however, it requires a large effort for loss function parameter tuning. In this work, we show that by explicitly modeling the feature relations, our ASSD model competes favorably with RetinaNet without heavy tuning of parameters.

2.2 Visual Attention

Visual attention mechanism is generally used to exploit the salient visual information and facilitate visual tasks such as object recognition. There are many visual attention methods in the literature. For example, the saliency-based visual attention model

[12] selects attended locations from saliency maps. In contrast, RAM [22], AttentionNet [30] and RA-CNN [6]

search and crop the useful regions recurrently. In particular, RAM employs Recurrent Neural Network (RNN) and reinforcement learning to discover the target. AttentionNet explores the direction that leads to the real object through CNN classification. RA-CNN also uses reinforcement learning to learn the discriminative region attention and region-based feature representation. The common characteristic of these methods is that they only focus on single instance problems. For multi-object recognition, AC-CNN

[15], LPA [13] and RelationNet [11]

have been proposed to discover a global contextual guidance. AC-CNN examines the global context through the stacked Long Short-Term Memory (LSTM) units. LPA learns the attention maps from the compatibility scores between the shallow and deep layers. RelationNet correlates the geometry features and appearance information between proposals to generate and forward the attentive features, and it is designed specifically for the two-stage object detectors. In practice, RelationNet only achieves a slight improvement.

2.3 Self-Attention

The self-attention mechanism has been widely used in natural language processing (NLP) field to model long-range dependencies of a sentence. LSTMN

[1] develops an attention memory network that discovers the relations between tokens to enhance the memorization capability of LSTM. Structured self-attentive sentence embedding [20] introduces self-attention in the bidirectional LSTM to generate a 2-D matrix representation of the embeddings, where each row attends to a different part of the sentence. Transformer [29] draws global dependencies between input and output based solely on attention mechanisms. Inspired by Transformer, in this work we build the long-range dependencies among all feature pixels within the feature map itself. In a similar spirit to Transformer, our ASSD is capable of attending to different regions for more effective object detection.

3 Attentive SSD

SSD [21] performs the detection on multi-scale feature maps to handle various object sizes effectively. However, the shallow layer lacks semantic information and is therefore insufficient for detecting small objects. One way to solving this problem is to build more CNN layers to make further refinements of the feature maps or inject semantics from deep layers to the shallow ones exhaustively. Considering that speed is the key advantage of one-stage object detectors, we aim to improve the SSD accuracy with small extra computational cost. To this end, we construct a small network, namely attention unit, and embed it into SSD to improve the detection accuracy. Our ASSD network architecture is illustrated in Fig. 2. Specifically, we use ResNet101 (conv1-5) [10] as the backbone. The pyramid convolutional blocks (conv6-9) follow the same design as the original SSD [21]. The feature maps from conv3-9 are used to detect objects with different scales. ASSD places the attention unit between the feature map and the prediction module, where the box regression and object classification are performed.

Layer Name Output Size Specifications
conv1 256256 7

7, 64, stride 2

conv2_x 128128
conv3_x 6464
conv4_x 3232
conv5_x 88
Table 1: Architecture of ASSD with ResNet101 backbone. ReLU and Batch Normalization are used in hidden layers. The input size is 513 513.
Method Backbone Training Data mAP Input Size FPS GPU #Anchors #Parameters
Faster R-CNN VGG16 07+12 73.2 1000600 7 Titan X 6000 134.7M
Faster R-CNN ResNet101 07+12 76.4 1000600 2.4 K40 300 -
R-FCN ResNet101 07+12 79.5 1000600 9 Titan X 300 50.9M
YOLOv2 Darknet19 07+12 78.6 544544 40 Titan X - -
RetinaNet300 ResNet101 07+12 62.9 300300 11.4 K80 15354 55.7M
RetinaNet300+att ResNet101 07+12 64.9 300300 11.1 K80 15354 55.8M
SSD300* VGG16 07+12 77.5 300300 46 Titan X 8732 -
SSD321 ResNet101 07+12 77.1 321321 11.2 Titan X 17080 56.8M
FSSD300 VGG16 07+12 78.8 300300 65.8 1080Ti 8732 -
DSSD321 ResNet101 07+12 78.6 321321 9.5 Titan X 17080 -
ASSD300 VGG16 07+12 80.0 300300 11.8 K40 8732 29.4M
ASSD321 ResNet101 07+12 79.5 321321 27.5/11.4 Titan X/K40 10325 66.7M
RefineDet320 VGG16 07+12 79.5 320320 12.9 K80 6375 32.1M
RefineDet320+att VGG16 07+12 80.0 320320 12.0 K80 6375 33.9M
RetinaNet500 ResNet101 07+12 72.2 500500 7.1 K80 35964 55.7M
RetinaNet500+att ResNet101 07+12 73.4 500500 6.7 K80 35964 55.8M
SSD512* VGG16 07+12 79.5 512512 19 Titan X 24564 -
SSD513 ResNet101 07+12 80.6 513513 6.8 Titan X 43688 57.5M
FSSD513 VGG16 07+12 80.9 512512 35.7 1080Ti 24564 -
DSSD513 ResNet101 07+12 81.5 513513 5.5 Titan X 43688 -
ASSD512 VGG16 07+12 81.6 512512 3.4 K40 24564 30.2M
ASSD513 ResNet101 07+12 83.0 513513 16.0/6.1 Titan X/K40 25844 67.5M
RefineDet512 VGG16 07+12 81.2 512512 5.6 K80 16320 32.1M
RefineDet512+att VGG16 07+12 82.2 512512 5.0 K80 16320 33.9M
Table 2: Comparison of speed and accuracy on PASCAL VOC2007 test. 07+12: 07 trainval+12 trainval. We compared our ASSD with Faster R-CNN [26, 10], R-FCN [2],YOLOv2 [25], SSD300*, SSD512* [21], SSD321, SSD513, DSSD321, DSSD513 [5], FSSD300, FSSD513 [16], RefineDet [32]. Att is the abbreviation for attention module.

3.1 Attention Unit

We adapt the self-attention mechanism from the sequence transduction problem [29]

to our task. In sequence transduction, self-attention mechanism draws global dependencies between the input and output sequences by an attention function, which maps a query and a set of key-value pairs to an output. In self-attention, the attention is motivated by the input features and used for refining these features. Here we repurpose our problem as a similar query problem that estimates the relevant information from the input features in order to build global pixel-level feature correlations.

Suppose is the feature map at a given scale , with and

representing the number of channels and total spatial locations in the feature map, respectively. We first linearly transform the feature map

into three different feature spaces and , i.e., , , and , where and with . The attention score matrix is then calculated by the matrix multiplication of and , as shown in Fig. 2. Each row of the attention score matrix is normalized by a softmax operation:


where describes the pixel relations when querying the -th location of the feature map. We call as an attention map. Note that, the reason we transform the input feature into and is to reduce computational cost. The matrix computation of and calculates the feature similarities and creates an attention map that reveals the feature relations. Note that such pixel-wise relations are learned through the network.

Next, we apply a matrix multiplication between and the attention maps . In this way we compute an updated feature map as the weighted sums of individual features at each location. Finally, we add the matrix multiplication result back to the input feature map :


Attention map relates the long-range dependencies of features at all positions and therefore learns global contexts of the feature map. It highlights the relevant parts of the feature map and guides the detection with refined information.

Method Backbone Time (s) mAP
SSD513 ResNet101 0.1417 79.75
SSD513+fusion ResNet101 0.1466 79.57
SSD513+att ResNet101 0.1593 82.13
SSD513+fusion+att ResNet101 0.1648 82.95
Table 3: Ablation Study on PASCAL VOC2007 test dataset. Training dataset is 07+12: 07 trainval+12 trainval. Time is evaluated on a single NVIDIA K40 GPU. Note that SSD513+fusion is different from FSSD513 [16].
Method Backbone mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
AC-CNN VGG16 70.6 83.2 80.8 70.8 54.9 42.1 79.1 73.4 89.7 47.0 75.9 61.8 87.8 80.9 81.8 74.4 37.8 71.6 67.7 83.1 67.4
Faster ResNet101 73.8 86.5 81.6 77.2 58.0 51.0 78.6 76.6 93.2 48.6 80.4 59.0 92.1 85.3 84.8 80.7 48.1 77.3 66.5 84.7 65.6
R-FCN ResNet101 77.6 86.9 83.4 81.5 63.8 62.4 81.6 81.1 93.1 58.0 83.8 60.8 92.7 86.0 84.6 84.4 59.0 80.8 68.6 86.1 72.9
YOLOv2 Darknet19 73.4 86.3 82.0 74.8 59.2 51.8 79.8 76.5 90.6 52.1 78.2 58.5 89.3 82.5 83.4 81.3 49.1 77.2 62.4 83.8 68.7
Retina300 ResNet101 59.8 73.9 68.2 65.1 43.6 32.3 67.2 58.8 83.0 39.6 58.4 4.6 80.9 67.8 69.2 70.0 34.6 57.1 48.4 73.8 58.3
Retina300+att ResNet101 61.5 76.5 70.6 66.2 42.1 34.1 69.3 59.4 87.2 42.6 59.0 47.5 83.8 69.0 72.9 71.6 37.3 58.9 48.3 74.7 59.9
SSD300* VGG16 75.8 88.1 82.9 74.4 61.9 47.6 82.7 78.8 91.5 58.1 80.0 64.1 89.4 85.7 85.5 82.6 50.2 79.8 73.6 86.6 72.1
SSD321 ResNet101 75.4 87.9 82.9 73.7 61.5 45.3 81.4 75.6 92.6 57.4 78.3 65.0 90.8 86.8 85.8 81.5 50.3 78.1 75.3 85.2 72.5
DSSD321 ResNet101 76.3 87.3 83.3 75.4 64.6 46.8 82.7 76.5 92.9 59.5 78.3 64.3 91.5 86.6 86.6 82.1 53.3 79.6 75.7 85.2 73.9
ASSD300 VGG16 77.5 88.7 85.6 78.0 65.7 54.1 82.6 78.2 91.8 59.7 84.0 65.0 90.4 87.6 88.3 83.7 53.5 81.1 70.4 86.8 75.5
ASSD321 ResNet101 76.4 89.6 84.3 76.7 64.40 49.30 81.7 77.0 92.2 57.80 81.3 64.0 91.6 86.5 85.8 82.1 53.0 80.0 70.9 87.2 71.8
Retina512 ResNet101 67.7 80.4 74.0 73.4 53.5 49.7 73.0 71.2 88.2 45.8 69.7 50.6 87.1 74.0 76.8 78.9 45.6 69.1 51.3 77.2 65.0
Retina512+att ResNet101 68.8 81.4 77.6 73.3 54.1 53.0 74.3 72.27 85.01 48.5 71.5 50.0 87.6 77.4 77.3 80.0 49.5 71.6 53.2 72.9 66.3
SSD512* VGG16 78.5 90.0 85.3 77.7 64.3 58.5 85.1 84.3 92.6 61.3 83.4 65.1 89.9 88.5 88.2 85.5 54.4 82.4 70.7 87.1 75.6
SSD513 ResNet101 79.4 90.7 87.3 78.3 66.3 56.5 84.1 83.7 94.2 62.9 84.5 66.3 92.9 88.6 87.9 85.7 55.1 83.6 74.3 88.2 76.8
DSSD513 ResNet101 80.0 92.1 86.6 80.3 68.7 58.2 84.3 85.0 94.6 63.3 85.9 65.6 93.0 88.5 87.8 86.4 57.4 85.2 73.4 87.8 76.8
ASSD512 VGG16 80.0 89.8 87.7 81.5 70.6 60.0 85.3 84.7 93.6 61.8 84.9 66.1 90.9 88.6 87.9 86.6 57.7 86.7 71.5 86.5 77.4
ASSD513 ResNet101 81.3 92.1 89.2 82.5 71.5 60.4 85.5 84.8 93.9 63.7 88.6 67.4 92.6 90.2 89.0 86.5 60.4 88.2 73.4 88.6 77.0
Table 4: PASCAL VOC2012 test detection results. Training dataset is 07++12: 07 trainval+07 test+12 trainval. Results are evaluated by online PASCAL VOC evaluation server. We compared the performance of our ASSD321 and ASSD513 with AC-CNN [15], Faster R-CNN [10], R-FCN [2],YOLOv2 [25], SSD300*, SSD512* [21], SSD321, SSD513, DSSD321, DSSD 513 [5], RefineDet [32].
Method Training Data Backbone Avg. Precision, IoU: Avg. Precision, Area: Avg. Recall, #Dets: Avg. Recall, Area:
0.5:0.95 0.5 0.75 S M L 1 10 100 S M L
Faster R-CNN trainval VGG16 21.9 42.7 - - - - - - - - - -
R-FCN trainval ResNet101 29.9 51.9 - 10.8 32.8 45.0 - - - - - -
YOLOv2 trainval35k Darknet19 21.6 44.0 19.2 5.0 22.4 35.5 20.7 31.6 33.3 9.8 36.5 54.4
SSD300* trainval35k VGG16 25.1 43.1 25.8 6.6 25.9 41.4 23.7 35.1 37.2 11.2 40.4 58.4
SSD321 trainval35k ResNet101 28.0 45.4 29.3 6.2 28.3 49.3 25.9 37.8 39.9 11.5 43.3 64.9
FSSD300 trainval35k VGG16 27.1 47.7 27.8 8.7 29.2 42.2 24.6 37.4 40.0 15.9 44.2 58.6
DSSD321 trainval35k ResNet101 28.0 46.1 29.2 7.4 28.1 47.6 25.5 37.1 39.4 12.7 42.0 62.6
ASSD321 trainval35k ResNet101 29.2 47.8 30.9 6.9 33.3 47.9 26.3 38.7 40.2 10.4 46.0 64.8
RefineDet320 trainval35k VGG16 29.4 49.2 31.3 10.0 32.0 44.4 - - - - - -
SSD512* trainval35k VGG16 28.8 48.5 30.3 10.9 31.8 43.5 26.1 39.5 42.0 16.5 46.6 60.8
SSD513 trainval35k ResNet101 31.2 50.4 33.3 10.2 34.5 49.8 28.3 42.1 44.4 17.6 49.2 65.8
FSSD512 trainval35k VGG16 31.8 52.8 33.5 14.2 35.1 45.0 27.6 42.4 45.0 22.3 49.9 62.0
DSSD513 trainval35k ResNet101 33.2 53.3 35.2 13.0 35.4 51.1 28.9 43.5 46.2 21.8 49.1 66.4
RetinaNet500 trainval35k ResNet101 34.4 53.1 36.8 14.7 38.5 49.1 - - - - - -
ASSD513 trainval35k ResNet101 34.5 55.5 36.6 15.4 39.2 51.0 29.9 45.6 47.6 22.8 52.2 67.9
RefineDet512 trainval35k VGG16 33.0 54.5 35.5 16.3 36.3 44.3 - - - - - -
Table 5: COCO test-dev detection results, which is evaluated by online evaluation server. We compared the our ASSD321 and ASSD512 performances with Faster R-CNN[26], R-FCN[2], YOLOv2[25], SSD300*, SSD500*[21], SSD321, SSD513, DSSD321, DSSD513 [5], FSSD300, FSSD512[16], RetinaNet500[18], RefineDet[32].

3.2 Semantic Fusion

Motivated by FSSD [16], we fuse the contextual information from layer4 and layer5 into layer3 to enrich its semantics. In our experiment, we find the fusion operation alone does not notably improve the detection accuracy (see Table 3). Instead, it even decreases the accuracy a bit with more computational cost. The reason would be that the three layers possess different receptive fields and have different capabilities; further, the concatenation and conv transformation would possibly neutralize the relative importance of the three layers and suppress the critical features in original layer3. However, when we place the attention unit after the fusion operation, there is a noticeable improvement (see Table 3). It is possible that semantics from the deep layers help the attention unit to discover useful information that resides in the original layer3. Finally, when only applying the attention unit, we observe inferior performance in contrast to the model with both fusion and attention mechanisms. This indicates that the feature fusion and attention are complementary to each other. The semantic fusion process can be formulated as:


where is the feature map at layer , and

. In the concatenation operation, layer4 and layer5 are upsampled through bilinear interpolation in order to align their sizes with that of layer3.

Figure 3: Visualization of attention maps on PASCAL VOC 2007 test set. The attention maps are calculated from feature maps of different scales. For a given input image, the attention maps highlight the useful regions of different sizes, as indicated by the heat regions. The attention map will be used as the weighted sum of spatial features at each location. Therefore, the features of unrelated regions such as background are suppressed. In this way, the attention maps helps the model focus on the real targets and thereby improves the detection accuracy.

4 Implementation Details

We follow the same anchor box generating method as SSD [21]. Specifically, we use aspect ratio for anchor boxes on feature maps conv3,8,9 and for anchor boxes on feature maps of conv4-7. Each box has a minimum scale and a maximum scale , where the scale is regularly spaced over the feature map layers and is the of next layer. The normalized width and height of an anchor box are calculated by and , where for , otherwise . We use hard negative mining to solve the positive-negative box class imbalance problem as in the original SSD [21]. Also, we employ the same data augmentations and the same loss functions as SSD.

Our model is implemented with Pytorch


and trained on 8 NVIDIA Tesla K80 GPUs. The weights of ResNet101 backbone are pretrained on ImageNet. We use Stochastic Gradient Descent (SGD) algorithm to optimize ASSD weights, with a momentum of 0.9, a decay of 0.0005 and an initial learning rate of 0.001. Following the settings of SSD, DSSD and FSSD, we train and evaluate ASSD on two input resolution images:

and . In particular, we set the mini-batch size to 10 images per GPU for ASSD321 and 8 images per GPU for ASSD512.

5 Experiments

We conduct experiments on two common datasets: PASCAL VOC [4] and COCO [19]. The PASCAL VOC dataset contains 20 object classes for object detection challenge. We evaluate ASSD on the PASCAL VOC 2007/2012 test set. The COCO dataset includes 80 object categories. In this work, we use COCO 2017 dataset, which has the same train, validation and test images as COCO 2014. Hence we have a fair comparison with the state-of-the-art methods. Note that RetinaNet [18] does not have PASCAL VOC detection results. Therefore we only compare the accuracy and speed of RetinaNet on COCO dataset.

5.1 Pascal Voc 2007

We first evaluate our ASSD on PASCAL VOC 2007 test set with a primary goal of comparing the speed and accuracy of ASSD with state-of-the-art methods. The training dataset we use here is a union of 2007 trainval and 2012 trainval. We train ASSD321 for 280 epochs, where the initial learning rate of 0.001 decreases by 0.1 at the 200th epoch and the 250th epoch. For ASSD513, we train for 180 epochs, with a learning rate decay of 0.1 at the 120th and 170th epochs. As shown in Table 

2, with a comparable fast speed, ASSD achieves a large improvement in accuracy compared to SSD, DSSD, and FSSD.

5.2 Ablation study on PASCAL VOC 2007

We perform ablation study to explore the effects of attention unit and semantic fusion on detection accuracy and speed. Here we investigate four models, SSD513, SSD513+fusion, SSD513+att, SSD513+fusion+att, on the PASCAL VOC 2007 test set. It can be observed from Table 3 that the fusion module alone does not show noticeable accuracy improvement. On the contrary, it brings a little more computational overhead. In contrast, attention unit alone leads to a significant performance improvement. When combining the attention unit with the fusion module, we observe further boost of performance. We conjecture that the attention unit may have the ability to analyze the contextual semantics at different levels and select the useful information for guiding a better detection.

5.3 Pascal Voc 2012

We compare the detection accuracy of ASSD with the state-of-the-art methods on the PASCAL VOC 2012 test set. The mAP is evaluated by online PASCAL VOC evaluation server. We present a detailed comparison of average precision (AP) for each class in Table 4. The training dataset contains 2007 trainval+test and 2012 trainval. We follow similar training settings as PASCAL VOC 2007. From Table 4, it can be seen that ASSD513 improves the detection accuracy for most of the classes. The reason would be that the attention unit figures out the pixel-level feature relationships and therefore enhances the model ability to distinguish objects of different classes.

5.4 Coco

We train and validate ASSD on COCO training dataset (118k) and validation dataset (5k). We compare with the state-of-the-art methods on COCO test-dev. The detection performance is evaluated by the online evaluation server. We train ASSD321 for 160 epochs with a learning rate decay of 0.1 at the 100th epoch and the 150th epoch. ASSD513 is trained for 140 epochs, and the learning rate decreases after 80 and 130 epochs. As illustrated in Table 5, ASSD achieves a large improvement over SSD, DSSD and FSSD. Besides, at a similar input resolution, ASSD513 obtains better accuracies than RetinaNet500, especially for AP at different object area thresholds. In particular, when the intersection over union (IoU) is higher than 0.5, ASSD513 has a 2.4% improvement compared to RetinaNet500. Furthermore, from Table 5 it can also be observed that ASSD is more effective at detecting the small, medium and large objects. Note that, with the above superiority in detection accuracy, ASSD513 (6.1FPS K40) still achieves comparable speed as RetinaNet500 (6.8FPS K40).

5.5 Attention Visualization

To better investigate the attention mechanism, we visualize the attention maps of different scales. In particular, we project the attention maps onto the original images. Here we utilize the PASCAL VOC 2007 test set, which contains 20 classes. From Fig. 3, we observe that the attention maps highlight the crucial locations of objects, indicating the feature relations help the model concentrate on useful regions. At shallow layers, the attention map guides the model to focus on small objects; while at deep layers, the attention map highlights objects with large sizes. Moreover, it can also be observed that the attention map suppresses the negative regions, which would be of great help for fast determination of negative anchor boxes.

6 Conclusion

In this paper, we propose an attentive single shot multibox detector, termed ASSD, for more effective object detection. Specifically, ASSD utilizes a fast and light-weight attention unit to help discover feature dependencies and focus the model on useful and relevant regions. ASSD improves the accuracy of SSD by a large margin at a small extra cost of computation. Moreover, ASSD competes favorably with the other state-of-the-art methods. In particular, it achieves better performance than the one-stage detector RetinaNet, while being easier to train without the need to heavily tune the loss parameters.


  • [1] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. In EMNLP, 2016.
  • [2] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, pages 379–387, 2016.
  • [3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886–893. IEEE, 2005.
  • [4] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective.

    International Journal of Computer Vision

    , 111(1):98–136, Jan. 2015.
  • [5] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
  • [6] Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, volume 2, page 3, 2017.
  • [7] Ross Girshick. Fast r-cnn. In ICCV, December 2015.
  • [8] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
  • [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, pages 346–361. Springer, 2014.
  • [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [11] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, June 2018.
  • [12] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI, 20(11):1254–1259, 1998.
  • [13] Saumya Jetley, Nicholas A Lord, Namhoon Lee, and Philip HS Torr. Learn to pay attention. In ICLR, 2018.
  • [14] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018.
  • [15] Jianan Li, Yunchao Wei, Xiaodan Liang, Jian Dong, Tingfa Xu, Jiashi Feng, and Shuicheng Yan. Attentive contexts for object detection. IEEE TMM, 19(5):944–954, 2017.
  • [16] Zuoxin Li and Fuqiang Zhou. Fssd: Feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960, 2017.
  • [17] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He, Bharath Hariharan, and Serge J Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 3, 2017.
  • [18] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE TPAMI, 2018.
  • [19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [20] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. In ICLR, 2017.
  • [21] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37. Springer, 2016.
  • [22] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In NIPS, pages 2204–2212, 2014.
  • [23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • [24] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  • [25] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In CVPR, pages 6517–6525, 2017.
  • [26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  • [27] Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. Pedestrian detection with unsupervised multi-stage feature learning. In CVPR, pages 3626–3633, 2013.
  • [28] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply supervised object detectors from scratch. In Proceedings of the IEEE International Conference on Computer Vision, pages 1919–1927, 2017.
  • [29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
  • [30] Donggeun Yoo, Sunggyun Park, Joon-Young Lee, Anthony S Paek, and In So Kweon. Attentionnet: Aggregating weak directions for accurate object detection. In CVPR, pages 2659–2667, 2015.
  • [31] Xingyu Zeng, Wanli Ouyang, Junjie Yan, Hongsheng Li, Tong Xiao, Kun Wang, Yu Liu, Yucong Zhou, Bin Yang, Zhe Wang, et al. Crafting gbd-net for object detection. IEEE transactions on pattern analysis and machine intelligence, 40(9):2109–2123, 2018.
  • [32] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Single-shot refinement neural network for object detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4203–4212, 2018.
  • [33] Peng Zhou, Bingbing Ni, Cong Geng, Jianguo Hu, and Yi Xu. Scale-transferrable object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 528–537, 2018.