SpotNet: Self-Attention Multi-Task Network for Object Detection

Humans are very good at directing their visual attention toward relevant areas when they search for different types of objects. For instance, when we search for cars, we will look at the streets, not at the top of buildings. The motivation of this paper is to train a network to do the same via a multi-task learning approach. To train visual attention, we produce foreground/background segmentation labels in a semi-supervised way, using background subtraction or optical flow. Using these labels, we train an object detection model to produce foreground/background segmentation maps as well as bounding boxes while sharing most model parameters. We use those segmentation maps inside the network as a self-attention mechanism to weight the feature map used to produce the bounding boxes, decreasing the signal of non-relevant areas. We show that by using this method, we obtain a significant mAP improvement on two traffic surveillance datasets, with state-of-the-art results on both UA-DETRAC and UAVDT.


page 1

page 4

page 5

page 6


Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation

We address the problem of weakly-supervised semantic segmentation (WSSS)...

SparseTT: Visual Tracking with Sparse Transformers

Transformers have been successfully applied to the visual tracking task ...

Exploring Reciprocal Attention for Salient Object Detection by Cooperative Learning

Typically, objects with the same semantics are not always prominent in i...

Custom Object Detection via Multi-Camera Self-Supervised Learning

This paper proposes MCSSL, a self-supervised learning approach for build...

Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer

Camouflaged object detection intends to discover the concealed objects h...

Learning To Pay Attention To Mistakes

As evidenced in visual results in <cit.><cit.><cit.><cit.><cit.>, the pe...

Realistic Image Generation using Region-phrase Attention

The Generative Adversarial Network (GAN) has recently been applied to ge...

I Introduction

There is increasing interest in automatic road user detection for intelligent transportation systems, advanced driver assistance systems, traffic surveillance, etc. Road user detection has its own set of challenges and difficulties, such as the high speed of some road users, the frequent occlusion between them and the small size of road users appearing afar. Despite huge improvements in the last years thanks to advancements in deep learning-based methods, results still need to be improved for reliable practical applications.

Recently, a new family of object detectors were proposed based on keypoint detection rather than based on bounding box classification [law2018cornernet, duan2019centernet, zhou2019objectsaspoints]. This approach presents several advantages, including not having to manually design anchor boxes and having to process fewer candidate boxes. Detecting objects in this way is deceptively simple and elegant, and quite fast. It yields state-of-the-art accuracy results on several datasets. Therefore, in this work, we build upon CenterNet [zhou2019objectsaspoints]

by designing a novel convolutional neural network (CNN) model that directs its attention towards the areas of interest and thus decreases the probability of having false detections in incongruous areas.

Our contributions are: 1) a self-attention mechanism based on multi-task learning (object detection and segmentation) and 2) a semi-supervised training method that capitalizes on automatic foreground/background segmentation annotations.

The idea of attention and self-attention has been around for some time now, most notably in image captioning 


and natural language processing (NLP) 


. In those works, neural networks are trained to learn which parts of the input are the most important to solve the task. But they do so progressively, using recurrent neural networks. Can a simple CNN learn which areas it should use to increase its visual attention? In this work, we show that it is indeed possible and beneficial for object detection by using a semi-supervised training approach and multi-task learning. The network is trained for both object detection and foreground/background segmentation, the latter being also used to weight object detection feature maps. Indeed, the foreground/background segmentation is used in an internal attention mechanism that gives more weight to areas useful for detection. In figure 


, we can visualize what the network learns from this approach, that is to concentrate the keypoint search on areas where there are indeed road users, and therefore reducing the response of any other neuron. One can see this process as shining a spotlight on relevant areas and dimming the lights everywhere else. Therefore, we named our method, SpotNet.

Fig. 1: A visualisation of the attention map produced by SpotNet on top of its corresponding image, from the UAVDT [du2018unmanned] dataset.

This attention approach is particularly beneficial for keypoint-based methods since we are globally looking for keypoints on the whole image at the same time, and not just classifying the object in a cropped bounding box. However, a question remains. How can we train such a self-attention process? Typically, object detection datasets do not provide the segmentation ground-truth since it is very costly and time-consuming to produce. Instead, we rely on classical computer vision techniques to generate automatic pixel-wise annotation labels and on datasets providing video sequences instead of single frames to train the network. In the case of fixed camera video sequences, we successfully employ a background subtraction method to obtain the automatic annotations, while in the case of moving camera video sequences, we rely on dense optical flow for the same purpose.

Although we use imperfect foreground/background segmentation annotations, we can train a network to produce quality segmentation maps by using multi-task learning. The detection and segmentation tasks are trained jointly by sharing all the parameters of the backbone network. Both tasks are mutually beneficial. Indeed, by producing a better segmentation, the object detection task benefits from a better attention mechanism. And by producing better object detection, the parameters of the backbone network get better at recognizing the features of interest from the images to improve the segmentation maps. We validated our method on two popular traffic scene datasets, and we show that our method is the state-of-art on these datasets by improving significantly the performance of the base network (CenterNet).

Ii Related Work

Object detection as meant in this paper is the task of drawing a rectangular bounding box around objects of interest in an image, as well as producing a class label for each box. All state-of-the-art object detection methods have been based on deep learning since its rise. They can broadly be split up into two main categories, two-stage and one-stage methods.

Two-stage methods divide the task of object detection into two steps, producing a set of object candidates, and then computing a score, a label and a coordinate offset for each box. The first deep learning-based method was R-CNN [rcnn_Girshick_2014_CVPR], which used an external method to produce box candidates, namely selective search [uijlings2013selective]. It then passed each candidate in a CNN to compute features for each box. A classification is done on those features by a SVM afterwards. Fast R-CNN [Girshick_2015_ICCV_fast] aimed to increase the speed of R-CNN by passing the whole image through a CNN once, and afterwards just cropped the relevant parts of the feature map for each box candidate for classification. Faster R-CNN [ren2015faster] is a further improvement that introduced the RPN, a region proposal network that shares most of its parameters with the classification and regression parts, making it even faster and more efficient than its predecessors. RFCN [RFCN_NIPS2016_6465] further builds upon Faster R-CNN by learning to detect and classify parts of objects and then using a grid of parts to vote on each object. Cascade R-CNN [cai2018cascade] addresses the problems of the mismatch between the minimum IOU (Intersection over union) used to evaluate during inference, and the minimum IOU used to select a positive sample during training. They also address overfitting during training by training while progressively increasing the IOU thresholds.

One-stage methods aim to reduce the processing time of two-stage methods by removing the candidate proposal phase and by detecting objects directly from the feature map. The first one-stage method was YOLO [redmon2016you_yolo] which divides the input image into a regular grid and makes each cell predict two bounding boxes. Further iterations of the method, YOLOv2 [Redmon_2017_CVPR_YOLO2] and YOLOv3[redmon2018yolov3] built upon it by using anchor boxes, a better backbone network and several other tweaks. SSD [liu2016ssd] addresses the multi-scale detection problem by combining feature maps at multiple spatial resolutions and then applying anchor boxes to look for objects. RetinaNet [lin2018focal] uses an FPN (Feature pyramid network) [Lin_2017_CVPR_FPN] to produce a multi-scale pyramid of features and applies a set of anchor boxes followed by non-maximal suppression to find objects. CornerNet [law2018cornernet] uses the Hourglass network [newell2016stacked] paired with corner pooling layers to detect a set of top-left corners and bottom-right corners, and combines them with a learned embedding. Keypoint Triplets [duan2019centernet] builds upon CornerNet by improving the corner pooling layers, and by also detecting a center keypoint to validate each object. Objects as Points [zhou2019objectsaspoints] detects an object as a center keypoint and regresses the size of the object to find the bounding box.

Attention mechanisms in object detection have been around for a while. In Geometric Proposal for Faster R-CNN [amin2017geometric]

, the authors re-rank the proposals of the region proposal network depending on a geometric estimation of the scene, outperforming the standard Faster R-CNN by large margin. Their geometric estimation of the scene is mostly based on vehicle scale. The HAT 


method uses a hierarchical attention mechanism that first trains a part-specific attention model. Then a LSTM models the relations between those parts, making it a part-aware detector. FG-BR Net 

[fu2019foreground] uses background subtraction methods to produce a foreground image that is fed as another input to the network. They also introduce a feedback process from the detection outputs to the background subtraction to keep static objects in the foreground image. Compared to these models, our attention mechanism is simple, elegant and fast. Furthermore, we do not need background subtraction at inference, only during the training phase.

Iii Proposed Method

Figure 2 shows a detailed overview of our complete model.

Fig. 2: Overview of SpotNet: the input image first passes through a double-stacked hourglass network; the segmentation head then produces an attention map that multiplies the final feature map of the backbone network; the final center keypoint heatmap is then produced as well as the size and coordinate offset regressions for each object.

Iii-a Base network

Our method is based upon CenterNet [zhou2019objectsaspoints], not to be confused with the homonym method CenterNet, or keypoint triplets [duan2019centernet]. This method trains a backbone network to recognize the center point of objects by assigning the center pixel of a box to be the ground-truth center and gives a reduced loss for other close points. The width and height of the bounding box are regressed, as well as the coordinate offset of the box (to compensate from the error caused by the smaller spatial resolution of the output). The final output is thus a center point heatmap for each possible label, an object size for each point, and an offset for each point, the size and offset being label agnostic.

Iii-B Multi-task Learning

Our main idea is to train a network to perform multiple tasks, to make it better for at least one of the tasks. In our case, we train a network to perform segmentation of objects of interest while performing bounding box detection, thus making the shared parameters more generic and less prone to overfitting. To do this, we add a two-class (foreground/background) segmentation head to the network and train this head with semi-supervised annotations from training datasets (more details in subsection III-D).

The added segmentation head takes as input a feature map that has been reduced by a factor of four in terms of spatial dimension when compared to the input. It consists of three 3 3 convolutions, with upsampling layers in between. The channel dimension is reduced to 1 in the last convolution, thus resulting in a segmentation map that is the same width and height as the input, with a single channel. The loss, , used to train this head is the binary cross-entropy, given by


where is the annotation label for sample , its predicted label by the network and the number of samples. We found out during our experiments that it works better than the initial mean squared error loss that we had tried initially.

Iii-C Self-Attention Mechanism

To further benefit from our learned segmentation map, we implement a simple yet effective self-attention mechanism within the network. Once we obtain our segmentation map, we downsample it by a factor of 4 to reduce it to the spatial dimension of the original feature map. To attenuate the response at locations unlikely to contain an object of interest, we multiply every channel of the feature map with our segmentation map, thus reducing the probability of false positives in irrelevant areas.

Iii-D Semi-Supervised annotations

To train our model to produce foreground/background segmentation maps, we had to produce semi-supervised pixel-wise segmentation annotations. To do that, we took advantage of having access to full video sequences, despite training and evaluating on a single frame at a time. For the fixed camera video sequences, we used the background subtraction method PAWCS [st2015self_pawcs]. Since background subtraction is not designed to work with a moving background, for the moving camera video sequences, we used Farneback optical flow [farneback2003two] followed by some basic image processing and a threshold on motion magnitude. For both automatic two-class segmentation results, we then do an intersection with the ground-truth bounding boxes for each frame to reduce noise and to obtain pixel-wise segmentation annotations only for the object categories to detect. All other object categories, not inside ground-truth training bounding boxes, are therefore labelled as background. This results in fairly good foreground/background segmentation maps, with sometimes squared corners at one or more sides, due to the intersection with bounding boxes, as can be seen in Figure 3. In our experiments, we find that not only are these non-perfect segmentation annotations good enough to train good attention maps, it also allows our segmentation head to produce segmentation maps comparable to good unsupervised foreground/background segmentation methods.

It should be noted that although our method requires videos for training to obtain the semi-supervised segmentation annotations, once trained, it can be applied to single images.

Fig. 3: Example of semi-supervised annotations on UA-DETRAC [Wen2015Tracking] produced by PAWCS [st2015self_pawcs] and the intersection with the ground-truth bounding boxes.

Iii-E Training for multiple tasks

To adapt the training loss of the whole network, we added the binary cross-entropy loss of our segmentation head (equation 1) to the original CenterNet loss. The center point heatmap loss is calculated with the focal loss [lin2018focal], and the losses for the regressions for the offset and width/heigth are formulated as L1 losses as in the original paper [zhou2019objectsaspoints]. The total loss is given by


The total loss is thus the sum of all losses, with the width and height regression having less weight than the others, 0.1 compared to 1.

Iv Experiments

Iv-a Datasets

To validate the effectiveness of our method, we trained and evaluated it against other state-of-the-art methods on two datasets of traffic scenes, namely UA-DETRAC[Wen2015Tracking] and UAVDT[du2018unmanned]. Figure 4 and Figure 5 show example frames of UA-DETRAC and UAVDT respectively, with their ground-truth. These two datasets were captured with very different settings, UA-DETRAC being filmed with a fixed camera for every scene, and UAVDT with a moving camera. Both datasets have pre-determined test sets, and we used a subset of the training data to do the validation.

Fig. 4: Sample from UA-DETRAC with the ground-truth bounding boxes in yellow.
Fig. 5: Sample from UAVDT with the ground-truth bounding boxes in yellow.

Evaluation is done using the Matlab code provided by the authors of both datasets. A strict training and validation protocol was followed and the testing data was never seen by the network before the final evaluation. The metric used for evaluation is the mAP, the mean Average Precision, with a minimum IOU of 0.7 between inferred and ground-truth bounding boxes. The minimum IOU is the minimum overlap of a bounding box with the ground-truth to be considered a true detection. The IOU is computed as the intersection of the boxes divided by the union of the boxes. The mean average precision is the mean of the average precisions for all classes for multiple values of recall, ranging from 0 to 1 with small steps, typically of 0.1.

Iv-B Implementation Details

We used the stacked hourglass network as our backbone because it shows the best performance for keypoint estimation. This network is composed of modules of downsampling and convolutions followed by upsampling and convolutions with skip connections in an encoder-decoder fashion. For our experiments, we use the Hourglass-104 version as in [law2018cornernet]

which stacks two encoder-decoder modules. We implemented the model in PyTorch 0.4.1 using Cuda 10.0. Experiments were run on a workstation with 32 GB of RAM and a NVIDIA GTX 1080Ti GPU.

Iv-C Object detection results

The experimental results are shown in Table I for UA-DETRAC and in Table II for UAVDT. We outperform our baseline, CenterNet, by a very significant margin on both datasets while being the state-of-art results on both datasets as well. The results are very coherent, showing approximately the same percentage of improvement over CenterNet on both datasets, the absolute value on UAVDT being smaller.

For UA-DETRAC, not only do we outperform all previously published results, we do so in every category, showing the benefit of our self-attention mechanism based on multi-task learning. Moreover, the improvements are particularly impressive for the category hard and cloudy, meaning that our model is particularly good for hard examples. It is interesting to note that the improvement for the easy category is very small, due to the mAP values being already very high. Nonetheless, improvement is consistent across all categories. At the moment of writing, our model outperforms every published result on this dataset, including ensemble models from challenges 


The UAVDT dataset is more difficult than UA-DETRAC due to its high density of small vehicles and aerial point of view, but the percentage of improvement remains consistent. Our model also outperforms every published result on this dataset by a very significant margin.

Model Overall Easy Medium Hard Cloudy Night Rainy Sunny
SpotNet (ours) 86.80% 97.58% 92.57% 76.58% 89.38% 89.53% 80.93% 91.42%
CenterNet[duan2019centernet] 83.48% 96.50% 90.15% 71.46% 85.01% 88.82% 77.78% 88.73%
FG-BR_Net [fu2019foreground] 79.96% 93.49% 83.60% 70.78% 87.36% 78.42% 70.50% 89.8%
HAT [wu2019hierarchical] 78.64% 93.44% 83.09% 68.04% 86.27% 78.00% 67.97% 88.78%
GP-FRCNNm [amin2017geometric] 77.96% 92.74% 82.39% 67.22% 83.23% 77.75% 70.17% 86.56%
R-FCN [RFCN_NIPS2016_6465] 69.87% 93.32% 75.67% 54.31% 74.38% 75.09% 56.21% 84.08%
EB [EB_wang2017evolving] 67.96% 89.65% 73.12% 53.64% 72.42% 73.93% 53.40% 83.73%
Faster R-CNN [ren2015faster] 58.45% 82.75% 63.05% 44.25% 66.29% 69.85% 45.16% 62.34%
YOLOv2 [Redmon_2017_CVPR_YOLO2] 57.72% 83.28% 62.25% 42.44% 57.97% 64.53% 47.84% 69.75%
RN-D [perreault2019road] 54.69% 80.98% 59.13% 39.23% 59.88% 54.62% 41.11% 77.53%
3D-DETnet [3D_detnet_li20183d] 53.30% 66.66% 59.26% 43.22% 63.30% 52.90% 44.27% 71.26%
TABLE I: Results on the UA-DETRAC dataset [Wen2015Tracking]. 3D-DETNet results are from [3D_detnet_li20183d], and others results are reported as in the results section of the UA-DETRAC website (Boldface: best result, Italic: indicates our baseline).
Model Overall
SpotNet (Ours) 52.80%
CenterNet[duan2019centernet] 51.18%
Wang et al[wang2019learning] 37.81%
R-FCN [RFCN_NIPS2016_6465] 34.35%
SSD [liu2016ssd] 33.62%
Faster-RCNN [ren2015faster] 22.32%
RON [kong2017ron] 21.59%
TABLE II: Results on the UAVDT [du2018unmanned] dataset (Boldface: best result, Italic: indicates our baseline).

Iv-D Foreground/Background segmentation results

Although that was not our principal objective, it is nonetheless interesting to see how we do on specialised foreground/background benchmarks. To produce results, we used our best model trained on UA-DETRAC and ran it on three sequences of the dataset [goyette2012changedetection] containing only vehicles (because UA-DETRAC includes only annotations for vehicles). To obtain the foreground, we took the attention maps produced by our network, applied a binary threshold and then masked the resulting image with the bounding boxes detected by our network to remove noise. We can see in table III that our method produces competitive results, although we do not quite reach state-of-the-art foreground/background performance. Figure 6 shows qualitative results on a few frames. Our method does not always fit the object boundaries very well. This is expected since the training annotations are imperfect. Nevertheless, we outperform several classical methods, at no additional cost when producing bounding boxes. It is important to note that a limitation of our model is that it must be trained on the objects we want to segment.

Model Average F-Measure
PAWCS [st2015self_pawcs] 0.872
SuBSENSE [st2014subsense] 0.831
SpotNet (Ours) 0.806
SGMM [evangelio2012splitting_sgmm] 0.766
KNN [zivkovic2006efficient_knn] 0.731
GMM [stauffer1999adaptive_gmm] 0.709
TABLE III: Results on the [goyette2012changedetection] dataset. Results are averaged for sequences “highway”, “traffic” and “boulevard” (Boldface: best result).
(a) Input Image
(b) Ground-truth
(c) SpotNet (ours)
(d) PAWCS [st2015self_pawcs]
(e) SGMM [evangelio2012splitting_sgmm]
(f) GMM [stauffer1999adaptive_gmm]
Fig. 6: Example of foreground/background segmentation maps obtained with several segmentation methods. First row: frame 1015 of “highway”, second row: frame 967 of “traffic”, third row: frame 883 of “boulevard”.

V Discussion

V-a Ablation study

To detail the contribution of each part of our model, we conducted an ablation study on UA-DETRAC. Table IV shows that even though multi-task learning helps, the biggest contribution comes from combining our attention process with it. To further understand the contribution of each part, we draw the precision/recall curve (Figure 7) compared to several other methods on UA-DETRAC. On this curve, we can note that the multi-task learning by itself (SpotNet No Attention) helps to be more precise, but does not help to detect more objects, i.e. to reach improved values of recall. On the other hand, the attention mechanism does both, it helps to be even more precise for the same values of recall (fewer false positives), and it also allows the model to detect more and reach significantly higher values of recall.

Attention Multi-Task Overall Easy Medium Hard Cloudy Night Rainy Sunny
86.80% 97.58% 92.57% 76.58% 89.38% 89.53% 80.93% 91.42%
84.57% 96.72% 90.85% 73.16% 86.53% 88.76% 78.84% 90.10%
83.48% 96.50% 90.15% 71.46% 85.01% 88.82% 77.78% 88.73%
TABLE IV: Ablation study on the UA-DETRAC [Wen2015Tracking] dataset.
Fig. 7: Precision/Recall curve of our model compared with a variant and other methods.

Since the network is looking for keypoints on the whole image, it is natural that concentrating the search on learned foreground pixels will increase the probability that the keypoints found belong to the objects of interest, thus reducing the rate of false positives. Furthermore, the experiments show that this increases recall because the network can concentrate on useful information.

It is expected that learning the segmentation task jointly with the object detection task can be mutually beneficial, since both tasks have a large overlap in what needs to be learned. The main difference is that object detection needs to separate instances, while segmentation needs a more precise border around the objects. We show that semi-supervised annotations are good enough for our purpose, and multi-task learning by itself, based on those annotations, improves precision.

V-B Limitations of our Model

One of the limitations of our model is the fact that it needs semi-supervised annotations to be trained properly. However, we believe that in most real-world applications, video sequences are available and we can thus run background subtraction or optical flow to generate them. In other cases, pre-trained semantic segmentation methods could be used to obtain the desired annotations.

Vi Conclusion

In this paper, we presented a novel multi-task model equipped with a self-attention process, and we trained it with semi-supervised annotations. We show that these improvements allow us to reach state-of-the-art performance on two traffic scenes datasets with different settings. We argue that not only does this improve accuracy by a large margin, it also provides instance segmentations of the road users almost at no cost.


We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [RDCPJ 508883 - 17], and the support of Genetec.