Log In Sign Up

Focal Loss Dense Detector for Vehicle Surveillance

by   Xiaoliang Wang, et al.

Deep learning has been widely recognized as a promising approach in different computer vision applications. Specifically, one-stage object detector and two-stage object detector are regarded as the most important two groups of Convolutional Neural Network based object detection methods. One-stage object detector could usually outperform two-stage object detector in speed; However, it normally trails in detection accuracy, compared with two-stage object detectors. In this study, focal loss based RetinaNet, which works as one-stage object detector, is utilized to be able to well match the speed of regular one-stage detectors and also defeat two-stage detectors in accuracy, for vehicle detection. State-of-the-art performance result has been showed on the DETRAC vehicle dataset.


Fast and Accurate, Convolutional Neural Network Based Approach for Object Detection from UAV

The ever-growing interest witnessed in the acquisition and development o...

Focal Loss for Dense Object Detection

The highest accuracy object detectors to date are based on a two-stage a...

Probabilistic two-stage detection

We develop a probabilistic interpretation of two-stage object detection....

Evaluating Context for Deep Object Detectors

Which object detector is suitable for your context sensitive task? Deep ...

Consistent Optimization for Single-Shot Object Detection

We present consistent optimization for single stage object detection. Pr...

SS3D: Single Shot 3D Object Detector

Single stage deep learning algorithm for 2D object detection was made po...

3D-DETNet: a Single Stage Video-Based Vehicle Detector

Video-based vehicle detection has received considerable attention over t...

I Introduction

The traffic surveillance system is broadly used to monitor traffic conditions. Vehicle detection plays significant roles in many vision-based traffic surveillance applications. Vehicles need to be located based on the videos or images of the traffic scene. Some further processing, such as vehicle tracking and vehicle counting, could be developed based on the obtained specified location of the vehicle. The extracted bounding box, which contains the vehicle, can also be collected for other usage, such as vehicle type and model recognition. However, there are still a fairly amount of concerns with the development of vehicle detection technology. One of them is the occlusion, which place resistance to accurate vehicle detection. In the real application scenario of traffic surveillance systems, detection performance could also be influenced by different illumination and weather conditions. Vehicles and other objects could bring shadows, which easily give rise to false positives in detection procedures. Different vehicles may diversify in shape, size and color. Various pose may generate different appearance for the vehicle, which make vehicle detection even more challenging. Previously, different feature extraction techniques have been employed for vehicle detection,relying on the rigid characteristic of vehicle

[1] and unique part based features[2]. Recently, Convolutional Neural Network(CNN) has been proved to be a promising approach for feature extraction of Region of Interest in images. There are many CNN based methods, which have been proposed for vehicle detection and classification [3, 4, 5]. Nonetheless, superior detection accuracy and low processing time latency could hardly be achieved at the same time.

In this study, we deploy a Focal Loss Convolutional Neural Network based object detection method-RetinaNet[6] to undertake the vehicle detection task for DETRAC[7] dataset. Our experiment result show that the RetinaNet could be well adjusted to perform faster and more accurate vehicle detections compared to previous other methods.

Ii Overview of Focal Loss Dense Object Detector

Ii-a Evolution of object detection method

There are a group of traditional and classic object detection methods developed with history. Firstly, the sliding-window approach is proposed, through which a classifier is applied on a dense image grid. Some of the representative work are the study

[8, 9]

, which utilize Convolutional Neural Networks for handwritten digit recognition. The usage of boosted object detectors for face detection has been explored in

[10], which make the proposed methods widely accepted in the related area. The study of integral channel features[11] and HOG[12] lead to breakthrough for pedestrian detection. The method of DPMs[13] is able to make dense detectors applicable to general object detection, which continuously achieve remarkable results on PASCAL[14]. However, with the revival of deep learning based methods for computer vision[15], the traditional sliding-window approach was replaced by the unrivaled two-stage detectors, which dominate object detection lately.

For two-stage object detector, the Selective Search method[16] is the earlier work which utilize the first stage to generate sparse proposals which may potentially include objects inside and the second stage to classify the proposal as foreground or background. R-CNN[17] is able to leverage Convolutional Neural Network for the second stage classification task and achieve even higher accuracy in object detection. R-CNN make each object proposal in an image to pass through CNN independently for feature extraction, which lead to large time latency when executing object detection work. In order to accelerate, the SPPnet[18] and the Fast R-CNN[19] pass through the CNN only once for the entire input image. For the further development of two-stage object detector, in the work of Faster R-CNN[20], the first stage of Convolutional Neural Network(CNN) is used for generating Region of Interest(ROI) proposal; the second stage of CNN is used for both region proposal refining and object classification. The critical part is to make RPN share the full-image convolution features with the detection network. Based on the analysis of Faster R-CNN framework, many improvement work has been deployed[21, 22, 23, 24, 25].

For one-stage object detector, OverFeat[26] was one of the pioneered work based on deep networks. Lately SSD[27, 28] and YOLO[29, 30] are the typical one-stage methods. In their study, Huang[31] discuss and analyze the accuracy and speed trade-offs among different CNN based object detectors. As their work analyzed,normally two-stage object detectors perform more accurate than one-stage object detectors; however, one-stage object detectors exhibit faster speed than two-stage object detectors. Until recently, one-stage detector RetinaNet[6] is able to achieve comparable accuracy as two-stage detector while still maintaining fast speed.

Most one-stage detectors meet with the problem of class imbalance. The detectors usually go through a huge amount of location with only a few of them containing interested objects. Those easy negatives, which include little useful information, make training procedure rather inefficient; on the other side, the easy negatives would produce degenerate training models. Many study[32, 10, 13, 33, 28] employ hard negative mining methodologies to gain more information from hard samples within training procedures. Some more complicated sampling or reweighing methods are explored in [34]. Focal loss introduced in next part is proposed to solve the class imbalance issue.

Ii-B Focal Loss Dense Object Detector

The normal Cross Entropy (CE) loss for binary classification is showed below:


In the above equation, specifies the ground-truth class and

is the model’s estimated probability for the class with label

. As analyzed in [6]

, this regular Cross Entropy loss function could easily be influenced by the sample imbalance of foreground and background class, which would unfortunately lead to instability in one-stage object detector training processes. Focal Loss function is proposed to solve this issue.

The Focal Loss could be defined as below.


A weighting factor is incorporated for class 1 and for class -1. As used in Cross Entropy(CE) loss, represent the estimated probability for class 1.The parameter is used to control the speed at which easy examples are down weighted. Previously, with default configuration, equal probability is given to binary classification to output either y = −1 or 1 when initialized. In that case, because of the existence of class imbalance,the loss generated by proportionally dominant class would contribute more to the loss and lead to failing to converge in the training. So in order to further prevent the instability in training, a ‘prior’ variable is introduced, through which the value of estimated by the model for the rare foreground class could be set low,such as 0.01,at the beginning of training.This pre-configuration method could help system avoid diverging in training.

0 0.75 69.63
0.1 0.75 69.92
0.2 0.75 70.28
0.5 0.5 71.24
1.0 0.25 71.85
2.0 0.25 72.38
5.0 0.25 70.87
TABLE I: Varying and for Focal Loss
(a) Foreground
(b) Background
Fig. 1: Cumulative distribution functions of the normalized loss for positive and negative samples for different values of for a converged model.
method category FPS mAP car van bus others
SSD(ResNet-50) one-stage 22.74 65.36 79.64 66.34 86.43 29.04
Faster R-CNN two-stage 12.21 71.93 88.57 76.42 90.09 32.63
RetinaNet(ResNet-50) one-stage 23.37 72.38 89.01 76.87 90.49 33.15
SSD(ResNet-101) one-stage 20.54 66.88 81.06 67.76 88.15 30.56
Faster R-CNN two-stage 10.18 73.27 89.92 77.82 91.44 33.91
RetinaNet(ResNet-101) one-stage 21.23 73.79 90.43 78.35 91.86 34.52
TABLE II: Accuracy and Speed Result on DETRAC dataset.()
(a) Daytime
(b) Nighttime
(c) Sunny Weather
(d) Rainy Weather
Fig. 2: RetinaNet based Vehicle Detection Result on DETRAC Dataset

Iii Experiment

The dataset we use for experiment is DETRAC[7] vehicle dataset. The dataset includes video taken in both daytime and night. It contains different weather conditions, such as sunny, cloudy and rainy situations. Four vehicle categories are defined in the dataset, which are car,bus,van and others(trucks and vehicles with trailers are categorized into others group). The algorithm we use is the RetinaNet proposed in [6]. RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors on MSCOCO dataset[35]. The RetinaNet network architecture uses a Feature Pyramid Network (FPN) [22] backbone on top of a feedforward ResNet architecture [36]

to generate a rich, multi-scale convolutional feature pyramid. The base ResNet models are pre-trained on ImageNet. For the final convolutional layer of the classification subnet, we set the bias initialization to

, in our experiments. As explained previously, this initialization strategy prevents the large amount of background anchors from generating a large, diverging loss value in the training.

The code was implemented with MXNet[37]

and run on a server equipped with two Intel 10-core Xeon CPU E5-2630 and an NVIDIA Tesla K80 GPU. In our experiment, RetinaNet is trained with stochastic gradient descent (SGD). Unless otherwise specified, all models are trained for 110k iterations with an initial learning rate of 0.01, which is then divided by 10 at 70k and again at 90k iterations. We use horizontal image flipping as the only form of data augmentation unless otherwise noted. Weight decay of 0.0001 and momentum of 0.9 are used.

Focal loss has been used as the loss on the output of the classification subnet in RetinaNet. Heuristically, we find the parameter setting

and robust for RetinaNet. We choose , to work best for our experiment, which is showed in Table I.

To explore the effect of the focal loss further, an experimental analysis is provided towards the distribution of the loss for a converged training model. For the training configuration, we select RetinaNet with ResNet101 architecture and set the parameter (which obtained 73.79 mAP). We deploy this model randomly to a great amount of testing images and record the predicted probability for around negative samples and positive samples. We collect the focal loss for those negative and positive samples and normalize the sum of loss for each group to one. The normalized loss is then sorted from low to high to obtain Cumulative Distribution Functions(CDF). CDF for positive and negative samples with different settings of are shown in Figure 1. According to the foreground samples result showed in Figure 0(a), it could be found that various settings of has minor effect on CDF. Around 18% of the hardest positive samples occupy roughly half of the positive loss.With increasing, more of the loss gets focused in the top 18% of examples, but the influence is trivial. However, as showed in Figure 0(b),the various settings of affect negative samples significantly. For , the positive and negative CDFs looks fairly similar. But with the value of becoming larger, considerably more weight has been placed on the hard negative samples. It is showed that, with (our training setting), the broad majority of the loss is generated from a small portion of examples. This could help prove that focal loss can attenuate the impact from easy negatives, transferring all attention to the hard negative samples.

In order to compare with previous work, for network architecture, we deployed 3 methods: SSD, Faster R-CNN and RetinaNet. As introduced previously,SSD and RetinaNet work as one-stage object detectors.Faster R-CNN works as a two-stage object detector.We chose either 50 or 101 for the ResNet depth. The accuracy and speed result are showed in Table II. We can find that the focal loss based RetinaNet could achieve higher accuracy than the representative two-stage object detector-Faster R-CNN. In addition, RetinaNet is able to run much faster than the two-stage object detector in terms of inference speed. Figure 2 depicts RetinaNet detection result on DETRAC dataset under different illumination conditions and weather situations.Figure 1(a) and Figure 1(b) show the detection result in different periods of the day,which reflect different lighting conditions: daytime and nighttime. Figure 1(c) and Figure 1(d) show the detection result under different weather conditions: sunny weather and rainy weather. We could find that RetinaNet with focal loss perform well in all these different environment situations.

Iv Conclusions

In this study, we categorized the latest research of Convolutional Neural Network based object detectors into two groups: one-stage object detector and two-stage object detector. As showed by the experiment result, RetinaNet, a one-stage object detector has proved to be able to achieve state-of-the-art performance for vehicle detection compared with other two-stage object detectors.The incorporated focal loss function,which resolve the critical class imbalance issues of normal one-stage object detectors, give rise to the performance boost. More vehicle based patterns may be explored to improve the one-stage object detector further.