Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection

by   Xianzhi Du, et al.
University of Maryland

We propose a deep neural network fusion architecture for fast and robust pedestrian detection. The proposed network fusion architecture allows for parallel processing of multiple networks for speed. A single shot deep convolutional network is trained as a object detector to generate all possible pedestrian candidates of different sizes and occlusions. This network outputs a large variety of pedestrian candidates to cover the majority of ground-truth pedestrians while also introducing a large number of false positives. Next, multiple deep neural networks are used in parallel for further refinement of these pedestrian candidates. We introduce a soft-rejection based network fusion method to fuse the soft metrics from all networks together to generate the final confidence scores. Our method performs better than existing state-of-the-arts, especially when detecting small-size and occluded pedestrians. Furthermore, we propose a method for integrating pixel-wise semantic segmentation network into the network fusion architecture as a reinforcement to the pedestrian detector. The approach outperforms state-of-the-art methods on most protocols on Caltech Pedestrian dataset, with significant boosts on several protocols. It is also faster than all other methods.



There are no comments yet.


page 4

page 10


Multispectral Deep Neural Networks for Pedestrian Detection

Multispectral pedestrian detection is essential for around-the-clock app...

Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation

Multispectral pedestrian detection has attracted increasing attention fr...

An FPGA-Accelerated Design for Deep Learning Pedestrian Detection in Self-Driving Vehicles

With the rise of self-driving vehicles comes the risk of accidents and t...

A Content-Based Late Fusion Approach Applied to Pedestrian Detection

The variety of pedestrians detectors proposed in recent years has encour...

Gated Multi-layer Convolutional Feature Extraction Network for Robust Pedestrian Detection

Pedestrian detection methods have been significantly improved with the d...

Detecting The Objects on The Road Using Modular Lightweight Network

This paper presents a modular lightweight network model for road objects...

End-to-End Pedestrian Collision Warning System based on a Convolutional Neural Network with Semantic Segmentation

Traditional pedestrian collision warning systems sometimes raise alarms ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pedestrian detection has applications in various areas such as video surveillance, person identification, image retrieval, and advanced driver assistance systems (ADAS). Real-time accurate detection of pedestrians is a key for adoption of such systems. A pedestrian detection algorithm aims to draw bounding boxes which describe the locations of pedestrians in an image in real-time. However, this is difficult to achieve due to the tradeoff between accuracy and speed

[8]. Whereas a low-resolution input will, in general, result in fast object detection but with poor performance, better object detection can be obtained by using a high-resolution input at the expense of processing speed. Other factors such as crowded scene, non-person occluding objects, or different appearances of pedestrians (different poses or clothing styles) also make this problem challenging.

The general framework of pedestrian detection can be decomposed into region proposal generation, feature extraction, and pedestrian verification

[9]. Classic methods commonly use sliding window based techniques for proposal generation, histograms of gradient orientation (HOG) [6] or scale-invariant feature transform (SIFT) [22]

as features, and support vector machine (SVM)

[2] or Adaptive Boosting [11]

as the pedestrian verification methods. Recently convolutional neural networks have been applied to pedestrian detection. Hosang et al.

[14] use SquaresChnFtrs [1] method to generate pedestrian proposals and train a AlexNet [17] to perform pedestrian verification. Zhang et al. [31] use a Region Proposal Network (RPN) [23] to compute pedestrian candidates and a cascaded Boosted Forest [7]

to perform sample re-weighting to classify the candidates. Li et al.

[18] train multiple Fast R-CNN [12] based networks to detect pedestrians with different scales and combine results from all networks to generate the final results.

Figure 1: The whole pipeline of our proposed work.

We propose a deep neural network fusion architecture to address the pedestrian detection problem, called Fused Deep Neural Network (F-DNN). Compared to previous methods, the proposed system is faster while achieving better detection accuracy. The architecture consists of a pedestrian candidiate generator, which is obtained by training a deep convolutional neural network to have a high detection rate, albeit a large false positive rate. A novel network fusion method called soft-rejection based network fusion is proposed. It employs a classification network, consisting of multiple deep neural network classifiers, to refine the pedestrian candidates. Their soft classification probabilities are fused with the original candidates using the soft-rejection based network fusion method. A parallel semantic segmentation network, using deep dilated convolutions and context aggregation

[30], delivers another soft confidence vote on the pedestrian candidates, which are further fused with the candidate generator and the classification network.

Our work is evaluated on the Caltech Pedestrian dataset [8]. We improve log-average miss rate on the ’Reasonable’ evaluation setting from 9.58% (previous best result [31]) to 8.65% (8.18% with semantic segmentation network). Meanwhile, our speed is times faster ( times faster for ’Reasonable’ test). Our numerical results show that the proposed system is accurate, robust, and efficient.

The rest of this paper is organized as follows. Section 2 provides a detailed description of each step of our method. Section 3 discusses the experiment results and explores the effectiveness of each component of our method. Section 4 draws conclusions and discusses about potential future work.

2 The Fused Deep Neural Network

The proposed network architecture consists of a pedestrian candidate generator, a classification network, and a pixel-wise semantic segmentation network. The pipeline of the proposed network fusion architecture is shown in Figure 1.

For the implementation described in this paper, the candidate generator is a single shot multi-box detector (SSD) [20]. The SSD generates a large pool of candidates with the goal of detecting all true pedestrians, resulting in a large number of false positives. Each pedestrian candidate is associated with its localization box coordinates and a confidence score. By lowering the confidence score threshold above which a detection candidate is accepted, candidates of various sizes and occlusions are generated from the primary detector. The classification network consists of multiple binary classifiers which are run in parallel. We propose a new method for network fusion called soft-rejection based network fusion (SNF). Instead of performing hard binary classification, which either accepts or rejects candidates, the confidence scores of the pedestrian candidates are boosted or discounted based on the aggregated degree of confidence in those candidates from the classifiers. We further propose a method for utilizing the context aggregation dilated convolutional network with semantic segmentation (SS) as another classifier and integrating it into our network fusion architecture. However, due to the large input size and complex network structure, the improved accuracy comes at the expense of a significant loss in speed.

Figure 2: The structure of SSD. 7 output layers are used to generate pedestrian candidates in this work.

2.1 Pedestrian Candidate Generator

We use SSD to generate pedestrian candidates. The SSD is a feed-forward convolutional network which has a truncated VGG16 as the base network. In VGG16 base, pool5 is converted to

with stride one, and fc6 and fc7 are converted to convolutional layers with atrous algorithm

[30]. Additional 8 convolutional layers and a global average pooling layer are added after the base network and the size of each layer decreases progressively. Layers ’conv4_3’, ’fc7’, ’conv6_2’, ’conv7_2’, ’conv8_2’, ’conv9_2’, and ’pool6’ are used as the output layers. Since ’conv4_3’ has a much larger feature scale, an L2 normalization technique is used to scale down the feature magnitudes [21]. After each output layer, bounding box(BB) regression and classification are performed to generate pedestrian candidates. Figure 2 shows the structure of SSD.

For each output layer of size , a set of default BBs at different scales and aspect ratios are placed at each location. convolutional kernels are applied to each location to produce classification scores and BB location offsets with respect to the default BB locations. A default BB is labeled as positive if it has a Jaccard overlap greater than 0.5 with any ground truth BB, otherwise negative (as shown in Equation (1)).


where and represent the area covered by the default BB and the ground true BB, respectively. The training objective is given as Equation (2):


where is the softmax loss and is the Smooth L1 localization loss [12], is the number of positive default boxes, and is a constant weight term to keep a balance between the two losses. For more details about SSD please refer to [20]. Since SSD uses 7 output layers to generate multi-scale BB outputs, it provides a large pool of pedestrian candidates varying in scales and aspect ratios. This is very important to the following work since pedestrian candidates generated here should cover almost all the ground truth pedestrians, even though many false positives are introduced at the same time. Since SSD uses a fully convolutional framework, it is fast.

Figure 3: Left figure shows the pedestrian candidates’ BBs on an image. Right figure shows the SS mask over the BBs. We can visualize several false positives (such as windows and cars) are softly rejected by the SS mask.

2.2 Classification Network and Soft-rejection based DNN Fusion

The classification network consists of multiple binary classification deep neural networks which are trained on the pedestrian candidates from the first stage. All candidates with confidence score greater than and height greater than 40 pixels are collected as the new training data for the classification network. For each candidate, we scale it to a fixed size and directly use positive/negative information collected from Equation (1) for labeling.

After training, verification methods are implemented to generate the final results. Traditional hard binary classification results in hard rejection and will eliminate a pedestrian candidate based on a single negative vote from one classification network. Instead, we introduce the SNF method which works as follows: Consider one pedestrian candidate and one classifier. If the classifier has high confidence about the candidate, we boost its original score from the candidate generator by multiplying with a confidence scaling factor greater than one. Otherwise, we decrease its score by a scaling factor less than one. We define ”confident” as a classification probability of at least . To prevent any classifier from dominating, we set as the lower bound for the scaling factor. Let be the classification probability generated by the classifier for this candidate, the scaling factor is computed as Equation (3).


where and are chosen as and by cross validation. To fuse all classifiers, we multiply the candidate’s original confidence score with the product of the confidence scaling factors from all classifiers in the classification network. This can be expressed as Equation (4).


The key idea behind SNF is that we don’t directly accept or reject any candidates, instead we scale them with factors based on the classification probabilities. This is because a wrong elimination of a true pedestrian (e.g. as in hard-binary classification) cannot be corrected, whereas a low classification probability can be compensated for by larger classification probabilities from other classifiers.

2.3 Pixel-wise semantic segmentation for object detection reinforcement

We utilize an SS network, based on deep dilated convolutions and context aggregation [30], as a parallel classification network. The SS network is trained on the Cityscapes dataset for driving scene segmentation [5]. To perform dense prediction, the SS network consists of a fully convolutional VGG16 network, adapted with dilated convolutions as the front end prediction module, whose output is fed to a multi-scale context aggregation module, consisting of a fully convolutional network whose convolutional layers have increasing dilation factors.

An input image is scaled and directly processed by the SS network, which produces a binary mask with one color showing the activated pixels for the pedestrian class, and the other color showing the background. We consider both the ‘person’ and ‘rider’ categories in Cityscapes dataset as pedestrians, and the remaining classes as background. The SS mask is intersected with all detected BBs from the SSD. We propose a method to fuse the SS mask and the original pedestrian candidates. The degree to which each candidate’s BB overlaps with the pedestrian category in the SS activation mask gives a measure of the confidence of the SS network in the candidate generator’s results. We use the following strategy to fuse the results: If the pedestrian pixels occupy at least of the candidate BB area, we accept the candidate and keep its score unaltered; Otherwise, we apply SNF to scale the original confidence scores. This is summarized in Equation (5):


where represents the area of the BB, represents the area within covered by semantic segmentation mask, , and are chosen as and by cross validation. As we don’t have pixel-level labels for pedestrian detection datasets, we directly implement the SS model [30] trained on the Cityscape dataset [5]. Figure 3 shows an example of this method and how we fuse it into the existing model.

SNF with an SS network is slightly different from SNF with a classification network. The reason is that the SS network can generate new detections which have not been produced by the candidate generator, which is not the case for the classification network. To address this, the proposed SNF method eliminates new detections from the SS network. The idea is illustrated in Figure 4.

Figure 4: Implementing semantic segmentation as a reinforcement to pedestrian detection problem.

3 Experiments and result analysis

3.1 Data and evaluation settings

We evaluate the proposed method on the most popular pedestrian detection dataset: Caltech Pedestrian dataset. The Caltech Pedestrian dataset contains 11 sets (S0-S10), where each set consists of 6 to 13 one-minute long videos collected from a vehicle driving through an urban environment. There are about 250,000 frames with about 350,000 annotated BBs and 2300 unique pedestrians. Each BB is assigned with one of the three labels: ’Person’, ’People’ (large group of individuals), and ’Person?’ (unclear identifications). The original frame size is

. The log-average miss rate (L-AMR) is used as the performance evaluation metric

[8]. L-AMR is computed evenly spaced in log-space in the range to by averaging miss rate at the rate of nine false positives per image (FPPI) [8]. There are multiple evaluation settings defined based on the height and visible part of the BBs. The most popular settings are listed in Table 1.

Setting Description
Reasonable 50+ pixels. Occ. none or partial
All 20+ pixels. Occ. none, partial, or heavy
Far 30- pixels
Medium 30-80 pixels
Near 80+ pixels
Occ. none 0% occluded
Occ. partial 1-35% occluded
Occ. heavy 35-80% occluded
Table 1: Evaluation settings for Caltech Pedestrian dataset.
Method Reasonable All Far Medium Near Occ. none Occ. partial Occ. heavy
SCF+AlexNet [14] 23.32% 70.33% 100% 62.34% 10.16% 19.99% 48.47% 74.65%
SAF R-CNN [18] 9.68% 62.6% 100% 51.8% 0% 7.7% 24.8% 64.3%
MS-CNN [3] 9.95% 60.95% 97.23% 49.13% 2.60% 8.15% 19.24% 59.94%
DeepParts [26] 11.89% 64.78% 100% 56.42% 4.78% 10.64% 19.93% 60.42%
CompACT-Deep [4] 11.75% 64.44% 100% 53.23% 3.99% 9.63% 25.14% 65.78%
RPN+BF [31] 9.58% 64.66% 100% 53.93% 2.26% 7.68% 24.23% 69.91%
F-DNN (Ours) 8.65% 50.55% 77.37% 33.27% 2.96% 7.10% 15.41% 55.13%
F-DNN+SS (Ours) 8.18% 50.29% 77.47% 33.15% 2.82% 6.74% 15.11% 53.76%
Table 2: Detailed breakdown performance comparisons of our models and other state-of-the-art models on the 8 evaluation settings. All numbers are reported in L-AMR.

3.2 Training details and results

To train the SSD candidate generator, all images which contain at least one annotated pedestrian from Caltech training set, ETH dataset [10], and TudBrussels dataset [29] are used. By using both original and flipped images, it provides around 68,000 images. Among all annotations, only ’Person’ and ’People’ categories are included. We further classify ’Person’ into ’Person_full’ and ’Person_occluded’. This results in 109,000 pedestrians in ’Person_full’, 60,000 pedestrians in ’Person_occluded’, and 35,000 pedestrians in ’People’. All the images are in their original size . To set up the SSD model, we place 7 default BBs with aspect ratios [0.1, 0.2, 0.41a, 0.41b, 0.8, 1.6, 3.0] on top of each location of all output feature maps. All default BBs except 0.41b have relative heights [0.05, 0.1, 0.24, 0.38, 0.52, 0.66, 0.80] for the 7 output layers. The heights for 0.41b are [0.1, 0.24, 0.38, 0.52, 0.66, 0.80, 0.94]. Since 0.41 is the average aspect ratio for all annotated pedestrians, we use two default BBs with slightly different heights. The aspect ratio ’1.6’ and ’3.0’ are designed for ’People’. By doing so, we can generate a rich pool of candidates so as not to lose any ground truth pedestrians. We then fine-tune our SSD model from the Microsoft COCO [19]

pre-trained SSD model for 40k iterations using stochastic gradient descent (SGD), with a learning rate of

. All the layers after each output layer are randomly initialized and trained from scratch.

To train the classification network, we use all pedestrian cadidates generated by SSD as well as all ground truth BBs. All the training samples are horizontally flipped with probability . This results in positive samples and negative samples, with a ratio . As the high similarity between consecutive frames sometimes leads to identical training samples, we scale all samples to a fixed size of , and then randomly crop a patch for data augmentation (center cropping for testing). We then fine-tune two models in parallel: ResNet-50 [13] and GoogleNet [24]

, from their ImageNet pre-trained models. All models are trained using SGD with a learning rate of


For the SS network, since there is a lack of well-labeled SS dataset for pedestrian detection, we directly implement the network trained on the Cityscapes dataset with image size . To preserve the aspect ratio, we scale all images to height

, then pad black pixels on both sides.

All the above models are built on Caffe deep learning framework


Evaluation on Caltech Pedestrian: The detailed breakdown performance of our two models (without and with semantic segmentation) on this dataset is shown in Table 2. We compare with all the state-of-the-art methods reported on Caltech Pedestrian website. We can see that both of our models significantly outperform others on almost all evaluation settings. On the ’Reasonable’ setting, our best model achieves L-AMR, which has a relative improvement from the previous best result by RPN+BF. On the ’All’ evaluation setting, we achieve , a relative improvement of from by MS-CNN [3]. The L-AMR VS. FPPI plots for the ’Reasonable’ and ’All’ evaluation settings are shown in Figure 5 and Figure 6. F-DNN refers to fusing the SSD with the classification network, whereas F-DNN+SS refers to fusing the SSD with both the classification network and the SS network. Results from VJ [28] and HOG [6] are plotted as the baselines on this dataset.

3.3 Result analysis

3.3.1 Effectiveness of classification network

We explore how effective the classification network refines the original confidence scores of the pedestrian candidates. As many false positives are introduced from SSD, the main goal of the classification network is to decrease the scores of the false positives. By using ResNet-50 with classification probability as the confidence scaling threshold, scores of the false positives are decreased on the Caltech Pedestrian testing set. This improves the performance on the ’Reasonable’ setting from to . A similar result is obtained by GoogleNet- scores of the false positives are decreased, which improves performance from to . As the two classifiers do not decrease scores of the same set of false positives, by fusing their results with SNF, almost all false positives are covered, and the L-AMR is further improved to . Concerning limits to performance, if we were able to train an oracle classifier with classification accuracy , the L-AMR would be improved to only . This is shown in Table 3.

Method Reasonable
SSD 13.06%
SSD+GoogleNet 9.41%
SSD+ResNet-50 8.97%
SSD+GoogleNet+ResNet-50 (F-DNN) 8.65%
SSD+Oracle classifier 4%
Table 3: Effectiveness of the classification network.

3.3.2 Soft rejection versus hard rejection

SNF plays an important role in our system. Hard rejection is defined as eliminating any candidate which is classified as a false positive by any of the classifiers. The performance of hard-rejection based fusion depends on the performance of all classifiers. A comparison between the two methods is shown in Table 4. We also compare against the case when the SSD is fused with the SS network only, labeled as ’SSD+SS’. For the classification network, a classification probability threshold is used for hard rejection, while for the SS network, an overlap ratio of is used. We can see that hard rejection hurts performance significantly, especially for the classification network. All numbers are reported in L-AMR on the ’Reasonable’ evaluation setting.

Method Hard rejection Soft rejection
SSD+SS 13.4% 11.57%
F-DNN 20.67% 8.65%
F-DNN+SS 22.33%% 8.18%
Table 4: Performance comparisons on Caltech ’Reasonable’ setting between soft rejection and hard rejection. The original L-AMR of SSD alone is

3.3.3 Robustness on challenging scenarios

The proposed method performs much better than all other methods on challenging scenarios such as the small pedestrian scenario, the occluded pedestrian scenario, crowded scenes, and the blurred input image. Figure 7 visualizes the results of the ground truth annotation, our method, and RPN-BF (previous state-of-the-art method). The four rows represent the four challenging scenarios and the four columns represent the BBs from the ground true annotations, the pedestrian candidates generated by SSD alone, our final detection results, and the results from RPN-BF method. By comparing the third column with the second column, we can see that the classification network and the SS network are able to filter out most of the false positives introduced by the SSD detector. By comparing the third column with the last column, we can see our method is more robust and accurate on the challenging scenarios than RPN-BF method.

Figure 5: L-AMR VS. FPPI plot on ’Reasonable’ evaluation setting.
Figure 6: L-AMR VS. FPPI plot on ’All’ evaluation setting.
Figure 7: Detection comparisons on four challenging pedestrian detection scenarios. The four rows represent the small pedestrian scenario, the occluded pedestrian scenario, the crowed scenes, and the blurred input image. The four columns represent the ground true annotations, the pedestrian candidates generated by SSD alone, our final detection results, and the results from the RPN-BF method.

3.3.4 Speed analysis

There are 4 networks in the proposed work. As the SSD uses a fully convolutional framework, its processing time is s per image. For the speed of the classification network, we perform two tests: The first test runs on all pedestrian candidates; The second test runs only on candidates above 40 pixels in height. The second test is targeting only the ’Reasonable’ evaluation setting. Using parallel processing, the speed for the classification network equals its slowest classifier, which is s and s per image for the two tests. The overall processing time of F-DNN is s and s per image, which is to times faster than other methods. Note that the speed for GoogleNet on ’Reasonable’ test is only s per image. If we only fuse the SSD with GoogleNet, we can still achieve the state-of-the-art performance while being s per image. We also test fusing SSD with SqueezeNet [15] only to achieve s per image, which is above 10 frames per second (with a L-AMR around ). As the SS network uses size input with a more complex network structure, it’s processing time is s per image. As it can be processed in parallel with the whole F-DNN pipeline, the overall processing time will be limited to s per image. We use one NVIDIA TITIAN X GPU for all the speed tests. All the classifiers in the classification network are processed in parallel on one single GPU. Table 5 compares the processing speed of our methods and the other methods.

Method Speed on TITAN X
(seconds per image)
CompACT-Deep 0.5
SAF R-CNN 0.59
RPN+BF 0.5
F-DNN 0.3
F-DNN (Reasonable) 0.16
SSD+GoogleNet (Reasonable) 0.11
SSD+SqueezeNet (Reasonable) 0.09
F-DNN+SS 2.48
Table 5: A comparison of speed among the state-of-the-art models.

4 Conclusion and Discussion

We presented an effective and robust pedestrian detection system based on the fusion of multiple well-trained DNNs. We trained an SSD for rich pedestrian candidates generation. Then we designed a soft-rejection based network fusion approach to fuse binary classifiers based on ResNet-50 and GoogleNet to refine the candidates. We also proposed a method of using semantic segmentation to improve the pedestrian detection.

Experiments with different settings on the most popular dataset showed our method works well on pedestrians of different scales and occlusions. We outperformed all previous models, while also being faster than them.

For future work, we want to explore the fusion of more classifiers into our existing model to make it even stronger. To train the classifiers, instead of using binary labeling strategy, we want to explore the label smoothing method introduced in [25]. Moreover, since we showed in this work that semantic segmentation can help solve the object detection problem, we would also like to explore how an object detection algorithm can help solve the instance-aware semantic labeling problem [27].

Acknowledgment Thanks to Arvind Yedla and Marcel Nassar for the helpful discussions.