Pedestrian detection has long been an attractive topic in computer vision with significant impact on both research and industry. Pedestrians detection is essential for scene understanding, and has a wide applications such as video surveillance, robotics automation and intelligence driving assistance systems. The pedestrian detection task is often challenged by a pedestrians with large variation of poses, appearances, sizes and under real life scenarios with complex backgrounds.
object detector in terms of anchors and feature strides to accommodate for pedestrian detection problems. Multi-layer Channel Features (MCF) and RPN+BF  proposed to concatenate feature representations from multiple layers of a Convolutional Neural Network (CNN) and replace the downstream classifier of Faster R-CNN with boosted classifies to improve the performance on hard sample detection. Compared to the traditional methods, CNN-based methods are equipped with more powerful feature representation. The challenge of pedestrian detection regrading pose and appearance variations can be addressed well in most circumstances. While there is still a lot of room for improvement fro detecting pedestrian under large variations in scale.
The visual appearance and the feature representation of large-size and small-size pedestrians are significantly different. For this reason, it is intuitive to use different feature representation for detecting objects of different sizes. In , it has been claimed that the features that can best balance feature abstraction level and resolution are from different convolutional layers. A Scale-Aware Multi-resolution (SAM) CNN-method 
was proposed which achieves good feature representation by choosing the most suitable feature combination for pedestrians of different sizes from multiple convolutional layers. The limitation of this method is that how the multi-layer feature is combined is hand-designed. Hence, there has only be a limited number of heuristic and fixed feature combinations.
In this paper, we aim at investigating a more advanced approach which can automatically select combinations of multi-layer features for detecting pedestrians of various sizes. A pedestrian proposal network is used to generate pedestrian candidates and thereafter we propose a gated feature extraction network which can adaptively provide discriminative features for the pedestrian candidates of different size. In the proposed gated multi-layer feature extraction framework, a squeeze unit is applied to reduce the dimension of Region of Interests (RoI) feature maps pooled from each convolutional layer. It is an essential component in the gated feature extraction network in order to achieve a good balance on performance and computational and memory complexity. Following, a gate unit is applied to decide whether features from a convolutional layer is essential for the representation of this RoI. We investigate two gate models to manipulate and select features from multiple layers, namely, a spatial-wise gate model and a channel-wise gate model. We expect that the features manipulated by the proposed two gate models will be able to have stronger inter-dependencies among channels or among spatial locations, respectively. Experimental results show that the proposed method achieves the state-of-the-art performance on the challenging CityPersons dataset .
2 Proposed Method
2.1 Overview of the Proposed Detection Framework
The proposed gated multi-layer feature extraction network aims to generate discriminative features for pedestrians with a wide range of scale variations by end-to-end learning. The overview of the proposed method is shown in Fig.1. An input image, it is first passed through the backbone network for convolutional feature extraction. We employ the VGG16 network  as the backbone network. There are 13 convolutional layers in VGG16 which can be regarded as five convolutional blocks, i.e., , , , , and . Features from different layers of a CNN represent different levels of abstraction and meanwhile has different reception fields which can provide different cues for pedestrian detection. Our gated network takes the features from all the five convolutional layers as the input and will thereafter select the most discriminative feature component for pedestrian candidate of different size. A Region Proposal Network (RPN) is used to generate a number of candidate pedestrian proposals. Given the candidate proposals, the gated multi-layer feature extraction network manipulates the CNN feature maps from each convolutional block and generates representative features for each region of interest (RoI).
The proposed gated multi-layer feature extraction network helps to realize an automatic re-weighting of multi-layer convolutional features. Nevertheless, the gated network requires additional convolutional layers which induce a deeper RoI-wise sub-network at the cost of higher complexity and higher memory occupation. To remedy this issue, our gated sub-network includes a squeeze unit which reduces the dimension of the feature maps.
As illustrated in Fig. 1, features maps from each convolutional block of the backbone network are first compressed by a squeeze unit, then the RoI features pooled from the squeezed lightweight feature maps are passed through gate units
for feature selection, and finally integrated at the concatenation layer.
2.2 The Squeeze Unit
A squeeze unit is used to reduce the input feature dimension of the RoI-wise sub-network in the proposed gated feature extraction network. Let us denote the input feature maps as which has spatial size and is of channels. The squeeze unit will map the input feature maps to the lightweight output feature maps with by applying 1 by 1 convolution, i.e.,
where is the -th learned filter in the squeeze network for , and ‘’ denotes convolution.
The squeeze ratio is defined as . In Section 3.2, we will show that a properly selected squeeze ratio will reduce the RoI-wise sub-network parameters without noticeable performance deduction.
RoI-pooling  will be performed on the squeezed lightweight feature maps. The features then pass through a gate unit for feature selection. The gate units manipulates the CNN features to highlight the most suitable feature channels or feature components for a particular RoI, while suppressing the redundant or unimportant ones.
2.3 The Gate Unit
A gate unit will be used to manipulate RoI features pooled from the squeezed lightweight convolutional feature maps. Generally, a gate unit consists of a convolutional layer, two fully connected (fc) layers and a Sigmoid function at the end for output normalization. Given regional feature maps, the output of a gate unit can be expressed as:
where denotes the Sigmoid function,16], and are the learnable parameters of the gate network.
The output of a gate unit is used to manipulate the regional feature maps through an element-wise product:
where denotes the element-wise product.
The manipulated features outputs from a gate network will has the same size as its input RoI feature , and will enhance the information that is helpful for identifying the pedestrian within this RoI. We have designed two type gate units based on how the RoI feature maps will be manipulated, namely, the spatial-wise selection gate model and the channel-wise selection gate model. The channel-wise selection gate model are able to increase the inter-dependencies among different features channels, while the spatial-wise selection gate model enhance the feature capacity in terms of spatial locations.
2.3.1 Spatial-wise selection gate module
The gate unit for spatial-wise selection outputs a 2-dimensional (2D) map of size . It will perform an element-wise product with the RoI feature maps which is of size through a 1 by 1 convolution. As shown in Figure 2, through 1 by 1 convolution, the resulting 2D map has the same spatial resolution as the input feature maps. The 2D map is then passed through two fully connected (fc) layers and a Sigmoid function for normalization. The obtained 2D spatial mask will be used to modulate the feature representation for every spatial location of the input feature. The feature values from all feature channels at spatial location will be modulated by the coefficient .
2.3.2 Channel-wise selection gate module
The gate model for channel-wise section generates a vector of size through depth-wise separable convolution . As shown in Figure 3, this vector is further passed through two fc layers and a Sigmoid function. The obtained thereafter is used to perform a modulation with the convolutional features along the channel dimension. All the feature values within the -th () channel will be modulated by the -th coefficient of .
3.1 Dataset and Experimental Setups
CityPersons  is a recent pedestrian detection dataset built on top of the CityScapes dataset which is for semantic segmentation. The dataset includes 5, 000 images captured in several cities of Germany. There are about 35, 000 persons with additional around 13, 000 ignored regions in total. Both bounding box annotation of all persons and annotation of visible person parts are provided. We conduct our experiments on CityPersons using the reasonable train/validation sets for training and testing, respectively.
Evaluation metrics: Evaluations are measured using the log average missing rate (MR) of false positive per image (FPPI) ranging from to (MR). We evaluated four subsets with different ranges of pedestrian height () and different visibility levels () as follows:
1) All: and ,
2) Small (Sm): and ,
3) Occlusion (Occ): and ,
4) Reasonable (R): and .
Network training and experimental setup
: The loss function contains a classification loss term and a regression loss term as in Faster-RCNN
. Stochastic gradient descent with momentum is used for loss function optimization. A single image is processed at once, and for each image there are 256 randomly sampled anchors used to compute the loss of a mini-batch. The momentumis set as 0.9 and and weight decay is set as . Weights for the backbone network (i.e.,
convolutional blocks) are initialized from the network pre-trained using the ImageNet dataset. We set the learning rate to for the first 80k iterations and for the remaining 30k iterations. All experiments are performed on a single TITANX Pascal GPU.
3.2 Effectiveness of Squeeze Ratio
The squeeze ratio effects the network in terms of feature capacity and computational cost. To investigate the effects of squeeze ratio, we conduct experiments using features from multiple convolutional layers that have been squeeze by which will reduce the number of parameters in the following RoI-wise sub-networks by a factor of and 8, accordingly. The performances are compared in Table 1. We find that squeeze network can reduce the RoI-wise sub-network parameters without noticeable performance deduction. We use the reduction ratio which is a good trade-off between performance and computational complexity.
3.3 Effectiveness of the Proposed Gate Models
Baseline: we use a modified version of the Faster-RCNN  as our baseline detector. To generate pedestrian candidates, we use anchors of a single ratio of with scales for the region proposal network. The baseline detector only adopts the feature maps for feature representation. The limited feature resolution of restraints the capability for detecting small pedestrians. We dilate the features by a factor of two which enlarges the receptive field without increasing the filter size.
For our “spatial-wise gate” model and “channel-wise gate”, we use features extracted from the proposed gated multi-layer feature extraction network applying the two gate models, respectively. As can be seen from Table 2, both the spatial-wise gate model and the channel-wise gate model make improvements upon the Baseline detector. These results demonstrate the effectiveness of our proposed gated multi-layer feature extraction. More specifically, the spatial-wise gate model achieves better performance on the “Occlusion” subset, while the channel-wise gate model achieves better performance on the “Small” subset.
|Adapted FRCNN ||-||-||-||15.40|
|Repulsion Loss ||44.45||42.63||56.85||13.22|
We also compare our proposed pedestrian detector with several state-of-the-art pedestrian detectors in Table 2, including Adapted FRCNN , Repulsion Loss , and Occlusion-aware R-CNN (OR-RCNN) . Repulsion Loss  and OR-RCNN  are recent pedestrian detection methods proposed to address the occlusion problems. For fare comparison, all the performance are evaluated on the original image size () of the CityPersons validation dataset. We can observe that both of the two proposed gated models surpass the other approaches under the “All”, “Occlusion” and “Small” settings. The most notable improvements are on the “Occlusion” and the “Small” subsets. Our channel-wise gate model exceed the state-of-the-art method OR-RCNN  by a large margin of on the ”Small” subset, which highlights the effectiveness of our method for small-size pedestrian detection. When it comes to the “Occlusion” subset where includes some severely occluded pedestrian, our spatial-wise gate model achieves the best performance of ), surpassing the second best pedestrian detector by .
In this paper, we proposed a gated multi-layer convolutional feature extraction network for robust pedestrian detection. Convolutional features from backbone network are fed into the gated network to adaptively select discriminative feature representations for pedestrian candidate regions. A squeeze unit for feature dimension reduction is applied before the gated unit for RoI-wise feature manipulation. Two types of gate models that we proposed can manipulate the feature maps in channel-wise manner and in spatial-wise manner, respectively. Experiment on the CityPersons pedestrian dataset demonstrate the effectiveness of the proposed method for robust pedestrian detection.
-  (2017) Learning multilayer channel features for pedestrian detection. IEEE transactions on image processing 26 (7), pp. 3210–3220. Cited by: §1.
-  (2015) The cityscapes dataset. In CVPR Workshop on The Future of Datasets in Vision, Cited by: §3.1.
-  (1995) Support vector machine. Machine learning 20 (3), pp. 273–297. Cited by: §1.
Histograms of oriented gradients for human detection.
Conference Proceedings In
Computer Vision and Pattern Recognition (CVPR), 2005 IEEE Conference on, Vol. 1, pp. 886–893. External Links: Cited by: §1.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, Cited by: §3.1.
-  (2014) Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (8), pp. 1532–1545. Cited by: §1.
-  (2010) Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32 (9), pp. 1627–1645. External Links: Cited by: §1.
-  (2006) Extremely randomized trees. Machine learning 63 (1), pp. 3–42. Cited by: §1.
-  (2015) Fast R-CNN. In International Conference on Computer Vision (ICCV), Cited by: §3.1.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.3.2.
-  (2018-04) Scale-aware Fast R-CNN for pedestrian detection. IEEE Transactions on Multimedia 20 (4), pp. 985–996. External Links: Cited by: §1.
Enhanced pedestrian detection using deep learning based semantic image segmentation. In 2017 22nd International Conference on Digital Signal Processing (DSP), Vol. , pp. 1–5. External Links: Cited by: §1.
-  (2018-09) SAM-RCNN: scale-aware multi-resolution multi-channel pedestrian detection. In British Machine Vision Conference (BMVC), Cited by: §1, §1.
-  (2016-10) Fast head-shoulder proposal for deformable part model based pedestrian detection. In 2016 IEEE International Conference on Digital Signal Processing (DSP), pp. 457–461. External Links: Cited by: §1.
-  (2017) What can help pedestrian detection?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §2.3.
-  (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (7), pp. 971–987. Cited by: §1.
-  (1997) Pedestrian detection using wavelet templates. In cvpr, Vol. 97, pp. 193–199. Cited by: §1.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.2, §3.3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.
-  (2018) Repulsion loss: detecting pedestrians in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7774–7783. Cited by: §3.3, Table 2.
-  (2016) Is Faster R-CNN doing well for pedestrian detection?. CoRR abs/1607.07032. External Links: Cited by: §1.
-  (2016) How far are we from solving pedestrian detection?. CoRR abs/1602.01237. External Links: Cited by: §1.
-  (2017) CityPersons: A diverse dataset for pedestrian detection. CoRR abs/1702.05693. External Links: Cited by: §1, §1, §3.1, §3.3, Table 2.
-  (2018-09) Occlusion-aware R-CNN: detecting pedestrians in a crowd. In European Conference on Computer Vision (ECCV), Cited by: §3.3, Table 2.