Gated Multi-layer Convolutional Feature Extraction Network for Robust Pedestrian Detection

10/25/2019 ∙ by Tianrui Liu, et al. ∙ 0

Pedestrian detection methods have been significantly improved with the development of deep convolutional neural networks. Nevertheless, robustly detecting pedestrians with a large variant on sizes and with occlusions remains a challenging problem. In this paper, we propose a gated multi-layer convolutional feature extraction method which can adaptively generate discriminative features for candidate pedestrian regions. The proposed gated feature extraction framework consists of squeeze units, gate units and a concatenation layer which perform feature dimension squeezing, feature elements manipulation and convolutional features combination from multiple CNN layers, respectively. We proposed two different gate models which can manipulate the regional feature maps in a channel-wise selection manner and a spatial-wise selection manner, respectively. Experiments on the challenging CityPersons dataset demonstrate the effectiveness of the proposed method, especially on detecting those small-size and occluded pedestrians.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pedestrian detection has long been an attractive topic in computer vision with significant impact on both research and industry. Pedestrians detection is essential for scene understanding, and has a wide applications such as video surveillance, robotics automation and intelligence driving assistance systems. The pedestrian detection task is often challenged by a pedestrians with large variation of poses, appearances, sizes and under real life scenarios with complex backgrounds.

Traditional pedestrian detectors [4, 6, 15, 23, 7, 14] exploit various hand-engineered feature representations, such as Haar [18], local binary pattern [17]

as well as the Histogram of Oriented Gradient (HOG) feature and its variations. These feature representations are used in conjunction with a classifier, for instance support vector machine

[3] and boosted forests [8], to perform pedestrian detection via classification. Recent advances of deep neural networks have made significant improvements on pedestrian detection methods [22, 11, 24, 12, 13]. Zhang et al. [24] tailored the well known Faster-RCNN [19]

object detector in terms of anchors and feature strides to accommodate for pedestrian detection problems. Multi-layer Channel Features (MCF)

[1] and RPN+BF [22] proposed to concatenate feature representations from multiple layers of a Convolutional Neural Network (CNN) and replace the downstream classifier of Faster R-CNN with boosted classifies to improve the performance on hard sample detection. Compared to the traditional methods, CNN-based methods are equipped with more powerful feature representation. The challenge of pedestrian detection regrading pose and appearance variations can be addressed well in most circumstances. While there is still a lot of room for improvement fro detecting pedestrian under large variations in scale.

The visual appearance and the feature representation of large-size and small-size pedestrians are significantly different. For this reason, it is intuitive to use different feature representation for detecting objects of different sizes. In [13], it has been claimed that the features that can best balance feature abstraction level and resolution are from different convolutional layers. A Scale-Aware Multi-resolution (SAM) CNN-method [13]

was proposed which achieves good feature representation by choosing the most suitable feature combination for pedestrians of different sizes from multiple convolutional layers. The limitation of this method is that how the multi-layer feature is combined is hand-designed. Hence, there has only be a limited number of heuristic and fixed feature combinations.

In this paper, we aim at investigating a more advanced approach which can automatically select combinations of multi-layer features for detecting pedestrians of various sizes. A pedestrian proposal network is used to generate pedestrian candidates and thereafter we propose a gated feature extraction network which can adaptively provide discriminative features for the pedestrian candidates of different size. In the proposed gated multi-layer feature extraction framework, a squeeze unit is applied to reduce the dimension of Region of Interests (RoI) feature maps pooled from each convolutional layer. It is an essential component in the gated feature extraction network in order to achieve a good balance on performance and computational and memory complexity. Following, a gate unit is applied to decide whether features from a convolutional layer is essential for the representation of this RoI. We investigate two gate models to manipulate and select features from multiple layers, namely, a spatial-wise gate model and a channel-wise gate model. We expect that the features manipulated by the proposed two gate models will be able to have stronger inter-dependencies among channels or among spatial locations, respectively. Experimental results show that the proposed method achieves the state-of-the-art performance on the challenging CityPersons dataset [24].

Figure 1: Overview of the proposed gated multi-layer convolutional feature extraction method. Feature maps from each convolutional block of the backbone network are first passed through a squeeze unit for dimension reduction. The squeezed feature maps are then passed through gate units for feature manipulation, and finally integrated using a concatenation layer.

2 Proposed Method

2.1 Overview of the Proposed Detection Framework

The proposed gated multi-layer feature extraction network aims to generate discriminative features for pedestrians with a wide range of scale variations by end-to-end learning. The overview of the proposed method is shown in Fig.1. An input image, it is first passed through the backbone network for convolutional feature extraction. We employ the VGG16 network [20] as the backbone network. There are 13 convolutional layers in VGG16 which can be regarded as five convolutional blocks, i.e., , , , , and . Features from different layers of a CNN represent different levels of abstraction and meanwhile has different reception fields which can provide different cues for pedestrian detection. Our gated network takes the features from all the five convolutional layers as the input and will thereafter select the most discriminative feature component for pedestrian candidate of different size. A Region Proposal Network (RPN) is used to generate a number of candidate pedestrian proposals. Given the candidate proposals, the gated multi-layer feature extraction network manipulates the CNN feature maps from each convolutional block and generates representative features for each region of interest (RoI).

The proposed gated multi-layer feature extraction network helps to realize an automatic re-weighting of multi-layer convolutional features. Nevertheless, the gated network requires additional convolutional layers which induce a deeper RoI-wise sub-network at the cost of higher complexity and higher memory occupation. To remedy this issue, our gated sub-network includes a squeeze unit which reduces the dimension of the feature maps.

As illustrated in Fig. 1, features maps from each convolutional block of the backbone network are first compressed by a squeeze unit, then the RoI features pooled from the squeezed lightweight feature maps are passed through gate units

for feature selection, and finally integrated at the concatenation layer.

2.2 The Squeeze Unit

A squeeze unit is used to reduce the input feature dimension of the RoI-wise sub-network in the proposed gated feature extraction network. Let us denote the input feature maps as which has spatial size and is of channels. The squeeze unit will map the input feature maps to the lightweight output feature maps with by applying 1 by 1 convolution, i.e.,


where is the -th learned filter in the squeeze network for , and ‘’ denotes convolution.

The squeeze ratio is defined as . In Section 3.2, we will show that a properly selected squeeze ratio will reduce the RoI-wise sub-network parameters without noticeable performance deduction.

RoI-pooling [19] will be performed on the squeezed lightweight feature maps. The features then pass through a gate unit for feature selection. The gate units manipulates the CNN features to highlight the most suitable feature channels or feature components for a particular RoI, while suppressing the redundant or unimportant ones.

2.3 The Gate Unit

A gate unit will be used to manipulate RoI features pooled from the squeezed lightweight convolutional feature maps. Generally, a gate unit consists of a convolutional layer, two fully connected (fc) layers and a Sigmoid function at the end for output normalization. Given regional feature maps

, the output of a gate unit can be expressed as:


where denotes the Sigmoid function,

denotes the ReLU activation function

[16], and are the learnable parameters of the gate network.

The output of a gate unit is used to manipulate the regional feature maps through an element-wise product:


where denotes the element-wise product.

The manipulated features outputs from a gate network will has the same size as its input RoI feature , and will enhance the information that is helpful for identifying the pedestrian within this RoI. We have designed two type gate units based on how the RoI feature maps will be manipulated, namely, the spatial-wise selection gate model and the channel-wise selection gate model. The channel-wise selection gate model are able to increase the inter-dependencies among different features channels, while the spatial-wise selection gate model enhance the feature capacity in terms of spatial locations.

Figure 2: The spatial-wise selection gate unit. The squeezed RoI-pooled features will be transformed to a 2D map via a 1 by 1 convolution. The 2D map is followed with two fully connected layers to generate another 2D map to be fed into the Sigmoid function.
Figure 3: The channel-wise selection gate unit. The squeezed RoI-pooled features will be transformed to a 1D vector via depth-wise separable convolution. This vector is followed with two fully connected layers to generate another vector to be fed into the Sigmoid function.

2.3.1 Spatial-wise selection gate module

The gate unit for spatial-wise selection outputs a 2-dimensional (2D) map of size . It will perform an element-wise product with the RoI feature maps which is of size through a 1 by 1 convolution. As shown in Figure 2, through 1 by 1 convolution, the resulting 2D map has the same spatial resolution as the input feature maps. The 2D map is then passed through two fully connected (fc) layers and a Sigmoid function for normalization. The obtained 2D spatial mask will be used to modulate the feature representation for every spatial location of the input feature. The feature values from all feature channels at spatial location will be modulated by the coefficient .

2.3.2 Channel-wise selection gate module

The gate model for channel-wise section generates a vector of size through depth-wise separable convolution [10]. As shown in Figure 3, this vector is further passed through two fc layers and a Sigmoid function. The obtained thereafter is used to perform a modulation with the convolutional features along the channel dimension. All the feature values within the -th () channel will be modulated by the -th coefficient of .

3 Experiments

3.1 Dataset and Experimental Setups

CityPersons [24] is a recent pedestrian detection dataset built on top of the CityScapes dataset[2] which is for semantic segmentation. The dataset includes 5, 000 images captured in several cities of Germany. There are about 35, 000 persons with additional around 13, 000 ignored regions in total. Both bounding box annotation of all persons and annotation of visible person parts are provided. We conduct our experiments on CityPersons using the reasonable train/validation sets for training and testing, respectively.

Evaluation metrics: Evaluations are measured using the log average missing rate (MR) of false positive per image (FPPI) ranging from to (MR). We evaluated four subsets with different ranges of pedestrian height () and different visibility levels () as follows:

1) All: and ,

2) Small (Sm): and ,

3) Occlusion (Occ): and ,

4) Reasonable (R): and .

Network training and experimental setup

: The loss function contains a classification loss term and a regression loss term as in Faster-RCNN


. Stochastic gradient descent with momentum is used for loss function optimization. A single image is processed at once, and for each image there are 256 randomly sampled anchors used to compute the loss of a mini-batch. The momentum

is set as 0.9 and and weight decay is set as . Weights for the backbone network (i.e.,

convolutional blocks) are initialized from the network pre-trained using the ImageNet dataset


, while the other convolutional layers are initialized as a Gaussian distribution with mean 0 and standard deviation

. We set the learning rate to for the first 80k iterations and for the remaining 30k iterations. All experiments are performed on a single TITANX Pascal GPU.

3.2 Effectiveness of Squeeze Ratio

The squeeze ratio effects the network in terms of feature capacity and computational cost. To investigate the effects of squeeze ratio, we conduct experiments using features from multiple convolutional layers that have been squeeze by which will reduce the number of parameters in the following RoI-wise sub-networks by a factor of and 8, accordingly. The performances are compared in Table 1. We find that squeeze network can reduce the RoI-wise sub-network parameters without noticeable performance deduction. We use the reduction ratio which is a good trade-off between performance and computational complexity.

squeeze ratio All Small Occlusion Reasonable
43.70 39.65 56.97 14.49
43.02 42.02 55.60 14.35
42.93 44.33 56.34 14.63
44.52 39.98 58.06 14.85
Table 1: Missing rate (MR%) on Citypersons validation set using different squeeze ratios .

3.3 Effectiveness of the Proposed Gate Models

Baseline: we use a modified version of the Faster-RCNN [19] as our baseline detector. To generate pedestrian candidates, we use anchors of a single ratio of with scales for the region proposal network. The baseline detector only adopts the feature maps for feature representation. The limited feature resolution of restraints the capability for detecting small pedestrians. We dilate the features by a factor of two which enlarges the receptive field without increasing the filter size.

For our “spatial-wise gate” model and “channel-wise gate”, we use features extracted from the proposed gated multi-layer feature extraction network applying the two gate models, respectively. As can be seen from Table 2, both the spatial-wise gate model and the channel-wise gate model make improvements upon the Baseline detector. These results demonstrate the effectiveness of our proposed gated multi-layer feature extraction. More specifically, the spatial-wise gate model achieves better performance on the “Occlusion” subset, while the channel-wise gate model achieves better performance on the “Small” subset.

Model All Sm Occ R
FRCNN [Baseline] 44.6 40.46 56.19 16.44
Adapted FRCNN [24] - - - 15.40
Repulsion Loss [21] 44.45 42.63 56.85 13.22
OR-CNN [25] 42.32 42.31 55.68 12.81
Spatial-wise gate 41.72 39.46 52.18 14.01
Channel-wise gate 41.76 37.62 53.53 13.49
Table 2: Comparison of pedestrian detection performance (in terms of MR%) of our proposed gate model with state-of-the-arts (in terms of MR%) on the Citypersons dataset.

We also compare our proposed pedestrian detector with several state-of-the-art pedestrian detectors in Table 2, including Adapted FRCNN [24], Repulsion Loss [21], and Occlusion-aware R-CNN (OR-RCNN) [25]. Repulsion Loss [21] and OR-RCNN [25] are recent pedestrian detection methods proposed to address the occlusion problems. For fare comparison, all the performance are evaluated on the original image size () of the CityPersons validation dataset. We can observe that both of the two proposed gated models surpass the other approaches under the “All”, “Occlusion” and “Small” settings. The most notable improvements are on the “Occlusion” and the “Small” subsets. Our channel-wise gate model exceed the state-of-the-art method OR-RCNN [21] by a large margin of on the ”Small” subset, which highlights the effectiveness of our method for small-size pedestrian detection. When it comes to the “Occlusion” subset where includes some severely occluded pedestrian, our spatial-wise gate model achieves the best performance of ), surpassing the second best pedestrian detector by .

4 Conclusions

In this paper, we proposed a gated multi-layer convolutional feature extraction network for robust pedestrian detection. Convolutional features from backbone network are fed into the gated network to adaptively select discriminative feature representations for pedestrian candidate regions. A squeeze unit for feature dimension reduction is applied before the gated unit for RoI-wise feature manipulation. Two types of gate models that we proposed can manipulate the feature maps in channel-wise manner and in spatial-wise manner, respectively. Experiment on the CityPersons pedestrian dataset demonstrate the effectiveness of the proposed method for robust pedestrian detection.


  • [1] J. Cao, Y. Pang, and X. Li (2017) Learning multilayer channel features for pedestrian detection. IEEE transactions on image processing 26 (7), pp. 3210–3220. Cited by: §1.
  • [2] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2015) The cityscapes dataset. In CVPR Workshop on The Future of Datasets in Vision, Cited by: §3.1.
  • [3] C. Cortes and V. Vapnik (1995) Support vector machine. Machine learning 20 (3), pp. 273–297. Cited by: §1.
  • [4] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. Conference Proceedings In

    Computer Vision and Pattern Recognition (CVPR), 2005 IEEE Conference on

    Vol. 1, pp. 886–893. External Links: ISBN 0769523722 Cited by: §1.
  • [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, Cited by: §3.1.
  • [6] P. Dollár, R. Appel, S. Belongie, and P. Perona (2014) Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (8), pp. 1532–1545. Cited by: §1.
  • [7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2010) Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32 (9), pp. 1627–1645. External Links: ISSN 0162-8828 Cited by: §1.
  • [8] P. Geurts, D. Ernst, and L. Wehenkel (2006) Extremely randomized trees. Machine learning 63 (1), pp. 3–42. Cited by: §1.
  • [9] R. Girshick (2015) Fast R-CNN. In International Conference on Computer Vision (ICCV), Cited by: §3.1.
  • [10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.3.2.
  • [11] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan (2018-04) Scale-aware Fast R-CNN for pedestrian detection. IEEE Transactions on Multimedia 20 (4), pp. 985–996. External Links: Document, ISSN 1520-9210 Cited by: §1.
  • [12] T. Liu and T. Stathaki (2017-08)

    Enhanced pedestrian detection using deep learning based semantic image segmentation

    In 2017 22nd International Conference on Digital Signal Processing (DSP), Vol. , pp. 1–5. External Links: Document, ISSN Cited by: §1.
  • [13] T. Liu, M. Elmikaty, and T. Stathaki (2018-09) SAM-RCNN: scale-aware multi-resolution multi-channel pedestrian detection. In British Machine Vision Conference (BMVC), Cited by: §1, §1.
  • [14] T. Liu and T. Stathaki (2016-10) Fast head-shoulder proposal for deformable part model based pedestrian detection. In 2016 IEEE International Conference on Digital Signal Processing (DSP), pp. 457–461. External Links: Document Cited by: §1.
  • [15] J. Mao, T. Xiao, Y. Jiang, and Z. Cao (2017) What can help pedestrian detection?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [16] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §2.3.
  • [17] T. Ojala, M. Pietikainen, and T. Maenpaa (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (7), pp. 971–987. Cited by: §1.
  • [18] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio (1997) Pedestrian detection using wavelet templates. In cvpr, Vol. 97, pp. 193–199. Cited by: §1.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.2, §3.3.
  • [20] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.
  • [21] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen (2018) Repulsion loss: detecting pedestrians in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7774–7783. Cited by: §3.3, Table 2.
  • [22] L. Zhang, L. Lin, X. Liang, and K. He (2016) Is Faster R-CNN doing well for pedestrian detection?. CoRR abs/1607.07032. External Links: Link Cited by: §1.
  • [23] S. Zhang, R. Benenson, M. Omran, J. H. Hosang, and B. Schiele (2016) How far are we from solving pedestrian detection?. CoRR abs/1602.01237. External Links: Link Cited by: §1.
  • [24] S. Zhang, R. Benenson, and B. Schiele (2017) CityPersons: A diverse dataset for pedestrian detection. CoRR abs/1702.05693. External Links: Link Cited by: §1, §1, §3.1, §3.3, Table 2.
  • [25] S. Zhang (2018-09) Occlusion-aware R-CNN: detecting pedestrians in a crowd. In European Conference on Computer Vision (ECCV), Cited by: §3.3, Table 2.