1 Introduction
Crowd gathering lead to huge loss of people’s life, property and cause terrible social influence [1]. Subway station, railway station and scenic area are typical crowd gathering scene. In Shanghai, China, the daily average passenger volume in subway station comes to 10,330,000 in 2018 and increases up more than 11% from a year earlier. Therefore, the research of methods to avoid crowd gathering has important application value for public management.
Pedestrian counting plays an essential role in crowd monitoring [2]
. In the past years, computer vision and machine learning methods have been widely applied to the field of pedestrian counting. Researchers have proposed various pedestrian counting methods which can be divided into three categories
[3]: detectionbased methods [4] [5], regressionbased methods [6] [7] and densitybased methods [8] [9]. Detectionbased methods locate objects accurately but often suffer from scale variance, clutter backgrounds and occlusions. Regression can get more accurate results but hard to design a good feature representation. Recently, densitybased methods use a twodimension gaussian kernel to simulate pedestrian head and have demonstrated strong performance in extremely crowded and large view angle scenes such as ShanghaiTech
[9] and WorldExpo [8]. As shown in Figure 1, densitybased methods have two main shortcuts in high perspective distortion scenes. One is the error of gaussian kernel simulation and the other is misidentification of clutter background caused by perspective distortion. In Figure 1, red lines reflect the error of gaussian kernel simulation. Due to perspective distortion, gaussian kernels with different sizes and sigma setting are established to simulate the heads in different regions of scene. It is effective for the heads far away from camera. While the head is close to camera, it is not appropriate due to complicated texture such as women’s long hair, collars, and hats. As a result, neural network can not learn a good counting result for the region close to camera. Yellow lines in the Figure 1 show that densitybased methods misidentify the clutter background as head such as clothes’ texture, bags’ corner and pedestrians’ elbow angle. The reason to this phenomenon is that heads with multiple scales are trained together and neural networks learn an average representation for heads of all sizes. Therefore, using densitybased method in high perspective distortion scenes can not obtain a optimal counting result. All the aforementioned methods consider only one method for entire frame, while He et al. [7] proposed a straightline double region pedestrian counting method. It make the best use of features from different regions to design appropriate counting method. However, using a straight line may cut one head into two parts which lead to the error counting. Driven by this work, we propose a dynamic region division algorithm to keep the completeness of counting objects. Contributions of this paper are summarized as follows: 1) Utilizing the object bounding boxes obtained by YoloV3 and expectation division line of the scene, the boundary for nearby region and distant one is generated under the premise of keeping the completeness of heads. 2) Appropriate learning models are applied to count pedestrians in each obtained region. In distant region, a novel inception dilated convolutional neural network is proposed to solve the problem of choosing dilation rate. In nearby region, YoloV3 is used to detect pedestrian in multiscale.2 Related Work
Counting by density map Density map was first introduced into pedestrian counting field by Lempitsky et al. [2] and used 2D gaussian kernel to model one pedestrian. Then Fiaschi et al. [6]
used random forest to regress the object density and improved training efficiency. With the powerful ability of deep learning, Zhang
et al. [8] first explored deep models for crowd counting and used two 2D gaussian kernels to model pedestrian’s head and body separately. Huang et al. [10] considered that it is inaccurate to use a gaussian kernel to model pedestrian’s body. They applied semantic segmentation to extract body part instead and achieved more accurate counting result. However, other methods abandoned the modeling of body and only modeled pedestrian head. Zhang et al. [9] proposed a geometryadaptive method to generate proper kernel for different head sizes, but it was only suitable for extremely crowd scenes. Besides, multicolumn CNN(MCNN) was introduced to use filters of different sizes to model the density maps corresponding to heads of different scales. Following this work, later methods mainly focused on the improvement of network structure and added more extra information to network. Li et al. [11] conducted an experiment to show that multicolumn structure was inferior to a deeper network. Information of people number was widely added to network through various schemes [12] [13]. Zhao et al. [3] embedded perspective information into deconvolution network. These information raised the counting accuracy but still used gaussian kernels to model heads with larger size.Counting by detection
Traditional methods used the histogram of oriented gradients(HOG) as the pedestrianlevel features and the support vector machine as the classifier to detect pedestrians in specific scenes
[4], but these handcrafted features severely suffered from light variance and scale variance. The regionbased convolutional neural networks(RCNNs) [14]used features extracted from CNN and improved the performance in detection. This method could be summarized as two stages processing: proposal and classification, but hard to be accelerated. YOLO
[5] provided a new onestage solution for detection and significantly improved the speed. It converted the thought of classification to regression in subgrids and abandoned the process of proposal. Following YOLO, some methods paid attention to support multiple scales object detection such as SSD [15] and YOLOV3 [16]. Although detection methods have achieved tremendous performance and can be used in sparse crowd scenes, it is hard to substitute densitybased methods in crowded scenes.3 Dynamic region division algorithm
3.1 Overview
To overcome the error of gaussian kernel simulation and misidentification of clutter background caused by perspective distortion, a novel algorithm framwork is proposed. Figure 2 clearly shows the flow chart of algorithm. Since we find that gaussian kernels are not suitable for simulating large heads, basing on the straightline double region pedestrian counting method [7]
, we propose a dynamic region division algorithm to keep the completeness of counting objects. Utilizing the object bounding boxes obtained by YoloV3 and expectation division line of the scene, the boundary for nearby region and distant one is generated under the premise of retaining whole head. Then in the nearby region, we apply YoloV3 detector to detect pedestrians that can avoid occurrence of identifying clutter background as heads. In the distant region, we introduce dilated convolution layer into our density map based network to enlarge the receptive field and further design an inception module to address the problem how to choose dilation rate. Finally, we fuse the counting results from two parts and also obtain the total distribution information.
3.2 Dynamic region division
To maximize advantages of different methods in corresponding regions, it is significant to divide regions properly. He et al. [7] used a straight line to divide, but it causes the error that one head may be cut into two parts by the line. To avoid this problem, we propose a dynamic region division algorithm. The detailed steps are described below.
(1) For each frame in surveillance video V= {,,…,} with a resolution of * , YoloV3 detector is used to detect the pedestrians. We record the position of each detected pedestrian’s center in all frames of the video as
(1) 
(2) In order to calculate the distribution of the detected pedestrian along the frame height, we count the number of detected pedestrian in each height (from bottom to top) and record the number as . Then we obtain the possibility , which denotes the possibility of the pedestrian detection in height
(2) 
(3) Then we calculate the expectation H as the height for irregular region division later.
(3) 
(4) To avoid cutting one head into two parts, we calculate a dynamic division mask based on detected pedestrians and expectation height H. For each detected pedestrian bounding box , we record its top left corner and bottom right corner . We first find pedestrian bounding boxes whose head part height means that these heads will be divided into two parts if use a straight line. is the proportion of head to the whole body and we set 0.3 in our experiment. Then we input these boxes into algorithm 1 to get a mask, where 1 in mask represents distant region and 0 denotes nearby region. We can obtain the regions through mask. Figure 3 shows an example of dynamic region division. Heads with yellow rectangle denote that they are cut into two parts by using straight red line [7], but our dynamic region division can effectively keep the completeness of heads.
4 Counting model
4.1 Counting model for distant region
Li et al. [11] proposed a CSRNet for crowd counting which introduced dilated convolution to improve traditional convolutional neural network. Dilated convolution enlarges the receptive field without increasing the number of parameters or the amount of computation. However, it is hard to choose the dilation rate and CSRNet prepared four dilation rate configurations to decide final dilation rate according to the performance. Inspired by Szegedy et al. [17], we propose an inception layer to address the problem of dilation rate chosen as shown in Figure 4. The main idea is that instead of needing to pick one of these dilation rates, we can concatenate all the outputs and let the network learn whatever parameters it wants to use. There are three dilated convolution kernels with same kernel size but different dilation rate in our inception layer. Following [11], we choose 1,2,3 as the dilation rate for three kernels. We then concatenate these three outputs in depth channel. The upper part of Figure 4
shows our IDCNN in detail. There are three inception layer in our network and a max pooling layer behind first two inception dilated module. We doesn’t remove max pooling because it is harder for a fully convolutional neural network to converge with original resolution than downsampled resolution. In our network, all the convolution kernel size is 3
3. After the third inception dilated module layer, we use a convolution layer with 2 dilation rate and 11 convolution to generate the density map.In the distant region, head scale is small and it is suitable for using gaussian kernel to simulate head. For an input RGB frame, we aim to output a density map [2]. The ground truth density map is created as:
(4) 
where denotes the density value of pixel in density map corresponding to th frame. is the center position of pedestrian head while is the collection of all annotated head centers. A normalized 2D Gaussian kernel is used to model a head with variance . is the number of annotated heads and the whole distribution is normalized by its reciprocal. Besides, to better simulate the head, is related to perspective map . Manual annotation was used to get in [8]
, but we take the advantage of detector in the nearby region and use linear regression to calculate
with all the detected bounding boxes. We set .Methods  Jinshajiang road  Jing’an temple  South railway station  People square  Xujiahui  Average 
He et al. [7]  2.96  1.66  2.82  3.90  2.42  2.75 
MCNN [9]  2.17  1.44  2.35  5.35  2.13  2.69 
CSRNet [11]  2.10  1.43  2.76  5.65  2.16  2.82 
Ours  2.07  1.59  2.81  3.41  1.61  2.30 
4.2 Counting model for nearby region
In the nearby region, pedestrian has detailed pedestrianlevel features. Therefore, it is not accurate to use gaussian kernel to simulate head and we use the detectionbased method to count pedestrians instead. Recently Redmon et al. [16] have proposed YOLOV3 detection method. It is an incremental improvement of YOLO and support multiscale detection which is suitable for pedestrian detection in our nearby regions. Since we only need to detect pedestrian, we modify the output layer and only reserve three anchor boxes with following structure in each grid: [].
is the probability of detected pedestrian.
is the center of pedestrian and is the width and height of the bounding box. Peoplelabeled data in COCO dataset [18] are used to train this modified detector.5 Experiment
5.1 Experiment dataset
We evaluate the proposed algorithm through extensive experiments on the publicly available Subway station pedestrian dataset [7]. The dataset covers five typical subway station scenes in Shanghai. These scenes mainly locates at the transfer corridors, which have severe perspective distortion and large variance of head scale. Since we use densitybased method in the distant region, it is necessary to annotate each head position in frames. Therefore, we add the annotation of head positions to original dataset. For each scene, there are 420 frames for training and 180 frames for testing. The average pedestrian count is 28.78.
5.2 Model training
Training Due to our division algorithm, distant region is dynamic. Therefore, it can not be the input of IDCNN directly and we fill zero to make region be a rectangle instead. Since there are two max pooling layers in IDCNN, groundtruth density map was downsampled to 1/4 of the original height and width. To augment the training set for training the IDCNN, we flip the frames horizontally to double the training set. Following [7], we use the absolute error(
) as the evaluation metric which are defined as follows:
(5) 
where is the number of test images, is the estimated number of pedestrians in the th frame, and is the actual number of pedestrians in the th frame. indicates the accuracy of the estimates.
5.3 Results and discussion
To demonstrate the effectiveness of our proposed method, we compare our results to three methods: one that based on traditional machine learning method [7], and test another two deep learning based [9] [11] on subway station pedestrian dataset. Results of the extensive experiments are reported in Table 1. It can be observed that the proposed dynamic region learning algorithm obtains the lowest average MAE on the test frames. It is notable that during the training and testing process of [7], background mask are used to effective features representation on ROI(region of interest). However, we don’t use such background mask and process the whole frame in our method. Therefore, it can demonstrate that dynamic division can reserve more complete information than a straight line [7] and densitybased deep representation is more effective than handcrafted features in the distant region. At the same time, we can also observe that our method does not achieve the lowest in Jing’an temple and South railway station scene. We analyze these scenes and found that pedestrians are more likely to be occluded by other pedestrians and have larger occlusion area.
The counting and distribution estimation results are shown in Figure 5. The first column shows curve for test frames in each scene. The second column shows one sample image and the ground truth pedestrian count. The third row column shows the corresponding estimation result. It can be observed that our estimation curve is close to ground truth in most cases which means our method meet the requirement of public management.
6 Conclusion
In this paper, we propose an dynamic region learning algorithm for pedestrian counting in subway surveillance videos. The novel dynamic region division can meet the challenge of perspective distortion and avoid to cut the head into two parts. In the nearby region, we retrain YOLOV3 detector to substitute for using inaccurate gaussian kernels to model large scale heads. In the distant region, inception modules are used to automatically choose the dilation rate and achieve a better performance. The final fusion results can obtain more accurate counting result than methods before.
References
 [1] Zimei Liu, Yun Chen, and Kefan Xie, “Research on the impact of crowd flow on crowd risk in large gathering spots,” in Industrial InformaticsComputing Technology, Intelligent Technology, Industrial Information Integration (ICIICII), 2016 International Conference on. IEEE, 2016, pp. 368–371.
 [2] Victor Lempitsky and Andrew Zisserman, “Learning to count objects in images,” in Advances in neural information processing systems, 2010, pp. 1324–1332.
 [3] Muming Zhao, Jian Zhang, Fatih Porikli, Chongyang Zhang, and Wenjun Zhang, “Learning a perspectiveembedded deconvolution network for crowd counting,” in Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 403–408.

[4]
Navneet Dalal and Bill Triggs,
“Histograms of oriented gradients for human detection,”
in
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on
. IEEE, 2005, vol. 1, pp. 886–893.  [5] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
 [6] Luca Fiaschi, Ullrich Köthe, Rahul Nair, and Fred A Hamprecht, “Learning to count with regression forest and structured labels,” in Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012, pp. 2685–2688.

[7]
Gaoqi He, Qi Chen, Dongxu Jiang, Xingjian Lu, and Yubo Yuan,
“A doubleregion learning algorithm for counting the number of
pedestrians in subway surveillance videos,”
Engineering Applications of Artificial Intelligence
, vol. 64, pp. 302–314, 2017.  [8] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang, “Crossscene crowd counting via deep convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 833–841.
 [9] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma, “Singleimage crowd counting via multicolumn convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 589–597.
 [10] Siyu Huang, Xi Li, Zhongfei Zhang, Fei Wu, Shenghua Gao, Rongrong Ji, and Junwei Han, “Body structure aware deep crowd counting,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1049–1059, 2018.
 [11] Yuhong Li, Xiaofan Zhang, and Deming Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1091–1100.
 [12] Deepak Babu Sam, Neeraj N Sajjan, R Venkatesh Babu, and Mukundhan Srinivasan, “Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn,” in Conference on Computer Vision and Pattern Recognition, 2018.
 [13] Vishwanath A Sindagi and Vishal M Patel, “Cnnbased cascaded multitask learning of highlevel prior and density estimation for crowd counting,” in Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on. IEEE, 2017, pp. 1–6.
 [14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
 [15] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
 [16] Joseph Redmon and Ali Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
 [17] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
 [18] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
Comments
There are no comments yet.