CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes
We propose a network for Congested Scene Recognition called CSRNet to provide a data-driven and deep learning method that can understand highly congested scenes and perform accurate count estimation as well as present high-quality density maps. The proposed CSRNet is composed of two major components: a convolutional neural network (CNN) as the front-end for 2D feature extraction and a dilated CNN for the back-end, which uses dilated kernels to deliver larger reception fields and to replace pooling operations. CSRNet is an easy-trained model because of its pure convolutional structure. To our best acknowledge, CSRNet is the first implementation using dilated CNNs for crowd counting tasks. We demonstrate CSRNet on four datasets (ShanghaiTech dataset, the UCF_CC_50 dataset, the WorldEXPO'10 dataset, and the UCSD dataset) and we deliver the state-of-the-art performance on all the datasets. In the ShanghaiTech Part_B dataset, we significantly achieve the MAE which is 47.3 lower than the previous state-of-the-art method. We extend the applications for counting other objects, such as the vehicle in TRANCOS dataset. Results show that CSRNet significantly improves the output quality with 15.4 the previous state-of-the-art approach.READ FULL TEXT VIEW PDF
The task of crowd counting in varying density scenes is an extremely
Crowd management technologies that leverage computer vision are widespre...
For crowded scenes, the accuracy of object-based computer vision methods...
With multiple crowd gatherings of millions of people every year in event...
We propose an attention-injective deformable convolutional network calle...
This paper is aimed at creating extremely small and fast convolutional n...
Sex trafficking is a global epidemic. Escort websites are a primary vehi...
CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes
Growing number of network models have been developed [1, 2, 3, 4, 5] to deliver promising solutions for crowd flows monitoring, assembly controlling, and other security services. Current methods for congested scenes analysis are developed from simple crowd counting (which outputs the number of people in the targeted image) to density map presenting (which displays characteristics of crowd distribution) . This development follows the demand of real-life applications since the same number of people could have completely different crowd distributions (as shown in Fig. 1), so that just counting the number of crowds is not enough. The distribution map helps us for getting more accurate and comprehensive information, which could be critical for making correct decisions in high-risk environments, such as stampede and riot. However, it is challenging to generate accurate distribution patterns. One major difficulty comes from the prediction manner: since the generated density values follow the pixel-by-pixel prediction, output density maps must include spatial coherence so that they can present the smooth transition between nearest pixels. Also, the diversified scenes, e.g., irregular crowd clusters and different camera perspectives, would make the task difficult, especially for using traditional methods without deep neural networks (DNNs). The recent development of congested scene analysis relays on DNN-based methods because of the high accuracy they have achieved in semantic segmentation tasks [7, 8, 9, 10, 11] and the significant progress they have made in visual saliency . The additional bonus of using DNNs comes from the enthusiastic hardware community where DNNs are rapidly investigated and implemented on GPUs , FPGAs [14, 15, 16], and ASICs . Among them, the low-power, small-size schemes are especially suitable for deploying congested scene analysis in surveillance devices.
Previous works for congested scene analysis are mostly based on multi-scale architectures [4, 5, 18, 19, 20]. They have achieved high performance in this field but the designs they used also introduce two significant disadvantages when networks go deeper: large amount of training time and non-effective branch structure (e.g., multi-column CNN (MCNN) in ). We design an experiment to demonstrate that the MCNN does not perform better compared to a deeper, regular network in Table 1. The main reason of using MCNN in  is the flexible receptive fields provided by convolutional filters with different sizes across the column. Intuitively, each column of MCNN is dedicated to a certain level of congested scene. However, the effectiveness of using MCNN may not be prominent. We present Fig. 2 to illustrate the features learned by three separated columns (representing large, medium, and small receptive fields) in MCNN and evaluate them with ShanghaiTech Part_A  dataset. The three curves in this figure share very similar patterns (estimated error rate) for 50 test cases with different congest densities meaning that each column in such branch structure learn nearly identical features. It performs against the original intention of the MCNN design for learning different features for each column.
In this paper, we design a deeper network called CSRNet for counting crowd and generating high-quality density maps. Unlike the latest works such as [4, 5] which use the deep CNN for ancillary, we focus on designing a CNN-based density map generator. Our model uses pure convolutional layers as the backbone to support input images with flexible resolutions. To limit the network complexity, we use the small size of convolution filters (like ) in all layers. We deploy the first 10 layers from VGG-16  as the front-end and dilated convolution layers as the back-end to enlarge receptive fields and extract deeper features without losing resolutions (since pooling layers are not used). By taking advantage of such innovative structure, we outperform the state-of-the-art crowd counting solutions (a MCNN based solution called CP-CNN ) with 7%, 47.3%, 10.0%, and 2.9% lower Mean Absolute Error (MAE) in ShanghaiTech  Part_A, Part_B, UCF_CC_50 , and WorldExpo’10  datasets respectively. Also, we achieve high performance on the UCSD dataset  with 1.16 MAE. After extending this work to vehicle counting on TRANCOS dataset , we achieve 15.4% lower MAE than the current best approach, called FCN-HA .
The rest of the paper is structured as follows. Sec. 2 presents the previous works for crowd counting and density map generation. Sec. 3 introduces the architecture and configuration of our model while Sec. 4 presents the experimental results on several datasets. In Sec. 5, we conclude the paper.
|Col. 1 of MCNN||57.75k||141.2||206.8|
|Col. 2 of MCNN||45.99k||160.5||239.0|
|Col. 3 of MCNN||25.14k||153.7||230.2|
|A deeper CNN||83.84k||93.0||142.2|
followed by the ReLu layer.
is the max-pooling layer. Results show that the single-column version achieves higher performance on ShanghaiTech Part_A dataset with the lowest MAE and Mean Squared Error (MSE)
Following the idea proposed by Loy et al. 
, the potential solutions for crowd scenes analysis can be classified into three categories: detection-based methods, regression-based methods, and density estimation-based methods. By combining the deep learning, the CNN-based solutions show even stronger ability in this task and outperform the traditional methods.
Most of the early researches focus on detection-based approaches using a moving-window-like detector to detect people and count their number . These methods require well-trained classifiers to extract low-level features from the whole human body (like Haar wavelets  and HOG (histogram oriented gradients) ). However, they perform poorly on highly congested scenes since most of the targeted objects are obscured. To tackle this problem, researchers detect particular body parts instead of the whole body to complete crowd scenes analysis .
Since detection-based approaches can not be adapted to highly congested scenes, researchers try to deploy regression-based approaches to learn the relations among extracted features from cropped image patches, and then calculate the number of particular objects. More features, such as foreground and texture features, have been used for generating low-level information . Following similar approaches, Idrees et al.  propose a model to extract features by employing Fourier analysis and SIFT (Scale-invariant feature transform)  interest-point based counting.
When executing the regression-based solution, one critical feature, called saliency, is overlooked which causes inaccurate results in local regions. Lempitsky et al.  propose a method to solve this problem by learning a linear mapping between features in the local region and its object density maps. It integrates the information of saliency during the learning process. Since the ideal linear mapping is hard to obtain, Pham et al. 
use random forest regression to learn a non-linear mapping instead of the linear one.
Literature also focuses on the CNN-based approaches to predict the density map because of its success in classification and recognition [34, 21, 35]. In the work presented by Walach and Wolf , a method is demonstrated with selective sampling and layered boosting. Instead of using patch-based training, Shang et al.  try an end-to-end regression method using CNNs which takes the entire image as input and directly outputs the final crowd count. Boominathan et al.  present the first work purely using convolutional networks and dual-column architecture for generating density map. Marsden et al.  explore single-column fully convolutional networks while Sindagi et al.  propose a CNN which uses the high-level prior information to boost the density prediction performance. An improved structure is proposed by Zhang et al.  who introduce a multi-column based architecture (MCNN) for crowd counting. Similar idea is shown in Onoro and Sastre  where a scale-aware, multi-column counting model called Hydra CNN is presented for object density estimation. It is clear that the CNN-based solutions outperform the previous works mentioned in Sec. 2.1 to 2.3.
Most recently, Sam et al.  propose the Switch-CNN using a density level classifier to choose different regressors for particular input patches. Sindagi et al.  present a Contextual Pyramid CNN, which uses CNN networks to estimate context at various levels for achieving lower count error and better quality density maps. These two solutions achieve the state-of-the-art performance, and both of them used multi-column based architecture (MCNN) and density level classifier. However, we observe several disadvantages in these approaches: (1) Multi-column CNNs are hard to train according to the training method described in work . Such bloated network structure requires more time to train. (2) Multi-column CNNs introduce redundant structure as we mentioned in Sec. 1. Different columns seem to perform similarly without obvious differences. (3) Both solutions require density level classifier before sending pictures in the MCNN. However, the granularity of density level is hard to define in real-time congested scene analysis since the number of objects keeps changing with a large range. Also, using a fine-grained classifier means more columns need to be implemented which makes the design more complicated and causes more redundancy. (4) These works spend a large portion of parameters for density level classification to label the input regions instead of allocating parameters to the final density map generation. Since the branch structure in MCNN is not efficient, the lack of parameters for generating density map lowers the final accuracy. Taking all above disadvantages into consideration, we propose a novel approach to concentrate on encoding the deeper features in congested scenes and generating high-quality density map.
The fundamental idea of the proposed design is to deploy a deeper CNN for capturing high-level features with larger receptive fields and generating high-quality density maps without brutally expanding network complexity. In this section, we first introduce the architecture we proposed, and then we present the corresponding training methods.
as the front-end of CSRNet because of its strong transfer learning ability and its flexible architecture for easily concatenating the back-end for density map generation. In CrowdNet, the authors directly carve the first 13 layers from VGG-16 and add a convolutional layer as output layer. The absence of modifications results in very weak performance. Other architectures, such as , uses VGG-16 as the density level classifier for labeling input images before sending them to the most suitable column of the MCNN, while the CP-CNN  incorporates the result of classification with the features from density map generator. In these cases, the VGG-16 performs as an ancillary without significantly boosting the final accuracy. In this paper, we first remove the classification part of VGG-16 (fully-connected layers) and build the proposed CSRNet with convolutional layers in VGG-16. The output size of this front-end network is 1/8 of the original input size. If we continue to stack more convolutional layers and pooling layers (basic components in VGG-16), output size would be further shrunken, and it is hard to generate high-quality density maps. Inspired by the works [10, 11, 40], we try to deploy dilated convolutional layers as the back-end for extracting deeper information of saliency as well as maintaining the output resolution.
One of the critical components of our design is the dilated convolutional layer. A 2-D dilated convolution can be defined as follow:
is the output of dilated convolution from input and a filter with the length and the width of and respectively. The parameter is the dilation rate. If , a dilated convolution turns into a normal convolution.
Dilated convolutional layers have been demonstrated in segmentation tasks with significant improvement of accuracy [10, 11, 40] and it is a good alternative of pooling layer. Although pooling layers (e.g., max and average pooling) are widely used for maintaining invariance and controlling overfitting, they also dramatically reduce the spatial resolution meaning the spatial information of feature map is lost. Deconvolutional layers [41, 42] can alleviate the loss of information, but the additional complexity and execution latency may not be suitable for all cases. Dilated convolution is a better choice, which uses sparse kernels (as shown in Fig. 3) to alternate the pooling and convolutional layer. This character enlarges the receptive field without increasing the number of parameters or the amount of computation (e.g., adding more convolutional layers can make larger receptive fields but introduce more operations). In dilated convolution, a small-size kernel with filter is enlarged to
with dilated stride. Thus it allows flexible aggregation of the multi-scale contextual information while keeping the same resolution. Examples can be found in Fig. 3 where normal convolution gets receptive field and two dilated convolutions deliver and receptive fields respectively.
For maintaining the resolution of feature map, the dilated convolution shows distinct advantages compared to the scheme of using convolution pooling deconvolution. We pick one example for illustration in Fig. 4. The input is an image of crowds, and it is processed by two approaches separately for generating output with the same size. In the first approach, input is downsampled by a max pooling layer with factor 2, and then it is passed to a convolutional layer with a
Sobel kernel. Since the generated feature map is only 1/2 of the original input, it needs to be upsampled by the deconvolutional layer (bilinear interpolation). In the other approach, we try dilated convolution and adapt the sameSobel kernel to a dilated kernel with a factor stride. The output is shared the same dimension as the input (meaning pooling and deconvolutional layers are not required). Most importantly, the output from dilated convolution contains more detailed information (referring to the portions we zoom in).
We propose four network configurations of CSRNet in Table 3 which have the same front-end structure but different dilation rate in the back-end. Regarding the front-end, we adapt a VGG-16 network  (except fully-connected layers) and only use kernels. According to , using more convolutional layers with small kernels is more efficient than using fewer layers with larger kernels when targeting the same size of receptive field .
By removing the fully-connected layers, we try to determine the number of layers we need to use from VGG-16. The most critical part relays on the tradeoff between accuracy and the resource overhead (including training time, memory consumption, and the number of parameters). Experiment shows a best tradeoff can be achieved when keeping the first ten layers of VGG-16  with only three pooling layers instead of five to suppress the detrimental effects on output accuracy caused by the pooling operation. Since the output (density maps) of CSRNet is smaller (1/8 of input size), we choose bilinear interpolation with the factor of 8 for scaling and make sure the output shares the same resolution as the input image. With the same size, CSRNet generated results are comparable with the ground truth results using the PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity in Image ).
In this section, we provide specific details of CSRNet training. By taking advantage of the regular CNN network (without branch structures), CSRNet is easy to implement and fast to deploy.
Following the method of generating density maps in , we use the geometry-adaptive kernels to tackle the highly congested scenes. By blurring each head annotation using a Gaussian kernel (which is normalized to 1), we generate the ground truth considering the spatial distribution of all images from each dataset. The geometry-adaptive kernel is defined as:
For each targeted object in the ground truth , we use to indicate the average distance of nearest neighbors. To generate the density map, we convolve with a Gaussian kernel with parameter
(standard deviation), whereis the position of pixel in the image. In experiment, we follow the configuration in  where and . For input with sparse crowd, we adapt the Gaussian kernel to the average head size to blur all the annotations. The setups for different datasets are shown in Table 2.
We crop 9 patches from each image at different locations with 1/4 size of the original image. The first four patches contain four quarters of the image without overlapping while the other five patches are randomly cropped from the input image. After that, we mirror the patches so that we double the training set.
We use a straightforward way to train the CSRNet as an end-to-end structure. The first 10 convolutional layers are fine-tuned from a well-trained VGG-16 
. For the other layers, the initial values come from a Gaussian initialization with 0.01 standard deviation. Stochastic gradient descent (SGD) is applied with fixed learning rate at 1e-6 during training. Also, we choose the Euclidean distance to measure the difference between the ground truth and the estimated density map we generated which is similar to other works[19, 18, 4]
. The loss function is given as follow:
where is the size of training batch and is the output generated by CSRNet with parameters shown as . represents the input image while is the ground truth result of the input image .
|Configurations of CSRNet|
|input(unfixed-resolution color image)|
|back-end (four different configurations)|
Configuration of CSRNet. All convolutional layers use padding to maintain the previous size. The convolutional layers’ parameters are denoted as “conv-(kernel size)-(number of filters)-(dilation rate)”, max-pooling layers are conducted over apixel window with stride 2.
, our model is smaller, more accurate, and easier to train and deploy. In this section, the evaluation metrics are introduced, and then an ablation study of ShanghaiTech Part_A dataset is conducted to analyze the configuration of our model (shown in Table3
). Along with the ablation study, we evaluate and compare our proposed method to the previous state-of-the-art methods in all these five datasets. The implementation of our model is based on the Caffe framework.
The MAE and the MSE are used for evaluation which are defined as:
where is the number of images in one test sequence and is the ground truth of counting. represents the estimated count which is defined as follows:
and show the length and width of the density map respectively while is the pixel at of the generated density map. means the estimated counting number for image .
We also use the PSNR and SSIM to evaluate the quality of the output density map on ShanghaiTech Part_A dataset. To calculate the PSNR and SSIM, we follow the preprocess given by , which includes the density map resizing (same size with the original input) with interpolation and normalization for both ground truth and predicted density map.
In this subsection, we perform an ablation study to analyze the four configurations of the CSRNet on ShanghaiTech Part_A dataset  which is a new large-scale crowd counting dataset including 482 images for congested scenes with 241,667 annotated persons. It is challenging to count from these images because of the extremely congested scenes, the varied perspective, and the unfixed resolution. These four configurations are shown in Table 3. CSRNet A is the network with all the dilation rate of 1. CSRNet B and D maintain the dilation rate of 2 and 4 in their back-end respectively while CSRNet C combines the dilated rate of 2 and 4. The number of parameters of these four models are the same as 16.26M. We intend to compare the results by using different dilation rates. After training on Shanghai Part_A dataset using the method mentioned in Sec. 3.2, we perform the evaluation metrics defined in Sec. 4.1. We try dropout  for preventing the potential overfitting problem but there is no significant improvement. So we do not include dropout in our model. The detailed evaluation results are shown in Table 4, where CSRNet B achieves the lowest error (the highest accuracy). Therefore, we use CSRNet B as the proposed CSRNet for the following experiments.
ShanghaiTech crowd counting dataset contains 1198 annotated images with a total amount of 330,165 persons . This dataset consists of two parts as Part_A containing 482 images with highly congested scenes randomly downloaded from the Internet while Part_B includes 716 images with relatively sparse crowd scenes taken from streets in Shanghai. Our method is evaluated and compared to other six recent works and results are shown in Table 5. It indicates that our method achieves the lowest MAE (the highest accuracy) in Part_A compared to other methods and we get 7% lower MAE than the state-of-the-art solution called CP-CNN. CSRNet also delivers 47.3% lower MAE in Part_B compared to the CP-CNN. To evaluate the quality of generated density map, we compare our method to the MCNN and the CP-CNN using Part_A dataset and we follow the evaluation metrics in Sec. 3.2. Samples of the test cases can be found in Fig 5. Results are shown in Table 6 which indicates CSRNet achieves the highest SSIM and PSNR. We also report the quality result of ShanghaiTech dataset in Table 11.
|Zhang et al. ||181.8||277.7||32.0||49.8|
|Marsden et al. ||126.5||173.5||23.8||33.1|
UCF_CC_50 dataset includes 50 images with different perspective and resolutions . The number of annotated persons per image ranges from 94 to 4543 with an average number of 1280. 5-fold cross-validation is performed following the standard setting in . Result comparisons of MAE and MSE are listed in Table 7 while the quality of generated density map can be found in Table 11.
|Idrees et al. ||419.5||541.6|
|Zhang et al. ||467.0||498.5|
|Onoro et al.  Hydra-2s||333.7||425.2|
|Onoro et al.  Hydra-3s||465.7||371.8|
|Walach et al. ||364.4||341.4|
|Marsden et al. ||338.6||424.5|
The WorldExpo’10 dataset  consists of 3980 annotated frames from 1132 video sequences captured by 108 different surveillance cameras. This dataset is divided into a training set (3380 frames) and a testing set (600 frames) from five different scenes. The region of interest (ROI) is provided for the whole dataset. Each frame and its dot maps are masked with ROI during preprocessing, and we train our model following the instructions given in Sec. 3.2. Results are shown in Table 8. The proposed CSRNet delivers the best accuracy in 4 out of 5 scenes and it achieves the best accuracy on average.
|Chen et al. ||2.1||55.9||9.6||11.3||3.4||16.5|
|Zhang et al. ||9.8||14.1||14.3||22.2||3.7||12.9|
|Shang et al. ||7.8||15.4||14.9||11.8||5.8||11.7|
The UCSD dataset  has 2000 frames captured by surveillance cameras. These scenes contain sparse crowd varying from 11 to 46 persons per image. The region of interest (ROI) is also provided. Because the resolution of each frame is fixed and small (), it is difficult to generate a high-quality density map after frequent pooling operations. So we preprocess the frames by using bilinear interpolation to resize them into . Among the 2000 frames, we use frames 601 through 1400 as training set and the rest of them as testing set according to . Before blurring the annotation as we mentioned in Sec. 3.2, all the frames and the corresponding dot maps are masked with ROI. The accuracy of running UCSD dataset is shown in Table 9 and we outperform most of the previous methods except MCNN in the MAE category. Results indicate that our method can perform not only counting tasks for extremely dense crowds but also tasks for relative sparse scenes. Also, we provides the quality of generated density map in Table 11.
Beyond the crowd counting, we setup an experiment on the TRANCOS dataset  for vehicle counting to demonstrate the robustness and generalization of our approach. TRANCOS is a public traffic dataset containing 1244 images of different congested traffic scenes captured by surveillance cameras with 46796 annotated vehicles. Also, the region of interest (ROI) is provided for the evaluation. The perspectives of images are not fixed and the images are collected from very different scenarios. The Grid Average Mean Absolute Error (GAME)  is used for evaluation in this test. The GAME is defined as follow:
where is the number of images in testing set, and is the estimated result of the input image within region . is the corresponding ground truth result. For a specific level , the GAME() subdivides the image using a grid of non-overlapping regions which cover the full image, and the error is computed as the sum of the MAE in each of these regions. When , the GAME is equivalent to the MAE metric.
deploys a combination of fully convolutional neural networks (FCN) and a long short-term memory network (LSTM). Results are shown in Table10 with three examples shown in Fig. 6. Our model achieves a significant improvement on four different GAME metrics. Compared to the result from , CSRNet delivers 67.7% lower GAME(0), 60.1% lower GAME(1), 48.7% lower GAME(2), and 22.2% lower GAME(3), which is the best solution. We also present the quality of generated density map in Table 11.
|Method||GAME 0||GAME 1||GAME 2||GAME 3|
|Fiaschi et al. ||17.77||20.14||23.65||25.99|
|Lempitsky et al. ||13.76||16.72||20.72||24.36|
In this paper, we proposed a novel architecture called CSRNet for crowd counting and high-quality density map generation with an easy-trained end-to-end approach. We used the dilated convolutional layers to aggregate the multi-scale contextual information in the congested scenes. By taking advantage of the dilated convolutional layers, CSRNet can expand the receptive field without losing resolution. We demonstrated our model in four crowd counting datasets with the state-of-the-art performance. We also extended our model to vehicle counting task and our model achieved the best accuracy as well.
This work was supported by the IBM-Illinois Center for Cognitive Computing System Research (C3SR) - a research collaboration as part of the IBM AI Horizons Network.
Robust real-time face detection.International journal of computer vision, 57(2):137–154, 2004.
In this appendix, additional results generated by CSRNet from five datasets (ShanghaiTech , UCF_CC_50 , WorldExpo’10 , UCSD , and TRANCOS ) are presented to demonstrate the validity of our design. Two criteria are used as the PSNR (Peak Signal-to-Noise Ratio) and the SSIM (Structural Similarity in Image  to evaluate our design’s quality of generated density maps. Samples from these 5 datasets are shown in Fig. 7 to Fig. 12, which represent a variety of density levels.