I Introduction
In recent years, visionbased crowd analysis has been extensively researched, due to its wide applications in crowd management, traffic control, urban planning, and surveillance. As one of the most important applications, crowd counting has been studied extensively [Zhang_2016_CVPR_MCNN, Idrees_2018_ECCV_CL, Ma_2019_ICCV_BL]. With the recent progress of deep learning techniques, the performance of crowd counting has been significantly elevated [Li_2018_CVPR_CSRNet, Ma_2019_ICCV_BL, Yan_2019_ICCV_PGCNet, Liu_2019_ICCV_DSSINet]
. The convolutional neural network (CNN)based methods have demonstrated excellent performance on the task of counting dense crowds in images. Most of CNNbased methods first estimate the density map via deep neural networks and then calculate the counts
[Zhang_2016_CVPR_MCNN, Li_2018_CVPR_CSRNet, Cao_2018_ECCV_SA, Wan_2019_ICCV_adaptive_map, Yan_2019_ICCV_PGCNet]. In specific, the concept of density map, where the integral (sum) over any subregion equals the number of objects in that region, was first proposed in [NIPS2010_4043_map]. Since the existing crowd counting benchmarks provide the point annotation for each crowd image, in which each point is located on the head of a person in the crowd. To train a CNNbased crowd counting model, the point annotations need to be converted to a density map in advance. Lempitsky et al. [NIPS2010_4043_map] propose to use a normalized 2D Gaussian kernel to convert the point annotations to a groundtruth density map. Typically, the Gaussian kernel size is fixed while converting point annotations. But this trivial approach degrades the counting performance, since the scales of individuals in the crowd image may vary greatly. To produce better groundtruth density map, Zhang et al. [zhang2015cross] provides a manual estimation of the perspective maps for the crowd images. But it is laborious to provide the accurate perspective information for all the image captured in various scenarios. Zhang et al. [Zhang_2016_CVPR_MCNN] introduce the geometryadaptive kernel to create groundtruth density maps. They assume that the crowd is evenly distributed and thus they can estimate the kernel size by the average distance between each point annotations and its nearest neighbors. Generally, geometryadaptive kernels is efficient for estimating the kernel of dense point annotations, but it is inaccurate when the crowd is not distributed evenly. As we know, with different Gaussian kernel sizes, the groundtruth density maps converted from the point annotations can be sharper with smaller kernel sizes or smoother with larger kernel sizes (as shown in Fig. 2(b) and (c)). Empirically, it is easier for a CNN model to fit the smoother crowd density maps rather than the sharper ones. This is probably because the sharper groundtruth density maps contain a larger amount of zero values than the nonzero ones, which makes the network difficult to fit the groundtruth. Nevertheless, the individual information contained in the smoother density maps is relatively vague compared with the sharper ones. So the performance of the crowd counting model will thus be degraded and the prior density map generation approaches can hardly handle this problem.
To mitigate this problem, our work attempts to tackle it in several aspects. First, we propose a parametric perspectiveaware density map generation approach. We assume that the individuals on the same horizontal line (or the same row) of the image are from the similar distance away from the camera, so the kernel sizes for the point annotations on the same row should be the same. Thus, to determine the kernel size for each row of the image, we introduce a metric, effective density, used to measure the density of the highly aggregated segments on each row. By linearly mapping the effective densities to the manuallydefined kernel size range, we can easily produce perspectiveaware density maps by assigning a small number of parameters as the groundtruths of model training. Second, we propose a simple CNNbased architecture featuring with two output branches which are supervised by a multitask loss with lowresolution and highresolution groundtruth density maps. With the supervision of high and lowresolution density maps, our model is able to generate the crowd distribution with the relatively high dimension (i.e., of the input size).
Last but not least, we propose an iterative distillation optimization algorithm for progressively enhancing the performance of our network model. As mentioned, although our network can regress the crowd distribution, its performance is still constrained by the quality of the density maps. To benefit learning from more accurate yet hardtolearn density maps (i.e., the sharp density map generated by small Gaussian kernel), we propose to iteratively distill the network with the previously trained identical network. As known, the distillation techniques have been previously used to compress network or improve network capability [hinton2015distilling, yang2019snapshot, furlanello2018born]. Here, we employ it to strengthen the capability of our network, when the training objective of the crowd counting model becomes more and more challenging. Particularly, during distillation, our proposed parametric density map generation approach can be iteratively utilized to generate sharper groundtruth density maps as the training objectives. In experiments, we demonstrate that our perspectiveaware density map generation method is better than the prior generation techniques. Besides, we show that, although our network architecture is simple, our approach can still obtain the stateoftheart performance compared with other methods.
In summary, our contributions are below:

We introduce a parameteric perspectiveaware density map generation method to generate groundtruth density maps from crowd point annotations, so as to train a crowd counting model that can estimate the crowd density maps with relatively high spatial dimension.

We present a novel iterative distillation algorithm to enhance our model while progressively reducing the Gaussian kernel sizes of the groundtruth density maps, which can further improve the counting performance.

In experiments, we show that our simple network architecture can reach the stateoftheart performance compared against the latest approaches in public benchmarks.
Ii Related work
In this section, we survey the most related works on density map generation and knowledge distillation.
Iia Density map generation for crowd counting
The perspective distortion in images is one of the main factors that affect the accurate counting of the crowd. Due to the shooting scene, angle, terrain distribution and other factors, the perspective of each picture is different. It is a natural idea to combine perspective information to improve the groundtruth density map. By manually measuring the height of people in different positions in the picture, [zhang2015cross] calculated the perspective of 108 scenes in WorldExpo’10 dataset , and used the perspective to generate an appropriate ground_truth density map for the dataset. However, most datasets contain various scenes (such as ShangHaiTech [Zhang_2016_CVPR_MCNN], UCF_QNRF [Idrees_2018_ECCV_CL]), so manually estimating the perspective of each image is very laborious. Thus, Zhang et al. [Zhang_2016_CVPR_MCNN] proposed the adaptive geometric estimation technique. The Gaussian covariance of each marker point is estimated by calculating its average distance from the surrounding points, thereby generating a groundtruth density map that more closely matches the actual distribution of the crowd. Shi et al. [shi2019counting]
proposed a nonuniform kernel estimation. They assumes that the crowd is unevenly distributed throughout the crowd image. They first used adaptive geometric estimation technique to estimate the covariance of all points, and then calculated the average covariance of the local neighborhood of each point, which can thus make the Gaussian distribution of a single point will not become too large or too small. Compared with these methods, we propose a new method to produce a groundtruth density map, by assuming the points on each row of the image have the Gaussian kernels with the similar size and estimating the perspective of the entire image via rough density maps without any supervision.
IiB Knowledge distillation
Knowlege distillation is a deep learning training technique that trains a network model (i.e., the student model) to mimic the behavior of a independently trained model (i.e., the teacher model). It was originally used for model compression when the student model is lightweighted. However, some recent studies [furlanello2018born, yang2019snapshot] have found that the distillation between the models with the identical network structure can obtain better classification results. Particularly, Yang et al. [yang2019snapshot] proposed the snapshot distillation method, which divides the training process into two stages. In the first stage, the network performs normal training. Then use the firststage model as a teacher model to guide the student model for the secondstage training. Inspired by these works, we propose our distillation algorithm for the task of crowd counting.
Iii Our Proposed Method
In this section, we first introduce our deep network architecture. To progressively strengthen the performance of our network, we propose a distillation based optimization approach as well as our parametric density map generation method.
Iiia Network architecture
In order to train a crowd counting model, we follow the principle of the prior frameworks (e.g. [zhang2015cross]), in which the deep neural network model aims to generate a density map from the input crowd image and its crowd count can be measured by accumulating the values of the entire density map.
As shown in Fig. 1, we adopt a vanilla deep convolutional neural network (CNN) as the backbone network to estimate the count of a crowd image, which consists of a downsampling module and an upsampling module. The downsampling module adopts the first 10 convolutional layers of the pretrained model VGG16 [simonyan2014very], which extracts deep visual feature representation from the input crowd image. Then, two transposed convolutional layers are applied to upsample the spatial dimension of the feature maps to of the input dimension. As depicted in Table I, most prior counting models produce lowresolution density map (i.e., less than 1/16 of the original size), yet producing a highresolution density map benefits many downstream applications such as crowd analysis. The reason we adopt such a simple network architecture is to demonstrate the performance of our proposed density map generation and distillation method.
Method  Year  Ratio 

MBTTBFSCFB [Sindagi_2019_ICCV_MBTTBF]  2019  1/256 
CSRNet [Li_2018_CVPR_CSRNet]  2018  1/64 
ADCrowdNet [Liu_2019_CVPR_ADCrowd]  2019  1/64 
CSRNet+PACNN [Shi_2019_CVPR_PACNN]  2019  1/64 
CAN [Liu_2019_CVPR_CAN]  2019  1/64 
BL [Ma_2019_ICCV_BL]  2019  1/64 
PACNN [Shi_2019_CVPR_PACNN]  2019  1/64 
MCNN [Zhang_2016_CVPR_MCNN]  2016  1/16 
Ours    1/4 
However, it is challenging for a vanilla deep neural network to produce a highresolution density map. To make the network prediction robust, we introduce a multitask loss by employing two separate upsampling branches for lowresolution and highresolution supervision. As illustrated in Fig. 1, the losses of these two branches are supervised by the groundtruth density maps of two different scales, respectively. In particular, the lowresolution branch consists of two convolutional layers. Hence, the network is trained via a standard loss:
(1) 
where denotes the input crowd image in the training images. and are the highresolution and lowresolution outputs of our network. and represent the highresolution and lowresolution groundtruth density maps, respectively.
IiiB Parametric perspectiveaware density map
The supervision of our crowd counting network requires the groundtruth crowd density map. Existing crowd counting benchmarks provide the point annotation for each crowd image, in which each point annotation represents the position of a person in the crowd. To obtain the crowd density map, following [NIPS2010_4043_map], prior crowd counting works convert the point annotation of a crowd image into a density map by applying a Gaussian kernel over each point. Most prior methods apply the Gaussian kernel with a fixed kernel size [NIPS2010_4043_map, Cao_2018_ECCV_SA], or the geometryadaptive kernel size [8099912_switch, Li_2018_CVPR_CSRNet, Liu_2019_CVPR_CAN, Liu_2019_ICCV_DSSINet, Ma_2019_ICCV_BL, xu2019learn, xiong2019open, shen2018crowd]. These density conversion techniques do not consider the perspective information or nonuniform crowd distribution of the crowd images. Based on the recent findings [Wan_2019_ICCV_adaptive_map], the generated density map may affect the performance of the model, so these trival density map generation methods may not help us achieve the satisfactory results. Thus, we propose a spatialaware parametric method for generating adaptive density maps.
Our proposed density conversion approach is based a simple assumption that the majority of persons located on the same row of the image have a similar distance away from the camera (e.g. Fig. 2
(a)). This assumption may not be suited for certain crowd scenes, but it empirically works well for most scenes. Thus, subject to the assumption, each row of the density map should be corresponded to a Gaussian kernel with the same size, which can approximate the scale variance caused by the perspective view of the crowd image.
To estimate the Gaussian kernel size for each row, we first apply a fixed large Gaussian kernel size (e.g., the standard deviation of Gaussian kernel
) to produce a rough yet smooth density map as a prior (see Fig. 2(b)). Due to the perspective effect of the image, the density concerns with not only the spatial distribution but also the distance away from camera. For instance, given two persons sitting sidebyside, they appear to be closer when they are farther away from camera and vice versa. Thus, we can approximately find out the relation between the density and the distance from camera for each row. We propose to calculate the effective density for the th row of :(2)  
(3) 
where refers to the density value at the location of the image. denotes Dirac delta function, which equals to when the condition in the bracket is satisfied and otherwise. refers to the manually defined threshold.
(a)  (b)  (c) 
In essence, the effective density measures the average density of most dense segment on each row of the crowd image while filtering out the less dense part. Thus, we can determine the kernel size for each row according to its . To accomplish this, we apply a linear mapping between the maximum and minimum values of (i.e. and ) and the largest and smallest kernel size (i.e., the standard deviations and ), respectively. Specifically, and are manually determined, while and are measured over the entire training set. Hence, the Gaussian kernel size of the row can be computed as:
(4) 
where
(5)  
(6) 
Thus, after assigning the kernel size for each row, we can obtain the density map determined by and , denoted as . We illustrate an example of our parametric method in Fig. 2(c). In practice, is often fixed at a small value (e.g. ), while we mainly tune the value of to achieve various density maps. Therefore, for the sake of simplicity, the notation of can be simplified as . In the extreme case where equals to , the generated density map is the same as the one generated by a universallyfixed kernel size.
As described in the following section, our proposed density map generation technique can benefit the training of crowd counting model by providing a perspectiveaware and parametric density map.
IiiC Distillation based crowd counting
As shown in Fig. 2(b) and (c), with a larger kernel size (i.e. larger for ), the density map will appear to be smoother, and vice versa. On one hand, for a deep neural network that regresses the density maps, the smoother density maps are easier to learn than the sharper ones. On the other hand, training the regression network from such smoother density maps may degrade the results of crowd counting, since the smoother density maps blur the individual information of the crowd image. In practice, proper parameters are often empirically chosen to trade off the counting accuracy and model training.
IiiC1 Crowd counting network distillation
In order to break the bond caused by density maps, we propose a distillation based optimization method that enables to progressively reduce the difficulty of learning from sharper density maps so as to obtain a better solution. Knowledge distillation has been proposed in [hinton2015distilling], which has been applied for network compression. Specificially, a lightweight model is often applied to learn the behavior of a large network. The recent studies find that the identical network structures can benefit from the distillation [furlanello2018born, yang2019snapshot]. Here, we employ the similar optimization strategy to iteratively train our network.
First, we train our network according to Eq. 1 that can be simply expressed as:
(7) 
Next, we treat the trained network as the teacher model (denoted as ) and we leverage it to train an identical network (i.e., the student model ) from scratch. Thus, the trained student model at this stage can be treated as the teacher model of the next stage, i.e. , where denotes the timestamp of the training stage. Hence, the training loss of the stage in distillation can be expressed as follows:
(8) 
where denotes the constant balance weight. The first term is a standard loss, while the second term aims to align the outputs of the teacher model and the student model, which forces the student model to approach the behavior of the teacher model.
The distillation based optimization can progressively optimize the network. However, as we mentioned, the inappropriately computed groundtruth density maps will degrade the counting performance. Furthermore, the static groundtruth density maps (i.e., remains the same in distillation) limit the optimal performance discovered by the distillation. Thus, in the following section, we introduce a simple yet effective method in order to produce adaptive density maps.
IiiC2 Densityaware distillation
To further strengthen the network performance, utilizing our parameteric perspectiveaware density map generation technique, we improve the distillation method. According to Eq. 8, the student model will learn the behavior of the teacher model . But, with the static density map as groundtruth, the distillation may not further improve the performance. However, it is difficult to directly train a network from sharp density maps generated with a small kernel size, which often leads to a poor performance. With distillation, the model is able to progressively adapt to the sharper and sharper density map. Particularly, in each stage of distillation, we slightly increase the difficulty of the task by introducing the density maps with a smaller kernel size. In specific, on every stage of distillation, the training loss is modified as below:
(9) 
where refers to the maximal Gaussian kernel size at the training stage. After each stage, will be updated, i.e., , where is a constant within the value range of . Our algorithm is summarized in Algorithm 1. By iteratively distilling the model while reducing the kernel size to generate sharper density maps for network to learn, the performance of the student model can iteratively be strengthened.
Iv Experimental Results
In this section, we first introduce the datasets used for evaluation and the metrics, as well as the implementation details. Then, we conduct the comparison experiments with the stateoftheart methods in public benchmarks. Last, we perform the ablation study to investigate the density map generation, the distillation algorithm, and the multitask loss of our model.
Iva Datasets and evaluation metrics
We evaluate our approach on public crowd counting benchmarks: ShanghaiTech Part A/B [Zhang_2016_CVPR_MCNN], UCFQNRF [Idrees_2018_ECCV_CL], and UCSD [chan2008privacy]. Particularly, ShanghaiTech dataset has a total of 1198 crowd images, including Part A and Part B. Part A contains 482 images, and Part B contains 716 images. The QNRF dataset has 1535 images with average resolution at and a total of 1.25 million annotations, in which there are 1201 images for training and 334 images for testing. Besides, the UCSD dataset have 2000 images with the resolution of with relatively smaller density.
Following prior works, MSE and MAE are used as metrics, which are defined as follows:
(10)  
(11) 
where is the number of test images, indicates the estimated density map count of the th image and indicates the groundtruth count of the th image.
IvB Implementation details
In practice, we set as 2.5 and the initial as 25. In the first timestep, the is set as 20 and as 0.5. On the initialization stage, the learning rate is set as and the momentum 0.95. Since we do not normalize the input image, the batch size is set as 1. Other training settings differ across the benchmarks, so we will elaborate them below, respectively.
Method  Year  SHA  SHB  

MAE  MSE  MAE  MSE  
MCNN [Zhang_2016_CVPR_MCNN]  2016  110.2  173.2  26.4  41.3 
Switching CNN [8099912_switch]  2017  90.4  135.0  21.6  33.4 
SANet [Cao_2018_ECCV_SA]  2018  67.0  104.5  8.4  13.6 
CSRNet [Li_2018_CVPR_CSRNet]  2018  68.2  115.0  10.6  16.0 
icCNN [ranjan2018iterative]  2018  69.8  117.3  10.4  16.7 
CSRNet+PACNN [Shi_2019_CVPR_PACNN]  2019  62.4  102.0  7.6  11.8 
ADCrowdNet [Liu_2019_CVPR_ADCrowd]  2019  63.2  98.9  7.6  13.9 
PACNN [Shi_2019_CVPR_PACNN]  2019  66.3  106.4  8.9  13.5 
CAN [Liu_2019_CVPR_CAN]  2019  62.3  100.0  7.8  12.2 
BL [Ma_2019_ICCV_BL]  2019  62.8  101.8  7.7  12.7 
TEDnet [jiang2019crowd]  2019  64.2  109.1  8.2  12.8 
HACCN [sindagi2019ha]  2019  62.9  94.9  8.1  13.4 
Ours    61.1  104.7  7.5  12.0 
Method  Year  MAE  MSE 

MCNN [Zhang_2016_CVPR_MCNN]  2016  277.0  426.0 
Switching CNN [8099912_switch]  2017  228.0  445.0 
CL [Idrees_2018_ECCV_CL]  2018  132.0  191.0 
HACCN [sindagi2019ha]  2019  118.1  180.4 
RANet [zhang2019relational]  2019  111.0  190.0 
CAN [Liu_2019_CVPR_CAN]  2019  107.0  183.0 
TEDnet [jiang2019crowd]  2019  113.0  188.0 
SPN+L2SM [xu2019learn]  2019  104.7  173.6 
SDCNet [xiong2019open]  2019  104.4  176.1 
SFCN [wang2019learning]  2019  102.0  171.4 
DSSINet [Liu_2019_ICCV_DSSINet]  2019  99.1  159.2 
MBTTBFSCFB [Sindagi_2019_ICCV_MBTTBF]  2019  97.5  165.2 
Ours    92.9  159.2 
Method  Year  MAE  MSE 

MCNN [Zhang_2016_CVPR_MCNN]  2016  1.07  1.35 
Switching CNN [8099912_switch]  2017  1.62  2.10 
ConvLSTM [xiong2017spatiotemporal]  2017  1.30  1.79 
BSAD [huang2017body]  2017  1.00  1.40 
ACSCP [shen2018crowd]  2018  1.04  1.35 
CSRNet [Li_2018_CVPR_CSRNet]  2018  1.16  1.47 
SANet [Cao_2018_ECCV_SA]  2018  1.02  1.29 
SANet+SPANet [Cheng_2019_ICCV_SPANet]  2019  1.00  1.28 
ADCrowdNet [Liu_2019_CVPR_ADCrowd]  2019  0.98  1.25 
Ours    0.98  1.24 
Input image  Bayesian [Ma_2019_ICCV_BL]  CSRNet [Li_2018_CVPR_CSRNet]  Ours  Groundtruth 
Estimated count  464.4  481.5  460.4  460 
Estimated count  269.9  274.4  231.7  212 
Estimated count  797.8  917.5  758.1  760 
Estimated count  542.7  556.3  440.2  423 
Estimated count  260.5  255.7  252.2  250 
Estimated count  404.8  466.1  389.7  381 
IvC Comparison results
In this section, we illustrate the quantitative results on different benchmarks including ShanghaiTech Part A and Part B [Zhang_2016_CVPR_MCNN], UCFQNRF [Idrees_2018_ECCV_CL], and UCSD [chan2008privacy]. We demonstrate that the counting performance achieved by our simple network architecture can be better or comparable to the stateoftheart models.
ShanghaiTech Part A/B.
To train our model, we follow the general training protocal. When training the initial counting model, we train the model for 100 epochs, and the learning rate is decayed by 10 times after training for 60, 80, and 90 epochs, respectively. During distillation, the learning rate will be reduced by 10 times after training for 20, 60, and 80 epochs. We compared our method with some recent works, including MCNN
[Zhang_2016_CVPR_MCNN], Switching CNN [8099912_switch], SANet [Cao_2018_ECCV_SA], CSRNet [Li_2018_CVPR_CSRNet], icCNN [ranjan2018iterative], PACNN [Shi_2019_CVPR_PACNN], ADCrowdNet [Liu_2019_CVPR_ADCrowd], CAN [Liu_2019_CVPR_CAN], BL [Ma_2019_ICCV_BL], TEDnet [jiang2019crowd], HACCN [sindagi2019ha]. The results are shown in Table II. Although our model is based on a vanilla CNN architecture, our performance is comparable to the stateoftheart works. For reference, our network architecture is similar to that of CSRNet [Li_2018_CVPR_CSRNet], except that our network does not incorporate dilation convolution layers. As observed, depite our simple architecture, our result is significantly better than that of CSRNet.UCFQNRF: To train and valiate our model in this dataset, we scale the long edge of each crowd image to 1080 pixels, while maintaining the origin aspect ratio. For data augmentation, we crop 4 patches from each image and each patch is 1/4 of the original image size. The learning rate is set as and we decay the learning rate by 10 times at the , , epoch, respectively. During the distillation stage, the learning rate is set to as well, and the learning rate will decay by 10 times after 80, 100, 110 epochs, respectively. The comparison results of our method against the stateoftheart methods, including MCNN [Zhang_2016_CVPR_MCNN], MCNN [Zhang_2016_CVPR_MCNN], Switching CNN [8099912_switch], CL [Idrees_2018_ECCV_CL], HACCN [sindagi2019ha], RANet [zhang2019relational], CAN [Liu_2019_CVPR_CAN], TEDnet [jiang2019crowd], SPN+L2SM [xu2019learn], SDCNet [xiong2019open], SFCN [wang2019learning], DSSINet [Liu_2019_ICCV_DSSINet], MBTTBFSCFB [Sindagi_2019_ICCV_MBTTBF], are shown in Table III. As observed, in the term of MAE, our model is demonstrated superior to the latest methods (e.g. [Sindagi_2019_ICCV_MBTTBF]) for at least more than gain.
UCSD: During training, we upscale the images by two times and we leverage the provided regions of interest. We first set the Gaussian as 7 to generate the groundtruth density maps, and the set Gaussian as 5 and 3 to produce density maps for distillation, respectively. Other settings follow the general training protocal. As depicted in Table IV, compared with MCNN [Zhang_2016_CVPR_MCNN], Switching CNN [8099912_switch], ConvLSTM [xiong2017spatiotemporal], BSAD [huang2017body], ACSCP [shen2018crowd], CSRNet [Li_2018_CVPR_CSRNet], SANet [Cao_2018_ECCV_SA], SPANet [Cheng_2019_ICCV_SPANet], ADCrowdNet [Liu_2019_CVPR_ADCrowd], our model shows the stateoftheart performance. Since each image of this dataset contains a small number of people, the difference amongst the comparison methods are not significant. Even so, our approach can still be comparable to the latest methods (e.g. [Liu_2019_CVPR_ADCrowd]).
Qualitative results: In Fig. 4, we show several examples in which our results are compared against those of two representative crowd counting approaches, Bayesian loss based method [Ma_2019_ICCV_BL] and CSRNet [Li_2018_CVPR_CSRNet]. Perceptually, since our model generates the density maps with a relatively higher resolution, the visual quality of our produced density maps is much better and similar to the groundtruth.
Input image  Groundtruth  
Estimated count  697  686  678  679  589 
IvD Ablation study
We perform the ablation study on the dataset, ShanghaiTech Part A [Zhang_2016_CVPR_MCNN]. we thoroughly dissect and delve into the structure of our method, including our generated groundtruth density maps, the multitask loss of our applied network, and our distillation method.
Density map generation  MAE  Distillation  MAE 

+ Fixed  67.91  +  62.55 
+ Adaptive  66.25  +  61.15 
+ Nonuniform  64.58  +  61.09 
+ Ours  62.55  +  61.79 
+  62.72  
Single loss  68.86  Multitask loss  66.25 
Density map generation. We apply different strategies, including fixed kernel size (“+Fixed”), geometric adaptive kernel size (“+Adaptive”), nonuniform kernel estimation (“+Nonuniform”), and our proposed perspectiveaware kernel estimation, to generate “groundtruth” density maps from point annotations to supervise our model. For comparison, the crowd counting network model is trained without distillation. The results are depicted in the left part of Table V. As observed, our generated density maps can benefit the performance of crowd counting network, which significantly outperforms others by at least 2 points in the term of MAE. In Fig. 6, we illustrate several examples to show the difference of our proposed density map generator against previous methods. As observed, with the default settings, our approach can produce sharper density map than prior methods, which is able to deliver more accurate spatial information to the counting model and thus leads to better estimated counts.
Distillation. In the right part of Table V, we iteratively run our distillation algorithm and measure the performance of the model on each stage. As observed, the distillation progressively promotes the counting performance of the model from 62.55 to 61.09, which is comparable to the stateoftheart methods. But, when the kernel size shrinks to 5, the performance cannot further improve, due to the limit of distillation. After each distillation, the model will learn new supervision information on the original basis, optimize the performance of the model, and as the Gaussian kernel decreases, the density map will learn to have sharper position information. For reference, we also apply the density map generated by to train our model from scratch. As observed, its result is worse than our model trained from .
We illustrate an example of distillation in Fig. 5. Particularly, we observe that, after distilling the model for two time steps (i.e., from to ), the results of the model are obviously improved, which may be due to the fact that the kernels of the density distribution near the camera becomes smaller and clearer. After distilling our model for more than 2 time steps, the performance hardly gains further improvements.
Multitask loss. In the bottom row of Table V, our baseline is a vanilla CNNbased model without the extra branch for supervising lowresolution density map (i.e., Single loss). Both networks are trained using the density maps produced by the geometricadaptive kernel. As observed, the performance in the term of MAE (i.e., 68.86) is significantly worse than the one with multitask loss (i.e., 66.25).
V Conclusion and Future Works
In this paper, we propose a perspectiveaware density map generation method that is able to adaptively produce groundtruth density maps from point annotations to train crowd counting model to accomplish superior performance than prior density map generation techniques. Besides, leveraging our density map generation method, we propose an iterative distillation algorithm to progressively enhance our model with identical network structures, without significantly sacrificing the dimension of the output density maps. In experiments, we demonstrate that, with our simple convolutional neural network architecture strengthened by our proposed training algorithm, our model is able to outperform or be comparable with the stateoftheart methods.
Although our model can obtain satisfactory performance with a simple architecture, we have not validated our approach can be adapted to more complex network models. As the future work, we will explore the possibility of transferring our algorithm to advanced network models. On the other hand, distilling network is time consuming and there are many hyperparameters for tuning. We will investigate more efficient algorithm to accomplish this purpose.
Comments
There are no comments yet.