1 Introduction
Given crowded images, a crowd counting model estimates the number of people contained in a given area. If performed accurately, this has far-reaching applications in crowd monitoring, urban planning, traffic management, and disaster relief [57, 51, 52]. With advanced occlusion robustness and counting efficiency, counting-by-density [29, 75, 10, 25] has become the method-of-choice over others related techniques [30, 21, 12, 24, 12]. These techniques estimate a pixel-level density map and count the crowd by summing over pixels in the given area.

Although effective, counting-by-density is still challenged with scale variations induced by perspective distortion. As illustrated in Figure.1, the scales of people vary inversely with their distances to the camera causing an inconsistent crowd distribution in the density map. To address this issue, most methods [75, 10, 40]
employ deep Convolutional Neural Network (CNN) for exploiting multi-scale features to perform density estimation in multi-scaled scenes. In particular, different-sized filters are arranged in parallel in multiple columns to capture multi-scale features for accurate counting in
[75, 44, 53], while in [10, 25, 26], different filters are grouped into blocks and then stacked sequentially in one column. At the heart of these solutions, multi-scale capability originates from the compositional nature of CNNs [7, 67], where convolutions with various receptive fields are composed hierarchically by hand. However, these manual designs demand prohibitive expert-efforts and suffer from heuristically-fixed receptive fields.
We therefore developed a Neural Architecture Search (NAS) [77, 48] based approach to automatically discover multi-scaled counting-by-density models. NAS is enabled by the compositional nature of CNN and guided by human expertise in designing task-specific search space and strategies. For vision tasks, NAS blooms with image-level classification [78, 35, 47, 46]
, where novel architectures are found to progressively transform spatial details to semantically deep features. Counting-by-density is, however, a pixel-level task that requires spatial preserving architectures with refrained down-sampling strides. Accordingly, the successes of NAS in image classification are not immediately transferable to crowd counting. Although attempts have been made to deploy NAS in image segmentation for pixel-level classifications
[13, 34, 42], they are still not able to address counting-by-density, which is a pixel-level regression with scale variations across the inputs.
In NAS-Count we propose a counting-oriented NAS framework with specific search space, search strategy, and supervision method, what we use to develop our Automatic Multi-Scale Network (AMSNet). First, we employ a two-level search space [54, 34]. On the micro-level, multi-scaled cells are automatically explored to extract and fuse multi-scale features sufficiently. Pooling operations are limited to preserve spatial fidelity and dilated convolutions are utilized instead for receptive field enlargement. On the macro-level, multi-path encoder-decoder architectures are searched to produce a high-quality density map. Fully-convolutional encoder-decoder is the architecture-of-choice for pixel-level tasks [49, 71, 10], and the multi-path variant can better aggregate features encoded at different scales [33, 26, 38]. Second, we adopt a differential one-shot search strategy [37, 34, 70] to improve search efficiency, wherein architecture parameters are jointly optimized with gradient-based optimizer. Third, in order to address the pixel-level isolation problem [10, 31] of the traditional mean square error (MSE) loss, we introduce a novel Scale Pyramid Pooling Loss (SPPLoss) to optimize AMSNet, which leverages a pyramidal pooling architecture to enforce supervision with multi-scale structural awareness. By jointly searching AMSNet and SPPLoss, NAS-Count flexibly exploits multi-scale features and addresses the scale variation issue in counting-by-density. NAS-Count is illustrated in Figure 2.
Main contributions of NAS-Count includes:
-
To our best knowledge, NAS-Count is the first attempt at introducing NAS for crowd counting, where a multi-scale architecture is automatically developed to address the scale variation issue.
-
A counting-specific two-level search space is developed in NAS-Count, from which a spatially-preserved encoder-decoder architecture (AMSNet) is discovered efficiently with a differentiable search strategy using stochastic gradient descent (SGD).
-
A novel Scale Pyramid Pooling Loss (SPPLoss) is searched automatically to replace MSE supervision, which helps produce the higher-quality density map via optimizing structural information on multiple scales.
-
By jointly searching AMSNet and SSPLoss, NAS-Count reports the best overall counting and density estimation performances on four challenging benchmarks, considerably surpassing other state-of-the-arts, which all require demanding expert-involvement.
2 Related work
In this section, we briefly review related literature in the domains of crowd counting and neural architecture search.
2.1 Crowd Counting Literature
Existing counting methods can be categorized into counting-by-detection [17, 21, 30, 63], counting-by-regression [11, 50, 24, 74, 28], and counting-by-density strategies. For comprehensive surveys in crowd counting, please refer to [51, 52, 57, 27]. The first strategy is vulnerable to occlusions due to the requirement of explicit detection. Counting-by-regression successfully avoids such requirement by directly regressing to a scalar count, but forfeits the ability to perceive the localization of crowds. The counting-by-density strategy, initially introduced in [29], counts the crowd by first estimating a density map using hand-crafted [29, 20] or deep CNN [75, 31, 64, 39] features, then summing over all pixel values in the map. Being a pixel-level regression task, CNN architectures deployed in counting-by-density methods tend to follow the encoder-decoder formulation. In order to handle scale variations with multi-scale features, single-column [10, 25, 65] and multi-column [75, 5, 44, 62] encoders have been used where different-sized convolution kernels are sequentially or parallelly arranged to extract features. For the decoder, hour-glass architecture with a single decoding path has been adopted [10, 25, 76], while a novel multi-path variant is gaining increasing attention for superior multi-scale feature aggregation [26, 38, 43, 42].

2.2 NAS Fundamentals
Since its inception NAS [77, 48] has been intensively studied with favorable outcomes [18, 72]. The general efforts of developing new NAS solutions focus on designing new search spaces and search strategies. For search space, existing methods can be categorized into searching the network (macro) space [48, 77], the cell (micro) space [78, 35, 47, 37, 45], or exploring such a two-level space [54, 34] jointly. The cell-based space search is the most popular where the ensemble of cells in networks is hand-engineered to reduce the exponential search space for fast computation. For search strategy, it is essentially an optimizer to find the best architecture that maximizes a targeted task-objective. Random search [4, 23]
[77, 78, 61, 8, 1], neuro-evolutionary algorithms
[60, 48, 41, 36, 47, 66], and gradient-based methods [37, 68, 9] have been used to solve the optimization problem, but the first three suffer from prohibitive computation costs. Although many attempts have been made such as parameter sharing [45, 19, 8, 3], hierarchical search [35, 34], deploying proxy tasks with cheaper search space [78] and training procedures [2] to accelerate them, yet they are still far less efficient than gradient-based methods.Gradient-based NAS, represented by DARTS [37], follows the one-shot strategy [6] wherein a hyper-graph is established using differentiable architectural parameters. Based on the hyper-graph, an optimal sub-graph is explored within by solving a bi-level optimization with gradient-descent optimizers. Noteworthily, this one-shot strategy is inherently memory-heavy, and the bi-level optimization is prone to collapse with excessively-selected skip-connections. These drawbacks of one-shot search strategy are further investigated in DARTS variants including DARTS+ [32], P-DARTS [15], PC-DARTS [70], Single-Path NAS [59], and AutoDeepLab [34].
2.3 NAS Applications
NAS has shown great promise with discovered recurrent or convolutional neural networks in both sequential language modeling [58]
and multi-level vision tasks. In computer vision, NAS has excelled at image-level classification tasks
[78, 46, 35, 47], which is a customary starting-point for developing new classifiers outputting spatially coarsened labels. NAS was later extended to both bounding-box and pixel-level tasks, represented by object detection
[22, 16, 69] and segmentation [13, 34, 42], where the search spaces are modified to better preserve the spatial information in the feature map. In [13] a pixel-level oriented search space and a random search NAS were introduced to the pixel-level segmentation task. In [42], a similar search space was adopted, but the authors employed a reinforcement learning based search method. Nonetheless, both two methods suffer from formidable computations and are orders of magnitude slower than NAS-Count. In [34], the authors searched a two-level search space with more efficient gradient-based method, yet it dedicates in solving the pixel-level classification in semantic segmentation, which still differs from the per-pixel regression in counting-by-density.3 NAS-Count Methodology
NAS-Count efficiently searches a multi-scaled encoder-decoder network, the Automatic Multi-Scale Network (AMSNet) as shown in Figure 2, in a counting-specific search space. It is then optimized with a jointly searched Scale Pyramid Pooling Loss (SPPLoss) as shown in Figure 2
. The encoder and decoder in AMSNet consist of searched multi-scale feature extraction cells and multi-scale feature fusion cells, respectively, and SPPLoss deploys a two-stream pyramidal pooling architecture where the pooling cells are searched as well. By searching AMSNet and SPPLoss together end-to-end, the receptive fields established in these two architectures can cooperate and collaborate with each other to harvest the ideal multi-scale capability for addressing the scale-varied counting problem. NAS-Count details are discussed in the following subsections.
3.1 Automatic Multi-Scale Network
AMSNet is searched with the differential one-shot strategy in a two-level search space. To address scale variations in the pixel-level counting-by-density problem, we search for a multi-path encoder-decoder architecture, where multi-scale features are adaptively extracted and aggregated. In particular, the down-sampling strides are limited in AMSNet to preserve spatial fidelity. To improve the search efficiency, NAS-Count adopts a continuous relaxation and partial channel connection as described in [70].
3.1.1 AMSNet Encoder
The encoder of AMSNet is composed of a set of multi-scale feature extraction cells. For the -th cell in the encoder, it takes the outputs of previous two cells, feature maps and , as input and produces an output feature map . We define each cell as a directed acyclic graph containing nodes, i.e. with , each represents a propagated feature map. We set =7 containing two input nodes, four intermediate nodes, and one output node. Each directed edge in a cell indicates a convolutional operation performed between a pair of nodes, and is searched from the search space with 9 operations:
-
common convolution;
-
, , dilated convolution with rate 2;
-
, , depth-wise separable convolution;
-
skip-connections;
-
no-connection (zero);
For preserving spatial fidelity in the extracted features, extraction cell involves no down-sampling operations. To compensate for the receptive field enlargement, we utilize dilated convolutions to substitute for the normal ones. Besides, we adopt depth-wise separable convolutions to keep the searched architecture parameter-friendly. Skip connections instantiate the residual learning scheme, which helps improving multi-scale capacity as well as enhancing gradient flows during back-propagation.
Within each cell, a specific intermediate node is connected to all previous nodes , , . Edges are established between every pair of connected-nodes and , forming a densely-connected hyper-graph. On a given edge in the graph, following the continuously-relaxed differentiable search as discussed in [37], its associated operation is defined as a summation of all possible operations weighted by the architectural parameter :
(1) |
In the above equation, is a softmax function and
indicates the volume of the micro-level search space. Vector
is applied to perform a channel-wise sampling on , where 1/ channels are randomly selected to improve the search efficiency. is set to 4 as proposed in [70]. is a learnable parameter denoting the importance of each operation on an edge .In addition, each edge is also associated with another architecture parameter which indicates its importance. Accordingly, an intermediate node is computed as a weighted sum of all edges connected to it:
(2) |
Here includes all previous nodes in the cell. The output of the cell is a concatenation of all its intermediate nodes. The cell architecture is determined by two architectural parameters and , which are jointly optimized with the weights of convolutions through a bi-level optimization. For details please refer to [37]. To recover a deterministic architecture from continuous relaxation, the most important edges as well as their associated operations are selected by computing on and .
In the encoder, we apply a convolution to preliminary encode the input image into a channel feature map. Afterwards, two
convolutions are implemented after the first and third extraction cells, each doubling the channel dimension of the features. Our searched extraction cell is normal cell that keeps the feature channel dimension unchanged. Spatially, we only reduce the feature resolution twice with stride two max pooling layers, aiming to preserve the spatial fidelity in the features, while double the channels before the two down-sampling operations. Additionally, within each extraction cell, an extra
convolution is attached to each input node, adjusting their feature channels to be one-fourth of the cell final output dimension.3.1.2 AMSNet Decoder
The decoder of AMSNet deploys a multi-scale feature fusion cell followed by an up-sampling module. We construct the hyper-graph of the fusion cell as inputting multiple features while outputting just one, therefore conforming to the aggregative nature of a decoder. The search in this hyper-graph is similar to that of the extraction cell. A fusion cell takes three encoder output feature maps as input, consisting of nodes that include three input nodes, two intermediate nodes and one output node. After the relaxation as formulated in Eqa.1 and 2, the architecture of a fusion cell is determined by its associated architecture parameters and . By optimizing on three edges connecting the decoder with three extraction cells in the encoder, NAS-Count fully explores the macro-level architecture of AMSNet, such that different single- or multi-path encoder-decoder formulations are automatically searched to discover the best feature aggregation for producing high-quality density maps.
As shown in Figure 2, denotes the number of extraction cells in the encoder and is the number of channels in the output of the last cell. To improve efficiency, we first employ a smaller proxy network, with =6 and =256, to search the cell architecture. Upon deployment, we enlarge the network to =8 and =512 for better performance. Through the multi-scale aggregation instantiated in the decoder, we obtain a feature map with 128 channels, which is then processed by an up-sampling module containing two
convolutions interleave with the nearest neighbor interpolation layers. The output of the up-sampling module is a single-channel density map with restored spatial resolution, which is then utilized in computing the SPPLoss.
3.2 Scale Pyramid Pooling Loss
The default loss function to optimize counting-by-density models is the per-pixel mean square error (MSE) loss. By supervising this
difference between the estimated density map and corresponding ground-truth, one assumes strong pixel-level isolation, such that it fails to reflect structural differences in multi-scale regions [10, 31]. As motivated by the Astrous Spatial Pyramid Pooling (ASPP) module designed in [14], we propose to solve this problem of MSE by proposing a new supervision architecture we call the Scale Pyramid Pooling Loss (SPPLoss), where non-parametric pooling layers are stacked into a two-stream pyramid. As shown in Figure 4, after feeding the estimated map and the ground-truth into each stream, they are progressively coarsened and MSE losses are calculated on each level between the pooled maps. This is equivalent to computing the structural difference with increasing region-level receptive fields, and can therefore better supervise the pixel-level estimation model on different scales.Instead of setting the pooling layers manually as in [26], NAS-Count searches the most effective SPPLoss architecture jointly with AMSNet. In this way, the multi-scale capability composed in both architecture can better collaborate to resolve the scale variation problem in counting-by-density. Specifically, each stream in SPPLoss deploys =4 cascaded nodes. Among them, one input node is the predicted density map (or the given ground-truth). The other three nodes are produced through three cascaded searched pooling layers. The search space for operation performed on each edge contains six different pooling layers including:
-
, , max pooling layer with stride 2;
-
, , average pooling layer with stride 2;
The search for SPPLoss adopts the similar differentiable strategy as detailed in Section 3.1. Notably, as SPPLoss is inherently a pyramid, its macro-level search space takes a cascaded form instead of a densely-connected hyper-graph. Accordingly, we only need to optimize the operation-wise architecture parameter as follows:
(3) |
Here, indexes 6 different pooling operations, and represents an estimated map or ground-truth in specific level. Since both of them only have one channel, we thus do not apply partial channel connections (i.e. set equals to 1). The same cascaded architecture is shared in both streams of SPPLoss. Using the best searched architecture as depicted in Figure 4, SPPLoss is computed as:
(4) |
denotes the number of pixels in the map, indicates the searched pooling operation, superscript is the layer index ranging from 0 to 3. is the special case where MSE loss is computed directly between and .

4 Experiments
We search and evaluate AMSNet and SPPLoss on the ShanghaiTech [75], WorldExpo’10 [74], UCF_CC_50 [24] and UCF-QNRF [25] crowd counting datasets.
4.1 Implementation Details
The original annotations provided by the datasets are coordinates pinpointing the location of each individual in the crowd. To soften these hard regression labels for better convergence, we apply a normalized 2D Gaussian filter to convert coordinate map into density map, on which each individual is represented by a Gaussian response with radius equals to 15 pixels [65].
4.1.1 Architecture Search
The architecture of AMSNet and SPPLoss, i.e. their corresponding architecture parameters and , are jointly searched on the split UCF-QNRF [25] training set. We choose to perform search on this dataset as it has the most challenging scenes with large crowd counts and density variations, and the search costs approximately 21 TITAN Xp GPU hours. Benefiting from the continuous relaxation, we optimize all architecture parameters and network weights jointly using gradient descent. Specifically, the first-order optimization proposed in [37] is adopted, upon which and , are optimized alternatively. For architecture parameters, we set the learning rate to be 6e-4 with weight decay of 1e-3. We follow the implementation as in [70, 34]
, where a warm-up training for network weights is first conducted for 40 epochs and stops the search early at 80 epochs. For training the network weights, we use a cosine learning rate that decays from 0.001 to 0.0004, and weight decay 1e-4. Data augmentation with random sampling, flip and rotation are conducted to alleviate overfitting.

4.1.2 Architecture Training
After the architectures of AMSNet and SPPLoss are acquired by searching on the UCF-QNRF dataset, we re-train their network weights from scratch on each of the other datasets. We re-initialize the weights with Xavier initialization, and employ Adam with initial learning rate set to 1e-3. This learning rate is decayed by 0.8 every 15K iterations.
4.1.3 Architecture Evaluation
Upon deployment, we directly feed images into AMSNet as a whole without patching, aiming to obtain high-quality density maps free from boundary artifacts. In counting-by-density, the crowd count on an estimated density map equals to the summation of all pixels, and to evaluate the counting performance we employ the mean average error (MAE) and the mean squared error (MSE) metrics:
(5) |
(6) |
here is the number of images in the test set, and represent the predicted and ground truth counts of the
-th image. We also utilize the PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity in Image) metrics to evaluate density map quality.
[56].Method | ShanghaiTech Part_A | ShanghaiTech Part_B | UCF_CC_50 | UCF-QNRF | ||||
MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | |
Zhang et al. [74] | 181.8 | 277.7 | 32.0 | 49.8 | 467.0 | 498.5 | ||
MCNN [75] | 110.2 | 173.2 | 26.4 | 41.3 | 377.6 | 509.1 | 277 | 426 |
CP-CNN [56] | 73.6 | 106.4 | 20.1 | 30.1 | 295.8 | 320.9 | ||
CSRNet [31] | 68.2 | 115.0 | 10.6 | 16.0 | 266.1 | 397.5 | ||
SANet [10] | 67.0 | 104.5 | 8.4 | 13.6 | 258.4 | 334.9 | ||
TEDNet [26] | 64.2 | 109.1 | 8.2 | 12.8 | 249.4 | 354.5 | 113 | 188 |
ANF [73] | 63.9 | 99.4 | 8.3 | 13.2 | 250.2 | 340.0 | 110 | 174 |
PACNN+ [55] | 62.4 | 102.0 | 7.6 | 11.8 | 241.7 | 320.7 | ||
CAN [39] | 62.3 | 100.0 | 7.8 | 12.2 | 212.2 | 243.7 | 107 | 183 |
AMSNet | 58.0 | 96.2 | 7.1 | 10.4 | 208.6 | 296.3 | 103 | 165 |
4.2 Search Result Analysis
The best searched multi-scale feature extraction and fusion cells, as well as the SPPLoss architecture are illustrated in Figure 4. As shown, extraction cell maintains the spatial and channel dimensions unchanged ( convolutions are employed to manipulate the channel dimensions in the cells). The extraction cell primarily exploits dilated convolutions over normal ones, conforming to the fact that in the absence of heavy down-samplings, pixel-level models rely on dilations to enlarge receptive fields. Furthermore, different kernel sizes are employed in the extraction cell, showing its multi-scale capability in addressing scale variations. By taking in three encoded features and generating one output map, the fusion cell constitutes a multi-path decoding hierarchy, wherein primarily non-dilated convolutions with smaller kernels are selected to aggregate features more precisely and parameter-friendly. In addition, the deployed skip connections and varied kernels contribute to the decoder multi-scale capability.
Method | MAE | PSNR | SSIM | Size |
MCNN [75] | 110.2 | 21.4 | 0.52 | 0.13M |
Switch-CNN [53] | 90.4 | 15.11M | ||
CP-CNN [56] | 73.6 | 21.72 | 0.72 | 68.4M |
CSRNet [31] | 68.2 | 23.79 | 0.76 | 16.26M |
SANet [10] | 67.0 | 0.91M | ||
TEDNet [26] | 64.2 | 25.88 | 0.83 | 1.63M |
ANF [73] | 63.9 | 24.1 | 0.78 | 7.9M |
AMSNet | 58.0 | 26.96 | 0.89 | 3.79M |
AMSNet_light | 61.8 | 25.93 | 0.84 | 1.51M |
4.3 Ablation Study on Searched Architectures
For ablation purposes, we employ the architecture proposed in [10] as the baseline encoder (composed of four inception-like blocks). The baseline decoder cascades two convolutions interleaved with nearest-neighbor interpolation layers. The normal MSE loss is utilized as baseline supervision. By ablating different modules with its baseline, the ablation study result on the ShanghaiTech Part_A dataset is reported in Table 3. This table is partitioned into three groups row-wise, and each row indicates a specific configuration. The MAE and PSNR metrics are used to show the counting accuracy and density map quality.
Architectures in the first two groups (four rows) are optimized with the normal MSE loss. As shown, the searched AMSNet encoder improve counting accuracy and density map quality by 12.0% and 7.8%, while the searched decoder brings 8.5% and 1.7% improvements respectively. In the third group, AMSNet is supervised by different loss functions to demonstrate their efficacy. The Spatial Abstraction Loss (SAL) proposed in [26] adopts a hand-designed pyramidal architecture, which surpasses the normal MSE supervision on both counting and density estimation performance. These improvements are further enhanced by deploying SPPLoss, showing that the searched pyramid benefits counting and density estimation by supervising multi-scale structural information.

Configurations | MAE | PSNR | ||
Encoder Architecture | 1 | Baseline Encoder | 69.1 | 23.54 |
Baseline Decoder | ||||
2 | AMSNet Encoder | 60.8 | 25.52 | |
Baseline Decoder | ||||
Decoder Architecture | 1 | Baseline Encoder | 69.1 | 23.54 |
Baseline Decoder | ||||
3 | Baseline Encoder | 63.2 | 23.94 | |
AMSNet Decoder | ||||
Supervision | 3 | AMSNet + MSE | 59.4 | 25.96 |
4 | AMSNet + SAL | 58.7 | 26.20 | |
5 | AMSNet + SPPLoss | 58.0 | 26.96 |
4.4 Hyper-parameter Study
Due to the heavy deployment of the extraction cell, the size and performance of AMSNet are largely dependent on two hyper-parameter and , each denoting the number of extraction cell and its output channel dimension. As illustrated in Figure 5, and render the best counting performance, but populate AMSNet with 3.79M parameters. When decreasing to 256, the size of AMSNet also shrinks dramatically, but at the expense of decreased accuracy. Nevertheless, still produces the best MAE in this case. As a result, we configure our AMSNet with , and also maintain an AMSNet_light with in the experiment.
We compare the counting accuracy and density map quality of both AMSNet and AMSNet_light with other state-of-the-art counting methods in Table 2. As shown, AMSNet reports the best MAE and PSNR overall, while being heavier than three other methods. AMSNet_light, on the other hand, is the third most light model and achieves the best performance with the exception of AMSNet.
Method | S1 | S2 | S3 | S4 | S5 | Ave. |
MCNN [75] | 3.4 | 20.6 | 12.9 | 13.0 | 8.1 | 11.6 |
SANet [10] | 2.6 | 13.2 | 9.0 | 13.3 | 3.0 | 8.2 |
CAN [39] | 2.9 | 12.0 | 10.0 | 7.9 | 4.3 | 7.4 |
ECAN [39] | 2.4 | 9.4 | 8.8 | 11.2 | 4.0 | 7.2 |
TEDNet [26] | 2.3 | 10.1 | 11.3 | 13.8 | 2.6 | 8.0 |
AT-CSRNet [76] | 1.8 | 13.7 | 9.2 | 10.4 | 3.7 | 7.8 |
AMSNet | 1.6 | 8.8 | 10.8 | 10.4 | 2.5 | 6.8 |
4.5 Performance and Comparison
We compare the counting-by-density performance of NAS-Count with other state-of-the-art methods on four challenging datasets. In particular, the counting accuracy comparison is reported in Tables 1 and 4, while the density map quality result is shown in Table 2.
4.5.1 Counting Accuracy
ShanghaiTech. The ShanghaiTech is composed of Part_A and Part_B with in total of 1198 images. It is one of the largest and most widely used datasets in crowd counting. As shown in Table 1, AMSNet leads others on ShanghaiTech in terms of both MAE and MSE. On Part_A, we surpass the second best by 6.9% and 3.2% in MAE and MSE. On Part_ B we achieve 6.6 and 11.9 improvements respectively.
UCF_CC_50 The UCF_CC_50 dataset contains 50 images of varying resolutions and densities. In consideration of sample scarcity, we follow the standard protocol [24] and use 5-fold cross-validation to evaluate method performance. As shown in Table 1, AMSNet achieves the best MAE and second MSE, elevating the MAE performance by 1.7%.
UCF-QNRF The UCF-QNRF dataset introduced by Idress et al. [25] has images with the highest crowd counts and largest density variation, ranging from 49 to 12865 people per image. These characteristics make UCF-QNRF extremely challenging for counting models, and only selected methods have published their results. Nonetheless, we produce the best MAE and MSE scores on this dataset, leading the second best by 3.7% and 5.2%.
WorldExpo’10 The WorldExpo’10 dataset [74] contains 3980 images covering 108 different scenes. The training set contains 3380 images, and the test set includes 600 frames from 5 different scenes. As shown in Table 4, AMSNet achieves the lowest average MAE over five scenes, and also performs the best on the three scenes individually.
4.5.2 Density Map Quality
As shown in Table 2, we employ PSNR and SSIM indices to compare the quality of density maps estimated by different methods, and AMSNet performs the best on both indices, outperforming the second best by 4.0% and 6.7%. Notably, even by deploying AMSNet_light which is the third lightest model, we still generate the most high-quality density map. Quantitatively, density maps produced by MCNN [75], SANet [10], TEDNet [26], ANF [73], and our AMSNet on ShanghaiTech Part_A are displayed in Figure 6. We further showcase more density maps generated by AMSNet on all employed datasets in Figure 3.
5 Conclusion
NAS-Count is the first endeavor introducing neural architecture search into counting-by-density. In this paper, we developed the state-of-the-art AMSNet encoder-decoder as well as the SPPLoss supervision paradigm. Specifically, AMSNet employs a novel composition of multi-scale feature extraction and fusion cells, which are both searched efficiently from a counting-specific search space using gradient-based strategy. SPPLoss extends normal MSE loss with a scale pyramid architecture, which helps to supervise structural information in the density map at multiple scales. By jointly searching AMSNet and SPPLoss end-to-end, NAS-Count surpasses tedious hand-designing efforts by achieving a multi-scaled model automatically with less than 1 GPU day, and demonstrates overall the best performance on four challenging datasets.
References
- [1] (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §2.2.
- [2] (2017) Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823. Cited by: §2.2.
-
[3]
(2018)
Understanding and simplifying one-shot architecture search.
In
International Conference on Machine Learning
, pp. 549–558. Cited by: §2.2. - [4] (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §2.2.
- [5] (2016) Crowdnet: a deep convolutional network for dense crowd counting. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 640–644. Cited by: §2.1.
- [6] (2017) SMASH: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §2.2.
-
[7]
(2017)
Geometric deep learning: going beyond euclidean data
. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1. -
[8]
(2018)
Efficient architecture search by network transformation.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Cited by: §2.2. - [9] (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.2.
- [10] (2018) Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision, pp. 734–750. Cited by: §1, §1, §1, §2.1, §3.2, Figure 6, §4.3, §4.5.2, Table 1, Table 2, Table 4.
- [11] (2009) Bayesian poisson regression for crowd counting. In Proceedings of the International Conference on Computer Vision, pp. 545–551. Cited by: §2.1.
- [12] (2012) Feature mining for localised crowd counting. In Proceedings of the British Machine Vision Conference, Vol. 1, pp. 3. Cited by: §1.
- [13] (2018) Searching for efficient multi-scale architectures for dense image prediction. In Proceedings of the Advances in Neural Information Processing Systems, pp. 8699–8710. Cited by: §1, §2.3.
- [14] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §3.2.
- [15] (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. arXiv preprint arXiv:1904.12760. Cited by: §2.2.
- [16] (2019) Detnas: neural architecture search on object detection. arXiv preprint arXiv:1903.10979. Cited by: §2.3.
- [17] (2012) Pedestrian detection: an evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (4), pp. 743–761. Cited by: §2.1.
- [18] (2019) Neural architecture search: a survey.. Journal of Machine Learning Research 20 (55), pp. 1–21. Cited by: §2.2.
- [19] (2017) Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528. Cited by: §2.2.
-
[20]
(2012)
Learning to count with regression forest and structured labels.
In
Proceedings of the International Conference on Pattern Recognition
, pp. 2685–2688. Cited by: §2.1. - [21] (2009) Marked point processes for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2913–2920. Cited by: §1, §2.1.
- [22] (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7036–7045. Cited by: §2.3.
- [23] (2017) Google vizier: a service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. Cited by: §2.2.
- [24] (2013) Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547–2554. Cited by: §1, §2.1, §4.5.1, §4.
- [25] (2018) Composition loss for counting, density map estimation and localization in dense crowds. arXiv preprint arXiv:1808.01050. Cited by: §1, §1, §2.1, §4.1.1, §4.5.1, §4.
- [26] (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6133–6142. Cited by: §1, §1, §2.1, §3.2, Figure 6, §4.3, §4.5.2, Table 1, Table 2, Table 4.
- [27] (2018) Beyond counting: comparisons of density maps for crowd analysis tasks-counting, detection, and tracking. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.1.
- [28] (2017) Mixture of counting cnns: adaptive integration of cnns specialized to specific appearance for crowd counting. arXiv preprint arXiv:1703.09393. Cited by: §2.1.
- [29] (2010) Learning to count objects in images. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1324–1332. Cited by: §1, §2.1.
- [30] (2008) Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In Proceedings of the International Conference on Pattern Recognition, pp. 1–4. Cited by: §1, §2.1.
- [31] (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1091–1100. Cited by: §1, §2.1, §3.2, Table 1, Table 2.
- [32] (2019) Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §2.2.
- [33] (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 5. Cited by: §1.
- [34] (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 82–92. Cited by: §1, §1, §2.2, §2.2, §2.3, §4.1.1.
- [35] (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision, pp. 19–34. Cited by: §1, §2.2, §2.3.
- [36] (2017) Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §2.2.
- [37] (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2.2, §2.2, §3.1.1, §3.1.1, §4.1.1.
- [38] (2019) Crowd counting with deep structured scale integration network. In Proceedings of the International Conference on Computer Vision, pp. 1774–1783. Cited by: §1, §2.1.
- [39] (2019-06) Context-aware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, Table 1, Table 4.
- [40] (2018) Leveraging unlabeled data for crowd counting by learning to rank. arXiv preprint arXiv:1803.03095. Cited by: §1.
- [41] (2019) Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Cited by: §2.2.
- [42] (2019) Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9126–9135. Cited by: §1, §2.1, §2.3.
- [43] (2018) Light-weight refinenet for real-time semantic segmentation. arXiv preprint arXiv:1810.03272. Cited by: §2.1.
- [44] (2016) Towards perspective-free object counting with deep learning. In Proceedings of the European Conference on Computer Vision, pp. 615–629. Cited by: §1, §2.1.
- [45] (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2.2.
- [46] (2019) Aging evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1, §2.3.
- [47] (2019) Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §1, §2.2, §2.3.
- [48] (2017) Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §1, §2.2.
- [49] (2015) U-net: convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 234–241. Cited by: §1.
- [50] (2009) Crowd counting using multiple local features. In Digital Image Computing: Techniques and Applications, 2009. DICTA’09., pp. 81–88. Cited by: §2.1.
- [51] (2015) An evaluation of crowd counting methods, features and regression models. Computer Vision and Image Understanding 130, pp. 1–17. Cited by: §1, §2.1.
- [52] (2015) Recent survey on crowd density estimation and counting for visual surveillance. Engineering Applications of Artificial Intelligence 41, pp. 103–114. Cited by: §1, §2.1.
- [53] (2017) Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 6. Cited by: §1, Table 2.
- [54] (2016) Convolutional neural fabrics. In Proceedings of the Advances in Neural Information Processing Systems, pp. 4053–4061. Cited by: §1, §2.2.
- [55] (2019) Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7279–7288. Cited by: Table 1.
- [56] (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the International Conference on Computer Vision, pp. 1879–1888. Cited by: §4.1.3, Table 1, Table 2.
- [57] (2018) A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters 107, pp. 3–16. Cited by: §1, §2.1.
- [58] (2019) The evolved transformer. arXiv preprint arXiv:1901.11117. Cited by: §2.3.
- [59] (2019) Single-path nas: designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: §2.2.
- [60] (2002) Evolving neural networks through augmenting topologies. Evolutionary Computation 10 (2), pp. 99–127. Cited by: §2.2.
- [61] (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §2.2.
- [62] (2016) Learning to count with cnn boosting. In Proceedings of the European Conference on Computer Vision, pp. 660–676. Cited by: §2.1.
- [63] (2011) Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3401–3408. Cited by: §2.1.
- [64] (2019-06) Learning from synthetic data for crowd counting in the wild. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
- [65] (2018) In defense of single-column networks for crowd counting. arXiv preprint arXiv:1808.06133. Cited by: §2.1, §4.1.
- [66] (2017) Genetic cnn. In Proceedings of the International Conference on Computer Vision, pp. 1379–1388. Cited by: §2.2.
- [67] (2019) Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569. Cited by: §1.
- [68] (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §2.2.
- [69] (2019) Auto-fpn: automatic network architecture adaptation for object detection beyond classification. In Proceedings of the International Conference on Computer Vision, pp. 6649–6658. Cited by: §2.3.
- [70] (2019) Pc-darts: partial channel connections for memory-efficient differentiable architecture search. arXiv preprint arXiv:1907.05737. Cited by: §1, §2.2, §3.1.1, §3.1, §4.1.1.
- [71] (2017) Stacked hourglass network for robust facial landmark localisation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2025–2033. Cited by: §1.
- [72] (2019) Nas-bench-101: towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635. Cited by: §2.2.
- [73] (2019-10) Attentional neural fields for crowd counting. In Proceedings of the International Conference on Computer Vision, Cited by: Figure 6, §4.5.2, Table 1, Table 2.
- [74] (2015) Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 833–841. Cited by: §2.1, §4.5.1, Table 1, §4.
- [75] (2016) Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 589–597. Cited by: §1, §1, §2.1, Figure 6, §4.5.2, Table 1, Table 2, Table 4, §4.
- [76] (2019-06) Leveraging heterogeneous auxiliary tasks to assist crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, Table 4.
- [77] (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.2.
- [78] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §1, §2.2, §2.3.