NAS-Count: Counting-by-Density with Neural Architecture Search

by   Yutao Hu, et al.

Most of the recent advances in crowd counting have evolved from hand-designed density estimation networks, where multi-scale features are leveraged to address scale variation, but at the expense of demanding design efforts. In this work, we automate the design of counting models with Neural Architecture Search (NAS) and introduce an end-to-end searched encoder-decoder architecture, Automatic Multi-Scale Network (AMSNet). The encoder and decoder in AMSNet are composed of different cells discovered from counting-specific search spaces, each dedicated to extracting and aggregating multi-scale features adaptively. To resolve the pixel-level isolation issue in training density estimation models, AMSNet is optimized with a novel Scale Pyramid Pooling Loss (SPPLoss), which exploits a pyramidal architecture to achieve structural supervision at multiple scales. During training time, AMSNet and SPPLoss are searched end-to-end efficiently with differentiable NAS techniques. When testing, AMSNet produces state-of-the-art results that are considerably better than hand-designed models on four challenging datasets, fully demonstrating the efficacy of NAS-Count.



There are no comments yet.


page 1

page 3

page 5

page 8


Efficient Neural Architecture Search for End-to-end Speech Recognition via Straight-Through Gradients

Neural Architecture Search (NAS), the process of automating architecture...

Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

In this paper, we proposed two modified neural network architectures bas...

Crowd Counting and Density Estimation by Trellis Encoder-Decoder Network

Crowd counting has recently attracted increasing interest in computer vi...

MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search

Neural Architecture Search (NAS) has proved effective in offering outper...

MANAS: Multi-Scale and Multi-Level Neural Architecture Search for Low-Dose CT Denoising

Lowering the radiation dose in computed tomography (CT) can greatly redu...

Scale-Aware Neural Architecture Search for Multivariate Time Series Forecasting

Multivariate time series (MTS) forecasting has attracted much attention ...

AutoDispNet: Improving Disparity Estimation with AutoML

Much research work in computer vision is being spent on optimizing exist...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given crowded images, a crowd counting model estimates the number of people contained in a given area. If performed accurately, this has far-reaching applications in crowd monitoring, urban planning, traffic management, and disaster relief [57, 51, 52]. With advanced occlusion robustness and counting efficiency, counting-by-density [29, 75, 10, 25] has become the method-of-choice over others related techniques [30, 21, 12, 24, 12]. These techniques estimate a pixel-level density map and count the crowd by summing over pixels in the given area.

Figure 1: A demonstration of the scale variation issue in the counting-by-density task. Due to varied distances to the camera, objects are perspectively distorted in the input with varied scales.

Although effective, counting-by-density is still challenged with scale variations induced by perspective distortion. As illustrated in Figure.1, the scales of people vary inversely with their distances to the camera causing an inconsistent crowd distribution in the density map. To address this issue, most methods [75, 10, 40]

employ deep Convolutional Neural Network (CNN) for exploiting multi-scale features to perform density estimation in multi-scaled scenes. In particular, different-sized filters are arranged in parallel in multiple columns to capture multi-scale features for accurate counting in

[75, 44, 53], while in [10, 25, 26], different filters are grouped into blocks and then stacked sequentially in one column. At the heart of these solutions, multi-scale capability originates from the compositional nature of CNNs [7, 67]

, where convolutions with various receptive fields are composed hierarchically by hand. However, these manual designs demand prohibitive expert-efforts and suffer from heuristically-fixed receptive fields.

We therefore developed a Neural Architecture Search (NAS) [77, 48] based approach to automatically discover multi-scaled counting-by-density models. NAS is enabled by the compositional nature of CNN and guided by human expertise in designing task-specific search space and strategies. For vision tasks, NAS blooms with image-level classification [78, 35, 47, 46]

, where novel architectures are found to progressively transform spatial details to semantically deep features. Counting-by-density is, however, a pixel-level task that requires spatial preserving architectures with refrained down-sampling strides. Accordingly, the successes of NAS in image classification are not immediately transferable to crowd counting. Although attempts have been made to deploy NAS in image segmentation for pixel-level classifications

[13, 34, 42], they are still not able to address counting-by-density, which is a pixel-level regression with scale variations across the inputs.

Figure 2: An illustration of NAS-Count with the AMSNet architecture and SPPLoss supervision, all searched cells are outlined in black. Given () inputs, the output dimension of each extraction and fusion cell are marked accordingly.

In NAS-Count we propose a counting-oriented NAS framework with specific search space, search strategy, and supervision method, what we use to develop our Automatic Multi-Scale Network (AMSNet). First, we employ a two-level search space [54, 34]. On the micro-level, multi-scaled cells are automatically explored to extract and fuse multi-scale features sufficiently. Pooling operations are limited to preserve spatial fidelity and dilated convolutions are utilized instead for receptive field enlargement. On the macro-level, multi-path encoder-decoder architectures are searched to produce a high-quality density map. Fully-convolutional encoder-decoder is the architecture-of-choice for pixel-level tasks [49, 71, 10], and the multi-path variant can better aggregate features encoded at different scales [33, 26, 38]. Second, we adopt a differential one-shot search strategy [37, 34, 70] to improve search efficiency, wherein architecture parameters are jointly optimized with gradient-based optimizer. Third, in order to address the pixel-level isolation problem [10, 31] of the traditional mean square error (MSE) loss, we introduce a novel Scale Pyramid Pooling Loss (SPPLoss) to optimize AMSNet, which leverages a pyramidal pooling architecture to enforce supervision with multi-scale structural awareness. By jointly searching AMSNet and SPPLoss, NAS-Count flexibly exploits multi-scale features and addresses the scale variation issue in counting-by-density. NAS-Count is illustrated in Figure 2.

Main contributions of NAS-Count includes:

  • To our best knowledge, NAS-Count is the first attempt at introducing NAS for crowd counting, where a multi-scale architecture is automatically developed to address the scale variation issue.

  • A counting-specific two-level search space is developed in NAS-Count, from which a spatially-preserved encoder-decoder architecture (AMSNet) is discovered efficiently with a differentiable search strategy using stochastic gradient descent (SGD).

  • A novel Scale Pyramid Pooling Loss (SPPLoss) is searched automatically to replace MSE supervision, which helps produce the higher-quality density map via optimizing structural information on multiple scales.

  • By jointly searching AMSNet and SSPLoss, NAS-Count reports the best overall counting and density estimation performances on four challenging benchmarks, considerably surpassing other state-of-the-arts, which all require demanding expert-involvement.

2 Related work

In this section, we briefly review related literature in the domains of crowd counting and neural architecture search.

2.1 Crowd Counting Literature

Existing counting methods can be categorized into counting-by-detection [17, 21, 30, 63], counting-by-regression [11, 50, 24, 74, 28], and counting-by-density strategies. For comprehensive surveys in crowd counting, please refer to [51, 52, 57, 27]. The first strategy is vulnerable to occlusions due to the requirement of explicit detection. Counting-by-regression successfully avoids such requirement by directly regressing to a scalar count, but forfeits the ability to perceive the localization of crowds. The counting-by-density strategy, initially introduced in [29], counts the crowd by first estimating a density map using hand-crafted [29, 20] or deep CNN [75, 31, 64, 39] features, then summing over all pixel values in the map. Being a pixel-level regression task, CNN architectures deployed in counting-by-density methods tend to follow the encoder-decoder formulation. In order to handle scale variations with multi-scale features, single-column [10, 25, 65] and multi-column [75, 5, 44, 62] encoders have been used where different-sized convolution kernels are sequentially or parallelly arranged to extract features. For the decoder, hour-glass architecture with a single decoding path has been adopted [10, 25, 76], while a novel multi-path variant is gaining increasing attention for superior multi-scale feature aggregation [26, 38, 43, 42].

Figure 3: An illustration of generated density maps on ShanghaiTech Part_A, ShanghaiTech Part_B, UCF_50_CC, UCF-QNRF and WorldExpo’10 respectively. The first row shows the input images, the second and third depict the ground truth and estimated density maps.

2.2 NAS Fundamentals

Since its inception NAS [77, 48] has been intensively studied with favorable outcomes [18, 72]. The general efforts of developing new NAS solutions focus on designing new search spaces and search strategies. For search space, existing methods can be categorized into searching the network (macro) space [48, 77], the cell (micro) space [78, 35, 47, 37, 45], or exploring such a two-level space [54, 34] jointly. The cell-based space search is the most popular where the ensemble of cells in networks is hand-engineered to reduce the exponential search space for fast computation. For search strategy, it is essentially an optimizer to find the best architecture that maximizes a targeted task-objective. Random search [4, 23]

, reinforcement learning

[77, 78, 61, 8, 1]

, neuro-evolutionary algorithms

[60, 48, 41, 36, 47, 66], and gradient-based methods [37, 68, 9] have been used to solve the optimization problem, but the first three suffer from prohibitive computation costs. Although many attempts have been made such as parameter sharing [45, 19, 8, 3], hierarchical search [35, 34], deploying proxy tasks with cheaper search space [78] and training procedures [2] to accelerate them, yet they are still far less efficient than gradient-based methods.

Gradient-based NAS, represented by DARTS [37], follows the one-shot strategy [6] wherein a hyper-graph is established using differentiable architectural parameters. Based on the hyper-graph, an optimal sub-graph is explored within by solving a bi-level optimization with gradient-descent optimizers. Noteworthily, this one-shot strategy is inherently memory-heavy, and the bi-level optimization is prone to collapse with excessively-selected skip-connections. These drawbacks of one-shot search strategy are further investigated in DARTS variants including DARTS+ [32], P-DARTS [15], PC-DARTS [70], Single-Path NAS [59], and AutoDeepLab [34].

2.3 NAS Applications

NAS has shown great promise with discovered recurrent or convolutional neural networks in both sequential language modeling [58]

and multi-level vision tasks. In computer vision, NAS has excelled at image-level classification tasks

[78, 46, 35, 47]

, which is a customary starting-point for developing new classifiers outputting spatially coarsened labels. NAS was later extended to both bounding-box and pixel-level tasks, represented by object detection

[22, 16, 69] and segmentation [13, 34, 42], where the search spaces are modified to better preserve the spatial information in the feature map. In [13] a pixel-level oriented search space and a random search NAS were introduced to the pixel-level segmentation task. In [42], a similar search space was adopted, but the authors employed a reinforcement learning based search method. Nonetheless, both two methods suffer from formidable computations and are orders of magnitude slower than NAS-Count. In [34], the authors searched a two-level search space with more efficient gradient-based method, yet it dedicates in solving the pixel-level classification in semantic segmentation, which still differs from the per-pixel regression in counting-by-density.

3 NAS-Count Methodology

NAS-Count efficiently searches a multi-scaled encoder-decoder network, the Automatic Multi-Scale Network (AMSNet) as shown in Figure 2, in a counting-specific search space. It is then optimized with a jointly searched Scale Pyramid Pooling Loss (SPPLoss) as shown in Figure 2

. The encoder and decoder in AMSNet consist of searched multi-scale feature extraction cells and multi-scale feature fusion cells, respectively, and SPPLoss deploys a two-stream pyramidal pooling architecture where the pooling cells are searched as well. By searching AMSNet and SPPLoss together end-to-end, the receptive fields established in these two architectures can cooperate and collaborate with each other to harvest the ideal multi-scale capability for addressing the scale-varied counting problem. NAS-Count details are discussed in the following subsections.

3.1 Automatic Multi-Scale Network

AMSNet is searched with the differential one-shot strategy in a two-level search space. To address scale variations in the pixel-level counting-by-density problem, we search for a multi-path encoder-decoder architecture, where multi-scale features are adaptively extracted and aggregated. In particular, the down-sampling strides are limited in AMSNet to preserve spatial fidelity. To improve the search efficiency, NAS-Count adopts a continuous relaxation and partial channel connection as described in [70].

3.1.1 AMSNet Encoder

The encoder of AMSNet is composed of a set of multi-scale feature extraction cells. For the -th cell in the encoder, it takes the outputs of previous two cells, feature maps and , as input and produces an output feature map . We define each cell as a directed acyclic graph containing nodes, i.e. with , each represents a propagated feature map. We set =7 containing two input nodes, four intermediate nodes, and one output node. Each directed edge in a cell indicates a convolutional operation performed between a pair of nodes, and is searched from the search space with 9 operations:

  • common convolution;

  • , , dilated convolution with rate 2;

  • , , depth-wise separable convolution;

  • skip-connections;

  • no-connection (zero);

For preserving spatial fidelity in the extracted features, extraction cell involves no down-sampling operations. To compensate for the receptive field enlargement, we utilize dilated convolutions to substitute for the normal ones. Besides, we adopt depth-wise separable convolutions to keep the searched architecture parameter-friendly. Skip connections instantiate the residual learning scheme, which helps improving multi-scale capacity as well as enhancing gradient flows during back-propagation.

Within each cell, a specific intermediate node is connected to all previous nodes , , . Edges are established between every pair of connected-nodes and , forming a densely-connected hyper-graph. On a given edge in the graph, following the continuously-relaxed differentiable search as discussed in [37], its associated operation is defined as a summation of all possible operations weighted by the architectural parameter :


In the above equation, is a softmax function and

indicates the volume of the micro-level search space. Vector

is applied to perform a channel-wise sampling on , where 1/ channels are randomly selected to improve the search efficiency. is set to 4 as proposed in [70]. is a learnable parameter denoting the importance of each operation on an edge .

In addition, each edge is also associated with another architecture parameter which indicates its importance. Accordingly, an intermediate node is computed as a weighted sum of all edges connected to it:


Here includes all previous nodes in the cell. The output of the cell is a concatenation of all its intermediate nodes. The cell architecture is determined by two architectural parameters and , which are jointly optimized with the weights of convolutions through a bi-level optimization. For details please refer to [37]. To recover a deterministic architecture from continuous relaxation, the most important edges as well as their associated operations are selected by computing on and .

In the encoder, we apply a convolution to preliminary encode the input image into a channel feature map. Afterwards, two

convolutions are implemented after the first and third extraction cells, each doubling the channel dimension of the features. Our searched extraction cell is normal cell that keeps the feature channel dimension unchanged. Spatially, we only reduce the feature resolution twice with stride two max pooling layers, aiming to preserve the spatial fidelity in the features, while double the channels before the two down-sampling operations. Additionally, within each extraction cell, an extra

convolution is attached to each input node, adjusting their feature channels to be one-fourth of the cell final output dimension.

3.1.2 AMSNet Decoder

The decoder of AMSNet deploys a multi-scale feature fusion cell followed by an up-sampling module. We construct the hyper-graph of the fusion cell as inputting multiple features while outputting just one, therefore conforming to the aggregative nature of a decoder. The search in this hyper-graph is similar to that of the extraction cell. A fusion cell takes three encoder output feature maps as input, consisting of nodes that include three input nodes, two intermediate nodes and one output node. After the relaxation as formulated in Eqa.1 and 2, the architecture of a fusion cell is determined by its associated architecture parameters and . By optimizing on three edges connecting the decoder with three extraction cells in the encoder, NAS-Count fully explores the macro-level architecture of AMSNet, such that different single- or multi-path encoder-decoder formulations are automatically searched to discover the best feature aggregation for producing high-quality density maps.

As shown in Figure 2, denotes the number of extraction cells in the encoder and is the number of channels in the output of the last cell. To improve efficiency, we first employ a smaller proxy network, with =6 and =256, to search the cell architecture. Upon deployment, we enlarge the network to =8 and =512 for better performance. Through the multi-scale aggregation instantiated in the decoder, we obtain a feature map with 128 channels, which is then processed by an up-sampling module containing two

convolutions interleave with the nearest neighbor interpolation layers. The output of the up-sampling module is a single-channel density map with restored spatial resolution, which is then utilized in computing the SPPLoss.

3.2 Scale Pyramid Pooling Loss

The default loss function to optimize counting-by-density models is the per-pixel mean square error (MSE) loss. By supervising this

difference between the estimated density map and corresponding ground-truth, one assumes strong pixel-level isolation, such that it fails to reflect structural differences in multi-scale regions [10, 31]. As motivated by the Astrous Spatial Pyramid Pooling (ASPP) module designed in [14], we propose to solve this problem of MSE by proposing a new supervision architecture we call the Scale Pyramid Pooling Loss (SPPLoss), where non-parametric pooling layers are stacked into a two-stream pyramid. As shown in Figure 4, after feeding the estimated map and the ground-truth into each stream, they are progressively coarsened and MSE losses are calculated on each level between the pooled maps. This is equivalent to computing the structural difference with increasing region-level receptive fields, and can therefore better supervise the pixel-level estimation model on different scales.

Instead of setting the pooling layers manually as in [26], NAS-Count searches the most effective SPPLoss architecture jointly with AMSNet. In this way, the multi-scale capability composed in both architecture can better collaborate to resolve the scale variation problem in counting-by-density. Specifically, each stream in SPPLoss deploys =4 cascaded nodes. Among them, one input node is the predicted density map (or the given ground-truth). The other three nodes are produced through three cascaded searched pooling layers. The search space for operation performed on each edge contains six different pooling layers including:

  • , , max pooling layer with stride 2;

  • , , average pooling layer with stride 2;

The search for SPPLoss adopts the similar differentiable strategy as detailed in Section 3.1. Notably, as SPPLoss is inherently a pyramid, its macro-level search space takes a cascaded form instead of a densely-connected hyper-graph. Accordingly, we only need to optimize the operation-wise architecture parameter as follows:


Here, indexes 6 different pooling operations, and represents an estimated map or ground-truth in specific level. Since both of them only have one channel, we thus do not apply partial channel connections (i.e. set equals to 1). The same cascaded architecture is shared in both streams of SPPLoss. Using the best searched architecture as depicted in Figure 4, SPPLoss is computed as:


denotes the number of pixels in the map, indicates the searched pooling operation, superscript is the layer index ranging from 0 to 3. is the special case where MSE loss is computed directly between and .

Figure 4: Detailed illustrations of the best searched cells. The circled additive sign denotes element-wise or scalar additions.

4 Experiments

We search and evaluate AMSNet and SPPLoss on the ShanghaiTech [75], WorldExpo’10 [74], UCF_CC_50 [24] and UCF-QNRF [25] crowd counting datasets.

4.1 Implementation Details

The original annotations provided by the datasets are coordinates pinpointing the location of each individual in the crowd. To soften these hard regression labels for better convergence, we apply a normalized 2D Gaussian filter to convert coordinate map into density map, on which each individual is represented by a Gaussian response with radius equals to 15 pixels [65].

4.1.1 Architecture Search

The architecture of AMSNet and SPPLoss, i.e. their corresponding architecture parameters and , are jointly searched on the split UCF-QNRF [25] training set. We choose to perform search on this dataset as it has the most challenging scenes with large crowd counts and density variations, and the search costs approximately 21 TITAN Xp GPU hours. Benefiting from the continuous relaxation, we optimize all architecture parameters and network weights jointly using gradient descent. Specifically, the first-order optimization proposed in [37] is adopted, upon which and , are optimized alternatively. For architecture parameters, we set the learning rate to be 6e-4 with weight decay of 1e-3. We follow the implementation as in [70, 34]

, where a warm-up training for network weights is first conducted for 40 epochs and stops the search early at 80 epochs. For training the network weights, we use a cosine learning rate that decays from 0.001 to 0.0004, and weight decay 1e-4. Data augmentation with random sampling, flip and rotation are conducted to alleviate overfitting.

Figure 5: Illustrated hyper-parameter analysis. is the number of extraction cells, denotes the channels of feature map generated by the last extraction cell. Bottom left corner indicates superior counting result and the best parameters are colored in the legend.

4.1.2 Architecture Training

After the architectures of AMSNet and SPPLoss are acquired by searching on the UCF-QNRF dataset, we re-train their network weights from scratch on each of the other datasets. We re-initialize the weights with Xavier initialization, and employ Adam with initial learning rate set to 1e-3. This learning rate is decayed by 0.8 every 15K iterations.

4.1.3 Architecture Evaluation

Upon deployment, we directly feed images into AMSNet as a whole without patching, aiming to obtain high-quality density maps free from boundary artifacts. In counting-by-density, the crowd count on an estimated density map equals to the summation of all pixels, and to evaluate the counting performance we employ the mean average error (MAE) and the mean squared error (MSE) metrics:


here is the number of images in the test set, and represent the predicted and ground truth counts of the

-th image. We also utilize the PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity in Image) metrics to evaluate density map quality.


Method ShanghaiTech Part_A ShanghaiTech Part_B UCF_CC_50 UCF-QNRF
Zhang et al. [74] 181.8 277.7 32.0 49.8 467.0 498.5
MCNN [75] 110.2 173.2 26.4 41.3 377.6 509.1 277 426
CP-CNN [56] 73.6 106.4 20.1 30.1 295.8 320.9
CSRNet [31] 68.2 115.0 10.6 16.0 266.1 397.5
SANet [10] 67.0 104.5 8.4 13.6 258.4 334.9
TEDNet [26] 64.2 109.1 8.2 12.8 249.4 354.5 113 188
ANF [73] 63.9 99.4 8.3 13.2 250.2 340.0 110 174
PACNN+ [55] 62.4 102.0 7.6 11.8 241.7 320.7
CAN [39] 62.3 100.0 7.8 12.2 212.2 243.7 107 183
AMSNet 58.0 96.2 7.1 10.4 208.6 296.3 103 165
Table 1: Estimation errors on the ShanghaiTech, the UCF_CC_50, and the UCF-QNRF datasets.

4.2 Search Result Analysis

The best searched multi-scale feature extraction and fusion cells, as well as the SPPLoss architecture are illustrated in Figure 4. As shown, extraction cell maintains the spatial and channel dimensions unchanged ( convolutions are employed to manipulate the channel dimensions in the cells). The extraction cell primarily exploits dilated convolutions over normal ones, conforming to the fact that in the absence of heavy down-samplings, pixel-level models rely on dilations to enlarge receptive fields. Furthermore, different kernel sizes are employed in the extraction cell, showing its multi-scale capability in addressing scale variations. By taking in three encoded features and generating one output map, the fusion cell constitutes a multi-path decoding hierarchy, wherein primarily non-dilated convolutions with smaller kernels are selected to aggregate features more precisely and parameter-friendly. In addition, the deployed skip connections and varied kernels contribute to the decoder multi-scale capability.

MCNN [75] 110.2 21.4 0.52 0.13M
Switch-CNN [53] 90.4 15.11M
CP-CNN [56] 73.6 21.72 0.72 68.4M
CSRNet [31] 68.2 23.79 0.76 16.26M
SANet [10] 67.0 0.91M
TEDNet [26] 64.2 25.88 0.83 1.63M
ANF [73] 63.9 24.1 0.78 7.9M
AMSNet 58.0 26.96 0.89 3.79M
AMSNet_light 61.8 25.93 0.84 1.51M
Table 2: Model size and performance comparison among state-of-the-art counting methods on the ShanghaiTech Part_A.

4.3 Ablation Study on Searched Architectures

For ablation purposes, we employ the architecture proposed in [10] as the baseline encoder (composed of four inception-like blocks). The baseline decoder cascades two convolutions interleaved with nearest-neighbor interpolation layers. The normal MSE loss is utilized as baseline supervision. By ablating different modules with its baseline, the ablation study result on the ShanghaiTech Part_A dataset is reported in Table 3. This table is partitioned into three groups row-wise, and each row indicates a specific configuration. The MAE and PSNR metrics are used to show the counting accuracy and density map quality.

Architectures in the first two groups (four rows) are optimized with the normal MSE loss. As shown, the searched AMSNet encoder improve counting accuracy and density map quality by 12.0% and 7.8%, while the searched decoder brings 8.5% and 1.7% improvements respectively. In the third group, AMSNet is supervised by different loss functions to demonstrate their efficacy. The Spatial Abstraction Loss (SAL) proposed in [26] adopts a hand-designed pyramidal architecture, which surpasses the normal MSE supervision on both counting and density estimation performance. These improvements are further enhanced by deploying SPPLoss, showing that the searched pyramid benefits counting and density estimation by supervising multi-scale structural information.

Figure 6: Visualization of density maps generated by state-of-the-art methods. The first two columns show inputs and the corresponding ground truth. As follows, the outputs of MCNN [75], SANet [10], TEDNet [26], ANF [73], and AMSNet are demonstrated in each column.
Configurations MAE PSNR
Encoder Architecture 1 Baseline Encoder 69.1 23.54
Baseline Decoder
2 AMSNet Encoder 60.8 25.52
Baseline Decoder
Decoder Architecture 1 Baseline Encoder 69.1 23.54
Baseline Decoder
3 Baseline Encoder 63.2 23.94
AMSNet Decoder
Supervision 3 AMSNet + MSE 59.4 25.96
4 AMSNet + SAL 58.7 26.20
5 AMSNet + SPPLoss 58.0 26.96
Table 3: Ablation study results. Best performance is bolded, and arrows indicate the favorable directions of the metric values.

4.4 Hyper-parameter Study

Due to the heavy deployment of the extraction cell, the size and performance of AMSNet are largely dependent on two hyper-parameter and , each denoting the number of extraction cell and its output channel dimension. As illustrated in Figure 5, and render the best counting performance, but populate AMSNet with 3.79M parameters. When decreasing to 256, the size of AMSNet also shrinks dramatically, but at the expense of decreased accuracy. Nevertheless, still produces the best MAE in this case. As a result, we configure our AMSNet with , and also maintain an AMSNet_light with in the experiment.

We compare the counting accuracy and density map quality of both AMSNet and AMSNet_light with other state-of-the-art counting methods in Table 2. As shown, AMSNet reports the best MAE and PSNR overall, while being heavier than three other methods. AMSNet_light, on the other hand, is the third most light model and achieves the best performance with the exception of AMSNet.

Method S1 S2 S3 S4 S5 Ave.
MCNN [75] 3.4 20.6 12.9 13.0 8.1 11.6
SANet [10] 2.6 13.2 9.0 13.3 3.0 8.2
CAN [39] 2.9 12.0 10.0 7.9 4.3 7.4
ECAN [39] 2.4 9.4 8.8 11.2 4.0 7.2
TEDNet [26] 2.3 10.1 11.3 13.8 2.6 8.0
AT-CSRNet [76] 1.8 13.7 9.2 10.4 3.7 7.8
AMSNet 1.6 8.8 10.8 10.4 2.5 6.8
Table 4: The MAE comparison on WorldExpo’10.

4.5 Performance and Comparison

We compare the counting-by-density performance of NAS-Count with other state-of-the-art methods on four challenging datasets. In particular, the counting accuracy comparison is reported in Tables 1 and 4, while the density map quality result is shown in Table 2.

4.5.1 Counting Accuracy

ShanghaiTech. The ShanghaiTech is composed of Part_A and Part_B with in total of 1198 images. It is one of the largest and most widely used datasets in crowd counting. As shown in Table 1, AMSNet leads others on ShanghaiTech in terms of both MAE and MSE. On Part_A, we surpass the second best by 6.9% and 3.2% in MAE and MSE. On Part_ B we achieve 6.6 and 11.9 improvements respectively.

UCF_CC_50 The UCF_CC_50 dataset contains 50 images of varying resolutions and densities. In consideration of sample scarcity, we follow the standard protocol [24] and use 5-fold cross-validation to evaluate method performance. As shown in Table 1, AMSNet achieves the best MAE and second MSE, elevating the MAE performance by 1.7%.

UCF-QNRF The UCF-QNRF dataset introduced by Idress et al. [25] has images with the highest crowd counts and largest density variation, ranging from 49 to 12865 people per image. These characteristics make UCF-QNRF extremely challenging for counting models, and only selected methods have published their results. Nonetheless, we produce the best MAE and MSE scores on this dataset, leading the second best by 3.7% and 5.2%.

WorldExpo’10 The WorldExpo’10 dataset [74] contains 3980 images covering 108 different scenes. The training set contains 3380 images, and the test set includes 600 frames from 5 different scenes. As shown in Table 4, AMSNet achieves the lowest average MAE over five scenes, and also performs the best on the three scenes individually.

4.5.2 Density Map Quality

As shown in Table 2, we employ PSNR and SSIM indices to compare the quality of density maps estimated by different methods, and AMSNet performs the best on both indices, outperforming the second best by 4.0% and 6.7%. Notably, even by deploying AMSNet_light which is the third lightest model, we still generate the most high-quality density map. Quantitatively, density maps produced by MCNN [75], SANet [10], TEDNet [26], ANF [73], and our AMSNet on ShanghaiTech Part_A are displayed in Figure 6. We further showcase more density maps generated by AMSNet on all employed datasets in Figure 3.

5 Conclusion

NAS-Count is the first endeavor introducing neural architecture search into counting-by-density. In this paper, we developed the state-of-the-art AMSNet encoder-decoder as well as the SPPLoss supervision paradigm. Specifically, AMSNet employs a novel composition of multi-scale feature extraction and fusion cells, which are both searched efficiently from a counting-specific search space using gradient-based strategy. SPPLoss extends normal MSE loss with a scale pyramid architecture, which helps to supervise structural information in the density map at multiple scales. By jointly searching AMSNet and SPPLoss end-to-end, NAS-Count surpasses tedious hand-designing efforts by achieving a multi-scaled model automatically with less than 1 GPU day, and demonstrates overall the best performance on four challenging datasets.


  • [1] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §2.2.
  • [2] B. Baker, O. Gupta, R. Raskar, and N. Naik (2017) Accelerating neural architecture search using performance prediction. arXiv preprint arXiv:1705.10823. Cited by: §2.2.
  • [3] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In

    International Conference on Machine Learning

    pp. 549–558. Cited by: §2.2.
  • [4] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §2.2.
  • [5] L. Boominathan, S. S. Kruthiventi, and R. V. Babu (2016) Crowdnet: a deep convolutional network for dense crowd counting. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 640–644. Cited by: §2.1.
  • [6] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2017) SMASH: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §2.2.
  • [7] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017)

    Geometric deep learning: going beyond euclidean data

    IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
  • [8] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang (2018) Efficient architecture search by network transformation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Cited by: §2.2.
  • [9] H. Cai, L. Zhu, and S. Han (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.2.
  • [10] X. Cao, Z. Wang, Y. Zhao, and F. Su (2018) Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision, pp. 734–750. Cited by: §1, §1, §1, §2.1, §3.2, Figure 6, §4.3, §4.5.2, Table 1, Table 2, Table 4.
  • [11] A. B. Chan and N. Vasconcelos (2009) Bayesian poisson regression for crowd counting. In Proceedings of the International Conference on Computer Vision, pp. 545–551. Cited by: §2.1.
  • [12] K. Chen, C. C. Loy, S. Gong, and T. Xiang (2012) Feature mining for localised crowd counting. In Proceedings of the British Machine Vision Conference, Vol. 1, pp. 3. Cited by: §1.
  • [13] L. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens (2018) Searching for efficient multi-scale architectures for dense image prediction. In Proceedings of the Advances in Neural Information Processing Systems, pp. 8699–8710. Cited by: §1, §2.3.
  • [14] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §3.2.
  • [15] X. Chen, L. Xie, J. Wu, and Q. Tian (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. arXiv preprint arXiv:1904.12760. Cited by: §2.2.
  • [16] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun (2019) Detnas: neural architecture search on object detection. arXiv preprint arXiv:1903.10979. Cited by: §2.3.
  • [17] P. Dollar, C. Wojek, B. Schiele, and P. Perona (2012) Pedestrian detection: an evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (4), pp. 743–761. Cited by: §2.1.
  • [18] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey.. Journal of Machine Learning Research 20 (55), pp. 1–21. Cited by: §2.2.
  • [19] T. Elsken, J. Metzen, and F. Hutter (2017) Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528. Cited by: §2.2.
  • [20] L. Fiaschi, U. Köthe, R. Nair, and F. A. Hamprecht (2012) Learning to count with regression forest and structured labels. In

    Proceedings of the International Conference on Pattern Recognition

    pp. 2685–2688. Cited by: §2.1.
  • [21] W. Ge and R. T. Collins (2009) Marked point processes for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2913–2920. Cited by: §1, §2.1.
  • [22] G. Ghiasi, T. Lin, and Q. V. Le (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7036–7045. Cited by: §2.3.
  • [23] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley (2017) Google vizier: a service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. Cited by: §2.2.
  • [24] H. Idrees, I. Saleemi, C. Seibert, and M. Shah (2013) Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547–2554. Cited by: §1, §2.1, §4.5.1, §4.
  • [25] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. arXiv preprint arXiv:1808.01050. Cited by: §1, §1, §2.1, §4.1.1, §4.5.1, §4.
  • [26] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6133–6142. Cited by: §1, §1, §2.1, §3.2, Figure 6, §4.3, §4.5.2, Table 1, Table 2, Table 4.
  • [27] D. Kang, Z. Ma, and A. B. Chan (2018) Beyond counting: comparisons of density maps for crowd analysis tasks-counting, detection, and tracking. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.1.
  • [28] S. Kumagai, K. Hotta, and T. Kurita (2017) Mixture of counting cnns: adaptive integration of cnns specialized to specific appearance for crowd counting. arXiv preprint arXiv:1703.09393. Cited by: §2.1.
  • [29] V. Lempitsky and A. Zisserman (2010) Learning to count objects in images. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1324–1332. Cited by: §1, §2.1.
  • [30] M. Li, Z. Zhang, K. Huang, and T. Tan (2008) Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In Proceedings of the International Conference on Pattern Recognition, pp. 1–4. Cited by: §1, §2.1.
  • [31] Y. Li, X. Zhang, and D. Chen (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1091–1100. Cited by: §1, §2.1, §3.2, Table 1, Table 2.
  • [32] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, and Z. Li (2019) Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §2.2.
  • [33] G. Lin, A. Milan, C. Shen, and I. D. Reid (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 5. Cited by: §1.
  • [34] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 82–92. Cited by: §1, §1, §2.2, §2.2, §2.3, §4.1.1.
  • [35] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision, pp. 19–34. Cited by: §1, §2.2, §2.3.
  • [36] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2017) Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §2.2.
  • [37] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1, §2.2, §2.2, §3.1.1, §3.1.1, §4.1.1.
  • [38] L. Liu, Z. Qiu, G. Li, S. Liu, W. Ouyang, and L. Lin (2019) Crowd counting with deep structured scale integration network. In Proceedings of the International Conference on Computer Vision, pp. 1774–1783. Cited by: §1, §2.1.
  • [39] W. Liu, M. Salzmann, and P. Fua (2019-06) Context-aware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, Table 1, Table 4.
  • [40] X. Liu, J. van de Weijer, and A. D. Bagdanov (2018) Leveraging unlabeled data for crowd counting by learning to rank. arXiv preprint arXiv:1803.03095. Cited by: §1.
  • [41] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, et al. (2019) Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Cited by: §2.2.
  • [42] V. Nekrasov, H. Chen, C. Shen, and I. Reid (2019) Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9126–9135. Cited by: §1, §2.1, §2.3.
  • [43] V. Nekrasov, C. Shen, and I. Reid (2018) Light-weight refinenet for real-time semantic segmentation. arXiv preprint arXiv:1810.03272. Cited by: §2.1.
  • [44] D. Onoro-Rubio and R. J. López-Sastre (2016) Towards perspective-free object counting with deep learning. In Proceedings of the European Conference on Computer Vision, pp. 615–629. Cited by: §1, §2.1.
  • [45] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §2.2.
  • [46] E. Real, A. Aggarwal, Y. Huang, and Q. Le (2019) Aging evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1, §2.3.
  • [47] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §1, §2.2, §2.3.
  • [48] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §1, §2.2.
  • [49] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 234–241. Cited by: §1.
  • [50] D. Ryan, S. Denman, C. Fookes, and S. Sridharan (2009) Crowd counting using multiple local features. In Digital Image Computing: Techniques and Applications, 2009. DICTA’09., pp. 81–88. Cited by: §2.1.
  • [51] D. Ryan, S. Denman, S. Sridharan, and C. Fookes (2015) An evaluation of crowd counting methods, features and regression models. Computer Vision and Image Understanding 130, pp. 1–17. Cited by: §1, §2.1.
  • [52] S. A. M. Saleh, S. A. Suandi, and H. Ibrahim (2015) Recent survey on crowd density estimation and counting for visual surveillance. Engineering Applications of Artificial Intelligence 41, pp. 103–114. Cited by: §1, §2.1.
  • [53] D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 6. Cited by: §1, Table 2.
  • [54] S. Saxena and J. Verbeek (2016) Convolutional neural fabrics. In Proceedings of the Advances in Neural Information Processing Systems, pp. 4053–4061. Cited by: §1, §2.2.
  • [55] M. Shi, Z. Yang, C. Xu, and Q. Chen (2019) Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7279–7288. Cited by: Table 1.
  • [56] V. A. Sindagi and V. M. Patel (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the International Conference on Computer Vision, pp. 1879–1888. Cited by: §4.1.3, Table 1, Table 2.
  • [57] V. A. Sindagi and V. M. Patel (2018) A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters 107, pp. 3–16. Cited by: §1, §2.1.
  • [58] D. R. So, C. Liang, and Q. V. Le (2019) The evolved transformer. arXiv preprint arXiv:1901.11117. Cited by: §2.3.
  • [59] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyantha, J. Liu, and D. Marculescu (2019) Single-path nas: designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: §2.2.
  • [60] K. O. Stanley and R. Miikkulainen (2002) Evolving neural networks through augmenting topologies. Evolutionary Computation 10 (2), pp. 99–127. Cited by: §2.2.
  • [61] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §2.2.
  • [62] E. Walach and L. Wolf (2016) Learning to count with cnn boosting. In Proceedings of the European Conference on Computer Vision, pp. 660–676. Cited by: §2.1.
  • [63] M. Wang and X. Wang (2011) Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3401–3408. Cited by: §2.1.
  • [64] Q. Wang, J. Gao, W. Lin, and Y. Yuan (2019-06) Learning from synthetic data for crowd counting in the wild. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [65] Z. Wang, Z. Xiao, K. Xie, Q. Qiu, X. Zhen, and X. Cao (2018) In defense of single-column networks for crowd counting. arXiv preprint arXiv:1808.06133. Cited by: §2.1, §4.1.
  • [66] L. Xie and A. Yuille (2017) Genetic cnn. In Proceedings of the International Conference on Computer Vision, pp. 1379–1388. Cited by: §2.2.
  • [67] S. Xie, A. Kirillov, R. Girshick, and K. He (2019) Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569. Cited by: §1.
  • [68] S. Xie, H. Zheng, C. Liu, and L. Lin (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §2.2.
  • [69] H. Xu, L. Yao, W. Zhang, X. Liang, and Z. Li (2019) Auto-fpn: automatic network architecture adaptation for object detection beyond classification. In Proceedings of the International Conference on Computer Vision, pp. 6649–6658. Cited by: §2.3.
  • [70] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, and H. Xiong (2019) Pc-darts: partial channel connections for memory-efficient differentiable architecture search. arXiv preprint arXiv:1907.05737. Cited by: §1, §2.2, §3.1.1, §3.1, §4.1.1.
  • [71] J. Yang, Q. Liu, and K. Zhang (2017) Stacked hourglass network for robust facial landmark localisation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2025–2033. Cited by: §1.
  • [72] C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and F. Hutter (2019) Nas-bench-101: towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635. Cited by: §2.2.
  • [73] A. Zhang, L. Yue, J. Shen, F. Zhu, X. Zhen, X. Cao, and L. Shao (2019-10) Attentional neural fields for crowd counting. In Proceedings of the International Conference on Computer Vision, Cited by: Figure 6, §4.5.2, Table 1, Table 2.
  • [74] C. Zhang, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 833–841. Cited by: §2.1, §4.5.1, Table 1, §4.
  • [75] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 589–597. Cited by: §1, §1, §2.1, Figure 6, §4.5.2, Table 1, Table 2, Table 4, §4.
  • [76] M. Zhao, J. Zhang, C. Zhang, and W. Zhang (2019-06) Leveraging heterogeneous auxiliary tasks to assist crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, Table 4.
  • [77] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.2.
  • [78] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §1, §2.2, §2.3.