Crowd Scene Analysis by Output Encoding

01/27/2020 ∙ by Yao Xue, et al. ∙ Xi'an Jiaotong University University of Alberta 5

Crowd scene analysis receives growing attention due to its wide applications. Grasping the accurate crowd location (rather than merely crowd count) is important for spatially identifying high-risk regions in congested scenes. In this paper, we propose a Compressed Sensing based Output Encoding (CSOE) scheme, which casts detecting pixel coordinates of small objects into a task of signal regression in encoding signal space. CSOE helps to boost localization performance in circumstances where targets are highly crowded without huge scale variation. In addition, proper receptive field sizes are crucial for crowd analysis due to human size variations. We create Multiple Dilated Convolution Branches (MDCB) that offers a set of different receptive field sizes, to improve localization accuracy when objects sizes change drastically in an image. Also, we develop an Adaptive Receptive Field Weighting (ARFW) module, which further deals with scale variation issue by adaptively emphasizing informative channels that have proper receptive field size. Experiments demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance across four mainstream datasets, especially achieves excellent results in highly crowded scenes. More importantly, experiments support our insights that it is crucial to tackle target size variation issue in crowd analysis task, and casting crowd localization as regression in encoding signal space is quite effective for crowd analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The wide deployment of surveillance cameras in many cities stimulates the recent research interests in visual analysis of crowd scenes. It has a wide range of real-world applications, such as crowd surveillance, traffic monitoring and planning, even cell counting.

Mainstream approaches can be summarized into two categories: counting by density prediction and counting by detection. State-of-the-art methods use regression-based models (e.g.

object density estimator), which explicitly learn to count the objects of interest. These counting by density prediction approaches

[39] [31] [54] [32] [5] [35] have achieved superior performance on several existing counting datasets [54] [8] [18].

Density prediction methods measure deviation of the output density from a ground truth density during their training process. In order to train the density predictor, one has to create ground truth density map by smoothing on point (people head) annotations. This smoothing operation is extremely sensitive to high crowd density. If objects are densely present, peaks in the density map tend to merge. Neighboring peaks in the density map are very easy to mix together, thereby introducing errors in the very beginning stage. Additionally, sparse object locations create an imbalance in the cost function between positive and negative samples.

Recent research [31] has indicated that only predicting object count or global density map for congested scenes is insufficient for real-world demand such as public safety or traffic flow monitoring. Grasping the accurate crowd position (rather than merely global density) is important for spatially identifying high-risk regions from whole monitor images. But there was a recent fashion to perform counting by density prediction. Although the crowd count could be a precise estimation in the whole image level, the predicted density map can largely deviate from the true density map in sub-image level. We illustrate this phenomenon in Fig. 1, where the global true count: 361 and estimated count: 365 are quite close to each other. But the estimated density does not offer reliable approximations to ground truth in specific image regions, whose true counts are 117, 63 and 16, but estimated counts are 138, 56 and 12.

Fig. 1: Examples of predicted density maps for the ShanghaiTech dataset (true count: 361 prediction: 365). Left column: crowd image. Middle column: ground truth. Right column: prediction.

Despite the fashion of counting by density prediction, many researches [31] [18] [25] [33] still propose to fulfill crowd counting and crowd localization simultaneously. Recent researches [18] [31] have argued that tackling localization task can bring noteworthy benefits for counting task, such as counting error correction, enabling localization-based applications e.g.

human tracking. Counting by localization approaches formulate the counting task as a classic computer vision problem: object detection. For crowd scenes with few occlusion and low density, well-trained detector is able to localize objects, then object count is naturally obtained. These methods strive to directly predict the pixel level x,y-coordinate of objects. In this way, inevitable system prediction errors will also directly affect the pixel level coordinates of detected objects, such as position shift or size scaling of bounding boxes or dotted annotations. For general object detection, position shift or size scaling

e.g. in tens pixels could be acceptable. However for accurate crowd localization, such shift or scaling can pose much bigger challenges for localization in crowded scenes where objects are densely distributed, and would result in complete false detections. It is often seen that the more crowded an area is, the more inaccurate detection could emerge.

In this paper, we treat the task of crowd localization as an application of integrating Compressed Sensing based Output Encoding (CSOE) with supervised learning by CNN. As the output space is sparse for the crowd localization problem (only a few pixel locations are people head centroids), we can employ CSOE here. Furthermore, CS theory dictates that pairwise distances in the sparse space are approximately maintained in the compressed space

[16]. So, even after the output space encoding, CNN still targets the original output space in an equivalent distance norm.

The principle behind our CSOE module is straightforward. CS converts the sparse output pixel space into dense and short vectors. As a regressor, we use a trained CNN to predict the compressed vectors. Then using a reconstruction algorithm, we recover sparse cell locations in the output pixel space. In other words, we seek a different route that casts the problem of detecting variable number of small objects into a task of signal regression in encoding signal space. Compared to pixel coordinates representation of crowd in images, a signal representation is more robust to inevitable system errors.

On the basis of CSOE, we further render the structure of CNN+CS end-to-end trainable. CS-based encoding started with the work of Hsu et al. [16]

that proved a generalization prediction error bound. The error bound depends on two factors. How well the machine learner has predicted; and how well the recovery process has worked. In this work, we realize the joint optimization of both the machine learner and the recovery process by implementing them as CNN-based observation layers and sparse coding based reconstruction layers of the end-to-end trainable network. In addition, we derive a backpropagation rule for the reconstruction layers. Thus, the end-to-end training process is not only occurring within the observation layers, but also back-propagates error signals to optimize the parameters of the reconstruction layers, finally removes the risk of gradient vanishing in the deep reconstruction layers. This is different from the conventional sequential pipeline, where each component is optimized independently and could cause error accumulation.

Scale variation is a crucial challenge for object localization. An solution is to deploy in-network feature pyramids. E.g. FPN [30] adds a top-down connection to incorporate semantic high level features. Facing the issue of scale variation, we create Multiple Dilated Conv Branches (MDCB) sharing convolution weight but having different receptive field sizes for objects in different sizes. Furthermore, we deploy center pooling [11] to introduce the visual patterns within objects into the centroid point detection process.

Due to the distance, capture angle between camera and crowd, targets present huge size variation. Recent crowd analysis works [54] [29] [17] [28]

have suggested that the receptive field sizes of neural network should not be fixed, but modulated by the stimulus. Unfortunately, this property does not receive much attention in constructing deep learning models. In the paper, we present a nonlinear approach to aggregate information from multiple kernels to realize the adaptive changing of receptive field sizes. Specifically, we introduce an Adaptive Receptive Field Weighting (ARFW) module, which consists of a triplet of operations: spatially Aggregate, inter-channel Weight and Modulate. On the basis of backbone that generates multiple branches with various kernel sizes corresponding to different receptive field sizes, the Aggregate operator produces a channel descriptor by aggregating feature maps across their spatial dimensions. This descriptor enables a global receptive field within channel-wise feature maps. The Weight operator is based on two fully connected layers and produces a set of weights between channels. The Modulate operator modulates the feature maps of different receptive field sizes according to the channel weights. To this end, we propose a mechanism through which networks can learn to use global information to adaptively emphasize informative channels that have proper receptive field size and suppress less useful ones.

The contributions of this paper are summarized as. (1) We propose the Compressed Sensing based Output Encoding (CSOE) scheme, which casts object localization as signal regression task, CSOE helps to boost localization performance in circumstances where targets are highly crowded without huge scale variation. (2) We create Multiple Dilated Convolution Branches (MDCB) that aims to improve localization accuracy when objects sizes change drastically in an image and offers a set of different receptive field sizes. Unlike traditional convolution+pooling operations, MDCB avoids excessive loss of detail resolution which poses huge challenges to high density crowd scenes analysis. (3) To further deals with scale variation issue, we propose an Adaptive Receptive Field Weighting (ARFW) mechanism through which networks can learn to use global information to adaptively emphasize informative channels that have proper receptive field size. (4) In the observation head, we enrich geometric center information by center pooling to capture more recognizable visual patterns that locate within objects, while may not always lie on the geometric center of objects. (5) We render the method end-to-end trainable by deriving an independent backpropagation rule for the reconstruction layers to prevent gradient vanishing and error accumulation brought by conventional cascaded networks.

Fig. 2: System overview of the proposed method.

Ii Related work

Density prediction approaches

Regression-based algorithms are developed to regress the object count of interest from crowd images. Recently, end-to-end trainable deep neural networks (DNNs) are widely adopted in crowd counting task. These DNNs are optimized to predict a density map that approximates ground truth crowd distribution. For example, Zhang et al. [6]

solve the cross-scene crowd counting problem with a deep convolutional neural network fed with density map and global count datasets. The quality of density map can have a non-negligible impact on counting results.

[19] witnesses a significant boost in the crowd counting performance after adopting density map refinement framework.

Localization approaches

Early counting by detection methods [36] [47] [48] [26] rely on hand-crafted features, which cannot well handle those highly congested scenes with occlusions. Nowadays deep learning models have become key solution to object detection problem. Therefore, many researches seek to apply deep learning based detection frameworks into crowd counting, and have achieved remarkable performance improvements. [43] proposes an end-to-end trainable human detector for crowded scenes. For scenes with few occlusion and low density, well-trained detector is able to localize objects. For example, DecideNet [20] deploys human detector to rectify counting in low density regions. However in many crowd scenes, people head are so small that bounding box annotations are not suitable. Therefore, dot annotation on every head center is usually used in crowd localization task. This limits the application of detection based methods in crowd counting. Additionally, the high density becomes a hurdle for traditional counting by detection approaches. In this paper, we will demonstrate that detection based method can not only give highly precise localization results, but also obtain comparable even better counting performance by introducing compressed sensing techniques into detection framework.

Compressed sensing based output encoding

Compressed sensing (CS) [4] [14] [10] and sparse coding (SC) [13] have emerged as new frameworks for signal acquisition and reconstruction, with rich theoretical results and significant practical applications, such as MRI scan time reduction [2] and economical camera design [12]. CS-based encoding has a rather modest presence in the literature, where it was applied with linear and non-linear machine learners. One early research in the output encoding with error correcting ability [9] had shown superior accuracy. In the recent past, redundancy in the output representation [45] yielded more accurate predictions. Recently, non-linear predictors such as a Bayesian learner [24]

, Decision trees

[21] or CNN [49] were used. Viswanathan et al. [24]

used Bayesian inference with CS and showed good accuracy in prediction. Decision trees and gradient boosting had also been used in conjunction with CS encoding to yield good prediction accuracies

[21]. Recent researches [49] [51] [34] focused on the cross domain application of compressive sensing and deep learning. For example, [49] develops a CS-based tumor cell localization scheme and proposing an end-to-end training network. However unlike crowd analysis, size variations of tumor cells are much smaller due to strict operating conditions during making medical microscopy slides, such as tissue collection, sectioning, staining, scanning, etc. Consequently, [49] doesn’t pay attention to solve target size variation issue. Without the ability to offer different receptive field sizes, [49] also cannot adaptively emphasize informative channels that have proper receptive field size.

Channel-wise attention mechanisms

Proper regulation of informative channels can have a positive effect on the overall performance of DNNs. Hu et al. [17] use a lightweight gating mechanism in SENet to adaptively recalibrate feature maps based on channel-wise dependencies. SKNet [27] exploits the aggregation of the feature maps of different-sized kernels via selection weights to self-modulate receptive Field sizes and achieves superior performance in object recognition. [37] proposes weighted channel dropout to filter channels in accordance to activation status and elevates detection performance with a slight computational cost.

Multi-scale architectures

A number of research attentions have been paid to the scale variation issue. Several piece of initial works [42] [53] [23] deploy multi-scale image pyramid to refine counting performance in areas that objects are densely present. Recent progress have taken spatial locality [32], cross scale aggregation [5], and adaptive scale [35] into consideration. It has been proven effective to gather multiple branches with distinct targets, such as Switch-CNN [39] and RAZ-Net [31]. [54] develops a multi-column CNN that uses different convolution kernel sizes to deal with varying density. [38] adopts a scale-aware training scheme for the multi-branch architecture to give each branch a specialty for corresponding scales and achieves remarkable improvements over baseline approach. Additionally, enlarging receptive field of deep network is another insightful idea, for example CSRnet [29] deploys a sequence of dilation convolutions and takes human body structure into consideration.

Iii The proposed method

Iii-a System Overview

The proposed detection framework consists of two components: (1) a crowd location encoding scheme based on compressed sensing, (2) an end-to-end trainable network which is made up of observation layers and sparse reconstruction layers. The structure of the whole framework is shown in Fig. 2.

To encode training labels, We propose a crowd location encoding scheme, which converts people location from pixel space representation to compressed signal representation. Then, each training pair, consisting of a crowd image and the signal, trains a CNN to work as a multi-label regression model. We employ a joint loss function during training, because it is suitable for both signal regression and signal reconstruction. During testing, the observation layers of the network predicts crowd location signal for each test image. After that, sparse reconstruction layers of the network predicts the pixel level crowd locations.

Iii-B Crowd Location Encoding Scheme

Our proposed method relies on encoding people head locations into a dense code that a CNN predicts from an input image. We use a form of encoding that we refer to as encoding by Radon transform [22] followed by a random projection [50]. Radon transform is often seen as a mapping from Cartesian rectangular coordinates to polar coordinates, and is widely used for image reconstruction from the projections associated with cross-sectional scans of an object [22].

Referring to the encoding method shown in Fig. 2, denotes the binary (0/1) ground truth head location matrix with size . In the first step of the encoding method, is converted to another sparse matrix by Radon transform. Radon transform projects along a radial line oriented at a specific angle. Here we use angles uniformly varying in the range of [0, 179] degrees. The transform results in matrix with size , with .

Since radon transform [22] of people head locations is a sparse signal. In the second step of the encoding method, we apply CS-based encoding of the people head locations that compresses a sparse vector into a much smaller denser vector with a sensing matrix by:

(1)

where is a

random Gaussian sensing matrix (each element is independently and identically distributed zero mean Gaussian with variance

), with typically CS theory [4, 14] states that given and , a convex optimization can recover provided the sensing matrix satisfies a restricted isometry property (RIP) and , where is a small constant greater than one and is the maximum number of non-zero elements in

Given and , the recovery of typically relies on a convex optimization with a penalty expressed by norm:

(2)

where is a non-negative weight balancing the two terms in the cost function (2). Various algorithms exist today that can optimize (2). Examples include orthogonal matching pursuit (OMP) [3] and dual augmented Lagrangian (DAL) [44]. In this work, we realize the recovery process by an end-to-end neural network structure. Details can be found in section D.

Iii-C Signal Regression by Observation Layers

We utilize CNN to build a regression model between a crowd image and its people head location signal .

Iii-C1 Backbone

We adopt the truncated VGG-16 network, i.e. the first 13 layers of VGG-16, as the input structure of our backbone, such truncated network has shown superior transfer ability for crowd analysis ([31], [29]

). VGG-16 model pre-trained on ImageNet is used to initialize the backbone.

To equip the backbone with different receptive field sizes, following the truncated VGG-16, we create Multiple Dilated Convolution Branches (MDCB) architecture with different dilation rates to adapt the receptive fields for objects of different scales. To control the receptive field of the backbone, we use different dilation rates which vary from 1 to 3 for a 3 3 convolutions. As the output of the backbone, we perform channels concatenation to splice the feature maps from different channels to form a set of feature responses with different receptive field sizes. An illustration of our backbone is shown in the left part of Fig. 4.

Fig. 3: Illustration of Adaptive Receptive Field Weighting (ARFW) block.

A risk of our multi-branch block is that it introduces many parameters which may potentially cause overfitting. To prevent this risk, we make different branches share the same structure and weights, and only vary the dilation rate between branches. The advantages of weight sharing are three aspects. It doesn’t need extra parameters compared with the original backbone network. Second, it reflects our motivation that objects in different sizes should be processed by an uniform transformation only with different scales. Finally, in this way the same set of parameters are fully trained for different scale ranges under different receptive fields.

Iii-C2 Adaptive Receptive Field Weighting

Adaptive Receptive Field Weighting contains the following three modules.

Fig. 4:

Configuration of Backbone and Observation head. All convolutional layers use padding. Convolution layer parameters are denoted as ”Conv-(dilation rate)-stride kernel

kernel filters”. Center-pooling layer are conducted over a 3 3 pixel window with stride 2.

Aggregate: Spatial Information Embedding

In order to exploit channel dependencies, we first consider the signal to each channel in the output features. Each learned filters operates with a local receptive field and consequently can only exploit contextual information within its receptive field. To tackle this issue, we propose to spatially aggregate feature maps across their spatial dimensions, so that global information is embedded into a channel descriptor. To do that, we use global average pooling to generate channel-wise statistics. A statistic is generated by shrinking V through its spatial dimensions , so that the -th element of is computed by:

(3)

Adaptive Channel-wise Weighting

To make use of the information embedded in the aggregate operation, the second operation aims to fully capture channel-wise dependencies. This operation should be capable of learning a nonlinear interaction between channels and allow multiple channels to be emphasized simultaneously rather than enforcing a one-hot activation. To fulfill this objective, we choose to employ a gating mechanism with a sigmoid activation:

(4)

where

refers to the ReLU function,

and denote the parameters of the two FC layers. To aid model generalization, we parameterize the gating mechanism by forming a bottleneck with two fully-connected (FC) layers, one is a nonlinear dimensionality-reduction layer with reduction ratio , the other is a dimensionality-increasing layer returning to the number of input channels. Deploying two FCs is beneficial for modeling the complex dependencies between channels, and also helpful for limiting model complexity. This operator maps the input-specific descriptor to a set of channel weights .

Modulate Channels

The last operator modulates the feature maps of different receptive field sizes according to the channel weights. The final output of the ARFW block is obtained by rescaling with the activations :

(5)

where refers to channel-wise multiplication between the scalar and the feature map . ARFW block intrinsically introduce dynamics conditioned on the input, which can be regarded as a self attention function on channels.

Iii-C3 Observation Head

Observation head takes the weighted feature maps from ARFW as input and then predicts the encoding signal. Besides convolution and fully-connected layers, we further introduce the visual patterns within objects into the center point detection process by using center pooling [11]. Given a feature map as input, to determine if a pixel in the feature map is a center point, center pooling finds the maximum value in its both horizontal and vertical directions and add them together. By doing this, center pooling helps to enrich center information within objects, since geometric centers of objects cannot always convey recognizable visual patterns. Configuration of the observation head is shown in the right part of Fig. 4.

Iii-D Crowd Localization by Sparse Reconstruction Layers

Here we present a novel end-to-end trainable network for crowd localization with CS-based output encoding. Given the generalization bound [16], optimization of both the prediction and recovery simultaneously in an end-to-end fashion should prove superior.

The bottom diagram of Fig. 2 shows the end-to-end structure of the network. The input image goes through observation layers composed of a CNN that outputs a dense vector which is compared to the ground-truth dense vector . The predicted dense vector is fed to reconstruction layers that reconstructs a sparse vector which is compared to the ground-truth sparse vector by norm. Thus, the cost function is a mixture of and norms:

(6)

where is a hyper parameter that balances dense vector errors and sparse vector errors. To optimize the weights in the observation layers and the reconstruction layers jointly, we train the whole model according to the overall loss (6) using gradient descent during backpropagation.

Suppose and denote the partial derivatives of norm in the loss function (6) with respect to and , respectively. Then the following backpropagation rule relates and . (Due to space limit, derivation is provided in our additional materials.)

(7)
(8)

The aforementioned rules (7), (8) may not be numerically stable or efficient for batch training mode, as they involve different matrices to be inverted for different images. We derived an approximate, numerically stable, and efficient backpropagation for batch training (see additional materials):

(9)
(10)

Notice that using a standard toolbox, e.g.TensorFlow [1] would require both the observation and the reconstruction layers to be differentiable. This requirement brings us to the architecture shown in the bottom diagram of Fig. 2. The observation layers being a CNN are differentiable. For the reconstruction layer, we use the differentiable learned iterative shrinkage and thresholding algorithm [15]

architecture to compute approximate sparse vectors using a recurrent neural network with a limited number of iterations (

). The sparse reconstruction layers have trainable parameters and . Thus, the entire architecture is now differentiable and end-to-end trainable. We implemented this end-to-end trainable model using TensorFlow [1].

Iv Experiment

Iv-a Dataset Evaluation Metric

We evaluate the proposed method on four public crowd analysis benchmarks, whose basic information is summarized in Table. I. It is necessary to mention that the UCF-QNRF dataset is known as the newest and largest crowd counting and localization dataset, with 1535 high resolution images and over 1.25M annotated heads.

Dataset Resolution No. of images Av. Count

ShanghaiTech-A [54]
589868 482 501

ShanghaiTech-B [54]
7681024 716 123

WorldExpo [8]
576720 3980 56

UCF-QNRF [18]
20132902 1535 815

TABLE I: Summary of evaluation benchmarks.

For crowd counting task, we use the mainstream evaluation metrics: Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). When the predicted count for image i is

and the true count , the MAE and RMSE can be expressed as

(11)
(12)

For accurate crowd localization task, we adopt the evaluation metric Precision, Recall and -score: as in [18].



Method
ShanghaiTech-A ShanghaiTech-B WorldExpo UCF-QNRF

MAE RMSE MAE RMSE S1 S2 S3 S4 S5 Ave. MAE RMSE

CSR-Net [29]
58.2 115.0 10.6 16.0 2.9 11.5 8.6 16.6 3.4 8.6

Cascaded-CNN [40]
101.3 152.4 20.0 31.1 4.8 32.5 10.8 13.3 4.5 13.2 252 514

CP-CNN [41]
73.6 106.4 20.1 30.1 2.9 14.7 10.5 10.4 5.8 8.9

Switch-CNN [39]
90.4 135.0 21.6 33.4 4.4 15.7 10.0 11.0 5.9 9.4 228 445

MCNN [54]
110.2 173.2 26.4 41.3 3.4 20.6 12.9 13.0 8.1 11.6 277 426

SA-Net (patch) [5]
67.0 104.5 8.4 13.6 2.6 13.2 9.0 13.3 3.0 8.2

RAZ-Net [31]
65.1 106.7 8.4 14.1 2.0 11.8 9.0 13.6 3.3 8.0 116 195

ACSCP [52]
75.7 102.7 17.2 27.4 2.8 14.05 9.6 8.1 2.9 7.5

DecideNet [20]
20.75 29.42 2.0 13.14 8.9 17.40 4.75 9.23

Crowd-CNN [7]
32.0 49.8 9.8 14.1 14.3 22.2 3.7 12.9

Proposed
56.1 96.8 9.2 13.9 2.0 10.4 8.2 11.0 3.1 7.1 109.4 157.6

1-13[9pt/4pt]


Proposed (no-CSOE)
67.1 105.8 10.7 15.9 2.6 10.9 10.3 12.5 3.6 7.82 153.4 266.3

Proposed (no-CP)
63.9 98.7 10.4 18.3 2.9 11.9 8.9 12.0 5.3 9.6 112.8 164.1


Proposed (no-ARFW)
65.8 106.9 11.9 13.7 2.6 14.7 7.1 12.8 5.6 10.6 96.3 165.4
TABLE II: Counting performances on four crowd benchmarks.

Iv-B Crowd Counting

As the first experiment, we conduct a comparison between our proposed approach with the state-of-the-art approaches for counting task. Quantitative results are shown in Table. II, where strong competitors (e.g. ACSCP [52], CSR-Net [29], RAZ-Net [31], etc) are included. The ”Proposed” method refers to the end-to-end trainable model with CSOE+CP+MDCB+ARFW.

For the four evaluation datasets, we randomly divide their original training images into a training set (80%) and a validation set (20%). We perform a random grid search to tune the two hyper parameters () of the proposed algorithm on validation set, and evaluate the algorithm on their testing set. For example, UCF-QNRF crowd dataset contains 1535 images including 1201 training images (961 training and 240 validation) and 334 testing images. After random grid search for hyper parameters, the best performance of the proposed algorithm is obtained when for ShanghaiTech-A dataset; for ShanghaiTech-B dataset; for WorldExpo dataset; for UCF-QNRF dataset.

Results on the testing sets are summarized in Table. II. On ShanghaiTech-A dataset, the proposed method gets the best performance in terms of MAE and MSE, demonstrating the effectiveness of the proposed method against outdoor scenes like the two datasets, where significant perspective variations and complex background clutter are present. On ShanghaiTech-B, the proposed method outperforms most algorithms except SA-Net (patch) and RAZ-Net as close competitors. It is necessary to mention that SA-Net (patch) performed evaluation in patch level, which is different from the standard way in the literature. When evaluated under image level, the performance of SA-Net degrades severely, this is also observed by [46]. Compared to RAZ-Net, the proposed method obtains lower RMSE. On the newest crowd dataset UCF-QNRF, the proposed method achieves remarkable performance followed by RAZ-Net. Similar to RAZ-Net, the proposed method also adopts a fusion scheme to solve the two related tasks: crowd counting and localization. While, the proposed method can use accurate crowd location information to supervise layer-wise weights optimization through the entire end-to-end training process, by introducing the compressed sensing based encoding from pixel level crowd location to robust vector representation. Thus, the proposed method further reduces the MAE and MSE level of existing counting methods (see the gap between ”Proposed” and RAZ-Net).

Fig. 5: Detection -scores with respect to average crowd density. People distribution in testing sample groups is from sparse to highly dense.
Fig. 6: The red dots represent the ground truth while green dots are the locations detected by the proposed approach. Bottom row: results are obtained by network with Compressed Sensing based Output Encoding (CSOE); top row: results are obtained by network without CSOE. CSOE helps to boost localization performance in circumstances where targets are highly crowded but without huge scale variation.

Iv-C Crowd Localization

The second experiment is a comparison between our proposed approach with state-of-the-art crowd localization methods: RAZ-Net [31], Compo-CNN [18], MCNN [54], PSDDN [33], etc. Table. III

reports their performances in terms of Precision and Recall. On ShanghaiTech-A dataset, the proposed approach outperforms other competitors obviously in terms of both Precision and Recall. Similar to

[33], we find that crowd in ShanghaiTech-A is much denser than that in ShanghaiTech-B. On ShanghaiTech-B dataset, the proposed approach is superior to RAZ-Net, Compo-CNN and MCNN. On WorldExpo dataset, the proposed approach achieves the highest Precision=0.820 and Recall=0.812. On the newest and most challenging dataset UCF-QNRF, although most methods have a slight performance decline in their Precision and Recall values, the proposed approach obtains performance improvement over RAZ-Net and Compo-CNN, which are two strong competitors. In addition, some representative localization results are shown in Fig. 6 and Fig. 7. The red dots represent the ground truth while green dots are the locations detected by the proposed approach. It can be seen that even for very dense crowds, the proposed method still generates precise localization results.



Method
ShanghaiTech-A ShanghaiTech-B WorldExpo UCF-QNRF

Precision Recall Precision Recall Precision Recall Precision Recall

ACSCP [52]
0.792 0.828 0.790 0.601 0.737 0.796 0.756 0.597

DecideNet [20]
0.822 0.733 0.808 0.788 0.685 0.812 0.593 0.630

Crowd-CNN [7]
0.819 0.779 0.754 0.793 0.738 0.782 0.781 0.651

RAZ-Net [31]
0.865 0.697 0.841 0.758 0.795 0.731 0.815 0.711

Compo-CNN [18]
0.790 0.723 0.781 0.739 0.716 0.754 0.717 0.675

MCNN [54]
0.765 0.817 0.768 0.780 0.724 0.783 0.710 0.724

PSDDN [33]
0.760 0.806 0.824 0.760 0.809 0.775 0.788 0.675

Proposed
0.873 0.792 0.867 0.805 0.820 0.812 0.824 0.783

1-9[9pt/4pt]


Proposed (no-CSOE)
0.836 0.745 0.827 0.794 0.779 0.786 0.792 0.719

Proposed (no-CP)
0.851 0.779 0.845 0.798 0.801 0.788 0.796 0.748


Proposed (no-ARFW)
0.860 0.762 0.852 0.787 0.809 0.793 0.804 0.751

TABLE III: Localization performances on four crowd benchmarks.
Fig. 7: Red dots: the ground truth; green dots: localization results by the proposed approach. The results are obtained under the circumstance that both Compressed Sensing based Output Encoding and Center Pooling are used. Left column: not using Multiple Dilated Convolution Branches (MDCB); middle column: using MDCB; right column: using both MDCB and Adaptive Receptive Field Weighting (ARFW). MDCB is good for solving huge scale variation and also avoids excessive loss of detail resolution. ARFW further deals with scale variation issue by adaptively emphasizing informative channels that have proper receptive field size.

Effect of High Crowd Density

To further evaluate the existing crowd localization methods, in the third experiment, we investigate a significant issue in crowd analysis: how well the methods work in highly crowded scenes? To clarify the effect of high crowd density, we explore accuracies of five aforementioned crowd localization approaches with respect to varying crowd density. We rank test samples from WorldExpo [8] and UCF-QNRF [18] dataset according to the number of people present, resulting in 14080 images of size 200-by-200. We divide all the test samples into 33 groups, whose average crowd densities increase gradually from extremely sparse to extremely dense. For example: images in the first test sample group have only 1 people; images in the 15th group contain 40.8 people on average. Fig.5 presents the F1-scores of the five crowd localization methods on the 33 test sample groups. In the first 12 groups where average crowd densities are not very high [0-30], RAZ-Net, Compo-CNN, MCNN or PSDDN achieves superior -scores and outperforms ”Proposed” in 4 groups. But when facing the last 21 groups whose average crowd densities are much higher, the ”Proposed” obviously preserves the discrimination ability in highly crowded scenes. The relative -scores gains over the 4 methods increase with a higher average crowd density. More specifically, it is in the range of average crowd density = [60-135] (i.e. 20th to 33th group) that Compo-CNN, MCNN and PSDDN show rapid and obvious performance declines suffering from the increasing average crowd density. In the most crowded group, the -scores of RAZ-Net, MCNN and PSDDN have drop to the range of 0.40-0.48; in comparison ”Proposed” maintains a -score=0.637. The trend is clear. As the crowd density increases the accuracy gap between ”Proposed” and other methods increases, supporting our claim that regression in encoding signal space is better than detecting pixel coordinates of small objects in pixel space for crowd analysis in highly congested scenes.

Iv-D Ablation study

We carry out ablation experiments to better understand the effect of four major components of the proposed method.

Compressed Sensing based Output Encoding (CSOE)

To investigate the role of CSOE, we design a comparison neural network which directly predicts the heatmap of the head centroid position without the use of CSOE module and reconstruction module. Figure. 6 shows localization results obtained by network with CSOE (bottom) and without CSOE (top). The ”Proposed (no-CSOE)” row in Table. II and Table. III gives the counting results and localization results of not using CSOE on four datasets. Specifically speaking, CSOE helps to boost counting and localization performance in highly crowded circumstance, especially for cases where the density of targets is high but without huge scale variation. It can be also observed that the big ”precision” gap of ”Proposed” over ”Proposed (no-CSOE)”.

Center Pooling (CP)

The ”Proposed (no-CP)” row in Table. II and Table. III gives the counting results and localization results of not using center pooling on four datasets. Center pooling slightly improve localization accuracy, especially in terms of Precision. Center pooling is good for mining the recognizable features within objects.

Multiple Dilated Convolution Branches (MDCB)

MDCB helps to improve localization accuracy when head size changes drastically, since the Multiple Branches offer a set of different receptive field sizes. In addition, unlike conventional convolution+pooling operation, dilated convolution will not bring excessive loss of detail resolution.

Adaptive Receptive Field Weighting (ARFW)

The ”Proposed (no-ARFW)” row in Table. II and Table. III gives the counting results and localization results of not using ARFW on four datasets. To further handle the issue of huge scale variation in an image, we propose Adaptive Receptive Field Weighting (ARFW), which enables our model to use global information to adaptively emphasize informative channels that have proper receptive field size. Figure. 7 depicts localization results obtained by three models with different configurations: (1) left column: CSOE+CP only; (2) middle column: CSOE+CP+MDCB; (3) right column: CSOE+CP+MDCB+ARFW.

As we know, the receptive field size of baseline model (CSOE+CP) is fixed. It is often observed that such fixed receptive field size is hard to balance between small targets and large targets. Take 2nd row as an example, we speculate the receptive field size of baseline is relatively small. Consequently the localization performance of distant small targets is okay, but multiple localizations occur on one large target that is close to camera. However, with the introduction of MDCB and ARFW, the complete model (CSOE+CP+MDCB+ARFW) has a set of different receptive field sizes and is capable of adaptively emphasizing the proper receptive field size.

Components Configuration

Table.IV presents -scores of the proposed method using different configurations on the four crowd benchmarks. Here CSOE, MDCB, CP and ARFW are used separately or jointly to fully explore their influences to localization performances. Not surprisingly, the complete model CSOE+MDCB+CP+ARFW achieves the best results. We can observe that CSOE+MDCB+CP+ARFW performs much better than CSOE+MDCB+CP and MDCB+CP+ARFW. This demonstrates that both CSOE and ARFW are crucial and bring huge contributions to overall performance. While the small gap between CSOE+MDCB+CP+ARFW and CSOE+MDCB+ARFW indicates that CP can bring slight benefits to performance, similar to the finding in [11]. It is also intesting to see that MDCB+ARFW acquire the highest -scores among all four ”two-tricks” configurations: CSOE+MDCB, CSOE+CP, MDCB+CP and MDCB+ARFW. This phenomenon provides strong evidence to the importance of tackling target size variation issue in crowd analysis task. While CSOE is the best among all ”single-trick” configurations. Please note that ARFW must work with MDCB, so there is no configuration where only ARFW is used. The better performance of CSOE further supports our previous claim that regression in encoding signal space is better than detecting pixel coordinates of small objects in pixel space for crowd analysis in highly congested scenes.

CSOE MDCB CP ARFW Sh-A Sh-B Wor UCF

0.695 0.689 0.677 0.665

0.684 0.670 0.665 0.652

0.637 0.610 0.608 0.571

0.761 0.752 0.697 0.762

0.748 0.732 0.720 0.711

0.723 0.738 0.690 0.709

0.787 0.802 0.755 0.729

0.808 0.818 0.801 0.777

0.788 0.810 0.783 0.754

0.823 0.821 0.794 0.791

0.831 0.834 0.816 0.803

TABLE IV: Localization performances (-score) of the proposed method using different configurations on the four crowd benchmarks. Please note that ARFW must work with MDCB, so there is no configuration where only ARFW is used.

V Conclusion

In this paper we developed several core modules CSOE, MDCB, ARFW and sparse reconstruction layers, and also integrate them into an end-to-end trainable network. A wide range of experiments show the effectiveness of the proposed method, which presents state-of-the-art performance across multiple datasets, especially achieves excellent results in scenes with high crowd density. Experiments support our insights that it is crucial to tackle target size variation issue in crowd analysis task, and casting crowd localization as regression in encoding signal space is quite effective in crowd scenes. We hope these insights may prove useful for other crowd analysis tasks.

References

  • [1] M. Abadi et al. (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    .
    Note: Software available from tensorflow.org External Links: Link Cited by: §III-D.
  • [2] S. Birns, B. Kim, S. Ku, K. Stangl, and D. Needell (2016) A practical study of longitudinal reference based compressed sensing for MRI. CoRR abs/1608.04728. External Links: Link Cited by: §II.
  • [3] T. T. Cai and L. Wang (2011) Orthogonal matching pursuit for sparse signal recovery with noise.. IEEE Transactions on Information Theory 57 (7), pp. 4680–4688. Cited by: §III-B.
  • [4] E. J. Candes and J. K. Romberg (2005) Practical signal recovery from random projections. Proc. SPIE 5674, pp. 76–86. Cited by: §II, §III-B.
  • [5] X. Cao, Z. Wang, Y. Zhao, and F. Su (2018) Scale aggregation network for accurate and efficient crowd counting. European Conference on Computer Vision (ECCV), pp. 757–773. Cited by: §I, §II, TABLE II.
  • [6] Z. Cong, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §II.
  • [7] Z. Cong, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: TABLE II, TABLE III.
  • [8] Z. Cong, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §IV-C, TABLE I.
  • [9] T. G. Dietterich and G. Bakiri (1995) Solving multiclass learning problems via error-correcting output codes.

    Journal of Artificial Intelligence Research

    2, pp. 263–286.
    Cited by: §II.
  • [10] D. L. Donoho (2006) Compressed sensing. IEEE Transactions on Information Theory 52, pp. 1289–1306. Cited by: §II.
  • [11] K. Duan (2019) CenterNet: keypoint triplets for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §III-C3, §IV-D.
  • [12] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk (2008-03) Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine 25 (2), pp. 83–91. External Links: Document, ISSN 1053-5888 Cited by: §II.
  • [13] M. Elad (2010) Sparse and redundant representations: from theory to applications in signal and image processing. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 144197010X, 9781441970107 Cited by: §II.
  • [14] T. T. Emmanuel Candes (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52, pp. 489–509. Cited by: §II, §III-B.
  • [15] K. Gregor and Y. LeCun (2010) Learning fast approximations of sparse coding. In ICML, USA, pp. 399–406. External Links: ISBN 978-1-60558-907-7 Cited by: §III-D.
  • [16] D. Hsu, S. M. Kakade, J. Langford, and T. Zhang (2009) Multi-label prediction via compressed sensing. arXiv:0902.1284v2 [cs.LG]. Cited by: §I, §I, §III-D.
  • [17] J. Hu, L. Shen, and G. Sun (2017) Squeeze-and-excitation networks.. CoRR abs/1709.01507. External Links: Link Cited by: §I, §II.
  • [18] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. European Conference on Computer Vision (ECCV). Cited by: §I, §I, §IV-A, §IV-C, §IV-C, TABLE I, TABLE III.
  • [19] A. C. Jia Wan (2019) Adaptive density map generation for crowd counting. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
  • [20] L. Jiang, C. Gao, D. Meng, and A. G. Hauptmann (2018) DecideNet: counting varying density crowds through attention guided detection and density estimation. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §II, TABLE II, TABLE III.
  • [21] A. Joly (2016)

    Exploiting random projections and sparsity with random forests and gradient boosting methods

    .
    arXiv:1704.08067. Cited by: §II.
  • [22] A. C. Kak and M. Slaney (2002) Principles of computerized tomographic imaging. Medical Physics 29 (1), pp. 107–107. Cited by: §III-B, §III-B.
  • [23] D. Kang and A. Chan (2018-05) Crowd counting by adaptively fusing predictions from an image pyramid. In British Machine Vision Conference (BMVC), pp. . Cited by: §II.
  • [24] A. Kapoor, R. Viswanathan, and P. Jain (2012) Multilabel classification using Bayesian compressed sensing. In Advances in Neural Information Processing Systems 25, pp. 2645–2653. Cited by: §II.
  • [25] I. H. Laradji, N. Rostamzadeh, P. O. Pinheiro, D. Vázquez, and M. W. Schmidt (2018) Where are the blobs: counting by localization with point supervision. CoRR. Cited by: §I.
  • [26] M. Li, Z. Zhang, K. Huang, and T. Tan (2008) Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. International Conference on Pattern Recognition, pp. 1–4. Cited by: §II.
  • [27] X. Li (2019) Selective kernel networks. CoRR abs/1903.06586. External Links: Link Cited by: §II.
  • [28] Y. Li (2019) Scale-aware trident networks for object detection. In IEEE International Conference on Computer Vision (ICCV), Cited by: §I.
  • [29] Y. Li, X. Zhang, and D. Chen (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II, §III-C1, §IV-B, TABLE II.
  • [30] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2016) Feature pyramid networks for object detection. CoRR abs/1612.03144. External Links: Link, 1612.03144 Cited by: §I.
  • [31] C. Liu, X. Weng, and Y. Mu (2019-06) Recurrent attentive zooming for joint crowd counting and precise localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §I, §I, §II, §III-C1, §IV-B, §IV-C, TABLE II, TABLE III.
  • [32] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin (2018-07) Crowd counting using deep recurrent spatial-aware network. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 849–855. External Links: Document, Link Cited by: §I, §II.
  • [33] Y. Liu, M. Shi, Q. Zhao, and X. Wang (2019) Point in, box out: beyond counting persons in crowds. IEEE conference on computer vision and pattern recognition, pp. 6469–6478. Cited by: §I, §IV-C, TABLE III.
  • [34] X. Lu, W. Dong, P. Wang, G. Shi, and X. Xie (2018) ConvCSNet: A convolutional compressive sensing framework based on deep learning. CoRR abs/1801.10342. External Links: Link, 1801.10342 Cited by: §II.
  • [35] Z. Lu and M. Shi (2018) Crowd counting via scale-adaptive convolutional neural network. In IEEE Winter Conference on Applications of Computer Vision, Cited by: §I, §II.
  • [36] M. Rodriguez, I. Laptev, J. Sivic, and J. Audibert (2011) Density-aware person detection and tracking in crowds. IEEE International Conference on Computer Vision, pp. 2423–2430. Cited by: §II.
  • [37] Z. W. Saihui Hou (2019) Weighted channel dropout for regularization of deep convolutional neural network. AAAI. Cited by: §II.
  • [38] D. B. Sam and R. V. Babu (2018) Top-down feedback for crowd counting convolutional neural network. AAAI. Cited by: §II.
  • [39] D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II, TABLE II.
  • [40] V. A. Sindagi and V. M. Patel (2017) CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In IEEE International Conference on Advanced Video and Signal Based Surveillance, Cited by: TABLE II.
  • [41] V. A. Sindagi and V. M. Patel (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In IEEE International Conference on Computer Vision, Cited by: TABLE II.
  • [42] V. A. Sindagi and V. M. Patel (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In IEEE International Conference on Computer Vision, Cited by: §II.
  • [43] R. Stewart and M. Andriluka (2016-06) End-to-end people detection in crowded scenes. IEEE conference on computer vision and pattern recognition, pp. 2325–2333. Cited by: §II.
  • [44] R. Tomioka, T. Suzuki, and M. Sugiyama (2011) Super-linear convergence of dual augmented-Lagrangian algorithm for sparsity regularized estimation. Journal of Machine Learning Research 12 (8), pp. 1537–1586. Cited by: §III-B.
  • [45] G. Tsoumakas, I. Katakis, and I. Vlahavas (2011) Random k-labelsets for multi-label classification. IEEE Transactions on Knowledge and Data Engineering 23 (7), pp. 1079–1089. Cited by: §II.
  • [46] J. Wan, W. Luo, B. Wu, A. B. Chan, and W. Liu (2019) Residual regression with semantic prior for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4036–4045. Cited by: §IV-B.
  • [47] M. Wang and X. Wang (2011) Automatic adaptation of a generic pedestrian detector to a specific traffic scene. IEEE conference on computer vision and pattern recognition, pp. 3401–3408. Cited by: §II.
  • [48] B. Wu and R. Nevatia (2005) Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. IEEE International Conference on Computer Vision 1, pp. 90–97. Cited by: §II.
  • [49] Y. Xue, G. Bigras, J. Hugh, and N. Ray (2019-11) Training convolutional neural networks and compressed sensing end-to-end for microscopy cell detection. IEEE Transactions on Medical Imaging 38 (11), pp. 2632–2641. External Links: Document, ISSN 1558-254X Cited by: §II.
  • [50] Y. Xue and N. Ray (2018) Output encoding by compressed sensing for cell detection with deep convnet. pp. 159–165. Note: 2018 AAAI Workshop on Artificial Intelligence Applied to Assistive Technologies and Smart Environments Cited by: §III-B.
  • [51] Y. Yang, J. Sun, H. LI, and Z. Xu (2018) ADMM-csnet: a deep learning approach for image compressive sensing. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document, ISSN Cited by: §II.
  • [52] S. Zan, X. Yi, B. Ni, M. Wang, and X. Yang (2018) Crowd counting via adversarial cross-scale consistency pursuit. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §IV-B, TABLE II, TABLE III.
  • [53] L. Zeng, X. Xu, B. Cai, S. Qiu, and T. Zhang (2017-09) Multi-scale convolutional neural networks for crowd counting. In IEEE International Conference on Image Processing, pp. 465–469. External Links: Document Cited by: §II.
  • [54] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016-06) Single-image crowd counting via multi-column convolutional neural network. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 589–597. External Links: Document, ISSN Cited by: §I, §I, §II, §IV-C, TABLE I, TABLE II, TABLE III.