Hybrid Graph Neural Networks for Crowd Counting

01/31/2020 ∙ by Ao Luo, et al. ∙ 16

Crowd counting is an important yet challenging task due to the large scale and density variation. Recent investigations have shown that distilling rich relations among multi-scale features and exploiting useful information from the auxiliary task, i.e., localization, are vital for this task. Nevertheless, how to comprehensively leverage these relations within a unified network architecture is still a challenging problem. In this paper, we present a novel network structure called Hybrid Graph Neural Network (HyGnn) which targets to relieve the problem by interweaving the multi-scale features for crowd density as well as its auxiliary task (localization) together and performing joint reasoning over a graph. Specifically, HyGnn integrates a hybrid graph to jointly represent the task-specific feature maps of different scales as nodes, and two types of relations as edges:(i) multi-scale relations for capturing the feature dependencies across scales and (ii) mutual beneficial relations building bridges for the cooperation between counting and localization. Thus, through message passing, HyGnn can distill rich relations between the nodes to obtain more powerful representations, leading to robust and accurate results. Our HyGnn performs significantly well on four challenging datasets: ShanghaiTech Part A, ShanghaiTech Part B, UCF_CC_50 and UCF_QNRF, outperforming the state-of-the-art approaches by a large margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Crowd counting, with the purpose of analyzing large crowds quickly, is a crucial yet challenging computer vision and AI task. It has drawn a lot of attention due to its potential applications in public security and planning, traffic control, crowd management, public space design,

etc.

Same as many other computer vision tasks, the performance of crowd counting has been substantially improved by Convolutional Neural Networks (CNNs). Recently, the state-of-the-art crowd counting methods 

[19, 21, 38, 21, 11] mostly follow the density-based

paradigm. Given an image or video frame, CNN-based regressors are trained to estimate the crowd density map, whose values are summed to give the entire crowd count.

Figure 1: Illustration of the proposed HyGnn model. (a) Input image, in which crowds have heavy overlaps and occlusions. (b) Backbone, which is a truncated VGG-16 model. (c) Domain-specific branches: one for crowd counting and the other for localization. (d)HyGnn, which represents the features from different scales and domains as nodes, while the relations between them as edges. After several message passing iterations, multiple types of useful relations are built. (e) Crowd density map (for counting) and localization map (as the auxiliary task).

Recent studies [30, 3, 14, 15] have shown that multi-scale information, or relations across multiple scales helps to capture contextual knowledge which benefits crowd counting. Moreover, the crowd counting and its auxiliary task (localization), in spite of analyzing the crowd scene from different perspectives, could provide beneficial clues for each other [19, 18]. Crowd density map can offer guidance information and self-adaptive perception for precise crowd localization, and on the other hand, crowd localization can help to alleviate local inconsistency issue in density map. The mutual cooperation, or called mutual beneficial relation, is the key factor in estimating the high-quality density map. However, most methods only consider the crowd counting problem from one aspect, while ignore the other one. Consequently, they fail to fully utilize multiple types of useful relations or structural dependencies in the learning and inferring processes, resulting in sub-optimal results.

One primary reason is the lack of a unified and effective framework capable of modeling the different types of relations (i.e., multi-scale relations and mutual beneficial relations) over a single model. To address this issue, we introduce a novel Hybrid Graph Neural Network (HyGnn), which formulates the crowd counting and localization as a graph-based, joint reasoning procedure. As shown in Fig. 1, we build a hybrid graph which consists of two types of nodes, i.e., counting nodes storing density-related features and localization nodes storing location-related features. Besides, there are two different pairwise relationships (edge types) between them. By interweaving the multi-scale and multi-task features together and progressively propagating information over the hybrid graph, HyGnn can fully leverage the different types of useful information, and is capable of distilling the valuable, high-order relations among them for much more comprehensive crowd analysis.

HyGnn is easy to implement and end-to-end learnable. Importantly, it has two major benefits in comparison to existing models for crowd counting [19, 21, 38, 21, 11]. (i) HyGnn interweaves crowd counting and localization with a joint, multi-scale and graph-based processing rather than a simple combination as done in most existing solutions. Thus, HyGnn significantly strengthens the information flow between tasks and across scales, thereby enabling the augmented representation to incorporate more useful priors learned from the auxiliary task and different scales. (ii) HyGnn explicitly models and reasons all relations (multi-scale relations and mutual beneficial relations) simultaneously over a hybrid graph, while most existing methods are not capable of dealing with such complicated relations. Therefore, our HyGnn can effectively capture their dependencies to overcome inherent ambiguities in the crowd scenes. Consequently, our predicted crowd density map is potentially more accurate, and consistent with the true crowd localization.

In our experiments, we show that HyGnn performs remarkably well on four well-used benchmarks and surpasses prior methods by a large margin. Our contributions are summarized in three aspects:

  • We present a novel end-to-end learnable model, namely Hybrid Graph Neural Network (HyGnn), for joint crowd counting and localization. To the best of our knowledge, HyGnn is the first deep model capable of explicitly modeling and mining high-level relations between counting and its auxiliary task (localization) across different scales through a hybrid graph model.

  • HyGnn is equipped with a unique multi-tasking property, where different types of nodes, connections (or edges), and message passing functions are parameterized by different neural designs. With such property, HyGnn can more precisely leverage cooperative information between crowd counting and localization to boost the counting performance.

  • We conduct extensive experiments on four well-known benchmarks including ShanghaiTech Part A, ShanghaiTech Part B, UCF_CC_50 and UCF_QNRF, on which we set new records.

Figure 2: Overall of our HyGnn model. Our model is built on the truncated VGG-16, and includes a Domain-specific Feature Learning Module to extract features from different domains. A novel HyGnn is used to distill multi-scale and cross-domain information, so as to learn better representations. Finally, the multi-scale features are fused to produce the density map for counting as well as generate the auxiliary task prediction (localization map).

Related Works

Crowd Counting and Localization.

Early works [36] in crowd counting use detection-based methods, and employ handcrafted features like Haar [37] and HOG [6] to train the detector. The overall performances of these algorithms are rather limited due to various occlusion. The regression-based methods, which can avoid solving the hard detection problem, has become mainstream and achieved great performance breakthroughs. Traditionally, the regression models [4, 13, 25]

learn the mapping between low-level images features and object count or density, using gaussian process or random forests regressors. Recently, various CNN-based counting methods have been proposed 

[42, 43, 19, 21, 38, 21, 11] to better deal with different challenges, by predicting a density map whose values are summed to give the count. Particularly, the scale variation issue has attracted the most attention of recent CNN-based methods [5, 35]. On the other hand, as observed by some recent researches [9, 19, 18], although the current state-of-the-art methods can report accurate crowd count, they may produce the density map which is inconsistent with the true density. One major reason is the lack of crowd localization information. Some recent studies [45, 19] have tried to exploit the useful information from localization in a unified framework. They, however, only simply share the underlying representations or interweave two modules for different task together for more robust representations. Differently, our HyGnn considers a better way to utilize the mutual guidance information: explicitly modeling and iteratively distilling the mutual beneficial relations across scales within a hybrid graph. For a more comprehensive survey, we refer interested readers to [12].

Graph Neural Networks.

The essential idea of Graph Neural Network (GNN) is to enhance the node representations by propagating information between nodes. Scarselli et al. [29] first introduced the concept of GNN, which extended recursive neural networks for processing graph structure data. Li et al. [17]

proposed to improve the representation capacity of GNN by using Gated Recurrent Units (GRUs). Gilmer

et al. [7] used message passing neural network to generalize the GNN. Recently, GNN has been successfully applied in attributes recognition [23], human-object interactions [26], action recognition [33], etc. Our HyGnn shares similar ideas with above methods that fully exploits the underlying relationships between multiple latent representations through GNN. However, most existing GNN-based models are designed to deal with only one relation type, which may limit the power of GNN. To overcome above limitation, our HyGnn is equipped with a multitasking property, i.e., parameterizing different types of connections (or edges) and the message passing functions with different neural designs, which significantly discriminates HyGnn from all existing GNNs.

Methodology

Preliminaries

Problem Formulation.

Let the crowd counting model be represented by the function which takes an Image as input and generates the corresponding crowd density map (for counting) as well as the auxiliary task prediction, i.e., localization map . Let and be the groundtruth of density map and localization map, respectively. Our goal is to learn the powerful domain-specific representations, denoted as and , to minimize errors between the estimated and groundtruth , as well as and . Notably, the two tasks share a common meta-objective, and and are obtained from the same point-annotations without additional target labels.

Notations.

To achieve the goal, we need to distill the underlying dependencies between multi-task and multi-scale features. Given the multi-scale density feature maps and multi-scale localization feature maps , we represent and with a directed graph , where is a set of nodes and are edges. The nodes in our HyGnn are further grouped into two types: , where is the set of counting (density) nodes and denotes the set of localization nodes. In our model, we have the same number of nodes in two latent domains, and therefore . Accordingly, there are two types of edges between them: (i) cross-scale edges stand for the multi-scale relations between nodes from the scale to the scale within the same domain , where ; (ii) cross-domain edges reflect mutual beneficial relations between nodes from the domain to with the same scale , where & . For each node ( & ), we learn its updated representation, namely , through aggregating representations of its neighbors. Finally, the updated multi-scale features and are fused to produce the final representation and , which are used to generate the outputs and . Here, we only consider multi-scale relations between nodes in the same domain, and mutual beneficial (cross-domain) relations between nodes with the same scale in our graph model. Considering that our graph model is designed to simultaneously deal with two different node and relation types, we term it as Hybrid Graph Neural Network (HyGnn) which will be detailed in the following section.

Hybrid Graph Neural Network (HyGnn)

Overview.

The key idea of our HyGnn is to perform message propagation iterations over to joint distill and reason all relations between crowd counting and the auxiliary task (localization) across scales. Generally, as shown in Fig. 2, HyGnn maps the given image to the final predictions and through three phases. First

, in the domain-specific feature extracting phase,

HyGnn generates the multi-scale density features and localization features for through a Domain-specific Feature Learning Module (DFL), and represents these features with a graph . Second, a parametric message passing phase runs for times to propagate the message between nodes and also to update the node representations according to the received messages within the graph . Third, a readout phase fuses the updated multi-scale features and to generate final representations (i.e., and ), and maps them to the outputs and . Note that, as crowd counting is our main task, we emphasize the accuracy of during the learning process.

Domain-specific Feature Learning Module (DFL).

DFL is one of the major modules of our model, which extracts the multi-scale, domain-specific features and from the input . DFL is composed of three parts: one front-end and two domain-specific back-ends.

The front-end is based on the well-known VGG-16, which maps the RGB image to the shared underlying representations: . More specifically, the first 10 layers of VGG-16 are deployed as the front-end which is shared by the two tasks. Meanwhile, two series of convolution layers with different dilation rates are appended onto the back-ends, denoted as and . With the large receptive fields, the stacked convolutions are tailored for learning domain-specific features: and . In addition, the Pyramid Pooling Module (PPM) [44]

is applied in each domain-specific back-end for extracting multi-scale features, followed by an interpolation layer

to ensure the multi-scale feature maps to have the same size .

Figure 3: The architecture of the learnable adapter. The adapter takes the node representation of one (source) domain as input and outputs the adaptive convolution parameters . The adaptive representation is generated conditioned on .

Node Embedding.

In our HyGnn, each node or , where takes an unique value from , is associated with an initial node embedding (or node state), namely or . We use the domain-specific feature maps produced by DFL as the initial node representations. Taking an arbitrary counting node for example, its initial representations can be calculated by:

(1)

where

is a 3D tensor feature (batch size is omitted).

and denote the interpolation operation and pyramid pooling operation, respectively. The initial representation for the localization node is defined similarly as follows:

(2)

where denotes the initial representation for the localization node .

Cross-scale Edge Embedding.

A cross-scale edge connects two nodes and which are from the same domain but different scales . The cross-scale edge embedding, denoted as , is used to distill the multi-scale relation from to as the edge representation. To this goal, we employ a relation function to capture the relations by:

(3)

where is a function to combine the feature and . Following [40], we model , making the relations based on the difference between node embeddings to alleviate the symmetric impact in feature combination. means the convolution operation that is used to learn the edge embedding in a data-driven way. Each element in reflects the pixel-level relations between the nodes of different scales from to . As a result, can be considered as the features that depict the multi-scale relationships between nodes.

Figure 4: Detailed illustration of the cross-domain edge embedding and message aggregation. Please see text for details.

Cross-domain Edge Embedding.

Since our HyGnn is designed to fully exploit the complementary knowledge contained in the nodes of different domains ( & ), one major challenge is to overcome the “domain gap” between them. Rather than directly combining features as used in the cross-scale edge embedding, we first adapt the node representation of one (source) domain conditioned on the node representation of the other (target) domain to overcome the domain difference. Here, inspired by [2], we integrate a learnable adapter into our HyGnn to transform the original node representation to the adaptive representation as follows:

(4)

In the above function, is the convolution operation, and means the dynamic convolutional kernels. is a one-shot learner to predict the dynamic parameters from a single exemplar. Following [24], as shown in Fig. 3, we implement it by a small CNN with learnable parameters .

After achieving the adaptive representation , the cross-domain edge embedding for the edge can be formulated as:

(5)

where

is a 3D tensor, which contains the hidden representation of the cross-domain relation. The detailed architecture can be found in Fig. 

4.

Cross-scale Message Aggregation.

In our HyGnn, we employ different aggregation schemes for each node to aggregate feature messages from its neighbors. For the message passed from node to within the same domain across different scales, we have:

(6)

where is the cross-scale message passing function (aggregator), and maps the edge’s embedding into the link weight. Note that since our HyGnn is devised to handle the pixel-level task, the link weight between nodes is in the manner of a 2D map. Thus, assigns the pixel-wise weighted features from node to to aggregate information.

Cross-domain Message Aggregation.

As the cross-domain discrepancy is significant in the high-dimensional feature space and distribution, directly passing the learned representations of one node to its neighboring nodes for aggregation is a sub-optimal solution. Therefore, we formulate the message passing from node to as an adaptive representation learning process conditioned on . Here, we use the similar idea with that used in the cross-domain edge embedding process, i.e., using a one-shot adapter to predict the message that should be passed:

(7)

where ) means the message passing function between nodes from two different domains. is the adapter which is conditioned on the node embedding of target domain . means a small CNN with learnable parameters , which serves as an one-shot learner to predict the dynamic parameters. is the produced dynamic convolutional kernels, which includes the guidance information that should be propagated from node to .

Two-stage Node State Update.

In the step, our HyGnn first aggregates the information from the cross-domain nodes within the same scale using Eq. 7. Therefore, ( & ) gets an intermediate state by taking into account its received cross-domain message and its prior state . Here, following [27], we apply Gated Recurrent Unit (GRU) [1] as the update function,

(8)

Then, HyGnn performs message passing across scales within the same domain using Eq. 3, and aggregates messages using Eq. 6. After that, gets the new state after the iteration by considering the cross-scale message and its intermediate state ,

(9)

Readout Function.

After K message passing iterations, the updated multi-scale features of two domains and are merged to form their final representations and ,

(10)

where and are the merge functions by concatenation. Then, and are fed into a convolution layer to get the final per-pixel predictions.

Loss.

Our HyGnn

is implemented to be fully differentiable and end-to-end trainable. The loss for each task can be computed after the readout functions, and the error can propagate back according to the chain rule. Here, we simply employ the Mean Square Error (MSE) loss to optimize the network parameters for two tasks:

(11)

where and are MSE losses, and is the combination weight. As our main task is the crowd counting, we set to emphasize the accuracy of counting results.

 Methods Shanghai Tech A Shanghai Tech B UCF_CC_50 UCF_QNRF
MAE MSE MAE MSE MAE MSE MAE MSE
 Crowd CNN [42] 181.8 277.7 32 49.8 467 498 - -
 MC-CNN [43] 110.2 173.2 26.4 41.3 377.6 509.1 277 426
 Switching CNN [28] 90.4 135 21.6 33.4 318.1 439.2 228 445
 CP-CNN [34] 73.6 106.4 20.1 30.1 298.8 320.9 - -
 D-ConvNet [32] 73.5 112.3 18.7 26 288.4 404.7 - -
 L2R [22] 72 106.6 13.7 21.4 279.6 388.9 - -
 CSRNet [16] 68.2 115 10.6 16 266.1 397.5 - -
 PACNN [31] 66.3 106.4 8.9 13.5 267.9 357.8 - -
 RA2-Net [19] 65.1 106.7 8.4 14.1 - - 116 195
 SFCN [39] 64.8 107.5 7.6 13 214.2 318.2 124.7 203.5
 TEDNet [11] 64.2 109.1 8.2 12.8 249.4 354.2 113 188
 ADCrowdNet [20] 63.2 98.9 7.6 13.9 257.1 363.5 - -
HyGnn (Ours) 60.2 94.5 7.5 12.7 184.4 270.1 100.8 185.3
Table 1: Comparison with other state-of-the-art crowd counting methods on four benchmark crowd counting datasets using the MAE and MSE metrics.

Experiments

In this section, we empirically validate our HyGnn on four public counting benchmarks (i.e., ShanghaiTech Part A, ShanghaiTech Part B, UCF_CC_50 and UCF_QNRF). First, we conduct an ablation experiment to prove the effectiveness of our hybrid graph model and the multi-task learning. Then, our proposed HyGnn is evaluated on all of these public benchmarks, and compare the performance with the state-of-the-art approaches.

Datasets.

We use Shanghai Tech [43], UCF_CC_50 [8] and UCF_QNRF [10] for benchmarking our HyGnn

. Shanghai Tech provides 1,198 annotated images containing more than 330K people with head center annotations. It includes two subsets: Shanghai Tech A and Shanghai Tech B. UCF_CC_50 provides 50 images with 63,974 head annotations in total. The small dataset volume and large count variance make it a very challenging dataset. UCF_QNRF is the largest dataset to date, which contains 1,535 images that are divided into training and testing sets of 1,201 and 3,34 images respectively. All of these benchmarks have been widely used for performance evaluation by state-of-the-art approaches.

Implementation Details and Evaluation Protocol.

To make a fair comparison with existing works, we use a truncated VGG as the backbone network. Specifically, the first 10 convolutional layers from VGG-16 are used as the front-end and shared by two tasks. Following [16], our counting and localization back-ends are composed of 8 dilated convolutions with kernel size .

We use Adam optimizer with an initial learning rate . We set the momentum to , the weight decay to and the batchsize to . For data augmentation, the training images and corresponding groundtruths are randomly flipped and cropped from different locations with the size of . In the testing phase, we simply feed the whole image into the model for predicting the counting and localization results.

We adopt Mean Absolute Error () and Mean Squared Error () to evaluate the performance. The definitions are as follows:

(12)

where and are the estimated count and the ground truth of the testing image, respectively.

 Methods MAE MSE
 Baseline Model (a truncated VGG) 68.2 115.0
 Baseline + PSP [44] 65.3 106.8
 Baseline + Bidirectional Fusion [41] 65.1 105.9
 Single-task GNN 62.5 103.4
 Multi-task GNN w/o adapter 62.4 101.8
HyGnn (N=2, K=3) 62.1 100.8
HyGnn (N=3, K=3) 60.2 94.5
HyGnn (N=5, K=3) 60.2 94.1
HyGnn (N=3, K=1) 65.4 109.2
HyGnn (N=3, K=3) 60.2 94.5
HyGnn (N=3, K=5) 60.1 94.4
HyGnn (full model) 60.2 94.5
Table 2: Analysis of the proposed method. Our results are obtained on Shanghai Tech A.

Ablation Study.

Extensive ablation experiments are performed on ShanghaiTech A to verify the impact of each component of our HyGnn. Results are summarized in Tab. 2.

Effectiveness of HyGnn. To show the importance of our HyGnn, we offer a baseline model without HyGnn, which gives the results from our backbone model, the truncated VGG with dilated back-ends. As shown in Tab. 2, our HyGnn significantly outperforms the baseline by 8.0 in MAE () and 20.5 in MSE (). This is because our HyGnn can simultaneously model the multi-scale and cross-domain relationships which are important for achieving accurate crowd counting results.

Multi-task GNN vs. Single-task GNN. To evaluate the advantage of multi-task cooperation, we provide a single-task model which only formulates the cross-scale relationship. According to our experiments, HyGnn outperforms the single-task graph neural network by 2.3 in MAE () and 8.9 in MSE (). This is because our HyGnn is able to distill mutual benefits between the density and localization, while single-task graph neural network ignores these important information.

Effectiveness of the Cross-domain Edge Embedding. Our HyGnn carefully deals with the cross-domain information by a learnable adapter. To evaluate its effectiveness, we provide a multi-task GNN without the learnable adapter. Instead, we directly fuse features from different domains through the aggregation operation. As shown Tab. 2, our cross-domain edge embedding method achieves better performance in both MAE ( vs. ) and MSE ( vs. ), which indicates that our design of cross-domain edge embedding method is helpful for better leveraging the information from the other domain.

Figure 5: Density and localization maps generated by our HyGnn. We also show the counting map estimated by CSRNet for comparison. Clearly, our HyGnn produces more accurate results.

Node Numbers in HyGnn. In our model, we have numbers of nodes in each domain, i.e., . To investigate the impact of node numbers, we report the performance of our HyGnn with different . We find that with more scales in the model (), the performance improves significantly (i.e., in MAE and in MSE). However, when further considering more scales (), it only achieves slight performance improvements, i.e., in MAE and in MSE. This may be due to the redundant information within additional features. Considering the tradeoff between efficiency and performance, we set in the following experiments.

Message Passing Iterations . To evaluate the impact of message passing iterations , we report the performance of our model with different passing iterations. Each message passing iteration in HyGnn includes two cascade steps: i) the cross-scale message passing and ii) the cross-domain message passing. We find that with more iterations (), the performance of our model improves to some extent. When further considering more iterations (), it just bring a slight improvement. Therefore, we set , and our HyGnn can converge to an optimal result.

GNN vs. Other Multi-feature Aggregation Methods. Here, we conduct an ablation to evaluate the superiority of GNN. To prevent other jamming factor, we use a single-task GNN to fully distill the underlying relationships between multi-scale features, and compare our method with two well-known multi-scale feature aggregation methods (PSP [44] and Bidirectional Fusion [41]). As can be seen, our GNN-based method greatly outperforms other methods by a large margin.

Comparison with State-of-the-art.

We compare our HyGnn with the state-of-the-art for the performance of counting.

Quantitative Results. As can be seen in Tab. 1, our HyGnn consistently achieves better results than other methods on four widely-used benchmarks. Specifically, our method greatly outperforms previous best result by in MAE and in MSE on ShanghaiTech Part A. Although previous methods have made remarkable progresses on ShanghaiTech Part B, our HyGnn also achieves the best performance. Compared with existing top approaches like ADCrowdNet [20] and SFCN [39], our HyGnn achieves performance gain by in MAE and in MSE and in MAE and in MSE, respectively. On the most challenging UCF_CC_50, our HyGnn achieves considerable performance gain by decreasing the MAE from previous best to and MSE from to . On UCF-QNRF dataset, our HyGnn also outperforms other methods by a large margin. As shown in Tab. 1, our HyGnn achieves a significant improvement of in MAE over the existing best result produced by TEDNet [11]. Compared with other top-ranked methods, our HyGnn produces more accurate results. This is because HyGnn is able to leverage free-of-cost localization information and jointly reason all relations among them.

Qualitative Results. Fig. 5 provides some visualization comparisons of the predicted density maps and counts with CSRNet [16]. In addition, we also show the localization results. We observe that our HyGnn is very powerful, achieves much more accurate count estimations and reserves more consistency with the real crowd distributions. This is because our HyGnn can distill the significant benefit information from the auxiliary task through a graph.

Conclusions

In this paper, we propose a novel method for crowd counting with a hybrid graph model. To best of our knowledge, it is the first deep neural network model that can distill both multi-scale and mutual beneficial relations within a unified graph for crowd counting. The whole HyGnn is end-to-end differentiable, and is able to handle different relations effectively. Meanwhile, the domain gap between different tasks is also carefully considered in our HyGnn. According to our experiments, HyGnn achieves significant improvements compared to recent state-of-the-art methods on four benchmarks. We believe that our HyGnn can also incorporate other knowledge, e.g., foreground information, for further improvements.

Acknowledgement. This work was supported in part by the National Key R&D Program of China (No.2017YFB1302300) and the NSFC (No.U1613223).

References

  • [1] N. Ballas, L. Yao, C. Pal, and A. Courville (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432. Cited by: Two-stage Node State Update..
  • [2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi (2016) Learning feed-forward one-shot learners. In NeurIPS, Cited by: Cross-domain Edge Embedding..
  • [3] X. Cao, Z. Wang, Y. Zhao, and F. Su (2018) Scale aggregation network for accurate and efficient crowd counting. In ECCV, Cited by: Introduction.
  • [4] K. Chen, S. Gong, T. Xiang, and C. Change Loy (2013) Cumulative attribute space for age and crowd density estimation. In CVPR, Cited by: Crowd Counting and Localization..
  • [5] F. Dai, H. Liu, Y. Ma, J. Cao, Q. Zhao, and Y. Zhang (2019) Dense scale network for crowd counting. CoRR abs/1906.09707. Cited by: Crowd Counting and Localization..
  • [6] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In CVPR, Cited by: Crowd Counting and Localization..
  • [7] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. CoRR abs/1704.01212. Cited by: Graph Neural Networks..
  • [8] H. Idrees, I. Saleemi, C. Seibert, and M. Shah (2013) Multi-source multi-scale counting in extremely dense crowd images. In CVPR, Cited by: Datasets..
  • [9] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. In ECCV, Cited by: Crowd Counting and Localization..
  • [10] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. In ECCV, Cited by: Datasets..
  • [11] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In CVPR, Cited by: Introduction, Introduction, Crowd Counting and Localization., Table 1, Comparison with State-of-the-art..
  • [12] D. Kang, Z. Ma, and A. B. Chan (2018) Beyond counting: comparisons of density maps for crowd analysis tasks—counting, detection, and tracking. TCSVT 29 (5), pp. 1408–1422. Cited by: Crowd Counting and Localization..
  • [13] V. Lempitsky and A. Zisserman (2010) Learning to count objects in images. In NeurIPS, Cited by: Crowd Counting and Localization..
  • [14] X. Li, F. Yang, H. Cheng, J. Chen, Y. Guo, and L. Chen (2017) Multi-scale cascade network for salient object detection. In ACM MM, Cited by: Introduction.
  • [15] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen (2018) Contour knowledge transfer for salient object detection. In ECCV, Cited by: Introduction.
  • [16] Y. Li, X. Zhang, and D. Chen (2018) Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, Cited by: Table 1, Implementation Details and Evaluation Protocol., Comparison with State-of-the-art..
  • [17] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2016) Gated graph sequence neural networks. In ICLR, Cited by: Graph Neural Networks..
  • [18] D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao (2019) Density map regression guided detection network for rgb-d crowd counting and localization. In CVPR, Cited by: Introduction, Crowd Counting and Localization..
  • [19] C. Liu, X. Weng, and Y. Mu (2019) Recurrent attentive zooming for joint crowd counting and precise localization. In CVPR, Cited by: Introduction, Introduction, Introduction, Crowd Counting and Localization., Table 1.
  • [20] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In CVPR, Cited by: Table 1, Comparison with State-of-the-art..
  • [21] W. Liu, M. Salzmann, and P. Fua (2019) Context-aware crowd counting. In CVPR, Cited by: Introduction, Introduction, Crowd Counting and Localization..
  • [22] X. Liu, J. van de Weijer, and A. D. Bagdanov (2018) Leveraging unlabeled data for crowd counting by learning to rank. In CVPR, Cited by: Table 1.
  • [23] Z. Meng, N. Adluru, H. J. Kim, G. Fung, and V. Singh (2018) Efficient relative attribute learning using graph neural networks. In ECCV, Cited by: Graph Neural Networks..
  • [24] X. Nie, J. Feng, Y. Zuo, and S. Yan (2018) Human pose estimation with parsing induced learner. In CVPR, Cited by: Cross-domain Edge Embedding..
  • [25] V. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada (2015) Count forest: co-voting uncertain number of targets using random forest for crowd density estimation. In ICCV, Cited by: Crowd Counting and Localization..
  • [26] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In ECCV, Cited by: Graph Neural Networks..
  • [27] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In ECCV, Cited by: Two-stage Node State Update..
  • [28] D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In CVPR, Cited by: Table 1.
  • [29] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. TNNLS 20 (1), pp. 61–80. Cited by: Graph Neural Networks..
  • [30] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang (2018) Crowd counting via adversarial cross-scale consistency pursuit. In CVPR, Cited by: Introduction.
  • [31] M. Shi, Z. Yang, C. Xu, and Q. Chen (2019) Revisiting perspective information for efficient crowd counting. In CVPR, Cited by: Table 1.
  • [32] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M. Cheng, and G. Zheng (2018) Crowd counting with deep negative correlation learning. In CVPR, Cited by: Table 1.
  • [33] C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan (2018) Skeleton-based action recognition with spatial reasoning and temporal stack learning. In ECCV, Cited by: Graph Neural Networks..
  • [34] V. A. Sindagi and V. M. Patel (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In CVPR, Cited by: Table 1.
  • [35] R. R. Varior, B. Shuai, J. Tighe, and D. Modolo (2019) Scale-aware attention network for crowd counting. CoRR abs/1901.06026. Cited by: Crowd Counting and Localization..
  • [36] P. Viola, M. J. Jones, and D. Snow (2005) Detecting pedestrians using patterns of motion and appearance. IJCV 63 (2), pp. 153–161. Cited by: Crowd Counting and Localization..
  • [37] P. Viola, M. Jones, et al. (2001) Rapid object detection using a boosted cascade of simple features. Cited by: Crowd Counting and Localization..
  • [38] J. Wan, W. Luo, B. Wu, A. B. Chan, and W. Liu (2019) Residual regression with semantic prior for crowd counting. In CVPR, Cited by: Introduction, Introduction, Crowd Counting and Localization..
  • [39] Q. Wang, J. Gao, W. Lin, and Y. Yuan (2019) Learning from synthetic data for crowd counting in the wild. In CVPR, Cited by: Table 1, Comparison with State-of-the-art..
  • [40] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. TOG. Cited by: Cross-scale Edge Embedding..
  • [41] F. Yang, X. Li, H. Cheng, Y. Guo, L. Chen, and J. Li (2018) Multi-scale bidirectional fcn for object skeleton extraction. In AAAI, Cited by: Ablation Study., Table 2.
  • [42] C. Zhang, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In CVPR, Cited by: Crowd Counting and Localization., Table 1.
  • [43] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In CVPR, Cited by: Crowd Counting and Localization., Table 1, Datasets..
  • [44] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, Cited by: Domain-specific Feature Learning Module (DFL)., Ablation Study., Table 2.
  • [45] M. Zhao, J. Zhang, C. Zhang, and W. Zhang (2019) Leveraging heterogeneous auxiliary tasks to assist crowd counting. In CVPR, Cited by: Crowd Counting and Localization..