Log In Sign Up

Efficient Crowd Counting via Structured Knowledge Transfer

by   Lingbo Liu, et al.

Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications. However, most previous works relied on heavy backbone networks and required prohibitive runtimes, which would seriously restrict their deployment scopes and cause poor scalability. To liberate these crowd counting models, we propose a novel Structured Knowledge Transfer (SKT) framework integrating two complementary transfer modules, which can generate a lightweight but still highly effective student network by fully exploiting the structured knowledge of a well-trained teacher network. Specifically, an Intra-Layer Pattern Transfer sequentially distills the knowledge embedded in single-layer features of the teacher network to guide feature learning of the student network. Simultaneously, an Inter-Layer Relation Transfer densely distills the cross-layer correlation knowledge of the teacher to regularize the student's feature evolution. In this way, our student network can learn compact and knowledgeable features, yielding high efficiency and competitive performance. Extensive evaluations on three benchmarks well demonstrate the knowledge transfer effectiveness of our SKT for extensive crowd counting models. In particular, only having one-sixteenth of the parameters and computation cost of original models, our distilled VGG-based models obtain at least 6.5× speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.


page 3

page 4

page 7


Reducing Capacity Gap in Knowledge Distillation with Review Mechanism for Crowd Counting

The lightweight crowd counting models, in particular knowledge distillat...

Feature Matters: A Stage-by-Stage Approach for Knowledge Transfer

Convolutional Neural Networks (CNNs) become deeper and deeper in recent ...

Network-Agnostic Knowledge Transfer for Medical Image Segmentation

Conventional transfer learning leverages weights of pre-trained networks...

Towards Unsupervised Crowd Counting via Regression-Detection Bi-knowledge Transfer

Unsupervised crowd counting is a challenging yet not largely explored ta...

Improving Fast Segmentation With Teacher-student Learning

Recently, segmentation neural networks have been significantly improved ...

Real-Time Correlation Tracking via Joint Model Compression and Transfer

Correlation filters (CF) have received considerable attention in visual ...

ExpandNets: Exploiting Linear Redundancy to Train Small Networks

While very deep networks can achieve great performance, they are ill-sui...

Code Repositories


SKT distillation (ACM MM 2020)

view repo


Efficient Crowd Counting via Structured Knowledge Transfer (SKT, ACM MM 2020)

view repo


view repo

1. Introduction

Crowd counting, whose objective is to automatically estimate the total number of people in monitored scenes, is an important technique of crowd analysis 

(Li et al., 2015; Zhan et al., 2008). With the rapid increase of urban population, this task has attracted extensive interest in academic and industrial fields, due to its wide-ranging applications in video surveillance (Zhang et al., 2017), congestion alerting (Semertzidis et al., 2010) and traffic prediction (Liu et al., 2018c).

Method RMSE #Param FLOPs GPU CPU
DISSNet (Liu et al., 2019a) 159.20 8.86 8670.09 3677.98 378.80
CAN (Liu et al., 2019c) 183.00 18.10 2594.18 972.16 149.56
CSRNet* (Li et al., 2018) 233.32 16.26 2447.91 823.84 119.67
BL* (Ma et al., 2019) 158.09 21.50 2441.23 595.72 130.76
Ours 156.82 1.35 155.30 90.96 9.78
Table 1. The Root Mean Squared Error (RMSE), Parameters, FLOPs, and inference time of our SKT network and four state-of-the-art models on the UCF-QNRF (Idrees et al., 2018) dataset. The FLOPs and parameters are computed with the input size of 20322912, and the inference times are measured on a Intel Xeon E5 CPU (2.4G) and an single Nvidia GTX 1080 GPU. The models with * are reimplemented by us. The units are million (M) for #Param, giga (G) for FLOPs, millisecond (ms) for GPU time, and second (s) for CPU time, respectively. More efficiency analysis can be found in Table 8.

Recently, deep neural networks

(Zhang et al., 2015, 2016; Sindagi and Patel, 2017b; Liu et al., 2018b; Cao et al., 2018; Qiu et al., 2019; Liu et al., 2019b; Zhang et al., 2019) have become mainstream in the task of crowd counting and made remarkable progress. To acquire better performance, most of the state-of-the-art methods (Li et al., 2018; Liu et al., 2019c; Ma et al., 2019; Liu et al., 2019a; Yan et al., 2019) utilized heavy backbone networks (such as the VGG model (Simonyan and Zisserman, 2014)) to extract hierarchical features. Nevertheless, requiring large computation cost and running at low speeds, these models are exceedingly inefficient, as shown in Table 1. For instance, DISSNet (Liu et al., 2019a) requires 3.7s on an Nvidia 1080 GPU and 379s on an Intel Xeon CPU to process a 20322912 image. This would seriously restrict their deployment scopes and cause poor scalability, particularly on edge computing devices (1) with limited computing resources. Moreover, to handle citywide surveillance videos in real-time, we may need thousands of high-performance GPUs, which are expensive and energy-consuming. Under these circumstances, a cost-effective model is extremely desired for crowd counting.

Thus, one fundamental question is that how we can acquire an efficient crowd counting model from existing well-trained but heavy networks. A series of efforts (Han et al., 2015; Tai et al., 2015; Cai et al., 2017; Liu et al., 2018e) have been made to compress and speed-up deep neural networks. However, most of them either require cumbersome super-parameter search (e.g., the sensitivity in per layer for parameters pruning) or rely on specific hardware platforms (e.g., weight quantization and half-precision floating-point computing). Recently, knowledge distillation (Hinton et al., 2015) has become a desirable alternative due to its broad applicability scopes. It trains a small student network to mimic the knowledge of a complex teacher network. Numerous works (Romero et al., 2014; Zagoruyko and Komodakis, 2016; Sau and Balasubramanian, 2016; Zhang et al., 2018; Mirzadeh et al., 2019) have verified its effectiveness in image classification task. However, density-based crowd counting is a challenging pixel-labeling task. Once a pre-trained model is distilled, it’s very difficult to simultaneously preserve the original or similar response value at every location of its crowd density map. Therefore, extensive knowledge is desired to be transferred into student networks for high-quality density map generation.

In this study, we aim to improve the efficiencies of existing crowd counting models in a general and comprehensive way. Fortunately, we observe that the structured knowledge of deep networks is implicitly embedded in i) single-layer features that carry image content, and ii) cross-layer correlations that encode feature updating patterns. To this end, we develop a novel framework termed Structured Knowledge Transfer (SKT), which fully exploits the structured knowledge of teacher networks with two complementary transfer modules, i.e., an Intra-Layer Pattern Transfer (Intra-PT) and an Inter-Layer Relation Transfer (Inter-RT). First

, our Intra-PT takes a set of representative features extracted from a well-trained teacher network to sequentially supervise the corresponding features of a student network, analogous to using the teacher’s knowledge to progressively correct the student’s learning deviation. As a result, the student’s features exhibit similar distributions and patterns of its supervisor.

Second, our Intra-PT densely computes the relationships between pairwise features of the teacher network and then utilizes such knowledge to help the student network regularize the long short-term evolution of its hierarchical features. Thereby, the student network can learn the solution procedure flow of its teacher. Thanks to the tailor-designed SKT framework, our lightweight student network can effectively learn compact and knowledgeable features, yielding high-quality crowd density maps.

In experiments, we apply the proposed SKT framework to compress and accelerate a series of existing crowd counting models (e.g, CSRNet (Li et al., 2018), BL (Ma et al., 2019) and SANet (Cao et al., 2018)). Extensive evaluations and analyses on three representative benchmarks greatly demonstrate the transfer effectiveness of our method. Only having one-sixteenth of the parameters and computation cost of original models, our distilled VGG-based models obtain at least 6.5 speed-up on GPU and 9 speed-up on CPU. Moreover, these lightweight models can preserve competitive performance, and even achieve state-of-the-art results on Shanghaitech(Zhang et al., 2016) Part-A and UCF-QNRF (Idrees et al., 2018) datasets. In summary, the major contributions of this work are four-fold:

  • To the best of our knowledge, we are the first to focus on improving the efficiency of existing crowd counting models. This is an important premise to scale up these models to real-world applications.

  • We propose a general and comprehensive Structured Knowledge Transfer framework, which can generate lightweight but effective crowd counting models with two cooperative knowledge transfer modules.

  • An Intra-Layer Pattern Transfer and an Inter-Layer Relation Transfer are incorporated to fully distill the structured knowledge of well-trained models.

  • Extensive experiments on three benchmarks show the effectiveness of our method. In particular, our distilled VGG-based models have an order of magnitude speed-up while even achieving state-of-the-art performance.

2. Related Works

2.1. Crowd Counting

Crowd counting has been extensively studied for decades. Early works (Ge and Collins, 2009; Li et al., 2008) estimated the crowd count by directly locating the people with pedestrian detectors. Subsequently, some methods (Ryan et al., 2009; Chen et al., 2012) learned a mapping between handcrafted features and crowd count with regressors. Only using image low-level information, these methods had high efficiencies, but their performance was far from satisfactory for real-world applications.

Figure 1. The proposed Structured Knowledge Transfer (SKT) framework for crowd counting. With two complementary distillation modules, our SKT can effectively distill the structured knowledge of a pre-trained teacher network to a small student network. First, an Intra-Layer Pattern Transfer sequentially distill the inherent knowledge in a teacher’s feature to enhance a student’s feature with a cosine metric. Second, an Inter-Layer Relation Transfer enforces the student network to learn the long short term feature relationships of the teacher network, thereby fully mimicking the flow of solution procedure (FSP) of teacher. Notice that FSP matrices are densely computed between some representative features in our framework. For the conciseness and beautification of this figure, we only show some FSP matrices of adjacent features.

Recently, we have witnessed the great success of convolutional neural networks 

(Zhang et al., 2015; Onoro-Rubio and López-Sastre, 2016; Walach and Wolf, 2016; Sam et al., 2017; Liu et al., 2018b; Shi et al., 2018, 2019a; Lian et al., 2019; Zhang et al., 2019; Sindagi and Patel, 2019; Yan et al., 2019) in crowd counting. Most of these previous approaches focused on how to improve the performance of deep models. To this end, they tended to use heavy backbone networks (e.g., the VGG model (Simonyan and Zisserman, 2014)) to extract representative features. For instance, Li et al. (Li et al., 2018) combined a VGG-16 based front-end network and dilated convolutional back-end network to learn hierarchical features for crowd counting. Liu et al. (Liu et al., 2019c) introduced an expanded context-aware network that learned both image features and geometry features with two truncated VGG16 models. Liu et al. (Liu et al., 2019a) utilized three paralleled VGG16 networks to extract multiscale features and then conducted structured refinements. Recently, Ma et al. (Ma et al., 2019) proposed a new Bayesian loss for crowd counting and verified its effectiveness on VGG19. Although the aforementioned methods can make impressive progress, their performance advantages come with the cost of burdensome computation. Thus it’s hard to directly apply these methods to practical applications. In contrast, we take into consideration both the performance and computation cost. In this work, we aim to improve the efficiency of existing crowd counting models under the condition of preserving the performance.

2.2. Model Compression

Parameters quantization (Gong et al., 2014), parameters pruning (Han et al., 2015) and knowledge distillation (Hinton et al., 2015) are three types of commonly-used algorithms for model compression. Specifically, quantization methods (Zhou et al., 2016; Jacob et al., 2018) compress networks by reducing the number of bits required to represent weights, but they usually rely on specific hardware platforms. Pruning methods (Li et al., 2016; He et al., 2017; Zhu and Gupta, 2017) removed redundant weights or channels of layers. However, most of them used weight masks to simulate the pruning and mass post-processing are needed to achieve real speed-up. By contrast, knowledge distillation (Sau and Balasubramanian, 2016; Zhang et al., 2018; Mirzadeh et al., 2019) is more general and its objective is transferring knowledge from a heavy network to a small network. Recently, knowledge distillation has been widely studied. For instance, Hinton et al. (Hinton et al., 2015) trained a distilled network with the soft output of a large highly regularized network. Romero et al. (Romero et al., 2014) improved the performance of student networks with both the outputs and the intermediate features of teacher networks. Zagoruyko and Komodakis (Zagoruyko and Komodakis, 2016) utilized activation-based and gradient-based spatial attention maps to transfer knowledge between two networks. Recently, some works (Xu et al., 2017; Heo et al., 2019) adopted adversarial learning to model knowledge transfer between teacher and student networks. Nevertheless, most of these previous methods were proposed for image classification. To the best of our knowledge, we are the first to utilize knowledge distillation to improve the efficiency of crowd counting.

3. Method

In this work, a general Structured Knowledge Transfer (SKT) framework is proposed to address the efficiency problem of existing crowd counting models. Its architecture is shown in Fig. 1. Specifically, an Intra-Layer Pattern Transfer (Intra-PT) and an Inter-Layer Relation Transfer (Inter-RT) are incorporated into our framework to fully transfer the structured knowledge of teacher networks to student networks.

In this section, we take the VGG16-based CSRNet (Li et al., 2018) as an example to introduce the working modules of our SKT framework. The student network is a /-CSRNet, in which the channel number of each convolutional layer (but except the last layer) is / of the original one in CSRNet. Compared with the heavy CSRNet, the lightweight /-CSRNet model only has / parameters and computation cost, but it suffers serious performance degradation. Thus, our objective is to improve the performance of /-CSRNet as far as possible by transferring the knowledge of CSRNet. Notice that our SKT is general and it is also applicable to other crowd counting models (e.g., BL (Ma et al., 2019) and SANet (Cao et al., 2018)). Several distilled models are analyzed and compared in Section 4.

3.1. Feature Extraction

In general, teacher networks have been pre-trained on standard benchmarks. The learned knowledge can be explicitly represented as parameters or implicitly embedded into features. Similar to the previous works (Romero et al., 2014; Zagoruyko and Komodakis, 2016), we perform knowledge transfer on the feature level. The knowledge in the features of teacher networks is treated as supervisory information and can be utilized to guide the representation learning of student networks. Therefore, before conducting knowledge transfer, we need to extract the hierarchical features of teacher/student networks in advance.

As shown in Fig. 1, given an unconstrained image , we simultaneously feed it into CSRNet and /-CSRNet for feature extraction. For convenience, the -th dilated convolutional layer in the back-end network of CSRNet is renamed as “Conv_” in our work. Thus the feature at layer Conv_ of CSRNet can be uniformly denoted as . Similarly, we use to represent the Conv_ feature of /-CSRNet. Notice that and share the same resolution, but has only / channels of . Since the Inter-RT in Section 3.3 computes feature relations densely, we only perform distillation on some representative features, in order to reduce the computational cost during the training phase. Specifically, the selected features are shown as follows:


where and are the feature groups of CSRNet and /-CSRNet, respectively.

3.2. Intra-Layer Pattern Transfer

As described in the above subsection, the extracted feature implicitly contains the learned knowledge of CSRNet. To improve the performance of the lightweight /-CSRNet, we design a simple but effective Intra-Layer Pattern Transfer (Intra-PT) module, which sequentially transfers the knowledge of selected features of CSRNet to the corresponding features of /-CSRNet. Formally, we enforce to learn the patterns of and optimize the parameters of /-CSRNet by maximizing their distribution similarity.

Specifically, our Intra-PT is composed of two steps. The first step is channel adjustment. As feature and have different channel numbers, it is unsuitable to directly compute their similarity. To eliminate this issue, we generate a group of interim features by feeding each feature in into a convolutional layer, which is expressed as:


where denotes the parameters of the convolutional layer. The output is the embedding feature of and its channel number is the same as that of .

The second step is similarity computation and knowledge transfer. Since Euclidean distance is too restrictive and may cause the rote learning of student networks, we adopt a relatively liberal metric, cosine, to measure the similarity of two features. Specifically, the similarity of and at location is calculated by:


where denotes the channel number of feature and is the response value of at location of the -th channel. The symbol

is the length of a vector. Thus, the loss function of our Intra-PTD is defined as follows:


where and are the height and width of feature . By minimizing this simple loss and back-propagating the gradients, our method can effectively transfers knowledge and optimizes the parameters of /-CSRNet.

3.3. Inter-Layer Relation Transfer

“Teaching one to fish is better than giving him fish”. Thus, a student network should also be encouraged to learn how to solve a problem. Inspired by (Yim et al., 2017), the flow of solution procedure (FSP) can be modeled with the relationship between features from two layers. Such a relationship is a kind of meaningful knowledge. In this subsection, we develop an Inter-Layer Relation Transfer (Inter-RT) module, which densely computes the pairwise feature relationships (FSP matrices) of the teacher network to regularize the long short-term feature evolution of the student network .

Let’s introduce the detail of our Inter-RT. We first present the generation of FSP matrix. For two general feature and , we compute their FSP matrix with channel-wise inner product. Specifically, its value at index is calculated by:


Notice that FSP matrix computation is conducted on features with same resolution. However, the features in have various resolutions. To address this issue and simultaneously reduce the FSP computation cost, we consistently resize all features in to the smallest resolution

with max pooling. The resized feature of

is denoted as . In the same way, all features in are also resized to resolution .

Rather than sparsely compute FSP matrices for adjacent features, we design a Dense FSP strategy to better capture the long-short term evolution of features. Specifically, we generate a FSP matrix for every pair of features in . Similarly, a matrix is also computed for every pair of features in . Finally, the loss function of our Inter-RT is calculated as follows:


By minimizing the distances of these FSP matrices, the knowledge of CSRNet can be transferred to /-CSRNet.

Figure 2. Illustration the complementarity of hard and soft ground-truth (GT). (b) is a hard GT generated from point annotations with geometry-adaptive Gaussian kernels. (c) is a soft GT predicted by CSRNet, while (d) is the estimated map of /-CSRNet. The hard GT may exist some blemishes (e.g., the inaccurate scales and positions of human heads, the unmarked heads). For example, the red box in (b) shows the human heads with inaccurate scales. We find that the soft GT may be relatively reasonable in some regions and it is complementary to the hard GT. Thus, they can be incorporated to train the student network.

3.4. Learn from Soft Ground-Truth

In our work, a density map generated from point annotations is termed as hard ground-truth. We find that the density maps predicted by the teacher network are complementary to hard ground-truths. As shown in Fig. 2, there may exist some blemishes (e.g., the inaccurate scales and positions of human heads, the unmarked heads) in some regions of hard ground-truths. Fortunately, with powerful knowledge, a well-trained teacher network may predict some relatively reasonable maps. These predicted density maps can also be treated as knowledge and we call them soft ground-truths. In this work, we train our student network with both the hard and soft ground-truths.

As shown in Fig. 1, we use to represent the hard ground-truth of image . The predicted map of CSRNet is denoted as and the output map of /-CSRNet is denoted as . Since /-CSRNet is expected to simultaneously learn the knowledge of hard ground-truth and soft ground-truth, we defined the loss function on density maps as follows:


Finally, we optimize the parameters of 1/n-CSRNet by minimizing the losses of all knowledge transfers:


4. Experiments

4.1. Experiment Settings

In this work, we conduct extensive experiments on the following three public benchmarks of crowd counting.

Shanghaitech (Zhang et al., 2016): This dataset contains 1,198 images with 330,165 annotated people. It is composed of two parts: Part-A contains 482 images of congested crowd scenes, where 300 images are used for training and 182 for testing, and Part-B contains 716 images of sparse crowd scenes, with 400 images for training and the rest for testing.

UCF-QNRF (Idrees et al., 2018): As one of the most challenging datasets, UCF-QNRF contains 1,535 images captured from unconstrained crowd scenes with huge variations in scale, density and viewpoint. Specifically, 1,201 images are used for training and 334 for testing. There are about 1.25 million annotated people in this dataset and the number of persons per image varies from 49 to 12,865.

WorldExpo’10 (Zhang et al., 2015): It totally contains 1,132 surveillance videos captured by 108 cameras during the Shanghai WorldExpo 2010. Specifically, 3,380 images from 103 scenes are used as the train set and 600 images from other five scenes as the test set. Region-of-Interest (ROI) are provided to specify the counting regions for the test set.

Following (Cao et al., 2018; Li et al., 2018), we adopt Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to quantitatively evaluate the performance of crowd counting. Specifically, they are defined as follows:


where is the number of test images, and are the predicted and ground-truth count of image, respectively.

CPR #Param FLOPs RMSE Transfer
1 16.26 205.88 105.99
/ 4.07 51.77 137.32
/ 1.81 23.11 140.29
/ 1.02 13.09 146.40
/ 0.64 8.45 149.40
Table 2. Performance comparison under different channel preservation rates (CPR) on Shanghaitech Part-A. #Param denotes the number of parameters (M). FLOPs is the number of FLoating point OPerations (G) and it is computed on a image ( is the average resolution on Shanghaitech Part-A).

4.2. Ablation Study

4.2.1. Exploration on Channel Preservation Rate

In this work, we compress the existing crowd counting models by reducing their channel numbers. A model can run more efficiently if it has fewer parameters/channels. However, its performance may also be degraded consequently. Thus, the balance between efficiency and accuracy should be investigated. In this section, we first conduct ablation study to evaluate the influence of Channel Preservation Rate (CPR) on the performance of models.

In Table 2, we summarize the performance of various CSRNet trained with different CPRs. As can be observed, the original CSRNet has 16.26M parameters and achieves an RMSE of 105.99 by consuming 205.88G FLOPs. When preserving half the number of channels, 1/2-CSRNet has a 4 reduction in parameters and FLOPs. However, without applying knowledge transfer, its performance badly degrades, with the RMSE increasing to 137.32. By applying our STK, 1/2-CSRNet exhibits an obvious performance gain. When CPR decreases to 1/4, we can observe the model is further reduced in model size and FLOP consumption. What’s more, the 1/4-CSRNet with knowledge transfer just has a negligible performance drop. When CPR further decreases, 1/5-CSRNet meets a relatively large performance drop with a smaller gain in parameters and FLOPs reduction. Therefore, we consider it roughly reaches a balance when CPR is 1/4 and this setting is widely adopted in our following experiments.

Transfer Configuration MAE RMSE
W/O Transfer 89.65 146.40
Intra-PT L2 76.61 120.56
Cos 74.99 117.58
Inter-RT S-FSP 79.22 133.21
D-FSP 73.25 120.77
Intra-PT & Inter-RT L2 + D-FSP 72.89 117.92
Cos + D-FSP 71.55 114.40
Table 3. Performance of /-CSRNet distilled with different transfer configurations on Shanghaitech Part-A. D-FSP and S-FSP refer to Dense FSP and Sparse FSP, respectively.

4.2.2. Effect of Different Transfer Configurations

We further perform experiments to evaluate the effect of different transfer configurations of our framework. This ablation study is conducted based on 1/4-CSRNet and the results are summarized in Table 3. When just trained with our Intra-Layer Pattern Transfer module, the 1/4-CSRNet is able to obtain evident performance gain, decreasing MAE by at least 13 and RMSE by at least 25.8. We observe that using the Cosine (Cos) as the similarity metric performs better than using Euclidian (L2) distance. The possible reason is that the Cosine metric enforces the consistency of feature distribution between teacher and student networks, while the Euclidean distance further enforces location-wise similarity that is too restrictive for knowledge transfer. On the other hand, only using our Inter-Layer Relation Transfer module can also boost the 1/4-CSRNet’s performance by a large margin, with MAE decreased by at least 10 and RMSE decreased by at least 13. It is worth noting that the Dense FSP strategy achieves quite impressive performance gain, decreasing MAE of the 1/4-CSRNet without transfer by 16.40 (relatively 18.3%). When combining both the proposed Intra-Layer and Intra-Layer transfers to form our overall framework, the 1/4-CSRNet’s performance is further boosted. Specifically, by using Cosine metric and Dense FSP, the 1/4-CSRNet achieves the best performance (MAE 71.55, RMSE 114.40) within all transfer configurations of our framework.

Ground-Truth Type MAE RMSE
Hard 72.94 116.68
Soft 74.89 118.33
Hard + Soft 71.55 114.40
Table 4. Performance of /-CSRNet trained with different ground-truths on Shanghaitech Part-A.
Figure 3. Visualization of the feature maps of different models on Shanghaitech Part-A. The first and fifth columns are the features of the complete CSRNet and the naive 1/4-CSRNet. The middle three columns show the student features of 1/4-CSRNet+SKT, 1/4-CSRNet+AB (Heo et al., 2019) and 1/4-CSRNet+AT (Zagoruyko and Komodakis, 2016). The bottom three rows are the channel-wise average features at layers Conv3_1, Conv4_1 and Conv5_1, respectively. Thanks to the tailor-designed Intra-PT and Inter-RT, our 1/4-CSRNet+SKT can fully absorb the structured knowledge of CSRNet, thus the generated features are very similar to these of the teacher.
Baseline CSRNet 68.43 105.99
/-CSRNet 89.65 146.40
Quantization DoReFa (Zhou et al., 2016) 80.02 124.1
QAT (Jacob et al., 2018) 75.50 128.09
Pruning L1Filter (Li et al., 2016) 85.18 135.82
CP (He et al., 2017) 82.05 130.65
AGP (Zhu and Gupta, 2017) 78.51 125.83
Distillation FitNets (Romero et al., 2014) 87.32 140.34
DML (Zhang et al., 2018) 85.23 138.10
NST (Huang and Wang, 2017) 76.26 116.57
AT (Zagoruyko and Komodakis, 2016) 74.65 127.06
AB (Heo et al., 2019) 75.73 123.28
SKT (Ours) 71.55 114.40
Table 5. Performance of different compression algorithms on Shanghaitech Part-A.

4.2.3. Effect of Soft Ground-Truth

In this section, we conduct experiments to evaluate the effect of soft ground-truth (GT) on the performance. As shown in Table 4, when only using the soft GT generated by the teacher network as supervision, the 1/4-CSRNet’s performance is slightly worse than that of using hard GT. But it also indicates that the soft GT does provide useful information since it does not cause severe performance degradation. Furthermore, it is witnessed that the model’s performance is promoted when we utilize both soft GT and hard GT to supervise model training. This further demonstrates that the soft GT is complementary to the hard GT and we can indeed transfer knowledge of the teacher network with soft GT.

Method Part-A Part-B
MCNN (Zhang et al., 2016) 110.2 173.2 26.4 41.3
SwitchCNN (Sam et al., 2017) 90.4 135 21.6 33.4
DecideNet (Liu et al., 2018a) - - 21.5 31.9
CP-CNN (Sindagi and Patel, 2017b) 73.6 106.4 20.1 30.1
DNCL (Shi et al., 2018) 73.5 112.3 18.7 26.0
ACSCP (Shen et al., 2018) 75.7 102.7 17.2 27.4
L2R (Liu et al., 2018d) 73.6 112.0 13.7 21.4
IG-CNN (Babu Sam et al., 2018) 72.5 118.2 13.6 21.1
IC-CNN (Ranjan et al., 2018) 68.5 116.2 10.7 16.0
CFF (Shi et al., 2019b) 65.2 109.4 7.2 12.2
SANet* 75.33 122.2 10.45 17.92
/-SANet 97.36 155.43 14.79 23.43
/-SANet+SKT 78.02 126.58 11.86 19.83
CSRNet* 68.43 105.99 7.49 12.33
/-CSRNet 89.65 146.40 10.82 16.21
/-CSRNet + SKT 71.55 114.40 7.48 11.68
BL* 61.46 103.17 7.50 12.60
/-BL 88.35 145.47 12.25 19.77
/-BL + SKT 62.73 102.33 7.98 13.13
Table 6. Performance comparison on Shanghaitech dataset. The models with symbol * are our reimplemented teacher networks.

4.3. Compassion with Model Compression Algorithms

Undoubtedly, some existing compression algorithms can also be applied to compress the crowd counting models. To verify the superiority of the proposed SKT, we compare our method with ten representative compression algorithms.

In Table 5, we summarize the performance of different compression algorithms on Shanghaitech Part-A. Specifically, quantizing the parameters of CSRNet with 8 bits, DoReFa (Zhou et al., 2016) and QAT (Jacob et al., 2018) obtain an MAE 80.02/75.50 respectively. When we employ the official setting of CP (He et al., 2017) to prune CSRNet, the compressed model obtains an MAE 82.05 with 6.89M parameters. To maintain the same number of parameters of 1/4-CSRNet, L1Filter (Li et al., 2016) and AGP (Zhu and Gupta, 2017) prunes 93.75% parameters, and their MAE are above 78. Furthermore, six distillation methods including our SKT are applied to distill CSRNet to 1/4-CSRNet. As can be observed, our method achieves the best performance in both MAE and RMSE. The feature visualization in Fig. 3 also show that our features are much better than these of other compression methods. These quantitative and qualitative superiorities are attributed to that the tailor-designed Intra-PT and Inter-RT can fully distill the knowledge of teachers. What’s more, the proposed SKT is easily implemented and the distilled crowd counting models can be directly deployed on various edge devices. In summary, our SKT fits most with the crowd counting task, among the various existing compression algorithms.

4.4. Comparison with Crowd Counting Methods

To demonstrate the effectiveness of the proposed SKT, we also conduct comparisons with state-of-the-art methods of crowd counting from both performance and efficiency perspectives. Besides CSRNet (Li et al., 2018), we also apply our SKT framework to distill other two representative models BL (Ma et al., 2019) and SANet (Cao et al., 2018). Specifically, the former is based on VGG19 and we obtain a lightweight 1/4-BL with the same transfer configuration of CSRNet. Similar to GoogLeNet (Szegedy et al., 2015), the latter SANet adopts multi-column blocks to extract features. For SANet, we transfer knowledge on the output features of each block, yielding a lightweight 1/4-SANet.

4.4.1. Performance Comparison

The performance comparison with recent state-of-the-art methods on Shanghaitech, UCF-QNRF and WorldExpo’10 datasets are reported in Tables 6, 7 and 9, respectively. As can be observed, the BL model is the existing best-performing method, achieving the lowest MAE and RMSE on almost all these datasets. The CSRNet and SANet also show a relatively good performance among the compared methods. However, when reduced in model size to gain efficiency, the 1/4-BL, 1/4-CSRNet and 1/4-SANet models without knowledge transfer have a heavy performance degradation, compared with the original models. By applying our SKT method, these lightweight models can obtain comparable results with the original models, and even achieve better performance on some datasets. For example, as shown in Table 6, our 1/4-CSRNet+SKT performs better in both MAE and RMSE than the original CSRNet on Shanghaitech Part-B, while 1/4-BL+SKT obtains a new state-of-the-art RMSE of 102.33 on Shanghaitech Part-A. It can also be observed from Table 7 that 1/4-BL+SKT achieves an impressive state-of-the-art RMSE of 156.82 on the UCF-QNRF dataset. These results demonstrate that our STK can effectively compress various crowd counting models while preserving satisfactory performance.

Idrees et al.  (Idrees et al., 2013) 315 508
MCNN (Zhang et al., 2016) 277 426
Encoder-Decoder (Badrinarayanan et al., 2015) 270 478
CMTL (Sindagi and Patel, 2017a) 252 514
SwitchCNN (Sam et al., 2017) 228 445
Resnet-101 (He et al., 2016) 190 277
CL (Idrees et al., 2018) 132 191
TEDnet (Jiang et al., 2019) 113 188
CAN (Liu et al., 2019c) 107 183
S-DCNet (Xiong et al., 2019) 104.40 176.10
DSSINet (Liu et al., 2019a) 99.10 159.20
SANet* 152.59 246.98
/-SANet 192.47 293.96
/-SANet + SKT 157.46 257.66
CSRNet* 145.54 233.32
/-CSRNet 186.31 287.65
/-CSRNet + SKT 144.36 234.64
BL* 87.70 158.09
/-BL 135.64 224.72
/-BL + SKT 96.24 156.82
Table 7. Performance of different methods on UCF-QNRF dataset. The models with * are our reimplemented teacher networks.
Method #Param Shanghaitech A (576864) Shanghaitech B (7681024) WorldExpo’10 (576720) UCF-QNRF (20322912)
DSSINet (Liu et al., 2019a) 8.86 729.20 296.32 32.39 1152.31 471.83 49.49 607.66 250.69 26.13 8670.09 3677.98 378.80
CAN (Liu et al., 2019c) 18.10 218.20 79.02 7.99 344.80 117.12 20.75 181.83 68.00 6.84 2594.18 972.16 149.56
CSRNet (Li et al., 2018) 16.26 205.88 66.58 7.85 325.34 98.68 19.17 171.57 57.57 6.51 2447.91 823.84 119.67
BL (Ma et al., 2019) 21.50 205.32 47.89 8.84 324.46 70.18 19.63 171.10 40.52 6.69 2441.23 595.72 130.76
SANet (Cao et al., 2018) 0.91 33.55 35.20 3.90 52.96 52.85 11.42 27.97 29.84 3.13 397.50 636.48 87.50
/-CSRNet + SKT 1.02 13.09 8.88 0.87 20.69 12.65 1.84 10.91 7.71 0.67 155.69 106.08 9.71
/-BL + SKT 1.35 13.06 7.40 0.88 20.64 10.42 1.89   10.88 6.25 0.69 155.30 90.96 9.78
/-SANet + SKT 0.058 2.52 11.83 1.10 3.98 16.86 2.10 2.10 9.72 0.92 29.92 368.04 18.64
Table 8. The inference efficiency of state-of-the-art methods. #Param denotes the number of parameters, while FLOPs is the number of FLoating point OPerations. The execution time is computed on an Nvidia GTX 1080 GPU and a 2.4 GHz Intel Xeon E5 CPU. The units are million (M) for #Param, giga (G) for FLOPs, millisecond (ms) for GPU time, and second (s) for CPU time, respectively.
Method S1 S2 S3 S4 S5 Avg
Chen et al. (Chen et al., 2013) 2.1 55.9 9.6 11.3 3.4 16.5
Zhang et al (Zhang et al., 2015) 9.8 14.1 14.3 22.2 3.7 12.9
MCNN (Zhang et al., 2016) 3.4 20.6 12.9 13.0 8.1 11.6
Shang et al. (Shang et al., 2016) 7.8 15.4 14.9 11.8 5.8 11.7
IG-CNN (Babu Sam et al., 2018) 2.6 16.1 10.1 20.2 7.6 11.3
ConvLSTM (Xiong et al., 2017) 7.1 15.2 15.2 13.9 3.5 10.9
IC-CNN (Ranjan et al., 2018) 17.0 12.3 9.2 8.1 4.7 10.3
SwitchCNN (Sam et al., 2017) 4.4 15.7 10.0 11.0 5.9 9.4
DecideNet (Liu et al., 2018a) 2.00 13.14 8.90 17.40 4.75 9.23
DNCL (Shi et al., 2018) 1.9 12.1 20.7 8.3 2.6 9.1
CP-CNN (Sindagi and Patel, 2017b) 2.9 14.7 10.5 10.4 5.8 8.86
PGCNet (Yan et al., 2019) 2.5 12.7 8.4 13.7 3.2 8.1
TEDnet (Jiang et al., 2019) 2.3 10.1 11.3 13.8 2.6 8.0
SANet* 2.92 15.22 14.86 14.73 4.20 10.39
/-SANet 3.77 19.93 19.33 18.42 6.36 13.56
/-SANet + SKT 3.42 16.13 15.82 15.37 4.91 11.13
CSRNet* 1.58 13.55 14.70 7.29 3.28 8.08
/-CSRNet 1.96 15.70 20.59 8.52 3.70 10.09
/-CSRNet + SKT 1.77 12.32 14.49 7.87 3.10 7.91
BL* 1.79 10.70 14.12 7.08 3.19 7.37
/-BL 1.97 18.39 28.95 8.12 3.94 12.27
/-BL + SKT 1.41 10.45 13.10 7.63 4.08 7.34
Table 9. MAE of different methods on the WorldExpo’10 dataset. The models with symbol * are our reimplemented teacher networks.

4.4.2. Efficiency Comparison

A critical goal of this work is to achieve model efficiency. To further verify the superiority of STK. we also compare our method with existing crowd counting models on inference efficiency. In Table 8, we summarize the model sizes and the inference efficiencies of different models. Specifically, the inference time of using GPU or only CPU to process an image with the average resolution for each dataset are reported comprehensively, along with the number of the consumed FLOPs. The average resolution of images on each dataset is listed in the first row of Table 8.

As can be observed, all original models except SANet have a large number of parameters. When we compress these models with the proposed SKT, the generated models have a 16 reduction in model size and FLOPs, meanwhile achieving an order of magnitude speed-up. For example, when testing a 20322912 image from UCF-QNRF, our 1/4-BL+SKT only requires 90.96 milliseconds on GPU and 9.78 seconds on CPU, being 6.5/13.4 faster than the original BL model. On Shanghaitech Part-A, our 1/4-CSRNet+SKT takes 13.09 milliseconds on GPU (7.5 speed-up) and 8.88 seconds on CPU (9.0 speed-up) to process a 576864 image. Interestingly, we find that 1/4-SANet+SKT runs slower than 1/4-BL+SKT, although SANet is much faster than BL. It mainly results from that 1/4-SANet+SKT has many stacked/parallel features with small volumes, and feature communication/synchronization consumes some extra time. In summary, the distilled VGG-based models can achieve very impressive efficiencies and satisfactory performance.

5. Conclusion

In this work, we propose a general Structured Knowledge Transfer (SKT) framework to improve the efficiencies of existing crowd counting models. Specifically, an Intra-Layer Pattern Transfer and an Inter-Layer Relation Transfer are incorporated to fully transfer the structured knowledge from a heavy teacher network to a lightweight student network. Extensive evaluations on three standard benchmarks show that the proposed SKT can efficiently compress extensive models of crowd counting (e.g, CSRNet, BL and SANet). In particular, our distilled VGG-based models can achieve at least 6.5 speed-up on GPU and 9.0 speed-up on CPU, and meanwhile preserve very competitive performance.


  • [1] Note: Cited by: §1.
  • D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn. In CVPR, pp. 3618–3626. Cited by: Table 6, Table 9.
  • V. Badrinarayanan, A. Kendall, and R. Cipolla (2015) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561. Cited by: Table 7.
  • Z. Cai, X. He, J. Sun, and N. Vasconcelos (2017) Deep learning with low precision by half-wave gaussian quantization. In CVPR, pp. 5918–5926. Cited by: §1.
  • X. Cao, Z. Wang, Y. Zhao, and F. Su (2018) Scale aggregation network for accurate and efficient crowd counting. In ECCV, pp. 734–750. Cited by: §1, §1, §3, §4.1, §4.4, Table 8.
  • K. Chen, S. Gong, T. Xiang, and C. Change Loy (2013) Cumulative attribute space for age and crowd density estimation. In CVPR, pp. 2467–2474. Cited by: Table 9.
  • K. Chen, C. C. Loy, S. Gong, and T. Xiang (2012) Feature mining for localised crowd counting.. In BMVC, Vol. 1, pp. 3. Cited by: §2.1.
  • W. Ge and R. T. Collins (2009) Marked point processes for crowd counting. In CVPR, pp. 2913–2920. Cited by: §2.1.
  • Y. Gong, L. Liu, M. Yang, and L. Bourdev (2014) Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. Cited by: §2.2.
  • S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Table 7.
  • Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In ICCV, pp. 1389–1397. Cited by: §2.2, §4.3, Table 5.
  • B. Heo, M. Lee, S. Yun, and J. Y. Choi (2019)

    Knowledge transfer via distillation of activation boundaries formed by hidden neurons

    In AAAI, Vol. 33, pp. 3779–3787. Cited by: §2.2, Figure 3, Table 5.
  • G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network.

    arXiv: Machine Learning

    Cited by: §1, §2.2.
  • Z. Huang and N. Wang (2017) Like what you like: knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219. Cited by: Table 5.
  • H. Idrees, I. Saleemi, C. Seibert, and M. Shah (2013) Multi-source multi-scale counting in extremely dense crowd images. In CVPR, pp. 2547–2554. Cited by: Table 7.
  • H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. In ECCV, Cited by: Table 1, §1, §4.1, Table 7.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, pp. 2704–2713. Cited by: §2.2, §4.3, Table 5.
  • X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In CVPR, pp. 6133–6142. Cited by: Table 7, Table 9.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2.2, §4.3, Table 5.
  • M. Li, Z. Zhang, K. Huang, and T. Tan (2008) Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In ICPR, pp. 1–4. Cited by: §2.1.
  • T. Li, H. Chang, M. Wang, B. Ni, R. Hong, and S. Yan (2015) Crowded scene analysis: a survey. T-CSVT 25 (3), pp. 367–386. Cited by: §1.
  • Y. Li, X. Zhang, and D. Chen (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, pp. 1091–1100. Cited by: Table 1, §1, §1, §2.1, §3, §4.1, §4.4, Table 8.
  • D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao (2019) Density map regression guided detection network for rgb-d crowd counting and localization. In CVPR, pp. 1821–1830. Cited by: §2.1.
  • J. Liu, C. Gao, D. Meng, and A. G. Hauptmann (2018a) Decidenet: counting varying density crowds through attention guided detection and density estimation. In CVPR, pp. 5197–5206. Cited by: Table 6, Table 9.
  • L. Liu, Z. Qiu, G. Li, S. Liu, W. Ouyang, and L. Lin (2019a) Crowd counting with deep structured scale integration network. In ICCV, pp. 1774–1783. Cited by: Table 1, §1, §2.1, Table 7, Table 8.
  • L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin (2018b) Crowd counting using deep recurrent spatial-aware network. In IJCAI, Cited by: §1, §2.1.
  • L. Liu, R. Zhang, J. Peng, G. Li, B. Du, and L. Lin (2018c) Attentive crowd flow machines. In ACM MM, pp. 1553–1561. Cited by: §1.
  • N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu (2019b) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In CVPR, pp. 3225–3234. Cited by: §1.
  • W. Liu, M. Salzmann, and P. Fua (2019c) Context-aware crowd counting. In CVPR, pp. 5099–5108. Cited by: Table 1, §1, §2.1, Table 7, Table 8.
  • X. Liu, J. van de Weijer, and A. D. Bagdanov (2018d) Leveraging unlabeled data for crowd counting by learning to rank. In CVPR, Cited by: Table 6.
  • X. Liu, J. Pool, S. Han, and W. J. Dally (2018e) Efficient sparse-winograd convolutional neural networks. arXiv preprint arXiv:1802.06367. Cited by: §1.
  • Z. Ma, X. Wei, X. Hong, and Y. Gong (2019) Bayesian loss for crowd count estimation with point supervision. In ICCV, pp. 6142–6151. Cited by: Table 1, §1, §1, §2.1, §3, §4.4, Table 8.
  • S. Mirzadeh, M. Farajtabar, A. Li, and H. Ghasemzadeh (2019) Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393. Cited by: §1, §2.2.
  • D. Onoro-Rubio and R. J. López-Sastre (2016) Towards perspective-free object counting with deep learning. In ECCV, pp. 615–629. Cited by: §2.1.
  • Z. Qiu, L. Liu, G. Li, Q. Wang, N. Xiao, and L. Lin (2019) Crowd counting via multi-view scale aggregation networks. In ICME, Cited by: §1.
  • V. Ranjan, H. Le, and M. Hoai (2018) Iterative crowd counting. In ECCV, Cited by: Table 6, Table 9.
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §1, §2.2, §3.1, Table 5.
  • D. Ryan, S. Denman, C. Fookes, and S. Sridharan (2009) Crowd counting using multiple local features. In DICTA, Vol. , pp. 81–88. External Links: ISSN Cited by: §2.1.
  • D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In CVPR, Vol. 1, pp. 6. Cited by: §2.1, Table 6, Table 7, Table 9.
  • B. B. Sau and V. N. Balasubramanian (2016) Deep model compression: distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650. Cited by: §1, §2.2.
  • T. Semertzidis, K. Dimitropoulos, A. Koutsia, and N. Grammalidis (2010) Video sensor network for real-time traffic monitoring and surveillance. IET intelligent transport systems 4 (2), pp. 103–112. Cited by: §1.
  • C. Shang, H. Ai, and B. Bai (2016) End-to-end crowd counting via joint learning local and global count. In ICIP, pp. 1215–1219. Cited by: Table 9.
  • Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang (2018) Crowd counting via adversarial cross-scale consistency pursuit. In CVPR, pp. 5245–5254. Cited by: Table 6.
  • M. Shi, Z. Yang, C. Xu, and Q. Chen (2019a) Revisiting perspective information for efficient crowd counting. In CVPR, pp. 7279–7288. Cited by: §2.1.
  • Z. Shi, P. Mettes, and C. G. Snoek (2019b) Counting with focus for free. In ICCV, pp. 4200–4209. Cited by: Table 6.
  • Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M. Cheng, and G. Zheng (2018) Crowd counting with deep negative correlation learning. In CVPR, pp. 5382–5390. Cited by: §2.1, Table 6, Table 9.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2.1.
  • V. A. Sindagi and V. M. Patel (2017a) Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In AVSS, pp. 1–6. Cited by: Table 7.
  • V. A. Sindagi and V. M. Patel (2017b) Generating high-quality crowd density maps using contextual pyramid cnns. In ICCV, pp. 1879–1888. Cited by: §1, Table 6, Table 9.
  • V. A. Sindagi and V. M. Patel (2019) Multi-level bottom-top and top-bottom feature fusion for crowd counting. In ICCV, pp. 1002–1012. Cited by: §2.1.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: §4.4.
  • C. Tai, T. Xiao, Y. Zhang, X. Wang, et al. (2015) Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067. Cited by: §1.
  • E. Walach and L. Wolf (2016) Learning to count with cnn boosting. In ECCV, pp. 660–676. Cited by: §2.1.
  • F. Xiong, X. Shi, and D. Yeung (2017) Spatiotemporal modeling for crowd counting in videos. In ICCV, Cited by: Table 9.
  • H. Xiong, H. Lu, C. Liu, L. Liu, Z. Cao, and C. Shen (2019) From open set to closed set: counting objects by spatial divide-and-conquer. In ICCV, pp. 8362–8371. Cited by: Table 7.
  • Z. Xu, Y. Hsu, and J. Huang (2017)

    Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks

    arXiv preprint arXiv:1709.00513. Cited by: §2.2.
  • Z. Yan, Y. Yuan, W. Zuo, X. Tan, Y. Wang, S. Wen, and E. Ding (2019) Perspective-guided convolution networks for crowd counting. In ICCV, pp. 952–961. Cited by: §1, §2.1, Table 9.
  • J. Yim, D. Joo, J. Bae, and J. Kim (2017)

    A gift from knowledge distillation: fast optimization, network minimization and transfer learning

    In CVPR, pp. 4133–4141. Cited by: §3.3.
  • S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §1, §2.2, §3.1, Figure 3, Table 5.
  • B. Zhan, D. N. Monekosso, P. Remagnino, S. A. Velastin, and L. Xu (2008) Crowd analysis: a survey. Machine Vision and Applications 19 (5-6), pp. 345–357. Cited by: §1.
  • A. Zhang, L. Yue, J. Shen, F. Zhu, X. Zhen, X. Cao, and L. Shao (2019) Attentional neural fields for crowd counting. In ICCV, pp. 5714–5723. Cited by: §1, §2.1.
  • C. Zhang, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In CVPR, pp. 833–841. Cited by: §1, §2.1, §4.1, Table 9.
  • S. Zhang, G. Wu, J. P. Costeira, and J. M. Moura (2017) Fcn-rlstm: deep spatio-temporal neural networks for vehicle counting in city cameras. In ICCV, pp. 3687–3696. Cited by: §1.
  • Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In CVPR, pp. 4320–4328. Cited by: §1, §2.2, Table 5.
  • Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In CVPR, pp. 589–597. Cited by: §1, §1, §4.1, Table 6, Table 7, Table 9.
  • S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §2.2, §4.3, Table 5.
  • M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §2.2, §4.3, Table 5.