Crowd Counting by Self-supervised Transfer Colorization Learning and Global Prior Classification

05/20/2021 ∙ by Haoyue Bai, et al. ∙ The Hong Kong University of Science and Technology 0

Labeled crowd scene images are expensive and scarce. To significantly reduce the requirement of the labeled images, we propose ColorCount, a novel CNN-based approach by combining self-supervised transfer colorization learning and global prior classification to leverage the abundantly available unlabeled data. The self-supervised colorization branch learns the semantics and surface texture of the image by using its color components as pseudo labels. The classification branch extracts global group priors by learning correlations among image clusters. Their fused resultant discriminative features (global priors, semantics and textures) provide ample priors for counting, hence significantly reducing the requirement of labeled images. We conduct extensive experiments on four challenging benchmarks. ColorCount achieves much better performance as compared with other unsupervised approaches. Its performance is close to the supervised baseline with substantially less labeled data (10% of the original one).



There are no comments yet.


page 2

page 4

page 5

page 7

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Crowd counting is to estimate the number of closely packed objects in an image of unconstrained scene (for concreteness we use people as objects in this paper) 

Sindagi and Patel (2018); Zitouni et al. (2016); Bai and Chan (2020). It has wide applications in public safety, people monitoring, and traffic management Onoro-Rubio and López-Sastre (2016); Lempitsky and Zisserman (2010); Chan, Liang, and Vasconcelos (2008). Despite much study, crowd counting remains a challenging problem due to severe occlusion, large scale variation, uneven distribution of people, etc.

Via density map regression, Convolutional Neural Networks (CNNs) based methods have recently been shown to be promising with multi-branch architecture, local global context fusion and attention mechanisms 

Zhang et al. (2016b); Sam, Surya, and Babu (2017); Cao et al. (2018); Boominathan, Kruthiventi, and Babu (2016); Liu et al. (2018a); Kang and Chan (2018); Sam et al. (2019); Wang et al. (2019); Liu, van de Weijer, and Bagdanov (2018); Bai, Wen, and Gary Chan (2019); Ma et al. (2020). However, the previous approaches are highly data-driven, i.e., they require voluminous amount of diverse labeled data in the training process. These data are expensive due to intensive annotation. Such labeling cost is especially high for crowd images because each of the individual targets has to be annotated. This is the major reason that only a few hundred annotated images are available in current crowd counting datasets Idrees et al. (2013); Zhang et al. (2016b); Chan, Liang, and Vasconcelos (2008). Such a small database is often not sufficient to achieve good transferability, leading to over-fitting and limiting their application to diverse real-world scenarios.

While labeled crowd images are expensive and scarce, unlabeled ones are widely available at virtually no cost. In this work, we study how to significantly reduce the need for labeled data by leveraging these abundant freely available unlabeled images as training data for crowd counting. Our scheme, termed ColorCount, is a novel approach based on self-supervised transfer colorization learning and global prior classification to extract the discriminative features of crowds, so as to markedly reduce overfitting and the need for costly labeled data. Colorization is to hallucinate a plausible color version of a grayscale photograph Zhang, Isola, and Efros (2016). The key idea is that semantics and local texture pattern as obtained in the coloration process of an image reflects the density of people in the region, and hence provides important clues to its people count. Though we do not know the exact count in the region, we may use the fact that the number of people in a tight texture region (typically high-density area) is likely to be higher than that in a loose texture region (typically low-density area) or background (non-people regions). Furthermore, we extract the global discriminative features for counting by conditioning on categorical counting group priors. This can be viewed as a coarse-to-fine process to further fine-tune the count.

(a) The semantics and local texture constrains.
(b) The global group priors.
Figure 1: Illustration of the discriminative features learned for counting: the local texture constrains (Figure a) and the global group priors (Figure b). denotes the number of people in a target crowd scene image.

ColorCount consists of two sequential stages, self-supervised colorization pre-training using unlabeled data, followed by fine-tuning training using labeled data. In pre-training, it has two branches, the colorization and classification branches, which fuse together to count the crowd. We illustrate the principles in Figure 1. The colorization branch exploits the process of colorization as an auxiliary task for counting and treating the color components of unlabeled images as the supervision signal (pseudo labels) to train its network. Self-supervised colorization via color loss and self-reconstruction extracts both the semantics and the surface texture of the scene in each unlabeled image. For example, the background sky is typically blue, and the background grass is typically green, etc. As illustrated in Figure 1(a), the semantics and local textures provide clues on people count in the region. Let be the number of people in an image. The number of people in tight region (high-density area) is larger than loose texture (low-density) region and background, i.e., and . The network learns the discriminative features of the image, which sheds insights on the count of the objects.

For the classification branch, though we do not know the exact number of people in each image in a group, the classification step would generally result in , where is any image in Group (Figure 1(b)). Noting that the colorization branch does not have global features of the image, ColorCount employs this classification branch which extracts the global group priors by learning correlations among image clusters. The two branches are combined with a joint loss to estimate the count. In other words, ColorCount transfers the fused local and global knowledge learned from the colorization and classification process of the unlabeled images to count the objects. By leveraging the abundantly available unlabeled data, ColorCount is much more scalable, flexible and applicable to general operating conditions.

To the best of our knowledge, this is the first piece of crowd counting work on joint self-supervised transfer colorization learning and global prior classification to leverage unlabeled images. Using colorization as an auxiliary task and global prior classification, ColorCount jointly captures general semantics, counting-related local textual features and global image group priors to learn discriminative features and achieve adaptivity on counting tasks. We conduct extensive experiments on several public benchmark datasets and demonstrate that ColorCount significantly reduces labeled datasets and achieves much better performance given the same labeled dataset as compared with state-of-the-art unsupervised schemes.

This paper is organized as follows. We present related work in Section 2, and describe the details of ColorCount in Section 3. We conduct extensive experiments based on four real-world datasets and present the results in Section 4. We conclude in Section 5.

2 Related Work

In this section, we review the crowd counting related literature. Section 2.1

presents the supervised learning based approaches, and Section 

2.2 discusses utilizing unlabeled data for crowd counting.

2.1 Supervised Learning for Crowd Counting

Supervised learning based CNNs for corwd counting mainly focus on effective network design Zhang et al. (2016b); Sam, Surya, and Babu (2017); Cao et al. (2018); Varior et al. (2019); Idrees et al. (2013); Boominathan, Kruthiventi, and Babu (2016); Liu et al. (2018a); Kang and Chan (2018); Liu et al. (2018b); Li, Zhang, and Chen (2018); Ma, Shuai, and Cheng (2021); Yang et al. (2020)

. MCNN proposes multi-column convolutional neural networks with different filter size to address the scale variation problem 

Zhang et al. (2016b). Based on the MCNN, Switching-CNN designs a patch-based switching module with a multi-column structure which enlarges the scale range and better handles the scale variations Sam, Surya, and Babu (2017). Besides, researchers also study stacking several multi-column blocks with densely upsampled layers to generate high-quality density maps Cao et al. (2018).

Crowdnet uses a combination of deep and shallow fully convolutional neural layers to predict the density map Boominathan, Kruthiventi, and Babu (2016). This can both extract high-level semantic information and low-level features effectively. AFP adopts an across-scale attention scheme to adaptive fuse different scales and adapts to scale changes within an image Kang and Chan (2018). Researches have also incorporated LSTM modules into DRSAN to learn spatial information Liu et al. (2018b). CSRNet proves that dilated convolutional operations can be used to enlarge the receptive field and promote accurate crowd estimation Li, Zhang, and Chen (2018).

While impressive, the approaches mentioned above require considerable diversified labeled data for training to reduce over-fitting. The labeling task for crowd images is especially expensive and tedious, since hundreds and even thousands of individuals are needed to be labeled in one image. Thus the current crowd counting datasets Zhang et al. (2016b); Idrees et al. (2013); Chan, Liang, and Vasconcelos (2008); Chen et al. (2012); Idrees et al. (2018); Zhang et al. (2016a); Bai and Chan (2021) are typically small, and cannot satisfy the needs of real-world applications. Our work is to substantially reduce the need for labeled data by leveraging freely available unlabeled crowd images.

2.2 Utilizing Unlabeled Data for Crowd Counting

Recently, leveraging unlabeled data in an unsupervised learning manner draws much attention. Some researchers have attempted to use unlabeled data for crowd counting. This is an alternative to address the over-fitting and reduce the demand for human annotations. GWTA-CCNN trains

parameters of its model without any labeled data via reconstruction loss, and the remaining with supervision Sam et al. (2019). However, the L2 reconstruction loss cannot fully extract discriminative features for counting tasks and will introduce much unrelated information during training. CCWId makes use of a large-scale synthesized dataset for crowd counting and uses CycleGAN to alleviate the domain gap Wang et al. (2019). However, the distribution of synthesized people is different from real data. This results in a broader domain gap compared with using unlabeled real-world crowd scene images for counting.

Self-supervised learning is a subset of unsupervised learning methods. It learns visual features from large-scale annotation-free images via carefully designed auxiliary tasks Jing and Tian (2019). L2R studies the use of ranking as a self-supervised pretext task for crowd counting Liu, van de Weijer, and Bagdanov (2018); Liu, Van De Weijer, and Bagdanov (2019). However, ranking is a weak supervision signal for counting, which cannot fully extract discriminative features to learn to count. And this multi-task framework by minimizing an additional ranking loss is sensitive to the parameters. Carefully designed self-supervised pretext tasks can effectively learn useful visual features from large-scale unlabeled data for real-world crowd counting work.

The labeling work for crowd counting task is expensive and tedious, while unlabeled images are cheap and abundant. Our target is to capture discriminative features for counting by utilizing unlabeled data and to reduce the need for intensive human annotation. Our proposed ColorCount effectively extracts this information from unlabeled data by self-supervised colorization learning and global prior classification with local texture constraints and global categorical priors.

Figure 2: The framework of our proposed ColorCount: Transfer Colorization Learning with Classification for Crowd Counting. The first row is the pre-training stage with pretext task. The second row is the fine-tuning stage for crowd counting (main task).

3 ColorCount: Transfer Colorization Learning with Classification for
Crowd Counting

In this section, we describe the details of ColorCount. In Section 3.1, we present the problem formulation and baseline solution. Section 3.2 shows how to use colorization as an auxiliary task given group priors for crowd counting.

3.1 Problem Formulation and Baseline

Labeled images for crowd counting are expensive because we need to label each of the individual targets within an image. In this paper, we propose ColorCount, a novel self-supervised transfer colorization learning and global prior classification for crowd counting scheme to leverage unlabeled data and reduce the requirement for intensive human annotation. As Figure 2 shows, our ColorCount contains two training stages: the pre-training stage with pretext task (the first row of Figure 2) and fine-tuning stage with main task (the second row of Figure 2).

In the first stage, we create pretext training samples from the unlabeled image sets. There are two branches in the pre-training stage: colorization branch and classification branch. The original unlabeled image is divided into a density map (as input image) and its color component (as the pseudo label). The network can be trained by taking L channel of the unlabeled image as input and automatically generated color component as supervision signal (ab color components in CIE Lab color space). Besides, we also pre-train the network with global group constraints. There are three ways to get group priors, as discussed in Section 4.

In the second stage, we fine-tune the network with limited labeled data crowd counting. The network are fine-tuned with real counting label . We propose an Interleaved Group Convolution based Crowd Counting Network (IGCCNet) as our baseline counting branch, which is consisted of three modules: the pre-trained frontend, the interleaved group counting module, and context fusion module. Besides, the counting branch is optimized with the Euclidean loss in our experiments, which is defined as: , where

indicates the parameters, N is the number of pixels. The loss function in the fine-tuning stage, which is not limited to Euclidean loss, but can be any general function. We use Euclidean loss because it has been widely adopted in crowd counting with reportedly good performance, including our comparison schemes. For fairness, we hence use Euclidean loss here.

To be specific, the pre-trained frontend captures the transferred visual features. The interleaved group counting module is the main block to further enlarge the receptive field with limited parameters and effectively utilize counting features for density map prediction, which is a stack of interleaved group blocks Zhang et al. (2017b). The block sequentially contains two complementary interleaved group convolutions: primary group convolution with spatial convolution on each partition ( primary partitions), and secondary group point-wise convolution (

secondary partitions). The channels in each secondary partition are from different partitions in the primary group convolution. This operation is more efficient in terms of computation and parameters. Besides, the context fusion module fully utilizes the features extracted to achieve accurate crowd estimation.

3.2 Training with Colorization Given
Group Priors

We design ColorCount to learn discriminative features for counting. Our ColorCount is build based on the self-supervised colorization baselines. Besides, the texture constraint and categorical group constraint are included to effectively capture the local and global learning to count features. In this section, we present the details of our pre-training with colorization and given group priors process.

Colorization loss. This is used for the first stage of ColorCount. The original unlabeled image is divided into lightness channel and color component (as the pseudo label). The target is to learn a mapping function from the lightness channel to the other two ab color channels in the CIE Lab color space, where indicates prediction, denotes ground truth, and , means image dimensions.

Instead of directly minimizing the Euclidean loss between and , the value of both the lightness channel and the ab channels are quantized into grids. And then, we minimize a multinominal entropy loss between the two quantized color distributions to fully capture the semantic visual features Zhu et al. (2017)Zhang et al. (2017a). For the input , we learn a mapping function

to a probability distribution of possible colors

. Then we minimize a multinomial entropy loss between predicted ground truth color distribution converted from ground truth color :


where h, w is the image dimension, q indicates the number of quantized ab values, and is a weighting term to rebalance the loss based on color-class rarity. Finally, the color value

can be converted from the predicted probability distribution


GAN loss. To facilitate this pre-training process, we apply Cycle GAN Zhu et al. (2017) on two mapping functions and . For the mapping function and its discriminator , the objective is

Figure 3: Visualization of learning from colorization based on our collected unlabeled data. The first line is the original image in the pretraining Dataset, the second line is the channel of the original image, the third line is the ground truth channel, and the last line is our predicted colorization results.

Self-recontruction loss. The same to the mapping function and its discriminator . This cycle reconstruction loss is:


Texture loss. To fully capture the local spatial information in the feature maps, texture loss term is adopted, which is widely used in style transfer Gatys, Ecker, and Bethge (2016). After mapped into the feature space based on a pretrained VGG-19 architecture, we compute the Gram matrices for both the output features. And then, we get the texture loss term, which is the mean squared error between the feature correlations of the computed Gram matrices. So, the texture loss is defined as:


where is the number of feature maps of VGG-19 layer , is the product of layer

feature maps height and width. The Gram matrix is the inner product of vectorized feature maps. The special features will be more significant, and the features with smaller element values will become smaller after the inner product. So, the effect of the Gram matrix is to magnify the characteristics of the data and to obtain the texture details.

Classification loss. Given the global group priors, the images is categorized into several groups with different people density Haralick, Shanmugam, and Dinstein (1973)Cireşan, Meier, and Schmidhuber (2012). This can be viewed as a coarse-to-fine process. Training the network with this classification loss enables the network to learn counting related discriminative features for the following fine-tuning crowd counting stage. The loss function is defined as: , where is the number of groups, and is in the format of softmax probability.

Therefore, the full loss function is


After the pre-training stage, we fine-tune the network using limited labeled images via Euclidean loss in our experiment, as we have discussed in Section 3.1.

4 Experiments and Illustrative Results

In this section, we discuss the experiment details and results to evaluate our approach. Section 4.1 presents implementation details and the datasets. In Section 4.2

, we describe the evaluation metrics and our comparison schemes. Section 

4.3 shows the illustrative results of our ColorCount scheme. Section 4.4 details the ablation study.

max width=0.89 Dataset Average Resolution Images Numbers Total Max Min Mean UCF-QNRF Idrees et al. (2018) 2013 2902 1,525 1,251,642 815 49 12,865 ShanghaiTech A Zhang et al. (2016b) 589 868 482 241,677 501 33 3,139 ShanghaiTech B Zhang et al. (2016b) 768 1024 716 88,488 123 9 578 UCF_CC_50 Idrees et al. (2013) 2101 2888 50 63,974 1,279 94 4,543

Table 1: Statistics of the four labeled crowd counting datasets.

max width=0.89 Method Amounts of labeled data Initialization MAE MSE MCNN Zhang et al. (2016b) None 26.4 41.3 Switching-CNN Sam, Surya, and Babu (2017) None 21.6 33.4 ACSCP Shen et al. (2018) None 17.2 27.4 DRSAN Liu et al. (2018b) None 11.1 18.2 CSRNet Li, Zhang, and Chen (2018) ImageNet 10.6 16.0 Ours: ColorCount Random sampling Unlabeled data 14.33 26.70 Ours: ColorCount Random sampling Unlabeled data 8.77 14.12

Table 2: MAE and MSE error on the ShanghaiTech B dataset with different amounts of training data and different initialization.

4.1 Implementation Details and Datasets

The training stage one is optimized with a learning rate of , and the batch size is 25. In the second stage, we fine-tune our model with limited real-world labeled data. The weight of the first layer in the transferred channel-wise encoder is duplicated to three copies, which is to accommodate the three-channel input in the fine-tuning stage with labeled images. The optimization for this stage is Adam solver, with a

learning rate. Our framework is implemented with PyTorch 0.4.0, CUDA v9.0. The code will be released.

For the categorical group priors of the classification in the pre-training stage, there are three different approaches to generate the image group sets for learning to count. The following is the details of the three different methods:

  • Ranking-based group priors: cropping and sampling a decreasing sequence sub-images of the original images. The number of people in the original images is larger than its sub-images, and this can be used as natural prior information for global group Liu, van de Weijer, and Bagdanov (2018)Wilcoxon (1992).

  • Clustering-based group priors:

    using clustering-based methods to recognize and classify the image datasets into various groups based on the density features presented in the image 

    Yang et al. (2010).

  • Classification-based group priors: generating classification labels in the image keyword query stage. We can jointly use different degree adverbs as the keyword to build the datasets Wang et al. (2010).

There is a trade-off between the group annotation cost and the level of label noise. We can get the ranking-based group priors with no additional cost from the original unlabeled images, while the priors are weaker than the other two methods. The clustering-based algorithm is also a good way to get pre-clustering labels at low cost. Classification-based group priors generating categorical in the query process. This is practical in the real-word applications and easily obtained. Currently, we use this kind of classification-based group approach (the group number is set to in our experiments) and formulate it as a cross-entropy loss function, as shown in Section 3.2. Though there is a cost in low/med/high labeling, the cost is significantly lower than the labeling cost for original crowd counting tasks.

Figure 4: Qualititive results of the second training stage. The first row is original image, the second row is ground truth, and the third row is estimated density map.

max width=0.96 Method Random sampling Training process Data collection MAE MSE L2R Liu, van de Weijer, and Bagdanov (2018) labeled data Multi-task Example query 14.4 23.8 L2R Liu, van de Weijer, and Bagdanov (2018) labeled data Multi-task Keyword 13.7 21.4 Ours: ColorCount labeled data Pre-train fine-tune Keyword 14.33 26.70 Ours: ColorCount labeled data Pre-train fine-tune Keyword 8.77 14.12

Table 3: Compared with other self-suervised learning based crowd counting methods by both leveraging unlabeled and labeled data.

For a fair comparison, We collect a large unlabeled dataset of crowd images from the internet by keyword query. This follows the same way of our comparison scheme L2R Liu, van de Weijer, and Bagdanov (2018); Liu, Van De Weijer, and Bagdanov (2019), which is also making use of unlabeled data for crowd counting. We search from Google images with the keywords which have a higher probability of containing a crowd scene, such as Demonstration, Trainstation, Mall, Studio, Beach. Besides, we also jointly use different crowded degree adverbs at the keyword query to roughly classify the crowd scene images. We delete the images that not relevant to our problem. Finally, we collected a dataset with 2418 items of high-resolution crowd images. The total storage memory for this unlabeled dataset is 2.6GB.

We evaluate our ColorCount on four challenging crowd counting datasets: UCF-QNRF Idrees et al. (2018), ShanghaiTech A Zhang et al. (2016b), ShanghaiTech B Zhang et al. (2016b), and UCF_CC_50 Idrees et al. (2013). The statistics details of the four labeled crowd counting datasets are shown in Table 1. To show the superior performance of our approach in reducing the requirement for labeled data, we randomly sample various levels of subsets (e.g., , ) of all the four datasets for fine-tuning in our experiment. We conduct each experiment several times with different random samples and the results show not much difference. Besides, we generate the ground truth for labeled crowd images via blurring each head annotation with a geometry-adaptive Gaussian kernel, as shown in Figure 4.

4.2 Evaluation Metrics and Comparison Schemes

For crowd counting, two metrics are used for evaluation, Mean Absolute Error (MAE) and Mean Squared Error (MSE), which are defined as:


where is the total number of test images, means the ground truth count of the i-th image, and represents the estimated count.

The comparison schemes in our experiments are below:

  • Supervised learning methods: MCNN Zhang et al. (2016b), Switching-CNN Sam, Surya, and Babu (2017), ACSCP Shen et al. (2018), DRSAN Liu et al. (2018b), CSRNet Li, Zhang, and Chen (2018), PGCNet Yan et al. (2019). These supervised learning based methods are trained on labeled data without unlabeled images.

  • self-supervised learning methods: L2R Liu, van de Weijer, and Bagdanov (2018). This method studies the use of ranking as a self-supervised pretext task for crowd counting and formulates it as a multi-task framework, which uses both labeled and unlabeled data. It uses two ways to collect unlabeled data: query by example and keyword.

  • Almost unsupervised learning methods: GWTA-CCNN Sam et al. (2019). This scheme trains most of its parameters via unsupervised reconstruction loss, and only the last layers are trained with labeled data.

4.3 Illustrative Results

In this section, we evaluate and analyze the results of our approach on four real-world labeled datasets. To show the superior performance of our approach to utilize unlabeled data and reduce the need for labeled images, we randomly sample various level subsets (e.g., , in our experiment) of the original labeled data for training instead of directly adopting the original fully annotated dataset. It demonstrates the capability of our ColorCount approach for real-world applications with limited labeled training data.

max width=0.96 Method Training process UCF-QNRF Part A Part B UCF_CC_50 MAE MSE MAE MSE MAE MSE MAE MSE GWTA-CCNNSam et al. (2019) Self-reconstruction N/A N/A 154.7 229.4 N/A N/A 433.7 583.3 IGCCNet Plus ColorCount () 244.2 439.3 73.6 118.1 14.3 26.7 316.0 429.5 IGCCNet Plus ColorCount () 216.1 346.8 66.5 109.8 8.8 14.1 259.6 375.4 MCNN Zhang et al. (2016b) None () 277.0 N/A 110.2 173.2 26.4 41.3 377.6 509.1 MCNN Zhang et al. (2016b) Plus ColorCount () 266.8 460.6 100.5 156.3 22.6 32.5 338.5 461.2 CSRNet Li, Zhang, and Chen (2018) ImageNet () N/A N/A 68.2 115.0 10.6 16.0 266.1 397.5 CSRNet Li, Zhang, and Chen (2018) Plus ColorCount () N/A N/A 65.9 112.3 7.4 12.1 236.3 310.9

Table 4: Ablation study on four challenging crowd counting datasets.

Compared with supervised learning methods. Table 2 presents the results of our ColorCount compared with supervised learning approaches with different architectures and initialization methods. Our approach, using only of the labeled data, outperforms most of the listed supervised learning methods, which trains on labeled data. Even when only using labeled data, our ColorCount yields MCNN Zhang et al. (2016b), Switching-CNN Sam, Surya, and Babu (2017), ACSCP Shen et al. (2018) in terms of MAE and MSE. The good performance is in the cost of leveraging unlabeled data for our ColorCount, while the unlabeled data is cheap and abundant. So, this clearly shows that our ColorCount can largely reduce the demand for labeled crowd images by leveraging unlabeled data, and is more suitable for real-world applications.

Compared with self-supervised learning methods. As shown in Table 3, we compare ColorCount with other self-supervised learning based methods L2R Liu, van de Weijer, and Bagdanov (2018) by leveraging both labeled and unlabeled data. L2R uses labeled data in the experiment. For a fair comparison, we use even less amount of labeled data: and in our experiment. Besides, we follow a similar data collection way with L2R by utilizing keyword queries. Even with less labeled data utilization ( labeled data, Keyword), our ColorCount achieves much less the MAE and MSE value compared with the self-supervised L2R approach ( labeled data, Keyword).

Compared with almost unsupervised learning method. We compare our ColorCount with the almost unsupervised learning methods GWTA-CCNN Sam et al. (2019) in the Table 4. The value of the second row in Table 4 reports the results of GWTA-CCNN. We see that our approach significantly improves the crowd counting performance in terms of MAE and MSE. Our ColorCount outperforms the almost unsupervised learning method by a large margin.

Qualitative results. Figure 3 shows the visualization of our unlabeled dataset and the color predicting results in the first training stage. The first row is the original image, the second row is channel of the original image, the third row is the ground truth channel, and the last row is our generated color images. These qualitative results show that our ColorCount achieves plausible colorization results, e.g., the sky is painted in blue, and the grass is colorized to green, which indicates that the network effectively captures semantic and visual texture features. It provides useful hints to the crowd counting task with limited labeled data. Figure 4 presents the qualitative results of the crowd counting task. The first row is the original image, the second row is ground truth, and the third row is generated density maps, which shows good counting results.

4.4 Ablation Study

As Table 4 shows, we conduct an ablation study on four challenging labeled crowd counting datasets. The first column list the baseline model we used in our experiment. The second column presents the pre-training process way in the first training stage and how much labeled images it uses in the second training stage. Besides, the symbol in the table means that a certain experiment is not conducted in the related works, and we will not compare to it. IGCCNet is our designed counting branch in our ColorCount. The experiments discussed above are mainly based on this network. The original MCNN and CSRNet is a fully supervised learning based crowd counting method, which only makes use of the labeled dataset. The original results are reported in the fifth row and seventh row.

Furthermore, we conduct experiments on MCNN plus ColorCount and CSRNet plus ColorCount, which means pre-trained on unlabeled data use our designed pre-training scheme and then fine-tune with labeled images. We show that the results of plus ColorCount is always better than the results with only labeled data. Finally, we show that our approach can achieve superior performance with limited labeled data, and our self-supervised transfer colorization learning method can largely reduce the requirements for crowd counting data annotation and can promote the crowd counting task to real-world applications.

The unlabeled data can be accessed with virtually no cost, in contrast with the expensive annotation process of crowd counting tasks. Therefore, our focus is on reducing the requirement for labeled crowd counting images. Table 2 and Table 4 show that ColorCount achieve the same results with much less labeled data. Noted that it is difficult to compare with the fully supervised baseline methods. Therefore, we design two sets of experiments for a fair comparison with supervised baselines: 1) ColorCount with much less labeled data (Table 2, Table 4); and 2) ColorCount with the same amount of labeled data (Table 4).

5 Conclusion

In this paper, we propose ColorCount, a novel self-supervised transfer colorization learning with group prior classification approach to leverage widely available annotation-free images for the crowd counting task. ColorCount then uses colorization as a proxy task to learn discriminative features (general semantic information, local textual features) for learning to count and jointly extract global image group priors using classification. The fused both global priors and local texture information provide important and ample knowledge as discriminative counting features with excellent transferability. As a result, the second training stage is able to fine-tuned network parameters with much reduced labeled crowd counting data. Extensive experiments on four challenging datasets show our superior performance. Our proposed ColorCount achieves much better results in terms of MAE and MSE compared with other unsupervised learning methods and is close to the supervised baselines with much less labeled data.


  • Bai and Chan (2020) Bai, H.; and Chan, S.-H. G. 2020. CNN-based Single Image Crowd Counting: Network Design, Loss Function and Supervisory Signal. arXiv preprint arXiv:2012.15685 .
  • Bai and Chan (2021) Bai, H.; and Chan, S.-H. G. 2021. Motion-guided Non-local Spatial-Temporal Network for Video Crowd Counting. arXiv preprint arXiv:2104.13946 .
  • Bai, Wen, and Gary Chan (2019) Bai, H.; Wen, S.; and Gary Chan, S.-H. 2019. Crowd counting on images with scale variation and isolated clusters. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops

    , 0–0.
  • Boominathan, Kruthiventi, and Babu (2016) Boominathan, L.; Kruthiventi, S. S.; and Babu, R. V. 2016. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM international conference on Multimedia, 640–644. ACM.
  • Cao et al. (2018) Cao, X.; Wang, Z.; Zhao, Y.; and Su, F. 2018. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), 734–750.
  • Chan, Liang, and Vasconcelos (2008) Chan, A. B.; Liang, Z.-S. J.; and Vasconcelos, N. 2008. Privacy preserving crowd monitoring: Counting people without people models or tracking. In

    2008 IEEE Conference on Computer Vision and Pattern Recognition

    , 1–7. IEEE.
  • Chen et al. (2012) Chen, K.; Loy, C. C.; Gong, S.; and Xiang, T. 2012. Feature mining for localised crowd counting. In BMVC, volume 1, 3.
  • Cireşan, Meier, and Schmidhuber (2012) Cireşan, D.; Meier, U.; and Schmidhuber, J. 2012. Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745 .
  • Gatys, Ecker, and Bethge (2016) Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2414–2423.
  • Haralick, Shanmugam, and Dinstein (1973) Haralick, R. M.; Shanmugam, K.; and Dinstein, I. H. 1973. Textural features for image classification. IEEE Transactions on systems, man, and cybernetics 610–621.
  • Idrees et al. (2013) Idrees, H.; Saleemi, I.; Seibert, C.; and Shah, M. 2013. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2547–2554.
  • Idrees et al. (2018) Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; and Shah, M. 2018. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), 532–546.
  • Jing and Tian (2019) Jing, L.; and Tian, Y. 2019. Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey. arXiv preprint arXiv:1902.06162 .
  • Kang and Chan (2018) Kang, D.; and Chan, A. 2018. Crowd Counting by Adaptively Fusing Predictions from an Image Pyramid. arXiv preprint arXiv:1805.06115 .
  • Lempitsky and Zisserman (2010) Lempitsky, V.; and Zisserman, A. 2010. Learning to count objects in images. In Advances in neural information processing systems, 1324–1332.
  • Li, Zhang, and Chen (2018) Li, Y.; Zhang, X.; and Chen, D. 2018. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1091–1100.
  • Liu et al. (2018a) Liu, J.; Gao, C.; Meng, D.; and Hauptmann, A. G. 2018a. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5197–5206.
  • Liu et al. (2018b) Liu, L.; Wang, H.; Li, G.; Ouyang, W.; and Lin, L. 2018b. Crowd counting using deep recurrent spatial-aware network. arXiv preprint arXiv:1807.00601 .
  • Liu, van de Weijer, and Bagdanov (2018) Liu, X.; van de Weijer, J.; and Bagdanov, A. D. 2018. Leveraging unlabeled data for crowd counting by learning to rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7661–7669.
  • Liu, Van De Weijer, and Bagdanov (2019) Liu, X.; Van De Weijer, J.; and Bagdanov, A. D. 2019. Exploiting Unlabeled Data in CNNs by Self-supervised Learning to Rank. IEEE transactions on pattern analysis and machine intelligence .
  • Ma, Shuai, and Cheng (2021) Ma, Y.-J.; Shuai, H.-H.; and Cheng, W.-H. 2021. Spatiotemporal Dilated Convolution with Uncertain Matching for Video-based Crowd Estimation. IEEE Transactions on Multimedia .
  • Ma et al. (2020) Ma, Z.; Wei, X.; Hong, X.; and Gong, Y. 2020. Learning Scales from Points: A Scale-aware Probabilistic Model for Crowd Counting. In Proceedings of the 28th ACM International Conference on Multimedia, 220–228.
  • Onoro-Rubio and López-Sastre (2016) Onoro-Rubio, D.; and López-Sastre, R. J. 2016.

    Towards perspective-free object counting with deep learning.

    In European Conference on Computer Vision, 615–629. Springer.
  • Sam et al. (2019) Sam, D. B.; Sajjan, N. N.; Maurya, H.; and Babu, R. V. 2019. Almost Unsupervised Learning for Dense Crowd Counting. In

    Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA

    , volume 27.
  • Sam, Surya, and Babu (2017) Sam, D. B.; Surya, S.; and Babu, R. V. 2017. Switching convolutional neural network for crowd counting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4031–4039. IEEE.
  • Shen et al. (2018) Shen, Z.; Xu, Y.; Ni, B.; Wang, M.; Hu, J.; and Yang, X. 2018. Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5245–5254.
  • Sindagi and Patel (2018) Sindagi, V. A.; and Patel, V. M. 2018. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters 107: 3–16.
  • Varior et al. (2019) Varior, R. R.; Shuai, B.; Tighe, J.; and Modolo, D. 2019. Scale-Aware Attention Network for Crowd Counting. arXiv preprint arXiv:1901.06026 .
  • Wang et al. (2010) Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; and Gong, Y. 2010. Locality-constrained linear coding for image classification. In 2010 IEEE computer society conference on computer vision and pattern recognition

    , 3360–3367. Citeseer.

  • Wang et al. (2019) Wang, Q.; Gao, J.; Lin, W.; and Yuan, Y. 2019. Learning from Synthetic Data for Crowd Counting in the Wild. arXiv preprint arXiv:1903.03303 .
  • Wilcoxon (1992) Wilcoxon, F. 1992. Individual comparisons by ranking methods. In Breakthroughs in statistics, 196–202. Springer.
  • Yan et al. (2019) Yan, Z.; Yuan, Y.; Zuo, W.; Tan, X.; Wang, Y.; Wen, S.; and Ding, E. 2019. Perspective-Guided Convolution Networks for Crowd Counting. In Proceedings of the IEEE International Conference on Computer Vision, 952–961.
  • Yang et al. (2020) Yang, Y.; Li, G.; Du, D.; Huang, Q.; and Sebe, N. 2020. Embedding Perspective Analysis into Multi-Column Convolutional Neural Network for Crowd Counting. IEEE Transactions on Image Processing .
  • Yang et al. (2010) Yang, Y.; Xu, D.; Nie, F.; Yan, S.; and Zhuang, Y. 2010. Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing 19(10): 2761–2773.
  • Zhang et al. (2016a) Zhang, C.; Kang, K.; Li, H.; Wang, X.; Xie, R.; and Yang, X. 2016a. Data-driven crowd understanding: A baseline for a large-scale crowd dataset. IEEE Transactions on Multimedia 18(6): 1048–1061.
  • Zhang, Isola, and Efros (2016) Zhang, R.; Isola, P.; and Efros, A. A. 2016. Colorful image colorization. In European conference on computer vision, 649–666. Springer.
  • Zhang et al. (2017a) Zhang, R.; Zhu, J.-Y.; Isola, P.; Geng, X.; Lin, A. S.; Yu, T.; and Efros, A. A. 2017a. Real-time user-guided image colorization with learned deep priors. arXiv preprint arXiv:1705.02999 .
  • Zhang et al. (2017b) Zhang, T.; Qi, G.-J.; Xiao, B.; and Wang, J. 2017b. Interleaved group convolutions. In Proceedings of the IEEE International Conference on Computer Vision, 4373–4382.
  • Zhang et al. (2016b) Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; and Ma, Y. 2016b. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 589–597.
  • Zhu et al. (2017) Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In Proceedings of the IEEE international conference on computer vision, 2223–2232.
  • Zitouni et al. (2016) Zitouni, M. S.; Bhaskar, H.; Dias, J.; and Al-Mualla, M. E. 2016. Advances and trends in visual crowd analysis: A systematic survey and evaluation of crowd modelling techniques. Neurocomputing 186: 139–159.