Neuron Linear Transformation: Modeling the Domain Shift for Crowd Counting

04/05/2020 ∙ by Qi Wang, et al. ∙ 0

Cross-domain crowd counting (CDCC) is a hot topic due to its importance in public safety. The purpose of CDCC is to reduce the domain shift between the source and target domain. Recently, typical methods attempt to extract domain-invariant features via image translation and adversarial learning. When it comes to specific tasks, we find that the final manifestation of the task gap is in the parameters of the model, and the domain shift can be represented apparently by the differences in model weights. To describe the domain gap directly at the parameter-level, we propose a Neuron Linear Transformation (NLT) method, where NLT is exploited to learn the shift at neuron-level and then transfer the source model to the target model. Specifically, for a specific neuron of a source model, NLT exploits few labeled target data to learn a group of parameters, which updates the target neuron via a linear transformation. Extensive experiments and analysis on six real-world datasets validate that NLT achieves top performance compared with other domain adaptation methods. An ablation study also shows that the NLT is robust and more effective compare with supervised and fine-tune training. Furthermore, we will release the code after the paper is accepted.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Currently, accelerating the understanding of the crowd is playing an increasingly important role in building an intelligent society. As a huge research field, it involves many hotspots. In some scenes with sparse crowd distribution, crowd understanding mainly includes crowd detection[19, 25], crowd behavior analysis [50, 55], crowd segmentation [15, 5], and crowd tracking [13, 31]

. In some scenes with high-level density, such as an image containing thousands of people, crowd understanding mainly focuses on counting and density estimation

[18, 43, 27, 46, 21]. In this paper, we strive to work on the existing crowd counting problem.

Fig. 1: The domain shift in different views. 1) visual domain shift, such as brightness, background, character feature, etc. 2) when it comes to specific tasks, The domain shift is reflected in the model’s parameter distribution.

Crowd counting, a system that generates a pixel-level density estimation map and sums all of the pixels to predict how many people are in an image, has become a popular task due to its widespread piratical application: public management, traffic flow prediction, scene understanding

[55, 61], video analysis [59, 60], etc. Specifically, it can be used for public safety in many situations, such as political rallies and sports events [39]. Besides, density estimation can also be used to help detect the location of people in some sparse scenes [20]

. In the traditional supervised learning, many excellent algorithms

[24, 43, 32, 62, 49, 48] constantly refresh the counting metrics from different angles for the existing datasets.

However, traditional supervised learning requires a lot of labeled data to drive it, and unfortunately, pixel-level annotating is often costly. According to statistics, the entire procedure involved 2, 000 human-hours spent through the completion of the QNRF dataset[20]. On the recently established NWPU dataset [51], the time cost is even as high as 3,000 human-hours. Even if researchers invest a lot of time and money to build the datasets, the existing datasets are still limited in scale.

Because of the small-scale data in some existing datasets, the above models may suffer from overfitting at different extents, and there is a significant performance reduction when applying them in real life. Thus, the Cross-Domain Crowd Counting (CDCC) attracts the researcher’s attention, which focuses on improving the performance in the target domain by using the data from the source domain. Wang et al. [52] propose a crowd counting via domain adaptation method, SE CycleGAN, which translates synthetic data to photo-realistic scenes, and then apply the trained model in the wild. Gao et al. [11] present a high-quality image translation method feature disentanglement. [14, 12] adopt the adversarial learning to extract the domain-invariant features in the source and target domain. In a word, general Unsupervised Domain Adaption (UDA) methods concentrate on image style and feature similarity. The up box in Fig. 1 demonstrates the appearance differences.

Nevertheless, the domain shift in image and feature level is not sensitive to the counting task: this strategy does not directly affect the counting performance, and it is not optimal. For example, SE CyCleGAN [52] and DACC [11] focus on maintaining the local consistency to improve the translation quality in congested regions. When applying the model to the sparse scenes (Mall [6], UCSD [4]), the loss may be redundant. In other word, there are task gaps in the existing UDA-style methods. Besides, since the target label is unseen for UDA models, they do not work well: coarse prediction in the congested region and the estimation errors in the background.

Given a specific task, we find that the domain shift can be reflected in the parameters of models on different domains. Notably, we use synthetic data and real-scene data to train the model, respectively. And then calculate the average value of each kernel in a specific layer. The down box in Fig. 1

reports the the distribution histogram. It can intuitively see that the parameters supervised with both datasets show Gaussian distribution, and the difference lies in their mean and variance. Thus, we conclude that the domain shift in different datasets can be measured by the parameter distribution of the specific model.

Based on the above observation, these differences on the parameter level can be simulated by a linear transformation. Thus, this paper proposes a Neuron Linear Transformation (NLT) method to handle cross-domain crowd counting. To be specific, firstly, train a source model using traditional supervised learning. Then exploit few labeled target data to learn two matrices (product factor and bias) for each source neuron. Finally, update these neurons by a linear transformation, which are treated as target neurons and applied to the target data. The entire process is shown in Fig. 2.

Fig. 2: The flowchart of our proposed Neuron Linear Transformation (NLT), which consists of three components: 1) Source model is trained with the synthetic data; 2) The parameters and in NLT are denoted to model the domain shift. 3) After loading the transferred parameters to the target model, the few-shot data are feed into the target model to update the domain shift parameters.

In summary, the main contributions of this paper are:

  • Propose a novel Neuron Linear Transformation (NLT) method to model the domain shift. It is the first time that the domain shifts can be measured at the parameter level.

  • Exploit a few-shot target data to approach the real domain shifts, which significantly reduces the annotation costs.

  • Outperform the traditional methods on six real-word crowd counting datasets when facing the same problem. The experiments also evidence that NLT has higher practical value than UDA methods.

Ii Related Work

In this section, we briefly review the relevant works from the three tasks: supervised crowd counting, learn cross-domain crowd counting with synthetic data, and few-shot learning.

Supervised Crowd Counting. In recent years, the supervised crowd counting algorithms are mostly focusing on scale variability. From the perspective of scale-aware, Zhang et al. [58] propose a three-columns network with different kernels for scale perception in 2016. López-Sastre et al. [35] introduce a HydraCN with three-columns, where each column is fed by a patch from the same image with a different scale. Two years later, Wu et al. [54] developed a powerful multi-column scale-aware CNN with an adaptation module to fuse the sparse and congested column. In the same year, AFP [22] generates a density map by fusing the attention map and intermediary density map in each column. ic-CNN [36] generates a high-resolution density map via passing the feature and predict map from the low-resolution CNN to the high-resolution CNN. Last year, Hossain et al. [17] employ a scale-aware attention network, where each column is weighted with the output of a global scale attention network and local scale attention network. Except for multi-column scale-aware architecture, the single-column scale-aware CNN generally better in performance in recent research, such as SANet [2], SaCNN [57]. To combine the multi-column and single-column scale-aware CNN, CSRNet[26], CAN [30] and FPNCC [3] developed a model containing multiple paths only in several part of the networks.

From the respective of context-aware, CP-CNN [42]

designs a global context estimator and local context estimator to classify the density level of the full image and its patches respectively. Switching-CNN

[40] employs an extra column CNN to deliver the best performance given a certain patch. DRSAN [29] designs a module named Recurrent Spatial-Aware Refinement (RSAR) to refine the density map. In 2019, RAZ-Net [28] divides the training phase into two steps, first, a main CNN is trained as a typical density map regressor with an extra column to propose a region to zoom, and then another CNN is trained to recurrently refine the proposed zooming regions. Meanwhile, SAAN [17] designs three type of CNN: Multi-scale Feature Extractor (MFE), Global Scale Attention (GSA) and Local Scale Attention (LSA), which explores the local context to improve the counting performance.

Cross-domain Crowd Counting. In addition to the exploration mentioned above, a new research hotspot is beginning to interest researchers called cross-domain crowd counting. In this task, it is supposed to transfer what the model learns from one dataset to another unseen dataset. One of the earliest studies is launched by wang et al. [52], who establish a large-scale synthetic dataset to pre-train a model that improves the robust over real-world datasets by a fine-tune operation. Except fine-tuning, they train a counter without using any real-world labeled data. It is completed by using the Cycle GAN [63] and SE Cycle GAN [52] to generate a realistic image. Recently, several efforts have been made to follow it, DACC [11], a method for domain adaptation based on image translation and Gaussian-prior reconstruction, achieves new state-of-the-art results on several mainstream datasets. At the same time, some works [12, 14] extract domain invariant features based on adversarial learning. Experimental results show that those methods can narrow the domain shift to some extent.

Overall, the current research about learning from synthetic data for crowd counting is still in its infancy. However, the intersection of synthetic data and real-world data proves to be particularly fertile ground for groundbreaking new ideas, and we firmly believe that this field to become more significant over time.

Few-shot Learning.

 Since it involves a small number of target domain samples in our cross-domain crowd counting method, we hereby introduce some studies related to few-shot learning. The few-shot learning is based on given prior experience with very similar tasks where we have access to large-scale training sets, and then to train a deep learning model using only a few training examples. Early few-shot learning methods

[1, 7, 8] are based on hand-crafted features. Vinyals et al. [47] use a memory component in a neural net to learn common representation from very little data. Snell et al. [44]

propose Prototypical Networks, which map examples to a dimensional vector space. Ravi and Larochelle

[37]

use an LSTM-based meta-learner to learn an update rule for training a neural network learner. Model-Agnostic Meta-Learning (MAML)

[9] learns a model parameter initialization that generalizes better to similar tasks. Similar to MAML, AREPTILE [34]

executes stochastic gradient descent for

iterations on a given task, and then gradually moves the initialization weights in the direction of the weights obtained after the iterations. Santoro et al. [41] propose Memory-Augmented Neural Networks (MANNs)to memorize information about previous tasks and leverage that to learn a learner for new tasks. SNAIL [33] is a generic meta-learner architecture to learn a common feature vector for the training images to aggregate information from past experiences. Most of the above few-shot learning methods are based on classification tasks. For crowd counting tasks, [16] proposes a one-shot learning approach for learning how to adapt to a target scene using one labeled example. [38] applies the MAML [9] to learn scene adaptive crowd counting with few-shot learning.

Iii Approach

This section describes the detailed methodology for cross-domain crowd counting. Firstly, we define the problem that we want to solve. Then, the NLT, a linear operation at the neuron-level, are designed to model the domain shift. Finally, we introduce how to integrate NLT into the transformation process of the source model and the target model. Fig. 2 illustrates the entire framework.

Iii-a Problem Setup

In this paper, we strive to tackle the existing problems for domain adaptive crowd counting from the parameter-level with a transformation. The setting assumes access to a source domain (synthetic data) with labeled crowd images . Besides, a target domain (real scene data) provides few-shot images with the labeled density maps . The purpose is to train a source domain model with the parameters exploiting the , and learn a representable domain shift according to with few-shot learning, which are parameterized by the domain factors and domain biases . Finally, generating a well performed target model with the parameters by combining the source model with the domain shift parameters.

Iii-B Neuron Linear Transformation

Inspired by the neuron-level scale and shift operation [45], we propose a Neuron Linear Transformation (NLT) method to describe the domain gap, which makes the domain gap clearly visible. In order to model the domain shift, we assume that the source model and the target model belong to the same linear space . Each neuron in the target model can be transferred from the corresponding neuron in the source model by a linear transformation.

The domain adaptation method has two advantages: 1) The target model inherits the good feature extraction ability and preserves the generalization. 2) Compared with fine-tuning all parameters of the target model, only a few parameters need to be optimized in the target model with NLT. So it reduces the probability of overfitting for few-shot learning in the target domain. For each source domain neuron parameter

, we define the corresponding domain factor and domain bias . Then the neuron-level linear transformation can be expressed as , namely,

(1)

Iii-C Modeling the Domain Shift

In this section, we introduce how to use Neuron Linear Transformation (NLT) to model domain shift from the source domain to the target domain.

First, we introduce the architecture of the model. The source domain model can take any crowd counting model. However, for a fair comparison, a simple encoder-decoder structure is designed following the previous work [52, 14, 11, 12]. As shown in Fig.2, the first four layers of VGG-16 are adopted as the backbone in the encoder stage. That is, the output feature is 1/8 of the input image. In the decoder stage, a 3x3 convolutional layer is used to reduce the feature channels to a half, and then an up-sampling layer is followed by a 3x3 convolutional layer to reduce channels. After three repetitions, a 1x1 convolutional layer outputs the prediction density map. The training of the source domain model is similar to that of the traditional supervised crowd counting network, except that the training data adopts the synthetic dataset GCC. The are optimized by gradient descent as follows,

(2)

where is a standard MSE loss. is the batch size of source model. is the source model prediction of the training data. denotes the learning rate.

Second, we introduce how to embed NLT into our target model training. As shown in Fig .2, the target model remains the same architecture with the source model, but the number of parameters involved in training is different. The parameters in the target model are transferred from the source model. Moreover, the goal of transformation is to make up for the task gap. To achieve the transformation, we have to express it mathematically. This process is regarded as to model domain shift. Specifically, we model the domain shift by transfer all neurons in the source model to the target model with the proposed NLT. As a result, in the target model, we define two groups of additional parameters and to achieve the model-level linear transformation. Assuming that the parameters in the source model contain neurons in total, then the number of and is . According to Equ. (1), the mapping can be expressed as follows,

(3)

where represents the domain shift factor, initialized by 1. represents the domain shift bias, which is initialized to 0.

Since we introduce the learnable parameters to describe the task gap in the target model, some target domain labeled images are needed to learn the parameters. However, within the requirement of domain adaptation, we only use a few data to support the training. In the update phase of the source model, is learned. But it will be frozen when the target model is updated. After the calculation of Equ. 3, participate in the feedforward of the target model. Therefore, only the gradients of and need to be calculated in the feedback process, that is, and are learned in the target model. Since the convolution kernel of VGG-16 is , the updated parameters in the target model are of . The loss for optimizing the parameters is defined as follows,

(4)

where the former term is the density estimated loss corresponding to the few-shot data. It is the same as the loss of the source model. is the input image and density map. is the prediction density map. The latter term is the L2 regularization loss of parameters and , with the purpose of preventing overfitting in the target domain. is the weighted parameter. Finally, the target model is optimized as follows,

Fig. 3: The effects of our NLT and other training methods on learning process and performance. (a)(b)(c) and (d)(e)(f) show the validation loss and performance on Shanghai Tech Part A and B dataset, respectively.
Method DA FS   Shanghai Tech Part A   Shanghai Tech Part B
MAE MSE PSNR SSIM   MAE MSE PSNR SSIM
NoAdpt   188.0 279.6 20.91 0.670   20.1 29.2 26.62 0.895
Supervised   107.2 165.9 21.53 0.623   16.0 26.7 26.8 0.932
Fine-tuning   105.7 167.6 21.72 0.702   13.8 22.3 27.0 0.931
NLT (ours)   93.8 157.2 21.89 0.729   11.8 19.2 27.58 0.937
IFS [11]+NLT (ours)   90.1 151.6 22.01 0.741   10.8 18.3 27.69 0.932
TABLE I: The performance of different training methods on Shanghai Tech Part A and Shanghai Tech Part B.
(5)

where denotes the learning rate of target model.

Iv Implementation Details

Executive Stream. In the training phase, the workflow is shown in Fig. 2 1⃝ 6⃝, once iteration requires updating parameters for two models. First, are updated according a batch sampling from the GCC data by 1⃝ 3⃝. Second, the domain shift parameters are updated with the few-shot data provided in the target domain by 4⃝ 6⃝. Finally, the parameters of the target model are obtained by NLT, as shown in Equ. 3. In the validation phase, we divide the validation set for each target domain from its training data. In the testing phase, we use the best-performing model on the validation set to make an inference.

Parameter Setting. In each iteration, we input synthetic images and target few-shot images. Adam algorithm [23] is performed to optimize the networks. The learning rate for the source model in Equ. 2 is set as , and the learning rate for target model in Equ. 4 is initialized as . The parameter

for target model loss function in Eq.

4 is fixed to . Our code is developed based on the Framework [10] on NVIDIA GTX Ti GPU.

Scene Regularization. In other fields of domain adaptation, such as semantic segmentation, the object distribution in street scenes is highly consistent. Unlike this, current crowd real-world datasets are very different in terms of density range, such as the MALL [4] dataset with the count ranging from to , but the GCC [52] dataset is ranging from to . For avoiding negative adaptation by the different density ranges, we adopt a scene regularization strategy proposed by [52] and [12]. In other word, we add some filter conditions to select proper synthetic images from GCC as the source domain data for different real-world datasets.

Fig. 4: The effects of the domain shift parameters and . (a)(b)(c) and (d)(e)(f) show the validation loss and performance on Shanghai Tech Part A and B dataset, respectively.
Method DA FS   Shanghai Tech Part A   Shanghai Tech Part B
MAE MSE PSNR SSIM   MAE MSE PSNR SSIM
NoAdpt   188.0 279.6 20.91 0.670   20.1 29.2 26.62 0.895
Fine-tuning   105.7 167.6 21.72 0.702   13.8 22.3 27.0 0.931
Factor ()   109.2 161.3 21.49 0.758   13.5 23.5 27.26 0.921
bias ()   107.8 169.9 21.14 0.796   12.8 20.6 27.17 0.916
NLT ()   93.8 157.2 21.89 0.729   11.8 19.2 27.58 0.937
TABLE II: The effectiveness of the domain shift parameters and on the testing set of Shanghai Tech Part A and B.

V Experiments

In this section, we first report the experimental evaluation metrics and the selected datasets, and then a comprehensive ablation study is performed to illustrate the effectiveness of our proposed method. Next, we analyze the shifting phenomenon of different real-world datasets and synthetic dataset from the perspective of statistics. In addition, we also discuss the effect of selected few-shot data on performance improvement. Finally, we present the testing results and visualization results of our method in six real-world datasets.

V-a Evaluation Criteria

Counting Error.

 According to the evaluation criteria widely used in crowd counting, the counting error is usually reflected in two metrics, namely Mean Absolute Error(MAE) and Mean Square Error(MSE). MAE measures the mean length of the predicted Error, while MSE measures the robustness of the model to outliers. Both are the lower, the better. They are defined as follows:

(6)

where is the number of images to be tested, and and are the ground truth and estimated number of people corresponding to the sample, which is obtained by summing all the pixel values in the density map.

Density Map Quality.

 To further evaluate the predictive quality of the model, we also calculate PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity in Image)

[53]. For those two metrics, the larger the value, the higher the quality of the predict density map.

Method DA FS   Shanghai Tech Part A   Shanghai Tech Part B   UCF-QNRF
MAE MSE PSNR SSIM   MAE MSE PSNR SSIM   MAE MSE PSNR SSIM
CycleGAN [63]   143.3 204.3 19.27 0.379   25.4 39.7 24.60 0.763   257.3 400.6 20.80 0.480
SE CycleGAN [52]   123.4 193.4 18.61 0.407   19.9 28.3 24.78 0.765   230.4 384.5 21.03 0.660
FA [12]   - - - -   16.0 24.7 - -   - - - -
FSC [14]   129.3 187.6 21.58 0.513   16.9 24.7 26.20 0.818   221.2 390.2 23.10 0.7084
IFS [11]   112.4 176.9 21.94 0.502   13.1 19.4 28.03 0.888   211.7 357.9 21.94 0.687
NoAdpt (ours)   188.0 279.6 20.91 0.670   20.1 29.2 26.62 0.895   276.8 453.7 22.22 0.692
NLT (ours)   93.8 157.2 21.89 0.729   11.8 19.2 27.58 0.937   172.3 307.1 22.81 0.729
IFS[11]+NLT (ours)   90.1 151.6 22.01 0.741   10.8 18.3 27.69 0.932   157.2 263.1 23.01 0.744
Method DA FS   WorldExpo’10 (only MAE)   UCSD   MALL
S1 S2 S3 S4 S5 Avg.   MAE MSE PSNR SSIM   MAE MSE PSNR SSIM
CycleGAN [63]   4.4 69.6 49.9 29.2 9.0 32.4   - - - -   - - - -
SE CycleGAN [52]   4.3 59.1 43.7 17.0 7.6 26.3   - - - -   - - - -
FA [12]   5.7 59.9 19.7 14.5 8.1 21.6   2.0 2.43 - -   2.47 3.25 - -
IFS [11]   4.5 33.6 14.1 30.4 4.4 17.4   1.76 2.09 24.42 0.950   2.31 2.96 25.54 0.933
NoAdpt (ours)   5.0 89.9 63.1 20.8 17.1 39.2   12.79 13.22 23.94 0.899   6.20 6.96 24.65 0.879
NLT (ours)   2.3 22.8 16.7 19.7 3.9 13.1   1.58 1.97 25.29 0.942   1.96 2.55 26.92 0.967
IFS[11]+NLT (ours)   2.0 15.3 14.7 18.8 3.4 10.8   1.48 1.81 25.58 0.965   1.86 2.39 27.03 0.944
TABLE III: The performance of other domain adaptation (DA) methods and the proposed NLT on the six real-world datasets. FS refers to shot data from the target domain.

V-B Datasets

The synthetic dataset GCC [52] is the only source domain. As for the target domain, to ensure the sufficiency of our experiments, we respectively select two datasets from high-level density, medium-level density and low-level density datasets, a total of six datasets, namely UCF-QNRF [20], Shanghai Tech Part A [58], Shanghai Tech Part B [58], WorldExpo’10 [56] , Mall [6] and UCSD [4].

Source Domain Dataset.GCC is a large-scale synthetic dataset, which is sampled from virtual scenes by a computer mod. It contains of accurately annotated images with a total of instances. There is an average of people in each image.

Congested Crowd Dataset.UCF-QNRF is collected from a shared image website. Therefore, the dataset contains a variety of scenes. It consists of images( training and testing images), with annotated instances. The average number of people is 815 per image. Shanghai Tech Part A is also randomly collected from the Internet with different scenarios. It consists of images ( training and testing images) with different resolutions. The average number of people in an image is .

Moderate Crowd Dataset.Shanghai Tech Part B is captured from the surveillance camera on the Nanjing Road in Shanghai, China. It contains samples ( training and testing images). The scenes are relatively uniform, with an average of people per picture. WorldExpo’10 consists of labeled images, which are collected from surveillance scenes ( scenes for training and the remaining scenes for testing) in Shanghai WorldExpo. The average number of people is 50 per image.

Sparse Crowd Dataset.Mall is captured from a surveillance camera installed in a shopping mall, which records the ( for training and for testing) sequential frames. The average people of each image is . UCSD consists of frames (frames for training and the others for testing) collected from a single-scene surveillance video. The average number of the pedestrian in each image is .

Fig. 5: Exemplar results of adaptation from GCC to Shanghai Tech Part A/B dataset. Row 1 and 2 come from Shanghai Tech Part A, and others are from Part B.
Fig. 6: More visual samples of adaptation from GCC to other four real-world datasets with our proposed NLT.

V-C Ablation Study

We present our ablation experiments from two perspectives. First, regarding the few-shot data, we demonstrate the impact by using different training methods. Second, for the proposed NLT, we discuss the effects of and on modeling the domain shift. The following experiments are conducted on Shanghai Tech Part A and B datasets, and the selected few-shot data both are the of the training set.

Compared with Other Training Methods. Five training methods are used to demonstrate the role of few-shot data in narrowing the domain gap. The specific settings are as follows:

  • NoAdpt. Train the model on the GCC dataset.

  • Supervised. Train the model on few-shot data.

  • Fine-tuning. Train the model on the GCC dataset and fine-tune it with few-shot data.

  • NLT (ours). Train the model from GCC to the real-world dataset with our NLT and training strategy.

  • IFS+NLT (ours). Replace the original GCC data with IFS [11] translated GCC [52] in the last settings

As shown in Fig. 3, we draw the loss and performance curves on the validation set during training. Taking Shanghai Tech Part A dataset as an example, it is difficult to reduce the loss of the validation set without domain adaptation. The supervised training and fine-tuning with few-shot data can significantly reduce the loss, but it is easy to suffer from overfitting. Compared to supervised training and fine-tuning, our NLT can reach lower validation loss and inhibit overfitting. In Fig. 3 (b) and (c), the MAE and MSE curves also illustrate the effectiveness of NLT. Similarly, in Fig. 3 (d) (e) and (f), Shanghai Part B have the same trend with Shanghai part A, which proves that our method is suitable for both dense and sparse scenes.

Table I shows the results on the testing set, the results of no adaptation are usually unfulfilling, which validates the vast distance between the real scene and the synthetic data mentioned in our introduction. As shown in lines 4 and 5, both fully supervised training and fine-tuning on a pre-trained GCC model with few-shot data yield better results than no domain adaptation. It shows that few-shot data has a significant effect on narrowing the domain gap. Compare with lines 4 and 5, and it can be found that fine-tuning the pre-trained GCG model can achieve better results than full supervision. It indicates that the synthetic data plays a vital role in improving the generalization of the model. Compare with the results of lines 5 and 6, the proposed NLT and training strategy is a better method than the fine-tuning operation. Taking MAE as an example, In Shanghai Tech part A, NLT reduces the counting error by 13.1% compared with fine-tuning. In Shanghai Tech part B, it reduces by 16.1%. In addition, we also test NLT on the stylized images by IFS [11], and the results show that our method was further improved. It is just a simple validation, and our later experiments are still based on the original GCC. In conclusion, our method is a win-win approach. The few shot data drive the training of NLT to narrow the domain gap while NLT is maximizing the potential of few-shot data.

The Influence of Domain Shift Parameters. In our NLT, two groups of parameters are set to learn the transformation of neurons, namely and . To verify the validity and compatibility of the parameters, we conduct three experiments to show the effects on modeling domain shift. There are using factor , bias , and both of them to learn the model shift, respectively. In this section, the details of the experiments are shown in Fig .4.

The red curves represent that there is no domain adaptation. At the beginning of the training, the loss curve is verified to be reduced. But as time goes on, it keeps rising. The reason is that the model trained with synthetic data has a limited ability to fit the real data. Once the limit value is passed, the model will continuously deviate from the target domain. The blue and green curves show the effectiveness of domain factor and domain bias respectively, both of them can greatly reduce losses and improve performance. It is worth noting that factor is not easy to overfit, but the convergence is slow, while bias converges faster but is easy to overfit. When the two are together, they complement each other and perform best.

The results of the test set are shown in Table II. The learnable parameters for factors and bias both are of the source model. Fine-tuning operation is to update all parameters of the source model. In Shanghai Tech Part A, for example, of the training set are treated as few-shot data, but factor and bias achieve the similar results compare with fine-tuning. It appearances that it is effective to use factor and bias to represent domain shift. The best results are achieved when combining the two to form the Neuron Linear Transformation (NLT) for modeling domain shift.

V-D Adaptation Results on Real-world Datasets

In this section, we test the performance of the NLT by using it to learn the domain shift from GCC to six real-world datasets and compare it with the other domain adaptation methods.

Metrics Report. Table III lists the statistical results of the four metrics (MAE/MSE/PSNR/SSIM). From the table, comparing with the image translation (CycleGAN [63], SE CycleGAN [52] and IFS [11]) and feature adversarial learning (FA [12] and FSC [14]) methods, our method performs better with the use of annotated data in the target domain. Taking MAE as an example, as the lavender row shows, NLT reduced counting errors by 16.5%, 10.0%, 18.6%, 29.9%, 10.0%, and 15.2% comparing with the above methods, respectively, on the six real-world datasets. On the PSNR and SSIM, which represent the quality of the generated density map, we have also achieved a significant improvement, indicating that introducing few-shot data from the target domain is important for noise cancellation in the background region. Experiments with different density datasets also demonstrate the universality of NLT for cross-domain counting tasks.

Furthermore, we also discuss the combination of NLT and other domain adaptation methods. In this article, we implement stylistic realism for the GCC [52] dataset by using IFS [11], the currently known best image translation method for cross-domain crowd counting. These images are then treated as source domain data, and the proposed NLT is applied to achieve the domain adaptation. The final test results in the six real-world datasets are shown in table III, light cyan row. Compared with the original IFS [11], the NLT decreases the MAE by 19.8%, 17.6%, 25.7%, 37.9%, 16.0%, and 19.5% on the six real data sets, respectively.

Visualization results Fig. 5 shows the visualization results of no adaptation and the proposed NLT. Column 3 shows the results without domain adaptation. The regression results are not acceptable in a congested scene like Shanghai Tech Part A, especially the gray-scale image in Row 2. On Shanghai Tech Part B, the counting results of no domain adaptation is a little close to the ground truth, but the problems remain in details and background. Such as the red box in Row 3 shows that the regression value is still weak despite the trend. Besides, the estimation errors in the background region also prevented the performance, such as the red box shown on Raw 4. After the domain adaptation, the above questions are alleviated. In general, the NLT improves the density map in counting values and details. This reflects the effectiveness of using NLT to learn the domain shift.

Fig. 7: The averages of domain factor and domain bias in each layer of the network.

In order to more intuitively demonstrate our domain adaptation effect, we show more results in Fig. 6. For saving space, we only report the ground truths and the prediction results. From the performance on different datasets, NLT is effective for cross-domain counting tasks with different crowding levels.

V-E Statistical Analysis of Domain Shift

Domain factor and domain bias are defined as the parameters to model the shift from the source domain to target domain, which are initialized to 1 and 0, respectively. Driven by few-shot data, they are updated for narrowing the domain gap. In section V-C, we verify its effectiveness by specific task performance. In this section, we will further analyze the significance of these parameters from the perspective of mathematical statistics.

There are convolutional layers in the network we use, and each convolution kernel contains a domain factor and domain bias parameter. For the well trained model, we calculate the mean values of factor and bias at each layer. The statistical results are shown in Fig .7, where the mean value for factor is subtracted from the initial value . As Fig. 7 (a) shown, at most layers, the mean value of factor and bias are less than and , respectively. Therefore, the effect of factor and bias is to reduce the parameters of the GCC model. We call this shift as “down domain shift”. The distribution of UCF-QNRF shown in Fig. 7 (b) is similar to that of Shanghai Tech Part A. Both of them are collected from the Internet, so it has a similar distribution. In Fig. 7 (c), the averages of factor and bias are greater than and in most layers, respectively. We define the shift as “up domain shift.” In Fig. 7 (d), factor and bias are distributed on both sides. We define this shift as “up-down domain shift”. In addition, by comparing Fig. 7 (d)(e)(f), where , and of the training set are treated as few-shot data to learn domain shift, and the distribution is basically the same eventually. This reveals that only a few target domain labeled images are needed to learn the representation of domain shift.

V-F Analysis in Selecting Few-shot Data

Fig. 8: The testing results for NLT and supervised training with different ratios of few-shot data on Shanghai Tech Part B.

Since our domain adaptation method requires a few target domain labeled images, in this section, we will discuss the effects of selecting different proportions few-shot data on NLT. As shown in Fig .8, we carry out the experiments on the Shanghai Part B dataset. The horizontal axis represents the images with corresponding proportions regard as the few-shot data, while the vertical axis represents MAE and MSE on the Shanghai Tech Part B test set. The curves in blue and green illustrate that NLT performance gets better with increasing few-shot learning data. In addition, compared with the traditional supervised training methods, the proposed NLT is better in every data setting. Therefore, it can be concluded that NLT is very robust for the selection of target domain data. However, the original intention of this paper is to use a few of target domain data to narrow the gap between the synthetic data and real-world data, and we only adopt of the training set for each dataset as few-shot data in the reporting results.

Vi Conclusions

In this paper, we summarize the existing problems of cross-domain crowd counting methods in expressing domain shift and define the domain adaptation problem as the transformation of parameters from the perspective of model-level. In order to convert the source model to the target model, we propose a Neuron Linear Transformation (NLT) method to model the domain shift. Moreover, the introduced domain shift parameters are optimized by few-shot learning. Extensive experiments show that our method is better than other domain adaptation methods by using target domain data. Besides, it also has a better expression ability for domain shift. Considering the versatility of NLT, we will explore applications of NLT in other domain adaptation tasks in future work, such as semantic segmentation and pedestrian Re-ID.

References

  • [1] E. Bart and S. Ullman (2005) Cross-generalization: learning novel classes from a single example by feature replacement. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. 1, pp. 672–679. Cited by: §II.
  • [2] X. Cao, Z. Wang, Y. Zhao, and F. Su (2018) Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision, pp. 734–750. Cited by: §II.
  • [3] T. W. Cenggoro, A. H. Aslamiah, and A. Yunanto (2019) Feature pyramid networks for crowd counting. Procedia Computer Science 157, pp. 175–182. Cited by: §II.
  • [4] A. B. Chan, Z. J. Liang, and N. Vasconcelos (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7. Cited by: §I, §IV, §V-B.
  • [5] A. B. Chan and N. Vasconcelos (2008) Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE transactions on pattern analysis and machine intelligence 30 (5), pp. 909–926. Cited by: §I.
  • [6] K. Chen, C. C. Loy, S. Gong, and T. Xiang (2012) Feature mining for localised crowd counting. In BMVC, Vol. 1, pp. 3. Cited by: §I, §V-B.
  • [7] L. Fei-Fei (2006) Knowledge transfer in learning to recognize visual objects classes. In Proceedings of the International Conference on Development and Learning, pp. 11. Cited by: §II.
  • [8] M. Fink (2005) Object classification from a single example utilizing class relevance metrics. In Proceedings of the Advances in neural information processing systems, pp. 449–456. Cited by: §II.
  • [9] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume

    ,
    pp. 1126–1135. Cited by: §II.
  • [10] J. Gao, W. Lin, B. Zhao, D. Wang, C. Gao, and J. Wen (2019) C

    framework: an open-source pytorch code for crowd counting

    .
    arXiv preprint arXiv:1907.02724. Cited by: §IV.
  • [11] J. Gao, T. Han, Q. Wang, and Y. Yuan (2019) Domain-adaptive crowd counting via inter-domain features segregation and gaussian-prior reconstruction. arXiv preprint arXiv:1912.03677. Cited by: §I, §I, §II, §III-C, TABLE I, 5th item, §V-C, §V-D, §V-D, TABLE III.
  • [12] J. Gao, Q. Wang, and Y. Yuan (2019) Feature-aware adaptation and structured density alignment for crowd counting in video surveillance. arXiv preprint arXiv:1912.03672. Cited by: §I, §II, §III-C, §IV, §V-D, TABLE III.
  • [13] S. Gao, Q. Ye, J. Xing, A. Kuijper, Z. Han, J. Jiao, and X. Ji (2017) Beyond group: multiple person tracking via minimal topology-energy-variation. IEEE Transactions on Image Processing 26 (12), pp. 5575–5589. Cited by: §I.
  • [14] T. Han, J. Gao, Y. Yuan, and W. Qi (2020) Focus on semantic consistency for cross-domain crowd understanding. arXiv preprint arXiv:2002.08623. Cited by: §I, §II, §III-C, §V-D, TABLE III.
  • [15] E. Heim, A. Seitel, J. Andrulis, F. Isensee, C. Stock, T. Ross, and L. Maier-Hein (2017) Clickstream analysis for crowd-based object segmentation with confidence. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2814–2826. Cited by: §I.
  • [16] M. A. Hossain, M. Kumar, M. Hosseinzadeh, O. Chanda, and Y. Wang One-shot scene-specific crowd counting. Cited by: §II.
  • [17] M. Hossain, M. Hosseinzadeh, O. Chanda, and Y. Wang (2019) Crowd counting using scale-aware attention networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1280–1288. Cited by: §II, §II.
  • [18] S. Huang, X. Li, Z. Zhang, F. Wu, S. Gao, R. Ji, and J. Han (2017) Body structure aware deep crowd counting. IEEE Transactions on Image Processing 27 (3), pp. 1049–1059. Cited by: §I.
  • [19] H. Idrees, K. Soomro, and M. Shah (2015) Detecting humans in dense crowds using locally-consistent scale prior and global occlusion reasoning. IEEE transactions on pattern analysis and machine intelligence 37 (10), pp. 1986–1998. Cited by: §I.
  • [20] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. arXiv preprint arXiv:1808.01050. Cited by: §I, §I, §V-B.
  • [21] X. Jiang, L. Zhang, P. Lv, Y. Guo, R. Zhu, Y. Li, Y. Pang, X. Li, B. Zhou, and M. Xu (2019) Learning multi-level density maps for crowd counting. IEEE transactions on neural networks and learning systems. Cited by: §I.
  • [22] D. Kang and A. Chan (2018) Crowd counting by adaptively fusing predictions from an image pyramid. arXiv preprint arXiv:1805.06115. Cited by: §II.
  • [23] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV.
  • [24] I. H. Laradji, N. Rostamzadeh, P. O. Pinheiro, D. Vazquez, and M. Schmidt (2018) Where are the blobs: counting by localization with point supervision. In Proceedings of the European Conference on Computer Vision, pp. 547–562. Cited by: §I.
  • [25] W. Li, V. Mahadevan, and N. Vasconcelos (2013) Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence 36 (1), pp. 18–32. Cited by: §I.
  • [26] Y. Li, X. Zhang, and D. Chen (2018)

    Csrnet: dilated convolutional neural networks for understanding the highly congested scenes

    .
    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1091–1100. Cited by: §II.
  • [27] M. Ling and X. Geng (2019) Indoor crowd counting by mixture of gaussians label distribution learning. IEEE Transactions on Image Processing 28 (11), pp. 5691–5701. Cited by: §I.
  • [28] C. Liu, X. Weng, and Y. Mu (2019) Recurrent attentive zooming for joint crowd counting and precise localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1217–1226. Cited by: §II.
  • [29] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin (2018) Crowd counting using deep recurrent spatial-aware network. arXiv preprint arXiv:1807.00601. Cited by: §II.
  • [30] W. Liu, M. Salzmann, and P. Fua (2019) Context-aware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5099–5108. Cited by: §II.
  • [31] X. Liu, D. Tao, M. Song, L. Zhang, J. Bu, and C. Chen (2014) Learning to track multiple targets. IEEE transactions on neural networks and learning systems 26 (5), pp. 1060–1073. Cited by: §I.
  • [32] Z. Ma, X. Wei, X. Hong, and Y. Gong (2019) Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6142–6151. Cited by: §I.
  • [33] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2017) A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141. Cited by: §II.
  • [34] A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §II.
  • [35] D. Onoro-Rubio and R. J. López-Sastre (2016) Towards perspective-free object counting with deep learning. In Proceedings of the European Conference on Computer Vision, pp. 615–629. Cited by: §II.
  • [36] V. Ranjan, H. Le, and M. Hoai (2018) Iterative crowd counting. In Proceedings of the European Conference on Computer Vision, pp. 270–285. Cited by: §II.
  • [37] S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. Cited by: §II.
  • [38] M. K. K. Reddy, M. Hossain, M. Rochan, and Y. Wang (2020) Few-shot scene adaptive crowd counting using meta-learning. In Proceedings of the the IEEE Winter Conference on Applications of Computer Vision, pp. 2814–2823. Cited by: §II.
  • [39] S. A. M. Saleh, S. A. Suandi, and H. Ibrahim (2015) Recent survey on crowd density estimation and counting for visual surveillance.

    Engineering Applications of Artificial Intelligence

    41, pp. 103–114.
    Cited by: §I.
  • [40] D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4031–4039. Cited by: §II.
  • [41] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In Proceedings of the International conference on machine learning, pp. 1842–1850. Cited by: §II.
  • [42] V. A. Sindagi and V. M. Patel (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1861–1870. Cited by: §II.
  • [43] V. A. Sindagi and V. M. Patel (2019) Ha-ccn: hierarchical attention-based crowd counting network. IEEE Transactions on Image Processing 29, pp. 323–335. Cited by: §I, §I.
  • [44] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Proceedings of the Advances in neural information processing systems, pp. 4077–4087. Cited by: §II.
  • [45] Q. Sun, Y. Liu, T. Chua, and B. Schiele (2019)

    Meta-transfer learning for few-shot learning

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 403–412. Cited by: §III-B.
  • [46] Y. Tian, Y. Lei, J. Zhang, and J. Z. Wang (2019) Padnet: pan-density crowd counting. IEEE Transactions on Image Processing. Cited by: §I.
  • [47] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Proceedings of the Advances in neural information processing systems, pp. 3630–3638. Cited by: §II.
  • [48] J. Wan and A. Chan (2019) Adaptive density map generation for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1130–1139. Cited by: §I.
  • [49] J. Wan, W. Luo, B. Wu, A. B. Chan, and W. Liu (2019) Residual regression with semantic prior for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4031–4040. Cited by: §I.
  • [50] Q. Wang, M. Chen, F. Nie, and X. Li (2020) Detecting coherent groups in crowd scenes by multiview clustering. T-PAMI 42 (1), pp. 46–58. Cited by: §I.
  • [51] Q. Wang, J. Gao, W. Lin, and X. Li (2020) NWPU-crowd: a large-scale benchmark for crowd counting. arXiv preprint arXiv:2001.03360. Cited by: §I.
  • [52] Q. Wang, J. Gao, W. Lin, and Y. Yuan (2019) Learning from synthetic data for crowd counting in the wild. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 8198–8207. Cited by: §I, §I, §II, §III-C, §IV, 5th item, §V-B, §V-D, §V-D, TABLE III.
  • [53] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §V-A.
  • [54] X. Wu, Y. Zheng, H. Ye, W. Hu, J. Yang, and L. He (2019) Adaptive scenario discovery for crowd counting. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2382–2386. Cited by: §II.
  • [55] S. Yi, H. Li, and X. Wang (2016) Pedestrian behavior modeling from stationary crowds with applications to intelligent surveillance. IEEE transactions on image processing 25 (9), pp. 4354–4368. Cited by: §I, §I.
  • [56] C. Zhang, K. Kang, H. Li, X. Wang, R. Xie, and X. Yang (2016) Data-driven crowd understanding: a baseline for a large-scale crowd dataset. IEEE Transactions on Multimedia 18 (6), pp. 1048–1061. Cited by: §V-B.
  • [57] L. Zhang, M. Shi, and Q. Chen (2018) Crowd counting via scale-adaptive convolutional neural network. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1113–1121. Cited by: §II.
  • [58] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597. Cited by: §II, §V-B.
  • [59] Z. Zhang, M. Wang, and X. Geng (2015) Crowd counting in public video surveillance by label distribution learning. Neurocomputing 166, pp. 151–163. Cited by: §I.
  • [60] B. Zhao, X. Li, and X. Lu (2019)

    CAM-rnn: co-attention model based rnn for video captioning

    .
    IEEE Transactions on Image Processing 28 (11), pp. 5552–5565. Cited by: §I.
  • [61] B. Zhao, X. Li, and X. Lu (2019) Property-constrained dual learning for video summarization. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
  • [62] M. Zhao, J. Zhang, C. Zhang, and W. Zhang (2019) Leveraging heterogeneous auxiliary tasks to assist crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12736–12745. Cited by: §I.
  • [63] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    .
    arXiv preprint. Cited by: §II, §V-D, TABLE III.