I Introduction
Currently, accelerating the understanding of the crowd is playing an increasingly important role in building an intelligent society. As a huge research field, it involves many hotspots. In some scenes with sparse crowd distribution, crowd understanding mainly includes crowd detection[19, 25], crowd behavior analysis [50, 55], crowd segmentation [15, 5], and crowd tracking [13, 31]
. In some scenes with highlevel density, such as an image containing thousands of people, crowd understanding mainly focuses on counting and density estimation
[18, 43, 27, 46, 21]. In this paper, we strive to work on the existing crowd counting problem.Crowd counting, a system that generates a pixellevel density estimation map and sums all of the pixels to predict how many people are in an image, has become a popular task due to its widespread piratical application: public management, traffic flow prediction, scene understanding
[55, 61], video analysis [59, 60], etc. Specifically, it can be used for public safety in many situations, such as political rallies and sports events [39]. Besides, density estimation can also be used to help detect the location of people in some sparse scenes [20]. In the traditional supervised learning, many excellent algorithms
[24, 43, 32, 62, 49, 48] constantly refresh the counting metrics from different angles for the existing datasets.However, traditional supervised learning requires a lot of labeled data to drive it, and unfortunately, pixellevel annotating is often costly. According to statistics, the entire procedure involved 2, 000 humanhours spent through the completion of the QNRF dataset[20]. On the recently established NWPU dataset [51], the time cost is even as high as 3,000 humanhours. Even if researchers invest a lot of time and money to build the datasets, the existing datasets are still limited in scale.
Because of the smallscale data in some existing datasets, the above models may suffer from overfitting at different extents, and there is a significant performance reduction when applying them in real life. Thus, the CrossDomain Crowd Counting (CDCC) attracts the researcher’s attention, which focuses on improving the performance in the target domain by using the data from the source domain. Wang et al. [52] propose a crowd counting via domain adaptation method, SE CycleGAN, which translates synthetic data to photorealistic scenes, and then apply the trained model in the wild. Gao et al. [11] present a highquality image translation method feature disentanglement. [14, 12] adopt the adversarial learning to extract the domaininvariant features in the source and target domain. In a word, general Unsupervised Domain Adaption (UDA) methods concentrate on image style and feature similarity. The up box in Fig. 1 demonstrates the appearance differences.
Nevertheless, the domain shift in image and feature level is not sensitive to the counting task: this strategy does not directly affect the counting performance, and it is not optimal. For example, SE CyCleGAN [52] and DACC [11] focus on maintaining the local consistency to improve the translation quality in congested regions. When applying the model to the sparse scenes (Mall [6], UCSD [4]), the loss may be redundant. In other word, there are task gaps in the existing UDAstyle methods. Besides, since the target label is unseen for UDA models, they do not work well: coarse prediction in the congested region and the estimation errors in the background.
Given a specific task, we find that the domain shift can be reflected in the parameters of models on different domains. Notably, we use synthetic data and realscene data to train the model, respectively. And then calculate the average value of each kernel in a specific layer. The down box in Fig. 1
reports the the distribution histogram. It can intuitively see that the parameters supervised with both datasets show Gaussian distribution, and the difference lies in their mean and variance. Thus, we conclude that the domain shift in different datasets can be measured by the parameter distribution of the specific model.
Based on the above observation, these differences on the parameter level can be simulated by a linear transformation. Thus, this paper proposes a Neuron Linear Transformation (NLT) method to handle crossdomain crowd counting. To be specific, firstly, train a source model using traditional supervised learning. Then exploit few labeled target data to learn two matrices (product factor and bias) for each source neuron. Finally, update these neurons by a linear transformation, which are treated as target neurons and applied to the target data. The entire process is shown in Fig. 2.
In summary, the main contributions of this paper are:

Propose a novel Neuron Linear Transformation (NLT) method to model the domain shift. It is the first time that the domain shifts can be measured at the parameter level.

Exploit a fewshot target data to approach the real domain shifts, which significantly reduces the annotation costs.

Outperform the traditional methods on six realword crowd counting datasets when facing the same problem. The experiments also evidence that NLT has higher practical value than UDA methods.
Ii Related Work
In this section, we briefly review the relevant works from the three tasks: supervised crowd counting, learn crossdomain crowd counting with synthetic data, and fewshot learning.
Supervised Crowd Counting. In recent years, the supervised crowd counting algorithms are mostly focusing on scale variability. From the perspective of scaleaware, Zhang et al. [58] propose a threecolumns network with different kernels for scale perception in 2016. LópezSastre et al. [35] introduce a HydraCN with threecolumns, where each column is fed by a patch from the same image with a different scale. Two years later, Wu et al. [54] developed a powerful multicolumn scaleaware CNN with an adaptation module to fuse the sparse and congested column. In the same year, AFP [22] generates a density map by fusing the attention map and intermediary density map in each column. icCNN [36] generates a highresolution density map via passing the feature and predict map from the lowresolution CNN to the highresolution CNN. Last year, Hossain et al. [17] employ a scaleaware attention network, where each column is weighted with the output of a global scale attention network and local scale attention network. Except for multicolumn scaleaware architecture, the singlecolumn scaleaware CNN generally better in performance in recent research, such as SANet [2], SaCNN [57]. To combine the multicolumn and singlecolumn scaleaware CNN, CSRNet[26], CAN [30] and FPNCC [3] developed a model containing multiple paths only in several part of the networks.
From the respective of contextaware, CPCNN [42]
designs a global context estimator and local context estimator to classify the density level of the full image and its patches respectively. SwitchingCNN
[40] employs an extra column CNN to deliver the best performance given a certain patch. DRSAN [29] designs a module named Recurrent SpatialAware Refinement (RSAR) to refine the density map. In 2019, RAZNet [28] divides the training phase into two steps, first, a main CNN is trained as a typical density map regressor with an extra column to propose a region to zoom, and then another CNN is trained to recurrently refine the proposed zooming regions. Meanwhile, SAAN [17] designs three type of CNN: Multiscale Feature Extractor (MFE), Global Scale Attention (GSA) and Local Scale Attention (LSA), which explores the local context to improve the counting performance.Crossdomain Crowd Counting. In addition to the exploration mentioned above, a new research hotspot is beginning to interest researchers called crossdomain crowd counting. In this task, it is supposed to transfer what the model learns from one dataset to another unseen dataset. One of the earliest studies is launched by wang et al. [52], who establish a largescale synthetic dataset to pretrain a model that improves the robust over realworld datasets by a finetune operation. Except finetuning, they train a counter without using any realworld labeled data. It is completed by using the Cycle GAN [63] and SE Cycle GAN [52] to generate a realistic image. Recently, several efforts have been made to follow it, DACC [11], a method for domain adaptation based on image translation and Gaussianprior reconstruction, achieves new stateoftheart results on several mainstream datasets. At the same time, some works [12, 14] extract domain invariant features based on adversarial learning. Experimental results show that those methods can narrow the domain shift to some extent.
Overall, the current research about learning from synthetic data for crowd counting is still in its infancy. However, the intersection of synthetic data and realworld data proves to be particularly fertile ground for groundbreaking new ideas, and we firmly believe that this field to become more significant over time.
Fewshot Learning.
Since it involves a small number of target domain samples in our crossdomain crowd counting method, we hereby introduce some studies related to fewshot learning. The fewshot learning is based on given prior experience with very similar tasks where we have access to largescale training sets, and then to train a deep learning model using only a few training examples. Early fewshot learning methods
[1, 7, 8] are based on handcrafted features. Vinyals et al. [47] use a memory component in a neural net to learn common representation from very little data. Snell et al. [44]propose Prototypical Networks, which map examples to a dimensional vector space. Ravi and Larochelle
[37]use an LSTMbased metalearner to learn an update rule for training a neural network learner. ModelAgnostic MetaLearning (MAML)
[9] learns a model parameter initialization that generalizes better to similar tasks. Similar to MAML, AREPTILE [34]executes stochastic gradient descent for
iterations on a given task, and then gradually moves the initialization weights in the direction of the weights obtained after the iterations. Santoro et al. [41] propose MemoryAugmented Neural Networks (MANNs)to memorize information about previous tasks and leverage that to learn a learner for new tasks. SNAIL [33] is a generic metalearner architecture to learn a common feature vector for the training images to aggregate information from past experiences. Most of the above fewshot learning methods are based on classification tasks. For crowd counting tasks, [16] proposes a oneshot learning approach for learning how to adapt to a target scene using one labeled example. [38] applies the MAML [9] to learn scene adaptive crowd counting with fewshot learning.Iii Approach
This section describes the detailed methodology for crossdomain crowd counting. Firstly, we define the problem that we want to solve. Then, the NLT, a linear operation at the neuronlevel, are designed to model the domain shift. Finally, we introduce how to integrate NLT into the transformation process of the source model and the target model. Fig. 2 illustrates the entire framework.
Iiia Problem Setup
In this paper, we strive to tackle the existing problems for domain adaptive crowd counting from the parameterlevel with a transformation. The setting assumes access to a source domain (synthetic data) with labeled crowd images . Besides, a target domain (real scene data) provides fewshot images with the labeled density maps . The purpose is to train a source domain model with the parameters exploiting the , and learn a representable domain shift according to with fewshot learning, which are parameterized by the domain factors and domain biases . Finally, generating a well performed target model with the parameters by combining the source model with the domain shift parameters.
IiiB Neuron Linear Transformation
Inspired by the neuronlevel scale and shift operation [45], we propose a Neuron Linear Transformation (NLT) method to describe the domain gap, which makes the domain gap clearly visible. In order to model the domain shift, we assume that the source model and the target model belong to the same linear space . Each neuron in the target model can be transferred from the corresponding neuron in the source model by a linear transformation.
The domain adaptation method has two advantages: 1) The target model inherits the good feature extraction ability and preserves the generalization. 2) Compared with finetuning all parameters of the target model, only a few parameters need to be optimized in the target model with NLT. So it reduces the probability of overfitting for fewshot learning in the target domain. For each source domain neuron parameter
, we define the corresponding domain factor and domain bias . Then the neuronlevel linear transformation can be expressed as , namely,(1) 
IiiC Modeling the Domain Shift
In this section, we introduce how to use Neuron Linear Transformation (NLT) to model domain shift from the source domain to the target domain.
First, we introduce the architecture of the model. The source domain model can take any crowd counting model. However, for a fair comparison, a simple encoderdecoder structure is designed following the previous work [52, 14, 11, 12]. As shown in Fig.2, the first four layers of VGG16 are adopted as the backbone in the encoder stage. That is, the output feature is 1/8 of the input image. In the decoder stage, a 3x3 convolutional layer is used to reduce the feature channels to a half, and then an upsampling layer is followed by a 3x3 convolutional layer to reduce channels. After three repetitions, a 1x1 convolutional layer outputs the prediction density map. The training of the source domain model is similar to that of the traditional supervised crowd counting network, except that the training data adopts the synthetic dataset GCC. The are optimized by gradient descent as follows,
(2) 
where is a standard MSE loss. is the batch size of source model. is the source model prediction of the training data. denotes the learning rate.
Second, we introduce how to embed NLT into our target model training. As shown in Fig .2, the target model remains the same architecture with the source model, but the number of parameters involved in training is different. The parameters in the target model are transferred from the source model. Moreover, the goal of transformation is to make up for the task gap. To achieve the transformation, we have to express it mathematically. This process is regarded as to model domain shift. Specifically, we model the domain shift by transfer all neurons in the source model to the target model with the proposed NLT. As a result, in the target model, we define two groups of additional parameters and to achieve the modellevel linear transformation. Assuming that the parameters in the source model contain neurons in total, then the number of and is . According to Equ. (1), the mapping can be expressed as follows,
(3) 
where represents the domain shift factor, initialized by 1. represents the domain shift bias, which is initialized to 0.
Since we introduce the learnable parameters to describe the task gap in the target model, some target domain labeled images are needed to learn the parameters. However, within the requirement of domain adaptation, we only use a few data to support the training. In the update phase of the source model, is learned. But it will be frozen when the target model is updated. After the calculation of Equ. 3, participate in the feedforward of the target model. Therefore, only the gradients of and need to be calculated in the feedback process, that is, and are learned in the target model. Since the convolution kernel of VGG16 is , the updated parameters in the target model are of . The loss for optimizing the parameters is defined as follows,
(4) 
where the former term is the density estimated loss corresponding to the fewshot data. It is the same as the loss of the source model. is the input image and density map. is the prediction density map. The latter term is the L2 regularization loss of parameters and , with the purpose of preventing overfitting in the target domain. is the weighted parameter. Finally, the target model is optimized as follows,
Method  DA  FS  Shanghai Tech Part A  Shanghai Tech Part B  
MAE  MSE  PSNR  SSIM  MAE  MSE  PSNR  SSIM  
NoAdpt  ✗  ✗  188.0  279.6  20.91  0.670  20.1  29.2  26.62  0.895 
Supervised  ✗  ✔  107.2  165.9  21.53  0.623  16.0  26.7  26.8  0.932 
Finetuning  ✔  ✔  105.7  167.6  21.72  0.702  13.8  22.3  27.0  0.931 
NLT (ours)  ✔  ✔  93.8  157.2  21.89  0.729  11.8  19.2  27.58  0.937 
IFS [11]+NLT (ours)  ✔  ✔  90.1  151.6  22.01  0.741  10.8  18.3  27.69  0.932 
(5) 
where denotes the learning rate of target model.
Iv Implementation Details
Executive Stream. In the training phase, the workflow is shown in Fig. 2 1⃝ 6⃝, once iteration requires updating parameters for two models. First, are updated according a batch sampling from the GCC data by 1⃝ 3⃝. Second, the domain shift parameters are updated with the fewshot data provided in the target domain by 4⃝ 6⃝. Finally, the parameters of the target model are obtained by NLT, as shown in Equ. 3. In the validation phase, we divide the validation set for each target domain from its training data. In the testing phase, we use the bestperforming model on the validation set to make an inference.
Parameter Setting. In each iteration, we input synthetic images and target fewshot images. Adam algorithm [23] is performed to optimize the networks. The learning rate for the source model in Equ. 2 is set as , and the learning rate for target model in Equ. 4 is initialized as . The parameter
for target model loss function in Eq.
4 is fixed to . Our code is developed based on the Framework [10] on NVIDIA GTX Ti GPU.Scene Regularization. In other fields of domain adaptation, such as semantic segmentation, the object distribution in street scenes is highly consistent. Unlike this, current crowd realworld datasets are very different in terms of density range, such as the MALL [4] dataset with the count ranging from to , but the GCC [52] dataset is ranging from to . For avoiding negative adaptation by the different density ranges, we adopt a scene regularization strategy proposed by [52] and [12]. In other word, we add some filter conditions to select proper synthetic images from GCC as the source domain data for different realworld datasets.
Method  DA  FS  Shanghai Tech Part A  Shanghai Tech Part B  

MAE  MSE  PSNR  SSIM  MAE  MSE  PSNR  SSIM  
NoAdpt  ✗  ✗  188.0  279.6  20.91  0.670  20.1  29.2  26.62  0.895 
Finetuning  ✔  ✔  105.7  167.6  21.72  0.702  13.8  22.3  27.0  0.931 
Factor ()  ✔  ✔  109.2  161.3  21.49  0.758  13.5  23.5  27.26  0.921 
bias ()  ✔  ✔  107.8  169.9  21.14  0.796  12.8  20.6  27.17  0.916 
NLT ()  ✔  ✔  93.8  157.2  21.89  0.729  11.8  19.2  27.58  0.937 
V Experiments
In this section, we first report the experimental evaluation metrics and the selected datasets, and then a comprehensive ablation study is performed to illustrate the effectiveness of our proposed method. Next, we analyze the shifting phenomenon of different realworld datasets and synthetic dataset from the perspective of statistics. In addition, we also discuss the effect of selected fewshot data on performance improvement. Finally, we present the testing results and visualization results of our method in six realworld datasets.
Va Evaluation Criteria
Counting Error.
According to the evaluation criteria widely used in crowd counting, the counting error is usually reflected in two metrics, namely Mean Absolute Error(MAE) and Mean Square Error(MSE). MAE measures the mean length of the predicted Error, while MSE measures the robustness of the model to outliers. Both are the lower, the better. They are defined as follows:
(6) 
where is the number of images to be tested, and and are the ground truth and estimated number of people corresponding to the sample, which is obtained by summing all the pixel values in the density map.
Density Map Quality.
To further evaluate the predictive quality of the model, we also calculate PSNR (Peak SignaltoNoise Ratio) and SSIM (Structural Similarity in Image)
[53]. For those two metrics, the larger the value, the higher the quality of the predict density map.Method  DA  FS  Shanghai Tech Part A  Shanghai Tech Part B  UCFQNRF  
MAE  MSE  PSNR  SSIM  MAE  MSE  PSNR  SSIM  MAE  MSE  PSNR  SSIM  
CycleGAN [63]  ✔  ✗  143.3  204.3  19.27  0.379  25.4  39.7  24.60  0.763  257.3  400.6  20.80  0.480 
SE CycleGAN [52]  ✔  ✗  123.4  193.4  18.61  0.407  19.9  28.3  24.78  0.765  230.4  384.5  21.03  0.660 
FA [12]  ✔  ✗          16.0  24.7             
FSC [14]  ✔  ✗  129.3  187.6  21.58  0.513  16.9  24.7  26.20  0.818  221.2  390.2  23.10  0.7084 
IFS [11]  ✔  ✗  112.4  176.9  21.94  0.502  13.1  19.4  28.03  0.888  211.7  357.9  21.94  0.687 
NoAdpt (ours)  ✗  ✗  188.0  279.6  20.91  0.670  20.1  29.2  26.62  0.895  276.8  453.7  22.22  0.692 
NLT (ours)  ✔  ✔  93.8  157.2  21.89  0.729  11.8  19.2  27.58  0.937  172.3  307.1  22.81  0.729 
IFS[11]+NLT (ours)  ✔  ✔  90.1  151.6  22.01  0.741  10.8  18.3  27.69  0.932  157.2  263.1  23.01  0.744 
Method  DA  FS  WorldExpo’10 (only MAE)  UCSD  MALL  
S1  S2  S3  S4  S5  Avg.  MAE  MSE  PSNR  SSIM  MAE  MSE  PSNR  SSIM  
CycleGAN [63]  ✔  ✗  4.4  69.6  49.9  29.2  9.0  32.4                 
SE CycleGAN [52]  ✔  ✗  4.3  59.1  43.7  17.0  7.6  26.3                 
FA [12]  ✔  ✗  5.7  59.9  19.7  14.5  8.1  21.6  2.0  2.43      2.47  3.25     
IFS [11]  ✔  ✗  4.5  33.6  14.1  30.4  4.4  17.4  1.76  2.09  24.42  0.950  2.31  2.96  25.54  0.933 
NoAdpt (ours)  ✗  ✗  5.0  89.9  63.1  20.8  17.1  39.2  12.79  13.22  23.94  0.899  6.20  6.96  24.65  0.879 
NLT (ours)  ✔  ✔  2.3  22.8  16.7  19.7  3.9  13.1  1.58  1.97  25.29  0.942  1.96  2.55  26.92  0.967 
IFS[11]+NLT (ours)  ✔  ✔  2.0  15.3  14.7  18.8  3.4  10.8  1.48  1.81  25.58  0.965  1.86  2.39  27.03  0.944 
VB Datasets
The synthetic dataset GCC [52] is the only source domain. As for the target domain, to ensure the sufficiency of our experiments, we respectively select two datasets from highlevel density, mediumlevel density and lowlevel density datasets, a total of six datasets, namely UCFQNRF [20], Shanghai Tech Part A [58], Shanghai Tech Part B [58], WorldExpo’10 [56] , Mall [6] and UCSD [4].
Source Domain Dataset. GCC is a largescale synthetic dataset, which is sampled from virtual scenes by a computer mod. It contains of accurately annotated images with a total of instances. There is an average of people in each image.
Congested Crowd Dataset. UCFQNRF is collected from a shared image website. Therefore, the dataset contains a variety of scenes. It consists of images( training and testing images), with annotated instances. The average number of people is 815 per image. Shanghai Tech Part A is also randomly collected from the Internet with different scenarios. It consists of images ( training and testing images) with different resolutions. The average number of people in an image is .
Moderate Crowd Dataset. Shanghai Tech Part B is captured from the surveillance camera on the Nanjing Road in Shanghai, China. It contains samples ( training and testing images). The scenes are relatively uniform, with an average of people per picture. WorldExpo’10 consists of labeled images, which are collected from surveillance scenes ( scenes for training and the remaining scenes for testing) in Shanghai WorldExpo. The average number of people is 50 per image.
Sparse Crowd Dataset. Mall is captured from a surveillance camera installed in a shopping mall, which records the ( for training and for testing) sequential frames. The average people of each image is . UCSD consists of frames (frames for training and the others for testing) collected from a singlescene surveillance video. The average number of the pedestrian in each image is .
VC Ablation Study
We present our ablation experiments from two perspectives. First, regarding the fewshot data, we demonstrate the impact by using different training methods. Second, for the proposed NLT, we discuss the effects of and on modeling the domain shift. The following experiments are conducted on Shanghai Tech Part A and B datasets, and the selected fewshot data both are the of the training set.
Compared with Other Training Methods. Five training methods are used to demonstrate the role of fewshot data in narrowing the domain gap. The specific settings are as follows:

NoAdpt. Train the model on the GCC dataset.

Supervised. Train the model on fewshot data.

Finetuning. Train the model on the GCC dataset and finetune it with fewshot data.

NLT (ours). Train the model from GCC to the realworld dataset with our NLT and training strategy.
As shown in Fig. 3, we draw the loss and performance curves on the validation set during training. Taking Shanghai Tech Part A dataset as an example, it is difficult to reduce the loss of the validation set without domain adaptation. The supervised training and finetuning with fewshot data can significantly reduce the loss, but it is easy to suffer from overfitting. Compared to supervised training and finetuning, our NLT can reach lower validation loss and inhibit overfitting. In Fig. 3 (b) and (c), the MAE and MSE curves also illustrate the effectiveness of NLT. Similarly, in Fig. 3 (d) (e) and (f), Shanghai Part B have the same trend with Shanghai part A, which proves that our method is suitable for both dense and sparse scenes.
Table I shows the results on the testing set, the results of no adaptation are usually unfulfilling, which validates the vast distance between the real scene and the synthetic data mentioned in our introduction. As shown in lines 4 and 5, both fully supervised training and finetuning on a pretrained GCC model with fewshot data yield better results than no domain adaptation. It shows that fewshot data has a significant effect on narrowing the domain gap. Compare with lines 4 and 5, and it can be found that finetuning the pretrained GCG model can achieve better results than full supervision. It indicates that the synthetic data plays a vital role in improving the generalization of the model. Compare with the results of lines 5 and 6, the proposed NLT and training strategy is a better method than the finetuning operation. Taking MAE as an example, In Shanghai Tech part A, NLT reduces the counting error by 13.1% compared with finetuning. In Shanghai Tech part B, it reduces by 16.1%. In addition, we also test NLT on the stylized images by IFS [11], and the results show that our method was further improved. It is just a simple validation, and our later experiments are still based on the original GCC. In conclusion, our method is a winwin approach. The few shot data drive the training of NLT to narrow the domain gap while NLT is maximizing the potential of fewshot data.
The Influence of Domain Shift Parameters. In our NLT, two groups of parameters are set to learn the transformation of neurons, namely and . To verify the validity and compatibility of the parameters, we conduct three experiments to show the effects on modeling domain shift. There are using factor , bias , and both of them to learn the model shift, respectively. In this section, the details of the experiments are shown in Fig .4.
The red curves represent that there is no domain adaptation. At the beginning of the training, the loss curve is verified to be reduced. But as time goes on, it keeps rising. The reason is that the model trained with synthetic data has a limited ability to fit the real data. Once the limit value is passed, the model will continuously deviate from the target domain. The blue and green curves show the effectiveness of domain factor and domain bias respectively, both of them can greatly reduce losses and improve performance. It is worth noting that factor is not easy to overfit, but the convergence is slow, while bias converges faster but is easy to overfit. When the two are together, they complement each other and perform best.
The results of the test set are shown in Table II. The learnable parameters for factors and bias both are of the source model. Finetuning operation is to update all parameters of the source model. In Shanghai Tech Part A, for example, of the training set are treated as fewshot data, but factor and bias achieve the similar results compare with finetuning. It appearances that it is effective to use factor and bias to represent domain shift. The best results are achieved when combining the two to form the Neuron Linear Transformation (NLT) for modeling domain shift.
VD Adaptation Results on Realworld Datasets
In this section, we test the performance of the NLT by using it to learn the domain shift from GCC to six realworld datasets and compare it with the other domain adaptation methods.
Metrics Report. Table III lists the statistical results of the four metrics (MAE/MSE/PSNR/SSIM). From the table, comparing with the image translation (CycleGAN [63], SE CycleGAN [52] and IFS [11]) and feature adversarial learning (FA [12] and FSC [14]) methods, our method performs better with the use of annotated data in the target domain. Taking MAE as an example, as the lavender row shows, NLT reduced counting errors by 16.5%, 10.0%, 18.6%, 29.9%, 10.0%, and 15.2% comparing with the above methods, respectively, on the six realworld datasets. On the PSNR and SSIM, which represent the quality of the generated density map, we have also achieved a significant improvement, indicating that introducing fewshot data from the target domain is important for noise cancellation in the background region. Experiments with different density datasets also demonstrate the universality of NLT for crossdomain counting tasks.
Furthermore, we also discuss the combination of NLT and other domain adaptation methods. In this article, we implement stylistic realism for the GCC [52] dataset by using IFS [11], the currently known best image translation method for crossdomain crowd counting. These images are then treated as source domain data, and the proposed NLT is applied to achieve the domain adaptation. The final test results in the six realworld datasets are shown in table III, light cyan row. Compared with the original IFS [11], the NLT decreases the MAE by 19.8%, 17.6%, 25.7%, 37.9%, 16.0%, and 19.5% on the six real data sets, respectively.
Visualization results Fig. 5 shows the visualization results of no adaptation and the proposed NLT. Column 3 shows the results without domain adaptation. The regression results are not acceptable in a congested scene like Shanghai Tech Part A, especially the grayscale image in Row 2. On Shanghai Tech Part B, the counting results of no domain adaptation is a little close to the ground truth, but the problems remain in details and background. Such as the red box in Row 3 shows that the regression value is still weak despite the trend. Besides, the estimation errors in the background region also prevented the performance, such as the red box shown on Raw 4. After the domain adaptation, the above questions are alleviated. In general, the NLT improves the density map in counting values and details. This reflects the effectiveness of using NLT to learn the domain shift.
In order to more intuitively demonstrate our domain adaptation effect, we show more results in Fig. 6. For saving space, we only report the ground truths and the prediction results. From the performance on different datasets, NLT is effective for crossdomain counting tasks with different crowding levels.
VE Statistical Analysis of Domain Shift
Domain factor and domain bias are defined as the parameters to model the shift from the source domain to target domain, which are initialized to 1 and 0, respectively. Driven by fewshot data, they are updated for narrowing the domain gap. In section VC, we verify its effectiveness by specific task performance. In this section, we will further analyze the significance of these parameters from the perspective of mathematical statistics.
There are convolutional layers in the network we use, and each convolution kernel contains a domain factor and domain bias parameter. For the well trained model, we calculate the mean values of factor and bias at each layer. The statistical results are shown in Fig .7, where the mean value for factor is subtracted from the initial value . As Fig. 7 (a) shown, at most layers, the mean value of factor and bias are less than and , respectively. Therefore, the effect of factor and bias is to reduce the parameters of the GCC model. We call this shift as “down domain shift”. The distribution of UCFQNRF shown in Fig. 7 (b) is similar to that of Shanghai Tech Part A. Both of them are collected from the Internet, so it has a similar distribution. In Fig. 7 (c), the averages of factor and bias are greater than and in most layers, respectively. We define the shift as “up domain shift.” In Fig. 7 (d), factor and bias are distributed on both sides. We define this shift as “updown domain shift”. In addition, by comparing Fig. 7 (d)(e)(f), where , and of the training set are treated as fewshot data to learn domain shift, and the distribution is basically the same eventually. This reveals that only a few target domain labeled images are needed to learn the representation of domain shift.
VF Analysis in Selecting Fewshot Data
Since our domain adaptation method requires a few target domain labeled images, in this section, we will discuss the effects of selecting different proportions fewshot data on NLT. As shown in Fig .8, we carry out the experiments on the Shanghai Part B dataset. The horizontal axis represents the images with corresponding proportions regard as the fewshot data, while the vertical axis represents MAE and MSE on the Shanghai Tech Part B test set. The curves in blue and green illustrate that NLT performance gets better with increasing fewshot learning data. In addition, compared with the traditional supervised training methods, the proposed NLT is better in every data setting. Therefore, it can be concluded that NLT is very robust for the selection of target domain data. However, the original intention of this paper is to use a few of target domain data to narrow the gap between the synthetic data and realworld data, and we only adopt of the training set for each dataset as fewshot data in the reporting results.
Vi Conclusions
In this paper, we summarize the existing problems of crossdomain crowd counting methods in expressing domain shift and define the domain adaptation problem as the transformation of parameters from the perspective of modellevel. In order to convert the source model to the target model, we propose a Neuron Linear Transformation (NLT) method to model the domain shift. Moreover, the introduced domain shift parameters are optimized by fewshot learning. Extensive experiments show that our method is better than other domain adaptation methods by using target domain data. Besides, it also has a better expression ability for domain shift. Considering the versatility of NLT, we will explore applications of NLT in other domain adaptation tasks in future work, such as semantic segmentation and pedestrian ReID.
References

[1]
(2005)
Crossgeneralization: learning novel classes from a single example by feature replacement.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Vol. 1, pp. 672–679. Cited by: §II.  [2] (2018) Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision, pp. 734–750. Cited by: §II.
 [3] (2019) Feature pyramid networks for crowd counting. Procedia Computer Science 157, pp. 175–182. Cited by: §II.
 [4] (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7. Cited by: §I, §IV, §VB.
 [5] (2008) Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE transactions on pattern analysis and machine intelligence 30 (5), pp. 909–926. Cited by: §I.
 [6] (2012) Feature mining for localised crowd counting. In BMVC, Vol. 1, pp. 3. Cited by: §I, §VB.
 [7] (2006) Knowledge transfer in learning to recognize visual objects classes. In Proceedings of the International Conference on Development and Learning, pp. 11. Cited by: §II.
 [8] (2005) Object classification from a single example utilizing class relevance metrics. In Proceedings of the Advances in neural information processing systems, pp. 449–456. Cited by: §II.

[9]
(2017)
Modelagnostic metalearning for fast adaptation of deep networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume
, pp. 1126–1135. Cited by: §II. 
[10]
(2019)
C
framework: an opensource pytorch code for crowd counting
. arXiv preprint arXiv:1907.02724. Cited by: §IV.  [11] (2019) Domainadaptive crowd counting via interdomain features segregation and gaussianprior reconstruction. arXiv preprint arXiv:1912.03677. Cited by: §I, §I, §II, §IIIC, TABLE I, 5th item, §VC, §VD, §VD, TABLE III.
 [12] (2019) Featureaware adaptation and structured density alignment for crowd counting in video surveillance. arXiv preprint arXiv:1912.03672. Cited by: §I, §II, §IIIC, §IV, §VD, TABLE III.
 [13] (2017) Beyond group: multiple person tracking via minimal topologyenergyvariation. IEEE Transactions on Image Processing 26 (12), pp. 5575–5589. Cited by: §I.
 [14] (2020) Focus on semantic consistency for crossdomain crowd understanding. arXiv preprint arXiv:2002.08623. Cited by: §I, §II, §IIIC, §VD, TABLE III.
 [15] (2017) Clickstream analysis for crowdbased object segmentation with confidence. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2814–2826. Cited by: §I.
 [16] Oneshot scenespecific crowd counting. Cited by: §II.
 [17] (2019) Crowd counting using scaleaware attention networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1280–1288. Cited by: §II, §II.
 [18] (2017) Body structure aware deep crowd counting. IEEE Transactions on Image Processing 27 (3), pp. 1049–1059. Cited by: §I.
 [19] (2015) Detecting humans in dense crowds using locallyconsistent scale prior and global occlusion reasoning. IEEE transactions on pattern analysis and machine intelligence 37 (10), pp. 1986–1998. Cited by: §I.
 [20] (2018) Composition loss for counting, density map estimation and localization in dense crowds. arXiv preprint arXiv:1808.01050. Cited by: §I, §I, §VB.
 [21] (2019) Learning multilevel density maps for crowd counting. IEEE transactions on neural networks and learning systems. Cited by: §I.
 [22] (2018) Crowd counting by adaptively fusing predictions from an image pyramid. arXiv preprint arXiv:1805.06115. Cited by: §II.
 [23] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV.
 [24] (2018) Where are the blobs: counting by localization with point supervision. In Proceedings of the European Conference on Computer Vision, pp. 547–562. Cited by: §I.
 [25] (2013) Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence 36 (1), pp. 18–32. Cited by: §I.

[26]
(2018)
Csrnet: dilated convolutional neural networks for understanding the highly congested scenes
. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1091–1100. Cited by: §II.  [27] (2019) Indoor crowd counting by mixture of gaussians label distribution learning. IEEE Transactions on Image Processing 28 (11), pp. 5691–5701. Cited by: §I.
 [28] (2019) Recurrent attentive zooming for joint crowd counting and precise localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1217–1226. Cited by: §II.
 [29] (2018) Crowd counting using deep recurrent spatialaware network. arXiv preprint arXiv:1807.00601. Cited by: §II.
 [30] (2019) Contextaware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5099–5108. Cited by: §II.
 [31] (2014) Learning to track multiple targets. IEEE transactions on neural networks and learning systems 26 (5), pp. 1060–1073. Cited by: §I.
 [32] (2019) Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6142–6151. Cited by: §I.
 [33] (2017) A simple neural attentive metalearner. arXiv preprint arXiv:1707.03141. Cited by: §II.
 [34] (2018) On firstorder metalearning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §II.
 [35] (2016) Towards perspectivefree object counting with deep learning. In Proceedings of the European Conference on Computer Vision, pp. 615–629. Cited by: §II.
 [36] (2018) Iterative crowd counting. In Proceedings of the European Conference on Computer Vision, pp. 270–285. Cited by: §II.
 [37] (2016) Optimization as a model for fewshot learning. Cited by: §II.
 [38] (2020) Fewshot scene adaptive crowd counting using metalearning. In Proceedings of the the IEEE Winter Conference on Applications of Computer Vision, pp. 2814–2823. Cited by: §II.

[39]
(2015)
Recent survey on crowd density estimation and counting for visual surveillance.
Engineering Applications of Artificial Intelligence
41, pp. 103–114. Cited by: §I.  [40] (2017) Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4031–4039. Cited by: §II.
 [41] (2016) Metalearning with memoryaugmented neural networks. In Proceedings of the International conference on machine learning, pp. 1842–1850. Cited by: §II.
 [42] (2017) Generating highquality crowd density maps using contextual pyramid cnns. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1861–1870. Cited by: §II.
 [43] (2019) Haccn: hierarchical attentionbased crowd counting network. IEEE Transactions on Image Processing 29, pp. 323–335. Cited by: §I, §I.
 [44] (2017) Prototypical networks for fewshot learning. In Proceedings of the Advances in neural information processing systems, pp. 4077–4087. Cited by: §II.

[45]
(2019)
Metatransfer learning for fewshot learning
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 403–412. Cited by: §IIIB.  [46] (2019) Padnet: pandensity crowd counting. IEEE Transactions on Image Processing. Cited by: §I.
 [47] (2016) Matching networks for one shot learning. In Proceedings of the Advances in neural information processing systems, pp. 3630–3638. Cited by: §II.
 [48] (2019) Adaptive density map generation for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1130–1139. Cited by: §I.
 [49] (2019) Residual regression with semantic prior for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4031–4040. Cited by: §I.
 [50] (2020) Detecting coherent groups in crowd scenes by multiview clustering. TPAMI 42 (1), pp. 46–58. Cited by: §I.
 [51] (2020) NWPUcrowd: a largescale benchmark for crowd counting. arXiv preprint arXiv:2001.03360. Cited by: §I.
 [52] (2019) Learning from synthetic data for crowd counting in the wild. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 8198–8207. Cited by: §I, §I, §II, §IIIC, §IV, 5th item, §VB, §VD, §VD, TABLE III.
 [53] (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §VA.
 [54] (2019) Adaptive scenario discovery for crowd counting. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2382–2386. Cited by: §II.
 [55] (2016) Pedestrian behavior modeling from stationary crowds with applications to intelligent surveillance. IEEE transactions on image processing 25 (9), pp. 4354–4368. Cited by: §I, §I.
 [56] (2016) Datadriven crowd understanding: a baseline for a largescale crowd dataset. IEEE Transactions on Multimedia 18 (6), pp. 1048–1061. Cited by: §VB.
 [57] (2018) Crowd counting via scaleadaptive convolutional neural network. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1113–1121. Cited by: §II.
 [58] (2016) Singleimage crowd counting via multicolumn convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597. Cited by: §II, §VB.
 [59] (2015) Crowd counting in public video surveillance by label distribution learning. Neurocomputing 166, pp. 151–163. Cited by: §I.

[60]
(2019)
CAMrnn: coattention model based rnn for video captioning
. IEEE Transactions on Image Processing 28 (11), pp. 5552–5565. Cited by: §I.  [61] (2019) Propertyconstrained dual learning for video summarization. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
 [62] (2019) Leveraging heterogeneous auxiliary tasks to assist crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12736–12745. Cited by: §I.

[63]
(2017)
Unpaired imagetoimage translation using cycleconsistent adversarial networks
. arXiv preprint. Cited by: §II, §VD, TABLE III.
Comments
There are no comments yet.