Leveraging Self-Supervision for Cross-Domain Crowd Counting

03/30/2021 ∙ by Weizhe Liu, et al. ∙ EPFL 0

State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. While effective, these data-driven approaches rely on large amount of data annotation to achieve good performance, which stops these models from being deployed in emergencies during which data annotation is either too costly or cannot be obtained fast enough. One popular solution is to use synthetic data for training. Unfortunately, due to domain shift, the resulting models generalize poorly on real imagery. We remedy this shortcoming by training with both synthetic images, along with their associated labels, and unlabeled real images. To this end, we force our network to learn perspective-aware features by training it to recognize upside-down real images from regular ones and incorporate into it the ability to predict its own uncertainty so that it can generate useful pseudo labels for fine-tuning purposes. This yields an algorithm that consistently outperforms state-of-the-art cross-domain crowd counting ones without any extra computation at inference time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Synthetic Real
Figure 1: Motivation. Top row: Synthetic and real images unseen during training. Middle row: Ground-truth people density maps. The total number of people obtained by integrating these maps is overlaid on the images. Bottom row: Estimated people density maps by the network of [86] with overlaid estimated total number of people. Because the network has been trained on synthetic images, the estimated number of people in the synthetic image is very close to the correct one. This is not the case in the real one because of the large domain shift between synthetic and real images.

Crowd counting is important for applications such as video surveillance and traffic control. For example during the current COVID-19 pandemic, it has a role to play in monitoring social distancing and slowing down the spread of the disease. Most state-of-the-art approaches rely on regressors to estimate the local crowd density in individual images, which they then proceed to integrate over portions of the images to produce people counts. The regressors typically use Random Forests 

[35], Gaussian Processes [4], or more recently Deep Networks [101, 107, 56, 68, 90, 76, 71, 49, 37, 67, 74, 102, 42, 48, 66, 26, 60, 3], with most state-of-the-art approaches now relying on the latter.

Unfortunately, training such deep networks in a traditional supervised manner requires much ground-truth annotation. This is expensive and time-consuming and has slowed down the deployment of data-driven approaches. One way around this difficulty is to use synthetic data for training purposes. However there is usually too much domain shift—change in statistical properties—between real and synthetic images for networks trained in this manner to perform well, as shown in Fig. 1.

In this paper, we remedy this shortcoming by training with both synthetic images, along with their associated labels, and unlabeled real images. We force our network to learn perspective-aware features on the real images and build into it the ability to use these features to predict its own uncertainty using a fast variant of the ensemble method [13] to effectively use pseudo labels for fine-tuning. We train it as follows:

  1. Initially we use synthetic images, unlabeled real images, and upside-down version of the latter. We train the network not only to give good results on the synthetic images but also to recognize if the real images are upside-up or upside-down. This simple approach to self-supervision forces the network to learn features that are perspective-aware on the real images.

  2. At the end of this first training phase in which we perform image-wise self supervision on the real images, our network is semi-trained and the uncertainties attached to the people densities it estimates have meaning. We exploit them to provide pixel-wise self-supervision by treating the densities the network is confident about as pseudo labels, that we use as if they were ground-truth labels to re-train the network. We iterate this process until convergence.

Our contribution is therefore a novel approach to self-supervision for cross-domain crowd counting that relies on stochastic density maps, that is, maps with uncertainties attached to them, instead of the more traditional deterministic density maps. Furthermore, it explicitly leverages a specificity of the crowd counting problem, namely the fact that perspective distortion affects density counts. We will show that it consistently outperforms the state-of-the-art cross-domain crowd counting methods.

2 Related Work

Given a single image of a crowded scene, the currently dominant approach to counting people is to train a deep network to regress a people density estimate at every image location. This density is then integrated to deliver an actual count [43, 50, 72, 45, 27, 108, 103, 84, 38, 40, 96, 55, 41, 91]. Most methods work on counting people from individual images [92, 73, 77, 9, 83, 99, 100] while others account for temporal consistency in video sequence [90, 104, 14, 44, 47, 46].

While effective these approaches require a large annotated dataset for training purposes, which is hard to obtain in many real-world scenarios. Unsupervised domain adaptation seek to address this difficulty. We discuss earlier approaches to it, first in a generic context and then for the specific purpose of crowd counting.

Unsupervised Domain Adaptation.

Unsupervised domain adaptation aims to align the source and target domain feature distributions given annotated data only in the source domain. A popular approach is to learn domain-invariant features by adversarial learning [80, 16, 21, 81, 7, 22, 65, 105, 8, 106, 31, 54, 10, 11, 24, 53, 89], which leverages one extra discriminator network to narrow the gap between two different domains. Another way to bridge the domain gap is to define a specific domain shift metric that is then minimized during training [51, 52, 28, 12, 82, 58, 29, 62, 33, 95, 39, 34, 93, 94, 36, 59]. Other widely used approaches include generating realistic-looking synthetic images [69, 20, 2, 98, 97], incorporating self-training [70, 6, 18, 75], transferring model weights between different domains [63, 64]

, and using domain-specific batch normalization 

[5]. The method of [79] introduces a self-supervised auxiliary task such as detecting image-rotation in unlabeled target domain images for cross-domain image classification and served as an inspiration to us.

Crowd Counting.

Most of the techniques described above are intended for classification problems and very few have been demonstrated for crowd counting purposes.

One exception is the method of [86, 17, 87] that trains the deep model on synthetic images and then narrows the domain gap, by using a CycleGAN [109] extension to translate synthetic images to make them look real and then re-train the model on these translated images. A limitation of this work is that the translated images, while more realistic than the original synthetic ones, are still not truly real.

Another exception is the method of [78]

. It uses pseudo labels generated by a network trained on synthetic images as though they were ground-truth labels. It relies on Gaussian Processes to estimate the variance of the pseudo labels and to minimize it. However, the uncertainty of these pseudo labels is not estimated or taken into account and the computational requirements can become very large when many synthetic images are used simultaneously.

The method of [19] uses adversarial learning to align features across different domains. However, it relies on extra discriminator networks which are complicated and hard to train.  [61, 23, 88] leverage a few target labels to bridge the domain gap, therefore require extra annotation cost.

By contrast to these approaches, ours explicitly takes uncertainty into account and leverages a specificity of the crowd counting problem, namely the fact that perspective distortion matters.

3 Approach

We propose a fully unsupervised approach to fine-tuning a network that has been trained on annotated synthetic data, so that it can operate effectively on real data despite a potentially large domain shift. At the heart of our method is a network that estimates people-density at every location while incorporating a variant of the deep ensemble approach [13] to provide uncertainties about these. The key to success is to first pre-train this network so that these uncertainties are meaningful and then to exploit them to recursively fine-tune the network.

Figure 2: Two-stage approach. Top:

During the first training stage, we use synthetic images, real images, and flipped versions of the latter. The network is trained to output the correct people density for the synthetic images and to classify the real images as being flipped or not.

Bottom: During the second training stage, we use synthetic and real images. We run the previously trained network on the real images and treat the least uncertain people density estimates as pseudo labels. We then fine tune the network on both kinds of images and iterate the process.

We have therefore developed a two-stage approach that first relies on real-images and upside-down versions of these to provide an image-wise supervisory signal. We use them to train the network not only to give good results on the synthetic images but also to recognize if the real images are upside-up or upside-down. This yields a partially-trained network that can operate on real images and return meaningful uncertainty values along with the density values. We can therefore exploit them to provide pixel-wise supervisory signal, by treating the people density estimates the network is most confident about as pseudo labels, that are treated as ground-truth and use to re-train the network. We iterate this process until the network predictions stabilize. Fig. 2 depicts our complete approach.

3.1 Network Architecture

Formally, let be a synthetic source-domain dataset, where denotes a color synthetic image and the corresponding crowd density map. The target-domain dataset is defined as without ground truth crowd density labels where denotes a color real image. In most real-world scenarios, we have . Our goal is to learn a model that performs well on the target-domain data.

To this end, we use a state-of-the-art encoder/decoder architecture for people density estimation [86]. We chose this one because it has already been used by cross-domain crowd counting approaches and therefore allows for a fair comparison of our approach against earlier ones. Let and be the encoder and decoder networks that jointly form the people density estimation network of [86]. Given an input image as input,

returns the deep features

that takes as input to return the density map .

One way to enable self-supervision for classification purposes is to use a partially trained network to predict labels and associated probabilities, treat the most probable ones as

pseudo labels that can be used for training purposes as though they were ground-truth labels [98, 97]. This strategy is widely used to provide pixel-wise [111] and image-wise [110] self-supervision to address classification problems. If the probability measure is reliable and allows the discarding of potentially erroneous labels, repeating this procedure several times results in the network being progressively refined without any need for ground-truth labels.

Figure 3: Masksembles approach.

During training, for every input vector, a binary mask is selected from a set of pre-generated masks and is used to zero out a corresponding set of features. Performing the inference several times using different masks then yields an ensemble-like behavior.

To implement a similar mechanism in our context, we need more than labels at the image-level. We require estimates of which individual densities in an estimated density map are likely correct and which are not. In other words, we need a stochastic crowd density map instead of the deterministic one that existing methods produce. Among all the methods that can be used to turn our network into one that returns such stochastic density maps, MC-Dropout [15] and Deep Ensembles [32] have emerged as two of the most popular ones. Both of those methods exploit the concept of ensembles to produce uncertainty estimates. Deep Ensembles are widely acknowledged to yield significantly more reliable uncertainty estimates [57, 1]. However, they require training many different copies of the network, which can be very slow and memory consuming. Instead, we rely on Masksembles, a recent approach [13] that operates on the same basic principle as MC-Dropout. However, instead of achieving randomness by dropping different subsets of weights for each observed sample, it relies on a set of pre-computed binary masks that specify the network parameters to be dropped. Fig. 3 depicts this process.

In practice, we associate to the first convolutional layer of the decoder a Masksembles layer. During training, for each sample in a batch we randomly choose one of the masks, set the corresponding weights to one or zero in the Masksembles layers, which drops the corresponding parts of the model just like standard dropout. During inference, we run the model multiple times, once per mask, to obtain a set of predictions and, ultimately, an uncertainty estimate. This turns out to provide uncertainty estimates that are almost as reliable as those of Ensembles but without having to train multiple networks and is therefore much faster and easier to train. Formally, we write

(1)
(2)

where is the input image, is the modified network used with mask . and are the same size as input image and we treat the individual values of as pixel-wise uncertainties.

3.2 Image-Wise Self-Supervision

(a) (b)
Figure 4: Upside-up vs Upside-down. (a) Original image. Due to perspective effects, the 2D projection of people is smaller at the top of the image and the people density appears to be larger. (b) In the upside-down image the effect is reversed. To allow the decoder to distinguish between these two cases, the encoder must produce perspective-aware features that can operate in the real images.

can be trained in a supervised fashion using the synthetic training set but that does not guarantee that it will work well on real images. Hence, we introduce the auxiliary task decoder shown at the top of Fig. 2 whose task is to classify an image as being oriented normally or being upside-down from the features produced by the encoder. To train the resulting two-branch network, we use synthetic images from along with real images from and flipped versions of these, such as the ones shown in Fig. 4. For the synthetic images, the output should minimize the usual loss given the ground-truth density maps and, for the real images, the output should minimize a cross entropy loss for binary classification as being either upside-up or upside-down.

Formally, we introduce the loss function

(3)

which we minimize with respect to the weights of the encoder and the two decoders and . is the distance between the predicted people density map and the ground truth one while is the cross-entropy loss for binary classification given the ground-truth upside-up or down label for image . We use this label only for the real images because we have ground truth annotations for the synthetic ones. As will be shown in the results section, this provides sufficient supervision for the synthetic images and also using the image-wise supervision for these brings no obvious improvement.

Note that the and use the same encoder . To minimize and hence correctly estimate if an input image is upside-down or not, must extract meaningful features from the real images and not only from synthetic ones. Furthermore, these features must enable the decoder to handle scene perspective, that is, the fact that people densities are typically higher at the top of the image than the bottom in upside-up images. In other words, minimizing forces to produce perspective-aware features while minimizing forces the decoder to operate on such features to properly estimate people densities on the synthetic images. In this way, we make produce features that are appropriate both for synthetic and real images, hence mitigating the domain shift between the two, as will be demonstrated in the results section.

Source domain data .
Unlabeled target domain data .
procedure First Stage( and )
     Initialize the weights for people density estimation network with single encoder and two decoders and
     for  of gradient iterations do
         Pick one source domain image
         Pick one target domain image

         Generating one random variable

         if   then
              Flip upside-down
         else
              Do nothing
         end if
         Minimize of Eq. 3
     end for
end procedure
Generating pseudo labels for using
procedure Second Stage( , and pseudo labels for )
     for  of recursive iterations do
         for  of gradient iterations do
              Pick one source domain image
              Pick one target domain image
              Minimize of Eq. 4
         end for
         Update pseudo labels
     end for
end procedure
Algorithm 1 Two-Stage Training Algorithm

This first training stage is summarized by the first procedure of Alg. 1.

3.3 Pixel-Wise Self-Supervision

After the first training stage described above, our model can produce both a density map and its corresponding uncertainty . Let be the corresponding network. We can now refine its weights to create increasingly better tuned networks for by iteratively minimizing

(4)

where and is one for all densities for which the uncertainty is less than the top uncertainty . In other words, at each iteration we use the densities produced by for which the uncertainty is low enough as pseudo labels to train .

This second training stage is summarized by the second procedure of Alg. 1.

4 Experiments

In this section, we first introduce the evaluation metrics and benchmark datasets we use in our experiments. We then provide the implementation details and compare our approach to state-of-the-art methods. Finally, we perform a detailed ablation study.

Input Image Ground Truth Estimated People Density
Figure 5: Density maps. We indicate the ground-truth and estimated total number of people in the bottom left corner of the density maps. Note how close our estimations are to the ground truth ones. Please refer to the supplementary material for additional such images.

4.1 Evaluation Metrics

Previous works in crowd density estimation use the mean absolute error () and the root mean squared error () as evaluation metrics [86, 78]. They are defined as

where is the number of test images, denotes the true number of people inside the ROI of the th image and the estimated number of people. In the benchmark datasets discussed below, the ROI is the whole image except when explicitly stated otherwise. The number of people are recovered by integrating over the pixels of the predicted density maps.

4.2 Benchmark Datasets

Gcc [86].

It is the synthetic dataset we use. It consists of 15,212 images of size , containing 7,625,843 people annotations. It features 400 different scenes including both indoor and outdoor ones.

ShanghaiTech [107].

It is a real image dataset that comprises 1,198 annotated images with 330,165 people in them. It is divided in part A with 482 images and part B with 716. In part A, 300 images form the training set and, in part B, 400. The remainder are used for testing purpose.

Ucf-Qnrf [26].

It is a real image dataset that comprises 1,535 images with 1,251,642 people in them. The training set comprises 1,201 of these images. Unlike in ShanghaiTech, there are dramatic variations both in crowd density and image resolution.

Ucf_cc_50 [25].

It is a real image dataset that contains only 50 images with a people count ranging from 94 to 4,543, which makes it challenging for a deep-learning approach. For a fair comparison, we use the same 5-fold cross-validation protocol as in 

[86, 78]: We partition the images into 5 10-image groups. In turn, we then pick four groups for training and the remaining one for testing. This gives us 5 sets of results and we report their average.

WorldExpo’10 [101].

It is a real image dataset that comprises 1,132 annotated video sequences collected from 103 different scenes. There are 3,980 annotated frames, with 3,380 of them used for training purposes. Each scene contains a Region Of Interest (ROI) in which people are counted. As in previous work [86], we report the MAE for each scene along with the average over all scenes.

Model No Adapt 160.0 216.5 Cycle-GAN [109] 143.3 204.3 SE Cycle-GAN [86] 123.4 193.4 SE Cycle-GAN(JT) [85] 119.6 189.1 SE+FD [19] 129.3 187.6 GP [78] 121 181 OURS 109.2 168.1 Model No Adapt 22.8 30.6 Cycle-GAN [109] 25.4 39.7 SE Cycle-GAN [86] 19.9 28.3 SE Cycle-GAN(JT) [85] 16.4 25.8 SE+FD [19] 16.9 24.7 GP [78] 12.8 19.2 OURS 11.4 17.3 Model No Adapt 487.2 689.0 Cycle-GAN [109] 404.6 548.2 SE Cycle-GAN [86] 373.4 528.8 SE Cycle-GAN(JT) [85] 370.2 512.0 GP [78] 355 505 OURS 336.5 486.1
(a) (b) (c)
Model No Adapt 275.5 458.5 Cycle-GAN [109] 257.3 400.6 SE Cycle-GAN [86] 230.4 384.5 SE Cycle-GAN(JT) [85] 225.9 385.7 SE+FD [19] 221.2 390.2 GP [78] 210 351 OURS 198.3 332.9 Model Scene1 Scene2 Scene3 Scene4 Scene5 Average No Adapt 4.4 87.2 59. 1 51.8 11.7 42.8 Cycle-GAN [109] 4.4 69.6 49.9 29.2 9.0 32.4 SE Cycle-GAN [86] 4.3 59.1 43.7 17.0 7.6 26.3 SE Cycle-GAN(JT) [85] 4.2 49.6 41.3 19.8 7.2 24.4 GP [78] - - - - - 20.4 OURS 4.0 31.9 23.5 19.4 4.2 16.6
(d) (e)
Table 1: Comparative results on different datasets. (a) ShanghaiTech Part A. (b) ShanghaiTech Part B. (c) UCF_CC_50. (d) UCF-QNRF. (e) WorldExpo’10. Our approach consistently and clearly outperforms previous state-of-the-art methods on all the datasets.

4.3 Implementation Details

For a fair comparison with previous work [86, 78], we use SFCN [86] as the crowd density regressor and Adam [30] for parameter update with a learning rate of . After a grid search on one single dataset as discussed below, we set in Eq. 3, , and in Eq. 4 to , and respectively for all our experiments.

To estimate uncertainty, we generate

stochastic density map for each image and take the standard deviation to be our uncertainty measure. We set the threshold value

of Eq. 4 to , which means that most uncertain pseudo labels are discarded and that we keep the other as pseudo labels for model training. This large percentage is appropriate because there are large areas of the real images that do not contain anyone and for which the pseudo labels are very dependable. We will show below that removing only 10% of the labels suffices to substantially boost performance over keeping all pseudo labels.

Recall that we drop the auxiliary network in the second training stage. In the final evaluation phase, we generate only one density map for each image instead of averaging multiple estimates, we will show that the performance is similar for both cases in supplementary material. Hence our model does not require any extra computation at inference time. Fig. 5 depicts qualitative results on ShanghaiTech Part B dataset and we provide additional ones in the supplementary material along with more details about the model.

4.4 Comparing against Recent Techniques

In Tab. 1, we compare our results to those of state-of-the-art domain adaptation approaches for each one of the public benchmark datasets, as currently reported in the literature. In each case, we reprint the results as given in these papers and add those of OURS, that is, of our method. We consistently and clearly outperform all other methods on all the datasets. And, since we use the same SFCN network architecture as the methods of [86, 78], the performance boost is directly attributable to our approach of domain adaptation.

In [86], the authors report fully supervised MAE results on Shanghaitech Part B and UCF-QNRF of 9.4 and 124.7, respectively, to be compared to our own unsupervised values of 11.4 and 198.3. In other words, our unsupervised approach performs almost as well as a supervised one on Shanghaitech Part B while there still remains a gap on UCF-QNRF. This is because the crowds in both the synthetic source domain and in Shanghaitech Part B are still mostly sparse enough for bodies to be visible. By contrast, in UCF-QNRF, the crowds are denser. Hence, it often happens that only heads are visible, thus creating a larger domain gap between source and target images that could be bridged in future work either by using a synthetic dataset that itself features denser crowds or, more ambitiously, by using a detection pipeline that focuses more on heads and would naturally reduce the domain gap.

4.5 Ablation Study

We perform an ablation study on UCF-QNRF dataset to confirm the role of the self-supervision loss terms, the setting of hyper-parameters, the impact of stochastic density map, the choice of auxiliary task and to compare against other uncertainty estimation techniques.

Self-Supervision.

We compare our complete model against several variants. BASELINE uses the SFCN crowd density estimator trained on the synthetic data and without any domain adaptation. OURS-IMG involves the first image-wise training stage but not the second. OURS-IMG-SYN also involves only the first image-wise training stage but both real and synthetic images can be flipped upside down, whereas in OURS-IMG only the real ones are. Conversely, OURS-PIX skips the first image-wise training and involves only the second pixel-wise training stage. OURS-DUP is similar to our complete approach except for the fact that it uses both pixel-wise and image-wise supervision during the second training stage whereas OURS only uses pixel-wise supervision by that point.

Self-Supervision
Model Image Synthetic Image Pixel 2nd Image
BASELINE 275.5 458.5
OURS-IMG 242.8 407.6
OURS-IMG-SYN 243.0 406.8
OURS-PIX 208.3 346.9
OURS 198.3 332.9
OURS-DUP 198.5 331.7
Table 2: Ablation study on self-supervision. Both image-wise and pixel-wise self-supervision boost the performance and combining both further improves performance. By contrast, using image-wise self-supervision during the second stage, as opposed to the first, makes no obvious difference.

As shown in Tab. 2, both OURS-IMG and OURS-PIX outperform BASELINE which shows that both training stages matter. However, OURS does even better, which confirms that properly pre-training the network before using pixel-wise supervision matters. Since OURS-IMG-SYN and OURS-DUP achieve similar performance as OURS-IMG and OURS respectively, we drop image-wise self-supervision for synthetic image and in the second stage for simplicity.

208.0 344.2 198.3 332.9 205.4 340.6 229.3 395.1 213.7 372.2 198.3 332.9 220.5 386.1
(a) (b)
206.5 347.2 198.3 332.9 204.9 350.7 204.8 344.1 198.3 332.9 199.4 331.4
(c) (d)
Table 3: Ablation study on hyper-parameters. , , and achieves the best performance, we thus use this setting for all the experiments.

Hyper-Parameter Selection.

We tested different values for the hyper-parameters we use, that is in Eq. 3, , and in Eq. 4. As shown in Tab. 3, , , and yields the best results on this dataset and we used the same values for all others. Note that delivers much better performance than , which confirms that throwing away as few as 10% of the pseudo labels makes a very significant difference.

Stochastic Density Map.

To test if generating a stochastic density map instead of a deterministic one has a significant impact of performance, we compare the performance of BASELINE that generates a deterministic map with a version of it that includes Masksembles to generate a stochastic map but still without any domain adaptation. As can be seen in Tab. 4, the version with Masksembles does slightly better but not by a significant amount. Therefore, Masksembles by itself does not account for the large improvements we saw in Tab. 1.

Model
BASELINE 275.5 458.5
BASELINE+Masksembles 273.1 447.9
Table 4: Ablation study on stochastic density map. Generating stochastic density map slightly improve the performance but not by a significant amount.

Choice of Auxiliary tasks.

Having chosen to use inverted images to provide a self-supervision signal may seem arbitrary during the first phase of training. To show that it is not, we tried variants in which we flip the images left-right (OURS-MIRROR), we rotate them by degrees (OURS-90) and by degrees (OURS-270). As can be seen in Tab. 5, OURS-MIRROR performs on par with OURS-PIX, the model trained without any image-wise supervision. OURS-90 and OURS-270 do slightly better but OURS is clearly best. This confirms the importance of flipping the images upside-down, which helps the network deal with perspective effects.

Model
OURS-PIX 208.3 346.9
OURS-MIRROR 208.1 346.0
OURS-90 205.5 344.7
OURS-270 204.8 342.1
OURS 198.3 332.9
Table 5: Ablation study on auxiliary task. We tested different auxiliary tasks for image-wise supervision. Flipping the image upside-down yields the best performance and we used it for all other experiments.

Uncertainty Estimation.

Model Extra Cost Correlation
MC-Dropout [15] 0.18 209.8 344.9
Deep Ensembles [32] 0.44 199.7 331.8
OURS 0.46 198.3 332.9
Table 6: Ablation study on uncertainty estimation. The Masksembles approach we used in measuring model uncertainty achieves better performance than MC-Dropout and similar performance as Deep Ensembles in terms of all three measures, and at a much lower computational cost.

We use Masksembles [13] for uncertainty estimation because of its effectiveness and simplicity. However we could also have used MC-Dropout [15] or Deep Ensembles [32]. We tested both and report the results in Tab. 6. In addition to the usual MAE and RMSE, we also computed the Pearson Correlation Coefficient

(5)

where is the sample size, are pixel-wise samples of counting error and uncertainty value respectively. and the higher its value is, the more correlated uncertainty is to the MAE error. In other words, when is large, it makes sense to discard uncertain densities as probably wrong and not to be used as pseudo labels. As can be seen in Tab. 6, using Masksembles [13] as in OURS clearly outperform MC-Dropout [15] and is comparable with Deep Ensembles [32]. However, training Ensembles takes three times longer, which motivates our use of Masksembles.

5 Conclusion

We have proposed an approach to combining image-wise and pixel-wise self-supervision to substantially increase cross-domain crowd counting performance when only annotations of synthetic image is available. However, our approach does not require the source images to be synthetic and could take advantage of additional annotations when available. In future work, we will therefore expand it to using multiple datasets of real-world images with partial annotations.

Acknowledgments This work was supported in part by the Swiss National Science Foundation.

References

  • [1] A. Ashukha, A. Lyzhov, D. Molchanov, and D. Vetrov. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. In International Conference on Learning Representations, 2020.
  • [2] M. Binkowski, R. D. Hjelm, and A. Courville. Batch Weight for Domain Adaptation with Mass Shift. In International Conference on Computer Vision, 2019.
  • [3] X. Cao, Z. Wang, Y. Zhao, and F. Su. Scale Aggregation Network for Accurate and Efficient Crowd Counting. In European Conference on Computer Vision, 2018.
  • [4] A.B. Chan and N. Vasconcelos. Bayesian Poisson Regression for Crowd Counting. In International Conference on Computer Vision, pages 545–551, 2009.
  • [5] W.-G. Chang, T. You, S. Seo, S. Kwak, and B. Han. Domain-Specific Batch Normalization for Unsupervised Domain Adaptation. In

    Conference on Computer Vision and Pattern Recognition

    , 2019.
  • [6] C. Chen, W. Xie, W. Huang, Y. Rong, X. Ding, Y. Huang, T. Xu, and J. Huang. Progressive Feature Alignment for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [7] Y. Chen, W. Li, and L. Van Gool. ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes. In Conference on Computer Vision and Pattern Recognition, pages 7892–7901, 2018.
  • [8] Z. Chen, J. Zhuang, X. Liang, and L. Lin. Blending-Target Domain Adaptation by Adversarial Meta-Adaptation Networks. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [9] Z. Cheng, J. Li, Q. Dai, X. Wu, and A. G. Hauptmann. Learning Spatial Awareness to Improve Crowd Counting. In International Conference on Computer Vision, 2019.
  • [10] S. Cicek and S. Soatto. Unsupervised Domain Adaptation via Regularized Conditional Alignment. In International Conference on Computer Vision, 2019.
  • [11] S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian. Gradually Vanishing Bridge for Adversarial Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2020.
  • [12] Z. Deng, Y. luo, and J. Zhu. Cluster Alignment with a Teacher for Unsupervised Domain Adaptation. In International Conference on Computer Vision, 2019.
  • [13] N. Durasov, T. Bagautdinov, P. Baque, and P. Fua. Masksembles for Uncertainty Estimation. In Conference on Computer Vision and Pattern Recognition, 2021.
  • [14] Y. Fang, B. Zhan, W. Cai, S. Gao, and B. Hu.

    Locality-Constrained Spatial Transformer Network for Video Crowd Counting.

    International Conference on Multimedia and Expo, 2019.
  • [15] Y. Gal and Z. Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In

    International Conference on Machine Learning

    , pages 1050–1059, 2016.
  • [16] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky.

    Domain-Adversarial Training of Neural Networks.

    Journal of Machine Learning Research, 17:591–5935, 2016.
  • [17] J. Gao, T. Han, Q. Wang, and Y. Yuan. Domain-adaptive Crowd Counting via Inter-domain Features Segregation and Gaussian-prior Reconstruction. In arXiv Preprint, 2019.
  • [18] X. Gu, J. Sun, and Z. Xu. Spherical Space Domain Adaptation with Robust Pseudo-Label Loss. In Conference on Computer Vision and Pattern Recognition, 2020.
  • [19] T. Han, J. Gao, Y. Yuan, and Q. Wang. Focus on Semantic Consistency for Cross-Domain Crowd Understanding. In International Conference on Acoustics, Speech, and Signal Processing, 2020.
  • [20] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. CyCADA: Cycle Consistent Adversarial Domain Adaptation. In International Conference on Machine Learning, pages 1989–1998, 2018.
  • [21] J. Hoffman, D. Wang, F. Yu, and T. Darrell. FCNs in the Wild: Pixel-Level Adversarial and Constraint-Based Adaptation. In arXiv Preprint, 2016.
  • [22] W. Hong, Z. Wang, M. Yang, and J. Yuan.

    Conditional Generative Adversarial Network for Structured Domain Adaptation.

    In Conference on Computer Vision and Pattern Recognition, pages 1335–1344, 2018.
  • [23] M. A. Hossain, M. Kumar K, M. Hosseinzadeh, O. Chanda, and Y. Wang. One-Shot Scene-Specific Crowd Counting. In British Machine Vision Conference, 2019.
  • [24] L. Hu, M. Kan, S. Shan, and X. Chen. Unsupervised Domain Adaptation with Hierarchical Gradient Synchronization. In Conference on Computer Vision and Pattern Recognition, 2020.
  • [25] H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multi-Source Multi-Scale Counting in Extremely Dense Crowd Images. In Conference on Computer Vision and Pattern Recognition, pages 2547–2554, 2013.
  • [26] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah. Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds. In European Conference on Computer Vision, 2018.
  • [27] X. Jiang, Z. Xiao, B. Zhang, and X. Zhen. Crowd Counting and Density Estimation by Trellis Encoder-Decoder Networks. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [28] G. Kang, L. Jiang, Y. Yang, and A.G. Hauptmann. Contrastive Adaptation Network for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [29] M. Kim, P. Sabu, B. Gholami, and V. Pavlovic. Unsupervised Visual Domain Adaptation: A Deep Max-Margin Gaussian Process Approach. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [30] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimisation. In International Conference on Learning Representations, 2015.
  • [31] V.K. Kurmi, S. Kumar, and V.P. Namboodiri. Attending to Discriminative Certainty for Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [32] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. In Advances in Neural Information Processing Systems, 2017.
  • [33] C.-Y. Lee, T. Batra, M. H. Baig, and D. Ulbricht. Sliced Wasserstein Discrepancy for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [34] S. Lee, D. Kim, N. Kim, and S.-G. Jeong. Drop to Adapt:learning Discriminative Features for Unsupervised Domain Adaptation. In International Conference on Computer Vision, 2019.
  • [35] V. Lempitsky and A. Zisserman. Learning to Count Objects in Images. In Advances in Neural Information Processing Systems, 2010.
  • [36] M. Li, Y. Zhai, Y. Luo, P. Ge, and C. Ren. Enhanced Transport Distance for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2020.
  • [37] Y. Li, X. Zhang, and D. Chen.

    CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes.

    In Conference on Computer Vision and Pattern Recognition, 2018.
  • [38] D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao. Density Map Regression Guided Detection Network for RGB-D Crowd Counting and Localization. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [39] J. Liang, R. He, Z. Sun, and T. Tan. Distant Supervised Centroid Shift: A Simple and Efficient Approach to Visual Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [40] C. Liu, X. Weng, and Y. Mu. Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [41] L. Liu, Z. Qiu, G. Li, S. Liu, W. Ouyang, and L. Lin. Crowd Counting with Deep Structured Scale Integration Network. In International Conference on Computer Vision, 2019.
  • [42] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin. Crowd Counting Using Deep Recurrent Spatial-Aware Network. In

    International Joint Conference on Artificial Intelligence

    , 2018.
  • [43] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu. Adcrowdnet: An Attention-Injective Deformable Convolutional Network for Crowd Understanding. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [44] W. Liu, K. Lis, M. Salzmann, and P. Fua. Geometric and Physical Constraints for Drone-Based Head Plane Crowd Density Estimation. International Conference on Intelligent Robots and Systems, 2019.
  • [45] W. Liu, M. Salzmann, and P. Fua. Context-Aware Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [46] W. Liu, M. Salzmann, and P. Fua. Counting People by Estimating People Flows. In arXiv Preprint, 2020.
  • [47] W. Liu, M. Salzmann, and P. Fua. Estimating People Flows to Better Count Them in Crowded Scenes. In European Conference on Computer Vision, 2020.
  • [48] X. Liu, J.V.D.Weijer, and A.D. Bagdanov.

    Exploiting Unlabeled Data in CNNs by Self-Supervised Learning to Rank.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), August 2019.
  • [49] X. Liu, J.V.d. Weijer, and A.D. Bagdanov. Leveraging Unlabeled Data for Crowd Counting by Learning to Rank. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [50] Y. Liu, M. Shi, Q. Zhao, and X. Wang. Point In, Box Out: Beyond Counting Persons in Crowds. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [51] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning Transferable Features with Deep Adaptation Networks. In International Conference on Machine Learning, pages 97–105, 2015.
  • [52] M. Long, J. Wang, and M.I. Jordan.

    Deep Transfer Learning with Joint Adaptation Networks.

    In International Conference on Machine Learning, pages 2208–2217, 2017.
  • [53] Z. Lu, Y. Yang, X. Zhu, C. Liu, Y. Song, and T. Xiang. Stochastic Classifiers for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2020.
  • [54] X. Ma, T. Zhang, and C. Xu. GCAN: Graph Convolutional Adversarial Network for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [55] Z. Ma, X. Wei, X. Hong, and Y. Gong. Bayesian Loss for Crowd Count Estimation with Point Supervision. In International Conference on Computer Vision, 2019.
  • [56] D. Onoro-Rubio and R.J. López-Sastre. Towards Perspective-Free Object Counting with Deep Learning. In European Conference on Computer Vision, pages 615–629, 2016.
  • [57] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, and J. Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset dhift. In Advances in Neural Information Processing Systems, pages 13991–14002, 2019.
  • [58] Y. Pan, T. Yao, Y. Li, Y. Wang, C.-W. Ngo, and T. Mei. Transferrable Prototypical Networks for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [59] X. Peng, Y. Li, and K. Saenko. Domain2vec: Domain Embedding for Unsupervised Domain Adaptation. In European Conference on Computer Vision, 2020.
  • [60] V. Ranjan, H. Le, and M. Hoai. Iterative Crowd Counting. In European Conference on Computer Vision, 2018.
  • [61] M. K. K. Reddy, M. Hossain, M. Rochan, and Y. Wang. Few-Shot Scene Adaptive Crowd Counting Using Meta-Learning. In IEEE Winter Conference on Applications of Computer Vision, 2020.
  • [62] S. Roy, A. Siarohin, E. Sangineto, S. R. Bulo, N. Sebe, and E. Ricci. Unsupervised Domain Adaptation Using Feature-Whitening and Consensus Loss. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [63] A. Rozantsev, M. Salzmann, and P. Fua. Residual Parameter Transfer for Deep Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pages 4339–4348, 2018.
  • [64] A. Rozantsev, M. Salzmann, and P. Fua. Beyond Sharing Weights for Deep Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4):801–814, 2019.
  • [65] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018.
  • [66] D.B. Sam, S.V. Peri, M.N. Sundararaman, A. Kamath, and R.V. Babu. Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [67] D.B. Sam, N.N. Sajjan, R.V. Babu, and S. M. Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [68] D.B. Sam, S. Surya, and R.V. Babu. Switching Convolutional Neural Network for Crowd Counting. In Conference on Computer Vision and Pattern Recognition, page 6, 2017.
  • [69] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to Adapt: Aligning Domains Using Generative Adversarial Networks. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [70] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Learning Transferrable Representations for Unsupervised Domain Adaptation. In Advances in Neural Information Processing Systems, pages 2110–2118, 2016.
  • [71] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang. Crowd Counting via Adversarial Cross-Scale Consistency Pursuit. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [72] M. Shi, Z. Yang, C. Xu, and Q. Chen. Revisiting Perspective Information for Efficient Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [73] Z. Shi, P. Mettes, and C. G. M. Snoek. Counting with Focus for Free. In International Conference on Computer Vision, 2019.
  • [74] Z. Shi, L. Zhang, Y. Liu, and X. Cao. Crowd Counting with Deep Negative Correlation Learning. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [75] I. Shin, S. Woo, F. Pan, and I.S. Kweon. Two-Phase Pseudo Label Densification for Self-Training Based Domain Adaptation. In European Conference on Computer Vision, 2020.
  • [76] V.A. Sindagi and V.M. Patel. Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs. In International Conference on Computer Vision, pages 1879–1888, 2017.
  • [77] V.A. Sindagi and V.M. Patel. Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [78] V.A. Sindagi, R. Yasarla, D. S. Babu, R. V. Babu, and V.M. Patel. Learning to Count in the Crowd from Limited Labeled Data. In European Conference on Computer Vision, 2020.
  • [79] Y. Sun, E. Tzeng, T. Darrell, and A. A. Efros. Unsupervised Domain Adaptation through Self-Supervision. In arXiv Preprint, 2019.
  • [80] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous Deep Transfer Across Domains and Tasks. In International Conference on Computer Vision, pages 4068–4076, 2015.
  • [81] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial Discriminative Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
  • [82] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez. Advent: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation. In Conference on Computer Vision and Pattern Recognition, pages 2517–2526, 2019.
  • [83] J. Wan and A. B. Chan. Adaptive Density Map Generation for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [84] J. Wan, W. Luo, B. Wu, A. B. Chan, and W. Liu. Residual Regression with Semantic Prior for Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [85] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese.

    Densefusion: 6D Object Pose Estimation by Iterative Dense Fusion.

    In Conference on Computer Vision and Pattern Recognition, 2019.
  • [86] Q. Wang, J. Gao, W. Lin, and Y. Yuan. Learning from Synthetic Data for Crowd Counting in the Wild. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [87] Q. Wang, J. Gao, W. Lin, and Y. Yuan. Pixel-Wise Crowd Understanding via Synthetic Data. International Journal of Computer Vision, 2020.
  • [88] Q. Wang, T. Han, J. Gao, and Y. Yuan. Neuron Linear Transformation: Modeling the Domain Shift for Crowd Counting. IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • [89] Y. Wu, D. Inkpen, and A. El-Roby. Dual Mixup Regularized Learning for Adversarial Domain Adaptation. In European Conference on Computer Vision, 2020.
  • [90] F. Xiong, X. Shi, and D. Yeung. Spatiotemporal Modeling for Crowd Counting in Videos. In International Conference on Computer Vision, pages 5161–5169, 2017.
  • [91] H. Xiong, H. Lu, C. Liu, L. Liu, Z. Cao, and C. Shen. From Open Set to Closed Set: Counting Objects by Spatial Divide-And-Conquer. In International Conference on Computer Vision, 2019.
  • [92] C. Xu, K. Qiu, J. Fu, S. Bai, Y. Xu, and X. Bai. Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [93] R. Xu, G. Li, J. Yang, and L. Lin. Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation. In International Conference on Computer Vision, 2019.
  • [94] R. Xu, P. Liu, L. Wang, C. Chen, and J. Wang. Reliable Weighted Optimal Transport for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2020.
  • [95] X. Xu, X. Zhou, R. Venkatesan, G. Swaminathan, and O. Majumder. d-SNE: Domain Adaptation Using Stochastic Neighborhood Embedding. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [96] Z. Yan, Y. Yuan, W. Zuo, X. Tan, Y. Wang, S. Wen, and E. Ding. Perspective-Guided Convolution Networks for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [97] Y. Yang, D. Lao, G. Sundaramoorthi, and S. Soatto. Phase Consistent Ecological Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2020.
  • [98] Y. Yang and S. Soatto. Fda: Fourier Domain Adaptation for Semantic Segmentation. In Conference on Computer Vision and Pattern Recognition, pages 4085–4095, 2020.
  • [99] A. Zhang, J. Shen, Z. Xiao, F. Zhu, X. Zhen, X. Cao, and L. Shao. Relational Attention Network for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [100] A. Zhang, L. Yue, J. Shen, F. Zhu, X. Zhen, X. Cao, and L. Shao. Attentional Neural Fields for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [101] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-Scene Crowd Counting via Deep Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition, pages 833–841, 2015.
  • [102] L. Zhang, Z. Shi, M. Cheng, Y. Liu, J. Bian, J.T. Zhou, G. Zheng, and Z. Zeng. Nonlinear Regression via Deep Negative Correlation Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [103] Q. Zhang and A. B. Chan. Wide-Area Crowd Counting via Ground-Plane Density Maps and Multi-View Fusion CNNs. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [104] S. Zhang, G. Wu, J.P. Costeira, and J.M.F. Moura. FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras. In International Conference on Computer Vision, 2017.
  • [105] W. Zhang, W. Ouyang, W. Li, and D. Xu. Collaborative and Adversarial Network for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pages 3801–3809, 2018.
  • [106] Y. Zhang, H. Tang, K. Jia, and M. Tan. Domain-Symmetric Networks for Adversarial Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [107] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. In Conference on Computer Vision and Pattern Recognition, pages 589–597, 2016.
  • [108] M. Zhao, J. Zhang, C. Zhang, and W. Zhang. Leveraging Heterogeneous Auxiliary Tasks to Assist Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [109] J.-Y. Zhu, T. Park, P. Isola, and A.A. Efros.

    Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks.

    In International Conference on Computer Vision, pages 2223–2232, 2017.
  • [110] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang. Confidence Regularized Self-Training. In International Conference on Computer Vision, pages 5982–5991, 2019.
  • [111] Yang Zou, Zhiding Yu, B.V.K. Vijaya Kumar, and Jinsong Wang. Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training. In European Conference on Computer Vision, 2018.