Using Depth for Pixel-Wise Detection of Adversarial Attacks in Crowd Counting

11/26/2019 ∙ by Weizhe Liu, et al. ∙ EPFL 10

State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. While effective, deep learning approaches are vulnerable to adversarial attacks, which, in a crowd-counting context, can lead to serious security issues. However, attack and defense mechanisms have been virtually unexplored in regression tasks, let alone for crowd density estimation. In this paper, we investigate the effectiveness of existing attack strategies on crowd-counting networks, and introduce a simple yet effective pixel-wise detection mechanism. It builds on the intuition that, when attacking a multitask network, in our case estimating crowd density and scene depth, both outputs will be perturbed, and thus the second one can be used for detection purposes. We will demonstrate that this significantly outperforms heuristic-based and uncertainty-based strategies.



There are no comments yet.


page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art crowd counting algorithms [69, 71, 44, 49, 63, 54, 50, 38, 27, 48, 53, 34, 20, 47, 4] rely on Deep Networks to regress a crowd density, which is then integrated to estimate the number of people in the image. Their application can have important societal consequences, for example when they are used to assess how many people attended a demonstration or a political event.

In the “Fake News” era, it is therefore to be feared that hackers might launch adversarial attacks to bias the output of these models for political gain. While such attacks have been well studied for classification networks [15, 23, 41, 43, 42, 6], they remain largely unexplored territory for people counting and even for regression at large. The only related approach we know of [46] is very recent and specific to attacking optical flow networks, leaving the pixel-wise detection of attacks untouched.

In this paper, our goal is to blaze a trail in that direction. Our main insight is that if a two-stream network is trained to regress both the people density and the scene depth, it becomes very difficult to affect one without affecting the other. In other words, pixels that have been modified to alter the density estimate will also produce incorrect depths, which can be detected by estimating depth by unrelated means. When a ground-truth depth-map can be kept safe from the attacker, this is easily done. When this is not an option, we will show that using statistics from depth-maps acquired earlier suffices to detect tampering at a later date.

Figure 1: Density and Depth (DaD) model. A two-stream network is trained to regress both the people density and the scene depth. The pixels that have been attacked to alter the density estimate will also produce incorrect depths and can thus be detected.

Fig. 1 depicts our approach. We will show that it is robust to both black-box attacks, in which the adversary does not know the existence of adversarial detector, and white-box attacks, in which the adversary can access not only the density regression network but also our adversarial detector. In other words, even when our approach is exposed, a hacker cannot mount an effective attack while avoiding detection. This is because in surveillance videos recorded by a fixed camera, untampered depth map pixels tend to change less over time than RGB image pixels. In a crowded scene, depth measurements are affected by the appearance and disappearance of people but the perturbations remain small whereas those of RBG pixels can be much larger due to illumination and appearance changes. Therefore, even if the attacker has access to the depth maps, it remains difficult to guarantee that they will be altered in a consistent and undetectable way. We will refer to this as the Density and Depth (DaD) model.

Our contribution therefore is an effective approach to foiling adversarial attacks on people counting algorithms that rely on deep learning. Its principle is very generic and could be naturally extended to other image regression tasks. Our experiments on several benchmark datasets demonstrate that it outperforms heuristic-based and uncertainty-based attack detection strategies.

2 Related Work

Crowd Counting.

Early crowd counting methods [61, 60, 29] tended to rely on counting-by-detection, that is, explicitly detecting individual heads or bodies and then counting them. Unfortunately, in very crowded scenes, occlusions make detection difficult, and these approaches have been largely displaced by counting-by-density-estimation ones. They rely on training a regressor to estimate people density in various parts of the image and then integrating. This trend began in [8, 25, 12]

, using either Gaussian Process or Random Forests regressors. Even though approaches relying on low-level features 

[9, 7, 3, 45, 8, 19] can yield good results, they have now mostly been superseded by CNN-based methods [69, 63, 34, 54, 65, 44, 50, 71, 49, 4, 59, 35, 39, 51, 37, 36, 21, 72, 70, 58, 28, 31, 66, 40, 33, 64, 65, 52, 55, 10, 57, 67, 68], a survey of which can be found in [54]. In this paper, we therefore focus on attacks against these.

Defense against Adversarial Attacks.

Deep networks trained to solve classification problems are vulnerable to adversarial attacks [15, 23, 41, 43, 42, 6]. Existing attack strategies can be roughly split in two categories, optimization-based and gradient-based. The former [43, 42, 6]

involve terms related to the class probabilities, which makes the latter 

[15, 23, 41] better candidates for attacks against deep regression models. The very recent work of [46] is the only one we know of that examines adversarial attacks for a regression task, that is, optical flow estimation. However, it does not propose defense mechanisms, which by contrast is the focus of this paper.

In the context of classification, one popular defense is adversarial training [56], which augments the training data with adversarial examples and has been shown to outperform many competing methods [16, 14, 18, 1, 26, 11]

. However, it needs access to adversarial examples, which are often not available ahead of time and expensive to generate during training. As a consequence, several alternative approaches have been proposed. This includes training auxiliary classifiers, ranging from simple linear regressors to complex neural networks, to predict whether a sample is adversarial or not. However, as shown in 

[5], such detection mechanisms can easily be defeated by an adversary targeting them directly.

In any event, none of these methods are designed to detect attacks at the pixel-level. Even the few researchers who have studied adversarial attacks for semantic segmentation [62, 30], which is a pixel-level prediction task, do not go beyond detection at the global image level.

A seemingly natural approach to performing pixel-level attack detection would be to rely on prediction uncertainty. In [11], the authors argue that Bayesian uncertainty [13] is a reliable way to detect the adversarial examples because the perturbed pixels generally have much higher uncertainty values. Uncertainty can be computed using dropout, as in [13], learned from data [22], or estimated using the negative log-likelihood of each prediction [24]. In our experiments, we will extend this strategy to pixel-wise adversarial attack detection and show that our approach significantly outperforms it.

3 Density and Depth Model

As discussed in Section 2, most state-of-the-art crowd counting algorithms rely on a deep network regressor , that takes an image as input and returns , an estimated density map, which should be as close as possible to a ground-truth one in norm terms. Here, stands for the network’s weights, which have been optimized for this purpose.

An adversarial attack then involves generating a perturbation , where maps the input image and ground-truth densities to in such a way that is visually indistinguishable from while yielding a crowd density estimate that is as different as possible from the ground-truth one.

We will review the best known ways to generate such attacks in Section 4.1. Here, our concern is to define so as to defeat them by ensuring that they are easily detected. To this end, we leverage an auxiliary task, depth estimation, as follows.

Network Architecture.

Figure 2: Density and Depth model.

An input RGB image is first encoded to deep features by an encoder network. Then, these features are decoded to a crowd density map and a depth map by two different decoder networks. At inference time, we detect adversarial attacks in a pixel-wise manner by observing the depth estimation errors.

layer encoder layer decoder
1 - 2 3364 conv-1 1 33512 conv-2

2 max pooling

2 33512 conv-2
3 - 4 33128 conv-1 3 33512 conv-2
2 2 max pooling 4 33256 conv-2
5 - 7 33256 conv-1 5 33128 conv-2
2 2 max pooling 6 3364 conv-2
8 - 10 33512 conv-1 7 111 conv-1
11 context-aware features [37]
Table 1: Layers of the encoder and of the decoder. All layers are convolutional and described in terms of “(kernel size) (kernel size) (number of filters) conv-(dilation rate)”, except for layer 11 of the encoder that produces context-aware features [37] using weighted multi-scale features generated by Spatial Pyramid Pooling [17].

Instead of training a single regressor that predicts only people density, we train a two-stream network where one stream predicts people density and the other depth at each pixel. We write


where and are the estimated densities and depths while and are two regressors parameterized by the weigths . The network that implements and comprises a single encoder and two decoders, one that produces the densities and the other the depths. It is depicted by Fig. 2 and we provide details about its architecture in Table 1. Note that some of the weights in are specific to the first decoder and others to the second.

As the two decoders use the same set of features as input, it is difficult to tamper with the results of one without affecting that of the other, as we will see in Section 4.4. More specifically, if pixel is perturbed to change the local density estimate, the local depth estimate is likely to be affected as well. We therefore take the relative error in depth estimation with respect to the ground-truth depth map


to be an indicator of a potential disturbance. In practice, we label a pixel as potentially tampered with if this difference is larger than the top 5% of this difference in the training dataset, and we will evaluate the influence of this hyper-parameter in Section 4.6. In test sequences for which the ground-truth depth map can also be tampered with by the attackers, we can use the statistics of the training depth maps to also detect such tampering, as will be shown in Section 4.5.

Network Training.

Given a set of training images with corresponding ground-truth density maps and ground-truth depth maps , we learn the weights of the two regressors by minimizing the loss


where is the batch size and is a hyper-parameter that balances the contributions of the two losses. We found empirically that yields the best overall performance, as will be shown in section 4.4.

To obtain the ground-truth density maps , we rely on the same strategy as previous work [27, 49, 71, 48, 32]. In each training image , we annotate a set of 2D points that denote the position of each human head in the scene. The corresponding ground-truth density map is obtained by convolving an image containing ones at these locations and zeroes elsewhere with a Gaussian kernel of mean

and variance


4 Experiments

In this section, we first introduce the existing adversarial attack methods that can be used against a deep regressor and describe the evaluation metric and the benchmark datasets we used to assess their performance. We then use them against our approach to demonstrate its robustness, and conclude with an ablation study that demonstrates that our approach is robust to the hyper-parameter setting and works well when used in conjunction with several recent crowd density regressors 

[71, 27, 37] .

4.1 Attacking a Deep Regressor

While there exist many adversarial attackers [15, 23, 41, 43, 42, 6], their effectiveness have been proven mostly against classifiers but far more rarely against regressors [46]. As discussed in Section 2, the gradient-based methods [15, 23, 41] are the most suitable ones to attack regressors and we focus on the so-called Fast Gradient Sign Methods (FSGMs), which are the most successful and widely used ones. We will distinguish between black-box attacks in which the attacker does not know that we use depth for verification purposes and white-box attacks in which they do.

Black-Box Attacks.

If the attacker is unaware that we use the depth map for verification purposes, they will only try to affect the density map. They might then use one of the following variants of FSGM.

Untargeted FSGM (FSGM-U(n)[15, 23].

It generates adversarial examples designed to increase the network loss as much as possible for the correct answer, thereby preventing the networks from predicting it. Given an input image , the ground-truth density , and the regressor of Eq. 1 parametrized by , the attack is performed by iterating


times. The adversarial example is then taken to be and we will refer to this as FSGM-U(n). It is a single-step or multiple-step attack without target and guarantees that the resulting perturbation is bounded by . For consistency with earlier work [15, 23], when , we reformulate this attack as


Unless otherwise specified we use , , and , as recommended in earlier work [23]. These numbers are chosen to substantially increase the crowd counting error while keeping the perturbation almost imperceptible to the human eye. We will analyze the sensitivity of our approach to these values in Section 4.6. An example of this attack is shown in Fig. 3. By comparing Fig. 3(d) and (e), we can see that the attack made some people “disappear”.

(a) (b) (c) (d) (e)
Figure 3: Crowd density estimation with original and perturbed images. (a) Original image. (b) Image under FSGM-U(19) attack. (c) Ground-truth density map with 51 people. (d) Density map inferred from the original image, leading to an estimate of 51.8 people. (e) Density map inferred from the perturbed image (b), yielding an estimated number of people of 18.1. Note the mismatch in density map and people counts between the original image and the perturbed one.
Targeted FSGM (FSGM-T(n)[23].

Instead of simply preventing the network from finding the right answer , we can target a specific wrong answer . This is achieved using the slightly modified iterative scheme


Again, we take the adversarial example to be and use the same values as before for and . We will refer to this as FSGM-T(n). In our experiments, we take the targets to be the true value plus one, which creates an obvious error while yielding tampered images that are undistinguishable from the original ones.

White-Box Attacks.

If the attacker knows that we are using the depth maps and has access to both and , the two regressors of Eq. 1, their natural reaction will be to try to modify the density maps while leaving the depth maps as unchanged as possible. To this end, we propose the following exposed variations of the untargeted and targeted FSGM attacks described above.

Untargeted Exposed FSGM (FSGM-UE(n)).

The iterative scheme becomes


where is the ground-truth depth map. When , we reformulate the final line of Eq. 7 as in Eq. 5

for consistency with earlier work. The additional term in the loss function aims to preserve the predicting power of

while compromising that of as much as possible. We again use the same values as before for and , and is the same balancing factor as in Eq. 3.

Targeted Exposed FSGT (FSGM-TE(n)).

Similarly, the targeted attack iterative scheme becomes


When , we again reformulate the final line of Eq. 8 as in Eq. 5.

4.2 Evaluation Datasets

We use three different datasets to evaluate our approach. The first two are RGB-D datasets with ground-truth depth value obtained from sensors. Since depth sensors may not always be available, we also evaluate our model with a third dataset that contains RGB images with an accurate perspective map. This perspective map is a depth map computed from the scene geometry instead of using depth sensors. As such, it only represents the scene, not the people in it. This will let us show that our approach not only works for RGB-D datasets but also achieves remarkable performance in RGB images if scene geometry is available.

ShanghaiTechRGBD [28].

This is a large-scale RGB-D dataset with 2,193 images and 144,512 annotated heads. The valid depth ranges from 0 to 20 meters due to the limitation in depth sensors. The lighting condition ranges from very bright to very dark in different scenarios. We use the same setting as in [28], with 1,193 images as training set and the remaining ones as test set, and normalize the depth values from [0,20] to [0,1] for both training and evaluation.

Micc [2].

It is acquired by a fixed indoor surveillance camera. This dataset is divided into three video sequences, named FLOW, QUEUE and GROUPS. The crowd motion varies from sequence to sequence. In the FLOW sequence, people walk from point to point. In the QUEUE sequence, people walk in a line. In the GROUPS sequence, people move inside a controlled area. There are 1,260 frames in the FLOW sequence with 3,542 heads. The QUEUE sequence contains 5,031 heads in 918 frames, and the GROUPS sequence encompasses 1,180 frames with 9,057 heads. We follow the same setting as in [28], taking 20% of the images of each scene as training set and using the remaining ones as test set.

Venice [37].

The above two RGB-D datasets contain depth information acquired by sensors. Such information is hard to obtain in outdoor environments, particularly if the scene is far from the camera. Therefore, we also evaluate our approach on the Venice dataset. This dataset contains RGB images and an accurate perspective map of each scene. It was obtained using the grid-like ground pattern, as shown in Fig. 4, and thus does not depend on any depth sensor. The dataset contains 4 different sequences for a total of 167 annotated frames with fixed 1,280 720 resolution. Our experimental setting follows that of [37, 36], with 80 images from a single long sequence as training data, and the images from the remaining 3 sequences for testing purposes.

Figure 4: Example image from Venice. In Venice dataset, each image is provided with a perspective map, which is computed from the grid-like ground pattern.

4.3 Metrics and Baselines

In all our experiments, we partition the images into four parts and tamper with one while leaving the other three untouched. We then measure two things:

  • How well can we detect the pixels that have been tampered with? We measure this in terms of the mean Intersection over Union


    where is the number of images, is 1 for the pixels in image predicted to have been tampered with according to Eq. 2 and is the ground-truth perturbation mask.

  • How well do modifications of the depth map correlate with modifications of the predicted density? As in many previous works [71, 69, 44, 49, 63, 54, 37, 36] we quantify these modifications in terms of the mean absolute error for densities and depths along with the root mean squared error for density. They are defined as


    where is the number of test images, and denote the true number of people in the th image and depth value at pixel of the th image, and and are the estimated values. is the number of tampered pixels in the th image. In practice is obtained by integrating the predicted people densities.

In the absence of prior art on defenses against attacks of density estimation algorithms, we use the following baselines for comparison purposes.

  • RANDHALF and RANDQUARTER. We randomly label either half or a quarter of the pixels as being under attack, given that we know a priori that exactly a quarter are. We introduced RANDHALF to show that using a random rate other than the true one does not help.


    . Since adversarial attacks is caused by modifying the input image, it can be seen as heteroscedastic aleatoric uncertainty 

    [22], which assumes that observation noise vary with input. We threshold the uncertainty values to classify each pixel as perturbed or not and report the results obtained with the best threshold.

  • ENSEMBLES. We use the approach of [24] that relies on a scalable method for estimating predictive uncertainty estimates from deep nets using a proper scoring rule as the training criteria. The optional adversarial training process is not used as we do not know the potential attackers in advance. As before, we threshold the uncertainty values to obtain a pixel-wise classification map and report the best results.

  • BAYESIAN. We further compare our model with Bayesian uncertainty [13], which uses dropout as Bayesian approximation of model uncertainty. Again, we threshold the uncertainty value and report the results for the best threshold.

The baseline models are trained with the same crowd density regression networks as our approach.

4.4 Comparative Performance

ShanghaiTechRGBD MICC Venice
Original image 4.32 7.16 0.04 0.52 0.67 1.36 21.92 24.74 1.13
FSGM-U(1) 61.56 71.58 0.12 2.45 3.01 7.46 78.75 88.65 2.16
FSGM-T(1) 60.31 70.08 0.12 1.66 1.87 7.77 202.54 204.65 2.62
FSGM-U(19) 64.55 75.11 0.14 3.13 3.75 7.68 48.56 57.83 1.76
FSGM-T(19) 62.86 73.09 0.13 1.90 2.15 7.77 112.17 115.24 1.93
FSGM-UE(1) 58.14 68.34 0.11 2.72 3.33 6.66 58.40 66.94 2.03
FSGM-TE(1) 53.64 63.43 0.11 2.47 3.02 6.65 171.76 174.31 2.53
FSGM-UE(19) 63.81 74.30 0.10 2.44 2.97 5.26 42.74 51.20 1.80
FSGM-TE(19) 52.89 62.43 0.10 2.30 2.81 5.26 95.71 99.13 1.92
Table 2: Error Summary of Crowd Density and Depth Estimation
Original image Image under FSGM-U(1) attack Region of interest Ground truth OURS
Figure 5: Pixel-wise adversarial detection on ShanghaiTechRGBD. Original image, image under attack, ROI(red), ground-truth attacked region within the ROI(red), and attacked region estimated by our method. Note how similar the attacked region mask produced by OURS is to the ground truth.

Using the CAN [37] architecture.

CAN is an encoder-decoder crowd density estimation architecture, that delivers excellent performance. We use it to implement and duplicate its decoder to implement . Recall from Section 3 that we use the hyper-parameter of Eq. 3 to balance the people density estimation loss and the depth estimation loss while training and . In Table 3, we report the performance of the two regressors as a function of the value of . yields the best performance overall and we use regressors trained using this value in all our other experiments. Interestingly, training and jointly yields a better density regressor than training alone, which is what we do when we set to zero.

ShanghaiTechRGBD MICC Venice
0.0 4.82 7.23 NA 0.91 0.98 NA 23.51 38.92 NA
0.001 4.76 7.19 0.21 0.86 0.93 2.26 21.81 24.91 2.59
0.01 4.32 7.16 0.04 0.52 0.67 1.36 21.92 24.74 1.13
0.1 4.61 7.41 0.03 0.61 0.73 1.43 23.12 26.52 1.23
1.0 4.80 7.26 0.04 0.89 0.93 1.47 23.27 32.16 1.32
10.0 4.92 8.01 1.16 0.98 1.04 1.72 25.43 39.65 1.86
Table 3: Error Summary of Crowd Density and Depth Estimation for different values

We report the counting and depth errors with/without attack in Table 2 for the 3 datasets. All the attacks cause large increase in crowd counting errors, which always comes with a substantial increase in depth estimation error. The exposed methods reduce slightly this increase but at the cost of also making the attack less effective.

FSGM-U(1) 0.20 0.14 0.23 0.35 0.23 0.54
FSGM-T(1) 0.20 0.14 0.23 0.32 0.24 0.54
FSGM-U(19) 0.20 0.14 0.28 0.36 0.23 0.58
FSGM-T(19) 0.20 0.14 0.28 0.33 0.24 0.57
FSGM-UE(1) 0.20 0.14 0.24 0.28 0.23 0.52
FSGM-TE(1) 0.20 0.14 0.21 0.30 0.23 0.51
FSGM-UE(19) 0.20 0.14 0.20 0.33 0.23 0.45
FSGM-TE(19) 0.20 0.14 0.25 0.30 0.24 0.47
Table 4: mIoU of Pixel-Wise Adversarial Detection on ShanghaiTechRGBD
FSGM-U(1) 0.20 0.14 0.30 0.35 0.28 0.46
FSGM-T(1) 0.20 0.14 0.33 0.32 0.26 0.49
FSGM-U(19) 0.20 0.14 0.30 0.30 0.27 0.49
FSGM-T(19) 0.20 0.14 0.32 0.37 0.23 0.49
FSGM-UE(1) 0.20 0.14 0.30 0.35 0.26 0.41
FSGM-TE(1) 0.20 0.14 0.28 0.31 0.28 0.41
FSGM-UE(19) 0.20 0.14 0.31 0.33 0.28 0.40
FSGM-TE(19) 0.20 0.14 0.30 0.34 0.27 0.40
Table 5: mIoU of Pixel-Wise Adversarial Detection on MICC
FSGM-U(1) 0.20 0.14 0.24 0.19 0.23 0.42
FSGM-T(1) 0.20 0.14 0.26 0.20 0.22 0.49
FSGM-U(19) 0.20 0.14 0.25 0.16 0.22 0.36
FSGM-T(19) 0.20 0.14 0.26 0.18 0.23 0.38
FSGM-UE(1) 0.20 0.14 0.22 0.20 0.23 0.40
FSGM-TE(1) 0.20 0.14 0.23 0.20 0.25 0.48
FSGM-UE(19) 0.20 0.14 0.26 0.21 0.24 0.38
FSGM-TE(19) 0.20 0.14 0.22 0.19 0.23 0.41
Table 6: mIoU of Pixel-Wise Adversarial Detection on Venice

In Tables 4, 5, and 6, we report the pixel-wise adversarial detection accuracy for the ShanghaiTechRGBD, MICC and Venice datasets. Our approach outperforms all the baseline models by a large margin for all the attacks. In Fig. 5, we show a qualitative result.

ShanghaiTechRGBD dataset MICC dataset Venice dataset
Figure 6: Detection accuracy with different backbones. We report the mIoU of different backbones on different datasets.
Figure 7: Detection accuracy with different perturbation strengths. We report the mIoU for different values on different datasets.
Dataset FSGM-UE(1) FSGM-TE(1) FSGM-UE(19) FSGM-TE(19)
0.01 0.52 58.14 68.34 0.11 0.51 53.64 63.43 0.11 0.45 63.81 74.30 0.10 0.47 52.89 62.43 0.10
1.0 0.41 36.62 42.51 0.09 0.38 35.73 41.44 0.09 0.36 36.72 40.14 0.08 0.36 33.73 38.12 0.08
100.0 0.36 18.73 22.31 0.08 0.33 15.59 23.32 0.07 0.30 14.83 19.11 0.06 0.31 17.62 20.68 0.07
Table 7: Detection accuracy and error rates for different values on ShanghaiTechRGBD

Using the CSRNet [27] and MCNN [71] architectures.

To show that the above results are not tied to CAN architecture, we re-ran our experiments using CSRNet [27] and MCNN [71]. As can be seen in Fig. 6, we get similar mIoU scores for all three, with a slight advantage for the more recent CAN.

4.5 Tampering with the Ground-Truth Depth Map

The results of Section 4.4 were obtained under the assumption that we have access to a ground-truth depth map that is safe from attack. In some scenarios, this might not be the case and the attacker might be able to tamper with the depth map. Fortunately, even if this were the case, the attack would still be detectable as follows. Given training depth maps recorded by a fixed depth sensor, we can record the min and max depth values for each pixel as


Given ground-truth depth maps that are exposed and can be tampered with, the tampered depth value at pixel of the ground-truth depth map can be written as


where is the mean depth value in the ground-truth depth maps in the test dataset, and is a scalar that represents the perturbation strength of a potential attack. We take the perturbation to be a function of because the ground-truth depth map is not an input to our network. If the attacker were able to choose an appropriate for each pixel of the ground-truth depth map, the tampering indicator of Eq. 2 could be compromised. Fortunately, such an attack is very likely to be detected using the following simple but effective approach. If or , we label the ground-truth pixel as potentially tampered with. As shown in Fig. 8, the pixel-wise detection accuracy is over 90% even for extremely small perturbations with and quickly increases from there. This makes the attacker’s task difficult indeed.

Figure 8: Detection accuracy on MICC. We report the detection accuracy for depth values tampered with different strengths.

4.6 Sensitivity Analysis

FSGM-U(1) 0.42 0.48 0.54 0.51
FSGM-T(1) 0.45 0.49 0.54 0.49
FSGM-U(19) 0.41 0.47 0.58 0.50
FSGM-T(19) 0.44 0.52 0.57 0.50
FSGM-UE(1) 0.39 0.43 0.52 0.48
FSGM-TE(1) 0.42 0.44 0.51 0.46
FSGM-UE(19) 0.36 0.42 0.45 0.40
FSGM-TE(19) 0.40 0.42 0.47 0.44
Table 8: Pixel-Wise Adversarial Detection on ShanghaiTechRGBD for different indicator values
FSGM-U(1) 0.38 0.40 0.46 0.42
FSGM-T(1) 0.40 0.42 0.49 0.46
FSGM-U(19) 0.37 0.41 0.49 0.43
FSGM-T(19) 0.37 0.40 0.49 0.42
FSGM-UE(1) 0.35 0.38 0.41 0.44
FSGM-TE(1) 0.33 0.38 0.41 0.44
FSGM-UE(19) 0.36 0.44 0.40 0.38
FSGM-TE(19) 0.39 0.46 0.40 0.38
Table 9: Pixel-Wise Adversarial Detection on MICC for different indicator values
FSGM-U(1) 0.38 0.40 0.42 0.40
FSGM-T(1) 0.36 0.42 0.49 0.40
FSGM-U(19) 0.35 0.38 0.36 0.45
FSGM-T(19) 0.36 0.43 0.38 0.33
FSGM-UE(1) 0.38 0.46 0.40 0.38
FSGM-TE(1) 0.40 0.42 0.48 0.41
FSGM-UE(19) 0.35 0.36 0.38 0.44
FSGM-TE(19) 0.38 0.45 0.41 0.38
Table 10: Pixel-Wise Adversarial Detection on Venice for different indicator values

We now quantify the influence of the two main hyper-parameters introduced in Section 4.1 that control the intensity of the attacks.

Perturbation value.

We change the value of in Eq. 4 and Eq. 5 from 1.0 to 35.0 for all attacks and plot the resulting mIoU in Fig. 7. Our model can detect very weak attacks with down to 1.0 and its performance quickly increases for larger values. In the supplementary material, we will exhibit the monotonous relationship between and the people density estimation error. When , there is already a small perturbation of the density estimates—around 6 in for ShanghaiTechRGBD—that then become much larger as increases. The number of iterations is set to as recommended in earlier work [23].

Threshold value.

In Tables 89 and 10, we report mIoU values on each dataset as a function of the threshold we use to classify a pixel as tampered with or not, depending on the ratio of Eq. 2. 5% gives the best answer across all attacks.

Strength of White Attacks.

To check the robustness of our model against white-box attacks, we evaluate different values in the loss term of Eq. 7, whose role is to keep the depth estimate as steady as possible in ShanghaiTechRGBD. We tested our approach for values of ranging from 0.01 to 100.0 and report the detection accuracy results along with the crowd counting error and depth error in Table 7. For larger values , both the crowd density error and the detection rate drop. In other words, increasing makes the attack harder to detect but also weaker. We show the same trend in the other datasets in the supplementary material.

5 Conclusion and Future Perspectives

In this paper, we have shown that estimating density and depth jointly in a two-stream network could be leveraged to detect adversarial attacks against crowd-counting models at the pixel level. Our experiments have demonstrated this to be the case even when the attacker knows our detection strategy, or has access to the ground-truth depth map. In essence, our approach is an instance of a broader idea: One can leverage multi-task learning to detect adversarial attacks. In the future, we will therefore study the use of this approach for other tasks, such as depth estimation, optical flow estimation, and semantic segmentation.

Acknowledgments This work was supported in part by the Swiss National Science Foundation. We also would like to thank Krzysztof Lis and Krishna Nakka for helpful discussions.


  • [1] A.N. Bhagoji, D. Cullina, and P. Mittal. Dimensionality reduction as a defense against evasion attacks on machine learning classifiers. In arXiv preprint arXiv:1704.02654, 2017.
  • [2] E. Bondi, L. Seidenari, A.D. Bagdanov, and A.D. Bimbo. Real-time people counting from depth imagery of crowded environments. International Conference on Advanced Video and Signal Based Surveillance, 2014.
  • [3] G. J. Brostow and R. Cipolla. Unsupervised Bayesian Detection of Independent Motion in Crowds. In

    Conference on Computer Vision and Pattern Recognition

    , pages 594–601, 2006.
  • [4] X. Cao, Z. Wang, Y. Zhao, and F. Su. Scale Aggregation Network for Accurate and Efficient Crowd Counting. In European Conference on Computer Vision, 2018.
  • [5] N. Carlini and D. Wagner. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods. In

    ACM Workshop on Artificial Intelligence and Security

    , 2017.
  • [6] N. Carlini and D. Wagner. Towards Evaluating the Robustness of Neural Networks. In IEEE Symposium on Security and Privacy, pages 39–57, 2017.
  • [7] A.B. Chan, Z.S.J. Liang, and N. Vasconcelos. Privacy Preserving Crowd Monitoring: Counting People Without People Models or Tracking. In Conference on Computer Vision and Pattern Recognition, 2008.
  • [8] A.B. Chan and N. Vasconcelos. Bayesian Poisson Regression for Crowd Counting. In International Conference on Computer Vision, pages 545–551, 2009.
  • [9] K. Chen, C.C. Loy, S. Gong, and T. Xiang. Feature Mining for Localised Crowd Counting. In British Machine Vision Conference, page 3, 2012.
  • [10] Z. Cheng, J. Li, Q. Dai, X. Wu, and A. G. Hauptmann. Learning Spatial Awareness to Improve Crowd Counting. In International Conference on Computer Vision, 2019.
  • [11] R. Feinman, R.R. Curtin, S. Shintre, and A.B. Gardner. Detecting Adversarial Samples from Artifacts. In preprint arXiv:1703.00410, 2017.
  • [12] L. Fiaschi, U. Koethe, R. Nair, and F. Hamprecht. Learning to Count with Regression Forest and Structured Labels. In International Conference on Pattern Recognition, pages 2685–2688, 2012.
  • [13] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In

    International Conference on Machine Learning

    , pages 1050–1059, 2016.
  • [14] Z. Gong, W. Wang, and W. Ku. Adversarial and clean data are not twins. In arXiv preprint arXiv:1704.04960, 2017.
  • [15] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. International Conference on Learning Representations, 2015.
  • [16] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel. On the (Statistical) Detection of Adversarial Examples. In arXiv preprint arXiv:1702.06280, 2017.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In European Conference on Computer Vision, 2014.
  • [18] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On Detecting Adversarial Perturbations. International Conference on Learning Representations, 2017.
  • [19] H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multi-Source Multi-Scale Counting in Extremely Dense Crowd Images. In Conference on Computer Vision and Pattern Recognition, pages 2547–2554, 2013.
  • [20] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-maadeed, N. Rajpoot, and M. Shah. Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds. In European Conference on Computer Vision, 2018.
  • [21] X. Jiang, Z. Xiao, B. Zhang, and X. Zhen. Crowd Counting and Density Estimation by Trellis Encoder-Decoder Networks. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [22] Alex Kendall and Yarin Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In Advances in Neural Information Processing Systems, 2017.
  • [23] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial Machine Learning at Scale. International Conference on Learning Representations, 2017.
  • [24] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. In Advances in Neural Information Processing Systems, 2017.
  • [25] V. Lempitsky and A. Zisserman. Learning to Count Objects in Images. In Advances in Neural Information Processing Systems, 2010.
  • [26] X. Li and F. Li. Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics. In International Conference on Computer Vision, 2017.
  • [27] Y. Li, X. Zhang, and D. Chen.

    CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes.

    In Conference on Computer Vision and Pattern Recognition, 2018.
  • [28] D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao. Density Map Regression Guided Detection Network for RGB-D Crowd Counting and Localization. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [29] Z. Lin and L.S. Davis. Shape-Based Human Detection and Segmentation via Hierarchical Part-Template Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(4):604–618, 2010.
  • [30] K. Lis, K. Nakka, M. Salzmann, and P. Fua. Detecting the Unexpected via Image Resynthesis. In International Conference on Computer Vision, 2019.
  • [31] C. Liu, X. Weng, and Y. Mu. Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization. Conference on Computer Vision and Pattern Recognition, 2019.
  • [32] J. Liu, C. Gao, D. Meng, and A.G. Hauptmann1. Decidenet: Counting Varying Density Crowds through Attention Guided Detection and Density Estimation. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [33] L. Liu, Z. Qiu, G. Li, S. Liu, W. Ouyang, and L. Lin. Crowd Counting with Deep Structured Scale Integration Network. International Conference on Computer Vision, 2019.
  • [34] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin. Crowd Counting Using Deep Recurrent Spatial-Aware Network. In International Joint Conference on Artificial Intelligence, 2018.
  • [35] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu. ADCrowdNet: An Attention-Injective Deformable Convolutional Network for Crowd Understanding. Conference on Computer Vision and Pattern Recognition, 2019.
  • [36] W. Liu, K. Lis, M. Salzmann, and P. Fua. Geometric and Physical Constraints for Drone-Based Head Plane Crowd Density Estimation. International Conference on Intelligent Robots and Systems, 2019.
  • [37] W. Liu, M. Salzmann, and P. Fua. Context-Aware Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [38] X. Liu, J.V.d. Weijer, and A.D. Bagdanov. Leveraging Unlabeled Data for Crowd Counting by Learning to Rank. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [39] Y. Liu, M. Shi, Q. Zhao, and X. Wang. Point in, Box out: Beyond Counting Persons in Crowds. Conference on Computer Vision and Pattern Recognition, 2019.
  • [40] Z. Ma, X. Wei, X. Hong, and Y. Gong. Bayesian Loss for Crowd Count Estimation with Point Supervision. International Conference on Computer Vision, 2019.
  • [41] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  • [42] S.M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [43] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A Simple and Accurate Method to Fool Deep Neural Networks. In Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
  • [44] D. Onoro-Rubio and R.J. López-Sastre. Towards Perspective-Free Object Counting with Deep Learning. In European Conference on Computer Vision, pages 615–629, 2016.
  • [45] V. Rabaud and S. Belongie. Counting Crowded Moving Objects. In Conference on Computer Vision and Pattern Recognition, pages 705–711, 2006.
  • [46] A. Ranjan, J. Janai, A. Geiger, and M. J. Black. Attacking Optical Flow. In International Conference on Computer Vision, 2019.
  • [47] V. Ranjan, H. Le, and M. Hoai. Iterative Crowd Counting. In European Conference on Computer Vision, 2018.
  • [48] D.B. Sam, N.N. Sajjan, R.V. Babu, and M. Srinivasan. Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [49] D.B. Sam, S. Surya, and R.V. Babu. Switching Convolutional Neural Network for Crowd Counting. In Conference on Computer Vision and Pattern Recognition, page 6, 2017.
  • [50] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang. Crowd Counting via Adversarial Cross-Scale Consistency Pursuit. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [51] M. Shi, Z. Yang, C. Xu, and Q. Chen. Revisiting Perspective Information for Efficient Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [52] Z. Shi, P. Mettes, and C. G. M. Snoek. Counting with Focus for Free. In International Conference on Computer Vision, 2019.
  • [53] Z. Shi, L. Zhang, Y. Liu, and X. Cao. Crowd Counting with Deep Negative Correlation Learning. In Conference on Computer Vision and Pattern Recognition, 2018.
  • [54] V.A. Sindagi and V.M. Patel. Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs. In International Conference on Computer Vision, pages 1879–1888, 2017.
  • [55] V.A. Sindagi and V.M. Patel. Multi-Level Bottom-Top and Top-Bottom Feature Fusion for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [56] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing Properties of Neural Networks. arXiv Preprint, 2013.
  • [57] J. Wan and A. B. Chan. Adaptive Density Map Generation for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [58] J. Wan, W. Luo, B. Wu, A. B. Chan, and W. Liu. Residual Regression with Semantic Prior for Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [59] Q. Wang, J. Gao, W. Lin, and Y. Yuan. Learning from Synthetic Data for Crowd Counting in the Wild. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [60] X. Wang, B. Wang, and L. Zhang. Airport Detection in Remote Sensing Images Based on Visual Attention. In International Conference on Neural Information Processing, 2011.
  • [61] B. Wu and R. Nevatia. Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In International Conference on Computer Vision, 2005.
  • [62] Chaowei Xiao, Ruizhi Deng, Bo Li, Fisher Yu, Mingyan Liu, and Dawn Song. Characterizing adversarial examples based on spatial consistency information for semantic segmentation. In European Conference on Computer Vision, pages 217–234, 2018.
  • [63] F. Xiong, X. Shi, and D. Yeung. Spatiotemporal Modeling for Crowd Counting in Videos. In International Conference on Computer Vision, pages 5161–5169, 2017.
  • [64] H. Xiong, H. Lu, C. Liu, L. Liu, Z. Cao, and C. Shen. From Open Set to Closed Set: Counting Objects by Spatial Divide-and-Conquer. In International Conference on Computer Vision, 2019.
  • [65] C. Xu, K. Qiu, J. Fu, S. Bai, Y. Xu, and X. Bai. Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [66] Z. Yan, Y. Yuan, W. Zuo, X. Tan, Y. Wang, S. Wen, and E. Ding. Perspective-Guided Convolution Networks for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [67] A. Zhang, J. Shen, Z. Xiao, F. Zhu, X. Zhen, X. Cao, and L. Shao. Relational Attention Network for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [68] A. Zhang, L. Yue, J. Shen, F. Zhu, X. Zhen, X. Cao, and L. Shao. Attentional Neural Fields for Crowd Counting. In International Conference on Computer Vision, 2019.
  • [69] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-Scene Crowd Counting via Deep Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition, pages 833–841, 2015.
  • [70] Q. Zhang and A. B. Chan. Wide-Area Crowd Counting via Ground-Plane Density Maps and Multi-View Fusion CNNs. In Conference on Computer Vision and Pattern Recognition, 2019.
  • [71] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. In Conference on Computer Vision and Pattern Recognition, pages 589–597, 2016.
  • [72] M. Zhao, J. Zhang, C. Zhang, and W. Zhang. Leveraging Heterogeneous Auxiliary Tasks to Assist Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.