Log In Sign Up

The Vulnerability of Semantic Segmentation Networks to Adversarial Attacks in Autonomous Driving: Enhancing Extensive Environment Sensing

Enabling autonomous driving (AD) can be considered one of the biggest challenges in today's technology. AD is a complex task accomplished by several functionalities, with environment perception being one of its core functions. Environment perception is usually performed by combining the semantic information captured by several sensors, i.e., lidar or camera. The semantic information from the respective sensor can be extracted by using convolutional neural networks (CNNs) for dense prediction. In the past, CNNs constantly showed state-of-the-art performance on several vision-related tasks, such as semantic segmentation of traffic scenes using nothing but the red-green-blue (RGB) images provided by a camera. Although CNNs obtain state-of-the-art performance on clean images, almost imperceptible changes to the input, referred to as adversarial perturbations, may lead to fatal deception. The goal of this article is to illuminate the vulnerability aspects of CNNs used for semantic segmentation with respect to adversarial attacks, and share insights into some of the existing known adversarial defense strategies. We aim to clarify the advantages and disadvantages associated with applying CNNs for environment perception in AD to serve as a motivation for future research in this field.


page 1

page 2

page 3

page 6

page 7

page 8

page 9


Adversarial Attacks on Multi-task Visual Perception for Autonomous Driving

Deep neural networks (DNNs) have accomplished impressive success in vari...

Can we unify monocular detectors for autonomous driving by using the pixel-wise semantic segmentation of CNNs?

Autonomous driving is a challenging topic that requires complex solution...

Lite-HDSeg: LiDAR Semantic Segmentation Using Lite Harmonic Dense Convolutions

Autonomous driving vehicles and robotic systems rely on accurate percept...

Towards Robust Sensor Fusion in Visual Perception

We study the problem of robust sensor fusion in visual perception, espec...

Applying Semantic Segmentation to Autonomous Cars in the Snowy Environment

This paper mainly focuses on environment perception in snowy situations ...

Global Deconvolutional Networks for Semantic Segmentation

Semantic image segmentation is a principal problem in computer vision, w...

Self corrective Perturbations for Semantic Segmentation and Classification

Convolutional Neural Networks have been a subject of great importance ov...

I Introduction

The desire for mobility is a driving force in progressing technology, with autonomous driving (AD) clearly being the next major step in automotive technology along with electromobility. An AD vehicle is a highly complex system with several sensors and subcomponents, one of them being vehicle-to-everything (V2X) communication.

Fig. 1: An autonomous driving (AD) research vehicle equipped with radio detection and ranging (RaDAR, colored in orange), light detection and ranging (LiDAR, colored in yellow), and camera sensors (colored in purple). The sensors are placed at different locations to obtain an extensive environment sensing.

In the context of AD, V2X communication has several applications, e.g., path planning and decision making [Zeng2019], or systems for localization and cooperative perception [kim2015impact]. All autonomous systems need a perception stage which constitutes the first step in the process chain of sensing the environment. The purpose of cooperative perception systems in AD is the exploitation of information stemming from other traffic participants to increase safety, efficiency and comfort aspects while driving [hobert2015enhancements]. The common concept lies in information transmission between various vehicles as well as between vehicles and back-end servers over any kind of (wireless) transmission channel. The transmitted information ranges from trajectories of the ego vehicle and other traffic participants over vehicle state information to sensor data coming from radio detection and ranging (RaDAR), light detection and ranging (LiDAR), and camera, and assists in constructing a more complete model of the physical world.

Each decision of an AD vehicle is based on the underlying environment perception and is intended to lead to an appropriate action. Hence, the proper perception of the environment is an essential ingredient for reducing road accidents to a bare minimum to foster public acceptance of AD. The most common sensors of a single AD vehicle’s environment perception system ([Bengler2014, Levinson2011, Wei2013]) are illustrated in Fig. 1.

Fig. 2: A simple adversarial attack using the iterative least-likely class method (LLCM) [Kurakin2017a] to fool the ICNet [Zhao2018a]

on a hand-picked image from the Cityscapes validation set; (a) clean input image, (b) semantic segmentation of clean input image, (c) adversarial example, and (d) semantic segmentation of adversarial example.

Several external sensors, i.e., RaDAR, LiDAR, and camera, are mounted on an AD vehicle. RaDAR sensors are already widely used in multiple automotive functions and are considered to play a key role in enabling AD ([Engels2017, Patole2017]). LiDAR sensors are capable of detecting obstacles [Levinson2011] and were already used in numerous AD competitions [Bengler2014]. Camera sensors on the other hand are mainly used for detecting lane markings or traffic signs [Wei2013], but can also be used for object detection and semantic segmentation [Zhao2018a]. The data captured by the three sensor groups is gathered within a central processing unit to extract semantic information from the environment.

Over the past few years, the interest in employing deep neural networks (DNNs) increased noticeably as they constantly achieved state-of-the-art performance in multiple vision-related tasks and benchmarks, including semantic segmentation for AD ([Cordts2016, Long2015]

). Semantic segmentation is a classical computer vision task, where each pixel of an RGB image is assigned to a corresponding semantic class, see Fig. 

2 (a), (b). Since such camera-based technology is both cheaper and uses less data compared to LiDAR-based technology, it is of special interest for AD. Recent progress in semantic segmentation enables real-time processing [Zhao2018a], making this an even more promising technology for AD applications.

Nevertheless, the environment perception system of an AD vehicle is a highly safety-relevant function. Any error can lead to catastrophic outcomes in the real world. While DNNs revealed promising functional performance in a wide variety of tasks, they show vulnerability to certain input patterns, denoted as adversarial examples [Szegedy2014]. Adversarial examples are almost imperceptibly altered versions of an image and are able to fool state-of-the-art DNNs in a highly robust manner, see Fig. 2 (c), (d). Assion et al. [Assion2019]

showed that a virtually unlimited set of adversarial examples can be created on each state-of-the-art machine learning model. This intriguing property of DNNs is of special concern, when looking at their applications in AD and needs to be addressed further by DNN certification methods (

[Dvijotham2018, Wu2018]) or means of uncertainty quantification [Michelmore2019]. Cooperative perception for example can be seen as one of the weak spots in the data processing during the environment perception of an AD vehicle. It can be used as a loophole to intrude adversarial examples to fool AD vehicles in range. Note, this is only one of many possible scenarios how adversarial examples can find their way into the system.

In this article, we will examine the vulnerability of DNNs towards adversarial attacks, while focusing on environment perception for AD. For this purpose we chose semantic segmentation as the underlying function we want to perform adversarial attacks on, since it is a promising technology for camera-based environment perception. The remainder of this article is structured as follows: First, we give a brief overview of semantic segmentation and introduce the ICNet [Zhao2018a] as a potential network topology, which we will then adopt for our experiments. Second, we continue with adversarial attacks, starting with simple image classification and extending to adversarial attacks for semantic segmentation. We demonstrate several visual examples to raise awareness for DNNs’ vulnerability towards adversarial attacks. Third, we examine techniques for defending against the adversarial attacks shown before and compare the obtained qualitative results. Lastly, we conclude by providing final remarks and discuss some future research directions pointing out that certification is an important aspect to ensure a certain level of robustness when employing DNNs. The article is intended to sensibilize the reader towards vulnerability issues of DNNs in environment perception for AD and to stir interest in the development of new defense strategies for adversarial attacks.

Ii Semantic Segmentation

An RGB image is a high-dimensional source of data, with pixels being the smallest units of semantic information. Semantic segmentation is a popular method to extract the semantic information from an RGB image, where each pixel is tagged with a label taken from a finite set of classes. Today’s state of the art in semantic segmentation is dominated by convolutional neural networks (CNNs), a special form of DNNs. This section introduces some mathematical notation regarding CNNs and gives an overview of the CNN architecture used for semantic segmentation throughout this article.

Ii-a Mathematical Notation

For the sake of simplicity, we first assume having a CNN, which takes one input image and outputs only a corresponding class for the entire image. Hence, we begin with simple image classification and then extend to semantic segmentation.

Fig. 3: Architectural overview of ICNet [Zhao2018a]. The ICNet takes different scales of an RGB image as inputs (left gray block) to output a semantic segmentation mask (right gray block). The encoder consists of three scale-dependent parts to extract multi-scale features from the inputs (shades of blue

). Each of these three encoder parts perform a downsampling by a factor of eight during feature extraction. To save computational complexity, the bigger scales are limited to low-level and mid-level feature extraction. The extracted multi-scale features are then fused within the decoder by a multi-scale fusion block (

light magenta), before performing final upsampling to obtain a full-resolution semantic segmentation mask with respect to the input.

First of all, the input image is denoted as , with image height in pixels , image width in pixels , number of color channels , dataset , and the set of integer gray values . Each image contains gray values at each pixel position , with being the set of pixel positions, having the cardinality . Smaller patches of an image are denoted as , with patch height in pixels , crop width in pixels , and the set of pixel positions with being the center pixel and . For the special case of , we obtain . A CNN usually consists of several layers containing feature map activations of the respective layer , and 1st layer input image , with the set of layers , feature map height , feature map width , and number of feature maps . Fed with the input image , a CNN for image classification

outputs a probability score

for each class , with , and the set of classes , with the number of classes , leading to


For better readability, the CNN parameters are omitted in our notation. The predicted class for the input image is then obtained by


From now on a CNN is considered, which is capable of performing semantic segmentation. The respective CNN outputs a probability for each pixel position of the input image and class . Altogether, it outputs class scores for all pixel positions and classes , leading to


The semantic segmentation mask containing the predicted class at each pixel position of the input image is then obtained by


The performance of such a CNN is measured by the mean intersection-over-union (mIoU)


with the class-specific true positives , false positives , and false negatives .

Ii-B Architecture for Semantic Segmentation

Today’s state-of-the-art CNN architectures for semantic segmentation are often based on the work of Long et al. [Long2015]. They proposed to use a CNN, pretrained on image classification, as a feature extractor and further extend it to recover the original image resolution. The extended part is often referred to as the decoder and fulfills the task of gathering, reforming and rescaling the extracted features for the task of semantic segmentation. One characteristic of this proposed network architecture is the absence of fully connected layers. Such CNNs are therefore called fully convolutional networks (FCNs).

Especially for AD, a real-time capable state-of-the-art CNN being robust to minimal changes in the input is needed. Arnab et al. [Arnab2018] analyzed the robustness of various CNNs for semantic segmentation towards simple adversarial attacks ([Goodfellow2015, Kurakin2017a]), and concluded that CNNs using the same input with different scales are often most robust. The ICNet developed by Zhao et al. [Zhao2018a] comprises both, a light-weight CNN architecture with multi-scale inputs. The overall structure of the ICNet is depicted in Fig. 3. The ICNet is designed to extract multi-scale features by taking different scales of the image as inputs. The extracted multi-scale features are fused before being upsampled to obtain a full-resolution semantic segmentation mask. The ICNet mainly profits from the combination of high-resolution low-level features (i.e., edges) with low-resolution high-level features (i.e., spatial context). For the sake of reproducibility, an openly available reimplementation111

of the ICNet based on TensorFlow is used and tested on the widely applied Cityscapes dataset

[Cordts2016]. Cityscapes serves as a good dataset for exploring CNNs using semantic segmentation for AD, having pixel-wise annotations for 5000 images (validation, training and test set combined), with relevant classes such as pedestrians and cars. The reimplementation of the ICNet achieves 67.26 % mIoU on the Cityscapes validation set and runs at about 19 fps on our Nvidia Tesla P100 and about 26 fps on our Nvidia Geforce GTX 1080Ti with an input resolution of . These numbers are promising and indicate that semantic segmentation could serve as a technology for the environment perception system of AD vehicles.

Iii Adversarial Attacks

Although CNNs exhibit state-of-the-art performance in several vision-related fields of research, Szegedy et al. [Szegedy2014] revealed their vulnerability towards certain input patterns. The CNN topologies they investigated were fooled by just adding small and imperceptible patterns to the input image. An algorithm producing such adversarial perturbations is called an adversarial attack and a perturbed image is referred to as an adversarial example.

Based on the obvervations of Szegedy et al., new approaches arised for crafting adversarial examples more efficiently ([Athalye2018, Carlini2017, Goodfellow2015, Kurakin2017a, Moosavi-Dezfooli2016]) and were even extended to dense prediction tasks, e.g., semantic segmentation ([Assion2019, Metzen2017, Mopuri2018]). In the following, two types of adversarial attacks will be introduced: individual adversarial attacks, aiming at fooling on the basis of one particular input image, as well as universal adversarial attacks, aiming at fooling on the basis of a whole bunch of images at the same time.

Iii-a Individual Adversarial Perturbations

For the sake of simplicity, CNNs for image classification are considered in the following to describe the basic nature of targeted and non-targeted adversarial attacks using individual adversarial perturbations. As shown before, image classification can be easily extended to semantic segmentation.

Common adversarial attacks aim at fooling a CNN, so that the predicted class does not match with the ground truth class of the input image . One example for such type of an adversarial attack is the fast gradient sign method (FGSM) introduced by Goodfellow et al. [Goodfellow2015]

. FGSM adopts the loss function

that is used during training of the underlying CNN and computes the adversarial examples by


with the adversarial perturbation , the step size , and the gradient with respect to the input image . Note that . FGSM lets the perturbation effectively increase the loss in each dimension by manipulating the input image into positive (“+”) gradient direction. Thus, one is not limited to use the ground truth as depicted in (6), but can in fact use the output of the respective DNN .

Kurakin et al. [Kurakin2017a] extended FGSM by an iterative algorithm, changing the adversarial perturbation slightly in each iteration by a small . To prevent the adversarial perturbation’s magnitude from getting too large, it is upper-bounded by


with being the upper bound of the infinity norm and . This way, the perceptibility of the adversarial perturbation is controlled by adjusting accordingly. For the iterative case, (6) extends to


with being the current iteration index and therefore the adversarial example at iteration222The total number of iterations is set by flooring . .

Considering AD vehicles, there exists no ground truth for the data being inferred. As already pointed out, a naive attacking idea in this setup would be finding an adversarial perturbation , such that (classification!)


Such an attack is the least-likely class method (LLCM) introduced by Kurakin et al. [Kurakin2017a]. LLCM aims at finding an adversarial pertubation to obtain


with the least-likely class of the input image . Different from before, the adversarial example using LLCM is obtained by taking a step into the negative direction of the gradient with respect to the input image , according to


minimizing the loss function. Similar to FGSM, LLCM can also be performed in an iterative fashion, where in each step a small adversarial perturbation is added to the respective input image.

Another well-known approach for crafting adversarial examples is DeepFool [Moosavi-Dezfooli2016] introduced by Moosavi-Dezfooli and his colleagues. Compared with FGSM and LLCM, DeepFool does not only search for individual adversarial perturbations, but also tries to find the minimal adversarial perturbation, with respect to an -norm, changing the network’s output. This leads us to the following equation


with being the -norm restricting the magnitude of . Moosavi-Dezfooli et al. primarily experimented with , showing DeepFool’s superiority in terms of speed and magnitude compared to FGSM, when targeting the same error rate for the respective CNN. We will not go further into detail here, but we refer the interested reader to [Moosavi-Dezfooli2016] for more information about DeepFool instead.

Carlini and Wagner proposed an approach, which showed to be extremely effective regarding adversarial example detection mechanisms [Carlini2017]. They use


as an objective function, with

being a hyperparameter and

being a loss function. Athalye et al. [Athalye2018] adopted this approach and as a result managed to circumvent several state-of-the-art defense mechanisms. We refer the interested reader to [Athalye2018] and [Carlini2017] for more fine-grained information about both approaches and their specific variations.

So far, we introduced adversarial attacks that were successfully applied on image classification. Arnab et al. [Arnab2018] did the first extensive analysis on the behavior of different CNN architectures for semantic segmentation

using FGSM and LLCM, both performed iteratively and non-iteratively. They report results on a large variety of CNN architectures, including both lightweight CNN architectures and heavyweight CNN architectures. The main observation was that network models using residual connections are often more robust when it comes to adversarial attacks. In addition, lightweight CNN architectures tend to be almost equally robust as heavyweight CNN architectures. In summary, the results on the Cityscapes dataset demonstrated the vulnerability of CNNs in general. We show a typical attack in Fig. 

2 using the iterative LLCM on the ICNet with the hyper parameters and . Despite being mostly imperceptible for the human eye, the adversarial example leads to a dramatically altered network output. To show the overall effect on the Cityscapes validation set, we computed the mIoU ratio for the iterative LLCM and the non-iterative LLCM using different values for . The mIoU ratio is defined by


with being the mIoU on adversarially perturbed images , and being the mIoU on clean images . The results are plotted in Fig. 4. As expected, the stronger the adversarial perturbation (in terms of ) the lower the mIoU on adversarial examples and thus a lower mIoU ratio is obtained. As pointed out by Arnab et al. [Arnab2018], we also observe that the non-iterative LLCM is even stronger than its iterative counterpart, which contradicts the original observation made by Kurakin et al. [Kurakin2017a]

on image classification. Arnab et al. argue that this phenomenon might be a dataset property of Cityscapes, since the effect does not occur on their second dataset (Pascal VOC 2012). Nonetheless, we do not investigate this further as we anyway want to focus on more realistically looking adversarial attacks in the following.

Metzen et al. [Metzen2017] introduced new adversarial attacks for semantic segmentation. Instead of only fooling the CNN, they additionally wanted the respective CNN to output more realistically looking semantic segmentation masks. To do so, Metzen et al. developed two methods. The first method uses a fake semantic segmentation mask instead of the original semantic segmentation mask , with , meaning that the fake segmentation mask refers to an existing image of the dataset . The overall assumption of Metzen et al. is that a possible attacker might invest time to create a few uncorrelated fake semantic segmentation masks himself. Assuming that the attacker wants to use the same fake semantic segmentation mask to fool the respective CNN on several images, he is restricted to stationary situations to operate unnoticed, i.e., the AD vehicle doesn’t move and thus the scenery captured by the camera sensor only slightly changes. Because of this operational constraint, we call this method stationary segmentation mask method (SSMM). The second method modifies the CNN’s original semantic segmentation mask by replacing a predefined objective class at each corresponding pixel position by the spatial nearest-neighbor class , with . Here, is the set of all pixel positions, where holds. By completely removing the objective class from the semantic segmentation mask, we obtain

Fig. 4: Adversarial attacks on the ICNet using the iterative or the non-iterative least-likely class method (LLCM) from Kurakin et al. [Kurakin2017a] on the Cityscapes validation set with different values for , an upper bound of the -norm of the adversarial perturbation . We set for the non-iterative LLCM and for the iterative LLCM. A lower mIoU ratio means a stronger adversarial attack. Note that the non-iterative LLCM appears to be even more aggressive than the iterative LLCM.
Fig. 5: Adversarial attacks on the ICNet using the dynamic nearest neighbor method (DNNM) [Metzen2017] on two example images, one with pedestrians and one with cars, from the Cityscapes validation set. The adversarial examples aim at removing pedestrians (first row) and cars (second row) from the scene. (a) clean input image, (b) semantic segmentation output on clean input image, (c) adversarial example created by DNNM, and (d) semantic segmentation output on adversarial example created by DNNM.

with being the new target class at pixel position . Metzen et al. suggested to use the Euclidean distance of two pixel positions and in order to find the nearest-neighbor class satisfying . In contrast to SSMM, the created realistically looking fake semantic segmentation mask using this method is now unique for each real semantic segmentation output. Additionally, specific properties, such as the correlation between two consecutive real semantic segmentation outputs, are transferred to the created fake ones. Altogether, a possible attacker is able to create a sequence of correlated realistically looking fake semantic segmentation masks making this kind of attack suitable for situations, where the respective AD vehicle moves. Due to these properties we call this method dynamic nearest neighbor method (DNNM). The application of DNNM has the potential to create safety-relevant perception errors for AD. This can be seen in Fig. 5, where DNNM is used to remove pedestrians or cars from the scene. The adversarial examples were created by setting

followed by the same procedure as with iterative LLCM. Astonishingly, the semantic classes different from the objective class are completely preserved and the nearest neighbor class seems to be a good estimate for the regions occluded by the objective class, thereby dangerously providing a plausible but wrong semantic segmentation mask.

Iii-B Universal Adversarial Perturbations

So far, we discussed approaches that generate adversarial perturbations for single input images. In reality, however, it is hard for a possible attacker to generate adversarial examples for each incoming image of an environment perception system, considering a camera running at 20 fps. Therefore, in AD applications a special interest lies in single adversarial perturbations being capable of fooling a CNN on a set of input images, e.g., a video sequence. This class of adversarial perturbations is called universal adversarial perturbation (UAP).

One of the first works towards finding UAPs was done by Moosavi-Dezfooli et al. [Moosavi-Dezfooli2017]. Their idea was to find a UAP that fools almost all images in some image set in an image classification task (again, only one class per image). To achieve this, they used the DeepFool algorithm in an iterative fashion to solve the optimization problem


with the subset of respective images for which the CNN is fooled, and the set of all respective images the UAP is optimized on. The UAP is again constrained by


with being the -norm of , and being its upper bound. In their experiments, Moosavi-Dezfooli et al. obtained the best results setting and . Different from all the attacks shown before, the UAP optimized on generalizes well, meaning the UAP can even fool a respective system on a disjoint set of images , with , on which the UAP was not optimized on.

Fig. 6: Adversarial attacks on the ICNet using a (single) universal adversarial perturbation created by Fast Feature Fool (FFF) [Mopuri2018]. We show its effectiveness in fooling the ICNet on four example images with cars or pedestrians from the Cityscapes validation set. Each row corresponds to one attack scenario; (a) clean input image, (b) semantic segmentation of clean input image, (c) adversarial example created by FFF, and (d) semantic segmentation of adversarial example created by FFF.

While Moosavi-Dezfooli et al. use samples from a set of images to craft UAPs, Mopuri et al. [Mopuri2017] introduced a dataset-independent method named Fast Feature Fool (FFF). In the following we are considering the formulation of FFF from their extended work [Mopuri2018]. Adopting the overall objective in (16), FFF aims at finding a UAP that increases the mean activation in each layer , without any knowledge about the respective images to fool. This is done by minimizing the following loss function


with respect to (as 1st layer input image), constrained by (17) with . We used FFF on the ICNet to show the effectiveness and transferability of UAPs on several images taken from the Cityscapes validation set by following Mopuri et al. in choosing . The obtained results for some images are illustrated in Fig. 6. While not generating realistically looking semantic segmentation masks as DNNM does, FFF still completely fools the ICNet on several diverse images and needs to be computed only once to obtain . Moreover, safety-critical classes such as pedestrians and cars are removed from the scene in all examples, underlining again the risk of adversarial attacks for AD. Note that the particular danger of this method for AD lies in the fact that it just requires a generic adversarial pattern to be added to any unknown sensorial data during driving, causing major errors in the output segmentation mask.

Iv Adversarial Defense

So far, we demonstrated that DNNs can be fooled in many different ways by means of almost imperceptible modifications of the input image. This behavior of DNNs puts challenges to their application within environment perception in AD. Therefore, appropriate adversarial defense strategies are needed to decrease the risk of DNNs being completely fooled by adversarial examples. In this section, some adversarial defense strategies are presented, that have been hypothesized and developed to defend against adversarial attacks. In general, adversarial defense strategies can be distinguished as being specific or agnostic to a model at hand. In the following, we will provide a brief introduction to model-specific defense techniques, but then we will focus on model-agnostic ones.

Iv-a Model-Specific Defense Techniques

Model-specific defense techniques aim at modifying the behavior of a specific DNN in a way that the respective DNN becomes more robust towards adversarial examples. Note that such a technique most often can alternatively be applied to numerous DNN topologies, however, once being applied it always defends only the specific DNN at hand. One well-known and intuitive method of model-specific defense techniques is adversarial training. In adversarial training the original training samples of the DNN are extended with their adversarial counterparts, e.g., created by FGSM from Goodfellow et al. as shown before, and then retrained with this set of clean and adversarially perturbed images. Whereas the performance of the DNN on adversarial examples increases ([Goodfellow2015, Moosavi-Dezfooli2017]), the effect is still marginal [Moosavi-Dezfooli2016]. More importantly, it is also not clear which amount or type of adversarial examples is sufficient to increase the DNN’s robustness up to a desired level. Xie et al. [Xie2019] investigated the effect of adversarial examples on the feature maps in several layers. Their observation was that adversarial examples create noise-like patterns in the feature maps. To counter this, they proposed to add trainable denoising layers containing a denoising operation followed by a convolution operation. Xie et al. obtained the best results by using the non-local means algorithm (NLM) [Buades2005] for feature denoising. Bär et al. [Baer2019] explored the effectiveness of teacher-student approaches in defending against adversarial attacks. Here, an additional student DNN is included to increase the robustness against adversarial attacks, assuming that the potential attacker has a hard time to deal with a constantly adapting student DNN. It was concluded that in combination with simple output voting schemes this approach could be a promising model-specific defense technique. Nevertheless, a major drawback of model-specific defense techniques is that the respective DNN has to be retrained, or one has to modify the network architecture, which is not always possible when using pre-trained DNNs.

Iv-B Model-Agnostic Defense Techniques

In contrast to model-specific defense techniques, model-agnostic defense techniques, once developed, can be applied in conjunction with any model, as they do not modify the model itself but rather the input data. In particular, the model does not need to be retrained. Hence, it serves as an image pre-processing, where the adversary is removed from the input image.

to X[c]X[c]X[c]X[c]X[c] Clean output & DNNM attack … & … defended by NLM & … by IQ & … by NLM+IQ

to X[c]X[c]X[c]X[c]X[c] avg. mIoU % & % & % & % & %
(a) & (b) & (c) & (d) & (e)

Fig. 7: Adversarial attacks on the ICNet using the dynamic nearest neighbor method (DNNM) [Metzen2017], defended by image quilting (IQ) [Guo2018] and the non-local means algorithm (NLM) [Buades2005]. Both image rows correspond to the examples shown in Fig. 5. The first row contains an example, where DNNM was used to remove pedestrians from the scene, while the second row contains an example, where DNNM was used to remove cars instead; (a) clean output, (b) adversarial output using DNNM, (c) adversarial output using DNNM defended by NLM, (d) adversarial output using DNNM defended by IQ, and (e) adversarial output using DNNM defended by NLM and IQ combined. The mIoU values in the bottom line refer to the average mIoU over the entire Cityscapes validation set.

Guo et al. [Guo2018] analyzed the effectiveness of non-differentiable input transformations in destroying adversarial examples. Non-differentiability is an important property of adversarial defense strategies considering that the majority of adversarial attacks is build on gradient-based optimization. Guo et al. used image quilting (IQ) amongst some other input transformation techniques and observed IQ to be an effective way of performing model-agnostic defense against several adversarial attacks. IQ is a technique, wherein the input image is viewed as a puzzle of small patches , with being the position of the center pixel. To remove potential adversaries from an image, each of its patches , irrelevant of being adversarially perturbed or not, is replaced by a nearest neighbor patch to obtain a quilted image , with being a large set of patches created beforehand from random samples of clean images. The aim is to synthetically construct an adversary-free image having the original semantic content.

to X[c]X[c]X[c]X[c]X[c] Clean output & FFF attack … & … defended by NLM & … by IQ & … by NLM+IQ

to X[c]X[c]X[c]X[c]X[c] avg. mIoU % & % & % & % & %
(a) & (b) & (c) & (d) & (e)

Fig. 8: Adversarial attacks on the ICNet using Fast Feature Fool (FFF) [Mopuri2018], defended by image quilting (IQ) [Guo2018] and the non-local means algorithm (NLM) [Buades2005]. We show results on the four semantic segmentation outputs from Fig. 6 using the Cityscapes validation set. Each row corresponds to an attack scenario to be defended by NLM, IQ, or a combination of both; (a) clean output, (b) adversarial output using FFF, (c) adversarial output using FFF defended by NLM, (d) adversarial output using FFF defended by IQ, and (e) adversarial output using FFF defended by NLM and IQ combined. The mIoU values in the bottom line refer to the average mIoU over the entire Cityscapes validation set.

Another model-agnostic defense technique is the non-local means algorithm (NLM) from Buades et al. [Buades2005]. NLM aims at denoising the input image. To accomplish this, NLM replaces each pixel value by


with the NLM-denoised pixel , the inter-pixel weighting factor for which holds, and the pixel value at position . The inter-pixel weighting factor relates the respective pixel at pixel position to the pixel at pixel position . It is defined by


with the patches and centered at pixel positions and , the squared Gaussian weighted Euclidean distance , with

as the standard deviation of the Gaussian kernel, the hyperparameter for the degree of filtering

, and the normalizing factor . By incorporating the squared Gaussian weighted Euclidean distance, a large weight is put to pixels , whose neighborhood looks similar to (the neighborhood of the respective pixel to be denoised).

The idea behind NLM is to remove the high local dependency of adversarial perturbations. Nevertheless, applying NLM on the complete input image, as stated in (19), can be computationally demanding. Thus, the search window is often reduced to an image region of size . Note that , with .

Now let’s look into results of both model-agnostic defense methods, IQ and NLM, on the adversarial examples shown before. For IQ, the patch dataset was created using samples from the Cityscapes training set. Here, we followed Guo et al. and collected patches of size pixels in total. Increasing the size of the patch dataset will lead to better approximations of the patches, but on the other hand also increases the search space. The same holds when decreasing the size of the patches up to a certain level.

For NLM, patches and of size were used and the image region for neighbor pixel was restricted according to to keep an adequate algorithm complexity. The degree of filtering was computed by , with being an estimate for the Gaussian noise standard deviation on the input image .

Using these settings, we tested IQ, NLM, as well as a combined version of both, denoted as IQ+NLM, on the adversarial attacks shown in Section III (see Fig. 5 and Fig. 6). It is important to note that we applied both defense methods without any extensive hyperparameter search. The adversarial defenses on DNNM-attacked images are depicted in Fig. 7. From left to right, the original semantic segmentation mask is reconstructed better and better, with the combination of NLM and IQ showing the best results (Fig. 7 (e)). Comparing NLM and IQ separately, it can be seen that IQ is able to reconstruct the original semantic segmentation mask even more precisely. The same behavior can be observed when looking at the mIoU values in Fig. 7 where we report averages over the entire Cityscapes validation set. Altogether, the results show that by combining NLM with IQ one can lever the destructiveness of DNNM—an important and releaving observation.

The adversarial defenses on FFF-attacked images are illustrated and supported by the corresponding average mIoU values on the Cityscapes validation set in Fig. 8. Here, it is not trivial to judge by only looking at the images, which defense is superior, IQ or NLM. In some cases, NLM seems to lead to the better results, whereas in other cases IQ seems to outperform NLM. Yet, looking at the average mIoU values for the entire Cityscapes validation set leads to the conclusion that overall NLM is superior to IQ. Moreover, combining NLM with IQ again shows the best results leading to an overall significant improvement in restoration of the segmentation masks. This observation is both extremely important and relieving, as the existence of UAPs is particularly dangerous for the use case of DNNs in AD.

Even though we observe a certain level of effectiveness in using model-agnostic defense methods, there is still room left for improvement in defending against adversarial attacks. The work of Carlini and Wagner [Carlini2017] and Athalye et al. [Athalye2018] are just two of many representative examples. Carlini and Wagner bypassed several state-of-the-art detection systems for adversarial examples with their approach, whereas Athalye et al. circumvented the non-differentiality property of some state-of-the-art defenses by different gradient approximation methods.

V Summary and Future Directions

Deep neural networks (DNNs) are one of the most promising technologies for the use case of environment perception in autonomous driving (AD). Assuming the environment perception system consists of several camera sensors, a DNN trained for semantic segmentation can be used to perform extensive environment sensing in real-time. Nevertheless, today’s state-of-the-art DNNs still unveil flaws when fed with specifically crafted inputs, denoted as adversarial examples. It was step-by-step demonstrated that it is quite easy and intuitive to craft adversarial examples for individual input images using the least-likely class method (LLCM) or the dynamic nearest neighbor method (DNNM) by simply performing gradient updates on the clean input image. It is even possible to craft adversarial examples to fool not only one but a set of images using the Fast Feature Fool (FFF) method, without any knowledge of the respective input image to be perturbed. This in turn highlights the importance of appropriate defense strategies. From a safety-concerned perspective, the lack of robustness shown by DNNs is a highly relevant and important challenge to deal with, before AD vehicles are released for public use.

DNNs’ lack of robustness evoked the need for defense strategies and other fallback strategies regarding the safety relevance for AD applications. Model-agnostic defense strategies only modify the potentially perturbed input image to decrease the effect of adversarial attacks. This way, an already pretrained DNN can be used without the need of retraining or modifying the DNN itself. We explored two model-agnostic defense strategies, namely image quilting (IQ) and the non-local means algorithm (NLM), both on DNNM and FFF attacks, where the combination of IQ and NLM shows the best results on almost all images. Nevertheless, although clearly robustifying the DNNs towards adversarial attacks, the current state of research in model-agnostic defense strategies also showed that vulnerability of DNNs is not entirely solved yet. However, ensembles of model-agnostic defenses could be promising for tackling adversarial attacks, as well as intelligent redundancy, e.g., by teacher-student approaches. We would also like to point out that certification methods ([Dvijotham2018, Wu2018]) should be further investigated to really obtain provable robustness.

What does this mean regarding the application of DNNs for AD? Are today’s DNNs not suitable for safety-critical applications in AD? We would argue that this is to some extent true, if we only consider applying model-agnostic defenses without certification. DNN training and DNN understandability are two highly dynamic academic fields of research. Research so far mainly focused on increasing the performance of DNNs, widely neglecting their robustness and certification. In order to develop employable machine learning-based functions that are realistically usable in a real world setting, it is extremely important to establish their robustness against slight input alterations in addition to improving the task performance. Furthermore, new mature defense and certification strategies are needed, including fusion approaches, redundancy concepts, and modern fallback strategies. We especially recommend automotive companies to focus on certication of DNNs. Otherwise, doors would open for potentially fatal attacks which in turn would have consequences on public acceptance of AD.


The authors gratefully acknowledge support of this work by Volkswagen Group Automation, Wolfsburg, Germany, and would like to thank Nico M. Schmidt and Zeyun Zhong for their help in setting up final experiments.


Andreas Bär

( received his B.Eng. degree from Ostfalia University of Applied Sciences, Wolfenbüttel, Germany, in 2016, and his M.Sc. degree from Technische Universität Braunschweig, Braunschweig, Germany, in 2018, where he is currently a Ph.D. degree candidate in the Faculty of Electrical Engineering, Information Technology, and Physics. His research interests include convolutional neural networks for camera-based environment perception and the robustness of neural networks to adversarial attacks. In 2020, he won the Best Paper Award at the Workshop on Safe Artificial Intelligence for Automated Driving, held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition, along with coauthors Serin John Varghese, Fabian Hüger, Peter Schlicht, and Tim Fingscheidt.

Jonas Löhdefink

( received his B. Eng. degree from Ostfalia University of Applied Sciences, Wolfenbüttel, Germany, in 2015, and his M.Sc. degree from Technische Universität Braunschweig, Braunschweig, Germany, in 2018, where he is currently a Ph.D. degree candidate in the Faculty of Electrical Engineering, Information Technology, and Physics. His research interests include learned image compression and quantization approaches by means of convolutional neural networks and generative adversarial networks.

Nikhil Kapoor ( reeived his B.Eng. degree from the Army Institute of Technology, Pune, India, in 2012, and his M.Sc. degree from RWTH Aachen University, Germany, in 2018. Currently, he is a Ph.D. degree candidate at Technische Universität Braunschweig, Braunschweig, Germany, in cooperation with Volkswagen Group Research. His research focuses on training strategies that range from improving the robustness of neural networks for camera-based perception tasks to augmentations and adversarial perturbations using concept-based learning.

Serin John Varghese ( received his B.Eng. degree from the University of Pune, India, in 2013, and his M.Sc. degree from Technische Universität Chemnitz, Germany, in 2018. Currently, he is a Ph.D. degree candidate at Technische Universität Braunschweig, Braunschweig, Germany, in cooperation with Volkswagen Group Research. His research is focused on compression techniques for convolutional neural networks used for perception modules in automated driving, with a focus on not only inference times but also maintaining, and even improving, the robustness of neural networks.

Fabian Hüger ( received his M.Sc. degree in electrical and computer engineering from the University of California, Santa Barbara, as a Fulbright scholar in 2009. He received his Dipl.-Ing. and Dr.-Ing. degrees in electrical engineering from the University of Kassel, Germany, in 2010 and 2014, respectively. He joined Volkswagen Group Research, Germany, in 2010, and his current research is focused on safe and efficient use of artificial intelligence for autonomous driving.

Peter Schlicht ( received his Ph.D. degree in mathematics from the University of Leipzig, Germany. After a two-year research stay at the Ecole Polytechnique Fédérale, Lausanne, Switzerland, he joined Volkswagen Group Research, Wolfsburg, Germany, in 2016 as an artificial intelligence (AI) architect. There he deals with research questions on AI technologies for automatic driving. His research interests include methods used for monitoring, explaining, and robotizing deep neural networks as well as securing them.

Tim Fingscheidt

( received his Dipl.-Ing. and Ph.D. degrees in electrical engineering, both from RWTH Aachen University, Germany, in 1993 and 1998, respectively. Since 2006, he has been a full professor with the Institute for Communications Technology, Technische Universität Braunschweig, Braunschweig, Germany. He received the Vodafone Mobile Communications Foundation prize in 1999 and the 2002 prize of the Information Technology branch of the Association of German Electrical Engineers (VDE ITG). In 2017, he coauthored the ITG award-winning publication, “Turbo Automatic Speech Recognition.” He has been the speaker of the Speech Acoustics Committee ITG AT3 since 2015. He served as an associate editor of IEEE Transactions on Audio, Speech, and Language Processing (2008–2010) and was a member of the IEEE Speech and Language Processing Technical Committee (2011–2018). His research interests include speech technology and vision for autonomous driving. He is a Senior Member of IEEE.