Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving

01/17/2021 ∙ by James Tu, et al. ∙ UNIVERSITY OF TORONTO University of Illinois at Urbana-Champaign cornell university 2

Modern self-driving perception systems have been shown to improve upon processing complementary inputs such as LiDAR with images. In isolation, 2D images have been found to be extremely vulnerable to adversarial attacks. Yet, there have been limited studies on the adversarial robustness of multi-modal models that fuse LiDAR features with image features. Furthermore, existing works do not consider physically realizable perturbations that are consistent across the input modalities. In this paper, we showcase practical susceptibilities of multi-sensor detection by placing an adversarial object on top of a host vehicle. We focus on physically realizable and input-agnostic attacks as they are feasible to execute in practice, and show that a single universal adversary can hide different host vehicles from state-of-the-art multi-modal detectors. Our experiments demonstrate that successful attacks are primarily caused by easily corrupted image features. Furthermore, we find that in modern sensor fusion methods which project image features into 3D, adversarial attacks can exploit the projection process to generate false positives across distant regions in 3D. Towards more robust multi-modal perception systems, we show that adversarial training with feature denoising can boost robustness to such attacks significantly. However, we find that standard adversarial defenses still struggle to prevent false positives which are also caused by inaccurate associations between 3D LiDAR points and 2D pixels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 7

page 8

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in self-driving perception have shown that fusing information from multiple sensors (e.g., camera, LiDAR, radar) [gupta2014learning, song2016deep, wang2018depth, qi2018frustum, liang2019multi, radarnet] leads to superior performance when compared to approaches relying on single sensory inputs. Such performance gains are primarily due to the complementary information contained in the measurements provided by the different types of sensors. For example, LiDAR sensors provide accurate 3D geometry while cameras capture rich appearance information.

Figure 1: In this work, we generate a universal adversarial mesh to hide various host vehicles from state-of-the-art multi-sensor object detectors. Our attack produces consistent perturbations across image and LiDAR modalities.

Modern perception models which rely on deep neural networks (DNNs) have been found to be extremely vulnerable to adversarial attacks when processing images in isolation 

[eykholt2018robust, xie2017adversarialdetect, lu2017adversarial, phystexture, ranjan2019attacking]. Adversarial examples can be thought of as perturbations to the sensory inputs which do not alter the semantic meaning of the scene, but drastically change a DNN’s output, resulting in incorrect predictions. Such vulnerabilities can lead to catastrophic consequences in safety-critical applications. In the context of self-driving, most efforts have investigated attacks against single-sensor inputs, such as image-only attacks [eykholt2018robust, ranjan2019attacking] and LiDAR-only attacks [tu2020physically]. Towards multi-modal robustness, [wang2020towards] considers perturbations of LiDAR and image inputs independently, resulting in perturbations that are inconsistent across modalities and therefore may not be physically realizable and hence not threatening in practice. On the other hand, some proposed physically realizable approaches [cao2020msf]

only search over shape but ignore texture which is crucial for corrupting image inputs. Furthermore, these prior works do not attempt to generate universal perturbations which are perhaps the most threatening in practice. Such perturbations are input agnostic and can successfully attack any input in the training distribution with high probability, meaning they can be executed without prior knowledge of the scene and are able to consistently disrupt models that process sensory information across time.

This paper demonstrates the susceptibility of multi-sensor detection models to physically realizable and input-agnostic adversarial perturbations. To create a physically realizable attack which is also feasible to execute, we focus on object insertion attacks [eykholt2018robust, athalye2018synthesizing, xiao2019meshadv, xiang2019generating, tu2020physically], as they can be carried out via the deployment of physical objects in the real world. Following [tu2020physically], we insert the adversarial object into the scene by placing it on the rooftop of a host vehicle. We render the adversary into LiDAR and image inputs to ensure perturbations are consistent across modalities and that our attack is physically realizable. To achieve a high degree of realism in our attacks, we consider LiDAR and image occlusion as well as environmental lighting in our rendering process as shown in fig:sim. Furthermore, we perform rendering in a differentiable manner to enable end-to-end learning of the adversarial geometry and texture. During training, our adversary is optimized with respect to all vehicles in the training distribution to create a universal attack which can be applied to any vehicle in any scene.

We conduct an empirical evaluation of our proposed attack on the KITTI [Geiger2012CVPR] self-driving dataset and a novel large-scale self-driving dataset Xenith using the multi-sensor detector MMF [liang2019multi]. We generate input-agnostic adversarial examples that successfully hide host vehicles from state-of-the-art detectors in both datasets. More importantly, we find that incorporating image inputs makes the model more vulnerable when compared to using LiDAR alone, as successful attacks are primarily caused by the brittle image features. Moreover, the projection of image features into 3D allows the adversary to generate false detections in distant regions. Nonetheless, we show that false negative failures can be circumvented significantly by applying feature denoising and adversarial training. However, we observe that distant false positives are much harder to correct with adversarial defense, as they are also caused by inaccurate mappings between 2D pixels and 3D LiDAR points during fusion.

2 Related work

Adversarial attacks were first discovered in the 2D image domain, where small perturbations on the pixels were shown to generate drastically different prediction results on object classification [szegedy2014intriguing, goodfellow2015explaining]. Networks trained on object detection and semantic segmentation have also been shown to exhibit such vulnerability [xie2017adversarialdetect, lu2017adversarial, chen2018shapeshifter, liu2018dpatch, li2018robust, wei2019transferable]. Early methods [szegedy2014intriguing, goodfellow2015explaining, deepfool, universal] assume the knowledge of the gradient information of the victim, referred to as whitebox attacks. Later it was found that a blackbox attack can achieve similar success as well [practicalblackbox, decisionattack, qeba]. Defense and robustness evaluation procedures have been explored for adversarial attacks  [athalye2018obfuscated, papernot2016limitations, carlini2017towards, ilyas2019adversarial, carlini2019evaluating, tsipras2018robustness, xie2019feature, engstrom2019exploring].

Aside from changing the pixel values by a small amount, various other ways to “perturb” an image were also proposed. Object insertion attacks are realistic attacks that typically insert an object to an image that changes the network output but does not introduce changes in semantics [advpatch, athalye2018synthesizing, eykholt2018robust, yang2020patchattack]. These attacks were originally designed to be stickers that can be attached to a target object, and has since also been applied to the lens of a camera [camsticker]. Semantic adversarial attacks [semanticadv, semanticadviccv, bhattad2019unrestricted], on the other hand, aim to use a generative model to edit the semantic properties of an image, while maintaining the original identity. Image rendering is also a popular technique in those non-pixel based attacks, which can also be made differentiable [diffrender], by using which [beyondimage] showed that adversarial attacks can be made through changing lighting and illumination. Various other object insertion attacks designed camouflage textures that can be wrapped around the target object [camou, phystexture, advcamou, universalcamou, advlogo].

Figure 2: Simulating the addition of a mesh onto a vehicle rooftop in a realistic manner. First roof approximation is done to determine placement location and heading. Then LiDAR points and pixels are rendered with directional lighting to approximate sunlight. Finally, a dense depth image is generated with depth completion and used to handle occlusion.

The safety and robustness of self-driving cars against adversarial attacks has also been widely studied. Aside from the typical image-based attacks introduced above, since LiDAR sensors are usually equipped by self-driving vehicles, a body of work focused on point clouds as the input modality. [xiang2019generating, advpc, pointsetsadv, hamdi2020advpc] tried to directly perturb the location and cardinality of point clouds. Although they are similar to image-based attacks in spirit, these attacks usually cannot be physically realized in 3D space by a LiDAR sensor since the perturbed points are not guaranteed to be generated by ray projection. Towards more realistic attacks, [sensorattack, robustlidar] developed spoofing attacks that add malicious LiDAR points in the 3D space, while other approaches [xiao2019meshadv, cao2019adversarial, tu2020physically] instead optimized adversarial 3D mesh surfaces and used differentiable ray-casting mechanisms to generate LiDAR point clouds.

Despite the fact that multi-modal sensor configurations are widely seen on self-driving vehicles [contfuse, liang2019multi, mvxnet, fadadu2020multi, bijelic2020seeing, chadwick2019distant], research on multi-modal sensor attacks is still very limited. Several preliminary works show the possibility of attacking multi-sensor fusion networks [wang2020towards, cao2020msf, yu2020multimodal]. However, [wang2020towards] did not consider consistency across data modalities when perturbing the image input, whereas [cao2020msf] did not consider image texture, resulting in a lack of attack expressivity, and [yu2020multimodal] did neither. Although experiments show that attacking multiple modalities together can contribute to a stronger attack, recent research has also shown the seemingly different conclusion that multi-task learning can generate more robust networks [mao2020multitask]. We believe that it would be an interesting question to ask, whether multi-sensor fusion can be made more robust when the attacks are both input-agnostic and physically realizable.

3 Multi-sensor Adversarial Learning

In this section, we present a general method for learning an adversarial textured mesh to attack multi-sensor object detectors. We focus on attacking the LiDAR-image object detector MMF [liang2019multi], a state-of-the-art multi-sensor network architecture employed in modern self-driving systems. We believe our study is general enough as the multi-sensor fusion module is a common building block for other related works [fadadu2020multi, bijelic2020seeing, chadwick2019distant, radarnet]. Specifically, we require the adversarial multi-sensory attack to be (1) input-agnostic so that it can be applied in different environments, (2) geometrically-consistent across image and LiDAR input modalities, and (3) fully-automatic for implementation at large-scale. To that end, we design our framework upon the frontier of physically-realizable LiDAR attacks [tu2020physically, cao2019adversarial], differentiable rendering techniques [liu2019soft], and multi-modal sensor fusion networks [liang2019multi]. Our attacks are focused on vehicles as they are the most common object of interest on the road.

Preliminaries:

We consider a bird’s eye view (BEV) object detection model that takes the front camera image and LiDAR point clouds as input. Here, the dimensions and represent the image height and width respectively. The dimension represents the number of LiDAR points which which could vary in each frame. The object detector is trained on bird’s eye view (BEV) bounding box annotations , with each bounding box instance parameterized by . Subsequently, and are coordinates of the bounding box center, and indicate the width and height, respectively, and represents the orientation.

Figure 3: Illustration of the multi-sensor fusion operation described in Eq. 1.

In order to process both image and LiDAR data modalities, the object detector uses two separate branches to extract features from each modality (see Fig 4), namely image encoder and LiDAR encoder . To fuse 2D image features and 3D LiDAR features, the image features are projected onto the BEV voxels by a continuous fusion layer which performs 3D sensor fusion. Here, each BEV voxel samples nearby LiDAR points, obtains the corresponding image pixel for each point, and assigns these image pixels to map onto the BEV voxel. Image features are then projected into 3D using these assignments. Finally, the fusion branch aggregates the projected image feature map and LiDAR feature map by addition. Specifically, the overall model inference can be summarized as follows:

(1)

where and are the projected image features and LiDAR features. Such design choice has been proven to improve upon processing complimentary input data modalities, as cameras provide rich semantic information but struggle to extract accurate 3D geometry, while LiDAR captures accurate geometry but is less rich semantically.

Figure 4: Overview of the attack pipeline. The adversarial mesh is rendered into both LiDAR and image inputs in a differentiable manner. The inputs are then processed by a multi-sensor detection model which outputs bounding box proposals. An adversarial loss is then applied to generate false negatives by suppressing correct proposals and false positives by encouraging false proposals. Since the entire pipeline is differentiable, gradient can flow from the adversarial loss to mesh parameters.

3.1 Multi-sensor Simulation for Object Insertion

In this work, we design a framework to insert a textured mesh into the scene so that both appearance and shape can be perturbed to attack multi-sensor perception systems. We attach a triangle mesh onto the roof of a host vehicle, as such placement is physically realizable in the real world. The mesh is parameterized by vertex coordinates , vertex indices of faces , and per-face vertex textures . The dimensions , and represent the number of vertices, the number of triangle faces, and the per-face texture resolution, respectively. For scalability reasons, we do not consider transparency, reflective materials, or shadowing, as handling each case would require sophisticated physics-based rendering. Instead, we approximate the sensor simulation using LiDAR ray-tracing and a light-weight differentiable image renderer. Both image and LiDAR rendering pipelines are differentiable, allowing gradients from LiDAR points and image pixels to flow into the mesh parameters during optimization optimization. The overall pipeline of multi-sensor simulation for object insertion is illustrated in fig:sim.

Rooftop Approximation:

First, we estimate the center of the vehicle’s rooftop to determine the 3D location for placing the adversary. Following  

[engelmann2017samp, tu2020physically, najibi2020dops], we represent our vehicle objects using signed distance functions (SDFs) and further project them onto a low-dimensional shape manifold using PCA. For each vehicle, we then optimize the low-dimension latent code that minimizes the fitting error between the vehicle point clouds and the shape manifold. Then, a fitted SDF is converted to a watertight vehicle mesh with Marching Cubes [lorensen1987marching]. We select the top 20cm of the vehicle as the approximate rooftop and use the rooftop center and vehicle heading to determine the exact pose for object insertion.

LiDAR Simulation:

To simulate insertion in the LiDAR sweep, we sample rays according to LiDAR sensor specifications used to collect the original sweep, such as the number of beams and the horizontal rotation rate. We then compute the intersection of these rays and mesh faces using the Moller-Trumbore algorithm [moller1997fast, tu2020physically] to obtain a simulated point cloud of the adversarial mesh. These simulated points are then added to the original LiDAR sweep.

Image Simulation:

To render the adversary into the image, we extract the intrinsics and extrinsics from the camera sensor that captured the original image. We then use a light-weight differentiable renderer SoftRas [liu2019soft] to simulate the mesh pixels. Using a soft rasterizer during optimization allows gradient flow from occluded and far-range vertices to enable better 3D reasoning from pixel gradients. To enhance the fidelity of rendered images, we model the sun light with a directional lighting model as a light at infinite distance.

Occlusion Reasoning:

As we insert a new object into the scene, the rendering process must also consider occlusion for both the original and newly rendered LiDAR points and image pixels. To handle LiDAR occlusions, we use the existing points in the LiDAR sweep and compare the depth of the rendered points with the original points. For each ray which generates a point in the original LiDAR sweep and on the adversarial mesh, we take the closer point and discard the other point which has been occluded. To handle image occlusion, per-pixel depth estimates are needed. Thus, we first project LiDAR points onto the image, which generates a sparse depth image since the image has much higher resolution. We then use a depth completion model [chen2019learning], which takes the sparse depth and RGB images as input and outputs dense per-pixel depth estimates. Using the dense depth map, we discard rendered pixels which have greater depth than the corresponding pixel in the original image. After discarding occluded pixels, we overlay the rendered pixels onto the original image. Note that we do not attack the depth completion model as this is a preprocessing step used to increase the fidelity of our multi-sensor simulation.

3.2 Universal Adversarial Example Generation

With differentiable rendering methods that simulate consistent perturbations across image and LiDAR input modalities, we now describe the process to generate multi-sensor adversarial examples to fool the object detection model . We denote the detection output after perturbation as

(2)

where are the images and LiDAR point clouds after perturbation, and represents the detection outputs.

Adversarial Objectives:

We consider two adversarial objectives with different focuses: one for false negatives and the other for false positives. To generate false negative attacks, we follow prior work [xie2017adversarialdetect, tu2020physically] in attacking object detectors by suppressing all relevant bounding box proposals for the host vehicle. A proposal is relevant if its confidence score is greater than 0.1 and it overlaps with the ground-truth bounding box. The adversarial loss then minimizes the confidence of all candidates:

(3)

where IoU denotes the intersection over union operator and is the corresponding ground-truth box we aim to attack.

Alternatively, we aim to generate false bounding box proposals that do not overlap with any ground-truth boxes in the scene, which we refer to as false positive generation. The false positive adversarial loss increases the confidence of the false positive candidates as follows:

(4)

where is a subset of bounding box proposals with no overlaps of any ground-truth object bounding boxes.

Mesh Regularization:

Besides the adversarial objective, we use an additional regularization term to encourage realism in the mesh geometry. Specifically, we use a mesh Laplacian regularizer [liu2019soft], which encourages smooth object surface geometries: , with the distance from vertex to the centroid of its immediate neighbors :

(5)

Learning Input-Agnostic Attacks:

Overall, our optimization objective can be summarized as

(6)

where and are coefficient that weight the relative importance of the false positive loss term and mesh regularization term. We employ this objective to optimize the shape and appearance of the inserted object on the entire dataset to generate an input-agnostic adversarial example.

Therefore, we aim to optimize the expected loss across all vehicles in the training distribution. Specifically, we denote the optimal adversary in the following equation.

(7)

Note that we keep the mesh topology or unchanged during learning. Furthermore, we constrain the physical scale of the adversary by an axis-aligned 3D box. Namely, we require that , where , , and represent the box constraints along -axis, respectively.

We optimize the expected loss by learning a single adversarial mesh with respect to all vehicles in the training set. With our proposed pipeline which is differentiable end-to-end, optimization of the adversarial mesh is done using projected gradient descent to respect the box constraints on mesh vertices. In our experiments, we also conduct attacks targeting a single modality. To achieve this, we disable the gradient flow to the untargeted input branch, while we still simulate the mesh into both modalities to maintain physical consistency across image and LiDAR modalities.

3.3 Multi-sensor Adversarial Robustness

Upon conducting successful object insertion attacks, we also study defense mechanisms against such attacks. Compared to the single-sensor setting, achieving multi-sensor adversarial robustness is even more challenging. First, each single input modality could be attacked even when the perturbations on the other input sensors are non-adversarial. Second, adversarial perturbations from each single input modality can interact with each other, which is a unique aspect in the multi-sensor setting. Thus, we need to deal with not only perturbations at each input modality but also their effect in the fusion layer.

We employ adversarial training as it is the most standard and reliable approach to defense. Adversarial training can be formulated as solving for model parameters

(8)

which minimize the empirical risk under perturbation. Here

is the loss function used to train the detection model. This is achieved by training detection model

against perturbations generated by our threat model . While adversarial training is typically performed on image perturbations that are cheap to generate with only a few PGD steps [madry2018towards], our adversarial example generation is prohibitively expensive for the inner loop of the min-max objective. Thus, instead of generating a strong adversary from scratch at every iteration, we adopt free adversarial training [shafahi2019adversarial] and continuously update the same adversary to reduce computation.

4 Experimental Evaluations

In this section, we first describe the datasets we employ, our attack protocols, and our evaluation metrics. We then present our empirical findings for

white-box attacks on each dataset and the black-box transfer attacks across datasets. Finally, we explore several defense mechanisms including adversarial training towards a more robust multi-sensor object detector.

4.1 Experimental Setting

Datasets:

We conduct our experiments on two self-driving datasets: KITTI [Geiger2012CVPR] and Xenith. Xenith is collected by having a fleet of self-driving cars drive around several cities in North America. For both datasets, we use snippets captured in the daytime and consider object detection within 70 meters forward and 40 meters to the left and right of the ego-car using the front camera image and the LiDAR point clouds. KITTI images are provided at a resolution and Xenith images are provided at a resolution. We treat each vehicle in a frame as a separate sample and specifically require the adversarial mesh inserted on the rooftop of a host vehicle to be at least 70% visible from the image. Overall, we have 12,284 vehicle samples in KITTI and 77,818 in Xenith, which we further split into a training and test set at a 7:3 ratio.

Multi-sensor Object Detection:

We adopt the multi-sensor network architecture of MMF [liang2019multi] to perform object detection in bird’s eye view (BEV) on KITTI dataset. On Xenith, we use a slightly different architecture which has been tuned for improved performance (see our supplementary for details). The multi-sensor detector has two separate branches to extract features from images and LiDAR, in which LiDAR points are voxelized and processed as a BEV image with 2D convolutions while images are processed with a ResNet backbone [he2016deep]

. For sensor fusion, each voxel in BEV performs K-nearest neighbour (KNN) search to sample close-by LiDAR points. We follow the implementation of prior work 

[liang2019multi] and set in our fusion modules.

Figure 5: Evaluation of attack that hides the host vehicle. We plot the host vehicle recall across IoU thresholds. Only attacking LiDAR yields very weak attacks and attacking the image produces significantly stronger perturbations.

Metrics:

Following prior work on rooftop object insertion attacks [tu2020physically], we aim to generate false negatives and make the host vehicle “disappear” from the detector. Thus we will evaluate the recall on the host vehicle across various IoU thresholds. In addition, we evaluate the false negative attack success rate (FN ASR), which is defined as the percentage of host vehicles detected before perturbation that are not detected after perturbation. We consider a host vehicle to be detected if there exists an output bounding box having greater than 0.7 IoU with the vehicle. In addition, to generate missed detections on the host vehicle, false positives may also be generated by the attack. We consider an output bounding box to be a false positive if its maximum overlap with any ground truth box is less than 0.3 IoU. Furthermore, we do not count boxes which overlap with the clean detections or the host vehicle. We evaluate the false positive attack success rate (FP ASR) and the percentage of attacks which generated at least one false positive. Finally, we evaluate the overall attack success rate (ASR) as the percentage of attacks which successfully create a false positive or false negative.

Figure 6: Placing the adversarial mesh on a host vehicle can hide the host vehicle completely from state-of-the-art detectors.The same mesh is used for all vehicles in a dataset as the attack is input-agnostic with respect to the training distribution.

Implementation Details:

The adversarial mesh is initialized as an icosphere with vertices and faces. Per-face textures are parameterized using a texture atlas with a texture resolution for each face. To generate the adversarial mesh, we apply the Laplacian regularizer with coefficient and set during optimization. We further constrain the scale of the mesh by an axis-aligned 3D box where the and coordinates are bounded by and the coordinate is bounded by . We use Adam [kingma2014adam] to optimize the mesh parameters with a learning rate of for textures and for vertex coordinates. To target either the LiDAR or image branch in isolation, we disable gradient flow to the other branch during the backward pass to the adversary.

4.2 Universal Adversarial Attacks

Hiding Host Vehicle:

We evaluate the drop in recall in detecting the host vehicle, as missed detections can affect self-driving vehicle’s planning towards the most dangerous outcome. We sweep IoU thresholds and visualize the IoU-recall curve in fig:iou_rec. First, we show that the insertion of non-adversarial objects with randomized shape and appearance has little impact on the detector. On the other hand, the adversarial mesh perturbing both input modalities leads to a significant drop in recall. Moreover, we perturb the LiDAR and image inputs in isolation and find that targeting the LiDAR inputs alone yields very weak attacks. Meanwhile, targeting the image alone is almost as strong as perturbing both modalities. We believe that the image inputs are significantly less robust to the proposed object insertion attack.

Attack Success Rates:

In tab:kitti_asr and tab:xenon_asr, we further analyze the results in terms of the attack success rate. We also consider meshes with randomly generated geometry and texture as a baseline for comparison. We observe similar trends of image features being significantly more vulnerable. In addition to missed detections, the adversarial mesh is able to attack the detector through generating proposals that do not exist in the real world. Furthermore, we compare against prior work [tu2020physically] that studies the attack on a LiDAR-only object detector. In this case, incorporating image inputs boosts robustness to LiDAR attacks at the cost of being significantly more vulnerable to image perturbations.

Detector Attack FN ASR FP ASR ASR
LiDAR LiDAR [tu2020physically] 31.85% 4.84% 33.23%
LiDAR + Image Random 5.68% 2.01% 7.64%
LiDAR 7.99% 2.36% 10.11%
Image 26.06% 3.40% 28.43%
Both 32.76% 4.38% 34.68%
Table 1: Evaluation of attack success rates on KITTI. Also comparing with random meshes and a LiDAR only model.
Detector Attack FN ASR FP ASR ASR
LiDAR LiDAR [tu2020physically] 23.80% 10.70% 32.60%
LiDAR + Image Random 5.06% 4.15% 9.17%
LiDAR 9.52% 6.21% 15.33%
Image 42.81% 10.78% 49.59%
Both 43.15% 11.77% 49.76%
Table 2: Evaluation of attack success rates on Xenith. Also comparing with random meshes and a LiDAR only model.
Figure 7: Visualization of distant false positive detections and corrupted image features in the image plane and after projection to 3D. Due to the camera’s perspective projection, object insertion attacks can create distant false positives.

Qualitative Examples:

Qualitative examples are shown in fig:qual. First, we show that the detector fails to detect the host vehicle once we place the adversarial mesh on its rooftop. Note that we show detections in the image rather than LiDAR for ease of viewing. The same adversarial mesh is used for all demonstrations in a dataset, as our attack is input-agnostic with respect to the training distribution. Furthermore, we show in fig:fp_qual that our adversarial mesh generates false positives at very distant locations. Detections are visualized in BEV since distant objects appear too small in the image. Additionally, we visualize image features in the image plane and the visual cone of projected image features into 3D, showing that long-range false positives are caused by the perspective projection of cameras during fusion.

Black-box Transfer Attacks:

We conduct transfer attacks across datasets and show results in tab:attack_transfer. Overall, our transfer attack on the target dataset is stronger than attacking only the LiDAR input modality on the source dataset, especially from Xenith to KITTI. On the other hand, the transferability is probably lowered by the image resolution and hardware, which is beyond the scope of our paper but an interesting future direction to explore.

Source Target FN ASR FP ASR ASR
KITTI KITTI 32.76% 4.38% 34.68%
Xenith 14.20% 2.86% 16.88%
Xenith KITTI 12.64% 6.12% 18.22%
Xenith 43.15% 11.77% 49.76%
Table 3: Black box transfer attack results between KITTI and Xenith. We observe some transferability even between two datasets collected with different sensor hardware and in different geographic locations.

4.3 Improving Robustness

Attacks Against Defense Methods:

As empirical findings suggest that the image feature is more vulnerable, we first employ an existing image-based defense method that removes high-frequency component through JPEG compression [dziugaite2016study]. In addition, we conduct adversarial training against the attacker. Since generating a strong adversary is extremely expensive due to the simulation pipeline, we employ a strategy similar to Free Adversarial Training [shafahi2019adversarial] and reuse past perturbations by continuously updating the same adversarial object. Specifically, we perform 5 updates to the adversary per one update to the model. We combine the feature denoising [xie2019feature] with the adversarial training to further enhance robustness against image perturbations in particular. We report the success rates as well as the average precision (AP) at 0.7 IoU to study the trade-off between adversarial robustness and performance on benign data [tsipras2018robustness].

As shown in tab:defense, we find JPEG compression is very ineffective as defense. We hypothesize this is because the input-agnostic adversary is rendered at various different poses during training and therefore do not rely on high-frequency signals that are removed by JPEG compression. In comparison, our adversarial training effectively reduces the overall attack success rate from 49.76% to 14.97%, while dropping AP by 0.5%. Finally, adding non-local mean blocks after every residual block in the image processing backbone further improves robustness by reducing the ASR by 50%.

Discussions and Future Work:

While existing defense methods are effective at recovering the missed detections, they struggle to detect false positives. We believe this is because distant false positives shown in fig:fp_qual are only partially due to vulnerabilities to adversarial perturbations. In fact, such examples exploit erroneous associations between objects that are distant in 3D. Specifically, the mapping between a mesh pixel and a LiDAR point far away from the mesh enables such attacks. These false associations can easily occur if the assigned pixel for each LiDAR points is shifted by a few pixels, since objects which are far apart in 3D may appear very close in 2D. We identify two reasons how this can occur in practice. First, due to the receptive field of DNN activations, an adversarial object can influence pixels outside its physical boundaries. Second, while LiDAR sweeps are collected in a continuous fashion with a rotating sensor, images are captured instantaneously at regular intervals. Consequently, the camera extrinsics used for projection become outdated for LIDAR points captured before and after the image. Thus, to achieve more robust sensor fusion for images and LiDAR, fusion modules must reason about 3D geometry, contextual information, and temporal information of LiDAR points to generate mappings between image pixels and LiDAR points more intelligently. We hope these findings will inspire future work towards more robust sensor fusion methods.

Defense FN ASR FP ASR ASR AP(clean)
None 43.15% 11.77% 49.76% 84.64%
JPEG [dziugaite2016study] 43.19% 9.45% 49.60% 84.52%
Adv Train [shafahi2019adversarial] 7.83% 8.29% 14.97% 84.16%
Adv FD [xie2019feature] 3.57% 7.53% 10.82% 83.97%
Table 4: Defense results on Xenith. Adversarial training significantly boosts robustness to false negatives and is further improved with feature denoising. JPEG compression and decompression is not effective. However, these methods still struggle in defending false positives.

5 Conclusions

We present a novel method that places an adversarial object on the rooftop of a host vehicle to attack multi-sensor fusion-based object detectors in self driving. The proposed method is capable of hiding an existing host vehicle and generating false detections at the same time. Compared to existing methods, our attack works in a more realistic setting as we attack both image and LiDAR input modalities by generating input-agnostic and physically realizable adversarial perturbations. Experiments reveal the vulnerability of multi-sensor fusion-based object detectors in both white-box and black-box settings, primarily due to non-robust image features. While adversarial training with image feature denoising can effectively recover the missed detections, it still struggles to detect false positives without a deeper reasoning about 3D geometry in feature fusion. We believe this work would open up new research opportunities and challenges in the field of multi-sensor adversarial learning.

References

Appendix A Additional Implementation Details

Rooftop Approximation:

We convert all the vehicle meshes from our vehicle bank to a set of SDF volumes, which are then projected into a lower dimensional latent code of length with PCA. For each point cloud which we wish to fit a mesh to, we then optimize the latent code such that

(9)

Here, is the SDF volume decoded from and minimizes the mean squared distance from each point in to the zero-level surface of the SDF. To learn , we initialize with and perform steps of gradient updates using Adam [kingma2014adam] with a fixed learning rate . Furthermore, the point cloud is transformed to a canonical coordinate frame. While we attempt this approximation for all point clouds, a good fit is not always achievable since distant or occluded point clouds have a limited number of points. Therefore, we use the top of the bounding box annotation as the roof by default, if the mean squared distance after fitting is greater than .

Depth Completion:

Our depth completion model adopts the architecture introduced in Chen  [chen2017rethinking]

and we initialize the model with pre-trained weights from COCO dataset. The model takes RGB images concatenated with a sparse depth image obtained from projecting LiDAR points onto the image plane. For training, we use the official depth labels for KITTI and use aggregated LiDAR depths for

Xenith as supervision. We adopt the training loss and schedule from Chen  [chen2019learning].

Xenith Model:

We use a slightly different variation of the MMF model [liang2019multi] for Xenith that is tuned for more complex scenarios and faster inference. The differences can be summarized as follows:

  1. The refinement module is removed as two-stage detection is slow in practice.

  2. We modify the LiDAR feature extraction network from residual blocks to a feature pyramid network 

    [lin2017feature] style to enable cross-scale fusion of features at different resolutions.

  3. The number of image-LiDAR fusion modules is s reduced from 4 to 1, only fusing at a single feature resolution.

Appendix B Additional Experimental Results

Ablation Studies on the Adversary Size:

To understand how much the size of the adversary affects the strength of the attack, we vary the size of the box constraints on the vertex parameters. Here we set and set to and measure the attack success rates. Results are shown in fig:sz_sweep. As expected, the attack becomes stronger as the constraints on vertex coordinates are relaxed. The improvements are more noticeable in the false negative success rates.

Figure 8: The size of the adversary is varied as sweep the constraint for the vertex coordinates. Larger meshes lead to stronger attacks as expected.

Ablation Studies on :

As in our main experiments we set , here we vary the weighting coefficient to analyze how the adversary adapts to focus more on either generating false negatives or false positives. We sweep and show the results in fig:fp_sweep. At , the attack can trade most of the false negative success for false positives. In our main experiments, we chose to focus more on false negatives as missed detections is far more problematic.

Figure 9: Sweep of , the term which weights the false positive component of the loss term. Increasing trade false negative success for false positives.

Details on KNN:

Following previous implementations of multi-sensor fusion [liang2019multi], when each BEV voxel perform K-Nearest Neighbor search to query for LiDAR points, we set in our main experiments. Here, we vary and conduct our attacks on models that are retrained to use more LiDAR points for each BEV voxel during fusion. Results are shown in tab:knn. Note that is the equivalent of a LiDAR only model. Overall there is no clear trend between this parameter and robustness to our attack.

FN ASR FP ASR ASR AP (clean)
K = 0 23.80% 10.70% 32.60% 83.02
K = 1 43.15% 11.77% 49.76% 84.64
K = 3 25.80% 13.35% 37.36% 84.50
K = 5 36.67% 7.13% 41.54% 84.44
K = 7 59.25% 7.55% 62.05% 84.74
Table 5: We sweep , the number of LiDAR points used to query image features for each BEV voxel. Overall, there is no clear trend between and robustness to our attack.
Figure 10: Visualization of the attack success rate across different locations in bird’s eye view. The ego vehicle is at with being the longitudinal and lateral directions. Attacks are stronger on host vehicles that are more distant from the ego.

Success Rate Visualization:

To better stand where the attack is strong, we visualize the attack success rate across location in BEV. The visualization is shown in fig:bev_asr. Host vehicles that are farther away are much easier to attack than those close by.

Qualitative Examples:

Additional qualitative examples are shown in fig:xe_qual for Xenith and fig:kit_qual for KITTI.

Figure 11: Additional qualitative examples on Xenith.
Figure 12: Additional qualitative examples on KITTI.