Adversarial Attacks on Camera-LiDAR Models for 3D Car Detection

Most autonomous vehicles (AVs) rely on LiDAR and RGB camera sensors for perception. Using these point cloud and image data, perception models based on deep neural nets (DNNs) have achieved state-of-the-art performance in 3D detection. The vulnerability of DNNs to adversarial attacks have been heavily investigated in the RGB image domain and more recently in the point cloud domain, but rarely in both domains simultaneously. Multi-modal perception systems used in AVs can be divided into two broad types: cascaded models which use each modality independently, and fusion models which learn from different modalities simultaneously. We propose a universal and physically realizable adversarial attack for each type, and study and contrast their respective vulnerabilities to attacks. We place a single adversarial object with specific shape and texture on top of a car with the objective of making this car evade detection. Evaluating on the popular KITTI benchmark, our adversarial object made the host vehicle escape detection by each model type nearly 50 time. The dense RGB input contributed more to the success of the adversarial attacks on both cascaded and fusion models. We found that the fusion model was relatively more robust to adversarial attacks than the cascaded model.



There are no comments yet.


page 1

page 3

page 5


Towards Universal Physical Attacks On Cascaded Camera-Lidar 3D Object Detection Models

We propose a universal and physically realizable adversarial attack on a...

Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving

Modern self-driving perception systems have been shown to improve upon p...

Towards Robust Sensor Fusion in Visual Perception

We study the problem of robust sensor fusion in visual perception, espec...

Sensor Adversarial Traits: Analyzing Robustness of 3D Object Detection Sensor Fusion Models

A critical aspect of autonomous vehicles (AVs) is the object detection s...

Physically Realizable Adversarial Examples for LiDAR Object Detection

Modern autonomous driving systems rely heavily on deep learning models t...

End-to-end Uncertainty-based Mitigation of Adversarial Attacks to Automated Lane Centering

In the development of advanced driver-assistance systems (ADAS) and auto...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous vehicles (AVs) employ RGB cameras and LiDAR sensors to generate complimentary representations of the scene in the form of dense 2D RGB images and sparse 3D point clouds. Using these data, 3D car detection models based on deep neural networks (DNNs) have achieved state-of-the-art performance. Images and point clouds are utilized differently depending on the model type (cascaded, or fusion) as shown in Fig.

2, and 3. Cascaded models use each modality independently. Images are used by a 2D detection DNN to generate proposals of search spaces where a car may reside. LiDAR points in each proposed region are then extracted for 3D point-based detection [frustum, convnet, roarnet]. On the other hand, fusion models [epnet, mvx] use DNNs to extract and combine image and point cloud features in parallel. Afterwards, a combined representation is sent through a DNN for 3D detection.

Fig. 1: Left: A car is detected with accurate bounding boxes. Right: The same car is not detected after the adversarial attack. We place a 3D adversarial mesh with learned shape and texture on top of the car. The adversarial texture fools the 2D RGB detection pipeline, and the red points are the rendered LiDAR points added to the point cloud to fool the 3D point cloud pipeline.

The aforementioned models rely on DNNs which are known to be vulnerable to adversarial attacks that slightly alter the input but greatly affect the output. These attacks were initially observed and studied in the RGB image domain [adversarial, patch, sharif2016accessorize]. Realistic adversarial attacks on 3D detection models that take only point clouds as input were recently investigated in [mich2019, uber]. They placed a perturbed mesh in the 3D scene and rendered it differentiably by simulating a LiDAR. However, this prior work only considered one modality, while most AVs rely on both modalities (images and point clouds) for perception. As safety is critical for AVs, this paper investigates realistic adversarial attacks on detection models that take both images and point clouds as input, specifically we consider the common cascaded and fusion architectures.


Fig. 2: The proposed attack pipeline on the cascaded F-PN model [frustum]. Gradient flow is represented by dashed lines. Purple is for texture changing gradients, and red is for shape changing gradients. The car with the adversarial object goes undetected.

Recently, [uber2] used differntiable rendering to make a multi-sensor adversarial attack on the fusion-based bird’s eye view (BEV) detector MMF[mmf]. While [uber2]

has considered fusion models, we study adversarial attacks on both cascaded and fusion camera-LiDAR models in the context of AVs. Moreover, their model and its fusion module are not commonly used or publicly available which makes their results harder to generalize. The two models we consider are both open-source and built upon commonly used backbones as will be explained in section


We propose a universal, and realistic adversarial attack against cascaded and fusion 3D car detection models. The malicious adversarial object is placed on top of a car to avoid occlusion. This object is then rendered to both point cloud and the corresponding RGB image by differentiable renderers, thus maintaining consistent and realistic characteristics across inputs. The shape and texture of the adversarial object are trainable parameters that are perturbed adversarially. This adversarial object is trained on the entire KITTI [kitti] dataset which makes our attack universal, i.e. this single object is capable of reducing the accuracy of a perception system in different 3D scenes. For the cascaded type, we chose Frustum-PointNet (F-PN) [frustum] which is a pioneering work in cascaded models and many more works were based on its architecture. As for the fusion type, we chose EPNet [epnet] which is a state-of-the-art multi-modal 3D detection model. EPNet works directly on LiDAR points (not their projection to BEV) to avoid information loss due to projection. EPNet [epnet] focuses on the 3D detection task and is more accurate than MMF [mmf] in both 3D and BEV detection.

The proposed attack shows how this adversarial object can make its host vehicle evade detection. This is especially dangerous in the context of self-driving cars where car detection accuracy is consequential in autonomous decision making. We not only report the attack success rate in hiding cars, but also the reduction in the average precision after an attack on all input vehicles, to aid in comparing our results to previous and future work on adversarial attacks and defenses. We find that both models were vulnerable to attacks that manipulate the input image. This can be due to the dense and brittle nature of RGB features. Moreover, attacks that target only the point cloud input are much more effective on the cascaded model than the fusion one. This is probably because fusion models directly incorporate image features which can lend robustness against LiDAR-only adversarial attacks.

Ii Related Work

Adversarial attacks were first discovered and studied in RGB image classification DNNs [adversarial]. These attacks would change the pixel values of an image slightly and lead to great errors in the DNN output. However, such pixel-wise attacks were not realistic and would usually work on a single image. So, constraints were established to ensure the physical feasibility of an attack [patch, sharif2016accessorize]. Also, universal attacks were proposed where a single perturbation was effective across the input distribution [moosavi2017universal]. Our attack is very different from the aforementioned work: We aim not to learn a patch or particular pixel values, but an adversarial texture on a 3D object that can attack a model when rendered to RGB images. Moreover, we target both the image and the point cloud input.

Early work on adversarial attacks on point cloud models focused on perturbing point clouds by slightly moving, adding, or removing a few individual points [yang2019adversarial, Xiang_2019_CVPR]. Due to LiDAR properties such perturbations were largely unrealizable physically. Towards a realistic point cloud attack on detection models in AVs, [mich2019, uber] used an adversarial 3D mesh that is placed somewhere in the scene around an AV. These approaches still focus only on a single modality, while an AV usually utilizes both.

In [wang2020towards], Wang et al. study adversarial attacks on multi-modal DNNs by shifting individual points in the 3D space and using a 2D RGB sticker. As stated earlier, simple point perturbations can be unrealistic due to LiDAR sensor properties, and they also ignore cross-modality consistency. Towards a realistic multi-modal attack on perception systems, [uber2] used a single 3D object to attack a fusion-based bird’s eye view (BEV) detector. However, they don’t consider cascaded models, which are more interpretable and less computationally expensive than fusion models [fusionrev]. This makes cascaded models easier to train, test, and deploy in industry and thus more critical to examine in practice, while fusion models are relatively less mature. While [uber2]

focuses on reporting attack success rates, we report both: attack success rates, as well as the overall reduction in the car detection accuracy of a multi-modal detector. This allows us to compare with the detector’s initial accuracy as well as other attack or defense methods. Finally, their victim model and their attack are based on BEV detection which is different from our 3D detection task (e.g., BEV detection doesn’t estimate 3D bounding boxes which are necessary for most AVs). Also, BEV learning is done on projected point clouds which leads to information loss and can make their attack results harder to extend to fusion multi-modal DNNs that reason on LiDAR points directly like



Fig. 3: The proposed attack pipeline on the fusion EPNet model [epnet]. The gradient flow is represented by dashed lines. Purple is for texture changing gradients, and red is for shape changing gradients. The car with the adversarial object goes undetected.

Iii Multi-Modal Adversarial Attack

In this work, we want to learn a single 3D adversarial object that is placed on top of a car in a 3D scene, to avoid occlusion, with the aim of dodging detection. This object is rendered to both point cloud and RGB image as if it was present in the original scene (see Fig.  2, 3). We use this attack to study how this adversarial object affects the car detection accuracy of an AV’s camera-LiDAR 3D detection model. We focus on car detection since it’s one of the most safety-critical tasks of an AV’s perception system. Below we elaborate on the adversarial object, differentiable rendering, the victim models, and the attack objective functions.

Iii-a Adversarial Object

To craft a realistic adversarial object we choose a mesh as a graphical representation. A mesh can maintain realistic 3D physical geometry and texture and can also be differentiably rendered to RGB images and point cloud.

To train the shape of the object, following previous work [mich2019, uber], we start with an initial mesh with vertices, where each initial vertex is defined as

. We deform the shape of the mesh by displacing each vertex with a displacement vector

, which is a learnable parameter to produce an adversarial mesh with vertices , as in Eqn. (1). We use bounding box location and orientation to make a transformation matrix that puts the mesh on top of a host car and gives it the same orientation as the car’s heading direction.


For the texture of the adversarial mesh, a learnable vector

is assigned to each vertex to represent an RGB color. To produce the texture of the mesh, each face is colored by the interpolation of the colors from its three vertices. The mesh is then rendered and added to the 2D RGB camera space with the given projection matrix.

To ensure the object is realistic we put constraints on the size and smoothness of the mesh geometry and interpolate between vertex colors to produce the adversarial texture. The result is a realistic smooth texture without abrupt changes in color (a typical observation in RGB adversarial attacks using pixel-wise perturbations or dense texture representations).

Iii-B Differentiable Rendering

To realistically render a mesh to point cloud, we need to find which LiDAR rays would intersect with the mesh if it was placed in the original scene. We simulate the LiDAR used in capturing the dataset’s point cloud by producing rays at the same angular frequencies and incorporating a small amount of noise that is present in this specific LiDAR. We then calculate the intersection points between these rays and our mesh’s faces in the 3D scene using the Möller–Trumbore intersection algorithm [raycast]. Finally, for each ray with intersection points we take the nearest point, and add all the resulting points to the LiDAR point cloud scene.

To train our adversarial texture, the rendering process needs to be differentiable. We therefore use the fast, differentiable rendering tools developed in [ravi2020pytorch3d] to render our adversarial mesh from the 3D scene to RGB images, as shown in Fig. 2. For image rendering, we use Phong shading and simulate sunlight by pointing a light source from above at a angle.

Iii-C Victim Models

Frustum-PointNet (F-PN) [frustum] and EPNet [epnet] are the representative cascaded and fusion victim models respectively. We chose F-PN [frustum] for many reasons: It’s a pioneering model in the area of multi-modal detection for AVs and many works [convnet, roarnet] were developed based on its architecture and popular backbone DNN [pointnet]. This means our attack could be a threat to other similar cascaded camera-LiDAR 3D detection models. Finally, F-PN showed competitive results on the KITTI benchmark.

As shown in Fig.  2, F-PN deals with each input modality separately. First, it takes an RGB image through a 2D detection DNN, which proposes 2D bounding boxes. These bounding boxes are then projected to 3D space, thus producing frustum-shaped 3D search spaces that surround each object. Points within each frustum are extracted and sent through two PointNet-based [pointnet] DNNs for 3D instance segmentation, and 3D box estimation. F-PN directly outputs one 3D bounding box estimation from a given point cloud frustum, since the assumption is that there must be a single object in a frustum. This implicit bias can pose a challenge for adversarial attacks that seek to suppress detection by targeting only the point cloud pipeline. Also, the post-proposal point cloud is very sparse and small in size so there is not much maneuvering space for our realistic adversarial attack to take place. Despite these challenges our attack was successful as discussed in section IV.

For the image detection pipline in F-PN, we use YOLOv3 [yolov3]. It is a standard fast 2D object detection DNN that outputs region proposals, classifications, and confidence scores. The F-PN and YOLOv3 models were pre-trained on the KITTI dataset.

From the fusion architecture we attack the multi-modal 3D detection model EPNet [epnet]. Unlike the cascaded model, this model takes both image and point cloud inputs simultaneously, and then learns and fuses their features in parallel. EPNet uses a novel fusion module that enhances point features with semantic image features at multiple scales down the network. Its image stream is built on standard 2D convolutional neural nets and its point cloud stream is based on modules developed in [pointnet++]. EPNet uses the enhanced point features to generate many 3D bounding box proposals for each foreground point of a car and then uses a refinement network to filter proposals and refine bounding box dimensions. We note that there are some similarities between this network and the LiDAR only model PointRCNN [pointrcnn]. For example, PointRCNN also generates proposals from each foreground point of each car and thus generating many proposals for each car. Due to that, a previous point cloud-only adversarial attack [uber] found it very difficult to suppress all these proposals in their attack. While EPNet’s usage of image features gave it a higher car detection accuracy than PointRCNN, it can be interesting to see how incorporating image features affects the robustness of a multi-modal model that’s similar to a relatively robust LiDAR-only model.

Iii-D Objective Functions

In F-PN we need two separate objective functions since 2D and 3D detection are independent. For 2D image detection, we minimize the sum of all objectness scores given to car detections. For 3D detection, assuming input points , F-PN uses a segmentation network to produce a segmentation mask determining background and car points and giving each a segmentation score . We minimize the maximum score given to a “car” point using binary cross entropy, as shown in Eqn. (2). Finally, following previous work [uber], we weighed the objective function by the intersection over union (IoU) score between the ground truth and the predicted bounding boxes:


EPNet outputs many bounding box proposals for each detected vehicle; each proposal has a confidence score . Similar to [uber], we aim to suppress relevant proposals , where is the set of proposals with scores . We use binary cross entropy again to minimize the scores given to proposals as shown in Eqn. 3. Similarly, we weigh the objective by the IoU score. This objective applies to both shape and texture optimization and runs end-to-end since this is a fusion model.

Fig. 4: Victim vehicle recall results as the IoU threshold varies. We notice high recall in F-PN under low IoU thresholds, since it always assumes the existence of a car in a frustum. Also, the fusion model is much more robust to LiDAR-only attacks when compared to the cascaded model. However, both models are very vulnerable to image attacks.

We also want to ensure that the geometry of the mesh is smooth and realistic, so we minimize a Laplacian loss [liu2019soft]: , where is the distance between a vertex and the centroid of its neighbors .


The overall adversarial attack objective function for EPNet is shown in Eqn. (5), and for the point cloud pipeline in the cascaded F-PN in Eqn. (6), where is a weight coefficient for the Laplacian smoothing loss.



Fig. 5: Examples from the multi-modal attack on the fusion EPNet model [epnet]. We show detection on the image and LiDAR point cloud before and after an attack. Note that cars with the adversarial mesh on them are undetected. Bounding boxes are shown with 8 corner points and 7 lines instead of 12 to avoid clutter.

Iv Experiments

Iv-a Experiment Setup

We use the KITTI dataset [kitti], a popular benchmark for 3D detection in autonomous driving, to train the adversarial object. Similar to many prior work [frustum, epnet] we use 3,712 LiDAR+image samples for training, and 3,769 for validation. All reported results in the following section are from the validation set. This is a universal attack, i.e., the shape and texture of the object are trained on the entire training set.

The attack pipeline for the cascaded model is shown in Fig. 2. We divide our attack to 2 stages, as F-PN has two separate pipelines: RGB detection and point cloud detection. To attack F-PN’s point cloud pipeline, we use an initial 40 cm radius isotropic sphere mesh with 162 vertices and 320 faces. Projected gradient descent is used to keep the object’s size within an box. We use a small mesh because in cascaded models the point cloud pipeline takes fewer points (limited 3D space from 2D region proposal) than models that input the entire scene’s point cloud. We use ground truth 2D bounding boxes to generate the frustums used in training and evaluating the attack on the point cloud pipeline. An ADAM optimizer is used to iteratively deform the mesh to minimize Eqn. (6). Once we have an adversarial shape we move to the next stage: we render it to 2D images and use an ADAM optimizer to learn a universal adversarial texture for a multi-modal attack. As mentioned earlier, the texture is produced via interpolation of learned per-vertex colors. We use YOLOv3 generated 2D bounding boxes to measure the effect of an image-only attack and an image+point cloud attack on detection. In the cascaded case, the image-only or LiDAR-only attacks are mainly an ablation study since we only render to the modality under consideration. We use these attacks to see which input modality caused higher vulnerability.

For the fusion model, we train the shape and texture simultaneously, and end-to-end as in Fig. 3. We use the same initial sphere mesh as in the F-PN attack, but since this model takes the whole 3D scene we allow a slightly larger adversarial mesh by keeping width and length under 120 cm and height under 80 cm. We also use projected gradient descent with an ADAM optimizer to learn the geometry and texture of the adversarial mesh. For the LiDAR-only attack, we learn an adversarial shape while keeping the texture a constant grey color. For the image-only attack, we use a constant sphere shape and learn the adversarial texture of this sphere. For the image+LiDAR attack, we allow adversarial perturbations to shape and texture.

In the results, we report the drop in a model’s BEV average precision (AP) after putting the mesh on all cars in the validation set. We show the results for 3 detection difficulty levels (based on occlusion) with an IoU threshold of . This can help compare our results with future attack and defense methods. Moreover, we can compare the model’s AP on the KITTI benchmark before and after an attack. This can provide insight into the robustness of a perception model. We use the new KITTI evaluation protocol that uses 40 recall points instead of 11 to get clearer and fairer results.

We also report the attack success rate, which is the percentage of detected vehicles that go undetected after placing the adversarial object. We count a vehicle detected if its IoU with the ground truth is greater than 0.7. We also report the recall of victim vehicles across different IoU thresholds. Recall is the number of cars detected over the number of all cars in the dataset. As a control we compare our adversarial attack results to just applying a grey spherical mesh instead of an adversarial mesh.

Attack Type Easy Moderate Hard
No Attack (Clean) 91.17 85.08 81.00
No Attack (Sphere) 52.81 51.46 48.19
LiDAR Only 37.36 37.79 39.10

Image Only
50.71 36.17 31.07

LiDAR + Image
29.04 23.52 19.66
TABLE I: Cascaded: F-PN car detection AP results
Attack Type Easy Moderate Hard
No Attack (Clean) 95.94 88.83 88.50
No Attack (Sphere) 94.74 87.47 84.58
LiDAR Only 93.92 87.02 82.56

Image Only
37.99 41.32 37.25

LiDAR + Image
22.05 25.63 22.58
TABLE II: Fusion: EPNet car detection AP results

Iv-B Results & Discussion

First, we find that a multi-modal adversarial attack was successful in hiding nearly half the cars from both models as shown in Table III. This drop in recall lead to another drop in the models’ AP going from nearly 96% to 22%, and from 91% to 29% in the fusion and cascaded models respectively under the easy scenario as shown in Table I and II. This sharp drop in AP can also indicate that our attack introduced some false positives. Moreover, we mentioned in section III, that EPNet can be thought of as the multi-modal fusion counterpart to the LiDAR-only PointRCNN model [pointrcnn] which was attacked in [uber] with an attack success rate of 32.3%. While EPNet gained accuracy over PointRCNN because of incorporating image features using a novel fusion method, it became much more vulnerable to image-based adversarial attacks. As shown in Table III, a multi-modal adversarial attack on EPNet had an attack success rate of 50.64%, a nearly 20% gain over the attack on PointRCNN. Examples from the attack on EPNet are shown in Fig. 5.

We find that LiDAR-only attacks are much more effective on the cascaded model than the fusion one. From Fig. 4, we can see that, as the IoU thresholds increase, recall drops quickly with the insertion of any object (in this case a sphere) on top of a car even without an attack. Recall is reduced further after the insertion of our adversarial object. This is also reflected in the sharp decreases in AP under a 0.7 IoU threshold, as shown in Table I. This can be attributed to the fact that cascaded models are mainly trained on frustum-shaped limited 3D spaces that just encapsulate a car. This means that it is not exposed to objects around or on cars. This poses a grave danger for autonomous driving, since correct estimation of dimensions and locations of surrounding vehicles and objects is important to autonomous decision making. On the other hand, fusion models are not much affected with a sphere or a LiDAR-only adversarial attack. This is probably because fusion models directly incorporate image features which can lend robustness against LiDAR-only adversarial attacks. Also, EPNet generates many proposals for a single car from foreground points and thus it can challenging to suppress them all in a LiDAR-only attack.

Attack Type Cascaded: FP-N Fusion: EPNet
No Attack (Sphere) 24.39% 5.45%
LiDAR Only 34.24% 5.91%

Image Only
42.8% 48.64%

LiDAR + Image
55.60% 50.64%
TABLE III: Attack success rate results for the two models

Both architectures are very vulnerable to adversarial attacks that target the image pipeline. In cascaded models, the 2D proposals decide exactly where to search for objects and so if that fails the entire model fails. Using image-based region proposals doesn’t utilize a main purpose behind LiDAR which is to avoid cases where lighting and occlusion affect detection and localization. Also, RGB attacks are much simpler and less computationally expensive than point cloud attacks, and they’re more easily reproduced in real life which can make them more dangerous. The fusion model as well was heavily affected because of the image features. As shown in Table III, an image-only attack on EPNet had a success rate of 48.64%. This can be due to the known issue of brittle image features in DNNs. In the fusion case, adding a LiDAR attack to the image attack improved on the success rate by only 2%. On the other hand, adding the LiDAR attack in the cascaded case improved the attack success rate by nearly 13%.

V Conclusion

We proposed a universal and physically realistic adversarial attack on multi-modal 3D detection models used in car perception. We manipulated mesh geometry and texture and used differentiable rendering to study the vulnerability of both representative cascaded and fusion camera-LiDAR models. We found that both model types are vulnerable to the proposed multi-modal adversarial attacks mainly due to the brittle image features. We also showed that the proposed attack can successfully make a car evade detection from the two studied models nearly 50% of the time.