Messing Up 3D Virtual Environments: Transferable Adversarial 3D Objects

09/17/2021 ∙ by Enrico Meloni, et al. ∙ Università di Siena 9

In the last few years, the scientific community showed a remarkable and increasing interest towards 3D Virtual Environments, training and testing Machine Learning-based models in realistic virtual worlds. On one hand, these environments could also become a mean to study the weaknesses of Machine Learning algorithms, or to simulate training settings that allow Machine Learning models to gain robustness to 3D adversarial attacks. On the other hand, their growing popularity might also attract those that aim at creating adversarial conditions to invalidate the benchmarking process, especially in the case of public environments that allow the contribution from a large community of people. Most of the existing Adversarial Machine Learning approaches are focused on static images, and little work has been done in studying how to deal with 3D environments and how a 3D object should be altered to fool a classifier that observes it. In this paper, we study how to craft adversarial 3D objects by altering their textures, using a tool chain composed of easily accessible elements. We show that it is possible, and indeed simple, to create adversarial objects using off-the-shelf limited surrogate renderers that can compute gradients with respect to the parameters of the rendering process, and, to a certain extent, to transfer the attacks to more advanced 3D engines. We propose a saliency-based attack that intersects the two classes of renderers in order to focus the alteration to those texture elements that are estimated to be effective in the target engine, evaluating its impact in popular neural classifiers.



There are no comments yet.


page 1

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The classic approach to the development and evaluation of Machine Learning algorithms has always relied on the availability of datasets that collect samples acquired from the real-world setting in which the algorithms are expected to operate. In the case of Computer Vision, in the last few years we observed a remarkable diffusion of simulators that are progressively conquering an important role in the development pipeline of novel algorithms [3, 17, 30]. The visual quality of these simulators has significantly improved, making the rendered scene very close to the real-world appearance, thus offering a manageable way to setup experiments in controlled and reproducible conditions that are visually similar to the real target setting, despite being artificially generated [10, 24, 35].

On the flip side, gaining popularity also implies that there could be a large number of people potentially eager to poison or to voluntarily corrupt data inside publicly available 3D Virtual Environments, with the aim of injecting backdoors in Machine Learning systems trained in such environments or of spoiling benchmarking procedures. As a matter of fact, once a malicious 3D object has been crafted, it can be plugged into multiple 3D scenes, spreading out its effect in an exponential manner. This might become extremely dangerous in those cases in which a large community collaborates to the development of an open project about Virtual Environments [6, 34]. Differently, when dealing with datasets of images or videos, altering some data in an adversarial manner [5] will only affect such data, and not other images or videos that are about the same subject. Of course, moving to a more constructive perspective, it is also important to consider that purposely and admittedly augmenting 3D worlds with objects generated in an adversarial context could be useful to train more robust Machine Learning-based models or to better evaluate their quality [31].

Even if the Adversarial Machine Learning community exploited rendered views of a 3D object to craft real objects that fool a classifier [2], most of the efforts are oriented towards the case of datasets of images. However, in the last years a large number of novel rendering schemes were proposed, ranging from Neural Renderers [9, 32], to more generic Differentiable Renderers [15], that allow the user to compute gradients with respect to the parameters of the renderer, including meshes, textures, and others. Of course, these renderers make it easier to alter elements belonging to the 3D world with the purpose of optimizing a target objective function, thus opening for deeper investigations on how Adversarial Machine Learning can impact 3D Virtual Environments.

In this paper, we propose a novel study on the generation of adversarial 3D objects in the context of 3D Virtual Environments. We exploit the most straightforward differentiable rendering tools that were recently made public, directly interfaced with popular Machine Learning libraries, and that do not require robust skills in 3D graphics. Our goal is to investigate how easy is to use these tools with the precise purpose of fooling a classifier that processes views produced by a high-end 3D engine. As a matter of fact, 3D Virtual Environments are usually based on renderers (target renderers) that have more advanced features than the ones that are supported by versatile differentiable renderers, that we will refer to as surrogate renderers. For this reason, we investigate how effective is the process of transferring the surrogate-based adversarial 3D objects to the target renderer. We consider a realistic case study in which PyTorch3D [28] and the popular Unity3D engine [14] constitute the surrogate and target renderer, respectively. Among several 3D Virtual Environments that exploit Unity3D, we selected the recently published SAILenv [24], that is advertised due to its simplicity in integrating it with Machine Learning algorithms. Our analysis will focus on the alteration of the textures of different objects, observed by multiple views, even if it could be extended to any other parameter supported by the surrogate renderer (mesh, lighting, etc.). In order to reduce the number of texture elements that are altered by the attack, we propose to consider only those texels that are estimated to be more effective when rendered by the target engine. The saliency associated to different views rendered in the target engine is used as an indirect way to restrict the attack to limited regions of the textures. We evaluate the transferability of the proposed attack strategy using popular neural classifiers, experimentally confirming that it is indeed possible to rely on simple software to setup a tool chain that can craft adversarial 3D objects for 3D Virtual Environments. The resulting adversarial objects can be used to augment the object library of SAILenv, and exploited by the community to improve or test the robustness of newly developed classifiers. To the best of our knowledge, we are the first ones to provide evidence of the extent to which this process is effective, opening the road to further investigations.

Ii Background

The three pillars that support the analysis of this paper are 3D virtual environments, rendering software, and the generation of adversarial examples, that we describe in the following subsections, together with recent related work.

Ii-a 3D Virtual Environments

In recent years there has been an emerging paradigm shift in the way Machine Learning algorithms are evaluated, focusing on algorithms and agents that do not simply learn from datasets of images, videos or text but instead learn in 3D Virtual Environments. Due to the improved photorealism of the rendered scenes, there has been substantial growth in the demand for 3D simulators to support a variety of research tasks [3, 10, 17, 24, 30, 34, 35]. Most virtual environments allow for basic agent navigation in closed door environments and limited physical interaction, while some also have photo-realistic and moving objects. It not surprising to see 3D Virtual Environments dedicated to Machine Learning and, more generally, AI that are built within 3D game engines, designed to render high-quality graphics at large frame rates. Among a variety of recent 3D environments, we mention DeepMind Lab [3] (Quake III Arena engine), VR Kitchen [11], CARLA [7] (Unreal Engine 4), AI2Thor [17], CHALET [38], VirtualHome [27], ThreeDWorld [10], SAILenv [24] (Unity3D game engine), HabitatSim [30], iGibson [35], SAPIEN [36] (other engines).

These environments are designed to be interfaced with high-level programming languages and, in turn, with common Machine Learning libraries. The visual quality of the rendered scenes depends both on the features of the 3D engine and on the design of the 3D models that are shared together with the environment itself. Several different tasks are studied using these 3D simulators, such as generic robot navigation, visual recognition, visual QA – see [8] and references therein. Some environments are developed in the context of open projects that might benefit from the contributions of large communities [6, 17, 34], while others are based on closed source solutions [10]. To the best of our knowledge, their flexibility as tools to deepen the understanding of Adversarial Machine Learning has not been the subject of specific studies yet.

Ii-B Renderers

Rendering is the process of generating an image from a 3D scene by means of a computer program, which is known as the rendering software, or renderer for short. In a nutshell, and skipping several details, we could think of rendering as a function from a 3D scene and a camera to a 2D image ,


where any is composed of 3D objects (meshes), lights, and other elements. During rendering, objects are projected onto the camera view plane, taking into account lights and the relevant properties of the objects. The nature of these properties and the influence of light vary depending on the renderer we use. Modern 3D engines support high-end rendering facilities, among which we mention Physically Based Rendering (PBR), that indicates a broader range of technologies simulating the behavior of light impacting and bouncing on the so-called materials, that allow the object to react to the light sources in a realistic manner. Each material has specific properties and texture maps defining its roughness, reflectivity, occlusion and so forth, depending on the engine specifications. For example, the standard shader in Unity3D [14] supports the definition of color and opacity (Albedo Texture Map) or how metallic and smooth a material should be (Metallic Smoothness Texture Map), togheter with several other properties (Albedo Map Color, Ambient Occlusion Texture Map, Smoothness Multiplier, Normal Multiplier) [14]. Differently, in non-PBR renderers most of the information commonly contained in a PBR material is held by a single texture called diffuse map. Diffuse maps are usually hand-made by artists or “baked” within external software. As a matter of fact, generic renderers compute function of Eq. (1) by means of non-differentiable operations.

In the last years, the scientific community focused on alternative tools to implement a rendering function. In particular, researchers studied neural models to learn Eq. (1) from data [16, 25, 29], or, more specifically, they promoted new rendering software that allows the user to compute gradients with respect to several parameters involved in the rendering process [20, 26, 28]. Among the latter category, we mention PyTorch3D [28]

, that implement a differentiable rendering API based on the widely diffused machine learning framework PyTorch.

111See and Despite being extremely versatile, several differentiable renderers do not support PBR [20, 28] or other advanced rendering facilities, thus not reaching the level of photorealism that is typical of high-end non-differentiable renderers.

Ii-C Adversarial Objects

The growing diffusion of deep learning methods and applications in real-life scenarios

[13] poses serious concerns on their robustness. In particular, the vulnerability of their prediction performances to intentionally designed alterations of input data, i.e., adversarial examples [4, 5, 33], has been proven using several methods, such as Fast Gradient Sign Method (FGSM) [12], Projected Gradient Descent (PGD) [23] and many others [1].

Let us consider a classification task and a generic annotated pair , where denotes an input pattern and is the associated supervision. We also consider a neural network classifier with parameters . Let us indicate with the output yielded by the classifier when processing ,

, and the loss function

that measures the mismatch between the prediction and the ground truth. Neural classifiers have been proved to be vulnerable to the injection of adversarial perturbations in the input space, resulting in the misclassification of the pattern at-hand. In particular, an adversarial input causes to make wrong predictions, i.e., with . In order to inject a perturbation that can be considered imperceptible to humans, a set of admissible perturbations is defined. A common choice, that we will consider henceforth, limits the perturbation to fall upon the -ball. In the most simple case, the goal of the attacker is to find as solution of the following optimization problem,


In the specific case of PGD, problem (2) is solved by an iterative scheme, eventually including random restarts,


being ,222Or , with that is randomly generated. the iteration index, the step length and projects its argument onto an -ball with radius centered on the original example.

In the case of 3D data, prior work [21, 22] has shown that carefully-designed 2D adversarial examples fail to fool classifiers in the physical 3D world under several image transformations, such as changes in viewpoint, angle or other conditions (camera noise or light variation). In order to generalize attacks to such contexts, Athalye et al. [2] proposed adversarial examples that are robust over a certain distribution of transformations. They introduced the so-called Expectation over Transformation (EOT), where Eq. (2) becomes


being a distribution of transformations and is a transformation sampled from , while is the input of the classifier. Moreover, is the set of perturbations for which


where is a distance function and . This approach basically introduces expectations both in the objective function and in the perturbation-related constraint. What is important to consider is that is about a wide variety of transformations, including special operations that consist in using as a texture of a 3D object and rendering it to a 2D image. The authors of [2] use this intuition to physically create real-world 3D objects that are adversarial over different visual poses. Of course, this requires a renderer (Section II-B) that is differentiable, and [2] is based on specific ad-hoc operations that cannot be easily implemented in a general setting.

Other very recent works focus on different aspects of the rendering process. In [40], authors consider the tasks of Visual Question Answering (VQA) and 3D shape classification, and they perturb multiple physical parameters such as the material, the illumination or the normal map. However, they keep a fixed view of the rendered object, that might create artifacts when the object is observed from different locations. Liu et al. [19]

proposed perturbations that are focussed on lighting. Their work is based on a ad-hoc created physically-based differentiable renderer capable to backpropagate gradients to the parametric space. MeshAdv 

[37] alters the object meshes, using a neural renderer applied on models with constant reflectance – very simple textures. The authors investigate the robustness under various viewpoints and the transferability to black-box renderers under controlled rendering parameters. A recent work by Yao et al. [39] leverages multi-view attacks inspired by EOT, in order to devise 3D adversarial objects perturbing the texture space, investigating the attack quality using multiple classifiers. Finally, Liu et al. [18] considers the case of embodied agents performing navigation and question answering. To better attack the task at hand, the perturbations are focused on the salient stimuli characterizing the temporal trajectory followed by the embodied agent to complete its task.

Iii Adversarial Attacks to 3D Virtual Environments

We consider the problem of generating adversarial 3D objects in the context of the 3D Virtual Environments of Section II-A. The existing experiences in crafting 3D adversarial objects (Section II-C) have shown that it is indeed possible to create attacks that fool the classifier of a rendered scene. However, existing works are strongly based on ad-hoc solutions, sometimes using specifically created renderers, limiting the attacks to a single view or considering very simple textures. They usually assume that the attacker has access to low-level properties of the renderer, such as the mapping of the view-space coordinates to the texture-space coordinates, or that renderers can be modified to expose additional information [2]. Unfortunately, all these assumptions does not make their findings easily adaptable to more general cases. Another remarkable limit is that rendering engines (Section II-B

) are pieces of software that requires advanced skills not only in programming, but also in computer graphics, in order to be modified to accommodate attack procedures, or they might not be open source.

We focus on a more generic perspective, that is based on a realistic setting in which the attacker has the goal of creating adversarial objects for a certain target renderer on which he has limited control. We assume attacker to have some skills in Adversarial Machine Learning but not necessarily an advanced knowledge of computer graphics. We explore the idea of synthesizing 3D adversarial objects using off-the-shelf popular software packages, well assembled into a specifically designed tool chain, with the goal of being able to craft malicious examples that can then be transferred to the target renderer of the considered 3D Virtual Environment. We report the structure of the proposed tool chain in Fig. 1.

Fig. 1: Structure of our adversarial object generation procedure. Saliency is computed exploiting the target renderer, as highlighted by the dotted line.

Our computational pipeline takes into account two different renderers (Fig. 1, white boxes). One of them is the already introduced target renderer, while the other one is what we refer to as surrogate renderer. The latter is a differentiable renderer on which the attacker has complete control, a reasonable assumption considering that many open-source differentiable renderers have been recently made available to the research community. As discussed in Section II-B, it is likely that differentiable renderers will not perfectly match the quality of the target renderer, so that we focus on the specific case in which there is an evident difference between the outcome of the two renderers. Of course, we assume that the 3D Virtual Environment allows users to introduce and render custom objects. However, care must be taken in adapting the object data format between the two renderers, since there might be a misalignment between the type of models expected by the 3D Virtual Environment and by the surrogate engine, requiring specific adaptations (Fig. 1, leftmost and rightmost blocks).

Let us consider a certain object of class , a scene and a camera . The notation indicates the object and all its properties (mesh, textures, etc.), and, for the sake of simplicity, we indicate with an alteration of the object obtained by perturbing its properties by an offset . We consider the image (view) that we get when plugging into scene , and rendering the whole 3D data when observed from camera . We overload the notation of in Eq. (1) to introduce the dependence on ,


A neural network classifier (Fig. 1, top green box) processes . The classifier prediction is evaluated into a loss function (Fig. 1, mid green box) that drives the generation of the adversarial object, inspired by the EOT of Eq. (4), even if fully focused on transformations in the 3D world. In particular,


where includes different camera positions and orientations, different lighting conditions and, in the most generic cases, different backgrounds. Eq. (7) is paired with a norm-based constraint that ensures , for all . This view-based constraint acts as an indirect measure to ensure that is not changing in a too evident way.

In order to solve Eq. (7) we need to exploit a differentiable renderer that allows us to compute the gradient with respect to the properties of object . In general, we do not know in advance how many of such properties will be concretely altered by , since it depends on what transformations are considered and on the details of the experimental setup. Of course, perturbing a small subset of such properties could be a desirable feature to ensure that the object is not altered in a too evident way. Inspired by this consideration, we propose an approach that is agnostic to the type of adversarial example generation algorithm. In particular, we exploit the fact that while the target renderer cannot be used to compute gradients with respect to , we can indeed use it to render an object view, that we indicate with for a certain pair . Then, we can straightforwardly compute the gradients of with respect to the pixels of the image. The pixels with the largest (absolute) gradients are the ones to which the classifier is more sensitive. Once we rescale such gradients in , we can select a custom threshold to build a binary saliency map (Fig. 1, bottom green box) that tells what are the pixels to which the target renderer is known to be significantly sensitive. Moving to the surrogate renderer, such map can be used to avoid gradients to be back-propagated through not salient pixels, that, in turn, is expected to reduce the number of properties of that will be altered.

Iii-a Case Study

We instantiated the strategy of Fig. 1 into a specific case study, that will also drive our experiments. Our choices are completely driven by simplicity, selecting tools that are recent, freely available, and that do not require advanced skills in computer graphics. In particular, we considered a recent 3D Virtual Environment named SAILenv [24] (Section II-A), that is open-source and claimed to be simple to be interfaced with Machine Learning models. SAILenv exploits Unity3D, that is our target renderer, which satisfies the assumption of allowing the attacker to render custom objects while having limited low-level control to the renderer, since it is based on proprietary code and cannot be “easily” modified. As surrogate renderer we focused on the recent PyTorch3D [28] (Section II-B), that is completely based on Autograd and thus trivial to integrate with a PyTorch-based classifier for gradient computation. The two renderers have some remarkable differences. SAILenv uses PBR while PyTorch3D is based on diffuse-based rendering, which takes into account only the surface color and a much simplified light model. Both are discussed in Section II-B. We approximate the diffuse maps rendered by Pytorch3D with the Albedo Texture Maps used within Unity3D. This approximation holds the best for neutral illumination settings and for low reflective materials. See Fig. 2 for a comparison of the rendering capabilities of the two renderers.

(a) PyTorch3D              (b) SAILenv (Unity3D)

Fig. 2: Rendering capabilities of the surrogate (a) and target (b) renderers.

PyTorch3D allows gradients estimation of several parameters of the object and of the scene (surface color, object geometry, lightning, etc.). For the scope of this paper, we will focus only on the surface color texture. This is a very challenging setting due to the aforementioned limited rendering facilities of PyTorch3D, making this case study a very good representative of the previously described attack scenario.

We qualitatively show in Fig. 3 how the saliency maps, computed using Unity3D over multiple views, are projected back onto the texture space, accumulating their contributes on the texels, then rendering the 3D objects. While computing this projection in the target renderer is not straightforward, this can be easily done following the texturing routine of PyTorch3D, and that is how we created the figure. We can appreciate how the larger saliency areas only cover a subportion of the texels.

Fig. 3: Multi-view saliency maps (target renderer), projected into the surface of the object (projection computed by the surrogate engine) – red=high; blue=low.

Notice that the data adapters or Fig. 1 (i.e. Surrogate Renderer Adapter and Target Renderer Adapter) play a crucial role in our case study, since Unity3D stores objects in a different format (FBX) than the one used by PyTorch3D (OBJ). We implemented a source object converter by means of a Blender333 script, created from scratch. The final adversarial object is then converted back to the Unity3D format through a plugin that is internal to Unity3D, and finally rendered in SAILenv.

In order to evaluate the impact of our strategies in different networks, we selected two popular and powerful deep neural image classifiers trained on ImageNet, that are

InceptionV3 and MobileNetV2.444
The former is a state-of-the art image classifier, the latter is a smaller model, still very accurate. We considered 10 different objects from the SAILenv library, associated to classes that are supported by the classifiers. The objects are shown in Fig. 4, comparing their appearance in the surrogate and target renderers.

Fig. 4: Objects considered in our case study, rendered using PyTorch3D (top) and in SAILenv/Unity3D (bottom).

As adversarial object generation method, we implemented the PGD attack described in Section II-C, using the cross entropy loss.

Iv Experimental Results

We performed several experiments to evaluate the proposed attack strategy in the case study of Section III-A.555Our implementation of what we propose and study in this paper can be found at The 3D models used in the experiments can be found at (download section). Each object is rendered from different views, keeping the camera at a fixed distance which was manually chosen to obtain an iconic image of the object, i.e., so that the object covers most of the picture. The camera turns around the object, from to and also changes its elevation. The range on which the elevation is changed is manually chosen for each object in order to avoid unnatural viewing orientations that would lower the classification accuracy even without any attacks. We considered a directional light, similar to the way sunlight shines on objects, coming from the front and at an elevation of . The background scene of each object is composed of a uniform color, that was evaluated as being white or black, selecting the one that maximized the recognition accuracy.

We explored attacks that progressively yield larger alterations in the original textures, considering

, comparing cases in which we do not use saliency maps or when the maps are binarized with different thresholds of tolerance, i.e.,

, and we set to . It is important to remark that we are considering the norm to bound the perturbations, so that, given the same

, we can have very different number of altered texels. Altering less texels is expected to reduce the probability of letting humans recognize the adversarial object, and that is the goal of the proposed saliency-map-based procedure. We used two metrics to evaluate the quality of the adversarial objects. The first one is the

accuracy drop , that is the ratio of the variation of average accuracy (before and after the attack, referred to as and , respectively) to the initial accuracy, while the second one is the percentage of texels that are altered by the attack procedure. Formally,


the texture tensor composed of

elements and the norm. We computed both the metrics within the PyTorch3D renderer and the SAILenv (Unity3D) renderer. In the former case, we are basically exploring a white-box scenario, where the system we attack is the one on which we evaluate the result. In the latter case, we consider the impact of the adversarial object once it is transferred to a target environment, in a very challenging black-box setting, due to the previously described differences between the two renderers.

In Table I we report the main results of our experiments, showing for the considered objects and the average result (last column; we indicate with n.a. those objects that were not correctly recognized by MobileNetV2 in their original state). In the case of the surrogate renderer, it is evident that even lower values of are enough to usually achieve near drop of accuracy, with the exception of the Tennis Racket, for which a higher is needed.



MobileNetV2 0.05 1.00 n.a. 0.86 1.00 1.00 1.00 0.98 0.95 0.92 1.00 0.97
0.10 1.00 n.a. 0.86 1.00 1.00 1.00 0.98 1.00 1.00 1.00 0.98
0.50 1.00 n.a. 0.86 1.00 1.00 1.00 1.00 0.95 1.00 1.00 0.98
InceptionV3 0.05 1.00 1.00 0.98 1.00 1.00 0.97 0.95 1.00 0.05 1.00 0.89
0.10 1.00 1.00 0.95 1.00 0.98 0.97 1.00 0.97 0.45 1.00 0.93
0.50 1.00 1.00 1.00 1.00 1.00 0.94 0.97 1.00 0.97 1.00 0.99


MobileNetV2 0.05 0.76 n.a. 0.62 1.00 0.72 0.60 0.21 0.00 0.81 n.a. 0.59
0.10 0.72 n.a. 0.62 1.00 0.74 0.65 0.43 0.00 0.90 n.a. 0.63
0.50 0.76 n.a. 0.62 1.00 0.74 0.68 0.41 0.00 0.94 n.a. 0.64
InceptionV3 0.05 0.37 0.87 -0.12 0.65 0.07 0.00 0.10 0.16 0.00 1.00 0.31
0.10 0.41 0.77 -0.15 0.65 0.08 0.21 0.34 0.47 0.00 1.00 0.38
0.50 0.39 0.73 -0.08 0.65 0.12 0.09 0.41 0.42 0.07 1.00 0.38
TABLE I: Accuracy drop in each considered object and average result.

When the attack is transferred to SAILenv, we can observe that the classifiers are still fooled in a non-negligible manner. Of course, the extent to which the attack has effect is reduced, as expected, but it is surprisingly to see that even if the difference between the two renderers in our case study is significant, the attack can impact the outcome of the classification in the target 3D Virtual Environment. With the exception of Lamp Floor, Pot, Tennis Racket in the case of InceptionV3, and Teapot, Living Room Table for MobileNetV2, where the attack yields no evident accuracy drops (in one case also a negative drop, meaning that it is slightly improving the classification), the other adversarial objects reduce the accuracy of the classifiers, with a pretty strong effect in the case of MobileNetV2.

In Fig. 5, we report the 2D plot of against (all objects), taking into account different values of and saliency thresholds . When using PyTorch3D, several points are clustered on the right side of the plot, associated to a large . In the case of SAILenv and InceptionV3 as a classifier, the majority of points are between and , with some attacks reaching very large drops. In the case of SAILenv and MobileNetV2, more attacks have approaching . As already discussed, even small might end up in altering a significant amount of texels. However, the plots show that using saliency is a good solution to identify a trade off between and . In particular, the attacks in which no saliency information is used are usually located in the upper-right quadrant of the plot – high impact but they heavily alter the textures. Attacks with the largest saliency threshold are instead usually located in the lower-left quadrant – low impact but they are also more hardly noticeable by humans, altering less pixels. When using a lower we get results distributed in the central part of the plot – good impact on the classifier, altering a relatively small number of texels.

Fig. 5: Accuracy drop versus percentage of altered texels. Points are about adversarial objects (colors indicate different ; markers are about different ).

We qualitatively evaluated the renderings of the adversarial objects, reporting in Fig. 6 two examples that fool the InceptionV3 classifier in both the renderers. While we do not see any evident differences comparing the original and the altered objects, we still observe the huge gap between what the surrogate and the target renderer produce, remarking the importance of the results of this study.

Fig. 6: For the surrogate (a) and target (b) renderers, we report two objects (one per line), before (left) and after (right) the attack.

In Fig. 7 we consider all the views of such objects, reporting how the predictions of InceptionV3 are distributed. It is evident that before the attack, most of the predictions are correctly distributed on the ground truth class, while after the attack they are spread over multiple incorrect classes.

PyTorch3D                  SAILenv
Remote Control
PyTorch3D                  SAILenv
Dining Table

Fig. 7: Number of correct predictions made by InceptionV3 out of 60 different views of the objects of Fig. 6, before and after having attacked them.

V Conclusions and Future Work

We presented a novel study on the transferability of adversarial 3D objects, created using an off-the-shelf differentiable renderer and then moved to a powerful 3D engine that is at the basis of several recent 3D Virtual Environments. Our analysis showed that it is indeed possible to setup a tool chain based on simple elements that do not require advanced skills in computer graphics, and use it to craft malicious 3D objects. Experiments on texture-oriented manipulations showed that attacks can be transferred to fool popular neural classifiers, also considering an estimated saliency of the texels. There is certainly room for future work in improving the effectiveness of the attacks (e.g., considering other parameters of the renderer – mesh and others). However, our results are expected to point the attention of the scientific community towards this double-sided aspect: on one hand, it could be an issue for community-open 3D Virtual Environments, and, on the other hand, it is an opportunity to create even more powerful testing environments, purposely populated with adversarial examples.


  • [1] N. Akhtar and A. S. Mian (2018) Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access 6, pp. 14410–14430. Cited by: §II-C.
  • [2] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2018) Synthesizing robust adversarial examples. In International Conference on Machine Learning, pp. 284–293. Cited by: §I, §II-C, §III.
  • [3] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, et al. (2016) Deepmind lab. arXiv:1612.03801. Cited by: §I, §II-A.
  • [4] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 387–402. Cited by: §II-C.
  • [5] B. Biggio and F. Roli (2018) Wild patterns: ten years after the rise of adversarial machine learning. Pattern Recognition 84, pp. 317–331. Cited by: §I, §II-C.
  • [6] M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, et al. (2020) Robothor: an open simulation-to-real embodied ai platform. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3164–3174. Cited by: §I, §II-A.
  • [7] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: an open urban driving simulator. In Conference on robot learning, pp. 1–16. Cited by: §II-A.
  • [8] J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan (2021) A survey of embodied ai: from simulators to research tasks. arXiv:2103.04918. Cited by: §II-A.
  • [9] S. A. Eslami, D. J. Rezende, et al. (2018)

    Neural scene representation and rendering

    Science 360 (6394), pp. 1204–1210. Cited by: §I.
  • [10] C. Gan, J. Schwartz, S. Alter, M. Schrimpf, et al. (2020) Threedworld: a platform for interactive multi-modal physical simulation. arXiv:2007.04954. Cited by: §I, §II-A, §II-A.
  • [11] X. Gao, R. Gong, T. Shu, X. Xie, S. Wang, and S. Zhu (2019) Vrkitchen: an interactive 3d virtual environment for task-oriented learning. arXiv:1903.05757. Cited by: §II-A.
  • [12] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv:1412.6572. Cited by: §II-C.
  • [13] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu (2020) A survey of deep learning techniques for autonomous driving. Journal of Field Robotics 37 (3), pp. 362–386. Cited by: §II-C.
  • [14] J. K. Haas (2014) A history of the unity game engine. Cited by: §I, §II-B.
  • [15] H. Kato, D. Beker, M. Morariu, T. Ando, T. Matsuoka, W. Kehl, and A. Gaidon (2020) Differentiable rendering: a survey. arXiv:2006.12057. Cited by: §I.
  • [16] H. Kato, Y. Ushiku, and T. Harada (2017) Neural 3d mesh renderer. arXiv:1711.07566. Cited by: §II-B.
  • [17] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) Ai2-thor: an interactive 3d environment for visual ai. arXiv:1712.05474. Cited by: §I, §II-A, §II-A.
  • [18] A. Liu, T. Huang, X. Liu, Y. Xu, Y. Ma, X. Chen, S. J. Maybank, and D. Tao (2020) Spatiotemporal attacks for embodied agents. In European Conference on Computer Vision, pp. 122–138. Cited by: §II-C.
  • [19] H. D. Liu, M. Tao, C. Li, D. Nowrouzezahrai, and A. Jacobson (2018) Beyond pixel norm-balls: parametric adversaries using an analytically differentiable renderer. In International Conference on Learning Representations, Cited by: §II-C.
  • [20] S. Liu, T. Li, W. Chen, and H. Li (2019) Soft rasterizer: a differentiable renderer for image-based 3d reasoning. In IEEE/CVF International Conference on Computer Vision, Cited by: §II-B.
  • [21] J. Lu, H. Sibai, E. Fabry, and D. Forsyth (2017) No need to worry about adversarial examples in object detection in autonomous vehicles. arXiv:1707.03501. Cited by: §II-C.
  • [22] Y. Luo, X. Boix, G. Roig, T. Poggio, and Q. Zhao (2015) Foveation-based mechanisms alleviate adversarial examples. arXiv:1511.06292. Cited by: §II-C.
  • [23] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083. Cited by: §II-C.
  • [24] E. Meloni, L. Pasqualini, M. Tiezzi, M. Gori, and S. Melacci (2020) SAILenv: Learning in Virtual Visual Environments Made Simple. In 25th International Conference on Pattern Recognition, Cited by: §I, §I, §II-A, §III-A.
  • [25] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pp. 405–421. Cited by: §II-B.
  • [26] M. Nimier-David, D. Vicini, T. Zeltner, and W. Jakob (2019) Mitsuba 2: a retargetable forward and inverse renderer. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–17. Cited by: §II-B.
  • [27] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba (2018) Virtualhome: simulating household activities via programs. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8494–8502. Cited by: §II-A.
  • [28] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W. Lo, J. Johnson, and G. Gkioxari (2020) Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501. Cited by: §I, §II-B, §III-A.
  • [29] K. Rematas and V. Ferrari (2020) Neural voxel renderer: learning an accurate and controllable rendering tool. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5417–5427. Cited by: §II-B.
  • [30] M. Savva, A. Kadian, O. Maksymets, et al. (2019) Habitat: a platform for embodied ai research. In IEEE/CVF International Conference on Computer Vision, pp. 9339–9347. Cited by: §I, §II-A.
  • [31] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein (2019) Adversarial training for free!. arXiv:1904.12843. Cited by: §I.
  • [32] V. Sitzmann, M. Zollhoefer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. Advances in Neural Information Processing Systems 32, pp. 1121–1132. Cited by: §I.
  • [33] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv:1312.6199. Cited by: §II-C.
  • [34] L. Weihs, J. Salvador, K. Kotar, U. Jain, K. Zeng, R. Mottaghi, and A. Kembhavi (2020) Allenact: a framework for embodied ai research. arXiv:2008.12760. Cited by: §I, §II-A, §II-A.
  • [35] F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese (2020) Interactive gibson benchmark: a benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters 5 (2), pp. 713–720. Cited by: §I, §II-A.
  • [36] F. Xiang, Y. Qin, K. Mo, et al. (2020) Sapien: a simulated part-based interactive environment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11097–11107. Cited by: §II-A.
  • [37] C. Xiao, D. Yang, B. Li, J. Deng, and M. Liu (2019) Meshadv: Adversarial meshes for visual recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6898–6907. Cited by: §II-C.
  • [38] C. Yan, D. Misra, A. Bennnett, A. Walsman, Y. Bisk, and Y. Artzi (2018) Chalet: cornell house agent learning environment. arXiv:1801.07357. Cited by: §II-A.
  • [39] P. Yao, A. So, T. Chen, and H. Ji (2020) On multiview robustness of 3d adversarial attacks. In Practice and Experience in Advanced Research Computing, pp. 372–378. Cited by: §II-C.
  • [40] X. Zeng, C. Liu, Y. Wang, W. Qiu, L. Xie, Y. Tai, C. Tang, and A. L. Yuille (2019) Adversarial attacks beyond the image space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4302–4311. Cited by: §II-C.