SIM2REALVIZ: Visualizing the Sim2Real Gap in Robot Ego-Pose Estimation

by   Theo Jaunet, et al.

The Robotics community has started to heavily rely on increasingly realistic 3D simulators for large-scale training of robots on massive amounts of data. But once robots are deployed in the real world, the simulation gap, as well as changes in the real world (e.g. lights, objects displacements) lead to errors. In this paper, we introduce Sim2RealViz, a visual analytics tool to assist experts in understanding and reducing this gap for robot ego-pose estimation tasks, i.e. the estimation of a robot's position using trained models. Sim2RealViz displays details of a given model and the performance of its instances in both simulation and real-world. Experts can identify environment differences that impact model predictions at a given location and explore through direct interactions with the model hypothesis to fix it. We detail the design of the tool, and case studies related to the exploit of the regression to the mean bias and how it can be addressed, and how models are perturbed by the vanish of landmarks such as bikes.



There are no comments yet.


page 1

page 3

page 6

page 8

page 9


Self-supervised 6D Object Pose Estimation for Robot Manipulation

To teach robots skills, it is crucial to obtain data with supervision. S...

RMQFMU: Bridging the Real World with Co-simulation Technical Report

In this paper we present an experience report for the RMQFMU, a plug and...

Real-time Pose Estimation from Images for Multiple Humanoid Robots

Pose estimation commonly refers to computer vision methods that recogniz...

Sim2Real2Sim: Bridging the Gap Between Simulation and Real-World in Flexible Object Manipulation

This paper addresses a new strategy called Simulation-to-Real-to-Simulat...

Generating Annotated Training Data for 6D Object Pose Estimation in Operational Environments with Minimal User Interaction

Recently developed deep neural networks achieved state-of-the-art result...

Motion-Nets: 6D Tracking of Unknown Objects in Unseen Environments using RGB

In this work, we bridge the gap between recent pose estimation and track...

Active exploration of sensor networks from a robotics perspective

Traditional algorithms for robots who need to integrate into a wireless ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual navigation is at the core of most autonomous robotic applications such as self-driving cars or service robotics. One of the main challenges for the robot is to efficiently explore the environment, to robustly identify navigational space, and eventually be able to find the shortest paths in complex environments with obstacles. The Robotics and Deep Learning communities have introduced models trained with Reinforcement Learning (RL), Inverse RL, or Imitation Learning, targeting complex scenarios requiring visual reasoning beyond waypoint navigation and novel ways to interact with robots, e. g., combining vision, robotics, and natural language processing through queries like “

Where are my keys?”. Current learning algorithms are not sampled efficiently enough, this kind of capability requires an extremely large amount of data. In the case of RL, this is in the hundreds of millions or in the billions of interactions — this simply cannot be addressed in a reasonable amount of time using a physical robot in a real environment, which also may damage itself in the process.

To tackle this issue, the field heavily relies on simulation, where training can proceed significantly faster than in physical (wall clock) time on fast modern hardware, easily distributing multiple simulated environments over a large number of cores and machines. However, neural networks trained in simulated environments often perform poorly when deployed on real-world robots and environments, mainly due to the

“sim2real gap”, — i. e. the lack of accuracy and fidelity in simulating real-world environment conditions such as, among others, image acquisition conditions, sensors noise, but also furniture changes and other moved objects. The exact nature of the gap is often difficult to pinpoint. It is well known that adversarial examples, where only a few pixel shifts occur, considered as small artifacts by humans, or which might even be undetectable by humans, can directly alter the decisions of trained models Goodfellow et al. (2014); Moosavi-Dezfooli et al. (2016); Liu et al. (2018).

The sim2gap is currently addressed by various methods, including domain randomization, where the physical reality is considered to be a single parametrization of a large variety of simulations Tobin et al. (2017); Mehta et al. (2020), and domain adaptation, i. e. explicitly adapting a model trained in simulation to the real-wold Tzeng et al. (2017); Chadha and Andreopoulos (2019). However, identifying the sources of the sim2real gap would help experts in designing and optimizing transfer methods by directly targeting simulators and design choices of the agents themselves. To this end, we propose Sim2RealViz, a visual analytics interface aiming to understand the gap between a simulator and a physical robot. We claim that this tool is helpful to gather insights on the studied agents’ behavior by comparing decisions made in simulation and in the real-world physical environment. Sim2RealViz exposes different trajectories, and their divergences, in which a user can dive deeply for further analysis. In addition to behavior analysis, it provides features designed to explore and study the models’ inner representations, and thus grasp differences between the simulated environment and the real-world as perceived by agents. Experts can rely on multiple-coordinated views, which can be used to compare model performances estimated with different metrics such as a distance, orientation, or a UMAP projection of latent representations. In addition, experts dispose of three different approaches to highlight the estimated sim2real gap overlaid over either 3D projective inputs or over a bird’s eye view (“Geo-map”) of the environment.

Sim2RealViz targets domain experts, referred to as model builders and trainers Strobelt et al. (2017); Hohman et al. (2019)

. The goal is assistance during real-world deployment, pinpointing root causes of decisions. Once a model is trained in simulation, those experts are often required to adapt it to real-world conditions through transfer learning and similar procedures.

Sim2RealViz provides information on the robot’s behavior, and has been designed to help end-users, building trust Dragan et al. (2013); Lipton (2018). In Section 5, we report on insights gained through experiments using Sim2RealViz, on how a selection of pre-trained neural models exploits specific sensor data, hints on their internal reasoning, and sensibility to sim2real gaps.

2 Context and problem definition

We study trained models for Ego-Localization of mobile robots in navigation scenarios, which regress the coordinates and camera angle from observed RGB and depth images. Physical robots take these inputs from a depth-cam, whereas in the simulation they are rendered using computer graphics software from a 3D scan of the environment. Fig. 2 provides a concrete example, where two images are taken at the same spatial coordinates ①, one from simulation and the other from a physical robot. As our goal is to estimate the sim2real gap, we do not focus on generalization to unseen environments. Instead, our simulated environment corresponds to a 3D scanned version of the same physical environment in which the robot navigates, which allows precise estimation of the difference in localization performance, as gap leads to differences in predicted positions. The full extent of the gap, and how it may affect models is hard to understand by humans, which makes it difficult to take design choices and optimize decisions for sim2real transfer.

Figure 2: In the studied robot ego-localization task, an RGB-D image ①, is given to a trained model ②, which uses it to regress the location (), and orientation angle () in the environment from which this image was taken from ③. As illustrated above, images taken from the same coordinates in simulation and real-world ① may lead to different predictions due to differences, such as here, among others, the additional presence of a bike in the scene. We are interested in reducing the gap between the sim and real predictions.

Simulation — we use the Habitat Savva et al. (2019) simulator and load a high fidelity 3D scan of a modern office building created with the Matterport3D software Chang et al. (2017) from individual 360-degree camera images taken at multiple viewpoints. The environment is of size meters and can potentially contain differences to the physical place due to estimation errors of geometry, texture, lighting, alignment of the individual views, as well as changes done after the acquisition, such as moved furniture, or opened/closed doors. The Habitat simulator handles environment rendering and agent physics, starting with its shape and size (e. g., a cylindrical with diameter 0.2m and height 1.5m), its action space (e. g., turn left, right or move forward), and sensors — a simulated e. g., RGB-D camera. Depending on the hardware, the simulator can produce up to 10.000 frames per second, allowing to train agents on billions of interactions in a matter of days.

Real-world — Our physical robot is a “Locobot” Locobot featuring an RGB-D camera and an additional lidar sensor which we installed. We use the lidar and the ROS NavStack to collect ground truth information on the robot’s position and angle , used as a reference to evaluate ego-pose localization performances on the real-world. To increase precision, we do not build the map online with SLAM, but instead, export a global map from the 3D scan described above and align this map with the lidar scan using the ROS NavStack.

Ego-pose estimation: the trained agent — Traditionally, ego-pose estimation of robots is performed from various inputs such as lidar, odometry, or visual input. Localization from RGB data is classically performed from keypoint-based approaches and matching/alignment Brachmann et al. (2021). More recently, this task has been addressed using end-to-end training of deep networks. We opted for the latter, and, inspired by poseNet Kendall et al. (2015), trained a deep convolutional network to take a stacked RGB-D image of shape

and directly output a vector

of three values: the coordinates and the orientation angle . The model is trained on images sampled from the simulator with varying positions and orientations while assuring a minimal distance of

meters between data points. We optimize the following loss function between predictions

and ground truth (GT) over training samples :


where is the norm and is a weighting hyper-parameter set to 0.3 in our experiments.

3 Related work

Closing the sim2real gap and transfer learning

— Addressing the Sim2Real gaprelies on methods for knowledge transfer, which usually combine a large number of samples from simulation and/or interactions obtained with a simulator simulation with a significantly smaller number of samples collected from the real world. Although machine learning is a primary way of addressing the transfer, it remains important to assess and analyze the main sources of discrepancies between simulation and real environments. A common strategy is to introduce noise to the agent state based on statistics collected in the real world 

Savva et al. (2019). Additionally, tweaking the collision detection algorithm to prevent wall sliding has been shown to improve the performance in the real world of navigation policies trained in simulation Kadian et al. (2020), which tend to exploit inaccurate physics simulation. Another approach is to uniformly alter simulation parameters through domain randomization, e. g., modifying lighting and object textures, to encourage models to learn invariant features during training Tobin et al. (2017); Tremblay et al. (2018). This line of work highly benefits from domain expert knowledge on the targeted environment, which can provide randomizations closer to reality Prakash et al. (2019); James et al. (2019).

A different family of methods addresses the Sim2Real gap through domain adaption, which focuses on modifying trained models’ and their features learned from simulation to match those needed for high performance in real environments. This has been explored by different statistical methods from the machine learning toolbox, including discriminative adversarial losses Tzeng et al. (2017); Chadha and Andreopoulos (2019). Feature-wise adaptation has also been addressed by extensive use of loss Fang et al. (2018); Tzeng et al. (2020), and through fine-tuning Rusu et al. (2017). Instead of creating invariant features, other approaches perform domain adaption at pixel level Bousmalis et al. (2017, 2018). Despite great results, domain adaptation suffers from the need for real-world data, which is often hard to come by. We argue that there is a need for the assistance of domain experts and model builders to understand the main sources of sim2real gaps, which can then be leveraged for targeted adapted domain transfer, e.g. through specific types of representations or custom losses.

Visual analytics and interpretability of deep networks — Due to their often generic computational structures, their extremely high number of trainable parameters (up to the orders billions) and the enormous amounts of data on which they have been trained, deep neural networks have a reputation of not being interpretable and providing predictions that cannot be understood by humans, hindering their deployment to critical applications. The main challenge is to return control over the decision process to humans, engineers, and model builders, which has been delegated to training data. This arguably requires the design of new tools capable of analyzing the decision process of trained models. The goal is to help domain experts to improve models, closing the loop, and build trust in end-users. These goals have been addressed by, both, the visualization, and machine learning communities Hohman et al. (2019).

Convolutional neuronal networks have been studied by exposing their gradients over input images, along with filters 

Zeiler and Fergus (2014). Combined with visual analytics Liu et al. (2016), this approach provided a glimpse on how sensible the neurons of those models are too different patterns in the input. Gradients can also be visualized combined with input images to highlight elements towards which the model is attracted to Springenberg et al. (2014), with respect to the desired output (e. g., a class).

More recently those approaches have been extended with class driven attributions of features Olah et al. (2018); Carter et al. (2019) which can also be formulated as a graph to determine the decision process of a model through interpretation of features (e. g., black fur for bears) Hohman et al. (2020)

. However, these approaches are often limited to image classification tasks such as ImageNet, as they need features to exploit human interpretable concepts from given images.

Interpretable robotics — This classical line of work remains an under-explored challenge when applied to regression tasks such as robot ego-localization, our targeted application, in which attributions may be harder to interpret. To our knowledge, visualization of transfer learning, and especially targeting sim2real is an under-explored area, in particular for navigation tasks, where experiments with real physical robots are harder to perform compared to, say, grasping problems. In Szabó et al. (2020), the evolution of features is explored before and after transfer through pair-wise alignment. Systems such as Ma et al. (2020) address transfer gaps through multi-coordinated views and an in-depth analysis for models weights and features w.r.t. domains. In Jaunet et al. (2020), the focus is on inspecting the latent memory of agents navigating in simulated environments. Finally, common visualizations consist in heatmaps designed to illustrate results from papers in ML communities such as Tzeng et al. (2017); Zhu et al. (2019).

Despite providing insights on how models adapt to different domains, and in contrast to our work, those methods have not been designed to directly target what parts of the environment, or which sensor settings may produce sim2real gaps, which we consider as relevant information for domain experts.

4 Sim2RealViz: A visual analytics tool to explore the sim2real gap

We introduce Sim2RealViz, an interactive visual analytics tool designed to assist domain experts in conducting in-depth analyses of the performance gaps between simulation and real environments of models whose primary task is ego-localization. The tool is implemented in JavaScript and the D3 Bostock et al. (2011)

library to run in modern browsers and directly interacts with models implemented in Pytorch 

Paszke et al. (2019)

. The tool and its source code are available as an open-source project at: .

4.1 Tasks analysis

Sim2RealViz supports the following tasks we identified through literature review and discussions with experts in Robotics:

  1. [label=T0.,align=left]

  2. Fine-grained assessment of model performance gap between SIM and REAL — the choice of the best performing sim2real transfer method (e. g., fine-tuning, domain randomization etc.), optimization of its hyper-parameters. This requires experts to study a large number of predictions in SIM and REAL from a large number of observed images and evaluate performances distributions over different environment conditions and factors of variation.

  3. Identification of the source of the performance gap — what are the factors of variation in the environment, agent, or trained model, which are responsible for the performance gap? This is inherently difficult, as the sources may be due to the global environment (differences in e.g., lightening, 3D scanning performance), the agent (e. g., differences in camera focal length or height) or changes due to the time span between scanning and physical deployment (e. g., furniture changes). In addition, some gaps may also be beyond human comprehension such as adversarial noise. For a human designer, it may not immediately be clear, which differences will have the largest impact on prediction performance.

  4. Closing the sim2real gaps — successful knowledge transfer requires the simulator to be as close as possible to the real-world scenario with respect to the factors of variation identified in 2 The goal is to close the loop and increase prediction performance using the insights gained from using Sim2RealViz.

4.2 Design rationale

Our design is centered around the comparison of simulation instances and real-world ones. As we deal with complex objects, and because sim2real gaps can emerge from various sources, we implemented several views with different comparison strategies Gleicher et al. (2011). Sim2RealViz follows the overview+detail interface scheme Cockburn et al. (2009) with a range from the most global views (left), to the most specific ones (right). To ease the comparison, simulation and real-world data are displayed next to each other within each view, with, if possible, simulation on the left side and real-world on the right side. The objective of the Statistics view (Fig. 1 ①) is to help in quickly identifying the performance of a model and to grasp global behavior with simple visualizations. The Geo-map (Fig. 1 ②), is key in providing context on the instance predictions, and for users to grasp what factors of variation may cause sim2real gaps. Finally, the Instance view (Fig. 1 ③), displays how models may perceive sim2real gaps under different scopes. To encode the main information related to the gap we used three colors corresponding to either sim, real, or gt, accross the visualizations. We also used color to encode the distance between two sets of coordinates or the intensity of the models’ attention towards parts of input images using a continuous turbo Mikhailov (2019) color scale, commonly used by experts, to emphasize the most critical instances, i. e. those with high values.

Figure 3: Conversion from pixels on a first-person point of view image to coordinates on a bird’s eye geo-map (left). Such a process, used in Sim2RealViz to display global heatmaps (right) on the geo-map, relies on ground-truth, image, and camera information. To optimize their computation, geo-maps are discretized into squares larger than a pixel, as a trade-off between the accuracy of projections, and the user to the feedback.

4.3 Main-stream workflow

We now provide a typical workflow of use of Sim2RealViz:

  1. [noitemsep]

  2. Models are pre-loaded and their overall performances on both sim and real are displayed on the top left of Sim2RealViz (Fig. 1 ①).

  3. After model selection, users can start a fine-grained performance analysis of sim and real models by observing global statistics views such as a UMAP projection of embeddings in which each dot is an instance, and its color encodes how far it is to its counterpart (sim or real). Followed by a radial bar chart of predicted or ground-truth orientation, and finally, a distribution of positions in which each is room is a bar, and their height corresponds to how many predictions there is. In any of those views, users can select a set of instances to be inspected (e. g., a cluster in UMAP).

  4. Any such selection updates a geo-map (Fig. 1 ②), i.e. a “geometric” bird’s eye view, in which users can inspect the predictions in a finer scale. Users can adapt the geo-map to either color-mode which only displays ground-truth positions with their colors indicating how far sim and real predictions are, or full-mode which displays sim predictions, ground-truth positions, and real predictions. Instances can be selected for further inspection by hovering over them with the mouse cursor.

  5. An instance selection updates the instance view (Fig. 1 ③) and displays heatmaps, which highlights the portions of images on which the model most focuses on, or which it perceives as different. Such a heatmap is also back-projected over the geo-map to highlight portions of the environment, which most likely carry sim2real gaps (Fig. 1 ④).

The views in Sim2RealViz are multi-coordinated, i.e. any of them, including the geo-map, can be used as an entry point to formulate complex queries such as “what are the instances which perform poorly in simulation, but good in real-world while being in a selected region of the environment?”. Concretely, those combinations of selection can be done using sets operations ({union, intersection, and complementary}), which can be selected through interactions with the corresponding views. This is further emphasized by the fact that the performance differences between simulation and real-world are also color-encoded on the geo-map.

4.4 Heatmaps

To facilitate the inspection of the sim2real gap through image comparisons, Sim2RealViz provides heatmaps superimposed over images, from a selected instance, to draw user attention towards key portions of inputs extracted by the trained model (Fig. 1 ③). Feature-wise visualizations are essential, as visual differences between simulated and real-world images perceived by humans may not correspond to differences in features with a high impact on model decisions. Fig. 1 ③ illustrates the result of three approaches to generate those heatmaps, as follows (from top to bottom):

Regression activation mapping — Inspired by grad-CAM Selvaraju et al. (2017) and RAM Wang and Yang (2017), we design heatmaps to highlight regions in the input, which have a high impact on model prediction. For each forward-pass of a model, we collect feature maps from the last CNN layer and multiply them by the weights of the last FC layer, obtaining an overlay of the size of the feature map, which is then re-scaled to fit the input image and normalized to fit a turbo color scale (Sec. 4.2). Similarity of activation maps between two similar sim and real images suggests a similarity of the two input images from the model’s reasoning perspective.

Sim/Real occlusion — Occlusion sensitivity Zeiler and Fergus (2014)

is a common method to visualize how neural networks in computer vision rely on some portions of their input images. It consists in applying gray patches over an image, forwarding it to a model, and observing its impact on the model’s prediction. By sampling a set of patches, we can then overlay the input with this information, blue color indicating that the occluded prediction is closer to the original ground truth, and red otherwise.

In our case, the intuition and solution are slightly different from the standard case. We are interested in areas of the input image, where the model performance is improved when the real-world observation is replaced by simulated information, indicating a strong sim2Real gap. We, therefore, occlude input REAL images with RGB or Depth patches from the corresponding simulated image. The size of the patches is governed by a Heisenberg-like uncertainty trade-off between localization performance and measurement power. We concluded that patches of pixels are the more suitable to analyze images. A further advantage of this approach is the possibility to discriminate between gaps in RGB or Depth input.

Features map distance — Another approach implemented in Sim2RealViz is to gather the feature map of the last CNN layer during a forward pass on both the simulation and its corresponding real-world image, and then compute a distance between them. The result is a matrix with the size of the feature map which is then overlaid like the activation mapping. After some iterations, we opted for the product of the cosine distance which favors small changes, and L1 which is more inclined to produce small spots. Such a product offers a trade-off between highlighting every change and face over-plotting while focusing only on one specific spot with the risk of losing relevant changes.

4.5 Contextualization on the global geo-map

As illustrated in Fig. 1 ④ and in Fig. 3, information from the individual first-person 3D projective input images, including heatmaps, can be projected into the global bird’s eye view, and thus overlaid over the geo-map. This is possible thanks to ground truth information, i. e. coordinates, and orientation of the instance, combined with information of the calibrated onboard cameras (simulated and real) themselves such as its field-of-view, position on the robot, resolution, and the range of the depth sensor. To do so, the environment is discretized in blocks initially filled with zeroes, and images are downsampled to . Each cell is converted into () coordinates, and its average value from a heatmap is summed with the closest environment block to () coordinates. Finally, the values of environment blocks are normalized to fit the turbo color scale and then displayed as an overlay on the geo-map. This process can also be applied to the complete dataset available at once to provide an overview of sim2real gaps of the environment as perceived by a model. Fig. 3 shows the conversion of heatmaps from the complete real-world dataset to a geo-map overlay using different aggregation strategies. This overlay can be displayed using the button make-heat from the geo-map view (fig 1 ②).

Figure 4: By clicking on the adjust button (Fig. 1 ③), users can display sliders that can be used to fine-tuning real-world images with filters and observe how it affect models’ prediction.

4.6 Exploration of input configurations

To check the impact of simulation gaps due to global imaging parameters, Sim2RealViz provides ways to adjust real-world images through filters such as brightness, contrast, temperature, and dynamic range of depth. Any adjustment on a selected instance updates the corresponding prediction in real-time. Once a set of adjustments is validated by the user, it can be saved, applied to the whole real-world dataset, and treated as a new model in the models gaps overview ① for further analysis. The configuration of inputs can also be used on images from simulation, to analyze the performance of the model under specific Domain Randomization configurations. Of course, to have an impact of the simulation on real-world images, the models need to be retrained; however, adjusting simulation images in Sim2RealViz can be useful to help manually extracting parameters of the real-world camera, assisting in the configuration of the simulator. Producing images with direct feedback should reduce the workload usually required to configure simulators and real-world robots.

Figure 5: By using the full encoding, we can observe that most real-world predictions are located in the first half of the environment ①. Hence, instances sampled from the left half of the environment provide the worst predictions. However, when we slightly increase the brightness of each real-world image, we can observe that instances are more evenly distributed over the environment ②.

5 Case studies and closing the loop

We report on illustrative case studies we conducted to demonstrate how Sim2RealViz can be used to provide insights on how different neural models may be influenced by sim2real gaps. During these experiments, Sim2RealViz is loaded with the following methods for sim2real transfer: vanilla (i. e. no transfer, deployment as-is), dataAug (i. e. with domain randomization over brightness, contrast, dynamic range, hue), and with fine-tuning on real-world images. We use visual data extracted from two different trajectories of the physical Locobot agent in the real environment performed with several months between them and at different times of the day, which provides a diverse range of sim2real gaps and optimizes generalization. Those models and data are available in our GitHub repository at: .

Insights on sim2real gaps grasped using Sim2RealViz can be leveraged from two different perspectives echoing current sim2real transfer approaches. First, similar to Domain Adaptation, we can provide global modifications of the REAL images (e. g., brightness), which can be placed as filters and used in Sim2RealViz. Second, related to Domain Randomization, by modifying the simulator settings (e. g.,), and train a new model on it. In what follows, we describe several types of sim2real gaps, which have been identified and partially addressed in our experiments.

Unveiling biases in predictions — Once loaded, users can observe how models perform on simulated and real-world data provided by different models trained and transferred with different methods, as shown in Fig. 1 ①. We report that best real-world performances are reached using dataAug, with an average of 84% accuracy, rather than Fine-tuning, with an average accuracy of 80%. This performance is evaluated on traj#1, whereas traj#2 had been used for fine-tuning on real-world data, ensuring generalization over experimental conditions in the environment. In what follows we will focus on the dataAug model, which a user can further analyze by clicking on its corresponding diamond (Fig. 1 ①). To assess what the worst real-world prediction is, they can use the min filter, which on this data displays instances sampled from the left corridor regardless of the model used. We conjecture, that corridors are among the most difficult instances as they are quite large and lack discriminative landmarks. However, we can also observe that the right-side corridor is among the most successful predictions. By selecting those instances and hovering over them, we can inspect activation heatmaps and observe that model attention is driven towards the limit between a wooden path (unique to the right corridor) and a wall. The model seems to have learned to discriminate between corridors, which suggests thus that the confusion between them may be due to other reasons. By switching the encoding on the geo-map to full, i. e. with sim, real, and gtpositions displayed (Fig. 5 ①), we observe that the predicted positions of the vanilla model lie in the right half of the environment, a very strong bias, and which means that the most real-world left instances are incorrect despite being accurate in simulation. A similar phenomenon is also observed for the dataAug model with instances on the left corridor creating predictions pointing to the middle of the environment, which is also an unreachable area.

Closing the loop with filters — We verify the hypothesis of regression to the mean, which is often an “easy” short-cut solution for a model in absence of regularities in data, or when regularities are not learned. We perform global adjustments of the imaging parameters of the real-world images as described in Sec. 4.6, in particular setting both RGB and depth input to zero, leading to the same constant predictions in the middle of the environment, corroborating the hypothesis. While adjusting the brightness filter, we noticed that the darker the left corridor images were, the more to the right side of the environment predictions were. In opposition, by making images 15% brighter with a filter, those predictions reached, with the vanilla model the left side of the environment leading to a slight improvement of the overall performance of 1.5% (Fg 5 ②).

Figure 6: With global heatmaps of feature-distance, we can observe (in red) areas of the environment that may be affected by a sim2real gap. Those areas correspond to changes in objects present in the simulation, for instance as illustrated here, a fire-extinguisher. By removing such objects in simulation and retraining a model on them, we can observe that they disappeared from most highlighted areas.

Sim2real gaps due to layout changes — Trained models deployed to real environments need to be resilient to dynamic layout changes such as opened/closed doors, the presence of humans or moved furniture and other objects. In Sim2RealViz, this can be investigated using the global heatmap with feat-dist, in which some areas of the environment noticeably stand out (red in Fig. 6). By hovering over instances on the geo-map which have those areas in view, and browsing their corresponding images, we can observe that those areas are triggered by different factors. For instance, in Fig. 6 ②, the highlighted area corresponds to the presence of a bike in the simulated data, which was not present when the real-world data had been captured. Other areas correspond to changed furniture, and imperfections of the simulation when rendering, for instance, a fire-extinguisher (Fig 6 ①). Such behavior, which can be observed across models, may benefit from specific attention while addressing sim2real transfer.

Editing the simulator — In order to test such a hypothesis, we edited the simulator and removed the objects corresponding to two red areas. This new simulation is then used to generate new data aligned with the real-world. Using these data, we trained a new dataAug model and loaded its predictions in Sim2RealViz to evaluate the influence of those changes on real-world performance. Fig. 6 shows the global feat-dist

heatmap on trajectory#1 created with the new model, taking into account the changes. We can see that the areas with the most significant differences are more uniformly distributed over the environment. Since global heatmaps are normalized over the complete dataset, this indicates that, to the model, those areas are now closer in feature space.

The experiments described above have lead to a better understanding of the sim2real gap of our baseline agent, and we reported more robust localization performance once these insights were leveraged to modify the simulator or by learning filters for the existing model. We hope that Sim2RealVizwill be adopted and facilitate the design and training of trained robotic agents.

6 Conclusion and perspectives

We introduced Sim2RealViz, an interactive visual analytics tool designed to perform an in-depth analysis of the emergence of sim2real gaps from neural networks applied to robot ego-localization. Sim2RealViz supports both overview and comparison of the performances of different neural models, which instances can be browsed based on metrics such as performance or distribution. Those metrics can be combined using set operations to formulate more elaborated queries. We also reported scenarios of use of Sim2RealViz to investigate how models are inclined to exploit biases, such as regression to the mean, and are easily disturbed by layout changes, such as moved objects.

We plan to submit an extended version of this article to the journal special issue of the workshop, that includes: a deeper motivation on the design of Sim2RealViz and advanced implementation challenges; additional scenarios of use of the tool and insights experts found using the tool.

We acknowledge support from French Agency ANR through AI-chair grant “Remember” (ANR-20-CHIA-0018).


  • M. Bostock, V. Ogievetsky, and J. Heer (2011) D3: Data-Driven Documents. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis). Cited by: §4.
  • K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, et al. (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 4243–4250. Cited by: §3.
  • K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017)

    Unsupervised pixel-level domain adaptation with generative adversarial networks


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3722–3731. Cited by: §3.
  • E. Brachmann, M. Humenberger, C. Rother, and T. Sattler (2021) On the limits of pseudo ground truth in visual camera re-localisation. In preprint arxiv:2109.00524, Cited by: §2.
  • S. Carter, Z. Armstrong, L. Schubert, I. Johnson, and C. Olah (2019) Activation atlas. Distill. Note: External Links: Document Cited by: §3.
  • A. Chadha and Y. Andreopoulos (2019) Improved techniques for adversarial discriminative domain adaptation. IEEE Transactions on Image Processing 29, pp. 2622–2637. Cited by: §1, §3.
  • A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: Learning from RGB-D Data in Indoor Environments. International Conference on 3D Vision (3DV). Cited by: §2.
  • A. Cockburn, A. Karlson, and B. B. Bederson (2009) A review of overview+ detail, zooming, and focus+ context interfaces. ACM Computing Surveys (CSUR) 41 (1), pp. 1–31. Cited by: §4.2.
  • A. D. Dragan, K. C. Lee, and S. S. Srinivasa (2013) Legibility and predictability of robot motion. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 301–308. Cited by: §1.
  • K. Fang, Y. Bai, S. Hinterstoisser, S. Savarese, and M. Kalakrishnan (2018) Multi-task domain adaptation for deep learning of instance grasping from simulation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3516–3523. Cited by: §3.
  • M. Gleicher, D. Albers, R. Walker, I. Jusufi, C. D. Hansen, and J. C. Roberts (2011) Visual comparison for information visualization. Information Visualization 10 (4), pp. 289–309 (en). External Links: ISSN 1473-8716, 1473-8724, Link, Document Cited by: §4.2.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
  • F. M. Hohman, M. Kahng, R. Pienta, and D. H. Chau (2019) Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers. IEEE Transactions on Visualization and Computer Graphics. External Links: Document, ISSN 1077-2626 Cited by: §1, §3.
  • F. Hohman, H. Park, C. Robinson, and D. H. Chau (2020) Summit: scaling deep learning interpretability by visualizing activation and attribution summarizations. IEEE Transactions on Visualization and Computer Graphics (TVCG). Cited by: §3.
  • S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis (2019) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12627–12637. Cited by: §3.
  • T. Jaunet, R. Vuillemot, and C. Wolf (2020) DRLViz: understanding decisions and memory in deep reinforcement learning. Computer Graphics Forum (EuroVis). Cited by: §3.
  • A. Kadian, J. Truong, A. Gokaslan, A. Clegg, E. Wijmans, S. Lee, M. Savva, S. Chernova, and D. Batra (2020) Sim2Real predictivity: does evaluation in simulation predict real-world performance?. IEEE Robotics and Automation Letters 5 (4), pp. 6670–6677. Cited by: §3.
  • A. Kendall, M. Grimes, and R. Cipolla (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pp. 2938–2946. Cited by: §2.
  • Z. C. Lipton (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery.. Queue 16 (3), pp. 31–57. Cited by: §1.
  • M. Liu, S. Liu, H. Su, K. Cao, and J. Zhu (2018) Analyzing the noise robustness of deep neural networks. In 2018 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 60–71. Cited by: §1.
  • M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu (2016)

    Towards better analysis of deep convolutional neural networks

    IEEE transactions on visualization and computer graphics 23 (1), pp. 91–100. Cited by: §3.
  • [22] LoCoBot: An Open Source Low Cost Robot. Note: Cited by: §2.
  • Y. Ma, A. Fan, J. He, A. Nelakurthi, and R. Maciejewski (2020) A visual analytics framework for explaining and diagnosing transfer learning processes.. IEEE Transactions on Visualization and Computer Graphics. Cited by: §3.
  • B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull (2020) Active domain randomization. In Conference on Robot Learning, pp. 1162–1176. Cited by: §1.
  • A. Mikhailov (2019) External Links: Link Cited by: §4.2.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §1.
  • C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A. Mordvintsev (2018) The building blocks of interpretability. Distill. Note: External Links: Document Cited by: §3.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: §4.
  • A. Prakash, S. Boochoon, M. Brophy, D. Acuna, E. Cameracci, G. State, O. Shapira, and S. Birchfield (2019) Structured domain randomization: bridging the reality gap by context-aware synthetic data. In 2019 International Conference on Robotics and Automation (ICRA), pp. 7249–7255. Cited by: §3.
  • A. A. Rusu, M. Večerík, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell (2017) Sim-to-real robot learning from pixels with progressive nets. In Conference on Robot Learning, pp. 262–270. Cited by: §3.
  • M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra (2019) Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2, §3.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §4.4.
  • J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for Simplicity: The All Convolutional Net. Cited by: §3.
  • H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush (2017)

    Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks

    IEEE transactions on visualization and computer graphics 24 (1), pp. 667–676. Cited by: §1.
  • R. Szabó, D. Katona, M. Csillag, A. Csiszárik, and D. Varga (2020) Visualizing transfer learning. arXiv preprint arXiv:2007.07628. Cited by: §3.
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. Cited by: §1, §3.
  • J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield (2018) Training deep networks with synthetic data: bridging the reality gap by domain randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §3.
  • E. Tzeng, C. Devin, J. Hoffman, C. Finn, P. Abbeel, S. Levine, K. Saenko, and T. Darrell (2020) Adapting deep visuomotor representations with weak pairwise constraints. In Algorithmic Foundations of Robotics XII, pp. 688–703. Cited by: §3.
  • E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7167–7176. Cited by: §1, §3, §3.
  • Z. Wang and J. Yang (2017) Diabetic retinopathy detection via deep convolutional networks for discriminative localization and visual explanation. arXiv preprint arXiv:1703.10757. Cited by: §4.4.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §3, §4.4.
  • F. Zhu, L. Zhu, and Y. Yang (2019) Sim-real joint reinforcement transfer for 3d indoor navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11388–11397. Cited by: §3.