Image-based Navigation in Real-World Environments via Multiple Mid-level Representations: Fusion Models, Benchmark and Efficient Evaluation

by   Marco Rosano, et al.
University of Catania

Navigating complex indoor environments requires a deep understanding of the space the robotic agent is acting into to correctly inform the navigation process of the agent towards the goal location. In recent learning-based navigation approaches, the scene understanding and navigation abilities of the agent are achieved simultaneously by collecting the required experience in simulation. Unfortunately, even if simulators represent an efficient tool to train navigation policies, the resulting models often fail when transferred into the real world. One possible solution is to provide the navigation model with mid-level visual representations containing important domain-invariant properties of the scene. But, what are the best representations that facilitate the transfer of a model to the real-world? How can they be combined? In this work we address these issues by proposing a benchmark of Deep Learning architectures to combine a range of mid-level visual representations, to perform a PointGoal navigation task following a Reinforcement Learning setup. All the proposed navigation models have been trained with the Habitat simulator on a synthetic office environment and have been tested on the same real-world environment using a real robotic platform. To efficiently assess their performance in a real context, a validation tool has been proposed to generate realistic navigation episodes inside the simulator. Our experiments showed that navigation models can benefit from the multi-modal input and that our validation tool can provide good estimation of the expected navigation performance in the real world, while saving time and resources. The acquired synthetic and real 3D models of the environment, together with the code of our validation tool built on top of Habitat, are publicly available at the following link:



page 3

page 6

page 8

page 9

page 13

page 14

page 17

page 18


On Embodied Visual Navigation in Real Environments Through Habitat

Visual navigation models based on deep learning can learn effective poli...

Out of the Box: Embodied Navigation in the Real World

The research field of Embodied AI has witnessed substantial progress in ...

MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments

We present MINOS, a simulator designed to support the development of mul...

Augmented Environment Representations with Complete Object Models

While 2D occupancy maps commonly used in mobile robotics enable safe nav...

Situational Fusion of Visual Representation for Visual Navigation

A complex visual navigation task puts an agent in different situations w...

RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues

The gap between simulation and the real-world restrains many machine lea...

Towards Cognitive Navigation: Design and Implementation of a Biologically Inspired Head Direction Cell Network

As a vital cognitive function of animals, the navigation skill is first ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Creating a robot able to navigate autonomously inside an indoor environment relying just on egocentric visual observations captured from its on-board camera is a challenging but attractive research goal. Indeed, such input imaging devices are becoming increasingly affordable while able to capture rich information of the surrounding space, in the form of RGB images, that are processed by Deep Learning (DL) models to obtain useful properties of the environment (e.g. presence of objects, people, free space, type of room, depth, etc.) [obj_det_survey, sem_seg_survey, taskonomy2018] to perform operations in real-world scenarios [bonin2008visual, rescue_robots]. Visual navigation approaches have been successfully applied when the goal to be reached is specified as coordinates [habitat19iccv], images [zhu2017target], object categories  [chaplot2020object], room type [roomNav2020] and language instructions [chen2011language_nav], showing how DL models can be successfully exploited to obtain effective navigation policies, given their ability to learn directly from data and generalize to unseen environments. In particular, Deep Reinforcement Learning (DRL) showed that it is possible to learn a navigation policy without densely labeling all training examples, allowing a robotic agent to collect the required knowledge by performing navigation episodes inside a photo-realistic simulator, following a trial-and-error setup [zhu2017target, habitat19iccv]. Despite the simulated environments are more and more photo-realistic, models trained in simulation struggle to effectively transfer their ability to real spaces, due to different factors: 1) a visual difference between virtual and real observations (domain shift); 2) robot dynamics between simulated and real world differ (i.e. real sensor measurements and robot movements are noisy and subject to failures). Thus, several domain adaptation techniques have been proposed to address the virtual-real visual gap problem [wang2018da_survey]

. These methods apply pixel or feature level transformations to the input images, to reduce the gap between the two domains. In the case of pixel-level transformations, the goal is to translate images from the source domain to the target domain in order to make them visually indistinguishable; in the case of feature-level transformations, a visual encoder, usually a Convolutional Neural Network (CNN), is trained to reproject the representation vectors of images belonging to the two domains in the same compact subspace so that domain-level differences between representations are minimized.

Other approaches [taskonomy2018] aim to extract more explicit scene information from RGB observations, such as depth or surface normals, so that the resulting geometric or semantic representations are invariant with respect to the domain. This results in more robust models that can be employed to perform effective downstream tasks such as visual navigation. While the effectiveness of visual navigation approaches has been extensively proved through evaluations in virtual environments [Wijmans2020DD-PPO:, gupta_cognitive], the physical evaluation in the real world remains difficult to carry out, mainly due to time and resources constraints (e.g., in terms of the human supervision required to perform the experiments) and the fragile nature of the robotic platforms. Indeed, hardware components (motors, batteries, etc) are subject to wear and failures, whereas collisions with obstacles and bumpy rides can easily harm the integrity of the robotic platform. These limitations represent a real obstacle preventing to carry out extensive evaluation processes. Mid-level representations of images such as surface normals, keypoints, depth maps have been proved useful to improve the performance of navigation models [navigateMidlevel2018] and to reduce the visual domain gap [robustPoliciesChen2020]. Furthermore, depending on what the agent observes during a navigation episode, some perception abilities could be more useful than others to successfully accomplish the navigation task. Despite these reasonable intuitions, a systematic evaluation of their use to facilitate the transfer of a learned policy in simulation to the real world is still missing.

Figure 1: Illustration of the proposed visual navigation training, evaluation and test setup. Training and evaluation are performed leveraging a locomotion simulator on two separate 3D models of the environment. During training (first row in light blue) a geometrically accurate 3D model of the environment is used to learn an optimal navigation policy. To evaluate the navigation policy (second row in orange) a photo-realistic 3D model of the environment is used to provide a good estimation of the expected navigation performance in the real world. In both cases, a set of mid-level representations are extracted starting from the RGB observations, that are wisely combined by one of the proposed modality fusion architectures to perform the navigation task. The navigation models are then tested by directly deploying them on a real robot to perform real-world navigation episodes (third row in green).

In this work, we investigate whether mid-level representations of the input RGB images can improve the transfer of a policy learned in simulation to real data. Specifically, we exploit the models presented in [taskonomy2018] to extract mid-level representations from RGB observations and consider a variety of deep learning architectures for visual navigation which perform early, mid and late fusion of the extracted mid-level representations. Our models are trained to perform visual navigation in a DRL setting, which offers the opportunity to efficiently train navigation policies entirely in simulated environments, thanks to the recent advances in simulation tools [habitat19iccv, xiazamirhe2018gibsonenv] and RL algorithms [schulman2017ppo, Wijmans2020DD-PPO:]. This avoid the need to collect experience in a real scenario using a physical robotic platform, which is prohibitively expensive in most circumstances. We show how the number of used mid-level representations, the type of geometric information captured by the considered mid-level representations and the adopted model architecture contribute to the improvement of navigation performance. We observed that a more elaborated fusion strategy, together with an increasing number of mid-level representations, can lead to superior performance when compared to navigation models that rely on a smaller set of mid-level representations or employ a more naive fusion mechanism, in terms of robustness and reliability, especially on complex navigation episodes.

We train our models using the Habitat simulator [habitat19iccv] on a synthetic version of a real environment. We hence test the ability of the learned policy to transfer it to a real-world navigation performed using a real robot. To facilitate the evaluation in a realistic context we also propose an evaluation tool built on top of Habitat, which is able to simulate realistic trajectories by leveraging observations collected in the real space. A complete overview of the proposed framework is depicted in Figure 1. More specifically, our tool makes possible to evaluate navigation models on a simulated environment but providing real-world observations, without having to deployment of a robotic agent in the physical world. The learned navigation policy is then tested in the real environment using a custom robotic platform equipped with accurate actuators. To prove that the navigation policy learned in simulation can transfer to the real world, we tested all our visual navigation models on a set of real navigation episodes, characterized by different levels of difficulty. We show that, overall, all the proposed visual navigation models based on mid-level representations can be successfully deployed in the real context, albeit they reported different navigation capabilities. Moreover, the realistic evaluation tool can provide a good estimation of the expected navigation performance in the real world, with the advantage of not requiring the physical deploy of the robot during the development of the navigation algorithm. This results in a drastic reduction of the evaluation time, to systematically evaluate a large number of navigation models without providing human supervision, removing the risk to damage the robotic platform or the environment in the deployment space.

In summary, the contributions of this work are as follows:

  1. we investigate a variety of learning-based multi-modal fusion architectures to exploit information from a set of mid-level representations, to facilitate the transfer of the navigation policy trained in the virtual environment to the real twin;

  2. we show to what extent the number of mid-level representations used, the type of information contained in the considered representations and the adopted model architecture affect the final navigation policies and their transferability from the virtual to the real space. Intuitively, a more elaborate fusion strategy, together with an increasing number of mid-level representations, can lead to superior models in terms of performance and adaptation;

  3. we propose an evaluation tool which is able to simulate realistic trajectories, leveraging a set of observations collected in a real space;

  4. we show the effectiveness of the proposed evaluation tool by comparing algorithm performance measured in real navigation episodes using a robotic platform. This allows a fast and inexpensive assessment of the models capabilities and represents a good proxy for real world performance.

The remainder of the paper is organized as follows. Section 2 discusses the related works. In Section 3, we describe the proposed approach. The experimental settings are discussed in Section 4, whereas the results are presented in Section 5. Section 6 concludes the paper and gives hints for future works.

2 Related Works

Different works on visual navigation faced the problem of the assessment of navigation policies in real settings [robothor, lichaplot2020unsupervised] but, despite their effort, the number of executed navigation trajectories is still limited by time and budget constraints. Some works considered approximations of the real world complexity, constructing a grid of real images and performing their evaluations in simulation [zhu2017target, rosano2020comparison]. Despite the efficiency of these approaches, this results in an oversimplified agent-environment interaction and consequently in an unreliable estimation of real world performance. In our work, we aim at maintaining the real world estimation process inside the simulator, leveraging a dense set of geolocalized images collected in the real world, to take advantage of the efficient execution of the navigation episodes in simulated environments and to avoid the downsides of a high discretized grid-world.

Our work relates to approaches which belong to a range of topics, including simulators for visual navigation, embodied visual navigation and simulated to real domain adaptation. We report the most relevant connections to our work with respect to the state-of-art in the subsections below.

2.1 Embodied Navigation Simulators

The development of advanced simulators [savva2017minos, house3d, xiazamirhe2018gibsonenv, nvidia2021isaac] used in conjuction with realistic large-scale 3D indoor datasets laid the foundations for the design of learning-based navigation models, which can learn the desired behaviour through realistic interactions with the scene. To foster the research of robotic agents that perform increasingly complex tasks, more recent works [newhabitat2021, gibson2.0] released highly interactive environments comprised of a large number of active objects, within which the agent can widely experiment a large variety of realistic interactions. Embodied simulators are designed to be used with third-party 3D datasets, which have different characteristics and are designed following different approaches. The 3D spaces proposed in [Matterport3D, xiazamirhe2018gibsonenv, dai2017scannet] reproduce real-world indoor rooms and have been acquired using special 3D scanners. This allows for the collection of a large number of photo-realistic 3D environments at a relatively low cost, but the final 3D reconstruction may present holes or artifacts due to imperfect scans. In contrast, the 3D models in [replica19arxiv, openRooms2021, kolve2017ai2, newhabitat2021] represent replicas of realistic indoor spaces, accurately designed by artists. The survey of Möller et al. [moller] contains a detailed section on state-of-art datasets and simulators for robot navigation.

To facilitate the assessment of navigation performance in real-world, the authors of [robothor] released a set of 3D virtual environments for training purposes and allowed researchers to physically test the obtained navigation models on the real equivalents through a remote deployment application.

In this work, we aim to train a navigation model in simulation on a virtual environment and test its ability to transfer to the same real space. In addition, our framework allows for a good estimation of the real-world performance avoiding the physical execution of the navigation episodes. Our work builds on top of an existing simulator [habitat19iccv] and extends its capability to efficiently test navigation policies on real observations.

2.2 Embodied Visual Navigation

The problem of robot visual navigation has been studied for decades by the research community [bonin2008visual, thrun2002probabilistic]. In its classic formulation, the navigation process can be thought as a composition of sub-problems: 1) construction of the map of the environment; 2) localization inside the map; 3) path-planning to the goal position; 4) navigation policy execution. The environmental map can be provided beforehand or reconstructed with a Structure from Motion (SfM) pipeline [colmap] using a set of images of the space. The localization is then performed by comparing new observations with the previously collected set of data. Also, in SLAM-based methods [cadena2016past, fuentes2015visual], the reconstruction of the map and the localization tasks are performed at the same time. The navigation is then performed after a path to the goal is computed. These methods have been implemented in several scenarios but they present significant limitations, such as the scalability to large environments, the accumulation of localization error and the robustness to dynamic scenarios. Recently, learning-based visual navigation approaches emerged as effective alternatives to the classic navigation pipelines, promising to learn navigation policies in a end-to-end way, receiving images as input and returning actions as output, avoiding all the intermediate steps [zhu2017target, mirowski2016learning]. Depending on the type of goal to be reached, several deep learning models have been proposed, performing ObjectGoal [chaplot2020object, morad2021embodied] or RoomGoal [roomNav2020] navigation, instructions following [chen2011language_nav, krantz2020navgraph, anderson2018vision, fried2018speaker], or question answering [das2018embodied, gordon2018iqa], which requires the agent to navigate to the appropriate location in order to provide the correct answer about a property of the environment.

When goals are specified as a coordinates (PointGoal) or observations of the environment (ImageGoal), the task is referred to as geometric navigation, given the requirement of the navigation model to reason about the geometry of the 3D space in order to accomplish the task. Recent geometric navigation approaches investigated the use of a variety of learning architectures: Zhu et al. [zhu2017target] used reactive feed-forward networks for ImageGoal navigation; Savva et al.  [habitat19iccv] included a recurrent module to embed the past experience, enforcing the sequential nature of navigation; Wijmans et al. [Wijmans2020DD-PPO:] improved the scalability of the model collecting billions of frames of experience. Chen et al. [chen2020soundspaces] introduced the use of sounds together with images to reason about the surrounding space and to guide the agent towards the goal. Chaplot et al. [chaplot2020learning] and Chen et al. [chen2018slam] used spatial memories and planning modules, whereas Savinov et al. [savinov2018semiparametric] and Chaplot et al. [chaplot2020topological] used topological memories to represent the environment.

The methods investigated in this paper fall in the class of geometric navigation approaches, where the model needs to reach a goal specified as coordinates. Similarly to the method proposed by Savva et al. [habitat19iccv], we trained a RL-based navigation model consisting of both convolutional and recurrent modules, which at each step returns the ID of the action to be executed, sampled from a set of possible discrete actions. However, our approach differs for the type of input it receives and for the final goal of the task. Motivated by the promising results obtained by Sax et al. [navigateMidlevel2018] and Chen et al. [robustPoliciesChen2020], which showed the generalization ability offered by the mid-level representations [taskonomy2018] to transfer the navigation capabilities to unseen environments, we propose a benchmark to investigate smart multi-modal fusion strategies to obtain optimal navigation policies that can successfully transfer to the real world. Shen et al. [situational2019] proposed to learn a mid-level representations fusion model to extract meaningful semantic cues from the virtual 3D space, to guide the agent in a high discretized grid-world towards a specific object. In contrast, our approach aims at learning how to leverage the correct combination of geometric cues that better transfer to the real world, considering a continuous state space and a goal specified as coordinates.

2.3 Simulated to Real Domain Adaptation

Domain adaptation methods aim to reduce the domain gap between virtual and real observations by learning domain-invariant image representations, in order to transfer action policies learned in simulation to the real world. The proposed simulators for visual navigation generally address this issue following two strategies: by making the environment highly photo-realistic or by randomizing the properties of the virtual environment. In the former, the goal is to make the simulation appear as much similar to the real world as possible. In the latter, the idea is to make the model experience a highly dynamic environment to avoid overfitting to a specific style and to allow for a style-agnostic representation learning of the space. Domain randomization was successfully applied for robotic grasping [james2019grasp], drone control [Loquercio2020DeepDR, sadeghi2016cad2rl], vision-and-language navigation [vln-pano2real]. Other approaches considered to train on synthetic data and then fine-tune on real observations [grasp2017finetune, rosano2020navigation], when real world data is available beforehand. Real observations can also be employed to perform adaptation at feature-level [adda, kouw2016feature], pixel-level [hu2018duplex, cyclegan] or both [hoffman2018cycada]. More recently, different works were proposed specifically for sim2real transfer of visuomotor policies. For instance, Li et al. [lichaplot2020unsupervised] proposed a GAN-based model to decouple style and content of visual observations and introduced a consistency loss term to enforce a style-invariant image representation. Rao et al. [rao2020rl] introduced a RL-aware consistency term to help preserving task-relevant features during image translation. The authors of [truong2021bidirectional] followed instead a bi-directional strategy, using a CycleGAN-based [cyclegan] real2sim adaptation model for the visual observations and a sim2real adaptation module for the physical dynamics. Rather than using adaptation modules to reduce the sim-real gap, we followed the idea of Sax et al. [navigateMidlevel2018] and Chen et al. [robustPoliciesChen2020] and trained our navigation policy on top of mid-level representations, which contain crucial geometric or semantic cues of the environment and are invariant to the application domain. This allows a direct deployment of the learned navigation model trained in simulation to the real world. We take advantage of the offered properties and focus our work on the research of optimal modalities fusion strategies.

Figure 2: Example of mid-level representations. First column shows input RGB images. From the second to the fifth column are shown: surface normals, keypoints3D, curvature, depth. Each mid-level representation is able to capture a different property of the observed scene.

3 Method

The goal of our approach is twofold: 1) enable a robotic agent to learn an optimal navigation policy entirely in simulation that successfully transfer to the real world. This is achieved by leveraging a set of mid-level visual representations [taskonomy2018] extracted from the RGB observations collected by the agent during the navigation task, which capture a range of different properties of the scene. To this end we benchmark several deep learning architectures to learn how to adaptively weight the contribution of each representation at every navigation step, depending on the perception of the agent; 2) provide a simple tool to reliably estimate the expected performance of the navigation models in the real world, by running realistic navigation episodes entirely in simulation. The proposed training and evaluation framework builds on top of the Habitat simulator [habitat19iccv] and involves the acquisition of two 3D models of the environment, which are aligned and used to perform both virtual and realistic navigation episodes. The navigation performances are also validated conducting real-world experiments, using a real robotic platform to assess the usefulness of the proposed tool. In the following, we first describe the problem setup and the proposed modality fusion strategies, then provide details about the data acquisition process and the adopted policy evaluation protocol.

3.1 Problem Setup

We consider the problem of PointGoal visual navigation in indoor environments. In this context, an agent equipped with an RGB camera is placed at a random location of the environment and is required to navigate towards the goal coordinates, relying solely on the visual observations to reason about the surrounding space and execute the best possible actions. No information about the layout of the environment is provided to the agent. At each timestep the agent receives a RGB observation that is processed by a set of transformation models from [taskonomy2018] to output a list of mid-level representations . From Figure 2

is possible to observe examples of mid-level representations obtained from the respective RGB images and the different scenes properties they are able to capture. These representations are then passed to a fusion module which learns how to combine them to produce a final compact vector containing the most meaningful information about the agent’s current observation. Our navigation policy is parametrized by a neural network

which, given the visual representations and information about the goal to reach , outputs the action to perform at time . This process is repeated until the goal or a given steps budget is reached.

The navigation models were trained entirely in simulation following a RL setup. In RL, the agent performs actions inside the virtual environment and collects rewards or penalties (negative rewards), depending in whether the actions led to reduce the distance to the goal or not. The objective of the training process is to find an optimal navigation policy which allows the agent to find the shortest path to the goal by maximizing the sum of the collected rewards.

Figure 3:

Popular multi-modality fusion strategies. a) In Early-fusion models, the input modalities are combined beforehand (usually concatenated along the channel dimension) and the unified representation is then provided as input to the Deep Learning model. b) In the case of Mid-fusion models, each modality is processed by a separate encoder, that outputs an intermediate embedding vector. Therefore, all the vectors are combined by a fusion module to produce the model’s final output. c) Differently, the Late-fusion model is an ensemble of separate models, each one processing a single modality and producing a distinct output. A fusion module then collects the outputs of the models to return the final decision. As depicted in the figure, the outputs of the models are denoted by the blue circles and they usually represent probability distributions with respect to the discrete set of actions.

3.2 Mid-level Representations Fusion

The idea of combining different visual representations to improve the navigation abilities of an agent has been first explored in [navigateMidlevel2018] and different works on visual navigation have later followed this approach [mousavian, morad2021embodied]. In these investigations, fusion mechanisms have often been limited to a simple stacking of the different input representations. The authors of [situational2019] explored more advanced fusion schemes in the context of ObjectGoal navigation, but their experiments were conducted in a poorly realistic setup (a discretized grid-world) and did not consider the deployment of the learned models in the real world. The intuition behind the idea of using a range of different perception abilities comes from the need of an agent to capture the most meaningful properties from the surrounding area which can correctly inform the decision making module to perform the best possible action. While moving, the agent traverses sections of the environment with different characteristics (i.e. narrow corridors, open space rooms) and containing a variety of different obstacles, from large furniture (i.e. cabinets, drawers) to small pieces of furniture (i.e. coffee tables), which can be harder to perceive given their size, chairs and tables with thin legs and carpets, which could represent an obstacle for the agent. This suggests that the use of an adaptive perception ability selection module could be beneficial to avoid the agent’s premature failures, leveraging the best visual cues at every navigation step.

In this work, we leverage the deep models proposed in [taskonomy2018] and specifically trained on a large dataset of RGB images to capture a variety of different geometric and semantic properties. To investigate the benefits that visual representations fusion strategies can offer to models performing PointGoal navigation, we proposed a variety of deep convolutional networks to perform early, mid and late fusion. Figure 3 shows an overview of the employed fusion schemes. More specifically, we considered five deep architectures:

  • a classic convolutional model performing early-fusion of the mid-level representations. It represents the most simple combination strategy and can be considered as a baseline for more elaborated fusion models (Figure 3a);

  • two convolutional models with a channel-level attention mechanism. This architecture represent a variant of the Early-fusion model depicted in Figure 3a, which perform a weighting of the feature maps after every convolutional layer, similarly to what is done in [squeeze_excitation]. Assuming that different feature maps contain different properties of the input observation, this architecture offers the chance to learn to focus on the most relevant ones. The two models differ for the type of layer pooling used in the attention branches;

  • a Mid-fusion model (Figure 3b) which processes each mid-level representation in a dedicated convolutional branch, to then condense their output in the final shared layers. This architecture offers the chance to specialize portions of the network to exploit the visual cues contained in specific mid-level representations;

  • a Late-fusion model (Figure 3

    c) which represents an ensemble of networks, each of them trained separately on a single mid-level representation. Each network outputs the probability of taking an action and a final policy fusion module aggregates them to select the final action, based on a context summary representation.

More detailed information about the architectures of the proposed fusion models is reported in Section 4.2

3.3 3D Datasets Acquisition and Orientation

Our training and evaluation tool requires the acquisition of two 3D models of the same environment: a geometrically accurate 3D model that can be acquired using a 3D scanner, such as Matterport 3D, and a photo-realistic 3D model reconstructed from a set of real-world observations, using a Structure from Motion (SfM) algorithm [colmap]. The first model is an accurate replica of the real environment but with limited photorealism. The scanning process returns a 3D mesh that can be natively imported inside Habitat [habitat19iccv] and used to train the navigation policy. On the contrary, the second model is a sparse photo-realistic but geometrically inaccurate reconstruction of the environment. The SfM process returns a 3D pointcloud in which all images are labeled with their camera pose (position and orientation). Figure 4 compares the two 3D models. It is worth noting that this model can not be directly used in the simulator and a dedicated interface was developed as part of our tool to allow its employment inside the simulation platform. Because the two 3D models are acquired separately using two different approaches, they might present a scale and a rotation offset, that should be minimized by following a maps alignment procedure. One possible solution is to manually search the parameters of the affine transformation, that is then applied to one or both 3D models to match the coordinates of the other 3D model. To make this process automatic, we leveraged an image-based alignment procedure111We used the model_aligner function of the COLMAP software to transform the coordinate system of the real-world 3D model to match the one of a set of observations sampled from the virtual 3D model. To this end, we used the Habitat simulator to collect images from random locations together with their camera pose. Although the images belong to two different 3D models and their appearance does not match perfectly, the aligning procedure turned out to be robust against visual differences and succeeded with the coordinate system transformation.

3.4 Generation of Realistic Navigation Episodes in Simulation

Once the 3D models are aligned, they can be exploited to generate realistic navigation episodes in simulation. At first, the navigation trajectory is generated by the simulator on top of the virtual 3D model. Then, the virtual agent performs the navigation task and at each step the perceived virtual observation is systematically replaced with the real-world image which is closest in space to the current agent position. In more detail, at each step, the current pose of the agent is extracted from the simulator and it is used to retrieve the nearest real image from the real-world 3D model. The retrieved real observation is then processed by the visual navigation module in order to output the action to take in the virtual environment. This process is repeated until the end of the navigation episode. Considering that the agent moves on the floor surface and that its camera does not change its height nor its pitch and roll angles, the 6DoF camera poses were transformed to 3DoF coordinates, with the first two degrees of freedom representing the X and Z cartesian coordinates on the ground plane and the third degree of freedom representing the camera orientation as the angle along the Y axis, perpendicular to the XZ plane. This transformation simplifies the subsequent image retrieval process, which consists in two steps: 1) filter the images by angle and subsequently 2) filter the images by coordinates. Because the image retrieval time is crucial to perform a fast policy evaluation, we leveraged the efficiency of the FAISS library 

[FAISS] to perform a fast search on a large set of thousands of records in a fraction of a second. We transformed the cameras heading angle to unit vectors , where and

, and calculated the angle difference as the cosine similarity between the corresponding vectors. From our experiments we found that a similarity threshold of 0.96 ensures good results. After filtering the real-world images by angle, we apply a second filter to the resulting subset of images based on the X-Z coordinates. Finally, the nearest image is chosen to replace the virtual observation. Thus, the navigation episode is performed in simulation but the policy depends on the real-world observations.

Figure 4: A view of the virtual (left) and real-world (right) 3D models of the considered office environment. The virtual model is geometrically accurate and allows for the sampling of images from any position but it is limited in terms of photo-realism. The real-world model is a sparser collection of localized real-world images (each red marker represents the position of an image).

4 Experimental Settings

4.1 Dataset Acquisition

We carried out our experiments in an office environment of about square meters. Figure 5 shows the floor plan of the considered space. The virtual 3D model was acquired using a Matterport 3D scanner and the resulting 3D mesh was imported inside the Habitat simulator to perform the training of the navigation policies. Instead, the real-world 3D model was reconstructed for testing purposes using the COLMAP [colmap] software, starting from a set of  32k RGB images of the environment, collected using a robotic platform equipped with a Realsense d435i camera222 This resulted in a sparse 3D pointcloud where each image is labeled with its camera pose relative to the 3D reconstruction. To capture the real-world images, the agent followed a simple exploration policy aimed at covering all the traversable space as more uniformly as possible, proceeding along straight trajectories, stopping and turning around by a random angle to avoid collisions and continue the acquisition. The real-world image set was acquired in about 3.5 hours at 3fps, with a robot’s maximum speed of 0.25m/s. As already mentioned in Section 3.3, the real-world model was aligned to match the coordinate system of the virtual model. For this purpose, an “alignment set” of 6k images was randomly sampled from the virtual environment together with the relative camera poses. These images were registered inside the real-world 3D model using COLMAP and then used by the image-based alignment function to perform the final match of the coordinate system.

Figure 5: Top view of the office environment considered in our experiments.
Figure 6: Overview of the proposed visual navigation models, which follow distinct mid-level representations fusion strategies. All the models are comprised of two parts: 1) a visual encoder and 2) a controller. The visual encoder (red background) is responsible for effectively combining the different mid-level representations provided as input to produce a meaningful vector embedding of the scene. The controller (light blue background) takes this embedding as input, together with additional information about the coordinates of the navigation goal and the previously performed action, to output the action to take and the estimated “quality” of the current state to reach the destination. The LSTM layers allow the model to embed the history of the navigation episode at each timestep, given the sequential nature of the task. Each model was decomposed in modules, which are detailed on the top right of the figure. See the text for the discussion of the different models.

4.2 Proposed Navigation Models

In this work, we propose the use the mid-level representations as visual input for our navigation models mainly for two reasons: 1) each representation is able to capture different properties of the environment, so that they can be selectively exploited depending on what the agent is experiencing during the navigation episode; 2) they are robust to domain shift, which translate in navigation policies that can be trained on synthetic observations and deployed to the real world without performing further domain adaptation. To investigate the benefits offered by a multi-modal visual input, we performed experiments considering a variety of mid-level representations, varying the number of input modalities and following different fusion strategies. In our experiments we leveraged four mid-level models from [taskonomy2018] to extract four mid-level representations: surface normals, 3D keypoints, curvature, depth map. Examples of these representations can be observed in Figure 2.We considered these representations because they are able to capture different geometric generally properties of the environment, which is ideal given that in our navigation setup the goal is specified as coordinates in the space and that the task requires geometric reasoning. Each of these models receives a

RGB image and outputs a compact tensor of size

. We found that this compact representation provides the navigation model the required information to successfully perform the downstream task. Moreover, it allows the design of compact navigation models which results in faster training and easiest deploy in real robotic platforms, whose computational resources are usually limited. As highlighted in Section 3.2, we proposed several deep fusion models performing modality fusion at different levels. Each model consists of two blocks: a visual encoder, which processes the mid-level representations received as input through convolutional layers, and a controller, which receives the features processed by the visual encoder, in addition to further useful data, to output the navigation policy. We used the same controller for all the proposed navigation models but different visual encoding architectures. Specifically, we proposed five navigation models implementing five distinct visual encoders, also depicted in Figure 6:

  • the “Simple"

    model consists of two convolutional layers with 3x3 kernels with numbers of intermediate feature maps of 64 and 128 respectively. After each convolutional layer we introduced a GroupNorm normalization layer to take into account the highly correlated data in the batch and a ReLU activation function. The aggregation module collects all the considered mid-level representations and stacks them along their features dimension to produce a unified representation of the agent’s observation. This representation is then provided as input to model. Figure 

    6a) illustrates the architecture of the model;

  • the “Squeeze-and-Excitation"

    (SE) model introduces a feature-level attention module after each convolutional layer, including the input layer, to weight the different feature maps depending on their content. Each attention module consists of a global pooling layer, two fully connected (FC) layers with a ReLU activation function in-between and a final sigmoid activation function which returns the weights to perform the feature map-level attention. We tested two variants of the same model, one with a global average pooling and the other with a global max pooling in the attention module. We refer to them as “SE attention (avg pool) model” and “SE attention (max pool) model” respectively. Figure 

    6b) depict the aforementioned model;

  • the “Mid-fusion" model consists of a set of parallel visual encoders equals to the number of input representations. In this architecture, each encoder has the chance to focus on a single mid-level representation and the final output of the model is given by the combination of the intermediate outputs of the various visual branches. In practice, each branch is represented by the same visual encoder of the “Simple model", which produces a compressed visual representation as output. These “intermediate” representations coming out from all the branches are subsequently concatenated along their channel dimension to form the final visual representation. We also considered more advanced combination strategies but in our experiments the simple concatenation returned the best results. Figure 6c) presents a scheme of the model;

  • the “Late-fusion" model differs from the previous models because it consists of a set of full navigation models (visual encoder + controller) trained independently on single distinct mid-level representations. At each navigation step, the models output action candidates and they are combined together depending on the current agent’s perception to output the final action. More specifically, each navigation model outputs a probability distribution over a discrete set of actions that the agent can perform and an additional policy fusion module is responsible to adaptively weight the single model’s outputs to obtain the final action probability. The policy fusion module is a replica of the visual encoder of the “Simple Model", which takes a stack of the considered mid-level representations as input and outputs the weights (a probability distribution over the number of models) to balance the contribution of every navigation model to the final output. Figure 6d) summarizes the entire architecture.

    Given the number of considered models, the number of actions in the discrete action set, the matrix containing by the models’ actions candidate vectors and the output of the policy fusion module, the final action probability distribution is equal to:

The controller consists of two LSTM layers [lstm] which take as input the visual representation coming from the visual encoder, the action produced by the navigation model at the previous timestep and the information about the goal coordinates relative to the robot’s current position, and outputs an action probability distribution together with a value representing the “quality” of the current agent’s location given the goal to be reached (i.e. it is an Actor-Critic RL model [actor-critic]). The information about the previous action and the goal coordinates are both projected to two separate vectors of size 32 and then concatenated to the output of the visual encoder, which is a 512-d vector, to produce the final 576-d vector that is fed to the controller. The use of recurrent layers helps the model to deal with the sequential nature of the navigation task.

4.3 Training Details and Evaluation

All the proposed navigation models have been trained on the synthetic version of the considered office environment, following the setup of [Wijmans2020DD-PPO:]. We used the Habitat simulator [habitat19iccv] to sample a set of 100k virtual navigation episodes beforehand, to use them for training our navigation models for 5 million frames each. This threshold was set experimentally as it allowed the agents to collect the required experience and to converge to optimal training metrics.

The navigation models have been trained with one, two, three and four mid-level representations as input, following an incremental setup. To train a model with modalities, we keep the representations that led to the best performing model and we add all the remaining ones, one at a time. As a result, we have trained and evaluated a total of 34 different navigation models.

In order to speed-up the training procedure we leveraged the DD-PPO architecture proposed in [Wijmans2020DD-PPO:], a distributed variant of the popular PPO reinforcement learning algorithm [schulman2017ppo], which allows multiple agents to be trained in parallel on one or multiple GPUs. Additionally, we adopted an input caching system that resulted in a doubled training speed (from 80fps to 160fps for a model receiving two modalities as input, trained with 4 parallel processes per gpu on two Nvidia Titan X).

We formulated the task as PointGoal navigation, where the agent is asked to navigate towards a destination provided as a coordinate in the environment. At each timestep the agent chooses one of the four possible actions available: move straight by , turn left by , turn right by , STOP. The navigation episode ends when the STOP action is performed or when the maximum number of execution steps is reached. Given the size of our environment, we fixed this threshold to 200 steps.

Our visual navigation models have been evaluated according to two standard metrics used to measure the performance of agents acting in indoor spaces: Success Rate (SR) and Success weighted by (normalized inverse) Path Length (SPL). The SR measures the effectiveness of the navigation policy at reaching the goal. It is defined as the ratio between the number of successful navigation episodes and the total number of performed episodes:


where is the total number of performed episodes and is a boolean value indicating the success of the i-th episode. The SPL takes into account the path followed by the agent and can be thought as a measure of efficiency of the navigation model with respect to a perfect agent following the shortest geodesic path to the goal. It is defined as:


where is the total number of performed episodes, is a boolean value indicating the success of the i-th episode, is the shortest geodesic path length from the starting position to the goal position of the i-th episode and is the agent’s path length in the i-th episode. In case of a perfectly executed navigation episode, it assumes the value of 1. On the contrary, if the navigation policy fails, it assumes the value of 0. The navigation models have been evaluated in simulation on 1000 episodes defined by a starting and a goal positions. Episodes have been sampled beforehand taking into account their complexity. Indeed, to avoid excessively simple navigation episodes, they have been filtered to ensure that the ratio between the geodesic distance and the euclidean distance from the starting position to the goal is greater than , as already suggested in [habitat19iccv]. An episode is considered successful if the agent calls the STOP action within from the goal, and unsuccessful otherwise. For the evaluation we used the same steps budget used during train.

Figure 7: The custom robotic platform used to test the navigation policies in the real world. It is equipped with a realsense d435i camera to perceive the environment, wheel encoders and accurate actuators to perform precise movements.
Figure 8: The real-world navigation episodes considered in our testing setup. Each blue spot represents a starting/goal position, with the blue arrows indicating the starting heading directions of the robot. The trajectories are highlighted by a set of dashed curves having different colors. The difficulty of the navigation episodes varies depending on their length and on the presence of obstacles along the path, which require advanced reasoning ability to successfully reach the destination.

4.4 Baseline Navigation Models

We compared the proposed models with three baselines, which share the same architecture but receive different types of RGB images as input. They consist of: 1) a SE [squeeze_excitation]-ResNeXt50 [resnext], a larger visual encoder compared to the ones used to process the mid-level representations, suitable to process the lower level information contained in the input images; 2) a controller, identical to the one used in the proposed mid-level models. We considered the DL model pretrained on the Gibson [xiazamirhe2018gibsonenv] and Matterport3d [Matterport3D] datasets for 2.5 billion steps, released by [Wijmans2020DD-PPO:], that was then adapted to our office environment. Specifically, the baselines models are as follow:

  • the “RGB Synthetic" model was trained on the synthetic observations coming from the proposed virtual environment, for 5 million steps. It is considered to assess to what extend a navigation model trained purely in the virtual domain can be transferred directly to the real world, without further adaptation or supervision;

  • the “RGB Synthetic + Real" model was trained for 2.5 million steps on the synthetic observations and then fine-tuned for other 2.5 million steps on real-world images, which belong to a separate real-world 3D model of the same environment, counting  25K geo-referenced images. As for the real-world 3D model used for evaluation, it was aligned to the virtual 3D model and used for training purpose. This navigation model shows how using observations of the target domain during training can improve the navigation performance, despite knowing that collecting and exploiting them could be often expensive or unfeasible. From this model we expect to reach a near-optimal navigation performance;

  • the “RGB Synthetic + CycleGAN" [cyclegan] model was trained for 2.5 million steps on the synthetic observations and then fine-tuned for other 2.5 million steps on “fake” real observations, obtained by transforming the synthetic images to have the appearance of the real ones. A CycleGAN [cyclegan] unsupervised domain adaptation model was trained on two sets of unpaired synthetic and real-world images (5K for each domain), randomly sampled from the virtual and real-world 3D models used for training, respectively. This navigation model shows the benefit of employing an unsupervised domain adaptation model during training to reduce the virtual-real domain gap, which does not require the reconstruction of a real-world 3D model as compared to the “RGB Synthetic + Real” model, although it still relies on observations of both domains.

Pre-training the “RGB Synthetic + Real” and the “RGB Synthetic + CycleGAN” models on virtual observations resulted in higher performance compared to the same navigation models trained directly on real-world images or transformed images only, as already highlighted in [rosano2020navigation].

4.5 Real-world Evaluation

To validate the navigation results reported by the proposed realistic evaluation framework based on real observations and, more generally, to assess the ability of the proposed navigation models to operate in a real context without performing any additional sim2real domain adaptation, we have carried out experiments in the office environment using a real robotic platform. We leveraged a robot equipped with accurate sensors and actuators, able to perform precise movements. That is a desired feature to have because, as already highlighted in other works [rosano2020comparison, arewemakingprogress], imprecise actions could lead to sensible drops in performance. Although this is an important issue to address, in this work we focus more on the visual understanding ability of the navigation models to support decision making during the navigation process, thus we defer a further investigation on the impact of noisy sensors and actuation in future works. The robot also came with a Realsense d435i camera mounted to match the point of view of the virtual agent, as can also be seen in Figure 7. We set up a client-server communication system to move the computation from the limited hardware of the robot to a more powerful machine. At each navigation step, the robot takes a RGB image from the real environment and sends it to the server. In the server, the image is processed by the navigation model which returns the action to be executed, that is sent back and executed by the robot. The wheel encoders of the robot provide the system a feedback about the course of motion.

In total, in this experimental setting we considered six navigation trajectories with an increasing level of difficulty, to assess the capabilities of the different models to understand the surrounding environment and to take the appropriate actions accordingly. The sampled trajectories are illustrated in Figure 8. Most episodes require the navigation models to reason about the obstacles that are interposed between the current position of the agent and the goal, and to find the best path given the agent’s understanding of the layout of the space, inferred from the current and the previous observations collected during the navigation episode. For instance, all goals are out of the line of sight from the agent’s starting pose and in most of the episodes, the goal is not visible for most of the navigation time; in some episodes the obstacle appears suddenly (i.e. in episode 4 the agent turns around towards the goal and faces the pillar at a very short distance); in other episodes (episodes 5 and 6) a movable obstacle was placed at test time only, in order to test the ability of the navigation models to cope with obstacles never seen during training and to deal with new space layouts. All navigation models have been tested on all the real-world trajectories and, to verify the repeatability and the reliability of the learned navigation policies, each navigation episode has been repeated three times. Evaluating one model on one episode took about 5 minutes on average, for a total of  1710 minutes or 28.5 hours required to complete the task. It should be noted that, in general, evaluating a large number of navigation models in real settings is time-consuming and requires a constant human supervision. Moreover, a lot of influencing factors should be taken into account to minimize the time spent to carry out the task. For instance, aspects such as how slippery or uneven the floor is, the grip properties of the robot’s wheels, the failure rate of the robot’s actuators, the robot’s battery life, heavily influence the amount of time needed to perform an extensive performance evaluation task or, in some cases, can totally compromise its execution. Given the high costs involved with the assessment of the performance on a real robot, it is immediately evident the value offered by the proposed evaluation tool, which offers the chance to drastically reduce the testing time to few seconds for each episode by considering real images.

5 Results

5.1 Evaluation on Realistic Trajectories

Table 1 reports the performance in terms of SPL and SR of all the proposed visual navigation models, tested on our real-world 3D environment. The notation A + B denotes the types and the number of mid-level representations provided as input to the navigation models. As previously highlighted, to increase the number of input modalities, we followed a greedy approach and expanded the model that reported the best result by adding an additional representation meanwhile retaining the already used ones. For instance, considering the “Mid-fusion” model with three modalities, we took the best performing “Mid-fusion” model with two modalities (i.e. surface normals + keypoints3d, “n + k") and extended it with a third one, curvature or depth, to obtain the “n + k + c” and the “n + k + d” “Mid-fusion” models.

Mid-level representations Navigation model SPL SR
(surface) normals (n) Simple model 0.4877 0.6180
keypoints3d (k) Simple model 0.4396 0.5740
curvature (c) Simple model 0.4262 0.5690
depth (d) Simple model 0.4417 0.5560
n+k Simple model 0.4972 0.6330
n+c Simple model 0.4295 0.5530
n+d Simple model 0.3515 0.4650
n+k+c Simple model 0.4349 0.5410
n+k+d Simple model 0.4314 0.5380
n+k+c+d Simple model 0.5233 0.6630
n+k SE att. (avg Pool) 0.5078 0.6400
n+c SE att. (avg Pool) 0.4683 0.6260
n+d SE att. (avg Pool) 0.3847 0.5330
n+k+c SE att. (avg Pool) 0.5014 0.6340
n+k+d SE att. (avg Pool) 0.4142 0.5490
n+k+c+d SE att. (avg Pool) 0.5440 0.6840
n+k SE att. (max Pool) 0.4878 0.6410
n+c SE att. (max Pool) 0.4488 0.5897
n+d SE att. (max Pool) 0.4585 0.6025
n+k+c SE att. (max Pool) 0.4943 0.6550
n+k+d SE att. (max Pool) 0.5101 0.6420
n+k+c+d SE att. (max Pool) 0.4487 0.5730
n+k Mid-fusion 0.4512 0.6150
n+c Mid-fusion 0.4286 0.5846
n+d Mid-fusion 0.4128 0.5627
n+k+c Mid-fusion 0.4851 0.6890
n+k+d Mid-fusion 0.4870 0.6850
n+k+c+d Mid-fusion 0.4441 0.5910
n+k Late-fusion 0.4944 0.6300
n+c Late-fusion 0.4845 0.6174
n+d Late-fusion 0.4573 0.5828
n+k+c Late-fusion 0.5329 0.6790
n+k+d Late-fusion 0.5284 0.6720
n+k+c+d Late-fusion 0.5561 0.7110
RGB Synthetic SE-ResNeXt50 0.2610 0.3990
RGB Synthetic + Real SE-ResNeXt50 0.8269 0.9640
RGB Synthetic + CG SE-ResNeXt50 0.5985 0.7500
Table 1: Performance of all the considered visual navigation models. We proposed 5 different architectures, each of them receiving between 1 and 4 mid-level representations as input. Together with 3 RGB baselines, a total of 37 navigation models have been trained and tested. The results are reported in terms of SPL (Success weighted by Path Length) and SR (Success Rate) at reaching the navigation goal.

A more compact version of the performance achieved is summarized in Figure 9, which reports the SPL of the best performing model of each pair {model type, number of input modalities}.

First of all, all the proposed models largely outperformed the “RGB Synthetic” baseline model, clearly showing the presence of a crucial sim-real domain gap, which was successfully reduced by the adoption of mid-level representations. Compared to the “RGB Synthetic + CycleGAN” baseline (last row of Table 1), all models reported a slightly lower SPL and SR values, while not requiring any observation of the target domain to perform any adaptation. Indeed, the two best performing mid-level models, namely the “SE attention (avg pool)” model and the “Late-fusion” model, both with 4 modalities as input, achieved an SPL of and respectively, against the of the “RGB Synthetic + CycleGAN” model. As expected, the “RGB Synthetic + Real” model benefited from the supervised adaptation procedure, reporting an SPL of and a near perfect SR of . Interestingly, the “Simple” model reported good performance even in its basic variant with 1 modality as input, indicating the high capability of mid-level representations to embed relevant properties of the scene that are meaningful for the navigation task. Increasing the number of input modalities, we can observe an improvement of the performance with 2 and 4 modalities, but a decrease in the case of 3 modalities. We hypothesize that this inconsistent behavior may be caused by the compact size of the considered model that, together with a very simple modality fusion scheme, could have failed to correctly manage the additional quantity of data received as input. A similar behavior can be observed with the “SE attention (max pool)” model and the “Mid-fusion” model, whose results increased with 3 modalities and dropped with 4 modalities. In contrast, the “SE attention (avg pool)” model reported the same performance when passing from 2 to 3 input modalities and showed a significant improvement when trained on 4 modalities, overall achieving one of the best results. Moreover, we can observe that it always outperformed the “Simple” model receiving the same number of input modalities. This achievement confirms the effectiveness of the feature-level attention mechanism, meanwhile showing the importance of carefully designing features aggregation scheme, given the performance gap with the similar “SE attention (max pool)” model.

The “Late-fusion” model showed the most consistent behavior, with a performance that increased proportionally with the number of input modalities. It achieved the best result among models with 3 modalities and the absolute best result with 4 modalities, with an SPL of and a SR of . We believe that the overall model benefited from the specialization of the different branches on specific mid-level representations, that had the chance to independently learn the meaningful, although limited, information to retain from the input representations. With the policy fusion module then, the model had the chance to decide which branch is less or more likely to output an optimal action, given the specific perception.

Figure 9: Performance of visual navigation models evaluated on the real-world 3D environment using the Habitat simulator. The performance are reported in terms of SPL (Success weighted by Path Length).

5.2 Real-world Evaluation

Figure 10: Performance of visual navigation models evaluated in the real-world using a real robotic platform. The performances are reported in terms of SPL (Success weighted by Path Length).

Figure 10 reports the overall performance obtained in the real-world evaluation across all the navigation episodes. First of all, almost all models successfully reached the navigation goals, with the exception of the “Simple” model, which failed in 1 trial out of 18 (the total number of executed trajectories) and reported a SR of . Also, the “RGB Synthetic” baseline showed significant limitations during the execution of the real-world trajectories, achieving a SR of .

Taking a look at the baselines, the “RGB Synthetic” model reported the lowest result, confirming that a navigation policy trained in simulation can not be directly transferred to the real world due to the persistency of a sim-real domain shift, that should be addressed with the design of appropriate tools. Interestingly, the “RGB Synthetic + CycleGAN” model returned the best result among the baselines, suggesting that even a general unsupervised domain adaptation technique can effectively help the visual navigation model to address the domain gap.

Generally, all the proposed mid-level representation models reported good results, with an average SPL of , which turns out to be greater than the estimation produced by our realistic evaluation tool. Starting with the “Simple” model with 1 modality, it reported a remarkable SPL of , which competes with the result of the “RGB Synthetic + Real” baseline. The “Simple” model with 2 modalities reported an even better result, surpassing the performance of all the baselines, with an SPL of . Although this promising improvement, increasing the number of input representations did not lead to better results. As already hypothesized in the previous subsection, this may be caused by the basic architecture of the model, which limits the scaling of performance with the increase of the input modalities. The “SE attention (avg pool)” model achieved interesting performance, with the model trained on 2 modalities reporting an SPL of , greater than the models trained on 3 and 4 modalities. Also, the “SE attention (max pool)” reported similar results, with a maximum SPL of for the model trained on 4 modalities. Both models showed limited benefit from the additional input, with stable or slightly decreasing performance. In contrast, a different trend can be observed with the “Mid-fusion” model and the “Late-fusion” model, whose performance increased with the number of input representations, both reaching two peaks of and for the “Mid-fusion” and the “Late-fusion” models with 4 modalities respectively. In this case, the models succeeded at exploiting the extra modalities provided as input, positioning themselves among the best performing navigation models.

Figure 11: Performance of navigation models in the real world, reported for each of the considered trajectories separately. The episodes differ for their complexity, as can be seen from the evolution of the results.

A more detailed overview of the real-world results is provided by Figure 11, which reports the SPL values relative to each of the considered real-world trajectories. As expected, in episode 1 most of the navigation models succeeded at following the optimal path, that was short and free of obstacles. Similar results are reported for episode 2, that presents a more challenging scenario but that was successfully managed by most of the proposed models, with very few differences. In the episodes from 3 to 6 we can observe a general decrease in performance, given that more sophisticated reasoning ability are required to face the complexity of the trajectories. The “Simple” model with 2 modalities performed consistently well in all episodes, outperforming all the proposed models in episodes 1, 2, 4, and still reporting competitive results in the remaining ones. The “SE attention (avg pool)” model with 2 modalities excelled in episode 5, the one with the longer trajectory, and performed reasonably well in the rest of the episodes. Good performances were also reported by the “Late-fusion” model with 4 modalities, which showed how using a multi-source input and a more complex architecture can lead to a more stable behavior across a variety of different scenarios, reporting consistently superior or at least comparable performance in each episode relative to the “Simple” model with 1 modality.

Figure 12: SPL values estimated by our tool in simulation, vs. SPL values measured in the real-world evaluation, after performing the same navigation episodes. The chart shows that the evaluation tool can provide a good estimate of the expected performance in the real world, while being slightly optimistic.

In summary, we can observe that using a spectrum of mid-level representations is a prerequisite to design robust visual navigation models. From the analysis we carried out no single model has prevailed over the others, but many of them showed to perform well in complex navigation trajectories. Indeed, they achieved better results compared to a classic, single modality model, and their results matched or exceeded the ones of navigation models that had access to observations of the target domain during train (The “RGB Synthetic + Real” and the “RGB Synthetic + CycleGAN” baselines). The real-world evaluation process also allowed the assessment of the robustness of the navigation models provided with a multi-modal input, in case of new obstacles in the scene that were never seen before during training.

To assess the ability of the proposed realistic evaluation tool to predict the expected performance of a visual navigation model, we replicated the proposed real-world navigation episodes inside the Habitat simulator, to then ran an evaluation following the same setup of the real world evaluation. In Figure 12, it is possible to observe the relation between the real-world and the evaluation SPL values relative to all navigation models. Overall, the proposed tool appears to provide a good estimation of the models performance. We can observe that the validation SPL values are not too far off the real performance in most of the cases, reporting a Mean Absolute Error (MAE) of . We also measured a Pearson correlation index of with a p-value of , suggesting that our tool is likely to provide a performance estimation that is at least better than the average real-world SPL value.

For a more in-depth analysis on the reliability of the provided estimation, we reported the percentage of SPL values correctly estimated by the proposed evaluation tool, varying the accepted estimation error. Specifically, the estimation is considered correct if the SPL value measured in the real-world test was at most points worse. We considered this metric to understand if an increment in the estimated performance reflects an increment in real-world performance, still allowing a margin of error. As we can see in Figure 13, more than of the estimated performance values were more optimistic than the real performance by at most SPL points, and more than of the estimated performance reported an SPL value that was at most points higher than the real SPL performance. We believe that the obtained results are fairly satisfactory, given the benefit offered by the evaluation tool in terms of saved time and resources, normally required to assess the performance of a navigation policy in the real world with a real robotic platform.

Figure 13: Percentage of SPL values correctly estimated by the proposed evaluation tool for varying levels of accepted SPL error. The estimation is considered correct if the reported SPL obtained during evaluation is at most points higher than the SPL obtained in the real-world ( varies along the x-axis).

6 Conclusion

In this work we investigated the impact of using a variety of visual representations as input to visual DL models performing PointGoal navigation, to improve their understanding of the surrounding environment and to allow them to successfully transfer the navigation skills learned purely in simulation, to the real world. We proposed a set of modality fusion models to combine an increasing number of mid-level representations and trained them in simulation on the proposed office environment. To facilitate the assessment of navigation performance in the real world, we also proposed a validation tool that leverages a photorealistic 3D model of the environment to simulate realistic trajectories, while keeping the advantages of maintaining the process inside the simulator. A further real-world test with a robotic platform confirmed the effectiveness of the evaluation tool and that navigation policies trained in simulation can successfully be deployed in the real world without performing domain adaptation. Our results suggest that navigation models can benefit from the additional mid-level representations provided as input, given most of the considered navigation models reporting comparable performance with the baselines that had access to real-world observations during train, with even the smaller models reaching relevant results.

7 Acknowledgements

This research is supported by OrangeDev333OrangeDev: s.r.l., by Next Vision444Next Vision: s.r.l., the project MEGABIT - PIAno di inCEntivi per la RIcerca di Ateneo 2020/2022 (PIACERI) – linea di intervento 2, DMI - University of Catania, and the grant MIUR AIM - Attrazione e Mobilità Internazionale Linea 1 - AIM1893589 - CUP E64118002540007.